Table Of Content7
1
0
2 A Brief Introduction to Machine
p
e Learning for Engineers
S
8
] Osvaldo Simeone (2017), “A Brief Introduction to Machine Learning
G
for Engineers”, : Vol. XX, No. XX, pp 1–201. DOI: XXX.
L
.
s Osvaldo Simeone
c
[ Department of Informatics
1 King’s College London
v osvaldo.simeone@kcl.ac.uk
0
4
8
2
0
.
9
0
7
1
:
v
i
X
r
a
Contents
1 Introduction 5
1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Goals and Outline . . . . . . . . . . . . . . . . . . . . . . 7
2 A Gentle Introduction through Linear Regression 11
2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 11
2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Frequentist Approach . . . . . . . . . . . . . . . . . . . . 15
2.4 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . 31
2.5 Minimum Description Length (MDL) . . . . . . . . . . . . 37
2.6 Interpretation and Causality . . . . . . . . . . . . . . . . . 39
2.7 Information-Theoretic Metrics . . . . . . . . . . . . . . . . 41
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Probabilistic Models for Learning 45
3.1 The Exponential Family . . . . . . . . . . . . . . . . . . . 46
3.2 Maximum Entropy Property . . . . . . . . . . . . . . . . . 51
3.3 Frequentist Learning . . . . . . . . . . . . . . . . . . . . . 52
3.4 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Energy-based Models . . . . . . . . . . . . . . . . . . . . 62
3.6 Supervised Learning via Generalized Linear Models (GLM) 64
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Classification 66
4.1 Classification as a Supervised Learning Problem . . . . . . 67
4.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . 69
4.3 Discriminative Deterministic Models . . . . . . . . . . . . 71
4.4 Discriminative Probabilistic Models . . . . . . . . . . . . . 83
4.5 Generative Probabilistic Models . . . . . . . . . . . . . . . 86
4.6 Multi-Class Classification . . . . . . . . . . . . . . . . . . 88
4.7 Non-linear Discriminative Models: Deep Neural Networks . 90
4.8 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 Statistical Learning Theory 96
5.1 A Formal Framework for Supervised Learning . . . . . . . 96
5.2 PAC Learnability and Sample Complexity . . . . . . . . . . 101
5.3 PAC Learnability for Finite Hypothesis Classes . . . . . . . 103
5.4 VC Dimension and Fundamental Theorem of PAC Learning 106
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Unsupervised Learning 110
6.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 111
6.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . 114
6.3 ML, ELBO and EM . . . . . . . . . . . . . . . . . . . . . 116
6.4 Directed Generative Models . . . . . . . . . . . . . . . . 127
6.5 Undirected Generative Models . . . . . . . . . . . . . . . 134
6.6 Discriminative Models . . . . . . . . . . . . . . . . . . . . 137
6.7 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 138
6.8 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7 Probabilistic Graphical Models 142
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 146
7.3 Markov Random Fields . . . . . . . . . . . . . . . . . . . 155
7.4 Bayesian Inference in Probabilistic Graphic Models . . . . 158
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8 Approximate Inference and Learning 162
8.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 163
8.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . 165
8.3 Monte Carlo-Based Variational Inference . . . . . . . . . . 172
8.4 Approximate Learning . . . . . . . . . . . . . . . . . . . 174
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9 Concluding Remarks 177
Appendices 180
A Appendix A: Information Measures 181
A.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.2 Conditional Entropy and Mutual Information . . . . . . . . 184
A.3 Divergence Measures . . . . . . . . . . . . . . . . . . . . 186
B Appendix B: KL Divergence and Exponential Family 189
Acknowledgements 191
References 192
A Brief Introduction to Machine
Learning for Engineers
Osvaldo Simeone1
1Department of Informatics, King’s College London;
osvaldo.simeone@kcl.ac.uk
ABSTRACT
This monograph aims at providing an introduction to key
concepts, algorithms, and theoretical frameworks in ma-
chinelearning,includingsupervisedandunsupervisedlearn-
ing, statistical learning theory, probabilistic graphical mod-
elsandapproximateinference.Theintendedreadershipcon-
sistsofelectricalengineerswithabackgroundinprobability
and linear algebra. The treatment builds on first principles,
and organizes the main ideas according to clearly defined
categories, such as discriminative and generative models,
frequentist and Bayesian approaches, exact and approxi-
mateinference,directedandundirectedmodels,andconvex
andnon-convexoptimization.Themathematicalframework
uses information-theoretic measures as a unifyingtool. The
text offers simpleand reproduciblenumericalexamples pro-
vidinginsightsintokeymotivationsandconclusions.Rather
than providing exhaustive details on the existing myriad
solutions in each specific category, for which the reader is
referred to textbooks and papers, this monograph is meant
as an entry point for an engineer into the literature on ma-
chine learning.
ISSN; DOI XXXXXXXX
(cid:13)c2017 XXXXXXXX
Notation
Random variables or random vectors – both abbreviated as rvs
•
– are represented using roman typeface, while their values and realiza-
tions are indicated by the corresponding standard font. For instance,
the equality x = x indicates that rv x takes value x.
Matrices are indicated using uppercase fonts, with roman type-
•
face used for random matrices.
Vectors will be taken to be in column form.
•
XT and X† are the transpose and the pseudoinverse of matrix X,
•
respectively.
The distribution of a rv x, either probability mass function (pmf)
•
for a discrete rvs or probability density function (pdf) for continuous
rvs, is denoted as p , p (x), or p(x).
x x
The notation x p indicates that rv x is distributed according
x
• ∼
to p .
x
For jointly distributed rvs (x,y) p , the conditional distribu-
xy
• ∼
tion of x given the observation y = y is indicated as p , p (xy)
x|y=y x|y
|
or p(xy).
|
The notation x y = y p indicates that rv x is drawn ac-
x|y=y
• | ∼
cording to the conditional distribution p .
x|y=y
The notation E [] indicates the expectation of the argument
• x∼px ·
with respect to the distribution of the rv x p . Accordingly, we will
x
∼
also write E [ y] for the conditional expectation with respect to
x∼p
x|y ·|
the distribution p . When clear from the context, the distribution
x|y=y
over which the expectation is computed may be omitted.
The notation Pr [] indicates the probability of the argument
• x∼px ·
event with respect to the distribution of the rv x p . When clear
x
∼
1
2 Notation
from the context, the subscript is dropped.
The notation log represents the logarithm in base two, while ln
•
represents the natural logarithm.
x (µ,Σ) indicates that random vector x is distributedaccord-
• ∼ N
ing to a multivariate Gaussian pdf with mean vector µ and covariance
matrix Σ. The multivariate Gaussian pdf is denoted as (xµ,Σ) as a
N |
function of x.
x (a,b) indicates that rv x is distributed according to a uni-
• ∼ U
form distribution in the interval [a,b]. The corresponding uniform pdf
is denoted as (xa,b).
U |
δ(x) denotes the Dirac delta function or the Kronecker delta func-
•
tion, as clear from the context.
a 2 = N a2 is the quadratic, or l , norm of a vector a =
• || || i=1 i 2
[a ,...,a ]T. We similarly define the l norm as a = N a , and
1 N P 1 || ||1 i=1| i|
the l pseudo-norm a as the number of non-zero entries of vector a.
0 || ||0 P
I denotestheidentity matrix,whosedimensionswillbeclearfrom
•
the context. Similarly, 1 represents a vector of all ones.
R is the set of real numbers; and R+ the set of non-negative real
•
numbers.
1( ) is the indicator function: 1(x)= 1 if x is true, and 1(x)= 0
• ·
otherwise.
represents the cardinality of a set .
• |S| S
x represents a set of rvs x indexed by the integers k .
S k
• ∈ S
Acronyms
AI: Artificial Intelligence
AMP: Approximate Message Passing
BN: Bayesian Network
DAG: Directed Acyclic Graph
ELBO: Evidence Lower BOund
EM: Expectation Maximization
ERM: Empirical Risk Minimization
GAN: Generative Adversarial Network
GLM: Generalized Linear Model
HMM: Hidden Markov Model
i.i.d.: independent identically distributed
KL: Kullback-Leibler
LBP: Loopy Belief Propagation
LL: Log-Likelihood
LLR: Log-Likelihood Ratio
LS: Least Squares
MC: Monte Carlo
MCMC: Markov Chain Monte Carlo
MDL: Minimum Description Length
MFVI: Mean Field Variational Inference
ML: Maximum Likelihood
MRF: Markov Random Field
NLL: Negative Log-Likelihood
PAC: Probably Approximately Correct
pdf: probability density function
pmf: probability mass function
3
4 Acronyms
PCA: Principal Component Analysis
PPCA: Probabilistic Principal Component Analysis
QDA: Quadratic Discriminant Analysis
RBM: Restricted Boltzmann Machine
SGD: Stochastic Gradient Descent
SVM: Support Vector Machine
rv: random variable or random vector (depending on the context)
s.t.: subject to
VAE: Variational AutoEncoder
VC: Vapnik–Chervonenkis
VI: Variational Inference
1
Introduction
Having taught courses on machine learning, I am often asked by col-
leagues and students with a background in engineering to suggest “the
best place to start” to get into this subject. I typically respond with a
list of books – for a general, but slightly outdated introduction, read
this book; for a detailed survey of methods based on probabilistic mod-
els, check this other reference; to learn about statistical learning, I
found this text useful; and so on. This answers strikes me, and most
likely also my interlocutors, as quite unsatisfactory. This is especially
so since the size of many of these books may be discouraging for busy
professionals and students working on other projects. This monograph
is my first attempt to offer a basic and compact reference that de-
scribes key ideas and principles in simple terms and within a unified
treatment, encompassing also more recent developments and pointers
to the literature for further study.
1.1 Machine Learning
In engineering, pattern recognition refers to the automatic discovery
of regularities in data for decision-making, prediction or data mining.
5