7 1 0 2 A Brief Introduction to Machine p e Learning for Engineers S 8 ] Osvaldo Simeone (2017), “A Brief Introduction to Machine Learning G for Engineers”, : Vol. XX, No. XX, pp 1–201. DOI: XXX. L . s Osvaldo Simeone c [ Department of Informatics 1 King’s College London v [email protected] 0 4 8 2 0 . 9 0 7 1 : v i X r a Contents 1 Introduction 5 1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Goals and Outline . . . . . . . . . . . . . . . . . . . . . . 7 2 A Gentle Introduction through Linear Regression 11 2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 11 2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Frequentist Approach . . . . . . . . . . . . . . . . . . . . 15 2.4 Bayesian Approach . . . . . . . . . . . . . . . . . . . . . 31 2.5 Minimum Description Length (MDL) . . . . . . . . . . . . 37 2.6 Interpretation and Causality . . . . . . . . . . . . . . . . . 39 2.7 Information-Theoretic Metrics . . . . . . . . . . . . . . . . 41 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Probabilistic Models for Learning 45 3.1 The Exponential Family . . . . . . . . . . . . . . . . . . . 46 3.2 Maximum Entropy Property . . . . . . . . . . . . . . . . . 51 3.3 Frequentist Learning . . . . . . . . . . . . . . . . . . . . . 52 3.4 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Energy-based Models . . . . . . . . . . . . . . . . . . . . 62 3.6 Supervised Learning via Generalized Linear Models (GLM) 64 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4 Classification 66 4.1 Classification as a Supervised Learning Problem . . . . . . 67 4.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . 69 4.3 Discriminative Deterministic Models . . . . . . . . . . . . 71 4.4 Discriminative Probabilistic Models . . . . . . . . . . . . . 83 4.5 Generative Probabilistic Models . . . . . . . . . . . . . . . 86 4.6 Multi-Class Classification . . . . . . . . . . . . . . . . . . 88 4.7 Non-linear Discriminative Models: Deep Neural Networks . 90 4.8 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5 Statistical Learning Theory 96 5.1 A Formal Framework for Supervised Learning . . . . . . . 96 5.2 PAC Learnability and Sample Complexity . . . . . . . . . . 101 5.3 PAC Learnability for Finite Hypothesis Classes . . . . . . . 103 5.4 VC Dimension and Fundamental Theorem of PAC Learning 106 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6 Unsupervised Learning 110 6.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 111 6.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . 114 6.3 ML, ELBO and EM . . . . . . . . . . . . . . . . . . . . . 116 6.4 Directed Generative Models . . . . . . . . . . . . . . . . 127 6.5 Undirected Generative Models . . . . . . . . . . . . . . . 134 6.6 Discriminative Models . . . . . . . . . . . . . . . . . . . . 137 6.7 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 138 6.8 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7 Probabilistic Graphical Models 142 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 146 7.3 Markov Random Fields . . . . . . . . . . . . . . . . . . . 155 7.4 Bayesian Inference in Probabilistic Graphic Models . . . . 158 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8 Approximate Inference and Learning 162 8.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 163 8.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . 165 8.3 Monte Carlo-Based Variational Inference . . . . . . . . . . 172 8.4 Approximate Learning . . . . . . . . . . . . . . . . . . . 174 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9 Concluding Remarks 177 Appendices 180 A Appendix A: Information Measures 181 A.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 A.2 Conditional Entropy and Mutual Information . . . . . . . . 184 A.3 Divergence Measures . . . . . . . . . . . . . . . . . . . . 186 B Appendix B: KL Divergence and Exponential Family 189 Acknowledgements 191 References 192 A Brief Introduction to Machine Learning for Engineers Osvaldo Simeone1 1Department of Informatics, King’s College London; [email protected] ABSTRACT This monograph aims at providing an introduction to key concepts, algorithms, and theoretical frameworks in ma- chinelearning,includingsupervisedandunsupervisedlearn- ing, statistical learning theory, probabilistic graphical mod- elsandapproximateinference.Theintendedreadershipcon- sistsofelectricalengineerswithabackgroundinprobability and linear algebra. The treatment builds on first principles, and organizes the main ideas according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approxi- mateinference,directedandundirectedmodels,andconvex andnon-convexoptimization.Themathematicalframework uses information-theoretic measures as a unifyingtool. The text offers simpleand reproduciblenumericalexamples pro- vidinginsightsintokeymotivationsandconclusions.Rather than providing exhaustive details on the existing myriad solutions in each specific category, for which the reader is referred to textbooks and papers, this monograph is meant as an entry point for an engineer into the literature on ma- chine learning. ISSN; DOI XXXXXXXX (cid:13)c2017 XXXXXXXX Notation Random variables or random vectors – both abbreviated as rvs • – are represented using roman typeface, while their values and realiza- tions are indicated by the corresponding standard font. For instance, the equality x = x indicates that rv x takes value x. Matrices are indicated using uppercase fonts, with roman type- • face used for random matrices. Vectors will be taken to be in column form. • XT and X† are the transpose and the pseudoinverse of matrix X, • respectively. The distribution of a rv x, either probability mass function (pmf) • for a discrete rvs or probability density function (pdf) for continuous rvs, is denoted as p , p (x), or p(x). x x The notation x p indicates that rv x is distributed according x • ∼ to p . x For jointly distributed rvs (x,y) p , the conditional distribu- xy • ∼ tion of x given the observation y = y is indicated as p , p (xy) x|y=y x|y | or p(xy). | The notation x y = y p indicates that rv x is drawn ac- x|y=y • | ∼ cording to the conditional distribution p . x|y=y The notation E [] indicates the expectation of the argument • x∼px · with respect to the distribution of the rv x p . Accordingly, we will x ∼ also write E [ y] for the conditional expectation with respect to x∼p x|y ·| the distribution p . When clear from the context, the distribution x|y=y over which the expectation is computed may be omitted. The notation Pr [] indicates the probability of the argument • x∼px · event with respect to the distribution of the rv x p . When clear x ∼ 1 2 Notation from the context, the subscript is dropped. The notation log represents the logarithm in base two, while ln • represents the natural logarithm. x (µ,Σ) indicates that random vector x is distributedaccord- • ∼ N ing to a multivariate Gaussian pdf with mean vector µ and covariance matrix Σ. The multivariate Gaussian pdf is denoted as (xµ,Σ) as a N | function of x. x (a,b) indicates that rv x is distributed according to a uni- • ∼ U form distribution in the interval [a,b]. The corresponding uniform pdf is denoted as (xa,b). U | δ(x) denotes the Dirac delta function or the Kronecker delta func- • tion, as clear from the context. a 2 = N a2 is the quadratic, or l , norm of a vector a = • || || i=1 i 2 [a ,...,a ]T. We similarly define the l norm as a = N a , and 1 N P 1 || ||1 i=1| i| the l pseudo-norm a as the number of non-zero entries of vector a. 0 || ||0 P I denotestheidentity matrix,whosedimensionswillbeclearfrom • the context. Similarly, 1 represents a vector of all ones. R is the set of real numbers; and R+ the set of non-negative real • numbers. 1( ) is the indicator function: 1(x)= 1 if x is true, and 1(x)= 0 • · otherwise. represents the cardinality of a set . • |S| S x represents a set of rvs x indexed by the integers k . S k • ∈ S Acronyms AI: Artificial Intelligence AMP: Approximate Message Passing BN: Bayesian Network DAG: Directed Acyclic Graph ELBO: Evidence Lower BOund EM: Expectation Maximization ERM: Empirical Risk Minimization GAN: Generative Adversarial Network GLM: Generalized Linear Model HMM: Hidden Markov Model i.i.d.: independent identically distributed KL: Kullback-Leibler LBP: Loopy Belief Propagation LL: Log-Likelihood LLR: Log-Likelihood Ratio LS: Least Squares MC: Monte Carlo MCMC: Markov Chain Monte Carlo MDL: Minimum Description Length MFVI: Mean Field Variational Inference ML: Maximum Likelihood MRF: Markov Random Field NLL: Negative Log-Likelihood PAC: Probably Approximately Correct pdf: probability density function pmf: probability mass function 3 4 Acronyms PCA: Principal Component Analysis PPCA: Probabilistic Principal Component Analysis QDA: Quadratic Discriminant Analysis RBM: Restricted Boltzmann Machine SGD: Stochastic Gradient Descent SVM: Support Vector Machine rv: random variable or random vector (depending on the context) s.t.: subject to VAE: Variational AutoEncoder VC: Vapnik–Chervonenkis VI: Variational Inference 1 Introduction Having taught courses on machine learning, I am often asked by col- leagues and students with a background in engineering to suggest “the best place to start” to get into this subject. I typically respond with a list of books – for a general, but slightly outdated introduction, read this book; for a detailed survey of methods based on probabilistic mod- els, check this other reference; to learn about statistical learning, I found this text useful; and so on. This answers strikes me, and most likely also my interlocutors, as quite unsatisfactory. This is especially so since the size of many of these books may be discouraging for busy professionals and students working on other projects. This monograph is my first attempt to offer a basic and compact reference that de- scribes key ideas and principles in simple terms and within a unified treatment, encompassing also more recent developments and pointers to the literature for further study. 1.1 Machine Learning In engineering, pattern recognition refers to the automatic discovery of regularities in data for decision-making, prediction or data mining. 5