Table Of ContentA FIRST COURSE
IN MACHINE
LEARNING
Second Edition
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
SERIES EDITORS
Ralf Herbrich Thore Graepel
Amazon Development Center Microsoft Research Ltd.
Berlin, Germany Cambridge, UK
AIMS AND SCOPE
This series reflects the latest advances and applications in machine learning and pattern recognition
through the publication of a broad range of reference works, textbooks, and handbooks. The inclusion of
concrete examples, applications, and methods is highly encouraged. The scope of the series includes, but
is not limited to, titles in the areas of machine learning, pattern recognition, computational intelligence,
robotics, computational/statistical learning theory, natural language processing, computer vision, game
AI, game theory, neural networks, computational neuroscience, and other relevant topics, such as machine
learning applied to bioinformatics or cognitive science, which might be proposed by potential contribu-
tors.
PUBLISHED TITLES
BAYESIAN PROGRAMMING
Pierre Bessière, Emmanuel Mazer, Juan-Manuel Ahuactzin, and Kamel Mekhnacha
UTILITY-BASED LEARNING FROM DATA
Craig Friedman and Sven Sandow
HANDBOOK OF NATURAL LANGUAGE PROCESSING, SECOND EDITION
Nitin Indurkhya and Fred J. Damerau
COST-SENSITIVE MACHINE LEARNING
Balaji Krishnapuram, Shipeng Yu, and Bharat Rao
COMPUTATIONAL TRUST MODELS AND MACHINE LEARNING
Xin Liu, Anwitaman Datta, and Ee-Peng Lim
MULTILINEAR SUBSPACE LEARNING: DIMENSIONALITY REDUCTION OF
MULTIDIMENSIONAL DATA
Haiping Lu, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos
MACHINE LEARNING: An Algorithmic Perspective, Second Edition
Stephen Marsland
SPARSE MODELING: THEORY, ALGORITHMS, AND APPLICATIONS
Irina Rish and Genady Ya. Grabarnik
A FIRST COURSE IN MACHINE LEARNING, SECOND EDITION
Simon Rogers and Mark Girolami
STATISTICAL REINFORCEMENT LEARNING: MODERN MACHINE LEARNING APPROACHES
Masashi Sugiyama
MULTI-LABEL DIMENSIONALITY REDUCTION
Liang Sun, Shuiwang Ji, and Jieping Ye
REGULARIZATION, OPTIMIZATION, KERNELS, AND SUPPORT VECTOR MACHINES
Johan A. K. Suykens, Marco Signoretto, and Andreas Argyriou
ENSEMBLE METHODS: FOUNDATIONS AND ALGORITHMS
Zhi-Hua Zhou
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
A FIRST COURSE
IN MACHINE
LEARNING
Second Edition
Simon Rogers
University of Glasgow
United Kingdom
Mark Girolami
University of Warwick
United Kingdom
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does
not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MAT-
LAB® software or related products does not constitute endorsement or sponsorship by The MathWorks
of a particular pedagogical approach or particular use of the MATLAB® software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20160524
International Standard Book Number-13: 978-1-4987-3856-9 (eBook - VitalBook)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
List of Tables xv
List of Figures xvii
Preface to the First Edition xxvii
Preface to the Second Edition xxix
Section I Basic Topics
Chapter 1(cid:4) Linear Modelling: A Least Squares Approach 3
1.1 LINEAR MODELLING 3
1.1.1 Defining the model 4
1.1.2 Modelling assumptions 5
1.1.3 Defining a good model 6
1.1.4 The least squares solution – a worked example 8
1.1.5 Worked example 12
1.1.6 Least squares fit to the Olympic data 13
1.1.7 Summary 14
1.2 MAKING PREDICTIONS 15
1.2.1 A second Olympic dataset 15
1.2.2 Summary 17
1.3 VECTOR/MATRIX NOTATION 17
1.3.1 Example 25
1.3.2 Numerical example 26
1.3.3 Making predictions 27
1.3.4 Summary 27
1.4 NON-LINEAR RESPONSE FROM A LINEAR MODEL 28
1.5 GENERALISATION AND OVER-FITTING 31
v
vi (cid:4) Contents
1.5.1 Validation data 31
1.5.2 Cross-validation 32
1.5.3 Computational scaling of K-fold cross-validation 34
1.6 REGULARISED LEAST SQUARES 34
1.7 EXERCISES 37
1.8 FURTHER READING 39
Chapter 2(cid:4) Linear Modelling: A Maximum Likelihood
Approach 41
2.1 ERRORS AS NOISE 41
2.1.1 Thinking generatively 42
2.2 RANDOM VARIABLES AND PROBABILITY 43
2.2.1 Random variables 43
2.2.2 Probability and distributions 44
2.2.3 Adding probabilities 46
2.2.4 Conditional probabilities 46
2.2.5 Joint probabilities 47
2.2.6 Marginalisation 49
2.2.7 Aside – Bayes’ rule 51
2.2.8 Expectations 52
2.3 POPULAR DISCRETE DISTRIBUTIONS 55
2.3.1 Bernoulli distribution 55
2.3.2 Binomial distribution 55
2.3.3 Multinomial distribution 56
2.4 CONTINUOUS RANDOM VARIABLES – DENSITY
FUNCTIONS 57
2.5 POPULAR CONTINUOUS DENSITY FUNCTIONS 60
2.5.1 The uniform density function 60
2.5.2 The beta density function 62
2.5.3 The Gaussian density function 63
2.5.4 Multivariate Gaussian 64
2.6 SUMMARY 66
2.7 THINKING GENERATIVELY...CONTINUED 67
2.8 LIKELIHOOD 68
2.8.1 Dataset likelihood 69
2.8.2 Maximum likelihood 70
Contents (cid:4) vii
2.8.3 Characteristics of the maximum likelihood solution 73
2.8.4 Maximum likelihood favours complex models 75
2.9 THE BIAS-VARIANCE TRADE-OFF 75
2.9.1 Summary 76
2.10 EFFECT OF NOISE ON PARAMETER ESTIMATES 77
2.10.1 Uncertainty in estimates 78
2.10.2 Comparison with empirical values 83
2.10.3 Variability in model parameters – Olympic data 84
2.11 VARIABILITY IN PREDICTIONS 84
2.11.1 Predictive variability – an example 86
2.11.2 Expected values of the estimators 86
2.12 CHAPTER SUMMARY 91
2.13 EXERCISES 92
2.14 FURTHER READING 93
Chapter 3(cid:4) The Bayesian Approach to Machine Learning 95
3.1 A COIN GAME 95
3.1.1 Counting heads 97
3.1.2 The Bayesian way 98
3.2 THE EXACT POSTERIOR 103
3.3 THE THREE SCENARIOS 104
3.3.1 No prior knowledge 104
3.3.2 The fair coin scenario 112
3.3.3 A biased coin 114
3.3.4 The three scenarios – a summary 116
3.3.5 Adding more data 117
3.4 MARGINAL LIKELIHOODS 117
3.4.1 Model comparison with the marginal likelihood 119
3.5 HYPERPARAMETERS 119
3.6 GRAPHICAL MODELS 120
3.7 SUMMARY 122
3.8 A BAYESIAN TREATMENT OF THE OLYMPIC 100m
DATA 122
3.8.1 The model 122
3.8.2 The likelihood 124
3.8.3 The prior 124
viii (cid:4) Contents
3.8.4 The posterior 124
3.8.5 A first-order polynomial 126
3.8.6 Making predictions 129
3.9 MARGINAL LIKELIHOOD FOR POLYNOMIAL MODEL
ORDER SELECTION 130
3.10 CHAPTER SUMMARY 133
3.11 EXERCISES 133
3.12 FURTHER READING 135
Chapter 4(cid:4) Bayesian Inference 137
4.1 NON-CONJUGATE MODELS 137
4.2 BINARY RESPONSES 138
4.2.1 A model for binary responses 138
4.3 A POINT ESTIMATE – THE MAP SOLUTION 141
4.4 THE LAPLACE APPROXIMATION 147
4.4.1 Laplace approximation example: Approximating a
gamma density 148
4.4.2 Laplace approximation for the binary response model150
4.5 SAMPLING TECHNIQUES 152
4.5.1 Playing darts 152
4.5.2 The Metropolis–Hastings algorithm 154
4.5.3 The art of sampling 162
4.6 CHAPTER SUMMARY 163
4.7 EXERCISES 163
4.8 FURTHER READING 164
Chapter 5(cid:4) Classification 167
5.1 THE GENERAL PROBLEM 167
5.2 PROBABILISTIC CLASSIFIERS 168
5.2.1 The Bayes classifier 168
5.2.1.1 Likelihood – class-conditional distributions 169
5.2.1.2 Prior class distribution 169
5.2.1.3 Example – Gaussian class-conditionals 170
5.2.1.4 Making predictions 171
5.2.1.5 The naive-Bayes assumption 172
Contents (cid:4) ix
5.2.1.6 Example – classifying text 174
5.2.1.7 Smoothing 176
5.2.2 Logistic regression 178
5.2.2.1 Motivation 178
5.2.2.2 Non-linear decision functions 179
5.2.2.3 Non-parametric models – the Gaussian
process 180
5.3 NON-PROBABILISTIC CLASSIFIERS 181
5.3.1 K-nearest neighbours 181
5.3.1.1 Choosing K 182
5.3.2 Support vector machines and other kernel methods 185
5.3.2.1 The margin 185
5.3.2.2 Maximising the margin 186
5.3.2.3 Making predictions 189
5.3.2.4 Support vectors 189
5.3.2.5 Soft margins 191
5.3.2.6 Kernels 193
5.3.3 Summary 196
5.4 ASSESSING CLASSIFICATION PERFORMANCE 196
5.4.1 Accuracy – 0/1 loss 196
5.4.2 Sensitivity and specificity 197
5.4.3 The area under the ROC curve 198
5.4.4 Confusion matrices 200
5.5 DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 202
5.6 CHAPTER SUMMARY 202
5.7 EXERCISES 202
5.8 FURTHER READING 203
Chapter 6(cid:4) Clustering 205
6.1 THE GENERAL PROBLEM 205
6.2 K-MEANS CLUSTERING 206
6.2.1 Choosing the number of clusters 208
6.2.2 Where K-means fails 210
6.2.3 Kernelised K-means 210
6.2.4 Summary 212
6.3 MIXTURE MODELS 213