ebook img

PAC-Bayesian Generalisation Error Bounds and Sparse Approximations PDF

270 Pages·2012·2.28 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview PAC-Bayesian Generalisation Error Bounds and Sparse Approximations

Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations Matthias Seeger Doctor of Philosophy Institute for Adaptive and Neural Computation Division of Informatics University of Edinburgh 2005 ii Abstract Non-parametric models and techniques enjoy a growing popularity in the (cid:12)eld of machine learning, and among these Bayesian inference for Gaussian process (GP) models has recently received signi(cid:12)cant attention. We feel that GP priors should be part of the standard toolbox for constructing models relevant to machine learning in the same way as parametric linear models are, and the results in this thesis help to remove some obstacles on the way towards this goal. In the (cid:12)rst main chapter, we provide a distribution-free (cid:12)nite sample bound on the di(cid:11)erence between generalisation and empirical (training) error for GP classi(cid:12)cation methods. While the general theorem (the PAC-Bayesian bound) is not new, we give a much simpli(cid:12)ed and somewhat generalised derivation and point out the underlying core technique (convex duality) explicitly. Furthermore, the application to GP models is novel (to our knowledge). A central feature of this bound is that its quality depends crucially on task knowledge being encoded faithfully in the model and prior distributions, so there is a mutual bene(cid:12)t be- tween a sharp theoretical guarantee and empirically well-established statistical practices. Extensive simulations on real-world classi(cid:12)cation tasks indicate an im- pressive tightness of the bound, in spite of the fact that many previous bounds for related kernel machines fail to give non-trivial guarantees in this practically relevant regime. In the second main chapter, sparse approximations are developed to address theproblemoftheunfavourablescalingofmostGPtechniques withlargetraining sets. Due to its high importance in practice, this problem has received a lot of at- tention recently. We demonstrate the tractability and usefulness of simple greedy forward selection with information-theoretic criteria previously used in active learning (or sequential design) and develop generic schemes for automatic model selection withmany (hyper)parameters. Wesuggest twonew genericschemes and evaluate some of their variants on large real-world classi(cid:12)cation and regression tasks. These schemes and their underlying principles (which are clearly stated and analysed) can be applied to obtain sparse approximations for a wide regime of GP models far beyond the special cases we studied here. iii Acknowledgements During the course of my PhD studies I have been fortunate enough to bene(cid:12)t from inspiring interactions with people in the machine learning community, and I am delighted to be able to acknowledge these here. My thanks go (cid:12)rst of all to my supervisor Christopher Williams who guided me through my research, and whose comments often sparked my deeper interest in directions which I would have ignored otherwise, or forced me to re-consider intuitive argumentations more carefully. I have also bene(cid:12)tted from his wide overview of relevant literature in many areas of machine learning I have been interested in. I would like to thank the members of the ANC group in Edinburgh for many inspiring discussions and interesting talks covering a wide range of topics. Espe- cially, I would like to thank Amos Storkey and David Barber for having shared their knowledge and pointing me to interesting work in areas beyond the neces- sarily narrow scope of this thesis. I gratefully acknowledge support through a research studentship from Microsoft Research Ltd. The \random walk" of my postgraduate studies led me into terrain which was quite unfamiliar to me, and it would have been a much harder and certainly much more boring journey without some people I met on the way. Ralf Herbrich shared some of his deep and pragmatic insights into data-dependent distribution- free bounds and other areas of learning theory and gave many helpful comments, and I very much enjoyed the time I spent working with him and Hugo Zaragoza during aninternship atMS Research, Cambridge in the autumn of2000, notleast the late-night pub sessions in the Eagle. I am grateful for many discussions with Neil Lawrence, Michael Tipping, Bernhard Scho(cid:127)lkopf and Antonio Criminisi who were there at that time. I was fortunate enough to work with John Langford, one of whose diverse interests in learning theory are applications and re(cid:12)nements of the PAC-Bayesian theorem. I have learned a lot from his pragmatic approach to learning-theoretical problems, and I would enjoy to do further joint work with him in the future. I am verygratefultoManfredOpperforsharingsomeofhisenormousknowledgeabout learning-theoretical analyses of Bayesian procedures. Manfred got interested in my work, parts of which reminded him of studies of learning curves for Bayesian methodshehaddonesometimeagowithDavidHaussler, andinvitedmetoAston University, Birmingham for a few days. Our discussions there were invaluable in that they helped me to abstract from the details and see the big picture behind the PAC-Bayesian technique, recognising for the (cid:12)rst time the huge versatility of convex inequalities. I would also like to thank David McAllester for proposing and proving the remarkable PAC-Bayesian theorems in the (cid:12)rst place, and for some very interesting discussions when we met at Windsor, UK and at NIPS. My interest in sparse Gaussian process approximations was sparked by work iv with Chris Williams (see Section 4.7.1) but also by my frustration with running time and memory consumption for my experiments with the PAC-Bayesian theo- rem. MygoalwastoprovideaPACresultwhichispracticallyuseful, butofcourse this calls for a method which practitioners can really use on large datasets. I got interested in the informative vector machine (IVM) presented by Neil Lawrence and Ralf Herbrich in a NIPS workshop contribution and worked out some gener- alisations and details such as the expectation propagation (EP) foundation and the randomisation strategy, so that my implementation would be able to handle datasets of the size required for the PAC-Bayesian experiments. I enjoyed this collaboration (which led to more joint work with Neil Lawrence). I would like to thank John Platt and Christopher Burges for giving me the opportunity to spend the summer of 2002 doing research at Microsoft, Redmond. I enjoyed the work with Chris and would have liked to spend more time with John who, unfortunately for me but very fortunately for him, became father at that time. I would also like to thank Patrice Simard for some interesting discussions and for many great football (\soccer") matches (in the end \youth and stamina" prevailed once more over \experience and wisdom"). Thanks also to Alex Salcianu for joining me on some great hiking trips around Mt. Rainier, (cid:12)shing me out of Lake Washington after some unexpected canoeing manoeuvre, and for his persistence of getting the photos of our mind-boggling white water rafting day. I hope we meet again for some hiking, although then I would prefer to do the driving! Finally, I would like to thank the friends I have made during my time in this beautiful town of Edinburgh. During my (cid:12)rst year, Esben and I explored many of the historical sites in this fascinating country, and he gave me a very lively account of truths and myths in Scottish history. I had a great time with Thomas and Bjo(cid:127)rn who are now out to build arti(cid:12)cial worms and to save the rain forest. The amazing scenery and remoteness of the Scottish highlands sparked my interest in hiking and mountain walking, and I would like to thank Karsten, Bjo(cid:127)rn, David, Steve and many others for great experiences and views in this vast wilderness. I am indebted to Emanuela for sharing these years with me, and although we are going separate ways now I wish her all the best in the future. My special thanks go to my mother and sisters who supported me through these years, especially during the (cid:12)nal hardest one. This thesis is written in memory of my father. v Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional quali(cid:12)cation except as speci(cid:12)ed. (Matthias Seeger) vi Table of Contents 1 Introduction 1 1.1 Declaration of Previous Work, Collaborations . . . . . . . . . . . 5 1.1.1 Publications during Postgraduate Studies . . . . . . . . . . 5 2 Background 9 2.1 Bayesian Gaussian Processes . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Gaussian Processes: The Process and the Weight Space View 10 2.1.2 Some Gaussian Process Models . . . . . . . . . . . . . . . 16 2.1.3 Approximate Inference and Learning . . . . . . . . . . . . 21 2.1.4 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . 25 2.1.5 Penalised Likelihood. Spline Smoothing . . . . . . . . . . . 29 2.1.6 Maximum Entropy Discrimination. Large Margin Classi(cid:12)ers 33 2.1.7 Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.8 Choice of Kernel. Kernel Design . . . . . . . . . . . . . . . 40 2.2 Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.1 Probably Approximately Correct . . . . . . . . . . . . . . 57 2.2.2 Concentration Inequalities . . . . . . . . . . . . . . . . . . 59 2.2.3 Vapnik-Chervonenkis Theory . . . . . . . . . . . . . . . . 61 2.2.4 Using PAC Bounds for Model Selection . . . . . . . . . . . 64 3 PAC-Bayesian Bounds for Gaussian Process Methods 67 3.1 Data-dependent PAC Bounds and PAC-Bayesian Theorems . . . . 68 3.1.1 The Need for Data-dependent Bounds . . . . . . . . . . . 68 3.1.2 Bayesian Classi(cid:12)ers. PAC-Bayesian Theorems . . . . . . . 71 3.2 The PAC-Bayesian Theorem for Gibbs Classi(cid:12)ers . . . . . . . . . 72 3.2.1 The Binary Classi(cid:12)cation Case . . . . . . . . . . . . . . . 72 3.2.2 Confusion Distributions and Multiple Classes . . . . . . . 74 3.2.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.4 The Case of General Bounded Loss . . . . . . . . . . . . . 80 3.2.5 An Extension to the Bayes Classi(cid:12)er . . . . . . . . . . . . 81 3.2.6 Some Speculative Extensions . . . . . . . . . . . . . . . . . 83 3.3 Application to Gaussian Process Classi(cid:12)cation . . . . . . . . . . . 85 vii 3.3.1 PAC-Bayesian Theorem for GP Classi(cid:12)cation . . . . . . . 85 3.3.2 Laplace Gaussian Process Classi(cid:12)cation . . . . . . . . . . . 87 3.3.3 Sparse Greedy Gaussian Process Classi(cid:12)cation . . . . . . . 89 3.3.4 Minimum Relative Entropy Discrimination . . . . . . . . . 90 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4.1 The Theorem of Meir and Zhang . . . . . . . . . . . . . . 91 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.5.1 The Setup MNIST2/3 . . . . . . . . . . . . . . . . . . . . 96 3.5.2 Experiments with Laplace GPC . . . . . . . . . . . . . . . 96 3.5.3 Experiments with Sparse Greedy GPC . . . . . . . . . . . 98 3.5.4 Comparison with PAC Compression Bound . . . . . . . . . 99 3.5.5 Using the Bounds for Model Selection . . . . . . . . . . . . 102 3.5.6 Comparing Gibbs and Bayes Bounds . . . . . . . . . . . . 105 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4 Sparse Gaussian Process Methods 109 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2 Likelihood Approximations and Greedy Selection Criteria . . . . . 111 4.2.1 Likelihood Approximations . . . . . . . . . . . . . . . . . . 111 4.2.2 Greedy Selection Criteria . . . . . . . . . . . . . . . . . . . 113 4.3 Expectation Propagation for Gaussian Process Models . . . . . . 116 4.4 Sparse Gaussian Process Methods: Conditional Inference . . . . . 118 4.4.1 The Informative Vector Machine . . . . . . . . . . . . . . . 119 4.4.2 Projected Latent Variables . . . . . . . . . . . . . . . . . . 124 4.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.5.1 Model Selection for PLV . . . . . . . . . . . . . . . . . . . 131 4.5.2 Model Selection for IVM . . . . . . . . . . . . . . . . . . . 132 4.5.3 Details about Optimisation . . . . . . . . . . . . . . . . . 133 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.7 Addenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.7.1 Nystro(cid:127)m Approximations . . . . . . . . . . . . . . . . . . . 140 4.7.2 The Importance of Estimating Predictive Variances . . . . 143 4.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.8.1 IVM Classi(cid:12)cation: Digit Recognition . . . . . . . . . . . . 147 4.8.2 PLV Regression: Robot Arm Dynamics . . . . . . . . . . . 150 4.8.3 IVM Regression: Robot Arm Dynamics . . . . . . . . . . . 157 4.8.4 IVM Classi(cid:12)cation: Digit Recognition II . . . . . . . . . . 158 4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 viii 5 Conclusions and Future Work 163 5.1 PAC-Bayesian Bounds for Gaussian Process Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.1.1 Suggestions for Future Work . . . . . . . . . . . . . . . . . 164 5.2 Sparse Gaussian Process Methods . . . . . . . . . . . . . . . . . . 165 5.2.1 Suggestions for Future Work . . . . . . . . . . . . . . . . . 167 A General Appendix 169 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 A.1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . 169 A.1.2 Probability. Miscellaneous . . . . . . . . . . . . . . . . . . 170 A.2 Linear Algebra. Useful Formulae . . . . . . . . . . . . . . . . . . 171 A.2.1 Partitioned Matrix Inverses. Woodbury Formula. Schur Complements . . . . . . . . . . . . . . . . . . . . . . . . . 171 A.2.2 Update of Cholesky Decomposition . . . . . . . . . . . . . 172 A.2.3 Some Useful Formulae . . . . . . . . . . . . . . . . . . . . 174 A.3 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.4 Exponential Families. Gaussians . . . . . . . . . . . . . . . . . . . 178 A.4.1 Exponential Families . . . . . . . . . . . . . . . . . . . . . 178 A.4.2 I-Projections . . . . . . . . . . . . . . . . . . . . . . . . . 181 A.4.3 Gaussian Variables . . . . . . . . . . . . . . . . . . . . . . 182 A.5 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 186 A.6 Bayesian Inference and Approximations . . . . . . . . . . . . . . . 187 A.6.1 Probabilistic Modelling . . . . . . . . . . . . . . . . . . . . 187 A.6.2 Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . 188 A.6.3 Approximations to Bayesian Inference . . . . . . . . . . . 189 A.6.4 Lower Bound Maximisation. Expectation Maximisation . . 192 A.7 Large Deviation Inequalities . . . . . . . . . . . . . . . . . . . . . 195 B Appendix for Chapter 3 197 B.1 Extended PAC-Bayesian Theorem: an Example . . . . . . . . . . 197 B.2 Details of Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . 199 B.3 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 199 B.4 E(cid:14)cient Evaluation of the Laplace GP Gibbs Classi(cid:12)er . . . . . . 201 B.5 Proof of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . 202 B.5.1 The Case of Regression . . . . . . . . . . . . . . . . . . . . 204 B.6 Proof of a PAC Compression Bound . . . . . . . . . . . . . . . . . 205 B.6.1 Examples of compression schemes . . . . . . . . . . . . . . 208 C Appendix for Chapter 4 209 C.1 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . 209 C.1.1 Expectation Propagation for Exponential Families . . . . . 209 ix C.1.2 Marginal Likelihood Approximation . . . . . . . . . . . . . 213 C.1.3 ADF Update for some Noise Models . . . . . . . . . . . . 215 C.2 Likelihood Approximations. Selection Criteria . . . . . . . . . . . 217 C.2.1 Optimal Sparse Likelihood Approximations . . . . . . . . . 217 C.2.2 Relaxed Likelihood Approximations . . . . . . . . . . . . . 218 C.2.3 Cheap versus Expensive Selection Criteria . . . . . . . . . 219 C.3 Derivations for the Informative Vector Machine . . . . . . . . . . 219 C.3.1 Update of the Representation . . . . . . . . . . . . . . . . 219 C.3.2 Exchange Moves . . . . . . . . . . . . . . . . . . . . . . . 221 C.3.3 Model Selection Criterion and Gradient . . . . . . . . . . . 223 C.4 Derivations for Projected Latent Variables . . . . . . . . . . . . . 225 C.4.1 Site Approximation Updates. Point Inclusions . . . . . . . 226 C.4.2 Information Gain Criterion . . . . . . . . . . . . . . . . . . 227 C.4.3 Extended Information Gain Criterion . . . . . . . . . . . . 229 C.4.4 Gradient of Model Selection Criterion . . . . . . . . . . . . 235 Bibliography 241 x

Description:
machine learning, and among these Bayesian inference for Gaussian process (GP) models has UK and at NIPS. My interest in sparse Gaussian process approximations was sparked by work iv We follow [211].8 The process view on a zero-mean GP u(x) with covariance function K is in the spirit.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.