Matrix Approximation for Large-scale Learning by Ameet Talwalkar A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science Courant Institute of Mathematical Sciences New York University May 2010 Mehryar Mohri—Advisor c Ameet Talwalkar (cid:13) All Rights Reserved, 2010 For Aai and Baba iv Acknowledgments I would first like to thank my advisor, Mehryar Mohri, for his guidance throughout my doctoral studies. He gave me an opportunity to pursue a PhD, patiently taught me about the field of machine learning and guided me towards exciting research questions. He also introduced me to my mentors and collaborators at Google Research, Sanjiv Kumar and Corinna Cortes, both of whom have been tremendous role models for me throughout my studies. I would also like to thank the final two members of my thesis committee, Den- nis Shasha and Mark Tygert, as well as Subhash Khot, who sat on my DQE and thesis proposal, for their encouragement and helpful advice. During my time at Courant and my summers at Google, I have had the good fortune to work and interact with several other exceptional people. In particular, IwouldliketothankEugeneWeinstein, AmeeshMakadia, CyrilAl- lauzen, Dejan Jovanovi´c, Shaila Musharoff, Ashish Rastogi, Rosemary Amico, Michael Riley, Henry Rowley and Jeremy Shute for helping me along the way and making my studies and research more enjoyable over these past four years. I would especially like to thank my partner in crime, Afshin Rostamizadeh, for v being a supportive officemate and a considerate friend throughout our count- less hours working together. Last, but not least, I would like to thank my friends and family for their unwavering support. In particular, I have consistently drawn strength from my lovely girlfriend Jessica, my brother Jaideep, my sister-in-law Kristen and the three cutest little men in the world, my nephews Kavi, Nayan and Dev. And to my parents, Rohini and Shrirang, to whom this thesis is dedicated, I am infinitely grateful. They are my sources of inspiration and my greatest teachers, and any achievement I may have is a credit to them. Thank you, Aai and Baba. vi Abstract Modern learning problems in computer vision, natural language processing, computational biology, and other areas are often based on large data sets of tens of thousands to millions of training instances. However, several stan- dardlearningalgorithms, suchaskernel-based algorithms, e.g., SupportVector Machines, Kernel Ridge Regression, Kernel PCA, do not easily scale to such ordersofmagnitude. Thisthesisfocusesonsampling-basedmatrixapproxima- tion techniques that help scale kernel-based algorithms to large-scale datasets. We address several fundamental theoretical and empirical questions including: 1. What approximation should be used? We discuss two common sampling- based methods, providing novel theoretical insights regarding their suit- ability for various applications and experimental results motivated by this theory. Our results show that one of these methods, the Nystr¨om method, is superior in the context of large-scale learning. 2. Do these approximations work in practice? We show the effectiveness of approximation techniques on a variety of problems. In the largest study vii to-dateformanifoldlearning, weusetheNystr¨ommethodtoextractlow- dimensional structure from high-dimensional data to effectively cluster face images. We also report good empirical results for Kernel Ridge Regression and Kernel Logistic Regression. 3. How should we sample columns? A key aspect of sampling-based algo- rithms is the distribution according to which columns are sampled. We study both fixed and adaptive sampling schemes as well as a promising ensembletechniquethatcan beeasily parallelized and generatessuperior approximations, both in theory and in practice. 4. How well do these approximations work in theory? We provide theoret- ical analyses of the Nystr¨om method to understand when this technique shouldbeused. Wepresentguaranteesonapproximationaccuracybased on various matrix properties and analyze the effect of matrix approxi- mation on actual kernel-based algorithms. This work has important consequences for the machine learning commu- nity since it extends to large-scale applications the benefits of kernel-based algorithms. The crucial aspect of this research, involving low-rank matrix approximation, is of independent interest within the field of linear algebra. viii Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Low Rank Approximations 10 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Nystr¨om method . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Column-sampling method . . . . . . . . . . . . . . . . 13 2.2 Nystr¨om vs Column-sampling . . . . . . . . . . . . . . . . . . 14 ix 2.2.1 Singular values and singular vectors . . . . . . . . . . . 15 2.2.2 Low-rank approximation . . . . . . . . . . . . . . . . . 16 2.2.3 Empirical comparison . . . . . . . . . . . . . . . . . . . 24 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Applications 32 3.1 Large-scale Manifold Learning . . . . . . . . . . . . . . . . . . 33 3.1.1 Manifold learning . . . . . . . . . . . . . . . . . . . . . 37 3.1.2 Approximation experiments . . . . . . . . . . . . . . . 40 3.1.3 Large-scale learning . . . . . . . . . . . . . . . . . . . . 42 3.1.4 Manifold evaluation . . . . . . . . . . . . . . . . . . . . 49 3.2 Woodbury Approximation . . . . . . . . . . . . . . . . . . . . 56 3.2.1 Nystr¨om Logistic Regression . . . . . . . . . . . . . . . 57 3.2.2 Kernel Ridge Regression . . . . . . . . . . . . . . . . . 62 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4 Sampling Schemes 66 4.1 Fixed Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Adaptive Sampling . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.1 Adaptive Nystr¨om sampling . . . . . . . . . . . . . . . 74 4.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Ensemble Sampling . . . . . . . . . . . . . . . . . . . . . . . . 79 x
Description: