Table Of ContentMatrix Approximation for Large-scale Learning
by
Ameet Talwalkar
A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science
Courant Institute of Mathematical Sciences
New York University
May 2010
Mehryar Mohri—Advisor
c Ameet Talwalkar
(cid:13)
All Rights Reserved, 2010
For Aai and Baba
iv
Acknowledgments
I would first like to thank my advisor, Mehryar Mohri, for his guidance
throughout my doctoral studies. He gave me an opportunity to pursue a
PhD, patiently taught me about the field of machine learning and guided me
towards exciting research questions. He also introduced me to my mentors and
collaborators at Google Research, Sanjiv Kumar and Corinna Cortes, both of
whom have been tremendous role models for me throughout my studies. I
would also like to thank the final two members of my thesis committee, Den-
nis Shasha and Mark Tygert, as well as Subhash Khot, who sat on my DQE
and thesis proposal, for their encouragement and helpful advice.
During my time at Courant and my summers at Google, I have had the
good fortune to work and interact with several other exceptional people. In
particular, IwouldliketothankEugeneWeinstein, AmeeshMakadia, CyrilAl-
lauzen, Dejan Jovanovi´c, Shaila Musharoff, Ashish Rastogi, Rosemary Amico,
Michael Riley, Henry Rowley and Jeremy Shute for helping me along the way
and making my studies and research more enjoyable over these past four years.
I would especially like to thank my partner in crime, Afshin Rostamizadeh, for
v
being a supportive officemate and a considerate friend throughout our count-
less hours working together.
Last, but not least, I would like to thank my friends and family for their
unwavering support. In particular, I have consistently drawn strength from
my lovely girlfriend Jessica, my brother Jaideep, my sister-in-law Kristen and
the three cutest little men in the world, my nephews Kavi, Nayan and Dev.
And to my parents, Rohini and Shrirang, to whom this thesis is dedicated,
I am infinitely grateful. They are my sources of inspiration and my greatest
teachers, and any achievement I may have is a credit to them. Thank you,
Aai and Baba.
vi
Abstract
Modern learning problems in computer vision, natural language processing,
computational biology, and other areas are often based on large data sets
of tens of thousands to millions of training instances. However, several stan-
dardlearningalgorithms, suchaskernel-based algorithms, e.g., SupportVector
Machines, Kernel Ridge Regression, Kernel PCA, do not easily scale to such
ordersofmagnitude. Thisthesisfocusesonsampling-basedmatrixapproxima-
tion techniques that help scale kernel-based algorithms to large-scale datasets.
We address several fundamental theoretical and empirical questions including:
1. What approximation should be used? We discuss two common sampling-
based methods, providing novel theoretical insights regarding their suit-
ability for various applications and experimental results motivated by
this theory. Our results show that one of these methods, the Nystr¨om
method, is superior in the context of large-scale learning.
2. Do these approximations work in practice? We show the effectiveness of
approximation techniques on a variety of problems. In the largest study
vii
to-dateformanifoldlearning, weusetheNystr¨ommethodtoextractlow-
dimensional structure from high-dimensional data to effectively cluster
face images. We also report good empirical results for Kernel Ridge
Regression and Kernel Logistic Regression.
3. How should we sample columns? A key aspect of sampling-based algo-
rithms is the distribution according to which columns are sampled. We
study both fixed and adaptive sampling schemes as well as a promising
ensembletechniquethatcan beeasily parallelized and generatessuperior
approximations, both in theory and in practice.
4. How well do these approximations work in theory? We provide theoret-
ical analyses of the Nystr¨om method to understand when this technique
shouldbeused. Wepresentguaranteesonapproximationaccuracybased
on various matrix properties and analyze the effect of matrix approxi-
mation on actual kernel-based algorithms.
This work has important consequences for the machine learning commu-
nity since it extends to large-scale applications the benefits of kernel-based
algorithms. The crucial aspect of this research, involving low-rank matrix
approximation, is of independent interest within the field of linear algebra.
viii
Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Low Rank Approximations 10
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Nystr¨om method . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Column-sampling method . . . . . . . . . . . . . . . . 13
2.2 Nystr¨om vs Column-sampling . . . . . . . . . . . . . . . . . . 14
ix
2.2.1 Singular values and singular vectors . . . . . . . . . . . 15
2.2.2 Low-rank approximation . . . . . . . . . . . . . . . . . 16
2.2.3 Empirical comparison . . . . . . . . . . . . . . . . . . . 24
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Applications 32
3.1 Large-scale Manifold Learning . . . . . . . . . . . . . . . . . . 33
3.1.1 Manifold learning . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 Approximation experiments . . . . . . . . . . . . . . . 40
3.1.3 Large-scale learning . . . . . . . . . . . . . . . . . . . . 42
3.1.4 Manifold evaluation . . . . . . . . . . . . . . . . . . . . 49
3.2 Woodbury Approximation . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Nystr¨om Logistic Regression . . . . . . . . . . . . . . . 57
3.2.2 Kernel Ridge Regression . . . . . . . . . . . . . . . . . 62
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Sampling Schemes 66
4.1 Fixed Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Adaptive Sampling . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Adaptive Nystr¨om sampling . . . . . . . . . . . . . . . 74
4.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Ensemble Sampling . . . . . . . . . . . . . . . . . . . . . . . . 79
x
Description:Matrix Approximation for Large-scale Learning by Ameet Talwalkar nis Shasha and Mark Tygert, as well as Subhash Khot, who sat on my DQE and thesis proposal,