Table Of ContentApproximation of Positive Semidefinite Matrices
Using the Nystro¨m Method
A dissertation presented
by
Nicholas Arcolano
to
The School of Engineering and Applied Sciences
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the subject of
Applied Mathematics
Harvard University
Cambridge, Massachusetts
May 2011
(cid:13)c 2011 - Nicholas Arcolano
All rights reserved.
Thesis advisor Author
Patrick J. Wolfe Nicholas Arcolano
Approximation of Positive Semidefinite Matrices
Using the Nystro¨m Method
Abstract
Positive semidefinite matrices arise in a variety of fields, including statistics, signal
processing, andmachinelearning. Unfortunately, whenthesematricesarehigh-dimensional
and/or must be operated upon many times, expensive calculations such as the spectral
decomposition quickly become a computational bottleneck.
Acommonalternativeistoreplacetheoriginalpositivesemidefinitematriceswithlow-rank
approximations whose spectral decompositions can be more easily computed. In this
thesis, we develop approaches based on the Nystr¨om method, which approximates a
positivesemidefinite matrixusinga data-dependentorthogonalprojection. AstheNystr¨om
approximation is conditioned on a given principal submatrix of its argument, it essentially
recasts low-rank approximation as a subset selection problem.
We begin by deriving the Nystr¨om approximation and developing a number of fundamental
results, including new characterizations of its spectral properties and approximation error.
We then address the problem of subset selection through a study of randomized sampling
algorithms. We provide new bounds for the approximation error under uniformly random
sampling, as well as bounds for two new data-dependent sampling methods. We continue
by extending these results to random positive definite matrices, deriving statistics for the
approximationerrorofmatriceswithWishartandbetadistributions,aswellasforabroader
class of orthogonally invariant and residual independent matrices.
Once this theoretical foundation has been established, we turn to practical applications
of Nystr¨om methods. We explore new exact and approximate sampling methods for
randomized subset selection, and develop greedy approaches for subset optimization. We
concludebydevelopingtheNystr¨omapproximationasalow-rankcovarianceestimatorthat
provides for computationally efficient spectral analysis while shrinking the eigenvalues of
the sample covariance. After deriving expressions for its bias and mean squared error, we
illustrate the effectiveness of the Nystr¨om covariance estimator through empirical examples
in adaptive beamforming and image denoising.
iii
Contents
1 Introduction 1
1.1 The Low-Rank Approximation Problem . . . . . . . . . . . . . . . . . . . . 2
1.2 Example Application: Approximate Spectral Clustering . . . . . . . . . . . 5
1.3 Contributions of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . 9
2 Review of Selected Topics 12
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Eigenvalues and Singular Values . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Special Classes of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Special Functions of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Unitarily Invariant Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Matrix-Variate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 The Nystr¨om Method 33
3.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Nystr¨om Method for Matrix Approximation . . . . . . . . . . . . . . . . . . 35
3.3 Properties of the Nystr¨om Approximation . . . . . . . . . . . . . . . . . . . 40
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Randomized Methods for Nystr¨om Subset Selection 54
4.1 Uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Data-Dependent Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
iv
Contents v
5 Nystr¨om Approximation of Random Matrices 73
5.1 Wishart Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Beta Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Orthogonally Invariant and Residual Independent
Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Algorithms for Nystr¨om Approximation 91
6.1 Existing Methods for Subset Selection . . . . . . . . . . . . . . . . . . . . . 92
6.2 Randomized Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 Subset Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Empirical Comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7 Low-Rank Covariance Estimation 120
7.1 Review of Covariance Estimation . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 The Nystrom Covariance Estimator . . . . . . . . . . . . . . . . . . . . . . 126
7.3 Example Application: Adaptive Beamforming . . . . . . . . . . . . . . . . . 133
7.4 Example Application: Image Denoising . . . . . . . . . . . . . . . . . . . . 139
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8 Conclusion 148
8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.2 Prominent Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Bibliography 153
Acknowledgments
First and foremost, I would like to thank the United States Air Force∗ for their generous
sponsorship of my research through the MIT Lincoln Scholars Program. Second, I would
like to thank my thesis advisor, Patrick Wolfe. A good advisor must act as both advocate
and critic, nurturing and mentoring his students while pushing them toward the limits of
theirpotential. ProfessorWolfebalancedtheseroleswell, andmytechnicalandprofessional
developmentbenefitedgreatlyfromhisguidance. Ialsowishtothanktheothermembersof
my thesis committee—Edoardo Airoldi and Roger Brockett—for their helpful suggestions
and comments.
In addition, I would like to acknowledge the many other students, colleagues, and
supervisors who have assisted me during my time at Harvard, including the members of
the Harvard Statistics and Information Sciences Laboratory (SISL), my former co-workers
and supervisors from Lincoln Laboratory’s Group 36 (BMDS Integration), and my new co-
workers and supervisors from Group 102 (Embedded and High Performance Computing).
In particular, I wish to thank my friend and colleague Daniel Rudoy for engaging me in
manyhoursofhelpfulresearchdiscussions, aswellasmyLincolnScholarsMentorKeh-Ping
Dunn, who has provided me with a great deal of professional guidance over the years.
On a personal note, I would like to thank my friends and family for their support
and kindness, especially my mother Cheryl. Throughout my life, many opportunities I
encountered were possible only because of great effort and sacrifice on her part. I also wish
to thank her for being a devoted grandparent to my beloved son Maxwell, always drawing
from the same boundless reserves of patience and selflessness she used when raising me.
Above all else, I am truly grateful to my wonderful wife Joy for her tireless encouragement
andunrelentingaffection. Asalovingspouseandmother, abrilliantteacher, andatalented
artist, she continues to be a source of personal and professional inspiration. All of my
proudest achievements are owed in no small part to her influence.
∗This work is sponsored by the United States Air Force under contract FA8721-05-C-0002. Opinions,
interpretations,recommendations,andconclusionsarethoseoftheauthorandarenotnecessarilyendorsed
by the United States Government.
vi
Contents vii
In regards to the resources needed to support a community of scholars, my favorite author
once wrote: “It takes forty men with their feet on the ground to keep one man with his
head in the air.” Once again, I extend my deepest thanks to all of the people whose help
has allowed me to soar through the lofty heights of academia these past four years.
For my family.
viii
Notation
R field of real numbers
C field of complex numbers
x,y,z (column) vectors
x , [x] i-th element of vector x
i i
Rn set of all real-valued n-length column vectors
Cn set of all complex-valued n-length column vectors
A,B,C matrices
a , [A] (i,j)-th element of matrix A
ij ij
AT matrix transpose
AH matrix conjugate transpose (Hermitian transpose)
I n×n identity matrix
n
rank(A) rank of matrix A
R(A) range space (column space) of matrix A
dim(S) dimension of subspace S
span(X) vector span of set of vectors X
A submatrix of A specified by index sets I and J
IJ
A principal submatrix of A specified by I; equivalent to A
I II
x (cid:31) y x majorizes y
x (cid:31) y x weakly majorizes y
w
σ (A) i-th singular value of matrix A (in nonincreasing order)
i
λ (S) i-th (real) eigenvalue of symmetric matrix S (in nonincreasing order)
i
A+ Moore-Penrose pseudoinverse of matrix A
A Schur complement of A in A
I I
C (A) k-th compound of a matrix A
k
∼ distributed as; used to relate a random variable to its distribution
D
= equal in distribution
I(·) indicator function
S∗ optimal k-th order approximation of S
k
S(cid:98)(I) Nystr¨om approximation of S given I
ix
Chapter 1
Introduction
Data is an essential part of engineering and the applied sciences. Nearly every technical
field involves the collection, processing, and analysis of data in some form, and as the
availabilityandpowerofcomputationalresourcescontinuetoimprove,sodoourcapabilities
for performing these tasks.
In many fields, however, our ability to acquire data far outpaces our capacity to operate on
it. Today, it is common to find applications involving data sets of size and dimensionality
ranging from several thousand to millions (or even larger), such as gene identification in
bioinformatics[1,2], portfoliooptimizationinfinance[3], andcommunitydetectioninlarge-
scale graphs and networks [4,5]. For these applications, performing even the most basic
analyses requires a substantial amount of computation.
Fortunately, “data” is not always synonymous with “information”, and we often believe
(or at least hope) that a collection of data vectors can be described by a mechanism much
simplerthanmightbesuggestedbyitsrawsizeanddimension. Ifweassumethismechanism
to be linear, one of the most powerful notions of simplicity involves the linear-algebraic
concept of rank.
Considerap×nmatrixX,whosecolumnsareacollectionofdatavectors{x ,...,x } ⊂ Rp.
1 n
The rank of X, denoted rank(X), is equal to the size of the largest linearly independent set
of its columns. Consequently, rank(X) = k ≤ min(p,n) implies that there exists a set of k
vectors {b ,...,b } ⊂ Rp such that
1 k
k
(cid:88)
x = a b ,
j ij i
i=1
where {a } is a set of coefficients for i = 1,...,k and j = 1,...,n.
ij
1
Description:Unfortunately, when these matrices are high-dimensional .. its eigenvalues and eigenvectors, they can be computed for a reduced cost that scales as.