Table Of ContentTheAnnalsofStatistics
2008,Vol.36,No.6,2577–2604
DOI:10.1214/08-AOS600
(cid:13)c InstituteofMathematicalStatistics,2008
COVARIANCE REGULARIZATION BY THRESHOLDING
9 By Peter J. Bickel1 and Elizaveta Levina2
0
0 University of California, Berkeley and University of Michigan
2
This paper considers regularizing a covariance matrix of p vari-
n
a ablesestimated from nobservations, byhardthresholding.Weshow
J that the thresholded estimate is consistent in the operator norm as
0 long as the true covariance matrix is sparse in a suitable sense, the
2 variables are Gaussian or sub-Gaussian, and (logp)/n→0, and ob-
tainexplicitrates.Theresultsareuniformoverfamiliesofcovariance
] matrices which satisfy a fairly natural notion of sparsity. Wediscuss
T
an intuitive resampling scheme for threshold selection and prove a
S
general cross-validation result that justifies this approach. We also
.
h compare thresholding to other covariance estimators in simulations
t and on an examplefrom climate data.
a
m
[ 1. Introduction. Estimation of covariance matrices is important in a
number of areas of statistical analysis, including dimension reduction by
1
v principal component analysis (PCA), classification by linear or quadratic
9 discriminant analysis (LDA and QDA), establishing independence and con-
7
ditional independence relations in the context of graphical models, and set-
0
tingconfidenceintervals onlinear functionsofthemeans ofthecomponents.
3
. Inrecentyears,many application areas wherethesetools areusedhave been
1
0 dealing with very high-dimensional datasets, and sample sizes can be very
9 small relative to dimension. Examples include genetic data, brain imaging,
0
spectroscopic imaging, climate data and many others.
:
v It is well known by now that the empirical covariance matrix for samples
i
X of size n from a p-variate Gaussian distribution, Np(µ,Σp), is not a good
estimator of the population covariance if p is large. Many results in random
r
a
matrix theory illustrate this, from the classical Marˇcenko–Pastur law [29]
Received October 2007; revised February 2008.
1Supportedby NSFGrant DMS-06-05236.
2SupportedbyNSFGrantsDMS-05-05424andDMS-08-05798andNSAGrantMSPF-
04Y-120.
AMS 2000 subject classifications. Primary 62H12; secondary 62F12, 62G09.
Key words and phrases. Covariance estimation, regularization, sparsity, thresholding,
large p small n, high dimension low sample size.
This is an electronic reprint of the original article published by the
Institute of Mathematical Statistics in The Annals of Statistics,
2008,Vol. 36, No. 6, 2577–2604. This reprint differs from the original in
pagination and typographic detail.
1
2 P. J. BICKEL AND E. LEVINA
to the more recent work of Johnstone and his students on the theory of the
largest eigenvalues [12, 23, 30] and associated eigenvectors [24]. However,
with the exception of a method for estimating the covariance spectrum [11],
these probabilistic results do not offer alternatives to the sample covariance
matrix.
Alternative estimators for large covariance matrices have therefore at-
tracted a lot of attention recently. Two broad classes of covariance estima-
tors have emerged: those that rely on a natural ordering among variables,
and assume that variables far apart in the ordering are only weakly corre-
lated, and those invariant to variable permutations. The first class includes
regularizing the covariance matrix by banding or tapering [2, 3, 17], which
we will discuss below. It also includes estimators based on regularizing the
Cholesky factor of the inverse covariance matrix. These methods use the
fact that the entries of the Cholesky factor have a regression interpreta-
tion, which allows application of regression regularization tools such as the
lasso and ridge penalties [21],or thenested lasso penalty [28] specifically de-
signed for the ordered variables situation. Banding the Cholesky factor has
also been proposed [3, 34]. These estimators are appropriate for a number
of applications with ordered data (time series, spectroscopy, climate data).
For climate applications andother spatial data, sincethereis nototal order-
ing on the plane, applying the Cholesky factor methodology is problematic;
but as long as there is an appropriate metric on variable indexes (some-
times, simple geographical distance can be used), banding or tapering the
covariance matrix can be applied.
However, there are many applications, for example, gene expression ar-
rays, wherethere is nonotion of distance between variables at all. Theseap-
plications require estimators invariant under variable permutations. Shrink-
age estimators are in this category and have been proposed early on [7, 20].
More recently, Ledoit and Wolf [26] proposed an estimator where the op-
timal amount of shrinkage is estimated from data. Shrinkage estimators
shrink the overdispersed sample covariance eigenvalues, but they do not
change the eigenvectors, which are also inconsistent [24], and do not result
in sparse estimators. Several recent papers [5, 31, 35] construct a sparse
permutation-invariant estimate of the inverse of the covariance matrix, also
knownas theconcentration orprecision matrix.Sparseconcentration matri-
ces are of interest in graphical models, since zero partial correlations imply
a graph structure. The common approach of [5, 31, 35] is to add an L
1
(lasso) penalty on the entries of the concentration matrix to the normal
likelihood, which results in shrinking some of the elements of the inverse
to zero. In [31], it was shown that this method has a rate of convergence
that is driven by (logp)/n and the sparsity of the truth. Computing this
estimator is nontrivial for high dimensions and can be achieved either via
a semidefinite programming algorithm [[5], [35]] or by using the Cholesky
COVARIANCETHRESHOLDING 3
decomposition to reparametrize the concentration matrix [31], but all of
these are computationally intensive. A faster algorithm that employs the
lasso was proposed by Friedman, Hastie and Tibshirani [16]. This approach
has also been extended to more general penalties like SCAD [15] by Lam
and Fan [25] and Fan, Fan and Lv [14]. In specific applications, there have
been other permutation-invariant approaches that use different notions of
sparsity: Zou, Hastie and Tibshirani [36] apply the lasso penalty to loadings
in PCA to achieve sparse representation; d’Aspremont et al. [6] compute
sparse principal components by semidefinite programming; Johnstone and
Lu [24] regularize PCA by moving to a sparse basis and thresholding; and
Fan, Fan and Lv [13] impose sparsity on the covariance via a factor model,
which is often appropriate in finance applications.
In this paper, we propose thresholding of the sample covariance matrix
as a simple and permutation-invariant method of covariance regulariza-
tion. This idea has been simultaneously and independently developed by
El Karoui [10], who studied it under a special notion of sparsity called β-
sparsity (see details in Section 2.4). Here we develop a natural permutation-
invariantnotionofsparsitywhich,thoughmorespecializedthanElKaroui’s,
seems easier to analyze and parallels the treatment in [3] which defines a
class of models where banding is appropriate. Bickel and Levina [3] showed
that, uniformly over the class of approximately “bandable” matrices, the
banded estimator is consistent in the operator norm (also known as the ma-
trix 2-norm, or spectral norm) for Gaussian data as long as (logp)/n→0.
Here we show consistency of the thresholded estimator in the operator
norm, uniformly over the class of matrices that satisfy our notion of spar-
sity, as longas (logp)/n→0,andobtain explicitrates of convergence. There
arevariousargumentstoshowthatconvergenceintheoperatornormimplies
convergence of eigenvalues and eigenvectors [3, 10], so this norm is particu-
larly appropriate for PCA applications. The rate we obtain is slightly worse
thantherateofbandingwhenthevariables areordered,butthedifferenceis
notsharp.Thisisexpected,sinceinthesituationwhenvariablesareordered,
banding takes advantage of the underlying true structure. Thresholding, on
the other hand, is applicable to many more situations. In fact, our treat-
ment is in many respects similar to the pioneering work on thresholding of
Donoho and Johnstone [8] and the recent work of Johnstone and Silverman
[22] and Abramovich et al. [1].
The rest of this paper is organized as follows. In Section 2 we introduce
the thresholdingestimator andour notion of sparsity, prove the convergence
resultandcomparetoresultsofElKaroui(Section2.4)andtobanding(Sec-
tion 2.5). In Section 3, we discuss a cross-validation approach to threshold
selection,whichisnovelinthiscontext,andproveacross-validationresultof
general interest. Section 4 gives simulations comparing several permutation-
invariantestimatorsandbanding.Section5givesanexampleofthresholding
4 P. J. BICKEL AND E. LEVINA
estimator applied to climate data and Section 6 gives a brief discussion. The
Appendix contains more technical proofs.
2. Asymptoticresults forthresholding. We startbysettingupnotation.
We write λ (M)=λ (M)≥···≥λ (M)=λ (M) for the eigenvalues of
max 1 p min
a matrix M. Following the notation of [3], we define, for any 0≤r,s≤∞
and a p×p matrix M,
(1) kMk ≡sup{kMxk :kxk =1},
(r,s) s r
where kxkr = p |x |r. In particular, we write kMk= kMk for the
r j=1 j (2,2)
operator norm, which for a symmetric matrix is given by
P
kMk= max |λ (M)|.
j
1 j p
≤ ≤
For symmetric matrices, we have (see, e.g., [18])
1/2
(2) kMk≤(kMk kMk ) =kMk =max |m |.
(1,1) ( , ) (1,1) ij
∞∞ j
i
X
We also use the Frobenius matrix norm,
kMk2 = m2 =tr(MMT).
F ij
i,j
X
We define the thresholding operator by
(3) T (M)=[m 1(|m |≥s)],
s ij ij
which we refer to as M thresholded at s. Note that T preserves symmetry
s
and is invariant under permutations of variable labels, but does not neces-
sarily preserve positive definiteness. However, if
(4) kT −T k≤ε and λ (M)>ε,
s 0 min
thenT (M)isnecessarilypositivedefinite,sinceforallvectors vwithkvk =
s 2
1 we have vTT Mv≥vTMv−ε≥λ (M)−ε>0.
s min
2.1. A uniformity class of covariance matrices. Recall that the banding
operator was defined in [3] as B (M)=[m 1(|i−j|≤k)]. The uniformity
k ij
class of “approximately bandable” covariance matrices is defined by
U(ε ,α,C)= Σ:max {|σ |:|i−j|>k}≤Ck α for all k>0,
0 ij −
( j
i
(5) X
and 0<ε ≤λ (Σ)≤λ (Σ)≤1/ε .
0 min max 0
)
COVARIANCETHRESHOLDING 5
Here we define the uniformity class of covariance matrices invariant under
permutations by
p
U (q,c (p),M)= Σ:σ ≤M, |σ |q ≤c (p), for all i ,
τ 0 ii ij 0
( )
j=1
X
for 0≤q<1. Thus, if q=0,
p
U (0,c (p),M)= Σ:σ ≤M, 1(σ 6=0)≤c (p) ,
τ 0 ii ij 0
( )
j=1
X
a class of sparse matrices. We will mainly write c for c (p) in the future.
0 0
Note that
λ (Σ)≤max |σ |≤M1 qc (p),
max ij − 0
i
j
X
by the bound (2). Thus, if we define,
U (q,c (p),M,ε )={Σ:Σ∈U (q,c (p),M) and λ (Σ)≥ε >0},
τ 0 0 τ 0 min 0
we have a class analogous to (5).
Naturally, there is a class of covariance matrices that satisfies both band-
ing and thresholding conditions. Define a subclass of U(ε ,α,C) by
0
V(ε ,α,C)={Σ:|σ |≤C|i−j| (α+1), for all i,j:|i−j|≥1,
0 ij −
and 0<ε ≤λ (Σ)≤λ (Σ)≤1/ε }.
0 min max 0
for α>0. Evidently,
V(ε,α,C)⊂U(ε ,α,C )
0 1
for C ≤C(1+1/α).
1
On the other hand, Σ∈V(ε ,α,C) implies
0
(α+1)q
|σij|q ≤ε−0q+C(α+1)q−1,
j
X
so that for a suitable choice of c and M,
0
V(ε ,α,C)⊂U (q,c ,M)
0 τ 0
for q> 1 .
α+1
6 P. J. BICKEL AND E. LEVINA
2.2. Main result. Supposewe observe n i.i.d. p-dimensionalobservations
X ,...,X distributed according to a distribution F, with EX=0 (with-
1 n
out loss of generality), and E(XXT)=Σ. We define the empirical (sample)
covariance matrix by
n
1
(6) Σˆ = (X −X¯)(X −X¯)T,
k k
n
k=1
X
where X¯ =n 1 n X , and write Σˆ =[σˆ ].
− k=1 k ij
We have the following result which parallels the bandingresult (Theorem
P
1) of Bickel and Levina [3].
Theorem1. Suppose F isGaussian. Then,uniformlyonU (q,c (p),M),
τ 0
for sufficiently large M , if
′
logp
(7) t =M
n ′
s n
and logp =o(1), then
n
logp (1 q)/2
kT (Σˆ)−Σk=O c (p) −
tn P 0 n
(cid:18) (cid:18) (cid:19) (cid:19)
and uniformly on U (q,c (p),M,ε ),
τ 0 0
logp (1 q)/2
k(T (Σˆ)) 1−Σ 1k=O c (p) − .
tn − − P 0 n
(cid:18) (cid:18) (cid:19) (cid:19)
Proof. Recall that, without loss of generality, we assumed EX =0.
Begin with the decomposition,
(8) Σˆ =Σˆ0−X¯X¯T,
where
n
1
Σˆ0≡[σˆ0]= X XT.
ij n k k
k=1
X
Note that, by (8),
(9) max|σˆ −σ |≤max|σˆ0 −σ |+max|X¯ X¯ |.
ij ij ij ij i j
i,j i,j i,j
By a result of Saulis and Statuleviˇcius [32] adapted for this context in
Lemma 3 of [3], and σ ≤M for all i,
ii
(10) P max|σˆ0 −σ |≥t ≤p2C e C2nt2,
ij ij 1 −
i,j
(cid:20) (cid:21)
COVARIANCETHRESHOLDING 7
for |t| < δ, where C , C and δ are constants depending only on M. In
1 2
particular, (10) holds if t=o(1).
For the second term in (9), we have, by the union sum inequality, the
Gaussian tail inequality and σ ≤M for all i,
ii
(11) P max|X¯ |2≥t ≤pC e C4nt.
i 3 −
i
(cid:20) (cid:21)
Combining (10) and (11), we see that if logp →0 and t=t is given by
n n
(7), then for M sufficiently large,
′
logp
(12) max|σˆ −σ |=O .
ij ij P
i,j s n
(cid:18) (cid:19)
We now recap an argument of Donoho and Johnstone [8]. Bound
kT (Σˆ)−Σk≤kT (Σ)−Σk+kT (Σˆ)−T (Σ)k.
t t t t
The first term above is bounded by
p
(13) max |σ |1(|σ |≤t)≤t1 qc (p).
ij ij − 0
i
j=1
X
On the other hand,
kT (Σˆ)−T (Σ)k
t t
p
≤max |σˆ |1(|σˆ |≥t,|σ |<t)
ij ij ij
i
j=1
X
p
(14) +max |σ |1(|σˆ |<t,|σ |≥t)
ij ij ij
i
j=1
X
p
+max |σˆ −σ |1(|σˆ |≥t,|σ |≥t)
ij ij ij ij
i
j=1
X
=I+II+III.
Using (12), we have
p
logp
III≤max|σˆ −σ |max |σ |qt q =O c (p)t q .
ij ij ij − P 0 −
i,j i s n
j=1 (cid:18) (cid:19)
X
To bound term I, write
p p
I≤max |σˆ −σ |1(|σˆ |≥t,|σ |<t)+max |σ |1(|σ |<t)
ij ij ij ij ij ij
i i
j=1 j=1
X X
(15)
≤IV+V.
8 P. J. BICKEL AND E. LEVINA
By (13),
(16) V≤t1 qc (p).
− 0
Now take γ∈(0,1). Then,
p
IV≤max |σˆ −σ |1(|σˆ |≥t,|σ |≤γt)
ij ij ij ij
i
j=1
X
p
(17) +max |σˆ −σ |1(|σˆ |>t,γt<|σ |<t)
ij ij ij ij
i
j=1
X
≤max|σˆ −σ |maxN (1−γ)+c (p)(γt) qmax|σˆ −σ |,
ij ij i 0 − ij ij
i,j i i,j
p
where N (a)≡ 1(|σˆ −σ |>at). Note that, for some δ>0,
i j=1 ij ij
P
P maxN (1−γ)>0 =P max|σˆ −σ |>(1−γ)t
i ij ij
i i,j
(cid:20) (cid:21) (cid:20) (cid:21)
(18)
≤p2e nδ(1 γ)2t2,
− −
if t=o(1), uniformly on U. By (18) and (16), and 0<γ<1, if
(19) 2logp−nδt2→−∞,
then
logp
(20) IV=O c (p)t q
P 0 −
s n
(cid:18) (cid:19)
and, by (9) and (13),
logp
(21) I=O c (p)t q +c (p)t1 q .
P 0 − 0 −
s n
(cid:18) (cid:19)
For term II, we have
p
II≤max [|σˆ −σ |+|σˆ |]1(|σˆ |<t,|σ |≥t)
ij ij ij ij ij
i
j=1
X
p p
(22) ≤max|σˆ −σ | 1(|σ |≥t)+tmax 1(|σ |≥t)
ij ij ij ij
i,j i
j=1 j=1
X X
logp
=O c (p)t q +c (p)t1 q .
P 0 − 0 −
s n
(cid:18) (cid:19)
Combining (21) and (22) and choosing t as in (7) establishes the first claim
of the theorem. The second claim follows since
k[T (Σˆ)] 1−Σ 1k=Ω (kT (Σˆ)−Σk)
tn − − P tn
COVARIANCETHRESHOLDING 9
uniformly on U (q,c (p),M,ε ), where A=Ω (B) means A=O (B) and
τ 0 0 P P
B=O (A). (cid:3)
P
Theorem2. Suppose F isGaussian. Then,uniformlyonU (q,c (p),M),
τ 0
if t=M logp and M is sufficiently large,
′ n ′
q
1 logp 1 q/2
(23) kT (Σˆ)−Σk2 =O c (p) − .
p t F P 0 n
(cid:18) (cid:18) (cid:19) (cid:19)
An analogous result holds for the inverse on U (q,c (p),M,ε ). The proof
τ 0 0
of Theorem 2 is similar to the proof of Theorem 1 and can be found in the
Appendix.
2.3. The non-Gaussian case. We consider two cases here. If, for some
η>0,
EetXi2j ≤K<∞ for all |t|≤ηj, for all i,j,
thentheproofgoesthroughverbatim,sinceresult(10)stillholds.Thebound
on max |X¯ |2 will always be at least the squared rate of max |σˆ −σ |,
i i i,j ij ij
hence we do not need normality for (11).
In the second case, if we have, for some γ>0,
E|X |2(1+γ)≤K for all i,j,
ij
then by Markov’s inequality
n (1+γ)/2
−
(24) P[|σˆ −σ |≥t]≤KC(γ) .
ij ij t1+γ
Thus the bound (10) becomes
n (1+γ)/2
p2KC(γ) −
t1+γ
and hence,
p2/(1+γ)
max|σˆ0 −σ |=O .
i,j ij ij P n1/2
(cid:18) (cid:19)
Therefore, taking t =Mp2/(1+γ), we find that
n n1/2
p2/(1+γ) 1 q
(25) kT (Σˆ)−Σk=O c (p) − ,
tn P 0 n1/2
(cid:18) (cid:18) (cid:19) (cid:19)
which is, we expect, minimax though this needs to be checked.
10 P. J. BICKEL AND E. LEVINA
2.4. Comparison to thresholding results of El Karoui. El Karoui [10]
shows as a special case that if:
(i) E|X |r <∞ for all r, 1≤j≤p,
j
(ii) σ ≤M <∞ for all j,
jj
(iii) if σ 6=0, |σ |>Cn α0,0<α < 1 −δ < 1,
ij ij − 0 2 0 2
(iv) Σ is β-sparse, β= 1 −η, η>0,
2
(v) p →c∈(0,∞),
n
then, if t =Cn α, α= 1 −δ >α ,
n − 2 0 0
(26) kT (Σˆ)−Σk−a→.s. 0.
tn
El Karoui’s notion of β-sparsity is such that our case q=0 is β-sparse with
c (p)=Kpβ. Our results yield a rate of
0
pβ+2/(1+γ)
O
P n1/2
(cid:18) (cid:19)
for γ arbitrarily large. Since β < 1 by assumption and p≍n we see that
2
our result implies (26) under (i), (ii), (iv), (v) and our notion of sparsity.
Thus, our result is stronger than his in the all moments case, again under
our stronger notion of sparsity. El Karoui’s full result, in fact, couples a
maximal value of r in (i) with the largest possiblevalue of β. Unfortunately,
this coupling involves (iii) which we do not require. Nevertheless, his result
implies thecorrespondingconsistency resultsof ours,if (iii) is ignored,when
only existence of a finite set of moments is assumed. However, according to
El Karoui (personal communication), (iii) is not needed for (26) in the case
when our sparsity condition holds.
2.5. Comparison to banding results of Bickel and Levina. Comparison is
readily possible on V(ε ,α,C). By Theorem 1 of [3] the best rate achievable
0
using banding is
logp α/(2(α+1))
O .
P
n
(cid:18)(cid:18) (cid:19) (cid:19)
On the other hand, by our Theorem 1, thresholding yields
logp (1 q)/2
−
O
P
n
(cid:18)(cid:18) (cid:19) (cid:19)
for q> 1 . Comparing exponents, we see that banding is slightly better in
α+1
the situation where labels are meaningful, since we must have
α
1−q< .
α+1
However, since 1−q can be arbitrarily close to α the difference is not
α+1
sharp. Not surprisingly, as α→∞, the genuinely sparse case, the bounds
both approach (logp)1/2.
n