TheAnnalsofStatistics 2008,Vol.36,No.6,2577–2604 DOI:10.1214/08-AOS600 (cid:13)c InstituteofMathematicalStatistics,2008 COVARIANCE REGULARIZATION BY THRESHOLDING 9 By Peter J. Bickel1 and Elizaveta Levina2 0 0 University of California, Berkeley and University of Michigan 2 This paper considers regularizing a covariance matrix of p vari- n a ablesestimated from nobservations, byhardthresholding.Weshow J that the thresholded estimate is consistent in the operator norm as 0 long as the true covariance matrix is sparse in a suitable sense, the 2 variables are Gaussian or sub-Gaussian, and (logp)/n→0, and ob- tainexplicitrates.Theresultsareuniformoverfamiliesofcovariance ] matrices which satisfy a fairly natural notion of sparsity. Wediscuss T an intuitive resampling scheme for threshold selection and prove a S general cross-validation result that justifies this approach. We also . h compare thresholding to other covariance estimators in simulations t and on an examplefrom climate data. a m [ 1. Introduction. Estimation of covariance matrices is important in a number of areas of statistical analysis, including dimension reduction by 1 v principal component analysis (PCA), classification by linear or quadratic 9 discriminant analysis (LDA and QDA), establishing independence and con- 7 ditional independence relations in the context of graphical models, and set- 0 tingconfidenceintervals onlinear functionsofthemeans ofthecomponents. 3 . Inrecentyears,many application areas wherethesetools areusedhave been 1 0 dealing with very high-dimensional datasets, and sample sizes can be very 9 small relative to dimension. Examples include genetic data, brain imaging, 0 spectroscopic imaging, climate data and many others. : v It is well known by now that the empirical covariance matrix for samples i X of size n from a p-variate Gaussian distribution, Np(µ,Σp), is not a good estimator of the population covariance if p is large. Many results in random r a matrix theory illustrate this, from the classical Marˇcenko–Pastur law [29] Received October 2007; revised February 2008. 1Supportedby NSFGrant DMS-06-05236. 2SupportedbyNSFGrantsDMS-05-05424andDMS-08-05798andNSAGrantMSPF- 04Y-120. AMS 2000 subject classifications. Primary 62H12; secondary 62F12, 62G09. Key words and phrases. Covariance estimation, regularization, sparsity, thresholding, large p small n, high dimension low sample size. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2008,Vol. 36, No. 6, 2577–2604. This reprint differs from the original in pagination and typographic detail. 1 2 P. J. BICKEL AND E. LEVINA to the more recent work of Johnstone and his students on the theory of the largest eigenvalues [12, 23, 30] and associated eigenvectors [24]. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance matrix. Alternative estimators for large covariance matrices have therefore at- tracted a lot of attention recently. Two broad classes of covariance estima- tors have emerged: those that rely on a natural ordering among variables, and assume that variables far apart in the ordering are only weakly corre- lated, and those invariant to variable permutations. The first class includes regularizing the covariance matrix by banding or tapering [2, 3, 17], which we will discuss below. It also includes estimators based on regularizing the Cholesky factor of the inverse covariance matrix. These methods use the fact that the entries of the Cholesky factor have a regression interpreta- tion, which allows application of regression regularization tools such as the lasso and ridge penalties [21],or thenested lasso penalty [28] specifically de- signed for the ordered variables situation. Banding the Cholesky factor has also been proposed [3, 34]. These estimators are appropriate for a number of applications with ordered data (time series, spectroscopy, climate data). For climate applications andother spatial data, sincethereis nototal order- ing on the plane, applying the Cholesky factor methodology is problematic; but as long as there is an appropriate metric on variable indexes (some- times, simple geographical distance can be used), banding or tapering the covariance matrix can be applied. However, there are many applications, for example, gene expression ar- rays, wherethere is nonotion of distance between variables at all. Theseap- plications require estimators invariant under variable permutations. Shrink- age estimators are in this category and have been proposed early on [7, 20]. More recently, Ledoit and Wolf [26] proposed an estimator where the op- timal amount of shrinkage is estimated from data. Shrinkage estimators shrink the overdispersed sample covariance eigenvalues, but they do not change the eigenvectors, which are also inconsistent [24], and do not result in sparse estimators. Several recent papers [5, 31, 35] construct a sparse permutation-invariant estimate of the inverse of the covariance matrix, also knownas theconcentration orprecision matrix.Sparseconcentration matri- ces are of interest in graphical models, since zero partial correlations imply a graph structure. The common approach of [5, 31, 35] is to add an L 1 (lasso) penalty on the entries of the concentration matrix to the normal likelihood, which results in shrinking some of the elements of the inverse to zero. In [31], it was shown that this method has a rate of convergence that is driven by (logp)/n and the sparsity of the truth. Computing this estimator is nontrivial for high dimensions and can be achieved either via a semidefinite programming algorithm [[5], [35]] or by using the Cholesky COVARIANCETHRESHOLDING 3 decomposition to reparametrize the concentration matrix [31], but all of these are computationally intensive. A faster algorithm that employs the lasso was proposed by Friedman, Hastie and Tibshirani [16]. This approach has also been extended to more general penalties like SCAD [15] by Lam and Fan [25] and Fan, Fan and Lv [14]. In specific applications, there have been other permutation-invariant approaches that use different notions of sparsity: Zou, Hastie and Tibshirani [36] apply the lasso penalty to loadings in PCA to achieve sparse representation; d’Aspremont et al. [6] compute sparse principal components by semidefinite programming; Johnstone and Lu [24] regularize PCA by moving to a sparse basis and thresholding; and Fan, Fan and Lv [13] impose sparsity on the covariance via a factor model, which is often appropriate in finance applications. In this paper, we propose thresholding of the sample covariance matrix as a simple and permutation-invariant method of covariance regulariza- tion. This idea has been simultaneously and independently developed by El Karoui [10], who studied it under a special notion of sparsity called β- sparsity (see details in Section 2.4). Here we develop a natural permutation- invariantnotionofsparsitywhich,thoughmorespecializedthanElKaroui’s, seems easier to analyze and parallels the treatment in [3] which defines a class of models where banding is appropriate. Bickel and Levina [3] showed that, uniformly over the class of approximately “bandable” matrices, the banded estimator is consistent in the operator norm (also known as the ma- trix 2-norm, or spectral norm) for Gaussian data as long as (logp)/n→0. Here we show consistency of the thresholded estimator in the operator norm, uniformly over the class of matrices that satisfy our notion of spar- sity, as longas (logp)/n→0,andobtain explicitrates of convergence. There arevariousargumentstoshowthatconvergenceintheoperatornormimplies convergence of eigenvalues and eigenvectors [3, 10], so this norm is particu- larly appropriate for PCA applications. The rate we obtain is slightly worse thantherateofbandingwhenthevariables areordered,butthedifferenceis notsharp.Thisisexpected,sinceinthesituationwhenvariablesareordered, banding takes advantage of the underlying true structure. Thresholding, on the other hand, is applicable to many more situations. In fact, our treat- ment is in many respects similar to the pioneering work on thresholding of Donoho and Johnstone [8] and the recent work of Johnstone and Silverman [22] and Abramovich et al. [1]. The rest of this paper is organized as follows. In Section 2 we introduce the thresholdingestimator andour notion of sparsity, prove the convergence resultandcomparetoresultsofElKaroui(Section2.4)andtobanding(Sec- tion 2.5). In Section 3, we discuss a cross-validation approach to threshold selection,whichisnovelinthiscontext,andproveacross-validationresultof general interest. Section 4 gives simulations comparing several permutation- invariantestimatorsandbanding.Section5givesanexampleofthresholding 4 P. J. BICKEL AND E. LEVINA estimator applied to climate data and Section 6 gives a brief discussion. The Appendix contains more technical proofs. 2. Asymptoticresults forthresholding. We startbysettingupnotation. We write λ (M)=λ (M)≥···≥λ (M)=λ (M) for the eigenvalues of max 1 p min a matrix M. Following the notation of [3], we define, for any 0≤r,s≤∞ and a p×p matrix M, (1) kMk ≡sup{kMxk :kxk =1}, (r,s) s r where kxkr = p |x |r. In particular, we write kMk= kMk for the r j=1 j (2,2) operator norm, which for a symmetric matrix is given by P kMk= max |λ (M)|. j 1 j p ≤ ≤ For symmetric matrices, we have (see, e.g., [18]) 1/2 (2) kMk≤(kMk kMk ) =kMk =max |m |. (1,1) ( , ) (1,1) ij ∞∞ j i X We also use the Frobenius matrix norm, kMk2 = m2 =tr(MMT). F ij i,j X We define the thresholding operator by (3) T (M)=[m 1(|m |≥s)], s ij ij which we refer to as M thresholded at s. Note that T preserves symmetry s and is invariant under permutations of variable labels, but does not neces- sarily preserve positive definiteness. However, if (4) kT −T k≤ε and λ (M)>ε, s 0 min thenT (M)isnecessarilypositivedefinite,sinceforallvectors vwithkvk = s 2 1 we have vTT Mv≥vTMv−ε≥λ (M)−ε>0. s min 2.1. A uniformity class of covariance matrices. Recall that the banding operator was defined in [3] as B (M)=[m 1(|i−j|≤k)]. The uniformity k ij class of “approximately bandable” covariance matrices is defined by U(ε ,α,C)= Σ:max {|σ |:|i−j|>k}≤Ck α for all k>0, 0 ij − ( j i (5) X and 0<ε ≤λ (Σ)≤λ (Σ)≤1/ε . 0 min max 0 ) COVARIANCETHRESHOLDING 5 Here we define the uniformity class of covariance matrices invariant under permutations by p U (q,c (p),M)= Σ:σ ≤M, |σ |q ≤c (p), for all i , τ 0 ii ij 0 ( ) j=1 X for 0≤q<1. Thus, if q=0, p U (0,c (p),M)= Σ:σ ≤M, 1(σ 6=0)≤c (p) , τ 0 ii ij 0 ( ) j=1 X a class of sparse matrices. We will mainly write c for c (p) in the future. 0 0 Note that λ (Σ)≤max |σ |≤M1 qc (p), max ij − 0 i j X by the bound (2). Thus, if we define, U (q,c (p),M,ε )={Σ:Σ∈U (q,c (p),M) and λ (Σ)≥ε >0}, τ 0 0 τ 0 min 0 we have a class analogous to (5). Naturally, there is a class of covariance matrices that satisfies both band- ing and thresholding conditions. Define a subclass of U(ε ,α,C) by 0 V(ε ,α,C)={Σ:|σ |≤C|i−j| (α+1), for all i,j:|i−j|≥1, 0 ij − and 0<ε ≤λ (Σ)≤λ (Σ)≤1/ε }. 0 min max 0 for α>0. Evidently, V(ε,α,C)⊂U(ε ,α,C ) 0 1 for C ≤C(1+1/α). 1 On the other hand, Σ∈V(ε ,α,C) implies 0 (α+1)q |σij|q ≤ε−0q+C(α+1)q−1, j X so that for a suitable choice of c and M, 0 V(ε ,α,C)⊂U (q,c ,M) 0 τ 0 for q> 1 . α+1 6 P. J. BICKEL AND E. LEVINA 2.2. Main result. Supposewe observe n i.i.d. p-dimensionalobservations X ,...,X distributed according to a distribution F, with EX=0 (with- 1 n out loss of generality), and E(XXT)=Σ. We define the empirical (sample) covariance matrix by n 1 (6) Σˆ = (X −X¯)(X −X¯)T, k k n k=1 X where X¯ =n 1 n X , and write Σˆ =[σˆ ]. − k=1 k ij We have the following result which parallels the bandingresult (Theorem P 1) of Bickel and Levina [3]. Theorem1. Suppose F isGaussian. Then,uniformlyonU (q,c (p),M), τ 0 for sufficiently large M , if ′ logp (7) t =M n ′ s n and logp =o(1), then n logp (1 q)/2 kT (Σˆ)−Σk=O c (p) − tn P 0 n (cid:18) (cid:18) (cid:19) (cid:19) and uniformly on U (q,c (p),M,ε ), τ 0 0 logp (1 q)/2 k(T (Σˆ)) 1−Σ 1k=O c (p) − . tn − − P 0 n (cid:18) (cid:18) (cid:19) (cid:19) Proof. Recall that, without loss of generality, we assumed EX =0. Begin with the decomposition, (8) Σˆ =Σˆ0−X¯X¯T, where n 1 Σˆ0≡[σˆ0]= X XT. ij n k k k=1 X Note that, by (8), (9) max|σˆ −σ |≤max|σˆ0 −σ |+max|X¯ X¯ |. ij ij ij ij i j i,j i,j i,j By a result of Saulis and Statuleviˇcius [32] adapted for this context in Lemma 3 of [3], and σ ≤M for all i, ii (10) P max|σˆ0 −σ |≥t ≤p2C e C2nt2, ij ij 1 − i,j (cid:20) (cid:21) COVARIANCETHRESHOLDING 7 for |t| < δ, where C , C and δ are constants depending only on M. In 1 2 particular, (10) holds if t=o(1). For the second term in (9), we have, by the union sum inequality, the Gaussian tail inequality and σ ≤M for all i, ii (11) P max|X¯ |2≥t ≤pC e C4nt. i 3 − i (cid:20) (cid:21) Combining (10) and (11), we see that if logp →0 and t=t is given by n n (7), then for M sufficiently large, ′ logp (12) max|σˆ −σ |=O . ij ij P i,j s n (cid:18) (cid:19) We now recap an argument of Donoho and Johnstone [8]. Bound kT (Σˆ)−Σk≤kT (Σ)−Σk+kT (Σˆ)−T (Σ)k. t t t t The first term above is bounded by p (13) max |σ |1(|σ |≤t)≤t1 qc (p). ij ij − 0 i j=1 X On the other hand, kT (Σˆ)−T (Σ)k t t p ≤max |σˆ |1(|σˆ |≥t,|σ |<t) ij ij ij i j=1 X p (14) +max |σ |1(|σˆ |<t,|σ |≥t) ij ij ij i j=1 X p +max |σˆ −σ |1(|σˆ |≥t,|σ |≥t) ij ij ij ij i j=1 X =I+II+III. Using (12), we have p logp III≤max|σˆ −σ |max |σ |qt q =O c (p)t q . ij ij ij − P 0 − i,j i s n j=1 (cid:18) (cid:19) X To bound term I, write p p I≤max |σˆ −σ |1(|σˆ |≥t,|σ |<t)+max |σ |1(|σ |<t) ij ij ij ij ij ij i i j=1 j=1 X X (15) ≤IV+V. 8 P. J. BICKEL AND E. LEVINA By (13), (16) V≤t1 qc (p). − 0 Now take γ∈(0,1). Then, p IV≤max |σˆ −σ |1(|σˆ |≥t,|σ |≤γt) ij ij ij ij i j=1 X p (17) +max |σˆ −σ |1(|σˆ |>t,γt<|σ |<t) ij ij ij ij i j=1 X ≤max|σˆ −σ |maxN (1−γ)+c (p)(γt) qmax|σˆ −σ |, ij ij i 0 − ij ij i,j i i,j p where N (a)≡ 1(|σˆ −σ |>at). Note that, for some δ>0, i j=1 ij ij P P maxN (1−γ)>0 =P max|σˆ −σ |>(1−γ)t i ij ij i i,j (cid:20) (cid:21) (cid:20) (cid:21) (18) ≤p2e nδ(1 γ)2t2, − − if t=o(1), uniformly on U. By (18) and (16), and 0<γ<1, if (19) 2logp−nδt2→−∞, then logp (20) IV=O c (p)t q P 0 − s n (cid:18) (cid:19) and, by (9) and (13), logp (21) I=O c (p)t q +c (p)t1 q . P 0 − 0 − s n (cid:18) (cid:19) For term II, we have p II≤max [|σˆ −σ |+|σˆ |]1(|σˆ |<t,|σ |≥t) ij ij ij ij ij i j=1 X p p (22) ≤max|σˆ −σ | 1(|σ |≥t)+tmax 1(|σ |≥t) ij ij ij ij i,j i j=1 j=1 X X logp =O c (p)t q +c (p)t1 q . P 0 − 0 − s n (cid:18) (cid:19) Combining (21) and (22) and choosing t as in (7) establishes the first claim of the theorem. The second claim follows since k[T (Σˆ)] 1−Σ 1k=Ω (kT (Σˆ)−Σk) tn − − P tn COVARIANCETHRESHOLDING 9 uniformly on U (q,c (p),M,ε ), where A=Ω (B) means A=O (B) and τ 0 0 P P B=O (A). (cid:3) P Theorem2. Suppose F isGaussian. Then,uniformlyonU (q,c (p),M), τ 0 if t=M logp and M is sufficiently large, ′ n ′ q 1 logp 1 q/2 (23) kT (Σˆ)−Σk2 =O c (p) − . p t F P 0 n (cid:18) (cid:18) (cid:19) (cid:19) An analogous result holds for the inverse on U (q,c (p),M,ε ). The proof τ 0 0 of Theorem 2 is similar to the proof of Theorem 1 and can be found in the Appendix. 2.3. The non-Gaussian case. We consider two cases here. If, for some η>0, EetXi2j ≤K<∞ for all |t|≤ηj, for all i,j, thentheproofgoesthroughverbatim,sinceresult(10)stillholds.Thebound on max |X¯ |2 will always be at least the squared rate of max |σˆ −σ |, i i i,j ij ij hence we do not need normality for (11). In the second case, if we have, for some γ>0, E|X |2(1+γ)≤K for all i,j, ij then by Markov’s inequality n (1+γ)/2 − (24) P[|σˆ −σ |≥t]≤KC(γ) . ij ij t1+γ Thus the bound (10) becomes n (1+γ)/2 p2KC(γ) − t1+γ and hence, p2/(1+γ) max|σˆ0 −σ |=O . i,j ij ij P n1/2 (cid:18) (cid:19) Therefore, taking t =Mp2/(1+γ), we find that n n1/2 p2/(1+γ) 1 q (25) kT (Σˆ)−Σk=O c (p) − , tn P 0 n1/2 (cid:18) (cid:18) (cid:19) (cid:19) which is, we expect, minimax though this needs to be checked. 10 P. J. BICKEL AND E. LEVINA 2.4. Comparison to thresholding results of El Karoui. El Karoui [10] shows as a special case that if: (i) E|X |r <∞ for all r, 1≤j≤p, j (ii) σ ≤M <∞ for all j, jj (iii) if σ 6=0, |σ |>Cn α0,0<α < 1 −δ < 1, ij ij − 0 2 0 2 (iv) Σ is β-sparse, β= 1 −η, η>0, 2 (v) p →c∈(0,∞), n then, if t =Cn α, α= 1 −δ >α , n − 2 0 0 (26) kT (Σˆ)−Σk−a→.s. 0. tn El Karoui’s notion of β-sparsity is such that our case q=0 is β-sparse with c (p)=Kpβ. Our results yield a rate of 0 pβ+2/(1+γ) O P n1/2 (cid:18) (cid:19) for γ arbitrarily large. Since β < 1 by assumption and p≍n we see that 2 our result implies (26) under (i), (ii), (iv), (v) and our notion of sparsity. Thus, our result is stronger than his in the all moments case, again under our stronger notion of sparsity. El Karoui’s full result, in fact, couples a maximal value of r in (i) with the largest possiblevalue of β. Unfortunately, this coupling involves (iii) which we do not require. Nevertheless, his result implies thecorrespondingconsistency resultsof ours,if (iii) is ignored,when only existence of a finite set of moments is assumed. However, according to El Karoui (personal communication), (iii) is not needed for (26) in the case when our sparsity condition holds. 2.5. Comparison to banding results of Bickel and Levina. Comparison is readily possible on V(ε ,α,C). By Theorem 1 of [3] the best rate achievable 0 using banding is logp α/(2(α+1)) O . P n (cid:18)(cid:18) (cid:19) (cid:19) On the other hand, by our Theorem 1, thresholding yields logp (1 q)/2 − O P n (cid:18)(cid:18) (cid:19) (cid:19) for q> 1 . Comparing exponents, we see that banding is slightly better in α+1 the situation where labels are meaningful, since we must have α 1−q< . α+1 However, since 1−q can be arbitrarily close to α the difference is not α+1 sharp. Not surprisingly, as α→∞, the genuinely sparse case, the bounds both approach (logp)1/2. n