Large Dimensional Analysis and Optimization of ✩ Robust Shrinkage Covariance Matrix Estimators Romain Couilleta, Matthew McKayb 5 aTelecommunication department, Sup´elec, Gif sur Yvette,France 1 bHong Kong University of Science and Technology 0 2 n a J Abstract 8 1 This article studies two regularized robust estimators of scatter matrices pro- posed (and proved to be well defined) in parallel in (Chen et al., 2011) and ] R (Pascal et al., 2013), based on Tyler’s robust M-estimator (Tyler, 1987) and P on Ledoit and Wolf’s shrinkage covariance matrix estimator (Ledoit and Wolf, . 2004). These hybrid estimators have the advantage of conveying (i) robustness h tooutliersorimpulsive samplesand(ii)smallsamplesize adequacyto the clas- t a sical sample covariance matrix estimator. We consider here the case of i.i.d. m elliptical zero mean samples in the regime where both sample and population [ sizes are large. We demonstrate that, under this setting, the estimators under 3 studyasymptoticallybehavesimilarto well-understoodrandommatrixmodels. v This characterization allows us to derive optimal shrinkage strategies to esti- 3 mate the population scatter matrix, improving significantly upon the empirical 8 shrinkage method proposed in (Chen et al., 2011). 0 4 Keywords: random matrix theory, robust estimation, linear shrinkage. . 1 0 4 1. Introduction 1 : v Many scientific domains customarily deal with (possibly small) sets of large i dimensional data samples from which statistical inference is performed. This is X in particular the case in financial data analysis where few stationary monthly r a observations of numerous stock indexes are used to estimate the joint covari- ance matrix of the stock returns (Laloux et al., 2000; Ledoit and Wolf, 2003; Rubio et al., 2012), bioinformatics where clustering of genes is obtained based on gene sequences sampled from a small population (Scha¨fer and Strimmer, 2005), computationalimmunology where correlationsamong mutations in viral strains are estimated from sampled viralsequences and used as a basis of novel ✩ Couillet’sworkissupportedbytheERCMOREEC–120133. McKay’sworkissupported bytheHongKongResearchGrantsCouncilundergrantnumber616713. Email addresses: [email protected] (RomainCouillet),[email protected] (Matthew McKay) Preprint submitted toJournal of Multivariate Analysis January 20, 2015 vaccine design (Dahirel et al., 2011; Quadeer et al., 2013), psychology where the covariance matrix of multiple psychological traits is estimated from data collectedon a groupof tested individuals (Steiger, 1980), orelectricalengineer- ing at large where signal samples extracted from a possibly short time window areusedtoretrieveparametersofthesignal(Scharf,1991). Inmanysuchcases, the number n of independent data samples x ,...,x CN (or RN) may not 1 n ∈ be large compared to the size N of the population, suggesting that the empir- ical sample covariance matrix C¯ = 1 n (x x¯)(x x¯)∗, x¯ = 1 n x , N n i=1 i − i − n i=1 i is a poor estimate for C = E[(x E[x ])(x E[x ])∗]. Several solutions N 1 1 1 1 − P − P have been proposed to work around this problem. If the end application is not to retrieve C but some metric of it, recent works on random matrix theory N showed that replacing C in the metric by C¯ often leads to a biased esti- N N mate of the metric (Mestre, 2008b), but that this estimate can be corrected by an improved estimation of the metric itself via the samples x ,...,x (Mestre, 1 n 2008a). However, when the object under interest is C itself and N n, there N ≃ is little hope to retrieve any consistent estimate of C . A popular alternative N proposed originally in (Ledoit and Wolf, 2004) is to “shrink” C¯ , i.e., consider N instead C¯ (ρ) = (1 ρ)C¯ +ρI for an appropriate ρ [0,1] that minimizes N N N the average distance−E[tr((C¯ (ρ) C )2)]. The intere∈st of ρ here is to give N N more orless weightto C¯ dependin−g onthe relevanceof the n samples, so that N in particular ρ is better chosen close to zero when n is large and close to one when n is small. In addition to the problem of scarcity of samples, it is often the case that outliers are present among the set of samples. These outliers may arise from erroneous or inconsistent data (e.g., individuals under psychological or biolog- ical tests incorrectly identified to fit the test pattern), or from the corruption of some samples by external events (e.g., interference by ambient electromag- netic noise in signal processing). These outliers, if not correctly handled, may further corrupt the statistical inference and in particular the estimation of C . N The field of robust estimation intends to deal with this problem (Huber, 1981; Maronna et al., 2006) by proposing estimators that have the joint capability to naturally attenuate the effect of outliers (Huber, 1964) as well as to appro- priately handle samples of an impulsive nature (Tyler, 1987), e.g., elliptically distributed data. A common denominator of such estimators is their belong- ing to the class of M-estimators, therefore taking the form of the solution to an implicit equation. This poses important problems of analysis in small N,n dimensions, resulting mostly in only asymptotic results in the regime N fixed and n (Maronna, 1976; Kent and Tyler, 1991). This regime is however → ∞ inconsistent with the present scenario of scarce data where N n. Nonethe- ≃ less, recent works based on random matrix theory have shown that a certain family of such robust covariance matrix estimators asymptotically behave as N,n and N/n c (0,1) similar to classical random matrices taking → ∞ → ∈ (almost)explicitforms. SuchobservationsweremadefortheclassofMaronna’s M-estimatorsof scatter(Maronna, 1976) for sample vectorswhose independent entriescancontainoutliers(Couillet et al.,2013a)andforellipticallydistributed samples(Couillet et al.,2013b),aswellasforTyler’sM-estimator(Tyler,1987) 2 in (Zhang et al., 2014). In this article, we study two hybrid robust shrinkage covariance matrix estimates Cˆ (ρ) (hereafter referred to as the Abramovich–Pascal estimate) N and Cˇ (ρ) (hereafter referred to as the Chen estimate) proposed in parallel N in (Abramovich and Spencer, 2007; Pascal et al., 2013)1 and in (Chen et al., 2011), respectively. Both matrices, whose definition is introduced in Section 2 below, are empirically built upon Tyler’s M-estimate (Tyler, 1987) originally designed to cope with elliptical samples whose distribution is unknown to the experimenter and upon the Ledoit–Wolf shrinkage estimator (Ledoit and Wolf, 2004). Thisallowsforanimproveddegreeoffreedomforapproximatingthepop- ulation covariance matrix and importantly allows for N >n, which Maronna’s and Tyler’s estimators do not. In (Pascal et al., 2013) and (Chen et al., 2011), Cˆ (ρ) and Cˇ (ρ) were proved to be well-defined as the unique solutions to N N their defining fixed-point matrices. However, little is known of their perfor- manceasestimatorsofC inthe regimeN nofinteresthere. Someprogress N ≃ in this direction was made in (Chen et al., 2011) but this work does not man- age to solve the optimal shrinkage problem consisting of finding ρ such that E[tr((Cˇ (ρ) C )2)]is minimizedandresortsto solvinganapproximateprob- N N − lem instead. The present article studies the matrices Cˆ (ρ) and Cˇ (ρ) from a random N N matrix approach, i.e., in the regime where N,n with N/n c (0, ), → ∞ → ∈ ∞ and under the assumption of the absence of outliers. Our main results are as follows: we show that, under the aforementioned setting, both Cˆ (ρ) and Cˇ (ρ) N N • asymptotically behave similar to well-known random matrix models and prove in particular that both have a well-identified limiting spectral dis- tribution; we prove that, up to a change in the variable ρ, the matrices Cˇ (ρ) and N • Cˆ (ρ)/(1 trCˆ (ρ))areessentiallythe sameforN,nlarge,implyingthat N N N both achieve the same optimal shrinkage performance; we determine the optimal shrinkage parameters ρˆ⋆ and ρˇ⋆ that mini- • mize the almost sure limits lim 1 tr[(Cˆ (ρ)/(1 trCˆ (ρ)) C )]2 and N N N N N − N lim 1 tr[(Cˇ (ρ) C )]2, respectively, both limits being the same. We N N N − N then propose consistentestimates ρˆ andρˇ for ρˆ⋆ andρˇ⋆ whichachieve N N the samelimitingperformance. Wefinallyshowbysimulationsthatasig- nificantgainis obtainedusing ρˆ⋆ (orρˆ )andρˇ⋆ (or ρˇ )comparedto the N N solution ρˇ of the approximate problem developedin (Chen et al., 2011). O Inpractice,theseresultsallowforaproperuseofCˆ (ρ)andCˇ (ρ)inanticipa- N N tionoftheabsenceofoutliers. Inthepresenceofoutliers,itisthenexpectedthat 1To the authors’ knowledge, the first instance of the estimator dates back to (AbramovichandSpencer,2007)althoughthenon-obviousproofofCˆN(ρ)beingwell-defined isonlyfoundlaterin(Pascaletal.,2013). 3 bothAbramovich–Pascaland Chenestimates willexhibit robustnessproperties that their asymptotic random matrix equivalents will not. Note in particular that, although Cˆ (ρ) and Cˇ (ρ) are shown to be asymptotically equivalent in N N the absence of outliers, it is not clear at this point whether one of the two es- timates will show better performance in the presence of outliers. The study of this interesting scenario is left to future work. The remainder of the article is structured as follows. In Section 2, we in- troduce our main results on the large N,n behavior of the matrices Cˆ (ρ) and N Cˇ (ρ). In Section 3, we develop the optimal shrinkage analysis, providing in N particular asymptotically optimal empirical shrinkage strategies. Concluding remarks are provided in Section 4. All proofs of the results of Section 2 and Section 3 are then presented in Section 5. General notations: The superscript ()∗ stands for Hermitian transpose in · the complex case or transpose in the real case. The notation stands for k·k the spectral norm for matrices and the Euclidean norm for vectors. The Dirac measure at point x is denoted δ . The ordered eigenvalues of a Hermitian (or x symmetric)matrixX ofsizeN N aredenotedλ (X) ... λ (X). Forℓ>0 1 N × ≤ ≤ and a positive and positively supported measure ν, we define M = tℓν(dt) ν,ℓ a.s. (may be infinite). The arrow “ ” designates almost sure convergence. The statement X ,Y defines the ne−w→notation X as being equal to Y. R 2. Main results We start by introducing the main assumptions of the data model under study. We consider n sample vectors x ,...,x CN (or RN) having the 1 n ∈ following characteristics. Assumption 1 (Growth rate). Denoting c = N/n, c c (0, ) as N N → ∈ ∞ N . →∞ Assumption 2 (Population model). The vectors x ,...,x CN (or RN) 1 n ∈ are independent with a. x = √τ A y , where y CN¯ (or RN¯), N¯ N, is a random zero miean unitiarNilyi(or orthogion∈ally) invariant vecto≥r with norm y 2 = N¯, i A CN×N¯ is deterministic, and τ ,...,τ is a collectionkofkpositive N 1 n ∈ scalars. We shall denote z =A y . i N i b. C ,A A∗ is nonnegative definite, with trace 1 trC =1 and spectral N N N N N norm satisfying limsup C < . Nk Nk ∞ c. ν , 1 N δ satisfies ν ν weakly with ν = δ almost every- N N i=1 λi(CN) N → 6 0 where. P Since all considerations to come are equally valid over C or R, we will con- sider by default that x ,...,x CN. As the analysis will show, the positive 1 n ∈ 4 scalars τ have no impact on the robust covariance estimates; with this defini- i tion,thedistributionofthevectorsx containsinparticulartheclassofelliptical i distributions. Note thatthe assumptionthaty iszeromeanunitarilyinvariant i with norm N¯ is equivalent to saying that y =√N¯ y˜i with y˜ CN¯ standard i ky˜ik i ∈ Gaussian. This, along with A CN×N¯ and limsup C < , implies in particular that x 2 is of ordeNr N∈. The assumption thNakt νN=kδ ∞almost every- i 0 k k 6 where avoids the degenerate scenario where an overwhelming majority of the eigenvalues of C tend to zero, whose practical interest is quite limited. Fi- N nallynotethattheconstraint 1 trC =1isinconsequentialandinfactdefines N N uniquely both terms in the product τ C . i N ThefollowingtwotheoremsintroducetherobustshrinkageestimatorsCˆ (ρ) N and Cˇ (ρ), and constitute the main technical results of this article. N Theorem 1 (Abramovich–Pascal Estimate). LetAssumptions1and2hold. For ε (0,min 1,c−1 ), define ˆ = [ε + max 0,1 c−1 ,1]. For each ε ρ (ma∈x 0,1 c{−1 ,1]}, let Cˆ (ρ)Rbe the unique so{lution−to } ∈ { − N } N 1 n x x∗ Cˆ (ρ)=(1 ρ) i i +ρI . N − n 1x∗Cˆ (ρ)−1x N i=1 N i N i X Then, as N , →∞ sup Cˆ (ρ) Sˆ (ρ) a.s. 0 N N − −→ ρ∈Rˆε(cid:13) (cid:13) (cid:13) (cid:13) where (cid:13) (cid:13) n 1 1 ρ 1 Sˆ (ρ)= − z z∗+ρI N γˆ(ρ)1 (1 ρ)cn i i N − − i=1 X and γˆ(ρ) is the unique positive solution to the equation in γˆ t 1= ν(dt). γˆρ+(1 ρ)t Z − Moreover, the function ρ γˆ(ρ) thus defined is continuous on (0,1]. 7→ Proof. The proof is deferred to Section 5.1. Theorem 2 (Chen Estimate). LetAssumptions1and2hold. Forε (0,1), define ˇ =[ε,1]. For each ρ (0,1], let Cˇ (ρ) be the unique solution∈to ε N R ∈ Bˇ (ρ) Cˇ (ρ)= N N 1 trBˇ (ρ) N N where 1 n x x∗ Bˇ (ρ)=(1 ρ) i i +ρI . N − n 1x∗Cˇ (ρ)−1x N i=1 N i N i X 5 Then, as N , →∞ sup Cˇ (ρ) Sˇ (ρ) a.s. 0 N N − −→ ρ∈Rˇε (cid:13) (cid:13) where (cid:13) (cid:13) n 1 ρ 1 T Sˇ (ρ)= − z z∗+ ρ I N 1 ρ+T n i i 1 ρ+T N ρ ρ − i=1 − X in which T =ργˇ(ρ)F(γˇ(ρ);ρ) with, for all x>0, ρ 1 1 1 F(x;ρ)= (ρ c(1 ρ))+ (ρ c(1 ρ))2+(1 ρ) 2 − − 4 − − − x r and γˇ(ρ) is the unique positive solution to the equation in γˇ t 1= ν(dt). γˇρ+ 1−ρ t Z (1−ρ)c+F(γˇ;ρ) Moreover, the function ρ γˇ(ρ) thus defined is continuous on (0,1]. 7→ Proof. The proof is deferred to Section 5.2. Theorem 1 and Theorem 2 show that, as N,n with N/n c, the matrices Cˆ (ρ) and Cˇ (ρ), defined as the non-triv→ial∞solution of fix→ed-point N N equations, behave similar to matrices Sˆ (ρ) and Sˇ (ρ), respectively, whose N N characterizationiswell-knownandmuchsimplerthanthatofCˆ (ρ)andCˇ (ρ) N N themselves. Indeed, Sˆ (ρ) and Sˇ (ρ) are random matrices of the sample co- N N variance matrix type thoroughly studied in e.g., (Mar˘cenko and Pastur, 1967; Silverstein and Bai, 1995; Silverstein and Choi, 1995). Notethattheseresultsaresimilarinstatementtotheresultsof(Couillet et al., 2013a,b) for robust estimators of the Maronna-type. Technically speaking, the proof of both Theorem 1 and Theorem 2 unfold from the same technique as produced in these articles. However, while the proof of Theorem 1 comes with no major additional difficulty compared to these works, due to the scale nor- malization imposed in the definition of Cˇ (ρ), the proof of Theorem 2 requires N amoreelaborateapproachthanusedin(Couillet et al.,2013b). Anotherdiffer- ence to previous works lies here in that, unlike Maronna’s estimator that only attenuates the effect of the scale parameters τ , the proposed Tyler-based es- i timators discard this effect altogether. Also, the technical study of Maronna’s estimator can be made under the assumption that C = I (from a natural N N variable change) while here, because of the regularization term ρI , C does N N intervene in an intricate manner in the results. As asideremark,itis shownin(Pascal et al.,2013)thatforeachN,nfixed with n N +1, Cˆ (ρ) Cˆ (0) as ρ 0 with Cˆ (0) defined (almost surely) N N N ≥ → → as one of the (uncountably many) solutions to 1 n x x∗ Cˆ (0)= i i . (1) N n 1x∗Cˆ (0)−1x i=1 N i N i X 6 IntheregimewhereN,n andN/n c,thisresultisdifficulttogeneralize as it is challenging to ha→nd∞le the limit→Cˆ (ρ ) Sˆ (ρ ) for a sequence N N N N k − k ρ ∞ with ρ 0. The requirement that ρ ρ > 0 on any such { N}N=1 N → N → 0 sequenceis indeedatthe coreofthe proofofTheorem1 (see Equations(5)and (6) in Section 5.1 where ρ > 0 is necessary to ensure e+ < 1). This explains 0 whythe set ˆ inTheorem1excludes the region[0,ε). Similarargumentshold ε forCˇ (ρ). ARs amatteroffact,the behaviorofanysolutionCˆ (0)to (1)inthe N N large N,n regime, recently derived in (Zhang et al., 2014), remains difficult to handle with our proof technique. AnimmediateconsequenceofTheorem1andTheorem2isthattheempirical spectraldistributionsofCˆ (ρ)andCˇ (ρ)convergetothewell-knownrespective N N limiting distributionsofSˆ (ρ) andSˇ (ρ), characterizedinthe followingresult. N N Corollary 1 (Limiting spectral distribution). Under the settings of The- orem 1 and Theorem 2, N 1 δ a.s. µˆ , ρ ˆ N λi(CˆN(ρ)) −→ ρ ∈Rε i=1 X N 1 δ a.s. µˇ , ρ ˇ N λi(CˇN(ρ)) −→ ρ ∈Rε i=1 X wheretheconvergencearrowisunderstoodastheweakconvergenceofprobability measures, for almost every sequence x ,...,x ∞ , and where { 1 n}n=1 µˆ =max 0,1 c−1 δ +µˆ ρ { − } ρ ρ µˇ =max 0,1 c−1 δ +µˇ ρ { − } 1−Tρ+ρTρ ρ with µˆ and µˇ continuous finite measures with compact support in [ρ, ) and ρ ρ ∞ [T (1 ρ+T )−1, )respectively, realanalyticwherevertheir densityispositive. ρ ρ − ∞ The measure µˆ is the only measurewith Stieltjes transform m (z) defined, for ρ µˆρ z C with [z]>0, as ∈ ℑ 1 (1 ρ)c 1 m (z)=γˆ − − ν(dt) µˆρ 1 ρ zˆ(ρ)+ t − Z 1+cδˆ(z) where zˆ(ρ) = (ρ z)γˆ(ρ)1−(1−ρ)c and δˆ(z) is the unique solution with positive − 1−ρ imaginary part of the equation in δˆ t δˆ= ν(dt). zˆ(ρ)+ t Z 1+cδˆ The measure µˇ is the only measurewith Stieltjes transform m (z) defined, for ρ µˇρ [z]>0 as ℑ 1 ρ+T 1 ρ m (z)= − ν(dt) µˇρ 1 ρ zˇ(ρ)+ t − Z 1+cδˇ(z) 7 withzˇ(ρ)= 1 T (1 z) zandδˇ(z)theuniquesolutionwithpositiveimaginary 1−ρ ρ − − part of the equation in δˇ t δˇ= ν(dt). zˇ(ρ)+ t Z 1+cδˇ Proof. Thisisanimmediateapplicationof(Silverstein and Bai,1995;Silverstein and Choi, 1995) and Theorems 1 and 2. 1.5 Empiricaleigenvaluedistribution Limitingdensitypˆρ 1 y t si n e D 0.5 0 0 1 2 3 4 Eigenvalues Figure1: Histogram of theeigenvalues of CˆN (Abramovich–Pascal type) forn=2048, N = 256,CN = 31diag(I128,5I128),ρ=0.2,versuslimitingeigenvaluedistribution. From Corollary 1, µˆ is continuous on (ρ, ) so that µˆ (dx) = pˆ (x)dx ρ ρ ρ ∞ where,fromtheinverseStieltjestransformformula(seee.g.,(Bai and Silverstein, 2009)) for all x (ρ, ), ∈ ∞ 1 pˆ (x)= lim m (x+ıε) . ρ ε→0πℑ µˆρ (cid:2) (cid:3) Letting ε>0 small and approximating pˆ (x) by 1 [m (x+ıε)] allows one to ρ πℑ µˆρ depict pˆ approximately. Similarly, µˇ (dx) = pˇ (x)dx for all x (T (1 ρ+ ρ ρ ρ ρ ∈ − T )−1, ) which can be obtained equivalently. This is performed in Figure 1 ρ ∞ andFigure2whichdepictthehistogramoftheeigenvaluesofCˆ (ρ)andCˇ (ρ) N N for ρ = 0.2, N = 256, n = 2048, C = diag(I ,5I ), versus their limiting N 128 128 distributions for c = 1/8. Figure 3 depicts Cˇ (ρ) for ρ = 0.8, N = 1024, N n=512, C =diag(I ,5I ) versus its limiting distribution for c=2. Note N 128 128 that, when c=1/8, the eigenvalues of Cˇ (ρ) concentrate in two bulks close to N 1/3 and 5/3, as expected. Due to the different trace normalization of Cˆ (ρ), N thesamereasoningholdsuptoamultiplicativeconstant. However,whenc=2, 8 3 Empiricaleigenvaluedistribution Limitingdensitypˇρ 2 y t si n e D 1 0 0 0.5 1 1.5 2 2.5 Eigenvalues Figure 2: Histogram of the eigenvalues of CˇN (Chen type) for n = 2048, N = 256, CN = 31diag(I128,5I128),ρ=0.2,versuslimitingeigenvaluedistribution. 1.2 Empiricaleigenvaluedistribution Limitingdensitypˇρ Diracmass1/2 1 0.8 y t si n e 0.6 D 0.4 0.2 0 1 1.5 2 2.5 Eigenvalues Figure 3: Histogram of the eigenvalues of CˇN (Chen type) for n = 512, N = 1024, CN = 13diag(I128,5I128),ρ=0.8,versuslimitingeigenvaluedistribution. the eigenvalues of Cˇ (ρ) are quite remote from masses in 1/3 and 5/3, an N observation known since (Mar˘cenko and Pastur, 1967). AnothercorollaryofTheorem1andTheorem2isthejointconvergence(over bothρandtheeigenvalueindex)oftheindividualeigenvaluesofCˆ (ρ)tothose N ofSˆ (ρ)andoftheindividualeigenvaluesofCˇ (ρ)tothoseofSˇ (ρ),aswellas N N N 9 the joint convergence over ρ of the moments of the empirical spectral distribu- tions of Cˆ (ρ) andCˇ (ρ). These joint convergenceproperties are fundamental N N in problems of optimization of the parameter ρ as discussed in Section 3. Corollary 2 (Joint convergence properties). Under the settings of Theo- rem 1 and Theorem 2, sup max λ (Cˆ (ρ)) λ (Sˆ (ρ)) a.s. 0 i N i N ρ∈Rˆε1≤i≤n(cid:12) − (cid:12)−→ sup max (cid:12)(cid:12)λi(CˇN(ρ)) λi(SˇN(ρ))(cid:12)(cid:12) a.s. 0. ρ∈Rˇε1≤i≤n − −→ (cid:12) (cid:12) (cid:12) (cid:12) This result implies limsup sup Cˆ (ρ) < N k k ∞ N ρ∈Rˆε limsup sup Cˇ (ρ) < . N k k ∞ N ρ∈Rˇε almost surely. This, and the weak convergence of Corollary 1, in turn induce that, for each ℓ N, ∈ 1 sup tr Cˆ (ρ)ℓ M a.s. 0 N N − µˆρ,ℓ −→ ρ∈Rˆε(cid:12)(cid:12) (cid:16) (cid:17) (cid:12)(cid:12) (cid:12) 1 (cid:12) sup(cid:12) tr Cˇ (ρ)ℓ M (cid:12) a.s. 0 N N − µˇρ,ℓ −→ ρ∈Rˇε(cid:12) (cid:12) (cid:12) (cid:0) (cid:1) (cid:12) (cid:12) (cid:12) where we recall that M =(cid:12) tℓµ(dt) (0, ] fo(cid:12)r any probability measure µ µ,ℓ ∈ ∞ with support in R+; in particular, M = 1 1−ρ +ρ and M =1. R µˆρ,1 γˆ(ρ)1−(1−ρ)c µˇρ,1 Proof. The proof is provided in Section 5.3. 3. Application to optimal shrinkage We nowapplyTheorems1and2totheproblemofoptimallinearshrinkage, originally considered in (Ledoit and Wolf, 2004) for the simpler sample covari- ance matrix model. The optimal linear shrinkage problem consists in choosing ρ to be such that a certain distance metric between Cˆ (ρ) (or Cˇ (ρ)) and C N N N is minimized, therefore allowing for a more appropriate estimation of C via N Cˆ (ρ) or Cˇ (ρ). The metric selected here is the squared Frobenius norm of N N the difference between the (possibly scaled) robust estimators and C , which N has the advantage of being a widespread matrix distance (e.g., as considered in (Ledoit and Wolf, 2004)) and a metric amenable to mathematical analysis.2 2Alternative metrics (such as the geodesic distance on the cone of nonnegative definite matrices)canbesimilarlyconsidered. Theappropriatechoiceofsuchametricheavilydepends ontheultimateproblemtooptimize. 10