SubmittedtotheStatisticalScience Sparse estimation by exponential weighting Philippe Rigollet∗ and Alexandre B. Tsybakov Princeton University and CREST-ENSAE 1 1 Abstract.Consideraregression modelwithfixeddesignandGaussiannoise 0 2 where the regression function can potentially be well approximated by a g function that admits a sparse representation in a given dictionary. This u paper resorts to exponential weights to exploit this underlying sparsity by A implementing the principle of sparsity pattern aggregation. This model 5 selection take on sparse estimation allows us to derive sparsity oracle in- 2 equalities in several popular frameworks including ordinary sparsity, fused ] sparsity and group sparsity. One striking aspect of these theoretical re- T sults is that they hold under no condition on the dictionary. Moreover, we S describe an efficient implementation of the sparsity pattern aggregation . h principle that compares favorably to state-of-the-art procedures on some t a basic numerical examples. m AMS 2000 subject classifications: Primary 62G08, Sparse regression; sec- [ ondary 62G05, 62J05, 62G20 . 1 Key words and phrases: High-dimensional regression, exponential weights, v 6 sparsity, fused sparsity, group sparsity, sparsity oracle inequalities, sparsity 1 pattern aggregation, sparsity prior. 1 5 . 8 1. INTRODUCTION 0 1 Since the 1990ies, the idea of exponential weighting has been successfully used 1 in a variety of statistical problems. In this paper, we review several properties : v of estimators based on exponential weighting with a particular emphasis on how i X they can be used to construct optimal and computationally efficient procedures r for high-dimensional regression under the sparsity scenario. a Most of the work on exponential weighting deals with a regression learning problem. Some of the results can be extended to other statistical models such as density estimation or classification, cf. Section 6. For the sake of brevity and to make thepresentation more transparent,we focushereon the following frame- workconsideredinRigollet and Tsybakov(2011).Let = (x ,Y ),...,(x ,Y ) 1 1 n n Z { } be a collection of independent random couples such that (x ,Y ) IR, where i i ∈ X × Philippe Rigollet, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA (e-mail: [email protected]) Alexandre Tsybakov, Laboratoire de Statistique, CREST-ENSAE, 3, av. Pierre Larousse, F-92240 Malakoff Cedex, France (e-mail: [email protected]) ∗Partially supported by theNational Science Foundation (DMS-0906424, DMS-1053987). 1 imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 2 RIGOLLET ANDTSYBAKOV is an arbitrary set. Assume the regression model: X (1.1) Y = η(x )+ξ , i= 1,...,n, i i i where η : IR is the unknown regression function and the errors ξ are inde- i X → pendent Gaussian (0,σ2). The covariates are deterministic elements x ,...,x 1 n N of . For any function f : IR we define a seminorm by1 X X → k·k n 1 f 2 = f2(x ). i k k n i=1 X We adopt the following learning setup. Let = f ,...,f , be a dictionary 1 M H { } of M 1 given functions. For example, f can be some basis functions or some j ≥ preliminary estimators of f constructed from another sample that we consider as frozen (see Section 4 for more details). Our goal is to approximate the re- gression function η by a linear combination f (x) = M θ f (x) with weights θ j=1 j j θ = (θ ,...,θ ), where possibly M n. The performance of a given estimator 1 M fˆof a function η is measured in term≫s of its averagedPsquared error n 1 2 R(fˆ) = fˆ η 2 = fˆ(x ) η(x ) . i i k − k n − Xi=1h i Let Θ be a given subset of IRM. In the aggregation problem, we would ideally wish to find an aggregated estimator fˆwhose risk R(fˆ) is as close as possible in a probabilistic sense to the minimum risk inf R(f ). Namely, one can construct θ∈Θ θ estimators fˆsatisfying the following property: (1.2) IER(fˆ) C inf R(f )+δ (Θ), θ n,M ≤ θ∈Θ where δ (Θ) is a small remainder term characterizing the performance of the n,M given aggregate fˆandthecomplexity ofthesetΘ,C 1isaconstant, andIEde- ≥ notes the expectation. Bounds of the form (1.2) are called the oracle inequalities. In some cases, even more general results are available. They have the form (1.3) IER(fˆ) C inf R(f )+∆ (θ) , θ n,M ≤ θ∈Θ′{ } where ∆ is a remainder term that characterizes the performance of the given n,M aggregate fˆandthecomplexity oftheparameter θ Θ′ IRM (often Θ′ = IRM). ∈ ⊆ To distinguish from (1.2), we will call bounds of the form (1.3) the balanced oracle inequalities. If Θ Θ′ then (1.2) is a direct consequence of (1.3) with ⊆ δ (Θ)= sup ∆ (θ). n,M θ∈Θ n,M Inthispaper,wemainlyfocusonthecasewherethecomplexity ofavector θ is measured as the number of its nonzero coefficients θ . In this case, inequalities 0 | | of the form (1.3) are sometimes called sparsity oracle inequalities. Other mea- sures of complexity, also related to sparsity are considered in Subsection 5.2. As 1Withoutlossofgenerality,inwhatfollows wewillassociateallthefunctionswithvectorsin IRn since only the values of functions at points x1,...,xn will appear in the risk. So, k·k will be indeed a norm and, with no ambiguity, we will use other related notation such as kY−fk where Y is a vectorin IRn with components Y1,...,Yn. imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 3 SPARSITYBY EXPONENTIALWEIGHTING indicated by the notation and illustrated below, the remainder term ∆ (θ) de- n,M pends explicitly on the size M of the dictionary and the sample size n. It reflects theinterplay between these two fundamentalparameters and also the complexity of θ. Whenthelinearmodelismisspecified,thatiswherethereisnoθ Θsuchthat ∈ η = f ontheset x ,...,x ,theminimumrisksatisfiesinf R(f ) > 0leading θ 1 n θ∈Θ θ { } to a systematic bias term. Since this term is unavoidable, we wish to make its contribution as small as possibleand it is therefore important to obtain a leading constant C = 1. Many oracle inequalities with leading constant C > 1 can be found in the literature for related problems. However, in most of the papers, the set Θ = Θ depends on the sample size n in such a way that inf R(f ) tends n θ∈Θn θ to 0 as n goes to infinity, under additional regularity assumptions. In this paper, we are interested in the case where Θ is fixed. For this reason, we consider here only oracle inequalities with leading constant C = 1 (the so called exact or sharp oracle inequalities). Because they hold for finite M and n, these are truly finite sample results. One salient feature of the oracle approach as opposed to standard statistical reasoning, is that it does not rely on an underlying model. Indeed, the goal is not to estimate the parameters of an underlying “true” model but rather to construct an estimator that mimics, in terms of an appropriate oracle inequality, the performance of the best model in a given class, whether this model is true or not. From a statistical viewpoint, this difference is significant since performance cannot be evaluated in terms of parameters. Indeed, there is no true parameter. However, we can still compare the risk of the estimator with the optimum value and oracle inequalities offer a framework for such a comparison. AparticularchoiceofΘcorrespondstotheproblemofmodel selection.LetΘ = ΘMC tobethesetofcanonicalvectorsofIRM.Thenthesetoflinearcombinations f ,θ ΘMC coincides withtheinitialdictionary of functions = f ,...,f , θ 1 M { ∈ } H { } so that the goal of model selection is to mimic the best function in the dictionary in the sense of the risk measure R(). This can be done in different ways leading · to different rates δ (ΘMC), however one is mostly interested in the methods n,M thatattain theoptimal rateknowntobeδ∗ (ΘMC) (logM)/n (seeTsybakov, n,M ≍ 2003; Bunea, Tsybakov and Wegkamp, 2007; Rigollet, 2009). Catoni (1999) was the first to show that, for model selection in Gaussian regression with random design, oracle inequalities of the form (1.2) with the optimal rate (logM)/n are satisfied for the progressive mixture method based on exponential weighting (see also Catoni, 2004). Other popular methods for model selection include AIC, BIC or Mallows’ C . They all consist in selecting a function in the dictionary by p minimizing a penalized empirical risk. One of the major novelties offered by exponential weighting is to combine (average) the functions in the dictionary using a convex combination and not simply to select one of them. From the theoretical point of view, selection of one of the functions has a fundamental drawback since it does not attain the optimal rate (logM)/n (cf. Section 2). The rest of the paper is organized as follows. In the next section, we discuss some connections between the exponential weighting schemes and penalized em- pirical risk minimization. In Section 3, we present the first oracle inequalities that demonstrate how exponential weighting can be used to efficiently combine functions in a dictionary. The results of Section 3 are then extended to the case imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 4 RIGOLLET ANDTSYBAKOV where one wishes to combine not deterministic functions but a special kind of estimators. Balanced oracle inequalities for this problem are derived in Section 4. Section 5 shows how these results can be adapted to deal with sparsity. We intro- duce the principle of sparsity pattern aggregation, and we derive sparsity oracle inequalitiesinseveralpopularframeworksincludingordinarysparsity,fusedspar- sity and group sparsity. Finally, we describe an efficient implementation of the sparsity pattern aggregation principle and compare its performance to state-of- the-art procedures on some basic numerical examples. 2. EXPONENTIAL WEIGHTING AND PENALIZED RISK MINIMIZATION 2.1 Suboptimality of selectors A natural candidate to solve the problem of model selection introduced in the previous section is an empirical risk minimizer. Define the empirical risk by n 1 Rˆ (f)= [Y f(x )]2 = Y f 2 n i i n − k − k i=1 X and the empirical risk minimizer by (2.1) fˆerm = argminRˆ (f), n f∈H where ties are broken arbitrarily. However, while this procedure satisfies an ex- act oracle inequality, it fails to exhibit the optimal remainder term of order δ∗ (ΘMC) (logM)/n. The following result shows that this defect is intrin- n,M ≍ sic not only to empirical risk minimization but also to any method that selects onlyonefunctioninthedictionary .Thisincludestraditionalmethodsofmodel H selection by penalized empirical risk minimization such as AIC and BIC. We call estimators Sˆ taking values in selectors. n H Theorem 2.1. Assume that f 1 for any f . Any empirical risk j j k k ≤ ∈ H minimizer fˆerm defined in (2.1) satisfies the following oracle inequality 2logM (2.2) IER(fˆerm) min R(f )+4σ . j ≤1≤j≤M r n Moreover, assume that (2.3) (σ 1) (logM)/n C 0 ∨ ≤ for0 < C < 1 small enough. Then,pforany selector Sˆ , and inparticular, forany 0 n selector based on penalized empirical risk minimization, there exist a regression function η and a dictionary = f ,...,f such that η 1, f 1 for 1 M j H { } k k ≤ k k ≤ any f and j ∈ H logM (2.4) IER(Sˆ ) min R(f )+C σ , n j ∗ ≥ 1≤j≤M r n for some positive constant C . ∗ imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 5 SPARSITYBY EXPONENTIALWEIGHTING Proof. See appendix. It follows from the lower bound (2.4) that selecting one of the functions in a finite dictionary to solve the problem of model selection is suboptimal in H the sense that it exhibits a too large remainder term, of the order (logM)/n. Indeed,onecandobetterifwetakeamixture,thatisaconvex combinationofthe p functions in . We will see in Section 3, cf. (3.4), that under a particular choice H of weights in this convex combination, namely the exponential weights, one can achieveoracleinequalitieswithmuchbetterrate(logM)/n.Thisrateisknownto be optimal in a minimax sense in several regression setups including the present one (see Tsybakov, 2003; Bunea, Tsybakov and Wegkamp, 2007; Rigollet, 2009). 2.2 Exponential weighting as a penalized procedure Penalized empirical risk minimization for model selection has received a lot of attention in the literature and many choices for the penalty can be consid- ered (see, e.g., Birg´e and Massart, 2001; Bartlett, Boucheron and Lugosi, 2002; Lugosi and Wegkamp,2004;Bunea, Tsybakov and Wegkamp,2007)toobtainor- acle inequalities with the optimal remainder term. However, all these inequalities exhibit a constant C > 1 in front of the leading term. This is not surprisingas we have proved in the previous section that it is impossible for selectors to satisfy sharporacle inequalities like (1.2) with theoptimal remainder term.To overcome this limitation of selectors, we look for convex combinations of the functions in the dictionary. Let ΛM denote the flat simplex of IRM defined by M ΛM = λ IRM : λ 0, λ = 1 . j j ∈ ≥ Xj=1 Let us now examine a fewways to obtain potentially goodconvex combinations. Onecandidateis asolution of thefollowing penalized empiricalrisk minimization problem: min Rˆ (f )+pen(λ) , n λ λ∈ΛM n o where pen() 0 is a penalty function. This choice looks quite natural since it · ≥ provides a proxy of the right-hand side of the oracle inequality (1.3) where the unknown risk R() is replaced by its empirical counterpart Rˆ (). The minimum n · · is taken over the simplex ΛM because we are looking for a convex combination. Clearly, the penalty pen() should be carefully chosen and ideally should match · the bestremainder term ∆ (). Yet, this problem may bedifficult to solve as it n,M · involves aminimization over ΛM.Instead,weproposetosolve a simplerproblem. Consider the following linear upper bound on the empirical risk: M λ Rˆ (f ) Rˆ (f ), λ ΛM, j n j n λ ≥ ∀ ∈ j=1 X and solve the following optimization problem M (2.5) min λ Rˆ (f )+pen(λ) . j n j λ∈ΛM Xj=1 imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 6 RIGOLLET ANDTSYBAKOV Notethatifpen 0,thesolutionλˆ of (2.5)issimplytheempiricalriskminimizer over the vertices≡of the simplex so that f = fˆerm. In general, depending on the λˆ penalty function, this problem may be more or less difficult to solve. It turns out that the Kullback-Leibler penalty leads to a particularly simple solution and allows us to approximate the best remainder term ∆ () thus adding great n,M · flexibility to the resulting estimator. Observe that vectors in ΛM can be associated to probability measures on 1,...,M . Let λ = (λ ,...,λ ) and π = (π ,...,π ) be two probability mea- 1 M 1 M { } sures on 1,...,M and define the Kullback-Leibler divergence between λ and π { } by M λ j (λ,π) = λ log 0. j K π ≥ j j=1 (cid:18) (cid:19) X Here and in the sequel, we adopt the convention that 0log0= 0, 0log(a/0) = 0, and log(a/0) = , for any a > 0. ∞ Exponential weights can beobtained as thesolution of thefollowing minimiza- tion problem. Fix β > 0, a prior π ΛM and define the vector λˆπ by ∈ M β (2.6) λˆπ = argmin λ Rˆ (f )+ (λ,π) . j n j nK λ∈ΛM Xj=1 This constrained convex optimization problem has a unique solution that can be expressed explicitly. Indeed, it follows from the Karush-Kuhn-Tucker (KKT) conditions that the components λˆπ of λˆπ satisfy j λˆπ (2.7) nRˆ (f )+βlog j +µ δ = 0, j = 1,...,M, n j j π − j! where µ,δ ,...,δ 0 are Lagrange multipliers, and 1 M ≥ M λˆπ 0, δ λˆπ = 0, λˆπ = 1. j j j j ≥ j=1 X Equation (2.7) together with the above constraints leads to the following closed form solution: exp( nRˆ (f )/β)π (2.8) λˆπ = − n j j , j =1,...,M, j M exp( nRˆ (f )/β)π k=1 − n k k called the exponentiaPl weights. We see that one immediate effect of penalizing by the Kullback-Leibler divergence is that the solution of (2.6) is not a selector. As a result, it achieves the desired effect of averaging as opposed to selecting. 3. ORACLE INEQUALITIES An aggregate is an estimator defined as a weighted average of the functions in the dictionary with some data-dependent weights. We focus on the aggregate H with exponential weights: M fˆπ = λˆπf , j j j=1 X imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 7 SPARSITYBY EXPONENTIALWEIGHTING where λˆπ is given in (2.8). This estimator satisfies the following balanced oracle j inequality. Theorem 3.1. The aggregate fˆπ with β 4σ2 satisfies the following balanced ≥ oracle inequality M β (3.1) IER(fˆπ) min λ R(f )+ (λ,π) . j j ≤ λ∈ΛM n+1K Xj=1 The proof can be found in thepapers of Dalalyan and Tsybakov (2007, 2008) containing more general results. In particular, they apply to non-Gaussian dis- tributions of errors ξ and to exponential weights with a general (not necessarily i discrete) probability distribution π on IRM. Dalalyan and Tsybakov (2007, 2008) show that the corresponding exponentially weighted aggregate fˆπ satisfies the ∗ following bound β (3.2) IER(fˆπ) inf R(f )p(dθ)+ (p,π) , ∗ ≤ p θ n+1K (cid:26)Z (cid:27) where the minimum is taken over all probability distributions p on IRM, and (p,π) denotes the Kullback-Leibler divergence between the general probability K measures p and π. The bound (3.1) follows immediately from (3.2) by taking p and π as discrete distributions. A useful consequence of (3.1) can be obtained by restricting the minimum on theright-hand sideto thevertices of thesimplex ΛM.Thesevertices areprecisely the vectors e(1),...,e(M) that form the canonical basis of IRM so that M (k) e R(f )= R(f ). j j k j=1 X It yields β (3.3) IER(fˆπ) min R(f )+ log(π−1) . ≤ 1≤j≤M j n+1 j (cid:26) (cid:27) Taking π to be the uniform distribution on 1,...,M leads to the following j { } oracle inequality β (3.4) IER(fˆπ) min R(f )+ logM, j ≤ 1≤j≤M n+1 that exhibit a remainder term of the optimal order (logM)/n. The role of the distribution π is to put a prior weight on the functions in the dictionary. When there is no preference, the uniform prior is quite a common choice. However, we will see in Section 5 that choosing non-uniform weights de- pending on suitable sparsity characteristics can be very useful. Moreover, this methodology can be extended to many cases where one wishes to learn with a prior. It is worth mentioning that while the terminology is reminiscent of a Bayesian setup, this paper deals only with a frequentist setting (the risk is not averaged over the prior). imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 8 RIGOLLET ANDTSYBAKOV 4. AGGREGATION OF ESTIMATORS 4.1 Sample splitting Akin to the setting of the previous section, exponential weights were originally introduced to aggregate deterministic functions from a dictionary. These func- tions can bechosen inessentially two ways. Either they have good approximation properties such as an (over-complete) basis of functions or they are constructed as preliminary estimators using a held-out sample. The latter case corresponds to the problem of aggregation of estimators originally described in Nemirovski (2000). The idea put forward by Nemirovski and later extended to several sce- narios (see, e.g., Yang, 2004; Rigollet and Tsybakov, 2007; Lecu´e, 2007) consists in splitting the sample at hand into two parts2. The first part is used to con- struct estimators while the second is used to perform aggregation, in particular toconstructexponential weights. Tocarry outtheanalysis,itisstandardtowork conditionallyonthefirstsamplesothattheproblemisequivalenttoworkingwith a deterministic dictionary of functions. Nevertheless, the idea of sample splitting does not carry over to independent samples that are not identically distributed as in the present setup. Indeed, the observations in the first sample no longer have the same distribution as those in the second sample. To overcome this limitation, one would like to aggregate esti- mators using the same observations for both estimation and aggregation. While for general estimators this approach would clearly lead to overfitting, it has been shown that it yields good oracle inequalities for certain types of estimators, first for projection estimators (Leung and Barron,2006) andmore recently for a more general class linear (affine) estimators (Dalalyan and Salmon, 2011). 4.2 Aggregation of linear estimators Suppose that we are given a finite family fˆ,...,fˆ of linear estimators 1 K { } defined by (4.1) fˆ(x) = Y⊤a (x), j j wherea ()aregivenfunctionswithvaluesinIRn.Thisrepresentationisquitegen- j · eral; for example, fˆ can be ordinary least squares, (kernel) ridge regression esti- j mators,diagonallinearfilterestimatorsetc.(seeKneip,1994;Dalalyan and Salmon, 2011, for a longer list of relevant examples). The vector of values (fˆ(x ),i = j i 1,...,n) equals to A Y where A an n n matrix with rows a (x ),i = 1,...,n. j j j i × Now, we would like to consider mixtures of such estimators rather than mix- tures of deterministic functions as in the previous sections. For this purpose, ex- ponential weights have to beslightly modified.Indeed,note that in Section 2,the riskofadeterministicfunctionf issimplyestimatedbytheempiricalriskRˆ (f ), j n j whichis plugged into theexpression for theweights. Clearly, IERˆ (f )= R(f )so n j j thatRˆ (f )isanunbiased estimator oftheriskR(f )off .For alinearestimator n j j j fˆ defined in (4.1), Rˆ (fˆ) is no longer an unbiased estimator of the risk R(fˆ). j n j j It is well known that the risk of the linear estimator fˆ has the form j σ2 IER(fˆ)= (A I)η 2+ Tr[A⊤A ], j k j − k n j j 2More precisely, Nemirovski (2000) considered a randomized procedure rather than sample splitting. The two samples were obtained from the original one by a randomization and the argument was restricted to theGaussian model. imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 9 SPARSITYBY EXPONENTIALWEIGHTING where Tr[A] denotes the trace of a matrix A, and I denotes the n n iden- tity matrix. Moreover, an unbiased estimator of IER(fˆ) is given by a×version of j Mallows’ C : p 2σ2 (4.2) R˜unb(fˆ)= Y fˆ 2 + Tr[A ] σ2. n j k − jk n j − Then, for linear estimators, the exponential weights and the corresponding ag- gregate are modified as follows: exp( nR˜unb(fˆ)/β)π K (4.3) λˆπ = − n j j , fˆπ = λˆπfˆ . j K exp( nR˜unb(fˆ)/β)π k k k=1 − n k k Xk=1 P Note that for deterministic f , we naturally define R˜unb(f ) = Rˆ (f ), so that j n j n j definition(4.3)remainsconsistentwith(2.8).Withthismoregeneraldefinitionof exponentialweights,Dalalyan and Salmon(2011)provethefollowingriskbounds for the aggregate fˆπ. Theorem4.1. Let fˆ,...,fˆ beafamilyoflinearestimatorsdefinedin (4.1) 1 K { } such that the matrices A are symmetric, positive definite, and A A = A A , for j j k k j all 1 j,k K. Then the exponentially weighted aggregate fˆπ defined in (4.3) ≤ ≤ with β 8σ2 satisfies ≥ K β (4.4) IER(fˆπ) min λ IER(fˆ)+ (λ,π) , j j ≤ λ∈ΛK nK Xj=1 β (4.5) IER(fˆπ) min IER(fˆ)+ log(π−1) . ≤ k=1,...,K j n j (cid:26) (cid:27) If all the A are projection matrices (A⊤ = A , A2 = A ), then the above inequal- j j j j j ities hold with β 4σ2. ≥ Here, the bound (4.5) follows immediately from (4.4). In the rest of the paper, wemainly usethelastpartof thistheorem concerningprojection estimators. The bound (4.5) for this particular case was originally proved in Leung and Barron (2006). The result of Dalalyan and Salmon (2011) is, in fact, more general than Theorem 4.1 covering non-discrete priors in the spirit of (3.2), and it applies not only to linear but also to affine estimators fˆ. j 5. SPARSE ESTIMATION A family of projection estimators that we consider in this section is the family of all 2M least squares estimators, each of which is characterized by its sparsity pattern.We examinepropertiesoftheseestimators, andshowthattheirmixtures with exponential weights satisfy powerful sparsity oracle inequalities for suitably chosen priors π. 5.1 Sparsity pattern aggregation Assumethatwearegivenadictionaryoffunctions = f ,...,f .However, 1 M H { } we will not aggregate the elements of the dictionary but rather least squares imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011 10 RIGOLLET ANDTSYBAKOV estimators depending on all the f . We denote by X, the n M design matrix j × with elements X = f (x ), i = 1,...,n,j = 1,...,M. i,j j i A sparsity pattern is a binary vector p := 0,1 M. Theterminology comes ∈ P { } from the fact that the coordinates of any such vectors can be interpreted as indicators of presence (p = 1) or absence (p = 0) of a given feature indexed by j j j 1,...,M . We denote by p the number of ones in the sparsity pattern p ∈ { } | | and by IRp the space defined by IRp = θ p : θ IRM IRM , { · ∈ }⊂ where θ p IRM denotes the Hadamard product between θ and p and is defined · ∈ as the vector with coordinates given by (θ p) = θ p ,j = 1...,M. j j j · For any p , let θˆ be any least squares estimator defined by p ∈ P M (5.1) θˆ argmin Y f 2 with f = θ f . p θ θ j j ∈ θ∈IRp k − k j=1 X The following simple lemma gives an oracle inequality for the least squares es- timator. It follows easily from the Pythagorean theorem. Moreover, the random variables ξ ,...,ξ need not be Gaussian for the result to hold. 1 n Lemma 5.1. Fix p . Then any least squares estimator θˆ defined in (5.1) p ∈ P satisfies R p (5.2) IE f η 2 = min f η 2+σ2 p min f η 2 +σ2| | k θˆp − k θ∈IRpk θ − k n ≤ θ∈IRpk θ − k n where R is the dimension of the linear subspace Xθ : θ IRp . p { ∈ } Clearly, if p is small compared to n, the oracle inequality gives a good per- | | formance guarantee for the least squares aggregate f . Nevertheless it may be θˆp the case that the approximation error minθ∈IRp fθ η 2 is quite large. Hence, k − k we are looking for a sparsity pattern such that p is small and that yields a | | least squares aggregate with small approximation error. This is clearly a model selection problem as described in Section 1. Observe that for each sparsity pattern p , the function f is a projection ∈ P θˆp estimator of the form f = A Y where the n n matrix A is the projector onto θˆp p × p the linear span of f : p = 1 (as above, we identify the functions f ,f { j ∈ H j } j θˆp withthevectors of theirvalues atpointsx ,...,x sincetheriskdependsonly on 1 n these values). Therefore Tr[A ] = R . We have seen in the previous section that p p projection estimators can be aggregated to solve the problem of model selection using exponential weights. Thus, instead of selecting the best sparsity pattern, weresortto taking convex combinations leading to whatis called sparsity pattern aggregation. For any sparsity pattern p , define the exponential weights λˆπ p ∈ P and the sparsity pattern aggregate f˜π respectively by exp( nR˜unb(f )/β)π λˆπ = − n θˆp p , f˜π = λˆπf , p p′∈Pexp(−nR˜nunb(fθˆp′)/β)πp′ pX∈P p θˆp P imsart-sts ver. 2009/08/13 file: RigTsy11_sts_v4.tex date: August 26, 2011