Optimal learning with Q-aggregation Guillaume Lecu´e∗ and Philippe Rigollet† CNRS, Universit´e Paris-Est and Princeton University 3 1 Abstract. Weconsider a general supervisedlearningproblem withstrongly 0 convex and Lipschitz loss and study the problem of model selection aggre- 2 gation. In particular, given a finite dictionary functions (learners) together n with the prior, we generalize the results obtained by Dai, Rigollet and a J Zhang (2012) for Gaussian regression with squared loss and fixed design 0 to this learning setup. Specifically, we prove that the Q-aggregation pro- 3 cedure outputs an estimator that satisfies optimal oracle inequalities both ] in expectation and with high probability. Our proof techniques somewhat T depart from traditional proofs by making most of the standard arguments S on the Laplace transform of the empirical process to be controlled. . h t AMS2000subjectclassifications:Primary62H25;secondary62F04,90C22. a m Key words and phrases: Learning theory, empirical risk minimization, ag- gregation, empirical processes theory. [ 2 v 1. INTRODUCTION AND MAIN RESULTS 0 8 Let be a probability space and let (X,Y) IR be a random couple. 0 X ∈ X × 6 Broadly speaking, the goal of statistical learning is to predict Y given X. To 1. achieve this goal, we observe a dataset = (X1,Y1),...,(Xn,Yn) that consists D { } 0 of n independent copies of (X,Y) and use these observations to construct a 3 function (learner) f : IR such that f(X) is close to Y in a certain sense. 1 X → : More precisely, the prediction quality of a (possibly data dependent) function v fˆ is measured by a risk function R : IRX IR associated to a loss function Xi 2 → ℓ :IR IR in the following way → r a R(fˆ)= IE ℓ(Y,fˆ(X)) . h i |D We focus hereafter on loss functions ℓ that are convex in their second argument. Moreover, for the sake of simplicity, throughout this article we restrict ourselves to functions f and random variables (X,Y) for which Y b and f(X) b | | ≤ | | ≤ Guillaume Lecu´e CNRS, LAMA Universit´e Paris-Est Marne-la-vall´ee, 77454 France (e-mail: [email protected]) Philippe Rigollet Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA (e-mail: [email protected]) ∗Supportedby FrenchAgenceNationale delaRechercheANRGrant “Prognostic” ANR- 09-JCJC-0101-01. †Supported in part by NSF grants DMS-0906424, CAREER-DMS-1053987 and a gift from theBendheim Centerfor Finance. 1 January 31, 2013 2 LECUE´ ANDRIGOLLET almost surely, for some fixed b 0. For any real valued measurable f on , for ≥ X which this quantity is finite, we define f 2 = IE[f(X)2]. k k p We are given a finite set = f1,...,fM of measurable functions from to F { } X IR. This set is called a dictionary. The elements in may have been constructed F using an independent, frozen, dataset at some previous step or may simply be good candidates for the learning task at hand. To focus our contribution on the aggregation problem, we restrict our attention to the case where consists of F deterministic functions.Theaim of model selection aggregation [27, 7, 8, 31]is to usethedata toconstructafunctionfˆhavinganexcess-risk R(fˆ) min R(f) f∈F D − as small as possible. Namely, we seek the smallest deterministic residual term ∆ ( ) > 0 such that the excess risk is bounded above by ∆ ( ), either in n n F F expectation or with high probability, or, in this instance, in both. In the high probability case, such bounds are called oracle inequalities. This problem was studied for instance in [2, 3, 6, 7, 14, 27, 18, 19, 23, 31, 32, 33, 34]. From a minimax standpoint, it has been proved that ∆ ( ) = C(logM)/n, n F C > 0 is the smallest residual term that one can hope for the regression prob- lem with quadratic loss [31]. An estimator fˆachieving such a rate (up to some multiplying constant) is called an optimal aggregate. The aim of this paper is to construct optimal aggregates under general conditions on the loss function ℓ. Notethattheoptimalresidualsformodelselection aggregation areoftheorder 1/n as opposed to the standard parametric rate 1/√n. This fast rate essentially comes from the strong convexity of the quadratic loss. In what follows we show that indeed, strong convexity is sufficient to obtain fast rates. It is known that ratesof optionalorder1/ncannotbeachieved ifthelossfunctionisonlyassumed to beconvex. Indeed,it follows from [21], Theorem2 that if the loss is linear then thebestachievable residualtermisatleastoftheorder (log )/n.Recall that p |F| a function g is said to be strongly convex on a nonempty convex set C IR if ⊂ there exists a constant c such that c g(αa+(1 α)a′) αg(a)+(1 α)g(a′) α(1 α)(a a′)2, − ≤ − − 2 − − for any a,a′ C,α (0,1). In this case, c is called modulus of strong convex- ∈ ∈ ity. For technical reasons, we will also need to assume that the loss function is Lipschitz. We now introduce the set of assumptions that are sufficient for our approach. Assumption 1. The loss function ℓ is such that for any f,g [ b,b], we ∈ − have ℓ(Y,f) ℓ(Y,g) C f g , a.s.. b | − | ≤ | − | Moreover, almost surely, the function ℓ(Y, ) is strongly convex with modulus of · strong convexity C on [ b,b]. ℓ − A central quantity that is used for the construction of aggregates is the empir- ical risk defined by n 1 (1.1) R (f) = ℓ(Y ,f(X )) n nX i i i=1 January 31, 2013 OPTIMAL LEARNINGWITH Q-AGGREGATION 3 for any real-valued function f defined over . A natural aggregation procedure X consistsintakingthefunctionin thatminimizestheempiricalrisk.Thisproce- F dure is called empirical risk minimization (ERM). It has been proved that ERM is suboptimal for the aggregation problem [19, 7, 24, 22, 26, 30]. Somehow, this procedure does not take advantages of the convexity of the loss since the class of functions on which the empirical risk is minimized to construct the ERM is , a F finite set. As it turns out, the performance of ERM relies critically on the con- vexity of the class of functions on which the empirical risk is minimized [26, 24]. Therefore, a natural idea is to “improve the geometry” of by taking its convex F hull conv( ) and then by minimizing the empirical risk over it. However, this F procedure is also suboptimal [23, 9]. The weak point of this procedure lies in the metric complexity of the problem: taking the convex hull of indeed “improves F thegeometry”of butitalsoincreasesbytoomuchitscomplexity.Thecomplex- F ity of the convex hull of a set can be much larger than the complexity of the set itself and this leads to a failure of this naive convexification trick. Nevertheless, a compromise between geometry and complexity was stricken in [2] and [23] where optimal aggregates have been successfully constructed. In [2], this improvement is achieved by minimizing the empirical risk over a carefully chosen star-shaped subset of the convex hull of . In [23], a better geometry was achieved by tak- F ing the convex hull of an appropriate subset of and then by minimizing the F empirical risk over it. In this paper, we show that a third procedure, called Q-aggregation, and that wasintroducedin[28,9]forfixeddesignGaussianregression,alsoleadstooptimal rates of aggregation. Unlike the above two procedures that rely on finding an appropriate constraint for ERM, Q-aggregation is based on a penalization of the empirical risk but the constraint set is kept to be the convex hull of . Let Θ F denote the flat simplex of IRM defined by M Θ =(cid:8)(θ1,...,θM)∈ IRM :θj ≥ 0,Xθj = 1(cid:9) j=1 M and for any θ Θ, define the convex combination f = θ f . For any fixed ∈ θ Pj=1 j j ν, the Q-functional is defined for any θ Θ by ∈ M (1.2) Q(θ)= (1 ν)R (f )+ν θ R (f ), − n θ X j n j j=1 We keep the terminology Q-aggregation from [9] in purpose.Indeed,Q stands for quadratic and while do not employ a quadratic loss, we exploit strong convexity in the same manner as in [9] and [28]. Indeed the first term in Q acts as a regularization of the linear interpolation of the empirical risk and is therefore a strongly convex regularization. We consider the following aggregation procedure. Unlike the procedures intro- ducedin[2,23],theQ-aggregation procedureallowsustoputapriorweightgiven byapriorprobability π = (π1,...,πM)oneach element of thedictionary .This F featureturnsouttobecrucialforapplications[1,10,11,13,14,15,12,16,29,30]. Let β > 0 be the temperature parameter and 0 < ν < 1. Consider any vector of January 31, 2013 4 LECUE´ ANDRIGOLLET weights θˆ Θ defined by ∈ M M β (1.3) θˆ argmin (1 ν)R (f )+ν θ R (f ) θ logπ . ∈ θ∈Θ h − n θ Xj=1 j n j − n Xj=1 j ji It comes out of our analysis that fˆ achieves an optimal rate of aggregation if β θ satisfies 2 2 12C (1 ν) 3C ν(νC +4µb) (1.4) β > max b − , 6√3bC (1 ν), b b . h b i µ − 2µ where µ = min(ν,1 ν)(C )/10. ℓ − Theorem A. Let be a finite dictionary of cardinality M and (X,Y) be a F random couple of IR such that Y b and max f(X) b a.s. for some f∈F X × | |≤ | | ≤ b > 0. Assume that Assumption 1 holds and that β satisfies (1.4). Then, for any x > 0, with probability greater than 1 exp( x) − − β 1 2βx R(fθˆ) ≤ j=m1,.i.n.,MhR(fj)+ n log(cid:16)πj(cid:17)i+ n . Moreover, β 1 IE(cid:2)R(fθˆ)(cid:3) ≤ j=m1,.i.n.,MhR(fj)+ n log(cid:16)πj(cid:17)i. If π is the uniform distribution, that is π = 1/M for all j = 1,...,M, then j we recover in Theorem A the classical optimal rate of aggregation (logM)/n and the estimator θˆ is just the one minimizing the Q-functional defined in (1.2). In particular no temperature parameter is needed for its construction. As a result, in this case, the parameter b need not be known for the construction of the Q-aggregation procedure. 2. PRELIMINARIES TO THE PROOF OF THEOREM A An important part of our analysis is based upon concentration properties of empiricalprocesses.Whileourproofsaresimilartothoseemployedin[28]and[9], they contain genuinely new arguments. In particular, this learning setting, unlike the denoising setting considered in [28, 9] allows us to employ various new tools such as symmetrization and contraction. A classical tool to quantify the concen- tration of measure phenomenon is given by Bernstein’s inequality for bounded variables. Interms of Laplace transform,Bernstein’s inequality [5, Theorem 1.10] states that if Z1,...,Zn are n i.i.d. real-valued random variables such that for all i= 1,...,n, 2 Z c a.s. and IEZ v, i i | | ≤ ≤ then for any 0< λ < 1/c, n 2 nvλ (2.5) IEexp λ Z IEZ exp . h (cid:16)X{ i − i}(cid:17)i ≤ (cid:16)2(1 cλ)(cid:17) i=1 − Bernstein’s inequality usually yields a bound of order √n for the deviations of a sum around its mean. As mentioned above, such bounds are not sufficient for our purposes and we thus consider the following concentration result. January 31, 2013 OPTIMAL LEARNINGWITH Q-AGGREGATION 5 Proposition 1. Let Z1,...,Zn be i.i.d. real-valued random variables and let c0 > 0. Assume that Z1 c a.s.. Then, for any 0 < λ < (2c0)/(1+2c0c), | |≤ n 1 2 IEexphnλ(cid:16)n XZi −IEZi −c0IEZi(cid:17)i ≤ 1 i=1 and n 1 2 IEexphnλ(cid:16)n XIEZi −Zi−c0IEZi(cid:17)i ≤ 1. i=1 Proof. It follows from Bernstein’s inequality (2.5) that for any 0 < λ < (2c0)/(1+2c0c), n 2 2 1 nIEZ λ 2 1 2 IEexphnλ(cid:16)n XZi −IEZi−c0IEZi(cid:17)i ≤ exp(cid:16)2(1 cλ)(cid:17)exph−nλc0IEZ1i ≤ 1 i=1 − The second inequality is obtained by replacing Z by Z . i i − We willalsousethefollowing exponentialboundforRademacher processes:let ε1,...,εn be independent Rademacher random variables and a1,...,an be some real numbers then, by Hoeffding’s inequality, n n 1 2 (2.6) IEexp ε a exp a . (cid:16)X i i(cid:17)≤ (cid:16)2X i(cid:17) i=1 i=1 Our analysis also relies upon some geometric argument. Indeed, the strong convexity of the loss function in Assumption 1 implies the 2-convexity of the risk in the sense of [4]. This translates into a lower bound on the gain obtained when applying Jensen’s inequality to the risk function R. Proposition2. Let(X,Y)bearandom couplein IRand = f1,...,fM X× F { } be a finite dictionary in L2( ,PX) such that fj(X) b, j = 1,...,M and X | | ≤ ∀ Y b a.s.. Assume that, almost surely, the function ℓ(Y, ) is strongly convex | | ≤ · with modulus of strong convexity C on [ b,b]. Then, it holds that, for any θ Θ, ℓ − ∈ M M C M M 2 ℓ (2.7) R θ f θ R(f ) θ f θ f . (cid:16)Xj=1 j j(cid:17) ≤ Xj=1 j j − 2 Xj=1 j(cid:13)(cid:13)(cid:13) j −Xj=1 j j(cid:13)(cid:13)(cid:13)2 Proof. Define the random function ℓ() = ℓ(Y, ). By strong convexity and · · [17], Theorem 6.1.2, it holds almost surely that for any a,a′ in [ b,b], − C ℓ(a) ℓ(a′)+(a a′)ℓ′(a′)+ ℓ(a a′)2, ≥ − 2 − for any ℓ′(a′) in the sub-differential of ℓ at a′. Plugging a = f (X), a′ = f (X), j θ we get almost surely C ℓ(Y,f (X)) ℓ(Y,f (X))+(f (X) f (X))ℓ′(f (X))+ ℓ[f (X) f (X)]2. j θ j θ θ j θ ≥ − 2 − January 31, 2013 6 LECUE´ ANDRIGOLLET Now, multiplying both sides by θ and summing over j, we get almost surely, j C ℓ 2 θ ℓ(Y,f (X)) ℓ(Y,f (X))+ θ [f (X) f (X)] . X j j ≥ θ 2 X j j − θ j j To complete the proof, it remains to take the expectation. 3. PROOF OF THEOREM A Let x > 0 and assume that Assumption 1 holds throughout this section. We start with some notation. For any θ Θ, define ∈ ℓ (y,x) = ℓ(y,f (x)) and R(θ)= IEℓ (Y,X) = IEℓ(Y,f (X)), θ θ θ θ where we recall that f = M θ f for any θ IRM. Let 0 < ν < 1. Let θ Pj=1 j j ∈ (e1,...,eM) is the canonical basis of IRM and for any θ IRM define ∈ M ℓ˜(y,x) = (1 ν)ℓ (y,x)+ν θ ℓ (y,x) and R˜(θ)= IEℓ˜(Y,X), θ − θ X j ej θ j=1 We also consider the functions M M θ IRM K(θ)= θ log 1 and θ IRM V(θ)= θ f f 2. ∈ 7→ X j (cid:16)π (cid:17) ∈ 7→ X jk j − θk2 j=1 j j=1 Let µ > 0. Consider any oracle θ∗ Θ such that ∈ β θ∗ argmin R˜(θ)+µV(θ)+ K(θ) . (cid:16) (cid:17) ∈ θ∈Θ n We start with a geometrical aspect of the problem. The following inequality follows from the strong convexity of the loss function ℓ. Proposition 3. For any θ Θ, ∈ R˜(θ) R˜(θ∗) µ V(θ∗) V(θ) +β K(θ∗) K(θ) +(cid:16)(1−ν)Cℓ µ(cid:17) fθ fθ∗ 22. − ≥ (cid:0) − (cid:1) n(cid:0) − (cid:1) 2 − k − k Proof. Since θ∗ is a minimizer of the (finite) convex function θ H(θ) = R˜(θ)+µV(θ)+(β/n)K(θ) over the convex set Θ, then there exists a s7→ubgradient H(θ∗) such that for any θ Θ it holds, H(θ∗),θ θ∗ 0. It yields ∇ ∈ (cid:10)∇ − (cid:11) ≥ R˜(θ∗),θ θ∗ µ V(θ∗),θ∗ θ +(β/n) K(θ∗),θ∗ θ (cid:10)∇ − (cid:11) ≥ (cid:10)∇ − (cid:11) (cid:10)∇ − (cid:11) (3.8) = µ V(θ∗) V(θ) µ f f 2+(β/n) K(θ∗) K(θ) . θ θ∗ 2 (cid:0) − (cid:1)− k − k (cid:0) − (cid:1) It follows from the strong convexity of ℓ(y, ) that · (1 ν)C R˜(θ) R˜(θ∗) R˜(θ∗),θ θ∗ + − ℓ fθ fθ∗ 22 − ≥ (cid:10)∇ − (cid:11) 2 k − k µ V(θ∗) V(θ) + β K(θ∗) K(θ) +(cid:16)(1−ν)Cℓ µ(cid:17) fθ fθ∗ 22 , ≥ (cid:0) − (cid:1) n(cid:0) − (cid:1) 2 − k − k January 31, 2013 OPTIMAL LEARNINGWITH Q-AGGREGATION 7 where the second inequality follows from the previous display. LetHbetheM M matrixwithentriesH = f f 2forall1 j,k M. j,k j k 2 × k − k ≤ ≤ Let s and x be positive numbers and consider the random variable M Zn = (P −Pn)(ℓ˜θˆ−ℓ˜θ∗)−µXθˆjkfj −fθ∗k22−µθˆHθ∗− 1sK(θˆ). j=1 Proposition 4. Assume that 10µ min(1 ν,ν)C and β 3n/s. Then, it ℓ ≤ − ≥ holds β 1 R(θˆ) min R(e )+ log +2Z . h j (cid:16) (cid:17)i n ≤ 1≤j≤M n πj Proof. First note that the following equalities hold: M (3.9) Xθˆjkfj −fθ∗k22 = V(θˆ)+(cid:13)fθˆ−fθ∗(cid:13)22 j=1 (cid:13) (cid:13) and (3.10) θˆHθ∗ = V(θˆ)+V(θ∗)+(cid:13)fθ∗ −fθˆ(cid:13)22. (cid:13) (cid:13) It follows from the definition of θˆthat β (3.11) R˜(θˆ)−R˜(θ∗) ≤ (P −Pn)(ℓ˜θˆ−ℓ˜θ∗)+ n(cid:0)K(θ∗)−K(θˆ)(cid:1). It follows from (3.9) and (3.10) in (3.11) that (3.12) R˜(θˆ)−R˜(θ∗) ≤ 2µV(θˆ)+µV(θ∗)+2µ(cid:13)fθˆ−fθ∗(cid:13)22+ 1sK(θˆ)+ βn(cid:0)K(θ∗)−K(θˆ)(cid:1)+Zn. (cid:13) (cid:13) Together with Proposition 3, it yields (cid:16)(1−2ν)Cℓ −3µ(cid:17)(cid:13)fθˆ−fθ∗(cid:13)22 ≤ 3µV(θˆ)+ 1sK(θˆ)+Zn. (cid:13) (cid:13) We plug the above inequality into (3.12) to obtain 2µ 1 R˜(θˆ) R˜(θ∗) 1+ K(θˆ)+Z (cid:16) (cid:17)(cid:16) n(cid:17) − ≤ (1 ν)C /2 3µ s ℓ − − 2 β 6µ (3.13) + K(θ∗) K(θˆ) +µV(θ∗)+ 2µ+ V(θˆ). (cid:16) (cid:17) n(cid:0) − (cid:1) (1 ν)C /2 3µ ℓ − − Thanks to the 2-convexity of the risk (cf. Proposition 2), we have R˜(θˆ) R(θˆ)+ ν(C /2)V(θˆ). Therefore, it follows from (3.13) that ≥ ℓ β 4µ R(θˆ) R˜(θ∗)+µV(θ∗)+ K(θ∗)+ 1+ Z (cid:16) (cid:17) n ≤ n (1 ν)C 6µ ℓ − − (3.14) 2 12µ C 1 8µ β + 2µ+ ν ℓ V(θˆ)+ + K(θˆ). (cid:16) (cid:17) (cid:16) (cid:17) (1 ν)C 6µ − 2 s s((1 ν)C 6µ) − n ℓ ℓ − − − − January 31, 2013 8 LECUE´ ANDRIGOLLET Note now that 10µ min(ν,1 ν)C implies that ℓ ≤ − 2 4µ 12µ C ℓ 1 and 2µ+ ν 0 (1 ν)C 6µ ≤ (1 ν)C 6µ − 2 ≤ ℓ ℓ − − − − Moreover, together, the two conditions of the proposition yield 1 8µ β + 0 s s((1 ν)C 6µ) − n ≤ ℓ − − Therefore, it follows from the above three displays that β R(θˆ) min R˜(θ)+µV(θ)+ K(θ) +2Z n ≤ θ∈Θ(cid:2) n (cid:3) β 1 min R(e )+ log +2Z . h j (cid:16) (cid:17)i n ≤ j=1,...,M n πj Tocompleteourproof,itremainstoprovethatP[Z > (βx)/n] exp( x)and n ≤ − IE[Z ] 0 under suitable conditions on µ and β. Using respectively a Chernoff n ≤ bound and Jensen’s inequality respectively, it is easy to see that both condi- tions follow if we prove that IEexp(nZ /β) 1. It follows from the excess loss n ≤ decomposition: M ℓ˜θˆ(y,x)−ℓ˜θ∗(y,x) = (1−ν)(ℓθˆ(y,x)−ℓθ∗(y,x))+νX(θˆj −θj∗)ℓej(y,x), j=1 and the Cauchy-Schwarz inequality implies that it is enough to prove that, M (3.15) IEexphs(cid:16)(1−ν)(P −Pn)(ℓθˆ−ℓθ∗)−µXθˆjkfj −fθ∗k22−1sK(θˆ)(cid:17)i ≤ 1. j=1 and M 1 (3.16) IEexp s ν(P P ) (θˆ θ∗)ℓ µθˆHθ∗ K(θˆ) 1, h (cid:16) − n (cid:16)X j − j ej(cid:17)− − s (cid:17)i ≤ j=1 for some s 2n/β. Let s be as such in the rest of the proof. ≥ We begin by proving (3.15). To that end, define the symmetrized empiri- cal process by h 7→ Pn,εh = n−1Pni=1εih(Yi,Xi) where ε1,...,εn are n i.i.d. Rademacher random variables independentof the (X ,Y )s. Moreover, take s and i i µ such that µn (3.17) s . ≤ [2C (1 ν)]2 b − January 31, 2013 OPTIMAL LEARNINGWITH Q-AGGREGATION 9 It yields M IEexphs(cid:16)(1−ν)(P −Pn)(ℓθˆ−ℓθ∗)−µXθˆjkfj −fθ∗k22− 1sK(θˆ)(cid:17)i j=1 M 2 1 IEexp smax (1 ν)(P P )(ℓ ℓ ) µ θ f f K(θ) ≤ h θ∈Θ (cid:16) − − n θ − θ∗ − X jk j − θ∗k2− s (cid:17)i j=1 (3.18) M 2 1 IEexp smax 2(1 ν)P (ℓ ℓ ) µ θ f f K(θ) ≤ h θ∈Θ (cid:16) − n,ε θ − θ∗ − X jk j − θ∗k2− s (cid:17)i j=1 (3.19) M 2 1 IEexp smax 2C (1 ν)P (f f ) µ θ f f K(θ) ≤ h θ∈Θ (cid:16) b − n,ε θ − θ∗ − X jk j − θ∗k2− s (cid:17)i j=1 where (3.18) follows from the symmetrization inequality [20, Theorem 2.1] and (3.19) follows from the contraction principle [25, Theorem 4.12] applied to con- tractions ϕ (t ) = C−1[ℓ(Y ,f (X ) t ) ℓ(Y ,f (X )] and T IRn is defined i i b i θ∗ i − i − i θ∗ i ⊂ by T = t IRn : t = f (X ) f (X ),θ Θ . Next, using the fact that the i θ∗ i θ i { ∈ − ∈ } maximum of a linear function over a polytope is attained at a vertex, we get M IEexphs(cid:16)(1−ν)(P −Pn)(ℓθˆ−ℓθ∗)−µXθˆjkfj −fθ∗k22− 1sK(θˆ)(cid:17)i j=1 M 2 π IEIE exp s 2C (1 ν)P (f f ) µ f f ≤ X k ε h (cid:16) b − n,ε k − θ∗ − k k − θ∗k2(cid:17)i k=1 (3.20) M 2 [2C (1 ν)s)] 2µn b 2 ≤ XπkIEexph 2−n (cid:16)Pn − [2C (1 ν)]2sP(cid:17)(fk −fθ∗) i k=1 b − (3.21) M 2 (2C (1 ν)s) 1 b 2 4 ≤ XπkIEexph 2−n (cid:16)(Pn −P)(fk −fθ∗) − 4b2P(fk −fθ∗) (cid:17)i k=1 where (3.20) follows from (2.6) and (3.21) follows from (3.17). Together with the above display, Proposition 1 yields (3.15) as long as n (3.22) s < . 2√3bC (1 ν) b − January 31, 2013 10 LECUE´ ANDRIGOLLET We now prove (3.15). We have M 1 IEexp s ν(P P ) (θˆ θ∗)ℓ µθˆHθ∗ K(θˆ) h (cid:16) − n (cid:16)X j − j ej(cid:17)− − s (cid:17)i j=1 M M θ∗ π IEexp s ν(P P )(ℓ ℓ ) µ f f 2 ≤ X j X k h (cid:16) − n ek − ej − k j − kk2(cid:17)i j=1 k=1 M M µ θ∗ π IEexp sν (P P )(ℓ ℓ ) P(ℓ ℓ )2 1 ≤ X j X k h (cid:16) − n ek − ej − νC2 ej − ek (cid:17)i ≤ j=1 k=1 b where the last inequality follows from Proposition 1 when 2µn (3.23) s < . C ν(νC +4µb) b b It is now straightforward to see that the conditions of Proposition 4, the ones of (3.17), (3.22) and (3.23) are fulfilled when 3n C ℓ s = , µ = min(ν,1 ν) β − 10 and 2 2 12C (1 ν) 3C ν(νC +4µb) β > max b − , 6√3bC (1 ν), b b . h b i µ − 2µ REFERENCES [1] Alquier, P., and Lounici, K. Pac-bayesian bounds for sparse regres- sion estimation with exponential weights. Electronic Journal of Statistics 5 (2011), 127–145. [2] Audibert, J.-Y. Progressive mixture rules are deviation suboptimal. Ad- vances in Neural Information Processing Systems (NIPS) (2007). [3] Audibert, J.-Y. Fast learning rates in statistical inference through aggre- gation. Ann. Statist. 37, 4 (2009), 1591–1646. [4] Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101, 473 (2006), 138–156. [5] Boucheron,S.,Lugosi, G., andMassart,P. Concentration inequalities with applications. Clarendon Press. Oxford, 2012. [6] Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. Aggregation for Gaussian regression. Ann. Statist. 35, 4 (2007), 1674–1697. [7] Catoni, O. Statistical learning theory and stochastic optimization, vol.1851 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 2004. Lecture notes from the 31st Summer School on Probability Theory held in Saint- Flour, July 8–25, 2001. [8] Catoni, O. Pac-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes— MonographSeries,56.InstituteofMathematical Statistics, Beachwood, OH, 2007. [9] Dai, D., Rigollet, P., and Zhang, T. Deviation optimal learning using greedy q-aggregation. Ann. Statist. (March 2012). arXiv:1203.2507. January 31, 2013