Table Of ContentOptimal learning with
Q-aggregation
Guillaume Lecu´e∗ and Philippe Rigollet†
CNRS, Universit´e Paris-Est and Princeton University
3
1 Abstract. Weconsider a general supervisedlearningproblem withstrongly
0
convex and Lipschitz loss and study the problem of model selection aggre-
2
gation. In particular, given a finite dictionary functions (learners) together
n
with the prior, we generalize the results obtained by Dai, Rigollet and
a
J Zhang (2012) for Gaussian regression with squared loss and fixed design
0 to this learning setup. Specifically, we prove that the Q-aggregation pro-
3
cedure outputs an estimator that satisfies optimal oracle inequalities both
] in expectation and with high probability. Our proof techniques somewhat
T
depart from traditional proofs by making most of the standard arguments
S
on the Laplace transform of the empirical process to be controlled.
.
h
t AMS2000subjectclassifications:Primary62H25;secondary62F04,90C22.
a
m Key words and phrases: Learning theory, empirical risk minimization, ag-
gregation, empirical processes theory.
[
2
v
1. INTRODUCTION AND MAIN RESULTS
0
8
Let be a probability space and let (X,Y) IR be a random couple.
0 X ∈ X ×
6 Broadly speaking, the goal of statistical learning is to predict Y given X. To
1. achieve this goal, we observe a dataset = (X1,Y1),...,(Xn,Yn) that consists
D { }
0 of n independent copies of (X,Y) and use these observations to construct a
3
function (learner) f : IR such that f(X) is close to Y in a certain sense.
1 X →
: More precisely, the prediction quality of a (possibly data dependent) function
v fˆ is measured by a risk function R : IRX IR associated to a loss function
Xi 2 →
ℓ :IR IR in the following way
→
r
a
R(fˆ)= IE ℓ(Y,fˆ(X)) .
h i
|D
We focus hereafter on loss functions ℓ that are convex in their second argument.
Moreover, for the sake of simplicity, throughout this article we restrict ourselves
to functions f and random variables (X,Y) for which Y b and f(X) b
| | ≤ | | ≤
Guillaume Lecu´e CNRS, LAMA Universit´e Paris-Est Marne-la-vall´ee, 77454
France (e-mail: guillaume.lecue@univ-mlv.fr) Philippe Rigollet Department of
Operations Research and Financial Engineering Princeton University
Princeton, NJ 08544, USA (e-mail: rigollet@princeton.edu)
∗Supportedby FrenchAgenceNationale delaRechercheANRGrant “Prognostic” ANR-
09-JCJC-0101-01.
†Supported in part by NSF grants DMS-0906424, CAREER-DMS-1053987 and a gift from
theBendheim Centerfor Finance.
1
January 31, 2013
2 LECUE´ ANDRIGOLLET
almost surely, for some fixed b 0. For any real valued measurable f on , for
≥ X
which this quantity is finite, we define f 2 = IE[f(X)2].
k k p
We are given a finite set = f1,...,fM of measurable functions from to
F { } X
IR. This set is called a dictionary. The elements in may have been constructed
F
using an independent, frozen, dataset at some previous step or may simply be
good candidates for the learning task at hand. To focus our contribution on the
aggregation problem, we restrict our attention to the case where consists of
F
deterministic functions.Theaim of model selection aggregation [27, 7, 8, 31]is to
usethedata toconstructafunctionfˆhavinganexcess-risk R(fˆ) min R(f)
f∈F
D −
as small as possible. Namely, we seek the smallest deterministic residual term
∆ ( ) > 0 such that the excess risk is bounded above by ∆ ( ), either in
n n
F F
expectation or with high probability, or, in this instance, in both. In the high
probability case, such bounds are called oracle inequalities. This problem was
studied for instance in [2, 3, 6, 7, 14, 27, 18, 19, 23, 31, 32, 33, 34].
From a minimax standpoint, it has been proved that ∆ ( ) = C(logM)/n,
n
F
C > 0 is the smallest residual term that one can hope for the regression prob-
lem with quadratic loss [31]. An estimator fˆachieving such a rate (up to some
multiplying constant) is called an optimal aggregate. The aim of this paper is to
construct optimal aggregates under general conditions on the loss function ℓ.
Notethattheoptimalresidualsformodelselection aggregation areoftheorder
1/n as opposed to the standard parametric rate 1/√n. This fast rate essentially
comes from the strong convexity of the quadratic loss. In what follows we show
that indeed, strong convexity is sufficient to obtain fast rates. It is known that
ratesof optionalorder1/ncannotbeachieved ifthelossfunctionisonlyassumed
to beconvex. Indeed,it follows from [21], Theorem2 that if the loss is linear then
thebestachievable residualtermisatleastoftheorder (log )/n.Recall that
p |F|
a function g is said to be strongly convex on a nonempty convex set C IR if
⊂
there exists a constant c such that
c
g(αa+(1 α)a′) αg(a)+(1 α)g(a′) α(1 α)(a a′)2,
− ≤ − − 2 − −
for any a,a′ C,α (0,1). In this case, c is called modulus of strong convex-
∈ ∈
ity. For technical reasons, we will also need to assume that the loss function is
Lipschitz. We now introduce the set of assumptions that are sufficient for our
approach.
Assumption 1. The loss function ℓ is such that for any f,g [ b,b], we
∈ −
have
ℓ(Y,f) ℓ(Y,g) C f g , a.s..
b
| − | ≤ | − |
Moreover, almost surely, the function ℓ(Y, ) is strongly convex with modulus of
·
strong convexity C on [ b,b].
ℓ
−
A central quantity that is used for the construction of aggregates is the empir-
ical risk defined by
n
1
(1.1) R (f) = ℓ(Y ,f(X ))
n nX i i
i=1
January 31, 2013
OPTIMAL LEARNINGWITH Q-AGGREGATION 3
for any real-valued function f defined over . A natural aggregation procedure
X
consistsintakingthefunctionin thatminimizestheempiricalrisk.Thisproce-
F
dure is called empirical risk minimization (ERM). It has been proved that ERM
is suboptimal for the aggregation problem [19, 7, 24, 22, 26, 30]. Somehow, this
procedure does not take advantages of the convexity of the loss since the class of
functions on which the empirical risk is minimized to construct the ERM is , a
F
finite set. As it turns out, the performance of ERM relies critically on the con-
vexity of the class of functions on which the empirical risk is minimized [26, 24].
Therefore, a natural idea is to “improve the geometry” of by taking its convex
F
hull conv( ) and then by minimizing the empirical risk over it. However, this
F
procedure is also suboptimal [23, 9]. The weak point of this procedure lies in the
metric complexity of the problem: taking the convex hull of indeed “improves
F
thegeometry”of butitalsoincreasesbytoomuchitscomplexity.Thecomplex-
F
ity of the convex hull of a set can be much larger than the complexity of the set
itself and this leads to a failure of this naive convexification trick. Nevertheless, a
compromise between geometry and complexity was stricken in [2] and [23] where
optimal aggregates have been successfully constructed. In [2], this improvement
is achieved by minimizing the empirical risk over a carefully chosen star-shaped
subset of the convex hull of . In [23], a better geometry was achieved by tak-
F
ing the convex hull of an appropriate subset of and then by minimizing the
F
empirical risk over it.
In this paper, we show that a third procedure, called Q-aggregation, and that
wasintroducedin[28,9]forfixeddesignGaussianregression,alsoleadstooptimal
rates of aggregation. Unlike the above two procedures that rely on finding an
appropriate constraint for ERM, Q-aggregation is based on a penalization of the
empirical risk but the constraint set is kept to be the convex hull of . Let Θ
F
denote the flat simplex of IRM defined by
M
Θ =(cid:8)(θ1,...,θM)∈ IRM :θj ≥ 0,Xθj = 1(cid:9)
j=1
M
and for any θ Θ, define the convex combination f = θ f . For any fixed
∈ θ Pj=1 j j
ν, the Q-functional is defined for any θ Θ by
∈
M
(1.2) Q(θ)= (1 ν)R (f )+ν θ R (f ),
− n θ X j n j
j=1
We keep the terminology Q-aggregation from [9] in purpose.Indeed,Q stands for
quadratic and while do not employ a quadratic loss, we exploit strong convexity
in the same manner as in [9] and [28]. Indeed the first term in Q acts as a
regularization of the linear interpolation of the empirical risk and is therefore a
strongly convex regularization.
We consider the following aggregation procedure. Unlike the procedures intro-
ducedin[2,23],theQ-aggregation procedureallowsustoputapriorweightgiven
byapriorprobability π = (π1,...,πM)oneach element of thedictionary .This
F
featureturnsouttobecrucialforapplications[1,10,11,13,14,15,12,16,29,30].
Let β > 0 be the temperature parameter and 0 < ν < 1. Consider any vector of
January 31, 2013
4 LECUE´ ANDRIGOLLET
weights θˆ Θ defined by
∈
M M
β
(1.3) θˆ argmin (1 ν)R (f )+ν θ R (f ) θ logπ .
∈ θ∈Θ h − n θ Xj=1 j n j − n Xj=1 j ji
It comes out of our analysis that fˆ achieves an optimal rate of aggregation if β
θ
satisfies
2 2
12C (1 ν) 3C ν(νC +4µb)
(1.4) β > max b − , 6√3bC (1 ν), b b .
h b i
µ − 2µ
where µ = min(ν,1 ν)(C )/10.
ℓ
−
Theorem A. Let be a finite dictionary of cardinality M and (X,Y) be a
F
random couple of IR such that Y b and max f(X) b a.s. for some
f∈F
X × | |≤ | | ≤
b > 0. Assume that Assumption 1 holds and that β satisfies (1.4). Then, for any
x > 0, with probability greater than 1 exp( x)
− −
β 1 2βx
R(fθˆ) ≤ j=m1,.i.n.,MhR(fj)+ n log(cid:16)πj(cid:17)i+ n .
Moreover,
β 1
IE(cid:2)R(fθˆ)(cid:3) ≤ j=m1,.i.n.,MhR(fj)+ n log(cid:16)πj(cid:17)i.
If π is the uniform distribution, that is π = 1/M for all j = 1,...,M, then
j
we recover in Theorem A the classical optimal rate of aggregation (logM)/n and
the estimator θˆ is just the one minimizing the Q-functional defined in (1.2). In
particular no temperature parameter is needed for its construction. As a result,
in this case, the parameter b need not be known for the construction of the
Q-aggregation procedure.
2. PRELIMINARIES TO THE PROOF OF THEOREM A
An important part of our analysis is based upon concentration properties of
empiricalprocesses.Whileourproofsaresimilartothoseemployedin[28]and[9],
they contain genuinely new arguments. In particular, this learning setting, unlike
the denoising setting considered in [28, 9] allows us to employ various new tools
such as symmetrization and contraction. A classical tool to quantify the concen-
tration of measure phenomenon is given by Bernstein’s inequality for bounded
variables. Interms of Laplace transform,Bernstein’s inequality [5, Theorem 1.10]
states that if Z1,...,Zn are n i.i.d. real-valued random variables such that for all
i= 1,...,n,
2
Z c a.s. and IEZ v,
i i
| | ≤ ≤
then for any 0< λ < 1/c,
n 2
nvλ
(2.5) IEexp λ Z IEZ exp .
h (cid:16)X{ i − i}(cid:17)i ≤ (cid:16)2(1 cλ)(cid:17)
i=1 −
Bernstein’s inequality usually yields a bound of order √n for the deviations of
a sum around its mean. As mentioned above, such bounds are not sufficient for
our purposes and we thus consider the following concentration result.
January 31, 2013
OPTIMAL LEARNINGWITH Q-AGGREGATION 5
Proposition 1. Let Z1,...,Zn be i.i.d. real-valued random variables and let
c0 > 0. Assume that Z1 c a.s.. Then, for any 0 < λ < (2c0)/(1+2c0c),
| |≤
n
1
2
IEexphnλ(cid:16)n XZi −IEZi −c0IEZi(cid:17)i ≤ 1
i=1
and
n
1
2
IEexphnλ(cid:16)n XIEZi −Zi−c0IEZi(cid:17)i ≤ 1.
i=1
Proof. It follows from Bernstein’s inequality (2.5) that for any 0 < λ <
(2c0)/(1+2c0c),
n 2 2
1 nIEZ λ
2 1 2
IEexphnλ(cid:16)n XZi −IEZi−c0IEZi(cid:17)i ≤ exp(cid:16)2(1 cλ)(cid:17)exph−nλc0IEZ1i ≤ 1
i=1 −
The second inequality is obtained by replacing Z by Z .
i i
−
We willalsousethefollowing exponentialboundforRademacher processes:let
ε1,...,εn be independent Rademacher random variables and a1,...,an be some
real numbers then, by Hoeffding’s inequality,
n n
1
2
(2.6) IEexp ε a exp a .
(cid:16)X i i(cid:17)≤ (cid:16)2X i(cid:17)
i=1 i=1
Our analysis also relies upon some geometric argument. Indeed, the strong
convexity of the loss function in Assumption 1 implies the 2-convexity of the risk
in the sense of [4]. This translates into a lower bound on the gain obtained when
applying Jensen’s inequality to the risk function R.
Proposition2. Let(X,Y)bearandom couplein IRand = f1,...,fM
X× F { }
be a finite dictionary in L2( ,PX) such that fj(X) b, j = 1,...,M and
X | | ≤ ∀
Y b a.s.. Assume that, almost surely, the function ℓ(Y, ) is strongly convex
| | ≤ ·
with modulus of strong convexity C on [ b,b]. Then, it holds that, for any θ Θ,
ℓ
− ∈
M M C M M 2
ℓ
(2.7) R θ f θ R(f ) θ f θ f .
(cid:16)Xj=1 j j(cid:17) ≤ Xj=1 j j − 2 Xj=1 j(cid:13)(cid:13)(cid:13) j −Xj=1 j j(cid:13)(cid:13)(cid:13)2
Proof. Define the random function ℓ() = ℓ(Y, ). By strong convexity and
· ·
[17], Theorem 6.1.2, it holds almost surely that for any a,a′ in [ b,b],
−
C
ℓ(a) ℓ(a′)+(a a′)ℓ′(a′)+ ℓ(a a′)2,
≥ − 2 −
for any ℓ′(a′) in the sub-differential of ℓ at a′. Plugging a = f (X), a′ = f (X),
j θ
we get almost surely
C
ℓ(Y,f (X)) ℓ(Y,f (X))+(f (X) f (X))ℓ′(f (X))+ ℓ[f (X) f (X)]2.
j θ j θ θ j θ
≥ − 2 −
January 31, 2013
6 LECUE´ ANDRIGOLLET
Now, multiplying both sides by θ and summing over j, we get almost surely,
j
C
ℓ 2
θ ℓ(Y,f (X)) ℓ(Y,f (X))+ θ [f (X) f (X)] .
X j j ≥ θ 2 X j j − θ
j j
To complete the proof, it remains to take the expectation.
3. PROOF OF THEOREM A
Let x > 0 and assume that Assumption 1 holds throughout this section. We
start with some notation. For any θ Θ, define
∈
ℓ (y,x) = ℓ(y,f (x)) and R(θ)= IEℓ (Y,X) = IEℓ(Y,f (X)),
θ θ θ θ
where we recall that f = M θ f for any θ IRM. Let 0 < ν < 1. Let
θ Pj=1 j j ∈
(e1,...,eM) is the canonical basis of IRM and for any θ IRM define
∈
M
ℓ˜(y,x) = (1 ν)ℓ (y,x)+ν θ ℓ (y,x) and R˜(θ)= IEℓ˜(Y,X),
θ − θ X j ej θ
j=1
We also consider the functions
M M
θ IRM K(θ)= θ log 1 and θ IRM V(θ)= θ f f 2.
∈ 7→ X j (cid:16)π (cid:17) ∈ 7→ X jk j − θk2
j=1 j j=1
Let µ > 0. Consider any oracle θ∗ Θ such that
∈
β
θ∗ argmin R˜(θ)+µV(θ)+ K(θ) .
(cid:16) (cid:17)
∈ θ∈Θ n
We start with a geometrical aspect of the problem. The following inequality
follows from the strong convexity of the loss function ℓ.
Proposition 3. For any θ Θ,
∈
R˜(θ) R˜(θ∗) µ V(θ∗) V(θ) +β K(θ∗) K(θ) +(cid:16)(1−ν)Cℓ µ(cid:17) fθ fθ∗ 22.
− ≥ (cid:0) − (cid:1) n(cid:0) − (cid:1) 2 − k − k
Proof. Since θ∗ is a minimizer of the (finite) convex function θ H(θ) =
R˜(θ)+µV(θ)+(β/n)K(θ) over the convex set Θ, then there exists a s7→ubgradient
H(θ∗) such that for any θ Θ it holds, H(θ∗),θ θ∗ 0. It yields
∇ ∈ (cid:10)∇ − (cid:11) ≥
R˜(θ∗),θ θ∗ µ V(θ∗),θ∗ θ +(β/n) K(θ∗),θ∗ θ
(cid:10)∇ − (cid:11) ≥ (cid:10)∇ − (cid:11) (cid:10)∇ − (cid:11)
(3.8) = µ V(θ∗) V(θ) µ f f 2+(β/n) K(θ∗) K(θ) .
θ θ∗ 2
(cid:0) − (cid:1)− k − k (cid:0) − (cid:1)
It follows from the strong convexity of ℓ(y, ) that
·
(1 ν)C
R˜(θ) R˜(θ∗) R˜(θ∗),θ θ∗ + − ℓ fθ fθ∗ 22
− ≥ (cid:10)∇ − (cid:11) 2 k − k
µ V(θ∗) V(θ) + β K(θ∗) K(θ) +(cid:16)(1−ν)Cℓ µ(cid:17) fθ fθ∗ 22 ,
≥ (cid:0) − (cid:1) n(cid:0) − (cid:1) 2 − k − k
January 31, 2013
OPTIMAL LEARNINGWITH Q-AGGREGATION 7
where the second inequality follows from the previous display.
LetHbetheM M matrixwithentriesH = f f 2forall1 j,k M.
j,k j k 2
× k − k ≤ ≤
Let s and x be positive numbers and consider the random variable
M
Zn = (P −Pn)(ℓ˜θˆ−ℓ˜θ∗)−µXθˆjkfj −fθ∗k22−µθˆHθ∗− 1sK(θˆ).
j=1
Proposition 4. Assume that 10µ min(1 ν,ν)C and β 3n/s. Then, it
ℓ
≤ − ≥
holds
β 1
R(θˆ) min R(e )+ log +2Z .
h j (cid:16) (cid:17)i n
≤ 1≤j≤M n πj
Proof. First note that the following equalities hold:
M
(3.9) Xθˆjkfj −fθ∗k22 = V(θˆ)+(cid:13)fθˆ−fθ∗(cid:13)22
j=1 (cid:13) (cid:13)
and
(3.10) θˆHθ∗ = V(θˆ)+V(θ∗)+(cid:13)fθ∗ −fθˆ(cid:13)22.
(cid:13) (cid:13)
It follows from the definition of θˆthat
β
(3.11) R˜(θˆ)−R˜(θ∗) ≤ (P −Pn)(ℓ˜θˆ−ℓ˜θ∗)+ n(cid:0)K(θ∗)−K(θˆ)(cid:1).
It follows from (3.9) and (3.10) in (3.11) that
(3.12)
R˜(θˆ)−R˜(θ∗) ≤ 2µV(θˆ)+µV(θ∗)+2µ(cid:13)fθˆ−fθ∗(cid:13)22+ 1sK(θˆ)+ βn(cid:0)K(θ∗)−K(θˆ)(cid:1)+Zn.
(cid:13) (cid:13)
Together with Proposition 3, it yields
(cid:16)(1−2ν)Cℓ −3µ(cid:17)(cid:13)fθˆ−fθ∗(cid:13)22 ≤ 3µV(θˆ)+ 1sK(θˆ)+Zn.
(cid:13) (cid:13)
We plug the above inequality into (3.12) to obtain
2µ 1
R˜(θˆ) R˜(θ∗) 1+ K(θˆ)+Z
(cid:16) (cid:17)(cid:16) n(cid:17)
− ≤ (1 ν)C /2 3µ s
ℓ
− −
2
β 6µ
(3.13) + K(θ∗) K(θˆ) +µV(θ∗)+ 2µ+ V(θˆ).
(cid:16) (cid:17)
n(cid:0) − (cid:1) (1 ν)C /2 3µ
ℓ
− −
Thanks to the 2-convexity of the risk (cf. Proposition 2), we have R˜(θˆ) R(θˆ)+
ν(C /2)V(θˆ). Therefore, it follows from (3.13) that ≥
ℓ
β 4µ
R(θˆ) R˜(θ∗)+µV(θ∗)+ K(θ∗)+ 1+ Z
(cid:16) (cid:17) n
≤ n (1 ν)C 6µ
ℓ
− −
(3.14)
2
12µ C 1 8µ β
+ 2µ+ ν ℓ V(θˆ)+ + K(θˆ).
(cid:16) (cid:17) (cid:16) (cid:17)
(1 ν)C 6µ − 2 s s((1 ν)C 6µ) − n
ℓ ℓ
− − − −
January 31, 2013
8 LECUE´ ANDRIGOLLET
Note now that 10µ min(ν,1 ν)C implies that
ℓ
≤ −
2
4µ 12µ C
ℓ
1 and 2µ+ ν 0
(1 ν)C 6µ ≤ (1 ν)C 6µ − 2 ≤
ℓ ℓ
− − − −
Moreover, together, the two conditions of the proposition yield
1 8µ β
+ 0
s s((1 ν)C 6µ) − n ≤
ℓ
− −
Therefore, it follows from the above three displays that
β
R(θˆ) min R˜(θ)+µV(θ)+ K(θ) +2Z
n
≤ θ∈Θ(cid:2) n (cid:3)
β 1
min R(e )+ log +2Z .
h j (cid:16) (cid:17)i n
≤ j=1,...,M n πj
Tocompleteourproof,itremainstoprovethatP[Z > (βx)/n] exp( x)and
n
≤ −
IE[Z ] 0 under suitable conditions on µ and β. Using respectively a Chernoff
n
≤
bound and Jensen’s inequality respectively, it is easy to see that both condi-
tions follow if we prove that IEexp(nZ /β) 1. It follows from the excess loss
n
≤
decomposition:
M
ℓ˜θˆ(y,x)−ℓ˜θ∗(y,x) = (1−ν)(ℓθˆ(y,x)−ℓθ∗(y,x))+νX(θˆj −θj∗)ℓej(y,x),
j=1
and the Cauchy-Schwarz inequality implies that it is enough to prove that,
M
(3.15) IEexphs(cid:16)(1−ν)(P −Pn)(ℓθˆ−ℓθ∗)−µXθˆjkfj −fθ∗k22−1sK(θˆ)(cid:17)i ≤ 1.
j=1
and
M
1
(3.16) IEexp s ν(P P ) (θˆ θ∗)ℓ µθˆHθ∗ K(θˆ) 1,
h (cid:16) − n (cid:16)X j − j ej(cid:17)− − s (cid:17)i ≤
j=1
for some s 2n/β. Let s be as such in the rest of the proof.
≥
We begin by proving (3.15). To that end, define the symmetrized empiri-
cal process by h 7→ Pn,εh = n−1Pni=1εih(Yi,Xi) where ε1,...,εn are n i.i.d.
Rademacher random variables independentof the (X ,Y )s. Moreover, take s and
i i
µ such that
µn
(3.17) s .
≤ [2C (1 ν)]2
b
−
January 31, 2013
OPTIMAL LEARNINGWITH Q-AGGREGATION 9
It yields
M
IEexphs(cid:16)(1−ν)(P −Pn)(ℓθˆ−ℓθ∗)−µXθˆjkfj −fθ∗k22− 1sK(θˆ)(cid:17)i
j=1
M
2 1
IEexp smax (1 ν)(P P )(ℓ ℓ ) µ θ f f K(θ)
≤ h θ∈Θ (cid:16) − − n θ − θ∗ − X jk j − θ∗k2− s (cid:17)i
j=1
(3.18)
M
2 1
IEexp smax 2(1 ν)P (ℓ ℓ ) µ θ f f K(θ)
≤ h θ∈Θ (cid:16) − n,ε θ − θ∗ − X jk j − θ∗k2− s (cid:17)i
j=1
(3.19)
M
2 1
IEexp smax 2C (1 ν)P (f f ) µ θ f f K(θ)
≤ h θ∈Θ (cid:16) b − n,ε θ − θ∗ − X jk j − θ∗k2− s (cid:17)i
j=1
where (3.18) follows from the symmetrization inequality [20, Theorem 2.1] and
(3.19) follows from the contraction principle [25, Theorem 4.12] applied to con-
tractions ϕ (t ) = C−1[ℓ(Y ,f (X ) t ) ℓ(Y ,f (X )] and T IRn is defined
i i b i θ∗ i − i − i θ∗ i ⊂
by T = t IRn : t = f (X ) f (X ),θ Θ . Next, using the fact that the
i θ∗ i θ i
{ ∈ − ∈ }
maximum of a linear function over a polytope is attained at a vertex, we get
M
IEexphs(cid:16)(1−ν)(P −Pn)(ℓθˆ−ℓθ∗)−µXθˆjkfj −fθ∗k22− 1sK(θˆ)(cid:17)i
j=1
M
2
π IEIE exp s 2C (1 ν)P (f f ) µ f f
≤ X k ε h (cid:16) b − n,ε k − θ∗ − k k − θ∗k2(cid:17)i
k=1
(3.20)
M 2
[2C (1 ν)s)] 2µn
b 2
≤ XπkIEexph 2−n (cid:16)Pn − [2C (1 ν)]2sP(cid:17)(fk −fθ∗) i
k=1 b −
(3.21)
M 2
(2C (1 ν)s) 1
b 2 4
≤ XπkIEexph 2−n (cid:16)(Pn −P)(fk −fθ∗) − 4b2P(fk −fθ∗) (cid:17)i
k=1
where (3.20) follows from (2.6) and (3.21) follows from (3.17). Together with the
above display, Proposition 1 yields (3.15) as long as
n
(3.22) s < .
2√3bC (1 ν)
b
−
January 31, 2013
10 LECUE´ ANDRIGOLLET
We now prove (3.15). We have
M
1
IEexp s ν(P P ) (θˆ θ∗)ℓ µθˆHθ∗ K(θˆ)
h (cid:16) − n (cid:16)X j − j ej(cid:17)− − s (cid:17)i
j=1
M M
θ∗ π IEexp s ν(P P )(ℓ ℓ ) µ f f 2
≤ X j X k h (cid:16) − n ek − ej − k j − kk2(cid:17)i
j=1 k=1
M M
µ
θ∗ π IEexp sν (P P )(ℓ ℓ ) P(ℓ ℓ )2 1
≤ X j X k h (cid:16) − n ek − ej − νC2 ej − ek (cid:17)i ≤
j=1 k=1 b
where the last inequality follows from Proposition 1 when
2µn
(3.23) s < .
C ν(νC +4µb)
b b
It is now straightforward to see that the conditions of Proposition 4, the ones of
(3.17), (3.22) and (3.23) are fulfilled when
3n C
ℓ
s = , µ = min(ν,1 ν)
β − 10
and
2 2
12C (1 ν) 3C ν(νC +4µb)
β > max b − , 6√3bC (1 ν), b b .
h b i
µ − 2µ
REFERENCES
[1] Alquier, P., and Lounici, K. Pac-bayesian bounds for sparse regres-
sion estimation with exponential weights. Electronic Journal of Statistics 5
(2011), 127–145.
[2] Audibert, J.-Y. Progressive mixture rules are deviation suboptimal. Ad-
vances in Neural Information Processing Systems (NIPS) (2007).
[3] Audibert, J.-Y. Fast learning rates in statistical inference through aggre-
gation. Ann. Statist. 37, 4 (2009), 1591–1646.
[4] Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity,
classification, and risk bounds. J. Amer. Statist. Assoc. 101, 473 (2006),
138–156.
[5] Boucheron,S.,Lugosi, G., andMassart,P. Concentration inequalities
with applications. Clarendon Press. Oxford, 2012.
[6] Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. Aggregation for
Gaussian regression. Ann. Statist. 35, 4 (2007), 1674–1697.
[7] Catoni, O. Statistical learning theory and stochastic optimization, vol.1851
of Lecture Notes in Mathematics. Springer-Verlag, Berlin, 2004. Lecture
notes from the 31st Summer School on Probability Theory held in Saint-
Flour, July 8–25, 2001.
[8] Catoni, O. Pac-Bayesian supervised classification: the thermodynamics
of statistical learning. Institute of Mathematical Statistics Lecture Notes—
MonographSeries,56.InstituteofMathematical Statistics, Beachwood, OH,
2007.
[9] Dai, D., Rigollet, P., and Zhang, T. Deviation optimal learning using
greedy q-aggregation. Ann. Statist. (March 2012). arXiv:1203.2507.
January 31, 2013