Adaptive and optimal online linear regression on ℓ1-balls S´ebastien Gerchinovitza,1, , Jia Yuan Yub ∗ aE´cole Normale Sup´erieure, 45 rue d’Ulm, 75005 Paris, France bIBM Research, Damastown Technology Campus, Dublin15, Ireland Abstract 2 Weconsidertheproblemofonlinelinearregressiononindividualsequences. Thegoalinthispaperisforthe 1 forecastertooutputsequentialpredictionswhichare,afterT timerounds,almostasgoodastheonesoutput 0 by the best linear predictor in a given ℓ1-ball in Rd. We consider both the cases where the dimension d is 2 small and large relative to the time horizon T. We first present regret bounds with optimal dependencies n on d, T, and on the sizes U, X and Y of the ℓ1-ball, the input data and the observations. The minimax a J regretis shown to exhibit a regime transition around the point d=√TUX/(2Y). Furthermore, we present 3 efficient algorithms that are adaptive, i.e., that do not require the knowledge of U, X, Y, and T, but still 2 achieve nearly optimal regret bounds. ] Keywords: Online learning, Linear regression,Adaptive algorithms, Minimax regret L M . t a 1. Introduction t s In this paper, we consider the problem of online linear regression against arbitrary sequences of input [ data and observations, with the objective of being competitive with respect to the best linear predictor in 3 an ℓ1-ball of arbitrary radius. This extends the task of convex aggregation. We consider both low- and v high-dimensional input data. Indeed, in a large number of contemporary problems, the available data can 2 be high-dimensional—the dimension of eachdata point is largerthan the number of data points. Examples 4 0 include analysis of DNA sequences, collaborative filtering, astronomical data analysis, and cross-country 4 growth regression. In such high-dimensional problems, performing linear regression on an ℓ1-ball of small . diametermaybehelpfulifthebestlinearpredictorissparse. Ourgoalis,inbothlowandhighdimensions,to 5 0 provide online linear regressionalgorithms along with bounds on ℓ1-balls that characterize their robustness 1 to worst-case scenarios. 1 : 1.1. Setting v i We consider the online version of linear regression, which unfolds as follows. First, the environment X choosesa sequence ofobservations(y ) in R and a sequence ofinput vectors(x ) inRd, both initially t t>1 t t>1 r hidden from the forecaster. At each time instant t N = 1,2,... , the environment reveals the data a ∗ x Rd; the forecaster then gives a prediction y ∈R; the e{nvironm}ent in turn reveals the observation t t y ∈R; and finally, the forecaster incurs the square l∈oss (y y )2. The dimension d can be either small or t t t ∈ − large relative to the number T of time steps: we consider both cases. b b In the sequel, u v denotes the standard inner product between u,v Rd, and we set u , · ∈ k k max u and u , d u . The ℓ1-ball of radius U >0 is the following bounded subset of R∞d: 16j6d| j| k k1 j=1| j| P B (U), u Rd : u 6U . 1 ∈ k k1 (cid:8) (cid:9) ∗Correspondingauthor Email addresses: [email protected] (S´ebastien Gerchinovitz), [email protected] (JiaYuanYu) 1ThisresearchwascarriedoutwithintheINRIAprojectCLASSIC hostedbyE´coleNormaleSup´erieureandCNRS. Preprint submitted toElsevier January 22, 2012 Given a fixed radius U >0 and a time horizon T >1, the goal of the forecaster is to predict almost as well as the best linear forecaster in the reference set x Rd u x R : u B (U) , i.e., to minimize the 1 ∈ 7→ · ∈ ∈ regret on B (U) defined by 1 (cid:8) (cid:9) T T (y y )2 min (y u x )2 . t t t t t=1 − −u∈B1(U)(t=1 − · ) X X We shall present algorithms along wibth bounds on their regret that hold uniformly over all sequences2 (x ,y ) such that x 6X and y 6Y for all t=1,...,T, where X,Y >0. These regret bounds t t 16t6T t t k k | | depend on four important qu∞antities: U, X, Y, and T, which may be known or unknown to the forecaster. 1.2. Contributions and related works In the next paragraphswe detail the main contributions of this paper in view of related works in online linear regression. Our first contribution (Section 2) consists of a minimax analysis of online linear regression on ℓ1-balls in the arbitrary sequence setting. We first provide a refined regret bound expressed in terms of Y, d, and a quantity κ = √TUX/(2dY). This quantity κ is used to distinguish two regimes: we show a distinctive regime transition3 at κ = 1 or d = √TUX/(2Y). Namely, for κ < 1, the regret is of the order of √T, whereas it is of the order of lnT for κ>1. ThederivationofthisregretboundpartiallyreliesonaMaurey-typeargumentusedundervariousforms with i.i.d. data, e.g., in [1, 2, 3, 4] (see also [5]). We adapt it in a straightforwardway to the deterministic setting. Therefore, this is yet another technique that can be applied to both the stochastic and individual sequence settings. Unsurprisingly,therefinedregretboundmentionedabovematchestheoptimalriskboundsforstochastic settings4 [6, 2] (see also [7]). Hence, linear regressionis just as hard in the stochastic setting as in the arbi- trary sequence setting. Using the standard online to batch conversion, we make the latter statement more precise by establishing a lower bound for all κ at least of the order of √lnd/d. This lower bound extends those of [8, 9], which only hold for small κ of the order of 1/d. Thealgorithmachievingourminimaxregretboundisbothcomputationallyinefficientandnon-adaptive (i.e., it requires prior knowledge of the quantities U, X, Y, and T that may be unknown in practice). Those two issues were first overcome by [10] via an automatic tuning termed self-confident (since the forecaster somehow trusts himself in tuning its parameters). They indeed proved that the self-confident p-normalgorithmwith p=2lnd and tuned with U has a cumulative loss L = T (y y )2 bounded by T t=1 t− t P LT 6L∗T +8UX (elnd)L∗T +(32elnd)U2bX2 b 68UXY√eTqlnd+(32elnd)U2X2 , b iwnhteerremLs∗Tof,κmshino{wus∈tRhda:ktuikt16isUo}ptimTt=a1l(yutp−toulo·gxatr)i2th6mTicYfa2.ctTorhsisinaltghoerirtehgmimiesκeffi6c1ienwti,thaonudtopurriolrowknerowbloeudngde P of X, Y, and T. Our second contribution (Section 3) is to show that similar adaptivity and efficiency properties can be obtained via exponential weighting. We consider a variant of the EG± algorithm [9]. The latter has a manageable computational complexity and our lower bound shows that it is nearly optimal in the regime 2Actuallyourresults holdwhether (xt,yt)t>1 isgenerated byanobliviousenvironment oranon-oblivious opponent since weconsiderdeterministicforecasters. 3Inhighdimensions(i.e.,whend>ωT,forsomeabsoluteconstantω>0),wedonotobservethistransition(cf.Figure1). 4Forexample,(xt,yt)16t6T maybei.i.d.,orxt canbedeterministicandyt=f(xt)+εt foranunknownfunctionf and ani.i.d.sequence (εt)16t6T ofGaussiannoise. 2 κ 6 1. However, the EG± algorithm requires prior knowledge of U, X, Y, and T. To overcome this adaptivity issue, we study a modification of the EG± algorithmthat relies on the variance-basedautomatic tuning of [11]. The resulting algorithm– called adaptive EG± algorithm – can be applied to generalconvex and differentiable loss functions. When applied to the square loss, it yields an algorithm of the same computational complexity as the EG± algorithm that also achieves a nearly optimal regret but without needing to know X, Y, and T beforehand. Our third contribution (Section 3.3) is a generic technique called loss Lipschitzification. It transforms the loss functions u (y u x )2 (or u y u x α if the predictions are scoredwith the α-loss for a t t t t 7→ − · 7→ − · realnumberα>2)intoLipschitzcontinuousfunctions. Weillustratethistechniquebyapplyingthegeneric (cid:12) (cid:12) adaptive EG± algorithm to the modified loss(cid:12)functions.(cid:12) When the predictions are scored with the square loss,thisyieldsanalgorithm(theLEGalgorithm)whosemainregrettermslightlyimprovesonthatderived for the adaptive EG± algorithm without Lipschtizification. The benefits of this technique are clearer for loss functions with higher curvature: if α> 2, then the resulting regret bound roughly grows as U instead of a naive Uα/2. Finally,inSection4,weprovideasimplewaytoachieveminimaxregretuniformlyoverallℓ1-ballsB (U) 1 for U > 0. This method aggregates instances of an algorithm that requires prior knowledge of U. For the sakeofsimplicity, weassumethat X, Y, andT areknown,but explaininthe discussionshowto extendthe method to a fully adaptive algorithm that requires the knowledge neither of U, X, Y, nor T. Thispaperisorganizedasfollows. InSection2,weestablishourrefinedupperandlowerboundsinterms of the intrinsic quantity κ. In Section 3, we present an efficient and adaptive algorithm — the adaptive EG± algorithm with or without loss Lipschitzification — that achieves the optimal regret on B1(U) when U is known. In Section 4, we use an aggregating strategy to achieve an optimal regret uniformly over all ℓ1-balls B (U), for U>0, when X, Y, and T are known. Finally, in Section 5, we discuss as an extension a 1 fully automatic algorithm that requires no prior knowledge of U, X, Y, or T. Some proofs and additional tools are postponed to the appendix. 2. Optimal rates In this section, we first present a refined upper bound on the minimax regret on B (U) for an arbitrary 1 U > 0. In Corollary 1, we express this upper bound in terms of an intrinsic quantity κ , √TUX/(2dY). The optimality of the latter bound is shown in Section 2.2. We consider the following definition to avoid any ambiguity. We call online forecaster any sequence F =(f ) of functions suchthat f :Rd (Rd R)t 1 R maps attime t the new input x andthe past t t>1 t − t × × → data(x ,y ) toapredictionf x ;(x ,y ) . Dependingonthecontext,thelatterprediction s s 16s6t 1 t t s s 16s6t 1 e − e − may be simply denoted by f x ) or by y . t t (cid:0) t (cid:1) e (cid:0) 2.1. Upper bound e b Theorem 1 (Upper bound). Let d,T N , and U,X,Y > 0. The minimax regret on B (U) for bounded ∗ 1 ∈ base predictions and observations satisfies T T inf sup (y y )2 inf (y u x )2 t t t t F kxtk∞6X,|yt|6Y(Xt=1 − −kuk16UXt=1 − · ) 3UXY 2Tln(2d) b if U < Y ln(1+2d) , X Tln2 q 6 2362UdYX2Yplnr1T+ln√(cid:16)T1U+X√2Td+UYXdY(cid:17)2 iiff UXY q>ln2T(d1Yl+n22d,) 6U 6 √2dTYX , dY X√T where the infimum is taken over a(cid:16)ll forecaster(cid:17)s F and where the supremum extends over all sequences (x ,y ) (Rd R)T such that y ,..., y 6Y and x ,..., x 6 X. t t 16t6T 1 T 1 T ∈ × | | | | k k k k ∞ ∞ 3 Theorem 1 improves the bound of [9, Theorem 5.11] for the EG± algorithm. First, our bound depends logarithmically—as opposed to linearly—on U for U > 2dY/(√TX). Secondly, it is smaller by a factor ranging from 1 to √lnd when Y ln(1+2d) 2dY 6U 6 . (1) Xr T ln2 √TX Hence, Theorem 1 provides a partial answer to a question5 raised in [9] about the gap of ln(2d) between the upper and lower bounds. p Before proving the theorem (see below), we state the following immediate corollary. It expresses the upper bound of Theorem 1 in terms of an intrinsic quantity κ , √TUX/(2dY) that relates √TUX/(2Y) to the ambient dimension d. Corollary 1 (Upper bound in terms of an intrinsic quantity). Let d,T N , and U,X,Y >0. The upper ∗ ∈ bound of Theorem 1 expressed in terms of d, Y, and the intrinsic quantity κ,√TUX/(2dY) reads: T T inf sup (y y )2 inf (y u x )2 t t t t F kxtk∞6X,|yt|6Y(Xt=1 − −kuk16UXt=1 − · ) 6dY2κ 2ln(2d) b if κ< √ln(1+2d) , 2d√ln2 6 52dY2κp ln(1+1/κ) if √ln(1+2d) 6κ61 , 32dY2 ln(1+2κ)+1 if κ2>d√1ln.2 p The upper bound of Corollary 1 is(cid:0)shown in Figu(cid:1)re 1. Observe that, in low dimension (Figure 1(b)), a clear transition from a regret of the order of √T to one of lnT occurs at κ = 1. This transition is absent for high dimensions: for d >ωT, where ω , 32(ln(3)+1) −1, the regret bound 32dY2 ln(1+2κ)+1 is worse than a trivial bound of TY2 when κ>1. (cid:0) (cid:1) (cid:0) (cid:1) Y2T Y2T Y2d 52dY2 (cid:3)ln(1+1/ Y2lnd 52dY2 (cid:1)ln(1+1/ Y2lnd cdY2(l(cid:2)n(1+2 (cid:2)) (cid:0) (cid:0)) (cid:2))+1) 1 1 (cid:0)min (cid:2)min (cid:2)max (a) Highdime(cid:0)nsiond>ωT. (b) Lowdimen(cid:2)siond<ωT. Figure1: TheregretboundofCorollary1overB1(U)asafunctionofκ=√TUX/(2dY). Theconstantcischosentoensure continuityatκ=1,andω,(cid:0)32(ln(3)+1)(cid:1)−1. Wedefine: κmin=pln(1+2d)/(2d√ln2)andκmax=(e(T/d−1)/c−1)/2. We nowproveTheorem1. The mainpartofthe proofrelies onaMaurey-type argument. Althoughthis argumentwas used in the stochastic setting [1, 2, 3, 4], we adapt it to the deterministic setting. This is yet another technique that can be applied to both the stochastic and individual sequence settings. 5The authors of [9] asked: “For large d there is a significant gap between the upper and lower bounds. We would like to knowifitpossibletoimprovetheupperboundsbyeliminatingthelndfactors.” 4 Proof (of Theorem 1): First note from Lemma 5 in Appendix B that the minimax regret on B (U) is 1 upper bounded6 by √TUX min 3UXY 2T ln(2d), 32dY2ln 1+ +dY2 . (2) dY ( ! ) p Therefore, the first case U < Y ln(1+2d) and the third case U > dY are straightforward. X Tln2 X√T q Therefore, we assume in the sequel that Y ln(1+2d) 6U 6 2dY . X Tln2 √TX We use a Maurey-type argument to refineqthe regret bound (2). This technique was used under various formsin the stochasticsetting, e.g.,in [1, 2, 3,4]. It consistsofdiscretizingB (U)andlookingata random 1 pointinthisdiscretizationtostudyitsapproximationproperties. Wealsouseclippingtogetaregretbound growing as U instead of a naive U2. More precisely, we first use the fact that to be competitive against B (U), it is sufficient to be compet- 1 itive against its finite subset d k U k U B , 1 ,..., d :(k ,...,k ) Zd, k 6m B (U) , U,m m m 1 d ∈ | j| ⊂ 1 (cid:18) (cid:19) Xj=1 e UX 2dY where m, α with α, T(ln2)/ln 1+ . ⌊ ⌋ Y s √TUX (cid:18) (cid:19) By Lemma 7 in Appendix C, and since m>0 (see below), we indeed have T T TU2X2 inf (y u x )2 6 inf (y u x )2+ t t t t u∈BeU,mXt=1 − · u∈B1(U)Xt=1 − · m T 2 2dY 6 inf (y u x )2+ UXY T ln 1+ , (3) t t u∈B1(U)t=1 − · √ln2 s (cid:18) √TUX(cid:19) X where (3) follows from m, α >α/2 since α>1 (in particular, m>0 as stated above). ⌊ ⌋ To see why α > 1, note that it suffices to show that x ln(1+x) 6 2d√ln2 where we set x , 2dY/(√TUX). ButfromtheassumptionU >(Y/X) ln(1+2d)/(Tln2),wehavex62d ln(2)/ln(1+2d), p y, so that, by monotonicity, x ln(1+x)6y ln(1+y)6y ln(1+2d)=2d√ln2. p p p p p Therefore it only remains to exhibit an algorithm which is competitive against B at an aggregation U,m price of the same order as the last term in (3). This is the case for the standard exponentially weighted averageforecaster applied to the clipped predictions e u x ,min Y,max Y,u x , u B , · t Y − · t ∈ U,m (cid:2) (cid:3) n (cid:8) (cid:9)o and tuned with the inverse temperature parameter η =1/(8Y2). More foremally, this algorithm predicts at each time t=1,...,T as y , p (u) u x , t t · t Y u∈XBU,m (cid:2) (cid:3) b e 6Asproved inLemma5, the regretbound (2) isachieved either by the EG± algorithm, the algorithm SeqSEWBτ,η of [12] (wecouldalsogetaslightlyworseboundwiththesequentialridgeregressionforecaster[13,14]),orthetrivialnullforecaster. 5 where p (u),1/ B (denoting by B the cardinality of the set B ), and where the weights p (u) 1 U,m U,m U,m t are defined for all t=2,...,T and u B by (cid:12) (cid:12) ∈(cid:12) U,m(cid:12) (cid:12)e (cid:12) (cid:12)e (cid:12) e p (u), eexp −η ts−=11 ys−[u·xs]Y 2 . t v∈BU,m(cid:16)expP−η (cid:0)ts−=11 ys−[v·(cid:1)xs(cid:17)]Y 2 P e (cid:16) P (cid:0) (cid:1) (cid:17) By Lemma 6 in Appendix B, the above forecaster tuned with η =1/(8Y2) satisfies T T (y y )2 inf (y u x )2 68Y2ln B t t t t U,m Xt=1 −b −u∈BeU,mXt=1 − · 68Y2ln(cid:12)(cid:12)ee(2d(cid:12)(cid:12)+m) m (4) m (cid:18) (cid:19) =8Y2m 1+ln(1+2d/m) 68Y2α 1+ln(1+2d/α) (5) (cid:0) 2(cid:1)dY ln(cid:0) 1+2dY/(√TUX(cid:1)) =8Y2α+8Y2αln 1+ √TUXs ln2 (cid:0) (cid:1) 2dY 68Y2α+16Y2αln 1+ (6) √TUX (cid:18) (cid:19) 8 2dY 6 +16√ln2 UXY T ln 1+ . (7) √ln2 s √TUX (cid:18) (cid:19) (cid:18) (cid:19) To get(4) we used Lemma 8 in Appendix C. Inequality (5) follows by definition of m6α and the fact that x x 1+ln(1+A/x) is nondecreasing on R for all A > 0. Inequality (6) follows from the assumption 7→ ∗+ U 62dY/(√TX) and the elementary inequality ln 1+x ln(1+x)/ln2 62ln(1+x) which holds for all (cid:0) (cid:1) x>1 andwasused,e.g., atthe endof[3,Theorem2-a)]. Finally, elementarymanipulations combinedwith (cid:0) p (cid:1) the assumption that 2dY/(√TUX)>1 lead to (7). Putting Eqs.(3) and (7) together,the previous algorithmhas a regreton B (U) which is bounded from 1 above by 10 2dY +16√ln2 UXY T ln 1+ , √ln2 s √TUX (cid:18) (cid:19) (cid:18) (cid:19) which concludes the proof since 10/√ln2+16√ln2626. 2.2. Lower bound Corollary1 gives an upper bound on the regretin terms of the quantities d, Y, and κ,√TUX/(2dY). We now show that for all d N , Y > 0, and κ > ln(1+2d)/(2d√ln2), the upper bound can not be ∗ ∈ improved7 up to logarithmic factors. p 7For T sufficiently large, we may overlook the case κ < pln(1+2d)/(2d√ln2) or √T < (Y/(UX))pln(1+2d)/ln2. Observethatinthiscase,theminimaxregretisalreadyoftheorderofY2ln(1+d)(cf. Figure1). 6 Theorem 2 (Lower bound). For all d N , Y > 0, and κ > √ln(1+2d), there exist T > 1, U > 0, and ∈ ∗ 2d√ln2 X >0 such that √TUX/(2dY)=κ and T T inf sup (y y )2 inf (y u x )2 t t t t F kxtk∞6X,|yt|6Y(Xt=1 − −kuk16UXt=1 − · ) c1 dY2κ ln(b1+1/κ) if √ln(1+2d) 6κ61 , > ln 2+16d2 2d√ln2 c2 dY2 p if κ>1 , ln(cid:0)2+16d2(cid:1) where c1,c2 > 0 are absolute co(cid:0)nstants(cid:1). The infimum is taken over all forecasters F and the supremum is takenover allsequences(x ,y ) (Rd R)T suchthat y ,..., y 6Y and x ,..., x 6X. t t 16t6T 1 T 1 T ∈ × | | | | k k∞ k k∞ The above lower bound extends those of [8, 9], which hold for small κ of the order of 1/d. The proof is postponed to Appendix A.1. We perform a reduction to the stochastic batch setting — via the standard online to batch conversion— and employ a version of a lower bound of [2]. 3. Adaptation to unknown X, Y and T via exponential weights AlthoughtheproofofTheorem1alreadygivesanalgorithmthatachievestheminimaxregret,thelatter takes as inputs U, X, Y, and T, and it is inefficient in high dimensions. In this section, we present a new method that achieves the minimax regret both efficiently and without prior knowledge of X, Y, and T providedthatU is known. Adaptationto anunknownU is consideredinSection 4. Our method consistsof modifyinganunderlyingefficientlinearregressionalgorithmsuchastheEG± algorithm[9]orthesequential ridge regressionforecaster [14, 13]. Next, we show that automatically tuned variants of the EG± algorithm nearlyachievethe minimaxregretfortheregimed>√TUX/(2Y). Asimilarmodificationcouldbeapplied to the ridge regression forecaster — without retaining additional computational efficiency — to achieve a nearly optimal regret bound of order dY2ln 1+d √TUX 2 in the regime d < √TUX/(2Y). The latter dY analysis is more technical and hence is omitted. (cid:0) (cid:0) (cid:1) (cid:1) 3.1. An adaptive EG± algorithm for general convex and differentiable loss functions The second algorithm of the proof of Theorem 1 is computationally inefficient because it aggregates approximately d√T experts. In contrast, the EG± algorithm has a manageable computational complexity that is linear in d at each time t. Next we introduce a version of the EG± algorithm — called the adaptive EG± algorithm — that does not require prior knowledge of X, Y and T (as opposed to the original EG± algorithm of [9]). This version relies on the automatic tuning of [11]. We first present a generic version suited for general convex and differentiable loss functions. The application to the square loss and to other α-losses will be dealt with in Sections 3.2 and 3.3. The generic setting with arbitrary convex and differentiable loss functions corresponds to the online convex optimization setting [15, 16] and unfolds as follows: at each time t > 1, the forecaster chooses a linear combination u Rd, then the environment chooses and reveals a convex and differentiable loss t function ℓ :Rd R, an∈d the forecaster incurs the loss ℓ (u ). In online linear regression under the square t t t → loss, the loss functiobns are given by ℓt(u)=(yt u xt)2. − · The adaptive EG± algorithm for general convex and dibfferentiable loss functions is defined in Figure 2. We denote by (e ) the canonical basis of Rd, by ℓ (u) the gradient of ℓ at u Rd, and by ℓ (u) j 16j6d t t j t ∇ ∈ ∇ the j-th component of this gradient. The adaptive EG± algorithm uses as a blackbox the exponentially weighted majority forecaster of [11] on 2d experts — namely, the vertices Ue of B (U) — as in [9]. It j 1 ± adapts to the unknown gradient amplitudes ℓ by the particular choice of η due to [11] and defined t t for all t>2 by k∇ k∞ 1 ln(2d) η =min , C , (8) t (Et 1 s Vt 1 ) − − 7 b Parameter: radius U >0. Initialization: p1 =(p+1,1,p−1,1,...,p+d,1,p−d,1), 1/(2d),...,1/(2d) ∈R2d. At each time round t>1, (cid:0) (cid:1) d 1. Output the linear combination ut ,U p+j,t−p−j,t ej ∈B1(U); j=1 X(cid:0) (cid:1) 2. Receive the loss function ℓt :Rdb R and update the parameter ηt+1 according to (8); → 3. Updatetheweightavectorpt+1 =(p+1,t+1,p−1,t+1,...,p+d,t+1,p−d,t+1)∈X2ddefinedforallj =1,...,d and γ +, by ∈{ −} t exp η γU ℓ (u ) t+1 j s s − ∇ ! pγ , Xs=1 . j,t+1 t b exp η µU ℓ (u ) t+1 k s s − ∇ ! 16k6d s=1 X X µ +, ∈{ −} b aForallγ +, ,byaslightabuseofnotation, γU denotes U or U ifγ=+orγ= respectively. ∈{ −} − − Figure2: Theadaptive EG± algorithmforgeneralconvexanddifferentiablelossfunctions(seeProposition1). where C , 2(√2 1)/(e 2) and where we set, for all t=1,...,T, − − q zj+,s ,U∇jℓs(us) and zj−,s ,−U∇jℓs(us) , j =1,...,d, s=1,...,t , E , inf 2kb:2k > max max zγ b zµ , t k Z 16s6t 16j,k6d j,s− k,s ∈ γ,µ∈{+,−}(cid:12) (cid:12) b (cid:12) (cid:12) 2 t Vt , pγj,szjγ,s− pµk,szkµ,s . s=1 16j6d 16k6d X X X γ∈{+,−} µ∈{+,−} NotethatE approximatestherangeofthezγ uptotimet 1,whileV isthecorrespondingcumulative t−1 j,s − t−1 variance of the forecaster. b Proposition 1 (The adaptive EG± algorithm for general convex and differentiable loss functions). Let U >0. Then, the adaptive EG± algorithm on B1(U) defined in Figure 2 satisfies, for all T >1 and all sequences of convex and differentiable8 loss functions ℓ ,...,ℓ :Rd R, 1 T → T T ℓ (u ) min ℓ (u) t t t − u 6U t=1 k k1 t=1 X X b T 64U ℓ (u ) 2 ln(2d)+U 8ln(2d)+12 max ℓ (u ) . vu t=1k∇ t t k∞! 16t6Tk∇ t t k∞ u X (cid:0) (cid:1) t b b In particular, the regret is bounded by 4U max ℓ (u ) T ln(2d)+2ln(2d)+3 . 16t6T t t k∇ k ∞ (cid:0) (cid:1)(cid:0)p (cid:1) b 8Gradientscanbereplacedwithsubgradientsifthelossfunctions ℓt:Rd Rareconvex butnotdifferentiable. → 8 Proof: The proof follows straightforwardly from a linearization argument and from a regret bound of [11] applied to appropriately chosen loss vectors. Indeed, first note that by convexity and differentiability of ℓ :Rd R for all t=1,...,T, we get that t → T T T T ℓ (u ) min ℓ (u)= max ℓ (u ) ℓ (u) 6 max ℓ (u ) (u u) t t t t t t t t t − u 6U u 6U − u 6U ∇ · − t=1 k k1 t=1 k k1 t=1 k k1 t=1 X X X(cid:0) (cid:1) X b T b b b = max ℓ (u ) (u γUe ) (9) t t t j 16j6d ∇ · − γ +, Xt=1 ∈{ −} T b b T = pγ γU ℓ (u ) min γU ℓ (u ) , (10) j,t ∇j t t − 16j6d ∇j t t Xt=1 16Xj6d γ +, Xt=1 γ +, ∈{ −} ∈{ −} b b where(9)followsbylinearityofu T ℓ (u ) (u u)onthepolytopeB (U),andwhere(10)follows 7→ t=1∇ t t · t− 1 from the particular choice of u in Figure 2. t P To conclude the proof, note that our choicesbof thbe weight vectors pt ∈X2d in Figure 2 and of the time- varying parameter ηt in (8) cobrrespondto the exponentially weighted averageforecaster of [11, Section 4.2] when it is applied to the loss vectors U ℓ (u ), U ℓ (u ) R2d, t = 1,...,T. Since at time t ∇j t t − ∇j t t 16j6d ∈ the coordinates of the last loss vector lie in an interval of length E 6 2U ℓ (u ) , we get from [11, (cid:0) (cid:1) t k∇ t t k Corollary 1] that b b ∞ b T T pγ γU ℓ (u ) min γU ℓ (u ) j,t ∇j t t − 16j6d ∇j t t Xt=1 16Xj6d γ 1 Xt=1 γ 1 ∈{± } ∈{± } b b T 64U ℓ (u ) 2 ln(2d)+U 8ln(2d)+12 max ℓ (u ) . vu t=1k∇ t t k∞! 16t6Tk∇ t t k∞ u X (cid:0) (cid:1) t b b Substituting the last upper bound in (10) concludes the proof. 3.2. Application to the square loss In the particular case of the square loss ℓ (u) = (y u x )2, the gradients are given by ℓ (u) = t t t t 2(y u x )x forallu Rd. ApplyingProposition1,w−ege·tthe followingregretboundforthe∇adaptive t t t − − · ∈ EG± algorithm. Corollary 2 (The adaptive EG± algorithm under the square loss). Let U > 0. Consider the online linear regression setting defined in the introduction. Then, the adaptive EG± algorithm (see Figure 2) tuned with U and applied to the loss functions ℓt :u (yt u xt)2 satisfies, for all individual sequences (x ,y ),...,(x ,y ) Rd R, 7→ − · 1 1 T T ∈ × T T (y u x )2 min (y u x )2 t t t t t − · − u 6U − · t=1 k k1 t=1 X X b T 68UX min (y u x )2 ln(2d)+ 137ln(2d)+24 UXY +U2X2 vu kuk16U t=1 t− · t ! u X (cid:0) (cid:1)(cid:0) (cid:1) 68UXYt Tln(2d)+ 137ln(2d)+24 UXY +U2X2 , p (cid:0) (cid:1)(cid:0) (cid:1) where the quantities X ,max x and Y ,max y are unknown to the forecaster. 16t6T t 16t6T t k k | | ∞ 9 Using the terminology of [17, 11], the first bound of Corollary 2 is an improvement for small losses: it yields a small regret when the optimal cumulative loss minkuk16U Tt=1(yt −u·xt)2 is small. As for the second regret bound, it indicates that the adaptive EG± algorithm achieves approximately the regret P bound of Theorem 1 in the regime κ 6 1, i.e., d > √TUX/(2Y). In this regime, our algorithm thus has a manageable computational complexity (linear in d at each time t) and it is adaptive in X, Y, and T. Inparticular,theaboveregretboundissimilar9 tothatoftheoriginalEG± algorithm[9,Theorem5.11], but it is obtained without prior knowledge of X, Y, and T. Note also that this bound is similar to that of theself-confidentp-normalgorithmof[10]withp=2lnd(seeSection1.2). Thefactthatwewereabletoget similar adaptivity and efficiency properties via exponential weighting corroborates the similarity that was already observed in a non-adaptive context between the original EG± algorithm and the p-norm algorithm (in the limit p + with an appropriate initial weight vector, or for p of the order of lnd with a zero → ∞ initial weight vector, cf. [18]). Proof (of Corollary 2): We apply Proposition 1 with the square loss ℓ (u)=(y u x )2. It yields t t t − · T T ℓ (u ) min ℓ (u) t t t − u 6U t=1 k k1 t=1 X X b T 64U ℓ (u ) 2 ln(2d)+U 8ln(2d)+12 max ℓ (u ) . (11) vu t=1k∇ t t k∞! 16t6Tk∇ t t k∞ u X (cid:0) (cid:1) t Using the equality ℓ (u) = 2(y ub x )x for all u Rd, we get that, on the onbe hand, by the upper t t t t ∇ − − · ∈ bound x 6X, t k k ∞ ℓ (u ) 2 64X2ℓ (u ) , (12) t t t t k∇ k ∞ and, on the other hand, max ℓ (u ) 62(Y +UX)X (indeed, by Ho¨lder’s inequality, u x 6 16t6T t t t t k∇ k · u x 6 UX). Substituting the last t∞wbo inequalities inb (11), setting L , T ℓ (u ) as well as k tk1k tk T t=1 t t (cid:12) (cid:12) Lb∗T ,mink∞uk16U Tt=1ℓt(u), we get thatb b P b (cid:12)b (cid:12) P L 6L +8UX L ln(2d)+ 16ln(2d)+24 UXY +U2X2 . T ∗T T q (cid:0) ,(cid:1)(cid:0)C (cid:1) b b Solving for L via Lemma 4 in Appendix B, we get|that {z } T 2 b LT 6L∗T +C+ 8UX ln(2d) L∗T +C+ 8UX ln(2d) b 6L∗T +8UX(cid:16) L∗T lpn(2d)+(cid:17)8UpX Cln(2d(cid:16))+64Up2X2ln((cid:17)2d)+C . Using that q p UX Cln(2d)=UXln(2d) 16+24/ln(2d) UXY +U2X2 p 6 U2X2+qU(cid:0)XY ln(2d) 16(cid:1)(cid:0)+24/ln(2) UX(cid:1) Y +U2X2 =p16+24/ln(2) UXY +qU(cid:0)2X2 ln(2d) (cid:1)(cid:0) (cid:1) and performing some simple uppper bounds conclu(cid:0)des the proof o(cid:1)f the first regret bound. The second one follows immediately by noting that minkuk16U Tt=1(yt−u·xt)2 6 Tt=1yt2 6TY2 (since 0∈B1(U)). P P 9ByTheorem5.11of[9],theoriginalEG± algorithmsatisfiestheregretbound2UXp2Bln(2d)+2U2X2ln(2d), whereB isanupperboundonminkuk16UPTt=1(yt−u·xt)2 (inparticular,B6TY2). Notethatourmainregrettermislargerbya multiplicativefactorof 2√2. However, contraryto [9],our algorithmdoes notrequirethe priorknowledge of X and B —or, alternatively,X,Y,andT. 10