Table Of Content

✩ Scale-Free Online Learning Francesco Orabona ∗ Stony Brook University,Stony Brook, NY 11794, USA 6 1 Dávid Pa´l∗∗ 0 Yahoo Research, 14th Floor, 229 West 43rd Street, NewYork, NY 10036, USA 2 c e D 4 1 Abstract We designandanalyzealgorithmsforonlinelinearoptimizationthathaveopti- ] G malregretandatthesametimedonotneedtoknowanyupperorlowerbounds L on the norm of the loss vectors. Our algorithms are instances of the Follow the . Regularized Leader (FTRL) and Mirror Descent (MD) meta-algorithms. We s achieve adaptiveness to the norms of the loss vectors by scale invariance, i.e., c [ ouralgorithmsmakeexactlythesamedecisionsifthesequenceoflossvectorsis multiplied by any positive constant. The algorithm based on FTRL works for 2 v any decision set, bounded or unbounded. For unbounded decisions sets, this is 4 the first adaptive algorithm for online linear optimization with a non-vacuous 7 regretbound. Incontrast,weshowlowerboundsonscale-freealgorithmsbased 9 on MD on unbounded domains. 1 0 . 1 1. Introduction 0 6 OnlineLinearOptimization(OLO)isaproblemwhereanalgorithmrepeat- 1 edly chooses a point w from a convex decision set K, observes an arbitrary, : t v or even adversarially chosen, loss vector ℓ and suffers the loss ℓ ,w . The t t t i h i X goal of the algorithm is to have a small cumulative loss. The performance of an algorithm is evaluated by the so-called regret, which is the difference of the r a cumulativelossesofthealgorithmandofthe(hypothetical)strategythatwould choose in every round the same best point in hindsight. OLO is a fundamental problemin machine learning[2, 3, 4]. Many learning problems can be directly phrased as OLO, e.g., learning with expert advice [5, 6, 7, 8] and online combinatorial optimization [9, 10, 11]. Other problems can be reduced to OLO,e.g., online convexoptimization [12], [4, Chapter 2], online ✩ Apreliminaryversionofthispaper[1]waspresentedatALT2015. ∗WorkdonewhileatYahooResearch. ∗∗Correspondingauthor Email addresses: [email protected] (FrancescoOrabona),[email protected] (Da´vidPál) Preprint submitted toElsevier December 15, 2016 Algorithm Decisions Set(s) Regularizer(s) Scale-Free Hedge[7] ProbabilitySimplex NegativeEntropy No GIGA[20] AnyBounded 12kwk22 No RDA[21] Any Any Strongly Convex No FTRL-Proximal[22,23] AnyBounded 12kwk22+ any convex Yes func.1 AdaGradMD[24] AnyBounded 12kwk22+anyconvexfunc. Yes AdaGradFTRL[24] Any 12kwk22+anyconvexfunc. No AdaHedge[25] ProbabilitySimplex NegativeEntropy Yes NAG[26] {u:maxthℓt,ui≤C} 12kwk22 Partially2 Scale invariant algo- Any 1kwk2+anyconvex func. Partially2 2 p rithms[27] 1<p≤2 Scale-free MD [this pa- supu,v∈KBf(u,v)<∞ Any Strongly Convex Yes per] SOLOFTRL[this paper] Any Any Strongly Convex Yes Table1: SelectedresultsforOLO.Bestresultsineachcolumnareinbold. classification [13,14]andregression[15], [2,Chapters11and12],multi-armed banditsproblems[2,Chapter6],[16,17],andbatchandstochasticoptimization of convex functions [18, 19]. Hence, a result in OLO immediately implies other results in all these domains. The adversarial choice of the loss vectors received by the algorithm is what makes the OLO problem challenging. In particular, if an OLO algorithm com- mits to an upper bound on the norm of future loss vectors, its regret can be madearbitrarilylargethroughanadversarialstrategythatproduceslossvectors with norms that exceed the upper bound. For this reason, most of the existing OLO algorithms receive as an input— or implicitly assume—an upper bound B on the norm of the loss vectors. The input B is often disguisedas the learningrate, the regularizationparameter,or the parameterofstrongconvexityoftheregularizer. However,thesealgorithms have two obvious drawbacks. First, they do not come with any regret guarantee for sequences of loss vectors with norms exceeding B. Second, on sequences of loss vectors with norms bounded by b B, these algorithms fail to have an optimal regret ≪ guarantee that depends on b rather than on B. 1 Even if, inprinciplethe FTRL-Proximalalgorithm can be used withany proximal regularizer,tothebestofourknowledgeageneralwaytoconstructproximalregularizersisnot known. Theonlyproximalregularizerweareawareisbasedonthe2-norm. 2Thesealgorithmsattempttoproduceaninvariantsequenceofpredictionshwt,ℓti,rather thanasequence ofinvariantwt. 2 Thereisaclearpracticalneedtodesignalgorithmsthatadaptautomatically to the norms of the loss vectors. A natural, yet overlooked, design method to achieve this type of adaptivity is by insisting to have a scale-free algorithm. That is, with the same parameters, the sequence of decisions of the algorithm does not change if the sequence of loss vectors is multiplied by a positive constant. The most important property of scale-free algorithms is that both their loss and their regret scale linearly with the maximum norm of the loss vector appearing in the sequence. 1.1. Previous results The majority of the existing algorithms for OLO are based on two generic algorithms: Follow The Regularizer Leader (FTRL) and Mirror De- scent (MD). FTRL dates back to the potential-based forecaster in [2, Chap- ter 11] and its theory was developed in [28]. The name Follow The Reg- ularized Leader comes from [16]. Independently, the same algorithm was proposed in [29] for convex optimization under the name Dual Averaging andrediscoveredin[21]foronlineconvexoptimization. Time-varyingregulariz- ers were analyzed in [24] and the analysis tightened in [27]. MD was originally proposedin[18]andlateranalyzedin[30]forconvexoptimization. Intheonline learning literature it makes its first appearance, with a different name, in [15]. Both FTRL and MD are parametrized by a function called a regularizer. Basedondifferentregularizersdifferentalgorithmswithdifferentpropertiescan be instantiated. Asummary ofalgorithmsfor OLO is presentedin Table 1. All of them are instances of FTRL or MD. Scale-free versions of MD include AdaGrad MD [24]. However, the Ada- Grad MD algorithm has a non-trivial regret bounds only when the Bregman divergence associated with the regularizer is bounded. In particular, since a bound on the Bregmandivergenceimplies that the decisionset is bounded, the regret bound for AdaGrad MD is vacuous for unbounded sets. In fact, as we showinSection4.1,AdaGradMDandsimilaralgorithmsbasedonMDincurs Ω(T) regret, in the worst case, if the Bregman divergence is not bounded. Only one scale-free algorithm based on FTRL was known. It is the Ada- Hedge [25] algorithmfor learning with expert advice, where the decisionset is bounded. AnalgorithmbasedonFTRLthatis“almost”scale-freeisAdaGrad FTRL[24]. Thisalgorithmfailtobescale-freedueto“off-by-one”issue;see[23] and the discussion in Section 3. Instead, FTRL-Proximal [22, 23] solves the off-by-one issue, but it requires proximalregularizers. In general,proximalregularizers do not have a simple form and even the simple 2-norm case requires bounded domains to achieve non-vacuous regret. Forunboundeddecisionsetsnoscale-freealgorithmwithanon-trivialregret boundwasknown. Unboundeddecisionsetsarepracticallyimportant(see,e.g., [31]),sincelearningoflarge-scalelinearmodels(e.g.,logisticregression)isdone by gradient methods that can be reduced to OLO with decision set Rd. 3 1.2. Overview of the Results Wedesignandanalyzetwoscale-freealgorithms: SOLOFTRLandScale- Free MD. A third one, AdaFTRL, is presented in the Appendix. SOLO FTRL and AdaFTRL are based on FTRL. AdaFTRL is a generalization of AdaHedge[25]toarbitrarystronglyconvexregularizers. SOLOFTRLcanbe viewed as the “correct” scale-free version of the diagonal version of AdaGrad FTRL [24] generalized to arbitrary strongly convex regularizers. Scale-Free MD is based on MD. It is a generalizationof AdaGrad MD [24] to arbitrary strongly convex regularizers. The three algorithms are presented in Sections 3 and 4, and Appendix B, respectively. We provethat the regretof SOLO FTRL and AdaFTRL onbounded domains after T rounds is bounded by O( sup f(v) T ℓ 2) where f is a v∈K t=1k tk∗ non-negativeregularizerthatis1-strongqlyconvexwithrespecttoanorm and P k·k isitsdualnorm. ForScale-FreeMD,weproveO( sup B (u,v) T ℓ 2) k·k∗ u,v∈K f t=1k tk∗ where Bf is the Bregman divergence associated with a 1q-strongly convex regPu- larizer f. In Section 5, we show that the T ℓ 2 term in the bounds is t=1k tk∗ necessary by proving a D T ℓ 2 lowqerPbound on the regret of any algo- √8 t=1k tk∗ rithm for OLO for any decqision set with diameter D with respect to the primal P norm . k·k For SOLO FTRL, we prove that the regret against a competitor u K ∈ is at most O(f(u) T ℓ 2 + max ℓ √T). As before, f is a t=1k tk∗ t=1,2,...,T k tk∗ non-negative 1-stroqngly convex regularizer. This bound is non-trivial for any P decisionset, bounded or unbounded. The resultmakes SOLO FTRL the first adaptive algorithm for unbounded decision sets witha non-trivialregret bound. Allthreealgorithmsareany-time,i.e.,theydonotneedtoknowthenumber of rounds, T, in advance and the regret bounds hold for all T simultaneously. Our proof techniques rely on new homogeneous inequalities (Lemmas 3, 7) which might be of independent interest. Finally,inSection4.1,weshownegativeresultsforexistingpopularvariants of MD. We show two examples of decision sets and sequences of loss vectors of unit norm on which these variants of MD have Ω(T) regret. These results indicate that FTRL is superior to MD in a worst-case sense. 2. Notation and Preliminaries Let V be a finite-dimensional3 realvector space equipped with a norm . k·k We denote by V its dual vector space. The bi-linear map associated with ∗ (V ,V) is denoted by , :V V R. The dual norm of is . ∗ ∗ h· ·i × → k·k k·k∗ 3Many,butnotall,ofourresultscanbeextended tomoregeneralnormedvector spaces. 4 In OLO, in each round t=1,2,..., the algorithm chooses a point w in the t decision set K V and then the algorithmobserves a loss vector ℓ V . The t ∗ ⊆ ∈ instantaneouslossofthealgorithminroundtis ℓ ,w . Thecumulativelossof t t h i the algorithm after T rounds is T ℓ ,w . The regret of the algorithm with t=1h t ti respect to a point u K is ∈ P T T Regret (u)= ℓ ,w ℓ ,u , T h t ti− h t i t=1 t=1 X X and the regret with respect to the best point is Regret =sup Regret (u). T u K T We assume that K is a non-empty closed convex subset of V.∈Sometimes we will assume that K is also bounded. We denote by D its diameter with respect to , i.e., D =sup u v . If K is unbounded, D =+ . k·k u,v∈Kk − k ∞ 2.1. Convex Analysis The Bregman divergence of a convex differentiable function f is defined as (u,v) = f(u) f(v) f(v),u v . Note that (u,v) 0 for any u,v f f B − −h∇ − i B ≥ which follows directly from the definition of convexity of f. The Fenchel conjugate of a function f : K R is the function f : V ∗ ∗ R + definedasf (ℓ)=sup ( ℓ,w f→(w)). TheFenchelconjugate→of an∪y{fun∞ct}ion is convex∗(since it iswa∈Ksuphremiu−m of affine functions) and satisfies the Fenchel-Young inequality w K, ℓ V∗ f(w)+f∗(ℓ) ℓ,w . ∀ ∈ ∀ ∈ ≥h i Monotonicity of Fenchel conjugates follows easily from the definition: If f,g : K R satisfy f(w) g(w) for all w K then f (ℓ) g (ℓ) for every ∗ ∗ → ≤ ∈ ≥ ℓ V . ∗ ∈Given λ>0, a function f :K R is called λ-strongly convex with respect → to a norm if and only if, for all x,y K, k·k ∈ λ f(y) f(x)+ f(x),y x + x y 2 , ≥ h∇ − i 2k − k where f(x) is any subgradient of f at the point x. ∇ The following proposition relates the range of values of a strongly convex functiontothediameterofitsdomain. TheproofcanbefoundinAppendix A. Proposition 1 (Diameter vs. Range). Let K V be a non-empty bounded ⊆ closed convex set. Let D = sup u v be its diameter with respect to . Let f : K R be a non-neug,va∈tiKveklow−erksemi-continuous function that is k·k → 1-strongly convex with respect to . Then, D 8sup f(v). k·k ≤ v∈K p Fenchel conjugates and strongly convex functions have certain nice properties, which we list in Proposition 2 below. 5 Algorithm 1 FTRL with Varying Regularizer Require: Non-empty closed convex set K V ⊆ 1: Initialize L0 0 ← 2: for t=1,2,3,... do 3: Choose a regularizer Rt :K R → 54:: Pwrte←dicatrwgmt inw∈K(hLt−1,wi+Rt(w)) 6: Observe ℓt V∗ ∈ 7: Lt Lt 1+ℓt ← − 8: end for Proposition 2 (Fenchel Conjugates of Strongly Convex Functions). Let K V be a non-empty closed convex set with diameter D :=sup u v . Let⊆λ > 0, and let f : K R be a lower semi-continuous fuun,cvt∈iKonkth−at kis → λ-strongly convex with respect to . The Fenchel conjugate of f satisfies: k·k 1. f is finite everywhere and differentiable everywhere. ∗ 2. For any ℓ V , f (ℓ)=argmin (f(w) ℓ,w ). ∈ ∗ ∇ ∗ w∈K −h i 3. For any ℓ V , f (ℓ)+f( f (ℓ))= ℓ, f (ℓ) . ∗ ∗ ∗ ∗ ∈ ∇ h ∇ i 45.. ff∗ hisasλ11-s-tLroipnsgclhyitszmcooontthin,ui.oeu.,sfgorradaineyntxs,,yi.e∈.,Vfo∗r,aBnfy∗(xx,,yy)≤V 21,λkxf−y(xk)2∗. ∗ λ ∈ ∗ k∇ ∗ − f (y) 1 x y . ∇ ∗ k≤ λk − k∗ 6. f∗(x,y) D x y for any x,y V∗. B ≤ k − k∗ ∈ 7. f (x) f (y) D for any x,y V . ∗ ∗ ∗ k∇ −∇ k≤ ∈ 8. For any c>0, (cf()) =cf (/c). ∗ ∗ · · Except for properties 6 and 7, the proofs can be found in [28]. Property 6 is proven in Appendix A. Property 7 trivially follows from property 2. 2.2. Generic FTRL with Varying Regularizer Two of our scale-free algorithms are instances of FTRL with varying regularizers,presentedasAlgorithm1. Thealgorithmisparamatrizedbyasequence R offunctionsR :K Rcalledregularizers. EachregularizerR cande- { t}∞t=1 t → t pend on the past loss vectors ℓ ,ℓ ,...,ℓ in an arbitraryway. The following 1 2 t 1 − lemma bounds its regret. Lemma 1 (Regret of FTRL). If the regularizers R ,R ,... chosen by Al- 1 2 gorithm 1 are strongly convex and lower semi-continuous, the algorithm’s regret is upper bounded as T RegretT(u)≤RT+1(u)+R1∗(0)+ BR∗t(−Lt,−Lt−1)−Rt∗(−Lt)+Rt∗+1(−Lt). t=1 X The proof of the lemma can be found in [27]. For completeness, we include it in Appendix A. 6 Algorithm 2 Mirror Descent with Varying Regularizer Require: Non-empty closed convex set K V 1: Choose a regularizer R0 :K R ⊆ → 32:: fwo1r←t=ar1g,m2,in3w,.∈.K. Rdo0(w) 4: Predict wt 5: Observe ℓt V∗ 6: Choose a re∈gularizer Rt :K R → 78:: enwdt+fo1r←argminw∈K(hℓt,wi+BRt(w,wt)) 2.3. Generic Mirror Descent with Varying Regularizer Mirror Descent (MD) is a generic algorithm similar to FTRL but quite different in the details. The algorithmis stated as Algorithm 2. The algorithm is parametrized by a sequence R of convex functions R : K R called { t}∞t=0 t → regularizers. Each regularizer R can depend on past loss vectors ℓ ,ℓ ,...,ℓ t 1 2 t in an arbitrary way. If R is not differentiable,4 the Bregman divergence, t (u,v)=R (u) R (v) R (v),u v needstobedefined. Thisisdoneby BRt t − t −h∇ t − i choosing a subgradient map R : K V, i.e., a function such that R (w) t t ∇ → ∇ is a subgradient of R at any point w. If R is a restriction of a differentiable t t function R , it is convenient to define R (w) = R (w) for all w K. The t′ ∇ t ∇ t′ ∈ following lemma bounds the regret of MD. Lemma 2 (Regret of MD). Algorithm 2 satisfies, for any u K, ∈ T Regret (u) ℓ ,w w (w ,w )+ (u,w ) (u,w ). T ≤ h t t− t+1i−BRt t+1 t BRt t −BRt t+1 t=1 X The proof of the lemma can be found in [3, 32]. For completeness, we give a proof in Appendix E. 2.4. Per-Coordinate Learning An interesting class of algorithms proposedin [22] and [24] are based on so- called per-coordinate learning rates. As shown in [33], any algorithm for OLO can be used with per-coordinate learning rates as well. Abstractly, we assume that the decision set is a Cartesian product K = K K K of a finite number of convex sets. On each factor K , 1 2 d j × × ··· × j = 1,2,...,d, we can run any OLO algorithm separately and we denote by 4NotethatthiscanhappenevenwhenRtisarestrictionofadifferentiablefunctiondefined onasupersetofK. IfK isboundedandclosed,Rt failstobedifferentiableattheboundary ofK. IfK isasubsetofanaffinesubspaceofadimensionsmallerthanthedimensionofV, thenRt failstobedifferentiableeverywhere. 7 Regret(j)(u )itsregretwithrespecttou K . Theoverallregretwithrespect T j j ∈ j to any u=(u ,u ,...,u ) K can be written as 1 2 d ∈ d Regret (u)= Regret(j)(u ). T T j j=1 X If the algorithm for each factor is scale-free, the overall algorithm is clearly scale-free as well. Hence, even if not explicitly mentioned in the text, any algorithmwe present can be trivially transformed to a per-coordinate version. 3. SOLO FTRL Inthissection,weintroduceourfirstscale-freealgorithm;itwillbebasedon FTRL. The closestalgorithmto a scale-free FTRL in the existing literature is the AdaGrad FTRL algorithm [24]. It uses a regularizer on each coordinate of the form t 1 − R (w)=R(w) δ+ ℓ 2 . t  vui=1k ik∗ uX  t  This kind of regularizer would yield a scale-free algorithm only for δ = 0. In fact, with this choice of δ it is easy to see that the predictions w in line 4 of t Algorithm 1 would be independent of the scaling of the ℓ . Unfortunately, the t regret bound in [24] becomes vacuous for such setting in the unbounded case. In fact, it requires δ to be greater than ℓ for all time steps t, requiring t k k∗ knowledge of the future (see Theorem 5 in [24]). In other words, despite of its name, AdaGrad FTRL is not fully adaptive to the norm of the gradient vectors. Similar considerations hold for the FTRL-Proximal in [22, 23]: The scale-free setting of the learning rate is valid only in the bounded case. One simple approach would be to use a doubling trick on δ in order to estimateonthe flythemaximumnormofthelosses. Notethatanaivestrategy would still fail because the initial value of δ should be data-dependent in order to have a scale-free algorithm. Moreover, we would have to upper bound the regret in all the rounds where the norm of the current loss is bigger than the estimate. Finally, the algorithmwould depend on an additional parameter, the “doubling” power. Hence, even in the case one would prove a regret bound, suchstrategywouldgive the feeling thatFTRL needs to be “fixed” in orderto obtain a scale-free algorithm. In the following, we propose a much simpler and better approach. We propose to use Algorithm 1 with the regularizer t 1 − R (w)=R(w) ℓ 2 , (1) t vui=1k ik∗ uX t where R:K R is any strongly convex function. Through a refined analysis, → we showthatthis regularizersuffices to obtainanoptimalregretboundfor any 8 decisionset, bounded or unbounded. We callthis variantScale-free Online Linear Optimization FTRL algorithm (SOLO FTRL). Our main result is Theorem 1 below, which is proven in Section 3.1. The regularizer (1) does not uniquely define the FTRL minimizer w = t argminw∈KRt(w) when ti=−11kℓik2∗ is zero. This happens if ℓ1,ℓ2,...,ℓt−1 are all zero (and in parqticPular for t = 1). In that case, we define wt = argmin R(w) which is consistent with w =lim argmin aR(w). w K t a 0+ w K ∈ → ∈ Theorem 1 (Regret of SOLO FTRL). Suppose K V is a non-empty ⊆ closed convex set. Let D = sup u v be its diameter with respect to a norm . Suppose that the reug,uv∈laKrikzer−R :kK R is a non-negative lower k·k → semi-continuous function that is λ-strongly convex with respect to . The k·k regret of SOLO FTRL satisfies 2.75 T √T 1 Regret (u) R(u)+ ℓ 2+3.5min − ,D max ℓ . T ≤(cid:18) λ (cid:19)vut=1k tk∗ (cid:26) λ (cid:27) t≤T k tk∗ uX t When K is unbounded, we pay a penalty that scales as max ℓ √T, t T t ≤ k k∗ that has the same magnitude of the first term in the bound. On the other hand, when K is bounded, the second term is a constant and we can choose the optimal multiple of the regularizer. We choose R(w) = λf(w) where f is a 1-strongly convex function and optimize λ. The result of the optimization is Corollary 1. Corollary 1 (Regret Bound for Bounded Decision Sets). Suppose K V is a non-empty bounded closed convex set. Suppose that f :K R is a non⊆- → negative lower semi-continuous function that is 1-strongly convex with respect to . SOLO FTRL with regularizer k·k f(w)√2.75 T R(w)= satisfies Regret 13.3 supf(v) ℓ 2 . supv∈Kf(v) T ≤ vuuv∈K Xt=1k tk∗ t p Proof. Let S = sup f(v). Theorem 1 applied to the regularizer R(w) = v K c f(w), together with∈Proposition 1 and a crude bound max ℓ √S t=1,2,...,T k tk∗ ≤ T ℓ 2, give t=1k tk∗ q P T 2.75 Regret c+ +3.5√8 S ℓ 2 . T ≤(cid:18) c (cid:19)vu t=1k tk∗ u X t We choose c by minimizing g(c)=c+ 2.75+3.5√8. Clearly, g(c) has minimum c at c=√2.75 and has minimal value g(√2.75)=2√2.75+3.5√8 13.3. ≤ 9 3.1. Proof of Regret Bound for SOLO FTRL The proof of Theorem 1 relies on an inequality (Lemma 3). Related and weaker inequalities, like Lemma 4, were proved in [34] and [35]. The main property of this inequality is that on the right-hand side C does not multiply the T a2 term. t=1 t q P Lemma 3 (Useful Inequality). Let C,a ,a ,...,a 0. Then, 1 2 T ≥ T a2 T min t , Ca 3.5C max a + 3.5 a2 . Xt=1  ti=−11a2i t≤ t=1,2,...,T t vuuXt=1 t t q Proof. Without loPss of generality, we can assume that a > 0 for all t. Since t otherwisewecanremovealla =0withoutaffectingeithersideoftheinequality. t Let M =max a ,a ,...,a and M =0. We prove that for any α>1 t 1 2 t 0 { } min ati=2t−11a2i,Cat≤2p1+α2vuuXi=t1a2i −vuuXti=−11a2i+Cα(Mαt−−1Mt−1) q t t  fromwhiPchthe inequality followsby summing overt=1,2,...,T and choosing α=√2. The inequality follows by case analysis. If a2t ≤α2 ti=−11a2i, we have P a2 a2 a2 min t ,Ca t = t  ti=−11a2i t≤ ti=−11a2i 1+1α2 α2 ti=−11a2i + ti=−11a2i q q r  P  P (cid:16) P P (cid:17) a2√1+α2 a2√1+α2 t t−1 t = t 2 1+α2 a2 a2 ≤ a2t + ti=−11a2i ti=1a2i ≤ p vuuXi=1 i −vuuXi=1 i q q t t  P P where we have used x2/ x2+y2 2( x2+y2 y2) in the last step. On ≤ − the other hand, if a2t >αp2 tt−=11a2i, we hpave p P min a2t , Ca Ca =Cαat−at C αa α t−1a2  ti=−11a2i t≤ t α−1 ≤ α−1 t− vuuXi=1 i q  t   CPα t−1 Cα Cα = a a2 (a M )= (M M ) α−1 t−vui=1 i≤ α−1 t− t−1 α−1 t− t−1 uX  t  where we have used that at =Mt and ti=−11a2i ≥Mt−1. q P 10