Online Learning: Stochastic, Constrained, and Smoothed Adversaries AlexanderRakhlin KarthikSridharan DepartmentofStatistics ToyotaTechnologicalInstituteatChicago UniversityofPennsylvania [email protected] [email protected] AmbujTewari ComputerScienceDepartment UniversityofTexasatAustin [email protected] Abstract Learningtheoryhaslargelyfocusedontwomainlearningscenarios: theclassical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen in- stanceisrevealedtotheplayer. Itcanbearguedthatintherealworldneitherof these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptionsondata.Buildingonthesequentialsymmetrizationapproach,wede- fineanotionofdistribution-dependentRademachercomplexityforthespectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deducevariation-typebounds. Westudyasmoothedonlinelearningscenarioand show that exponentially small amount of noise can make function classes with infiniteLittlestonedimensionlearnable. 1 Introduction Inthepapers[1,10,11],anarrayoftoolshasbeendevelopedtostudytheminimaxvalueofdiverse sequential problems under the worst-case assumption on Nature. In [10], many analogues of the classicalnotionsfromstatisticallearningtheoryhavebeendeveloped,andthesehavebeenextended in[11]forperformancemeasureswellbeyondtheadditiveregret. Theprocessofsequentialsym- metrizationemergedasakeytechniquefordealingwithcomplicatednestedminimaxexpressions. Intheworst-casemodel,thedevelopedtoolsgiveaunifiedtreatmenttosuchsequentialproblemsas regretminimization,calibrationofforecasters,Blackwell’sapproachability,Φ-regret,andmore. Learning theory has been so far focused predominantly on the i.i.d. and the worst-case learning scenarios. Much less is known about learnability in-between these two extremes. In the present paper, wemakeprogresstowardsfillingthisgapbyproposingaframeworkinwhichitispossible tovariouslyrestrictthebehaviorofNature. ByrestrictingNaturetoplayi.i.d. sequences,theresults boildowntotheclassicalnotionsofstatisticallearninginthesupervisedlearningscenario. Bynot placing any restrictions on Nature, we recover the worst-case results of [10]. Between these two endpointsofthespectrum,particularassumptionsontheadversaryyieldinterestingboundsonthe minimax value of the associated problem. Once again, the sequential symmetrization technique arisesasthemaintoolfordealingwiththeminimaxvalue,buttheproofsrequiremorecarethanin thei.i.d. orcompletelyadversarialsettings. 1 Adapting the game-theoretic language, we will think of the learner and the adversary as the two playersofazero-sumrepeatedgame. Adversary’smoveswillbeassociatedwith“data”,whilethe movesofthelearner–withafunctionoraparameter. Thispointofviewisnotnew: game-theoretic minimax analysis has been at the heart of statistical decision theory for more than half a century (see[3]). Infact,thereisawell-developedtheoryofminimaxestimationwhenrestrictionsareput oneitherthechoiceoftheadversaryortheallowedestimatorsbytheplayer. Wearenotawareofa similartheoryforsequentialproblemswithnon-i.i.d. data. Themaincontributionofthispaperisthedevelopmentoftoolsfortheanalysisofonlinescenarios where the adversary’s moves are restricted in various ways. In additional to general theory, we consider several interesting scenarios which can be captured by our framework. All proofs are deferredtotheappendix. 2 ValueoftheGame Let F be a closed subset of a complete separable metric space, denoting the set of moves of the learner. Suppose the adversary chooses from the set X. Consider the Online Learning Model, defined as a T-round interaction between the learner and the adversary: On round t = 1,...,T, the learner chooses f ∈ F, the adversary simultaneously picks x ∈ X, t t and the learner suffers loss f (x ). The goal of the learner is to minimize regret, defined as t t (cid:80)T f (x ) − inf (cid:80)T f(x ). It is a standard fact that simultaneity of the choices can be t=1 t t f t=1 t formalized by the fir∈sFt player choosing a mixed strategy; the second player then picks an action basedonthismixedstrategy,butnotonitsrealization. Wethereforeconsiderrandomizedlearners whopredictadistributionq ∈Qoneveryround,whereQisthesetofprobabilitydistributionson t F, assumed to be weakly compact. The set of probability distributions on X (mixed strategies of theadversary)isdenotedbyP. Wewouldliketocapturethefactthatsequences(x ,...,x )cannotbearbitrary. Thisisachieved 1 T bydefiningrestrictionsontheadversary,thatis,subsetsof“allowed”distributionsforeachround. Theserestrictionslimitthescopeofavailablemixedstrategiesfortheadversary. Definition 1. A restriction P on the adversary is a sequence P ,...,P of mappings P : 1:T 1 T t Xt 1 (cid:55)→2 suchthatP (x )isaconvexsubsetofP foranyx ∈Xt 1. − P t 1:t 1 1:t 1 − − − Notethattherestrictionsdependonthepastmovesoftheadversary,butnotonthoseoftheplayer. WewillwriteP insteadofP (x )whenx isclearlydefined. Usingthenotionofrestric- t t 1:t 1 1:t 1 tions,wecangivenamestoseveral−typesofadver−sariesthatwewillstudyinthispaper. (1) Aworst-caseadversaryisdefinedbyvacuousrestrictionsP (x )=P. Thatis,anymixed t 1:t 1 strategyisavailabletotheadversary,includinganydeterministicp−ointdistribution. (2) AconstrainedadversaryisdefinedbyP (x )beingthesetofalldistributionssupported t 1:xt−1 ontheset{x∈X :C (x ,...,x ,x)=1}forsomedeterministicbinary-valuedconstraint t 1 t 1 C . Thedeterministicconstraintca−n,forinstance,ensurethatthelengthofthepathdetermined t bythemovesx ,...,x staysbelowtheallowedbudget. 1 t (3) A smoothed adversary picks the worst-case sequence which gets corrupted by i.i.d. noise. Equivalently,wecanviewthisasrestrictionsontheadversarywhochoosesthe“center”(ora parameter)ofthenoisedistribution. Usingtechniquesdevelopedinthispaper,wecanalsostudythefollowingadversaries(omitteddue tolackofspace): (4) Ahybridadversaryinthesupervisedlearninggamepickstheworst-caselabely ,butisforced t todrawthex -variablefromafixeddistribution[6]. t (5) Ani.i.d. adversaryisdefinedbyatime-invariantrestrictionP (x ) = {p}foreverytand t 1:t 1 somep∈P. − ForthegivenrestrictionsP ,wedefinethevalueofthegameas 1:T " T T # V (P ) =(cid:52) inf sup E inf sup E ··· inf sup E Xf (x )− inf Xf(x ) T 1:T t t t q1∈Qp1∈P1 f1,x1 q2∈Qp2∈P2 f2,x2 qT∈QpT∈PT fT,xT t=1 f∈Ft=1 (1) wheref hasdistributionq andx hasdistributionp . Asin[10],theadversaryisadaptive,thatis, t t t t choosesp basedonthehistoryofmovesf andx . Atthispoint,theonlydifferencefrom t 1:t 1 1:t 1 − − 2 thesetupof[10]isintherestrictionsP ontheadversary.Becausetheserestrictionsmightnotallow t pointdistributions,supremaoverp ’sin(1)cannotbeequivalentlywrittenasthesupremaoverx ’s. t t Awordaboutthenotation. In[10], thevalueofthegameiswrittenasV (F), signifyingthatthe T mainobjectofstudyisF. In[11], itiswrittenasV ((cid:96),Φ )sincethefocusisonthecomplexity T T ofthesetoftransformationsΦ andthepayoffmapping(cid:96). Inthepresentpaper,themainfocusis T indeedontherestrictionsontheadversary,justifyingourchoiceV (P )forthenotation. T 1:T Thefirststepistoapplytheminimaxtheorem. Tothisend,weverifythenecessaryconditions. Our assumptionthatF isaclosedsubsetofacompleteseparablemetricspaceimpliesthatQistightand Prokhorov’s theorem states that compactness of Q under weak topology is equivalent to tightness [14]. Compactnessunderweaktopologyallowsustoproceedasin[10]. Additionally,werequire thattherestrictionsetsarecompactandconvex. Theorem1. LetF andX bethesetsofmovesforthetwoplayers,satisfyingthenecessarycondi- tionsfortheminimaxtheoremtohold. LetP betherestrictions,andassumethatforanyx , 1:T 1:t 1 P (x )satisfiesthenecessaryconditionsfortheminimaxtheoremtohold. Then − t 1:t 1 − (cid:34) (cid:35) T T (cid:88) (cid:88) V (P )= sup E ... sup E inf E [f (x )]− inf f(x ) . (2) T 1:T p1∈P1 x1∼p1 pT∈PT xT∼pT t=1ft∈F xt∼pt t t f∈F t=1 t ThenestedsequenceofsupremaandexpectedvaluesinTheorem1canbere-writtensuccinctlyas " T T # X X V (P )= supE E ...E inf E [f (x )]− inf f(x ) T 1:T p∈P x1∼p1 x2∼p2(·|x1) xT∼pT(·|x1:T−1) t=1ft∈F xt∼pt t t f∈Ft=1 t " T T # X X = supE inf E [f (x )]− inf f(x ) (3) p∈P t=1ft∈F xt∼pt t t f∈Ft=1 t where the supremum is over all joint distributions p over sequences, such that p satisfies the re- strictions as described below. Given a joint distribution p on sequences (x ,...,x ) ∈ XT, we 1 T denote the associated conditional distributions by p (·|x ). We can think of the choice p as a t 1:t 1 sequenceofobliviousstrategies{p : Xt 1 (cid:55)→ P}T , ma−ppingtheprefixx toaconditional t − t=1 1:t 1 distribution p (·|x ) ∈ P (x ). We will indeed call p a “joint distribu−tion” or an “obliv- t 1:t 1 t 1:t 1 ious strategy” interch−angeably. We−say that a joint distribution p satisfies restrictions if for any t andanyx ∈ Xt 1,p (·|x ) ∈ P (x ). Thesetofalljointdistributionssatisfyingthe 1:t 1 − t 1:t 1 t 1:t 1 restrictionsi−sdenotedbyP.Wen−otethatTheore−m1cannotbededucedimmediatelyfromtheanal- ogousresultin[10],asitisnotclearhowtherestrictionsontheadversarypereachroundcomeinto playafterapplyingtheminimaxtheorem. Nevertheless,itiscomfortingthattherestrictionsdirectly translateintothesetPofobliviousstrategiessatisfyingtherestrictions. Before continuing with our goal of upper-bounding the value of the game, we state the following interestingfacts. Proposition 2. There is an oblivious minimax optimal strategy for the adversary, and there is a correspondingminimaxoptimalstrategyfortheplayerthatdoesnotdependonitsownmoves. The latter statement of the proposition is folklore for worst-case learning, yet we have not seen a proofofitintheliterature. Thepropositionholdsforallonlinelearningsettingswithlegalrestric- tions P , encompassing also the no-restrictions setting of worst-case online learning [10]. The 1:T resultcruciallyreliesonthefactthattheobjectiveisexternalregret. 3 SymmetrizationandRandomAverages Theorem 1 is a useful representation of the value of the game. As the next step, we upper bound it with an expression which is easier to study. Such an expression is obtained by introducing Rademacherrandomvariables. Thisprocesscanbetermedsequentialsymmetrizationandhasbeen exploitedin[1,10,11]. TherestrictionsP ,however,makesequentialsymmetrizationconsiderably t more involved than in the papers cited above. The main difficulty arises from the fact that the set P (x )dependsonthesequencex ,andsymmetrization(thatis,replacementofx withx ) t 1:t 1 1:t 1 s (cid:48)s has to b−e done with care as it affects th−is dependence. Roughly speaking, in the process of sym- metrization, a tangent sequence x ,x ,... is introduced such that x and x are independent and (cid:48)1 (cid:48)2 t (cid:48)t 3 identically distributed given “the past”. However, “the past” is itself an interleaving choice of the originalsequenceandthetangentsequence. Definethe“selectorfunction”χ:X×X×{±1}(cid:55)→X byχ(x,x,(cid:15))=x if(cid:15)=1andχ(x,x,(cid:15))= (cid:48) (cid:48) (cid:48) xif(cid:15) = −1. Whenx andx areunderstoodfromthecontext,wewillusetheshorthandχ ((cid:15)) := t (cid:48)t t χ(x ,x ,(cid:15)). Inotherwords,χ selectsbetweenx andx dependingonthesignof(cid:15). Throughout t (cid:48)t t t (cid:48)t thepaper,wedealwithbinarytrees,whicharisefromsymmetrization[10]. GivensomesetZ,an Z-valuedtreeofdepthT isasequencez = (z ,...,z )ofT mappingsz : {±1}i 1 (cid:55)→ Z. The 1 T i − T-tuple(cid:15)=((cid:15) ,...,(cid:15) )∈{±1}T definesapath. Forbrevity,wewritez ((cid:15))insteadofz ((cid:15) ). 1 T t t 1:t 1 − Given a joint distribution p, consider the “(X ×X)T−1 (cid:55)→ P(X ×X)”- valued probability tree ρ=(ρ ,...,ρ )definedby 1 T ρ ((cid:15) )`(x ,x(cid:48)),...,(x ,x(cid:48) )´=(p (·|χ ((cid:15) ),...,χ ((cid:15) )),p (·|χ ((cid:15) ),...,χ ((cid:15) ))). t 1:t−1 1 1 T−1 T−1 t 1 1 t−1 t−1 t 1 1 t−1 t−1 In other words, the values of the mappings ρ ((cid:15)) are products of conditional distributions, where t conditioningisdonewithrespecttoasequencemadefromx andx dependingonthesignof(cid:15) . s (cid:48)s s We note that the difficulty in intermixing the x and x sequences does not arise in i.i.d. or worst- (cid:48) case symmetrization. However, in-between these extremes the notational complexity seems to be unavoidableifwearetoemploysymmetrizationandobtainaversionofRademachercomplexity. As an example, consider the “left-most” path (cid:15) = −1 in a binary tree of depth T, where 1 = (1,...,1)isaT-dimensionalvectorofones.Thenalltheselectorsχ(x ,x ,(cid:15) )choosethesequence t (cid:48)t t x ,...,x . Theprobabilitytreeρonthe“left-most”pathis,therefore,definedbytheconditional 1 T distributionsp (·|x );onthepath(cid:15)=1,theconditionaldistributionsarep (·|x ). t 1:t 1 t (cid:48)1:t 1 − − (cid:0) (cid:1) Slightly abusing the notation, we will write ρ ((cid:15)) (x ,x ),...,(x ,x ) for the probability t 1 (cid:48)1 t 1 (cid:48)t 1 tree since ρ clearly depends only on the prefix up to time t − 1. −Throu−ghout the paper, it will t be understood that the tree ρ is obtained from p as described above. Since all the conditional distributions of p satisfy the restrictions, so do the corresponding distributions of the probability treeρ. Bysayingthatρsatisfiesrestrictionswethenmeanthatp∈P. Sampling of a pair of X-valued trees from ρ, written as (x,x) ∼ ρ, is defined as the following (cid:48) recursiveprocess: forany(cid:15)∈{±1}T,(x ((cid:15)),x ((cid:15)))∼ρ ((cid:15))and 1 (cid:48)1 1 (x ((cid:15)),x ((cid:15)))∼ρ ((cid:15))((x ((cid:15)),x ((cid:15))),...,(x ((cid:15)),x ((cid:15)))) for 2≤t≤T (4) t (cid:48)t t 1 (cid:48)1 t 1 (cid:48)t 1 − − To gain a better understanding of the sampling process, consider the first few levels of the tree. The roots x ,x of the trees x,x are sampled from p , the conditional distribution for t = 1 1 (cid:48)1 (cid:48) 1 given by p. Next, say, (cid:15) = +1. Then the “right” children of x and x are sampled via 1 1 (cid:48)1 x (+1),x (+1) ∼ p (·|x ) since χ (+1) selects x . On the other hand, the “left” children 2 (cid:48)2 2 (cid:48)1 1 (cid:48)1 x (−1),x (−1)arebothdistributedaccordingtop (·|x ). Now,suppose(cid:15) = +1and(cid:15) = −1. 2 (cid:48)2 2 1 1 2 Then,x (+1,−1),x (+1,−1)arebothsampledfromp (·|x ,x (+1)). 3 (cid:48)3 3 (cid:48)1 2 The proof of Theorem 3 reveals why such intricate conditional structure arises, and Proposition 5 belowshowsthatthisstructuregreatlysimplifiesfori.i.d. andworst-casesituations. Nevertheless, theprocessdescribedaboveallowsustodefineaunifiednotionofRademachercomplexityforthe spectrumofassumptionsbetweenthetwoextremes. Definition 2. The distribution-dependent sequential Rademacher complexity of a function class F ⊆R isdefinedas X (cid:34) (cid:35) T (cid:88) R (F,p) =(cid:52) E E sup (cid:15) f(x ((cid:15))) T (x,x(cid:48)) ρ (cid:15) t t ∼ f ∈F t=1 where(cid:15)=((cid:15) ,...,(cid:15) )isasequenceofi.i.d. Rademacherrandomvariablesandρistheprobability 1 T treeassociatedwithp. We now prove an upper bound on the value V (P ) of the game in terms of this distribution- T 1:T dependentsequentialRademachercomplexity. Theresultcannotbededuceddirectlyfrom[10],and itgreatlyincreasesthescopeofproblemswhoselearnabilitycannowbestudiedinaunifiedmanner. Theorem3. Theminimaxvalueisboundedas V (P )≤2supR (F,p). (5) T 1:T T p P ∈ 4 Moregenerally,foranymeasurablefunctionM suchthatM (p,f,x,x,(cid:15))=M (p,f,x,x,−(cid:15)), t t (cid:48) t (cid:48) " T # VT(P1:T)≤2supE(x,x(cid:48))∼ρE(cid:15) supX(cid:15)t(f(xt((cid:15)))−Mt(p,f,x,x(cid:48),(cid:15))) p∈P f∈Ft=1 The following corollary provides a natural “centered” version of the distribution-dependent Rademacher complexity. That is, the complexity can be measured by relative shifts in the adver- sarialmoves. Corollary4. ForthegamewithrestrictionsP , 1:T (cid:34) (cid:35) (cid:88)T (cid:16) (cid:17) V (P )≤2supE E sup (cid:15) f(x ((cid:15)))−E f(x ((cid:15))) T 1:T (x,x(cid:48)) ρ (cid:15) t t t 1 t p P ∼ f − ∈ ∈F t=1 whereE denotestheconditionalexpectationofx ((cid:15)). t 1 t − Example1. SupposeF isaunitballinaBanachspaceandf(x)=(cid:104)f,x(cid:105). Then (cid:13) (cid:13) (cid:13)(cid:88)T (cid:16) (cid:17)(cid:13) V (P )≤2supE E (cid:13) (cid:15) x ((cid:15))−E x ((cid:15)) (cid:13) T 1:T (x,x(cid:48)) ρ (cid:15)(cid:13) t t t 1 t (cid:13) p P ∼ (cid:13) − (cid:13) ∈ t=1 Supposetheadversaryplaysasimplerandomwalk(e.g.,p (x|x ,...,x ) = p (x|x )isuni- t 1 t 1 t t 1 formonaunitsphere). Forsimplicity,supposethisistheonlystrategyall−owedbythese−tP. Then x ((cid:15))−E x ((cid:15)) are independent increments when conditioned on the history. Further, the in- t t 1 t crementsdo−notdependon(cid:15) . Thus,V (P ) ≤ 2E(cid:13)(cid:13)(cid:80)T Y (cid:13)(cid:13)where{Y }isthecorresponding t T 1:T (cid:13) t=1 t(cid:13) t randomwalk. We now show that the distribution-dependent sequential Rademacher complexity for i.i.d. data is preciselytheclassicalRademachercomplexity,andfurthershowthatthedistribution-dependentse- quentialRademachercomplexityisalwaysupperboundedbytheworst-casesequentialRademacher complexitydefinedin[10]. Proposition 5. First, consider the i.i.d. restrictions P = {p} for all t, where p is some fixed t distributiononX,andletρbetheprocessassociatedwiththejointdistributionp=pT. Then (cid:34) (cid:35) T (cid:88) R (F,p)=R (F,p), where R (F,p)=(cid:52) E E sup (cid:15) f(x ) (6) T T T x1,...,xT∼p (cid:15) f t t ∈F t=1 istheclassicalRademachercomplexity. Second,foranyjointdistributionp, (cid:34) (cid:35) T (cid:88) R (F,p)≤R (F), where R (F)=(cid:52) supE sup (cid:15) f(x ((cid:15))) (7) T T T (cid:15) t t x f ∈F t=1 isthesequentialRademachercomplexitydefinedin[10]. Inthecaseofhybridlearning,adversarychoosesasequenceofpairs(x ,y )wheretheinstancex ’s t t t arei.i.d.butthelabelsy ’sarefullyadversarial.Thedistribution-dependentRademachercomplexity i in such a hybrid case can be upper bounded by a vary natural quantity: a random average where expectationistakenoverx ’sandasupremumoverY-valuedtrees. So,thedistributiondependent t RademachercomplexityitselfbecomesahybridbetweentheclassicalRademachercomplexityand theworstcasesequentialRademachercomplexity. Formoredetails,seeLemma17intheAppendix asanotherexampleofananalysisofthedistribution-dependentsequentialRademachercomplexity. Distribution-dependentsequentialRademachercomplexityenjoysmanyofthenicepropertiessatis- fiedbybothclassicalandworst-caseRademachercomplexities. Asshownin[10],theseproperties arehandytoolsforprovingupperboundsonthevalueinvariousexamples. Wehave: (a)IfF ⊂G, then R(F,p) ≤ R(G,p); (b) R(F,p) = R(conv(F),p); (c) R(cF,p) = |c|R(F,p) for all c∈R;(d)Foranyh,R(F +h,p)=R(F,p)whereF +h={f +h:f ∈F}. Inadditiontotheaboveproperties,upperboundsonR(F,p)canbederivedviasequentialcovering numbers defined in [10]. This notion of a cover captures the sequential complexity of a function class on a given X-valued tree x. One can then show an analogue of the Dudley integral bound, wherethecomplexityisaveragedwithrespecttotheunderlyingprocess(x,x)∼ρ. (cid:48) 5 4 Application: ConstrainedAdversaries In this section, we consider adversaries who are deterministically constrained in the sequences of actions they can play. It is often useful to consider scenarios where the adversary is worst case, yethassomebudgetorconstrainttosatisfywhilepickingtheactions. Examplesofsuchscenarios include,forinstance,gameswheretheadversaryisconstrainedtomakemovesthatarecloseinsome fashiontothepreviousmove,lineargameswithboundedvariance,andsoon. Belowweformulate such games quite generally through arbitrary constraints that the adversary has to satisfy on each round. Weeasilyderiveseveralresultstoillustratetheversatilityofthedevelopedframework. ForaT roundgameconsideranadversarywhoisonlyallowedtoplaysequencesx ,...,x such 1 T thatatroundttheconstraintC (x ,...,x )=1issatisfied,whereC :Xt (cid:55)→{0,1}representsthe t 1 t t constraintonthesequenceplayedsofar. Theconstrainedadversarycanbeviewedasastochastic adversary with restrictions on the conditional distribution at time t given by the set of all Borel distributionsonthesetX (x ) =(cid:52) {x ∈ X : C (x ,...,x ,x) = 1}.Sincethissetincludes t 1:t 1 t 1 t 1 all point distributions on each−x ∈ X , the sequential comple−xity simplifies in a way similar to t worst-case adversaries. We write V (C ) for the value of the game with the given constraints. T 1:T Now, assume that for any x , the set of all distributions on X (x ) is weakly compact in 1:t 1 t 1:t 1 a way similar to compactness−of P. That is, P (x ) satisfy the nec−essary conditions for the t 1:t 1 minimaxtheoremtohold. Wehavethefollowingcorol−lariesofTheorems1and3. Corollary6. LetF andX bethesetsofmovesforthetwoplayers,satisfyingthenecessarycondi- tionsfortheminimaxtheoremtohold. Let{C :Xt 1 (cid:55)→{0,1}}T betheconstraints. Then t − t=1 (cid:34) (cid:35) T T (cid:88) (cid:88) V (C ) = supE inf E [f (x )]− inf f(x ) (8) T 1:T p∈P t=1ft∈F xt∼pt t t f∈F t=1 t whereprangesoveralldistributionsoversequences(x ,...,x )suchthat∀t,C (x )=1. 1 T t 1:t 1 − Corollary 7. Let the set T be a set of pairs (x,x) of X-valued trees with the prop- (cid:48) erty that for any (cid:15) ∈ {±1}T and any t ∈ [T], C(χ ((cid:15) ),...,χ ((cid:15) ),x ((cid:15))) = 1 1 t 1 t 1 t C(χ ((cid:15) ),...,χ ((cid:15) ),x ((cid:15)))=1.Theminimaxvalueisboundedas − − 1 1 t 1 t 1 (cid:48)t − − V (C )≤2 sup R (F,p). T 1:T T (x,x(cid:48)) ∈T Moregenerally,foranymeasurablefunctionM suchthatM (f,x,x,(cid:15))=M (f,x,x,−(cid:15)), t t (cid:48) t (cid:48) (cid:34) (cid:35) T (cid:88) V (C )≤2 sup E sup (cid:15) (f(x ((cid:15)))−M (f,x,x,(cid:15))) . T 1:T (cid:15) t t t (cid:48) (x,x(cid:48))∈T f∈F t=1 Armedwiththeseresults,wecanrecoverandextendsomeknownresultsononlinelearningagainst budgetedadversaries. Thefirstresultsaysthatiftheadversaryisnotallowedtomovebymorethan σ away from its previous average of decisions, the player has a strategy to exploit this fact and t obtainlowerregret. Forthe(cid:96) -norm,such“totalvariation”boundshavebeenachievedin[4]upto 2 alogT factor. Ouranalysisseamlesslyincorporatesvariancemeasuredinarbitrarynorms,notjust (cid:96) . Weemphasizethatsuchcertificatesoflearnabilityarenotpossiblewiththeanalysisof[10]. 2 Proposition 8 (Variance Bound). Consider the online linear optimization setting with F = {f : Ψ(f) ≤ R2} for a λ-strongly function Ψ : F (cid:55)→ R on F, and X = {x : (cid:107)x(cid:107) ≤ 1}. Let + f(x) = (cid:104)f,x(cid:105)foranyf ∈ F andx ∈ X. Considerthesequenceofconstraints{C }∗T givenby t t=1 Ct(x1,...,xt−1,x)=1if(cid:107)x− t−11(cid:80)τt−=11xτ(cid:107)∗ ≤σtand0otherwise. Then √ (cid:113) V (C )≤2 2R λ 1(cid:80)T σ2 T 1:T − t=1 t Inparticular,weobtainthefollowing(cid:96) variancebound. ConsiderthecasewhenΨ : F (cid:55)→ R is 2 + givenbyΨ(f)= 1(cid:107)f(cid:107)2,F ={f :(cid:107)f(cid:107) ≤1}andX ={x:(cid:107)x(cid:107) ≤1}. Considertheconstrained 2 2 (cid:13)2 (cid:13) game where the move xt played by adversary at time t satisfies (cid:13)(cid:13)xt− t11(cid:80)tτ−=11xτ(cid:13)(cid:13) ≤ σt . In √ (cid:113) − 2 thiscasewecanconcludethatV (C ) ≤ 2 2 (cid:80)T σ2 .Wecanalsoderiveavariancebound T 1:T t=1 t 6 over the simplex. Let Ψ(f) = (cid:80)d f log(df ) is defined over the d-simplex F, and X = {x : i=1 i i (cid:107)x(cid:107) ≤ 1}. Consider the constrained game where the move x played by adversary at time t t satis∞fies maxj [d](cid:12)(cid:12)(cid:12)xt[j]− t11(cid:80)tτ−=11xτ[j](cid:12)(cid:12)(cid:12) ≤ σt . For any f ∈ F, Ψ(f) ≤ log(d) and so we ∈ −√ (cid:113) concludethatV (C )≤2 2 log(d)(cid:80)T σ2. T 1:T t=1 t The next Proposition gives a bound whenever the adversary is constrained to choose his decision fromasmallballaroundthepreviousdecision. Proposition9(Slowly-ChangingDecisions). Considertheonlinelinearoptimizationsettingwhere adversary’s move at any time is close to the move during the previous time step. Let F = {f : Ψ(f) ≤ R2}whereΨ : F (cid:55)→ R isaλ-stronglyfunctiononF andX = {x : (cid:107)x(cid:107) ≤ B}. Let + f(x) = (cid:104)f,x(cid:105)foranyf ∈ F andx ∈ X. Considerthesequenceofconstraints{C }∗T givenby t t=1 C (x ,...,x ,x)=1if(cid:107)x−x (cid:107) ≤δand0otherwise. Then, t 1 t 1 t 1 − − ∗ (cid:112) V (C )≤2Rδ 2T/λ. T 1:T Inparticular,considerthecaseofaEuclidean-normrestrictiononthemoves. LetΨ : F (cid:55)→ R is + givenbyΨ(f)= 1(cid:107)f(cid:107)2,F ={f :(cid:107)f(cid:107) ≤1}andX ={x:(cid:107)x(cid:107) ≤1}. Considertheconstrained 2 2 2 game where the move x played by adversary at time t satisfies (cid:107)x −x (cid:107) ≤ δ . In this case t √ t t 1 2 − we can conclude that V (C ) ≤ 2δ 2T . For the case of decision-making on the simplex, we T 1:T obtain the following result. Let Ψ(f) = (cid:80)d f log(df ) is defined over the d-simplex F, and i=1 i i X = {x : (cid:107)x(cid:107) ≤ 1}. Considertheconstrainedgamewherethemovex playedbyadversaryat t timetsatisfies∞(cid:107)x −x (cid:107) ≤δ. Inthiscasenotethatforanyf ∈F,Ψ(f)≤log(d)andsowe t t 1 canconcludethatV (C− )∞≤2δ(cid:112)2T log(d). T 1:T 5 Application: SmoothedAdversaries The development of smoothed analysis over the past decade is arguably one of the landmarks in thestudyofcomplexityofalgorithms. Incontrasttotheoverlyoptimisticaveragecomplexityand theoverlypessimisticworst-casecomplexity,smoothedcomplexitycanbeseenasamorerealistic measureofalgorithm’sperformance.Intheirgroundbreakingwork,SpielmanandTeng[13]showed thatthesmoothedrunningtimecomplexityofthesimplexmethodispolynomial.Thisresultexplains goodperformanceofthemethodinpracticedespiteitsexponential-timeworst-casecomplexity. In thissection,weconsidertheeffectofsmoothingonlearnability. Itiswell-knownthatthereisagapbetweenthei.i.d.andtheworst-casescenarios.Infact,wedonot needtogofarforanexample: asimpleclassofthresholdfunctionsonaunitintervalislearnablein thei.i.d. supervisedlearningscenario,yetdifficultintheonlineworst-casemodel[8,2,9]. Thisfact isreflectedinthecorrespondingcombinatorialdimensions: theVapnik-Chervonenkisdimensionis one, whereas the Littlestone dimension is infinite. The proof of the latter fact, however, reveals thattheinfinitenumberofmistakesonthepartoftheplayerisduetotheinfiniteresolutionofthe carefullychosenadversarialsequence. Wecanarguethatthisinfiniteprecisionisanunreasonable assumptiononthepowerofareal-worldopponent. Theideaoflimitingthepowerofthemalicious adversary through perturbing the sequence can be traced back to Posner and Kulkarni [9]. The authorsconsideredon-linelearningoffunctionsofboundedvariation,butintheso-calledrealizable setting(thatis,whenlabelsaregivenbysomefunctioninthegivenclass). We define the smoothed online learning model as the following T-round interaction between the learner and the adversary. On round t, the learner chooses f ∈ F; the adversary simultaneously t choosesx ∈ X, whichisthenperturbedbysomenoises ∼ σ, yieldingavaluex˜ = ω(x ,s ); t t t t t and the player suffers f (x˜ ). Regret is defined with respect to the perturbed sequence. Here ω : t t X ×S (cid:55)→ X is some measurable mapping; for instance, additive disturbances can be written as x˜ = ω(x,s) = x+s. If ω keeps x unchanged, that is ω(x ,s ) = x , the setting is precisely t t t t the standard online learning model. In the full information version, we assume that the choice x˜ t is revealed to the player at the end of round t. We now recognize that the setting is nothing but a particularwaytorestricttheadversary. Thatis,thechoicex ∈ X definesaparameterofamixed t strategyfromwhichaactualmoveω(x ,s )isdrawn;forinstance,foradditivezero-meanGaussian t t noise, x definesthecenterofthedistributionfromwhichx +s isdrawn. Inotherwords, noise t t t doesnotallowtheadversarytoplayanydesiredmixedstrategy. 7 Thevalueofthesmoothedonlinelearninggame(asdefinedin (1))canbeequivalentlywrittenas (cid:34) (cid:35) T T (cid:88) (cid:88) V = infsup E infsup E ···infsup E f (ω(x ,s ))− inf f(ω(x ,s )) T t t t t t q1 x1 fs11∼∼qσ1 q2 x2 fs22∼∼qσ2 qT xT fsTT∼∼qσT t=1 f∈F t=1 wheretheinfimaareoverq ∈ Qandthesupremaareoverx ∈ X. Usingsequentialsymmetriza- t t tion,wededucethefollowingupperboundonthevalueofthesmoothedonlinelearninggame. Theorem10. Thevalueofthesmoothedonlinelearninggameisboundedaboveas (cid:34) (cid:35) T (cid:88) V ≤2 sup E E ... sup E E sup (cid:15) f(ω(x ,s )) T x1∈Xs1∼σ (cid:15)1 xT∈XsT∼σ (cid:15)T f∈F t=1 t t t We now demonstrate how Theorem 10 can be used to show learnability for smoothed learning of thresholdfunctions. First,considerthesupervisedgamewiththresholdfunctionsonaunitinterval (that is, non-homogenous hyperplanes). The moves of the adversary are pairs x = (z,y) with z ∈[0,1]andy ∈{0,1},andthebinary-valuedfunctionclassF isdefinedby F ={f (z,y)=|y−1{z <θ}|:θ ∈[0,1]}, (9) θ thatis,everyfunctionisassociatedwithathresholdθ ∈[0,1]. TheclassF hasinfiniteLittlestone’s dimensionandisnotlearnableintheworst-caseonlineframework. Considerasmoothedscenario, with the z-variable of the adversarial move (z,y) perturbed by an additive uniform noise σ = Unif[−γ/2,γ/2] for some γ ≥ 0. That is, the actual move revealed to the player at time t is (z +s ,y ),withs ∼ σ. Anynon-trivialupperboundonregrethastodependonparticularnoise t t t t assumptions,asγ = 0correspondstothecasewithinfiniteLittlestonedimension. Fortheuniform disturbance,theintuitiontellsusthatnoiseimpliesamargin,andweshouldexpecta1/γcomplexity parameterappearinginthebounds.Thenextlemmaquantifiestheintuitionthatadditivenoiselimits precisionoftheadversary. Lemma 11. Let θ ,...,θ be obtained by discretizing the interval [0,1] into N = Ta bins 1 N [θ ,θ )oflengthT a,forsomea ≥ 3. Then,foranysequencez ,...,z ∈ [0,1],withproba- i i+1 − 1 T bilityatleast1− 1 ,notwoelementsofthesequencez +s ,...,z +s belongtothesame γTa−2 1 1 T T interval[θ ,θ ),wheres ,...,s arei.i.d. Unif[−γ/2,γ/2]. i i+1 1 T We now observe that, conditioned on the event in Lemma 11, the upper bound on the value in Theorem10isasupremumofN martingaledifferencesequences! Wethenarriveat: Proposition12. Fortheproblemofsmoothedonlinelearningofthresholdsin1-D,thevalueis (cid:112) V ≤2+ 2T (4logT +log(1/γ)) T What we found is somewhat surprising: for a problem which is not learnable in the online worst- casescenario, anexponentiallysmallnoiseaddedtothemovesoftheadversaryyieldsalearnable problem. This shows, at least in the given example, that the worst-case analysis and Littlestone’s dimensionarebrittlenotionswhichmightbetoorestrictiveintherealworld,wheresomenoiseis unavoidable. Itiscomfortingthatsmalladditivenoisemakestheproblemlearnable! Theproofforsmoothedlearningofhalf-spacesinhigherdimensionfollowsthesamerouteasthe one-dimensional exposition. For simplicity, assume the hyperplanes are homogenous and Z = S ⊂Rd,Y ={−1,1},X =Z×Y.DefineF ={f (z,y)=1{y(cid:104)z,θ(cid:105)>0}:θ ∈S },and d 1 θ d 1 ass−umethatthenoiseisdistributeduniformlyonasquarepatchwithside-lengthγonthesu−rfaceof thesphereS . Wecanalsoconsiderotherdistributions,possiblywithsupportonad-dimensional d 1 ballinstead. − Proposition13. Fortheproblemofsmoothedonlinelearningofhalf-spaces, (cid:32)(cid:115) (cid:18) (cid:18) (cid:19) (cid:19) (cid:18) (cid:19) 3 (cid:33) 1 3 1 d−1 V =O dT log + logT +v · T γ d−1 d−2 γ wherev isconstantdependingonlyonthedimensiond. d 2 − Weconcludethathalfspacesareonlinelearnableinthesmoothedmodel,sincetheupperboundof Proposition13guaranteesexistenceofanalgorithmwhichachievesthisregret. Infact,forthetwo examplesconsideredinthissection,theExponentialWeightsAlgorithmonthediscretizationgiven byLemma11isa(computationallyinfeasible)algorithmachievingthebound. 8 References [1] J. Abernethy, A. Agarwal, P. Bartlett, and A. Rakhlin. A stochastic view of optimal regret throughminimaxduality. InCOLT,2009. [2] S.Ben-David,D.Pal,andS.Shalev-Shwartz. Agnosticonlinelearning. InProceedingsofthe 22thAnnualConferenceonLearningTheory,2009. [3] J.O.Berger. StatisticaldecisiontheoryandBayesiananalysis. Springer,1985. [4] E.HazanandS.Kale. Betteralgorithmsforbenignbandits. InSODA,2009. [5] S.M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds,marginbounds,andregularization. NIPS,22,2008. [6] A.LazaricandR.Munos. HybridStochastic-AdversarialOn-lineLearning. InCOLT,2009. [7] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer-Verlag, New York, 1991. [8] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. MachineLearning,2(4):285–318,041988. [9] S.PosnerandS.Kulkarni. On-linelearningoffunctionsofboundedvariationundervarious samplingschemes. InProceedingsofthesixthannualconferenceonComputationallearning theory,pages439–445.ACM,1993. [10] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Random averages, combinatorial parameters,andlearnability. InNIPS,2010. FullversionavailableatarXiv:1006.1138. [11] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Beyond regret. In COLT, 2011. FullversionavailableatarXiv:1011.3168. [12] S.Shalev-Shwartz,O.Shamir,N.Srebro,andK.Sridharan.Learnability,stabilityanduniform convergence. JMLR,11:2635–2670,Oct2010. [13] D.A.SpielmanandS.H.Teng. Smoothedanalysisofalgorithms: Whythesimplexalgorithm usuallytakespolynomialtime. JournaloftheACM,51(3):385–463,2004. [14] A. W. Van Der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes : With ApplicationstoStatistics. SpringerSeries,March1996. 9 A Proofs ProofofTheorem1. The proof is identical to that in [10]. For simplicity, denote ψ(x ) = 1:T inf (cid:80)T f(x ). Thefirststepintheproofistoappealtotheminimaxtheoremforeverycouple f t=1 t ofin∈fFandsup: (cid:34) (cid:35) T (cid:88) inf sup E ··· inf sup E f (x )−ψ(x ) t t 1:T q1∈Qp1∈P1 fx11∼∼pq11 qT∈QpT∈PT fxTT∼∼pqTT t=1 (cid:34) (cid:35) T (cid:88) = sup inf E ... sup inf E f (x )−ψ(x ) t t 1:T p1∈P1q1∈Q fx11∼∼pq11 pT∈PT qT∈Q fxTT∼∼pqTT t=1 (cid:34) (cid:35) T (cid:88) = sup inf E ... sup inf E f (x )−ψ(x ) p1∈P1f1∈F x1∼p1 pT∈PT fT∈F xT∼pT t=1 t t 1:T From now on, it will be understood that x has distribution p and that the suprema over p are in t t t factoverp ∈P (x ). Bymovingtheexpectationwithrespecttox andthentheinfimumwith t t 1:t 1 T respecttof insideth−eexpression,wearriveat T (cid:34)T 1 (cid:20) (cid:21) (cid:35) supinfE ... sup inf E sup (cid:88)− f (x )+ infE f (x ) −E ψ(x ) p1 f1 x1 pT−1fT−1 xT−1 pT t=1 t t fT xT T T xT 1:T (cid:34)T 1 (cid:20) (cid:21) (cid:35) =supinfE ... sup inf E supE (cid:88)− f (x )+ infE f (x ) −ψ(x ) p1 f1 x1 pT−1fT−1 xT−1 pT xT t=1 t t fT xT T T 1:T LetusnowrepeattheprocedureforstepT −1. Theaboveexpressionisequalto (cid:34)T 1 (cid:20) (cid:21)(cid:35) supinfE ... sup inf E (cid:88)− f (x )+supE infE f (x )−ψ(x ) p1 f1 x1 pT−1fT−1 xT−1 t=1 t t pT xT fT xT T T 1:T (cid:34)T 2 (cid:20) (cid:21) (cid:20) (cid:21)(cid:35) =supinfE ... sup (cid:88)− f (x )+ inf E f (x ) +E supE infE f (x )−ψ(x ) p1 f1 x1 pT−1 t=1 t t fT−1 xT−1 T−1 T−1 xT−1 pT xT fT xT T T 1:T (cid:34)T 2 (cid:20) (cid:21) (cid:20) (cid:21) (cid:35) =supinfE ... sup E supE (cid:88)− f (x )+ inf E f (x ) + infE f (x ) −ψ(x ) p1 f1 x1 pT−1 xT−1 pT xT t=1 t t fT−1 xT−1 T−1 T−1 fT xT T T 1:T ContinuinginthisfashionforT −2andallthewaydowntot=1provesthetheorem. ProofofProposition2. EventhoughTheorem1showsequalitytosomequantitywithasupremum over oblivious strategies p, it is not immediate that there exists an oblivious minimax strategy for theadversary,andaproofisrequired. Tothisend,foranyobliviousstrategyp,definetheregretthe playerwouldgetplayingoptimallyagainstp: (cid:34) (cid:35) T T (cid:88) (cid:88) Vp =(cid:52) inf E inf E ··· inf E f (x )− inf f(x ) . T f1∈F x1∼p1f2∈F x2∼p2(·|x1) fT∈F xT∼pT(·|x1:T−1) t=1 t t f∈F t=1 t (10) Wewillprovethatforanyobliviousstrategyp, (cid:34) (cid:35) T T (cid:88) (cid:88) V (P ) ≥ Vp = infE E E f (x )− inf f(x ) (11) T 1:T T π ft∼πt(·|x1:t−1) xt∼pt t t f t t=1 ∈F t=1 with equality holding for p which achieves the supremum in (3). Importantly, the infimum is ∗ over strategies π = {π }T of the player that do not depend on player’s previous moves, that is t t=1 π :Xt 1 (cid:55)→Q. t − FixanobliviousstrategypandnotethatV (P ) ≥ Vp. Fromnowon,itwillbeunderstoodthat T 1:T T x hasdistributionp (·|x ). Letπ = {π }T beastrategyoftheplayer,thatis,asequenceof t t 1:t 1 t t=1 mappingsπ :(F ×X)t 1−(cid:55)→Q. t − 10
Description: