ebook img

Online Nonparametric Regression with General Loss Functions PDF

0.28 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Online Nonparametric Regression with General Loss Functions

Online Nonparametric Regression with General Loss Functions AlexanderRakhlin KarthikSridharan UniversityofPennsylvania CornellUniversity January28,2015 5 1 0 Abstract 2 This paper establishes minimax rates for onlineregressionwith arbitrary classes of functionsandgeneral n losses.1 Weshowthatbelowacertainthresholdforthecomplexityofthefunctionclass,theminimaxratesde- a pendonboththecurvatureofthelossfunctionandthesequentialcomplexitiesoftheclass.Abovethisthreshold, J thecurvatureofthelossdoesnotaffecttherates.Furthermore,forthecaseofsquareloss,ourresultspointtothe 6 interestingphenomenon:wheneversequentialandi.i.d. empiricalentropiesmatch,theratesforstatisticaland 2 onlinelearningarethesame. ] Inadditiontothestudyofminimaxregret,wederiveagenericforecasterthatenjoystheestablishedoptimal L rates. Wealsoprovidearecipefordesigningonlinepredictionalgorithmsthatcanbecomputationallyefficient M forcertainproblems. Weillustratethetechniquesbyderivingexistingandnewforecastersforthecaseoffinite expertsandforonlinelinearregression. . t a t s 1 Introduction [ 1 Westudytheproblemofpredictingareal-valuedsequencey1,...,yn inanon-linemanner.Attimet 1,...,n,the = v forecasterreceivessideinformationintheformofanelementx ofanabstractsetX. Theforecasterthenmakes t 8 apredictiony onthebasisofthecurrentobservationx andthedata{(x ,y )}t 1encounteredthusfar,andthen 9 observestheretsponsey . t i i i=−1 5 t 6 Suchaprboblemofsequence predictionisstudied intheliterature undertwodistinctsettings: probabilistic 0 anddeterministic[18]. Intheformersetting,whichfallswithinthepurviewoftimeseriesanalysis,onepositsa . parametricformforthedata-generatingmechanismandestimatesthemodelparametersbasedonpastinstances 1 0 andinputinformationinordertomakethenextprediction. Incontrast,inthedeterministicsettingoneassumes 5 nosuchprobabilisticmechanism. Instead,thegoalisphrasedasthatofpredictingaswellasthebestforecaster 1 fromabenchmarksetofstrategies. Thislattersetting—oftentermedpredictionofindividualsequences,oronline : v learning—isthefocusofthepresentpaper. Xi Welettheoutcome yt andthepredictionyt takevaluesinY ⊆RandY ⊆R,respectively. Formally,adeter- ministicpredictionstrategyisamapping(X Y)t 1 X Y. Weletthelossfunction(y ,y ) ℓ(y ,y )score − t t t t ar thequalityofthepredictiononasingleround×.b × → c 7→ Assumethatthetimehorizonn Z ,isknowntotheforeccaster. Theoverallqualityobftheforecabsteristhen evaluated againstthebenchmarkse∈tof+predictors, denoted asaclassF offunctions X Y. Thecumulative → regretoftheforecasteronthesequence(x ,y ),...,(x ,y )isdefinedas 1 1 n n c n n ℓ(y ,y ) inf ℓ(f(x ),y ). (1) t t t t t 1 −f Ft 1 X= ∈ X= b Theforecasteraimstokeepthedifferencein(1)smallforallsequences(x ,y ),...,(x ,y ). 1 1 n n ThecomparisonclassFencodesthepriorbeliefaboutthefamilyofpredictorsoneexpectstoperformwell.Ifa forecastingstrategyguaranteessmallregretforallsequences,andifF isagoodmodelforthesequencesobserved inreality,thentheforecastingstrategywillalsoperformwellintermsofitscumulativeerror. Infact,wecantake 1Thispaperbuildsuponthestudyofonlineregressionwithsquareloss,presentedbytheauthorsattheCOLT2014conference. 1 F tobeaclassofsolutions (thatis,forecastingstrategies)toasetofprobabilisticsourcesonewouldobtainby positingagenerativemodelofdata.Bydoingso,wearemodelingsolutionstothepredictionproblemratherthan modelingthedata-generatingmechanism. Wereferto[18,21]forfurtherdiscussionsonthis“duality”between theprobabilisticanddeterministicapproaches. To ensure that F captures the phenomenon of interest, we would like F to be large. However, increasing the“size”ofF likelyleadstolargerregret,asthecomparisontermin(1)becomessmaller. Ontheotherhand, decreasingthe“size”ofF makestheregretminimizationtaskeasier,yetthepredictionmethodislesslikelytobe successfulinpractice.Thisdichotomyisananalogueofthebias-variancetradeoffcommonlystudiedinstatistics. Acontributionofthispaperisananalysisofthegrowthofregret(withn)intermsofvariousnotionsofcomplexity ofF.Thetaskwasalreadyaccomplishedin[24]forthecaseofabsolutelossℓ(a,b) a b .Inthepresentpaper =| − | weobtainoptimalguaranteesforconvexLipschitzlossesunderverygeneralassumptions. Togivethereaderasenseoftheresultsofthispaper,westatethefollowinginformalcorollary.Letcomplexity ofF bemeasuredviasequentialentropy atscaleβ,tobedefinedbelow. (Forthereaderfamiliarwithcovering numbers,thisisasequentialanalogue—introducedin[24]—oftheclassicalKoltchinskii-Pollardentropy). Corollary1(Informal). SupposesequentialentropyatscaleβbehavesasO(β p),p 0.Thenoptimalregret − > • forpredictionwithabsolutelossgrowsasn1/2ifp (0,2),andasn1 1/pforp 2; − ∈ > • forpredictionwithsquarelossgrowsasn1 2/(2 p)ifp (0,2),andasn1 1/p forp 2. − + − ∈ > Moreover,theserateshavematching,sometimesmoduloalogarithmicfactor,lowerbounds. Thefirstpartofthiscorollaryisestablishedin[24].Thesecondpartrequiresnewtechniquesthattakeadvan- tageofthecurvatureofthelossfunction. Inanattempttoenticethereader,letusdiscusstwoconclusionsthatcanbedrawnfromCorollary1. First, the rates of convergence match optimal rates for excess square loss in the realm of distribution-free Statistical LearningTheorywithi.i.d.data,undertheassumptiononthebehaviorofempiricalcoveringnumbers[27].Hence, intheabsenceofagapbetweenclassicalandsequentialcomplexities(introducedlater)theregressionproblems in the two seemingly different frameworksenjoy the same rates of convergence. A deeper understanding of this phenomenonisofagreatinterest. The second conclusion concerns the same optimal rate n 1/p for both square and absolute loss for “rich” − classes(p 2). Informally,strongconvexityofthelossdoesnotaffecttherateofconvergenceforsuchmassive > classes.Ageometricexplanationofthisinterestingphenomenonrequiresfurtherinvestigation. We finish this introduction with a note about the generality of the setting proposed so far. Suppose X Yt,thespaceofallhistoriesofY-valuedoutcomes. Denotingx (y ,...,y ),yt 1,wemayvieweac=h t n t 1 t 1 − ∪f ≤F itselfasastrategythatmapshistory yt−1 toaprediction. Ensuri=ngthat xt i−snotarbitrarybutconsistent ∈ withhistoryonlymakesthetaskofregretminimizationeasier;theanalysisofthispaperforthiscasefollowsalong thesamelines,butweomittheextraoverheadofrestrictionsonx ’sandinsteadreferthereaderto[14,21]. t The paper is organizedas follows. Section 2 introduces the notation and then presents a brief overview of sequential complexities. Upper and lower bounds on minimax regretare established in Sections 3 and 4. We calculateminimaxratesforvariousexamplesinSection5.Wethenturntothequestionofdevelopingalgorithms inSection6.WefirstshowthatanalgorithmbasedontheRademacherrelaxationisadmissible(see[19])andyields theratesderivedinanon-constructivemannerinthefirstpartofthepaper. Weshowthatfurtherrelaxationsin finitedimensionalspaceleadtothefamousVovk-Azoury-Warmuthforecaster.Wealsoderiveapredictionmethod forfiniteclassF. 2 Preliminaries 2.1 AssumptionsandDefinitions WeassumethatthesetofoutcomesY isaboundedset,arestrictionthatcanberemovedbystandardtruncation arguments(seee.g. [12]). LetX besome setofcovariates, andletF beaclassoffunctions X Y forsome → c 2 Y R.Recalltheprotocoloftheonlinepredictionproblem:Oneachroundt {1,...,n},x X isrevealedtothe t ⊆ ∈ ∈ learnerwhosubsequentlymakesapredictiony Y.Theresponsey Y isrevealedafterthepredictionismade. t t ∈ ∈ c Thelossfunctionℓ(,y)isassumedtobeconvex. Let∂ ℓ(a,y)denoteanyelementofthesubdifferentialset a · (withrespecttofirstargument),andassumethbat c sup ∂ ℓ(a,y) G . a | |≤ <∞ a Y,y Y ∈ ∈ Weassumethatforanydistributionof y suppcortedonY,thereisaminimizerofexpectedlossthatisfiniteand belongstoY: Y argminEℓ(y,y) . c ∩ y R 6=; ∈ Givenay Y,theerrorofalinearexpansioncatatoapproximbatefunctionvalueatbisdenotedby ∈ b ∆ay,b , ℓ(b,y)−[ℓ(a,y)+∂aℓ(a,y)·(b−a)]. Let∆:(Y Y) R beafunctiondefinedpointwiseas 0 − → ≥ c c ∆(x) inf ∆y , (2) = a,b a,b Y,y Y s.t.b a x ∈ ∈ − = a lower bound onthe residualfor any twovalues secparated by x. For instance, aneasy calculation showsthat ∆(x) x2forℓ(y,y) (y y)2. = = − 2.2 MinimabxFormbulation Unlikemostpreviousapproachestothestudyofonlineregression,wedonotstartfromanalgorithm,butinstead workdirectlywithminimaxregret.Wewillbeabletoextracta(notnecessarilyefficient)algorithmafterobtaining upperboundsontheminimaxvalue.Letusintroducethenotationthatmakestheminimaxregretdefinitionmore n concise. Weuse todenoteaninterleavedapplicationoftheoperators, repeatedover t 1...n rounds. Withthisnotation⟪,·t·h·⟫etm=1inimaxregretoftheonlineregressionproblemdescribedearliercanbew=rittenas n n n V supinfsup ℓ(y ,y ) inf ℓ(f(x ),y ) (3) n t t t t =⟪ xt yt yt ⟫t=1(tX=1 −f∈FtX=1 ) b whereeach x rangesoverX, y rangebsoverY, and y rangesoverY. Anupper boundonV guaranteesthe t t t n existence ofanalgorithm(thatis, awaytochoose y ’s)withatmostthatmuchregretagainstanysequence. A t lowerboundonVn,inturn,guabranteestheexisctenceofasequenceonwhichnomethodcanperformbetterthan thegivenlowerbound. b 2.3 SequentialComplexities Oneofthekeytoolsinthestudyofestimatorsbasedoni.i.d. dataisthesymmetrizationtechnique[13]. Byin- troducingRademacherrandomvariables,onecanstudythesupremumofanempiricalprocessconditionallyon thedata. Conditioning facilitatestheintroductionofsample-basedcomplexities ofafunctionclass, suchasan empiricalcoveringnumber. Foraclassofboundedfunctions,thecoveringnumberwithrespecttotheempirical metricisnecessarilyfiniteandleadstoacorrectcontroloftheempiricalprocessevenifdiscretizationofthefunc- tionclassinadata-independentmannerisimpossible.Wewillreturntothispointwhencomparingourapproach withdiscretization-basedmethods. Intheonlinepredictionscenario,symmetrizationismoresubtleandinvolvesthenotionofabinarytree.The binarytreeis,insomesense,thesmallestentitythatcapturesthesequentialnatureoftheproblem.Moreprecisely, aZ-valuedtreezofdepthnisacompleterootedbinarytreewithnodeslabeledbyelementsofasetZ. Equiva- lently,wethinkofzasnlabelingfunctions,wherez isaconstantlabelfortheroot,z ( 1),z ( 1) Z arethelabels 1 2 2 − + ∈ 3 fortheleftandrightchildrenoftheroot,andsoforth. Hence,forǫ (ǫ ,...,ǫ ) { 1}n,z (ǫ) z (ǫ ,...,ǫ ) Z 1 n t t 1 t 1 isthelabelofthenodeonthet-thlevelofthetreeobtainedbyfollo=wingthepat∈hǫ±.Forafunc=tiong:Z −R,g∈(z) → isanR-valuedtreewithlabelingfunctionsg z forlevelt (or,inplainwords,evaluationofg onz). t ◦ Wenowdefinetwotree-basedcomplexitynotionsofaclassoffunctions. Definition1([24]). SequentialRademachercomplexityofaclassF RX onagivenX-valuedtreexofdepthn, ⊆ aswellasitssupremum,aredefinedas n R (F;x),Esup ǫ f(x (ǫ)) , R (F),supR (F;x) (4) n t t n n f F·t 1 ¸ x ∈ X= wheretheexpectationisoverasequenceofindependentRademacherrandomvariablesǫ (ǫ ,...,ǫ ). 1 n = Onemaythinkofthefunctionsx ,...,x asapredictableprocesswithrespecttothedyadicfiltration{σ(ǫ ,...,ǫ )} . 1 n 1 t t 1 Thefollowingnotionofaβ-coverquantifiescomplexityoftheclassF evaluatedonthepredictableprocess. ≥ Definition2([24]). AsetV ofR-valuedtreesofdepthnformsaβ-cover(withrespecttotheℓ norm)ofafunction q classF RX onagivenX-valuedtreexofdepthnif ⊆ 1 n f F, ǫ { 1}n, v V s.t. f(x (ǫ)) v (ǫ)q βq. t t ∀ ∈ ∀ ∈ ± ∃ ∈ n | − | ≤ t 1 X= Aβ-coverintheℓ senserequiresthat f(x (ǫ)) v (ǫ) βforallt [n].Thesizeofthesmallestβ-coverisdenoted t t byN (β,F,x),an∞dN (β,F,n),sup|N (β,F−,x). |≤ ∈ q q x q WewillrefertologN (β,F,n)assequentialentropyofF.Inparticular,wewillstudythebehaviorofV when q n sequentialentropygrowspolynomially2asthescaleβdecreases: logN2(β,F,n) β−p, p 0. (5) ∼ > Wealsoconsidertheparametric“p 0”casewhensequentialcoveringitselfbehavesas = N2(β,F,n) β−d (6) ∼ (e.g. linearregressioninaboundedsetinRd). Weremarkthattheℓ coverisnecessarilyn-dependent, sothe ∞ formsweassumefornonparametricandparametriccases,respectively,are logN (β,F,n) β−plog(n/β) or N (β,F,n) (n/β)d (7) ∞ ∼ ∞ ∼ 3 UpperBounds Thefollowingtheoremfrom[24]showstheimportanceofsequentialRademachercomplexityforpredictionwith absoluteloss. Theorem2([24]). LetY [ 1,1],F [ 1,1]X,andℓ(y,y) y y .Itthenholdsthat = − = − =| − | R (F) V 2R (F). n ≤ bn≤ bn Furthermore,anupperboundof2GR (F)holdsforanyG-Lipschitzloss. Weobserve,however,thatassoon n asF containstwodistinctfunctions,sequentialRadmeachercomplexityofF scalesasΩ(n1/2). Yet,itisknown thatminimaxregretforpredictionwithsquarelossgrowsslowerthanthisrate.Therefore,thedirectanalysisbased onsequentialRademachercomplexity(andacontractionlemma)giveslooseupperboundsonminimaxregret. ThekeycontributionofthispaperisanintroductionofanoffsetRademachercomplexitythatcapturesthecorrect behavior. Inthenextlemma,weshowthatminimaxvalueofthesequentialpredictionproblemwithanyconvexLipschitz lossfunctioncanbecontrolledviaoffsetsequentialRademachercomplexity. Asbefore,letǫ (ǫ ,...,ǫ )where 1 n = eachǫ isanindependentRademacherrandomvariable. i 2Itisstraightforwardtoallowconstantsinthisdefinition,andweleavethesedetailsoutforthesakeofsimplicity. 4 Lemma3. UndertheassumptionsanddefinitionsinSection2.1,theminimaxrateisboundedby n V supEsup 2Gǫ (f(x (ǫ)) µ (ǫ)) ∆ f(x (ǫ)) µ (ǫ) (8) n≤ x,µ f∈F·tX=1 t t − t − ³ t − t ´¸ wherexandµrangeoverallX-valuedandY-valuedtreesofdepthn,respectively. Theright-handsideof(8)willbetermedoffsetRademachercomplexity ofafunction classF RX withre- c ⊆ specttoaconvexevenoffsetfunction∆:R R andameanR-valuedtreeµ. If∆ 0,werecoverthenotionof 0 sequentialRademachercomplexitysinceE[→ǫ µ≥(ǫ)] 0. ≡ t t = A matching lower bound on the minimax value will be presented in Section 4, and the tworesults warrant a further study of offset Rademacher complexity. To this end, a natural next question is whether the chaining techniquecanbeemployedtocontrolthesupremumofthismodifiedstochasticprocess.Asapointofcomparison, wefirstrecallthatsequentialRademachercomplexityofaclassG of[ 1,1]-valuedfunctionsonZ canbeupper − boundedviatheDudleyintegral-typebound n 1 Esup ǫ g(z (ǫ)) inf 4ρn 12pn logN (δ,G,z)dδ , (9) t t 2 g∈G·tX=1 ¸≤ρ∈(0,1]½ + Zρ q ¾ foranyZ-valuedtreezofdepthn,asshownin[26].WeaimtoobtaintighterupperboundsontheoffsetRademacher bytakingadvantageofthenegativeoffsetterm. To initiate the study of offset Rademacher complexity with functions ∆ other than quadratic, we recall the notionofaconvexconjugate. Definition3. Foraconvexfunction ψ:D RwithdomainD R, theconvexconjugate ψ∗ :R R { }is → ⊆ → ∪ +∞ definedas ψ∗(a) sup ad ψ(d) . =d D − ∈ © ª Thechainingtechniqueforcontrollingasupremumofastochasticprocessrequiresastatementaboutthebe- havioroftheprocessoverafinitecollection.ThenextlemmaprovidessuchastatementfortheoffsetRademacher process. Lemma4. Let∆beaconvex,nonnegative,evenfunctiononRandletΓ denotetheconvexconjugateofthefunction ∗ x ∆(px ).AssumeΓ isnondecreasing.ForanyfinitesetW ofR-valuedtreesofdepthnandanyconstantC 0, ∗ 7→ | | > n 1 Emax 2Cǫtwt(ǫ) ∆(wt(ǫ)) inf log W nΓ∗ 2C2λ . (10) w∈W½tX=1 − ¾≤λ>0½λ | |+ ¡ ¢¾ Further,forany[ G,G]-valuedtreeη, − n n Emax ǫ η (ǫ)w (ǫ) G 2log W max w (ǫ)2. (11) w∈W·tX=1 t t t ¸≤ s | |·w∈W,ǫ1:ntX=1 n Asanexample,if∆(x) x2,aneasycalculationshowsthatΓ (1) 0andΓ (y) foranyy 1.Hence,the ∗ ∗ = = =+∞ 6= infimumin(10)isachievedatλ 1/(2C2),andtheupperboundbecomes2C2log W . = | | Wecannowemploythechainingtechniquetoextendthecontrolofthestochasticprocessbeyondthefinite collection. Lemma5. Let∆andΓ beasinLemma4.ForanyZ-valuedtreezofdepthnandaclassG offunctionsZ Rand ∗ → anyconstantC 0, > n γ Esup 2Cǫ g(z (ǫ)) ∆ g(z (ǫ)) inf C inf 4ρn 12pn logN (δ,G,z)dδ t t t g∈G·tX=1 − ¡ ¢¸≤γ>0( ρ∈(0,γ)½ + Zρ q ∞ ¾ 1 +λinf0½λlogN∞ γ2,G,z +nΓ∗ 2C2λ ¾¾. > ¡ ¢ ¡ ¢ 5 Remark1. Forthecaseof∆(x) x2,itispossibletoprovetheupperboundofLemma5intermsofℓ sequential 2 = coveringnumbersratherthanℓ (see[22]). ∞ Lemma5,togetherwithLemma3,yieldupperboundsonminimaxregretunderassumptionsonthegrowthof sequentialentropy.Beforedetailingtherates,wepresentlowerboundsontheminimaxvalueintermsoftheoffset Rademachercomplexityandcombinatorialdimensions. 4 LowerBounds Thefunction∆,arisingfromuniform(orstrong)convexityofthelossfunction,enterstheupperboundsonmini- maxregret.Forprovinglowerbounds,weconsiderthedualproperty,thatof(restricted)smoothness.Tothisend, letS Y beasubsetsatisfyingthefollowingcondition: ⊆ 1 c s S, y (s),y (s) Y s.t. s argmin ℓ(y,y (s)) ℓ(y,y (s)) . (12) 1 2 1 2 ∀ ∈ ∃ ∈ ∈ 2 + y Y ∈ ¡ ¢ b b ForanysuchsubsetS,let∆ :(Y S) R bedefinedas b c S 0 − → ≥ c ∆ (x) sup max ∆y1(s),∆y2(s) . (13) S = s,b s,b s∈S,b∈Y s.t.b−s=x n o Wewrite∆ forthesingletonsetS {κ}. c κ = Thelowerboundsinthissectionwillbeconstructedfromsymmetricdistributionssupportedontwocarefully chosenpoints.Crucially,wedonotrequireauniformnotionofsmoothness,butratheraconditiononthelossthat holdsforarestrictedsubsetSandatwo-pointdistribution. Asanexample,considersquarelossandY Y ( B,B). Foranys Y,wemaychoosethetwopointsas = = − ∈ s δ Y,forsmallenoughδ,withthedesiredproperty.ThenS Y and∆ (x) x2. S ± ∈ = = c c Lemma6. FixR 0.SupposeS satisfiescondition(12),andsupcposethatforanys S, > 6=; ∈ ∂ℓ(s,y (s)) R, ∂ℓ(s,y (s)) R. 1 2 =+ =− ThenforanyS-valuedtreeµofdepthn, n V supEsup ǫ R f(x (ǫ)) µ (ǫ) ∆ (f(x (ǫ)) µ (ǫ)) . (14) n≥ x f F·t 1 t t − t − µt(ǫ) t − t ¸ ∈ X= ¡ ¢ Thelowerboundin(14)isanoffsetRademachercomplexity thatmatchestheupper boundofLemma 3up toconstants,aslongasfunctions∆and∆exhibitthesamebehavior. Inparticular,theupperandlowerbounds matchuptoaconstantforthecaseofsquareloss. Ournextstepistoquantifythelowerboundintermsofnaccordingto“size”ofF.Incontrasttothemorecom- monstatisticalapproachesbasedoncoveringnumbersandFanoinequality,weturntoanotionofacombinatorial dimensionasthemaintool. Definition4. AnX-valuedtreeofdepthdissaidtobeβ-shatteredbyF ifthereexistsanR-valuedtreesofdepth d suchthat ǫ { 1}d, fǫ F s.t. ǫ (fǫ(x (ǫ)) s (ǫ)) β/2 t t t ∀ ∈ ± ∃ ∈ − ≥ forallt {1,...,d}.Thetreesiscalledawitness.Thelargestdforwhichthereexistsaβ-shatteredX-valuedtreeis ∈ calledthe(sequential)fat-shatteringdimension,denotedbyfat (F). β ThereaderwillnoticethattheupperboundofLemma5isintermsofsequentialentropiesratherthancombi- natorialdimensions.Thetwonotions,however,arecloselyrelated. 6 Theorem7([26]). LetF beaclassoffunctionsX [ 1,1].Foranyβ 0, → − > 2en fatβ(F) N (β,F,n) N (β,F,n) . 2 ≤ ∞ ≤ β µ ¶ Asaconsequenceoftheabovetheorem,iflogN (β,F,n) (c/β)p andβ 1/n,thenfat (F) (c /β)p/log(n) 2 β ′ ≥ ≥ ≥ wherec,c maydependontherangeoffunctionsinF. ′ The lower bounds will now be obtained assuming fat (F) β p behavior of the fat-shattering dimension, β − ≥ andthecorrespondingstatementsintermsofthesequentialentropygrowthwillinvolveextralogarithmicfactors, hiddenintheΩ()notation. · Lemma8. SupposethestatementofLemma6holdsforsomeR 0,andsuppose e > R ∆ (µ (ǫ) f(x (ǫ))) µ (ǫ) f(x (ǫ)) (15) µt(ǫ) t − t ≤ 2| t − t | forany f F andµ,xinthestatementofLemma6.Thenitholdsthatforanyβ 0andn fat (F), β ∈ > = V (R/2)nβ. n ≥ Inparticular,iffat (F) β p forp 0,wehave β − ≥ > 1 Vn (R/2)n−1/p . n ≥ Asanexample,considerthecaseofsquarelosswithY [ B,B]. ThenwemaytakeS {0},y B, y B, 1 2 = − = = =− andhenceR 2B.Weverifythat(15)holdsforY [ B/2,B/2]. = = − Lemma9. Suppose the statementof Lemma 6 holds for some R 0. For any class F and β 0, there exists a c ′ > > modifiedclassF suchthatforallβ β,fat (F ) fat (F) 2fat (F ) 4andforn fat (F), ′< β′ ′ ≤ β′ ≤ β′ ′ + > β 1V sup Rβ fatβ(F) ∆ β . n n≥  2 s 2n − κ 2  β,κ  µ ¶ ArmedwiththeupperboundsofSection3andthelowerboundsofSection4,wearereadytodetailspecific minimaxratesofconvergenceforvariousclassesofregressionfunctionsF andarangeoflossfunctionsℓ. 5 MinimaxRates CombiningLemma3andLemma5,wecandetailthebehaviorofminimaxregretunderanassumptionaboutthe growthrateofsequentialentropy. Theorem10. Letr 2,p 0andsupposethelossfunctionandthefunctionclassaresuchthat ≥ > ∆(t) Ktr, logN (β,F,n) β−plog(n/β). ≥ ∞ ≤ Thenforp (0,2), ∈ n1Vn≤min cr,p n−2(r−r1)+pG2(r−21r)+pK−2(r2−−1p)+p log(n), cFGlog1/2(n)n−1/2 . (16) ½ ¾ andforp 2 > 1 Vn cpGlog1/2(n)n−1/p (17) n ≤ Here,cF dependsonsupf∈F|f|∞.Atp=2,thebound(17)gainsanextralog(n)factor. 7 Wematchtheaboveupperboundswithlowerboundsundertheassumptiononthegrowthofthecombinato- rialdimension. Theorem11. Supposethe statementofLemma6holds forsomeR 0andκ S Y. Letr 2, p (0,2), and > ∈ ⊆ ≥ ∈ assume ∆κ β/2 Kβr, fatβ β−p. c ≤ ≥ Thenthereexistsafunctionclasssuchthatfor¡some¢constantcp,r 0, > n1Vn≥cp,rmin n−2(r−r1)+pR2(r−21r)+pK−2(r2−−1p)+p , Rn−1/2 . ½ ¾ forp (0,2).Furthermore,forp 2,foranyF withfatβ β−p, ∈ > ≥ 1 Vn (R/2)n−1/p n ≥ undertheassumption(15). ThelowerboundofTheorem11matches(uptopolylogarithmicinnfactors)theupperboundofTheorem10 in its dependence on n, the dependence onthe constant K, and in dependence onthe size ofthe gradientsG (respectively,R). Therestofthissectionisdevotedtothediscussionofthederivedupperandlowerboundsfor particularlossfunctionsorparticularclassesoffunctions. 5.1 Absoluteloss Weverifythatthegeneralstatementsrecoverthecorrectratesforthecaseofℓ(y,y) y y . Sincetheabsolute =| − | loss is not strongly convex, we take K 0 (and ∆ 0). Theorem 10 then yields the O(n 1/2) rate for p (0,2) − = ≡ ∈ and O(n−1/p) for p 2, up tologarithmic factors. These rates arematched, agbainupbtologarithmic factors, in > Theorem11.Ofcourse,theresultalreadyfollowsfromTheorem2. e Iteisalsoinstructivetocheckthecaseofr . Inthiscase,ifK isscaledproperlybytherangeoffunction →∞ values,thefunction∆approachesthezerofunction,indicatingabsenceofstrongconvexityoftheloss. Examin- ingthepower r inTheorem10, weseethatitapproaches1/2, matchingthediscussionofthepreceding 2(r 1) p paragraph. − + 5.2 Squareloss Thecaseofsquarelossℓ(y,y) (y y)2hasbeenstudiedin[22].InviewofRemark1,westatethecorollarybelow = − intermsofℓ coveringnumbers,thusremovingsomelogarithmictermsofTheorem10. 2 Corollary12. ForaclassFb withsbequentialentropygrowthlogN2(β,F,n) β−p, ≤ • Forp 2,theminimaxregret3isboundedas 1V Cn 1/p > n n≤ − • Forp (0,2),theminimaxregretisboundedas 1V Cn 2/(2 p) ∈ n n≤ − + • Fortheparametriccase(6), 1V Cdn 1log(n) n n≤ − • ForfinitesetF, 1V Cn 1log F n n≤ − | | Corollary13. TheupperboundsofCorollary12aretight4: • Forp 2,foranyclassF ofuniformlyboundedfunctionswithalowerboundofβ p onsequentialentropy − ≥ growth, 1V Ω(n 1/p) n n≥ − • Forp (0,2],foranyclassF ofuniformlyboundedfunctions,thereexistsaslightlymodifiedclassF withthe ∈ e ′ samesequentialentropygrowthsuchthat 1V Ω(n 2/(2 p)) n n≥ − + • ThereexistsaclassF withthecoveringnumberasin(6),suchthat 1V Ω(dn 1log(n)) e n n≥ − 34FTohrepΩ=()2n,on1tVatnio≤nCsulopgp(rne)sns−es1/l2o.garithmicfactors · e 8 5.3 q-lossforq (1,2) ∈ Consider the caseof ℓ(yˆ,y) y yˆq, for q (1,2), which interpolates between theabsolute value and square =| − | ∈ losses. Corollary14. SupposeY Y [ 1,1]andℓ(yˆ,y) y yˆq forq (1,2).AssumecomplexityofF asinTheorems = = − =| − | ∈ 10and11forsomep 0.Then > c n1Vn=Θ min q−1 −22+−pp n−2+2p,n−1/2 µ ½ ¾¶ ¡ ¢ 5.4 q-lossforq 2 ≥ Itiseasytocheckthatforq 2,ℓ(,y) y q isq-uniformlyconvex,andthus > · =|·− | ∆(t) Ctq ≥ Theupperboundof 1 q nVn≤Cn−2(q−1)+p thenfollowsfromTheorem10. 5.5 Logisticloss Thelossfunctionℓ(yˆ,y) log(1 exp{ yˆy})isstronglyconvexandsmoothifthesetsY,Y arebounded.Thiscan = + − beseenbycomputingthesecondderivativewithrespecttothefirstargument: c exp{yˆy} ℓ′′(yˆ,y) y2 = (1 exp{yˆy})2 + Weconcludethat n1Vn=Θ min n−2+2p,n−1/2 ³ n o´ Logisticlossisanexampleofafunctionwithethirdderivativeboundedbyamultipleofthesecondderivative. ControloftheremainderterminTaylorapproximationforsuchfunctionsisgivenin[5,Lemma1].Otherexamples ofstronglyconvexandsmoothlossesaretheexponentiallossandtruncatedquadraticloss.Theseenjoythesame minimaxrateasgivenabove. 5.6 Logarithmicloss Thetechniquedevelopedinthispaperisnotuniversal.Inparticular,itdoesnotyieldcorrectratesforrichclasses offunctionsundertheloss ℓ(y,y) log(y)1 y 1 log(1 y)1 y 0 =− = − − = fortheproblemofprobabilityassignmentandabina©ryalpªhabetY {0,©1}. ThªesuboptimalityofLemma3isdue b b =b totheexplodingLipschitzconstant.However,amodifiedapproachispossible,andwillbecarriedoutinaseparate paper. 5.7 Sparselinearpredictorsandsquareloss Wenowfocusonquadraticlossandinsteaddetailminimaxratesforspecificclassesoffunctions. Considerthe followingparametricclass. LetG {g ,...,g }beasetofM functionssuchthateachg :X [ 1,1]. DefineF 1 M i = 7→ − tobetheconvexcombinationofatmostsoutoftheseMfunctions.Thatis s s F α g :σ [M], j,α 0, α 1 =(j 1 j σj 1:s⊂ ∀ j ≥ j 1 j = ) X= X= 9 Forthisexamplenotethatthesequentialcoveringnumbercanbeeasilyupperbounded:wecanchoosesoutofM functionsin M waysandobservethatpointwisemetricentropyforconvexcombinationofsboundedfunctions s atscaleβisboundedasβ s.Weconcludethat ¡ ¢ − eM s N2(β,F,n) β−s ≤ s µ ¶ Fromthemaintheorem,forthecaseofsquareloss,theupperboundis slog(M/s) 1V O . n n≤ n µ ¶ Theextensiontootherlossfunctionsfollowsimmediatelyfromthegeneralstatements. 5.8 Besovspacesandsquareloss LetX beacompactsubsetofRd.LetF beaballinBesovspaceBs (X).Whens d/p,pointwisemetricentropy p,q > boundsatscaleβscalesasΩ(β d/s)[31,p. 20]. Ontheotherhand,whens d/p,andp 2,onecanshowthat − < > thespaceisap-uniformlyconvexBanachspace. From[26],itcanbeshownthatsequentialRademachercanbe upperboundedbyO(n1−1/p),yieldingaboundonminimaxrate. Thesetwocontrolstogethergivetheboundon theminimaxrate.ThegenericforecasterwithRademachercomplexityasrelaxation(seeSection6),enjoysthebest ofbothoftheserates.Morespecifically,wemayidentifythefollowingregimes: • Ifs≥d/2,theminimaxrateis n1Vn≤O n−2s2+sd . ³ ´ • Ifs d/2,theminimaxratedependsontheinteractionofpandd,s: < – ifp>ds,theminimaxrate n1Vn≤O n−ds , otherwise,therateis n1Vn≤O n−p1 ³ ´ ³ ´ 5.9 Remarks:Experts,Mixability,andDiscretization Theproblemofpredictionwithexpertadvicehasbeencentralintheonlinelearningliterature[9].Onecanphrase theexpertsprobleminoursettingbytakingafiniteclassF {f1,...,fN}offunctions.Itispossibletoensuresub- = linearregretbyfollowingthe“advice” fIt(xt)ofarandomlychosen“expert”It fromanappropriatedistribution overexperts. Therandomizedapproach,however,effectivelylinearizestheproblemanddoesnottakeadvantage ofthecurvatureoftheloss.Theprecisewayinwhichthelossentersthepicturehasbeeninvestigatedthoroughly byVovk[28](seealso[15]). Vovkdefinesamixabilitycurvethatparametrizesachievableregretofaformslightly differentthan(1).Specifically,Vovkallowsaconstantotherthan1infrontoftheinfimumintheregretdefinition. Suchregretboundsarecalled“inexactoracleinequalities”instatistics.Audibert[2]showsthatthemixabilitycon- ditiononthelossfunctionleadstoavariance-typeboundinhisgeneralPAC-basedformulation,yettheanalysis isrestrictedtothecaseoffiniteexperts.Whileitispossibletorepeattheanalysisinthepresentpaperwithacon- stantotherthan1infrontofthecomparator,thisgoesbeyondthescopeofthepaper.Importantly,ourtechniques gobeyondthefinitecaseandcangivecorrectregretboundsevenifdiscretizationtoafinitesetofexpertsyields vacuousbounds. Let us emphasize the abovepoint again by comparing the upper bound of Lemma 5 to the bound wemay obtainviaametricentropyapproach,asintheworkof[31].AssumethatF isacompactsubsetofC(X)equipped withsupremumnorm.Themetricentropy,denotedbyH(ǫ,F),isthelogarithmofthesmallestǫ-netwithrespect tothesupnormonX. Anaggregatingprocedureovertheelementsofthenetgivesanupper bound(omitting constantsandlogarithmicfactors) nǫ H(ǫ,F) (18) + onregret(1). Here,nǫistheamountwelosefromrestrictingtheattentiontotheǫ-net,andthesecondtermap- pearsfromaggregationoverafiniteset. Thebalance(18)failstocapturetheoptimalbehaviorforlargenonpara- p metricsetsoffunctions.Indeed,foranO(ǫ−p)behaviorofmetricentropy,VovkconcludestherateofO np+1 .For ³ ´ 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.