ebook img

The Multiplicative Weights Update Method: A Meta-Algorithm and PDF

44 Pages·2012·0.37 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Multiplicative Weights Update Method: A Meta-Algorithm and

THEORY OF COMPUTING,Volume8(2012),pp. 121–164 www.theoryofcomputing.org RESEARCH SURVEY The Multiplicative Weights Update Method: A Meta-Algorithm and Applications Sanjeev Arora∗ Elad Hazan Satyen Kale Received: July22,2008;revised: July2,2011;published: May1,2012. Abstract: Algorithms in varied fields use the idea of maintaining a distribution over a certainsetandusethemultiplicativeupdateruletoiterativelychangetheseweights. Their analysesareusuallyverysimilarandrelyonanexponentialpotentialfunction. Inthissurveywepresentasimplemeta-algorithmthatunifiesmanyofthesedisparate algorithms and derives them as simple instantiations of the meta-algorithm. We feel that since this meta-algorithm and its analysis are so simple, and its applications so broad, it shouldbeastandardpartofalgorithmscourses,like“divideandconquer.” ACMClassification: G.1.6 AMSClassification: 68Q25 Keywordsandphrases: algorithms,gametheory,machinelearning 1 Introduction TheMultiplicativeWeights(MW)method isasimpleideawhichhasbeenrepeatedlydiscoveredinfields asdiverseasMachineLearning,Optimization,andGameTheory. Thesettingforthisalgorithmisthe following. Adecisionmakerhasachoiceofndecisions,andneedstorepeatedlymakeadecisionand obtainanassociatedpayoff. Thedecisionmaker’sgoal,inthelongrun,istoachieveatotalpayoffwhich is comparable to the payoff of that fixed decision that maximizes the total payoff with the benefit of ∗ThisprojectwassupportedbyDavidandLucilePackardFellowshipandNSFgrantsMSPA-MCS0528414andCCR- 0205594. 2012SanjeevArora,EladHazanandSatyenKale LicensedunderaCreativeCommonsAttributionLicense DOI:10.4086/toc.2012.v008a006 SANJEEVARORA,ELADHAZANANDSATYENKALE hindsight. Whilethisbestdecisionmaynotbeknownapriori,itisstillpossibletoachievethisgoalby maintainingweightsonthedecisions,andchoosingthedecisionsrandomlywithprobabilityproportional totheweights. Ineachsuccessiveround,theweightsareupdatedbymultiplyingthemwithfactorswhich dependonthepayoffoftheassociateddecisioninthatround. Intuitively,thisschemeworksbecauseit tendstofocushigherweightonhigherpayoffdecisionsinthelongrun. Thisidealiesatthecoreofavarietyofalgorithms. Someexamplesinclude: theAdaBoostalgorithm inmachinelearning[26];algorithmsforgameplayingstudiedineconomics(seereferenceslater),the Plotkin-Shmoys-TardosalgorithmforpackingandcoveringLPs[56],anditsimprovementsinthecase of flow problems by Young [65], Garg-Ko¨nemann [29, 30], Fleischer [24] and others; methods for convexoptimizationlikeexponentiatedgradient(mirrordescent),Lagrangianmultipliers,andsubgradient methods,Impagliazzo’sproofoftheYaoXORlemma[40],etc. Theanalysisoftherunningtimeusesa potentialfunctionargumentandthefinalrunningtimeisproportionalto1/ε2. Ithasbeencleartoseveralresearchersthattheseresultsareverysimilar. ForexampleKhandekar’s Ph.D. thesis [46] makes this point about the varied applications of this idea to convex optimization. Thepurposeofthissurveyistoclarifythatmanyoftheseapplicationsareinstancesofthesame,more generalalgorithm(althoughseveralspecializedapplications,suchas[53],requireadditionaltechnical work). Thismeta-algorithmisverysimilartothe“Hedge”algorithmfromlearningtheory[26]. Similar algorithms have been independently rediscovered in many other fields; see below. The advantage of derivingtheabovealgorithmsfromthesamemeta-algorithmisthatthishighlightstheircommonalitiesas wellastheirdifferences. Togiveanexample,thealgorithmsofGarg-Ko¨nemann[29,30]werefelttobe quitedifferentfromthoseofPlotkin-Shmoys-Tardos[56]. Inourframework,theycanbeseenasaclever trickfor“widthreduction”forthePlotkin-Shmoys-Tardosalgorithms(seeSection3.4). We feel that this meta-algorithm and its analysis are simple and useful enough that they should beviewedasabasictooltaughttoallalgorithmsstudentstogetherwithdivide-and-conquer,dynamic programming,randomsampling,andthelike. Notethatthemultiplicativeweightsupdaterulemaybe seenasa“constructive”versionofLPduality—equivalently,vonNeumann’sminimaxtheoremingame theory—anditgivesafairlyconcretemethodforcompetingplayerstoarriveatasolution/equilibrium (seeSection3.2). Thismaybeanappealingfeatureinintroductoryalgorithmscourses,sincethestandard algorithmsforLPsuchassimplex,ellipsoid,orinteriorpointlacksuchagame-theoreticinterpretation. Furthermore,itisaconvenientsteppingpointtomanyothertopicsthatrarelygetmentionedinalgorithms courses, including online algorithms (see the basic scenario in Section 1.1) and machine learning. Furthermoreourproofsseemeasierandcleanerthantheentropy-basedproofsforthesameresultsin machinelearning(althoughtheprooftechniqueweuseherehasbeenusedbefore,seeforexampleBlum’s survey[10]). Thecurrentpaperischieflyasurvey. Itintroducesthemainalgorithm,givesafewvariants(mostly havingtodowiththerangeinwhichthepayoffslie),andsurveysthemostimportantapplications—often withcompleteproofs. Notehoweverthatthissurveydoesnotcoverallapplicationsofthetechnique,as severaloftheserequireconsiderableadditionaltechnicalworkwhichisbeyondthescopeofthispaper. Wehaveprovidedpointerstosomesuchapplicationswhichusethemultiplicativeweightstechniqueat theircorewithoutgoingintomoredetails. Therearealsoafewsmallresultsthatappeartobenew,such asthevariantoftheGarg-Ko¨nemannalgorithminSection3.4andthelowerboundinSection4. THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 122 THEMULTIPLICATIVEWEIGHTSUPDATEMETHOD: AMETA-ALGORITHMANDAPPLICATIONS Relatedwork. AnalgorithmsimilarinflavortotheMultiplicativeWeightsalgorithmwasproposedin gametheoryintheearly1950’s[13,12,59]. FollowingBrown[12],thisalgorithmwascalled“Fictitious Play”: ateachstepeachplayerobservesactionstakenbyhisopponentinpreviousstages,updateshis beliefsabouthisopponents’strategies,andchoosesthepurebestresponseagainstthesebeliefs. Inthe simplestcase,theplayersimplyassumesthattheopponentisplayingfromastationarydistributionand setshiscurrentbeliefoftheopponent’sdistributiontobetheempiricalfrequencyofthestrategiesplayed bytheopponent. Thissimpleidea(whichwasshowntoleadtooptimalsolutionsinthelimitinvarious cases)ledtonumerousdevelopmentsineconomics,includingArrow-DebreuGeneralEquilibriumtheory andmorerecently,evolutionarygametheory. GrigoriadisandKhachiyan[33]showedhowarandomized variantof“FictitiousPlay”cansolvetwoplayerzero-sumgamesefficiently. Thisalgorithmisprecisely themultiplicativeweightsalgorithm. Itcanbeviewedasasoftversionoffictitiousplay,whentheplayer giveshigherweighttothestrategieswhichpayoffbetter,andchoosesherstrategyusingtheseweights ratherthanchoosingthebestresponsestrategy. InMachineLearning,theearliestformofthemultiplicativeweightsupdaterulewasusedbyLittle- stoneinhiswell-knownWinnowalgorithm[50,51]. Itissomewhatreminiscentoftheolderperceptron learningalgorithmofMinskyandPapert[55]. TheWinnowalgorithmwasgeneralizedbyLittlestoneand Warmuth[52]intheformoftheWeightedMajorityalgorithm,andlaterbyFreundandSchapireinthe formoftheHedgealgorithm[26]. Wenotethatmostrelevantpapersinlearningtheoryuseananalysis thatreliesonentropy(oritscousin,Kullback-Leiblerdivergence)calculations. Thisanalysisisclosely related to our analysis, but we use exponential functions instead of the logarithm, or entropy, used in thosepapers. Theunderlyingcalculationisthesame: whereaswerepeatedlyusethefactthatex≈1+x when |x| is small, they use the fact that ln(1+x)≈x. We feel that our approach is cleaner (although the entropy based approach yields somewhat tighter bounds that are useful in some applications, see Section2.2). OtherapplicationsofthemultiplicativeweightsalgorithmincomputationalgeometryincludeClark- son’s algorithm for linear programming with a bounded number of variables in linear time [20, 21]. FollowingClarkson,Bro¨nnimannandGoodrichusesimilarmethodstofindSetCoversforhypergraphs withsmallVCdimension[11]. Theweightedmajorityalgorithmaswellasmoresophisticatedversionshavebeenindependently discoveredinoperationsresearchandstatisticaldecisionmakinginthecontextoftheOn-linedecision problem; see the surveys of Cover [22], Foster and Vohra [25], and also Blum [10] who includes applicationsofweightedmajoritytomachinelearning. Anotablealgorithm,whichisdifferentfrombut relatedtoourframework,wasdevelopedbyHannaninthe1950’s[34]. KalaiandVempalashowedhow toderiveefficientalgorithmsviamethodssimilartoHannan’s[43]. WeshowhowHannan’salgorithm withtheappropriatechoiceofparametersyieldsthemultiplicativeupdatedecisionruleinSection3.8. Withincomputerscience,severalresearchershavepreviouslynotedthecloserelationshipsbetween multiplicativeupdatealgorithmsusedindifferentcontexts. Young[65]notestheconnectionbetween fastLPalgorithmsandRaghavan’smethodofpessimisticestimatorsforderandomizationofrandomized roundingalgorithms;seeourSection3.5. KlivansandServedio[49]relateboostingalgorithmsinlearning theorytoproofsofYao’sXORLemma;seeourSection3.6. GargandKhandekar[28]describeacommon frameworkforconvexoptimizationproblemsthatcontainsGarg-Ko¨nemannandPlotkin-Shmoys-Tardos assubcases. THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 123 SANJEEVARORA,ELADHAZANANDSATYENKALE To the best of our knowledge our framework is the most general and, arguably, the simplest. We readilyacknowledgetheinfluenceofallpreviouspapers(especiallyYoung[65]andFreund-Schapire[27]) onthedevelopmentofourframework. Weemphasizeagainthatwedonotclaimthateveryalgorithm designedusingthemultiplicativeupdateideafitsinourframework,justthatmostdo. Paperorganization. Weproceedtodefinetheillustrativeweightedmajorityalgorithminthissection. InSection2wedescribethegeneralMWmeta-algorithm,followedbynumerousandvariedapplications inSection3. InSection4wegivelowerbounds,followedbythemoregeneralmatrixMWalgorithmin Section5. 1.1 Theweightedmajorityalgorithm Nowwebrieflyillustratetheweightedmajorityalgorithminasimpleandconcretesetting,whichwill naturallyleadtoourgeneralizedmeta-algorithm. ThisisknownasthePredictionfromExpertAdvice problem. Imaginetheprocessofpickinggoodtimestoinvestinastock. Forsimplicity,assumethatthereisa singlestockofinterest,anditsdailypricemovementismodeledasasequenceofbinaryevents: up/down. (Below,thiswillbegeneralizedtoallownon-binaryevents.) Eachmorningwetrytopredictwhetherthe pricewillgoupordownthatday;ifourpredictionhappenstobewrongweloseadollarthatday,andif it’scorrect,welosenothing. The stock movements can be arbitrary and even adversarial. To balance out this pessimistic assumption,weassumethatwhilemakingourpredictions,weareallowedtowatchthepredictionsofn “experts.” Theseexpertscouldbearbitrarilycorrelated,andtheymayormaynotknowwhattheyare talkingabout. Thealgorithm’sgoalistolimititscumulativelosses(i.e.,badpredictions)toroughlythe sameasthebestoftheseexperts. Atfirstsightthisseemsanimpossiblegoal,sinceitisnotknownuntil theendofthesequencewhothebestexpertwas,whereasthealgorithmisrequiredtomakepredictions allalong. Indeed,thefirstalgorithmonethinksofistocomputeeachday’sup/downpredictionbygoingwith themajorityopinionamongtheexpertsthatday. But,thisalgorithmdoesn’tworkbecauseamajorityof expertsmaybeconsistentlywrongoneverysingleday. Theweightedmajorityalgorithmcorrectsthetrivialalgorithm. Itmaintainsaweightingoftheexperts. Initiallyallhaveequalweight. Astimegoeson,someexpertsareseenasmakingbetterpredictionsthan others,andthealgorithmincreasestheirweightproportionately. Thealgorithm’spredictionofup/down foreachdayiscomputedbygoingwiththeopinionoftheweightedmajorityoftheexpertsforthatday. Theorem1.1. AfterT steps,letm(T) bethenumberofmistakesofexpertiandM(T) bethenumberof i mistakesouralgorithmhasmade. Thenwehavethefollowingboundforeveryi: 2lnn M(T)≤ 2(1+η)m(T)+ . i η Inparticular,thisholdsforiwhichisthebestexpert,i.e.,havingtheleastm(T). i THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 124 THEMULTIPLICATIVEWEIGHTSUPDATEMETHOD: AMETA-ALGORITHMANDAPPLICATIONS Weightedmajorityalgorithm Initialization: Fixanη ≤ 1. Witheachexperti,associatetheweightw(1):=1. 2 i Fort =1,2,...,T: 1. Make the prediction that is the weighted majority of the experts’ predictions based on the weightsw (t),...,w (t). Thatis,predict“up”or“down”dependingonwhichpredictionhasa 1 n highertotalweightofexpertsadvisingit(breakingtiesarbitrarily). 2. Foreveryexpertiwhopredictswrongly,decreasehisweightforthenextroundbymultiplying itbyafactorof(1−η): w(t+1)=(1−η)w(t) (updaterule). (1.1) i i Remark When m(T) (cid:29) (2/η)lnn we see that the number of mistakes made by the algorithm is i boundedfromabovebyroughly2(1+η)m(T),i.e.,approximatelytwicethenumberofmistakesmade i bythebestexpert. Thisistightforanydeterministicalgorithm. However,thefactorof2canberemoved bysubstitutingtheabovedeterministicalgorithmbyarandomizedalgorithmthatpredictsaccordingto themajorityopinionwithprobabilityproportionaltoitsweight. (Inotherwords,ifthetotalweightof theexpertssaying“up”is3/4thenthealgorithmpredicts“up”withprobability3/4and“down”with probability1/4.) ThenthenumberofmistakesafterT stepsisarandomvariableandtheclaimedupper boundholdsforitsexpectation(seeSection2formoredetails). Proof. Asimpleinductionshowsthatwi(t+1)=(1−η)mi(t). LetΦ(t)=∑iwi(t) (“thepotentialfunction”). ThusΦ(1)=n. Eachtimewemakeamistake,theweightedmajorityofexpertsalsomadeamistake,soat leasthalfthetotalweightdecreasesbyafactor1−η. Thus,thepotentialfunctiondecreasesbyafactor ofatleast(1−η/2): (cid:18) (cid:19) 1 1 Φ(t+1) ≤ Φ(t) + (1−η) = Φ(t)(1−η/2). 2 2 Thus simple induction gives Φ(T+1) ≤n(1−η/2)M(T). Finally, since Φ(T+1) ≥w(T+1) for all i, the i claimedboundfollowsbycomparingtheabovetwoexpressionsandusingthefactthat −ln(1−η)≤η+η2 sinceη <1/2. The beauty of this analysis is that it makes no assumption about the sequence of events: they could be arbitrarily correlated and could even depend upon our current weighting of the experts. In this sense, this algorithm delivers more than initially promised, and this lies at the root of why (after obvious generalization) it can give rise to the diverse algorithms mentioned earlier. In particular, the scenariowheretheeventsarechosenadversariallyresemblesazero-sumgame,whichweconsiderlater inSection3.2. THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 125 SANJEEVARORA,ELADHAZANANDSATYENKALE 2 The Multiplicative Weights algorithm In the general setting, we have a set of n decisions and in each round, we are required to select one decision from the set. In each round, each decision incurs a certain cost, determined by nature or an adversary. Allthecostsarerevealedafterwechooseourdecision,andweincurthecostofthedecision wechose. Forexample,inthepredictionfromexpertadviceproblem,eachdecisioncorrespondstoa choiceofanexpert,andthecostofanexpertis1iftheexpertmakesamistake,and0otherwise. To motivate the Multiplicative Weights (MW) algorithm, consider the na¨ıve strategy that, in each iteration,simplypicksadecisionatrandom. Theexpectedpenaltywillbethatofthe“average”decision. Supposenowthatafewdecisionsareclearlybetterinthelongrun. Thisiseasytospotasthecostsare revealedovertime,andsoitissensibletorewardthembyincreasingtheirprobabilityofbeingpickedin thenextround(hencethemultiplicativeweightupdaterule). Intuitively,beingincompleteignoranceaboutthedecisionsattheoutset,weselectthemuniformly atrandom. Thismaximumentropystartingrulereflectsourignorance. Aswelearnwhichonesarethe gooddecisionsandwhichonesarebad,welowertheentropytoreflectourincreasedknowledge. The multiplicativeweightupdateisourmeansofskewingthedistribution. We now set up some notation. Lett =1,2,...,T denote the current round, and let i be a generic decision. Ineachroundt,weselectadistributionp(t) overthesetofdecisions,andselectadecisioni randomlyfromit. Atthispoint,thecostsofallthedecisionsarerevealedbynatureintheformofthe vectorm(t) suchthatdecisioniincurscostm(t). Weassumethatthecostslieintherange[−1,1]. Thisis i theonlyassumptionwemakeonthecosts;natureiscompletelyfreetochoosethecostvectoraslong astheseboundsarerespected,evenwithfullknowledgeofthedistributionthatwechooseourdecision from. Theexpectedcosttothealgorithmforsamplingadecisionifromthedistributionp(t) is E [m(t)] = m(t)·p(t). i∈p(t) i Thetotalexpectedcostoverallroundsistherefore∑T m(t)·p(t). Justasbefore,ourgoalistodesign t=1 analgorithmwhichachievesatotalexpectedcostnottoomuchmorethanthecostofthebestdecision in hindsight, viz. mini∑tT=1mi(t). Consider the following algorithm, which we call the Multiplicative Weights Algorithm. This algorithm has been studied before as the prod algorithm of Cesa-Bianchi, Mansour,andStoltz[17],andTheorem2.1canbeseentofollowfromLemma2in[17]. Thefollowingtheorem—completelyanalogoustoTheorem1.1—boundsthetotalexpectedcostof theMultiplicativeWeightsalgorithm(giveninFigure1)intermsofthetotalcostofthebestdecision: Theorem 2.1. Assume that all costs m(t) ∈ [−1,1] and η ≤ 1/2. Then the Multiplicative Weights i algorithmguaranteesthatafterT rounds,foranydecisioni,wehave T T T lnn ∑m(t)·p(t) ≤ ∑m(t)+η∑|m(t)|+ . i i η t=1 t=1 t=1 THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 126 THEMULTIPLICATIVEWEIGHTSUPDATEMETHOD: AMETA-ALGORITHMANDAPPLICATIONS MultiplicativeWeightsalgorithm Initialization: Fixanη ≤ 1. Witheachdecisioni,associatetheweightw(1):=1. 2 i Fort =1,2,...,T: 1. Choosedecisioniwithprobabilityproportionaltoitsweightw(t). I.e.,usethedistribution i overdecisionsp(t)={w1(t)/Φ(t),...,wn(t)/Φ(t)}whereΦ(t)=∑iwi(t). 2. Observethecostsofthedecisionsm(t). 3. Penalizethecostlydecisionsbyupdatingtheirweightsasfollows: foreverydecisioni,set w(t+1)= w(t)(1−ηm(t)) (2.1) i i i Figure1: TheMultiplicativeWeightsalgorithm. Proof. Theproofisalongthelinesoftheearlierone,usingthepotentialfunctionΦ(t)=∑iwi(t): Φ(t+1) = ∑w(t+1) i i = ∑w(t)(1−ηm(t)) i i i = Φ(t)−ηΦ(t)∑m(t)p(t) i i i = Φ(t)(1−ηm(t)·p(t)) ≤ Φ(t)exp(−ηm(t)·p(t)). Here,weusedthefactthat p(t)=w(t)/Φ(t). Thus,byinduction,afterT rounds,wehave i i (cid:32) (cid:33) (cid:32) (cid:33) T T Φ(T+1) ≤ Φ(1)exp −η∑m(t)·p(t) = n·exp −η∑m(t)·p(t) . (2.2) t=1 t=1 Nextweusethefollowingfacts,whichfollowimmediatelyfromtheconvexityoftheexponential function: (1−η)x≤(1−ηx) ifx∈[0,1], (1+η)−x≤(1−ηx) ifx∈[−1,0]. (2.3) Sincem(t)∈[−1,1],wehaveforeverydecisioni, i Φ(T+1)≥wi(T+1)= ∏(1−ηmi(t))≥(1−η)∑≥0mi(t)·(1+η)−∑<0mi(t), (2.4) t≤T THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 127 SANJEEVARORA,ELADHAZANANDSATYENKALE wherethesubscripts“≥0”and“<0”inthesummationsrefertotheroundst wherem(t) is≥0and<0 i respectively. Takinglogarithmsinequations(2.2)and(2.4)weget: T lnn−η∑m(t)·p(t)≥∑m(t)ln(1−η)−∑m(t)ln(1+η). i i t=1 ≥0 <0 Negating,rearranging,andscalingby1/η: T lnn 1 1 1 ∑m(t)·p(t) ≤ + ∑m(t)ln + ∑m(t)ln(1+η) i i η η 1−η η t=1 ≥0 <0 lnn 1 1 ≤ + ∑m(t)(η+η2)+ ∑m(t)(η−η2) i i η η η ≥0 <0 lnn T = +∑m(t)+η∑m(t)−η∑m(t) i i i η t=1 ≥0 <0 lnn T T = +∑m(t)+η∑|m(t)|. i i η t=1 t=1 Inthesecondinequalityweusedthefactsthat (cid:18) (cid:19) 1 ln ≤η+η2 and ln(1+η)≥η−η2 (2.5) 1−η forη ≤1/2. Corollary 2.2. The Multiplicative Weights algorithm also guarantees that after T rounds, for any distributionponthedecisions, T T lnn ∑m(t)·p(t) ≤ ∑(m(t)+η|m(t)|)·p+ , η t=1 t=1 where|m(t)|isthevectorobtainedbythetakingthecoordinate-wiseabsolutevalueofm(t). Proof. ThiscorollaryfollowsimmediatelyfromTheorem2.1,bytakingaconvexcombinationofthe inequalitiesforalldecisionsiwiththedistributionp. 2.1 Updatingwithexponentialfactors: theHedgealgorithm InourdescriptionoftheMWalgorithm,theupdateruleusesmultiplicationbyalinearfunctionofthe cost(specifically,(1−ηm(t))forexperti). InseveralotherincarnationsofMWalgorithm,notablythe i HedgealgorithmofFreundandSchapire[26],anexponentialfactorisusedinstead. Thisupdateruleis thefollowing: w(t+1) = w(t)·exp(−ηm(t)). (2.6) i i i AscanbeseenfromtheanalysisoftheMWalgorithm,Hedgeisnotverydifferent. Theboundweobtain forHedgeisslightlydifferenthowever. Whilemostoftheapplicationswepresentintherestofthepaper THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 128 THEMULTIPLICATIVEWEIGHTSUPDATEMETHOD: AMETA-ALGORITHMANDAPPLICATIONS canbederivedusingHedgeaswellwithalittleextracalculation,someapplications,suchastheonesin Sections3.3and3.5,explicitlyneedtheMWalgorithmratherthanHedgetoobtainbetterbounds. Here, westatetheboundobtainedforHedgewithoutproof—theanalysisisonthesamelinesasbefore. The onlydifferenceisthatinsteadoftheinequalities(2.3),weusetheinequality exp(−ηx) ≤ 1−ηx+η2x2 if|ηx|≤1. Theorem2.3. Assumethatallcostsm(t) ∈[−1,1]andη ≤1. ThentheHedgealgorithmguarantees i thatafterT rounds,foranydecisioni,wehave T T T lnn ∑m(t)·p(t) ≤ ∑m(t)+η∑(m(t))2·p(t)+ . i η t=1 t=1 t=1 Here,(m(t))2 isthevectorobtainedbytakingcoordinate-wisesquareofm(t). ThisguaranteeisverysimilartotheoneinTheorem2.1, withoneimportantdifference: theterm multiplyingη isalosswhichdependsonthealgorithm’sdistribution. InTheorem2.1,thisadditional termdependsonthelossofthebestdecisioninhindsight. Forsomeapplicationsthelatterguaranteeis stronger(seeSection3.3). 2.2 ProofviaKL-divergence Inthissection,wegiveanalternativeproofofTheorem2.1basedontheKullback-Leibler(KL)divergence, orrelativeentropy. Whilethisproofissomewhatmorecomplicated,itgivesagoodinsightintowhythe MW algorithm works: the reason is that it tends to reduce the KL-divergence to the optimal solution. AnotherreasonforgivingthisproofisthatityieldsamorenuancedformoftheMWalgorithmthatis usefulinsomeapplications(suchastheconstructionofhard-coresets,seeSection3.7). Readersmay skipthissectionwithoutlossofcontinuity. Fortwodistributionspandqonthedecisionset,therelativeentropybetweenthemis p RE(p(cid:107)q)=∑p ln i. i q i i wheretheterm p ln(p/q)isdefinedtobezeroif p =0andinfiniteif p (cid:54)=0, q =0. i i i i i i Considerthefollowingtwistonthebasicdecision-makingproblemfromSection2. Fixaconvex subset of distributions over decisions, P (note: the basic setting is recovered when P is the set of all distributions). Ineachroundt,thedecision-makerisrequiredtoproduceadistributionp(t)∈P. Atthat point,thecostvectorm(t) isrevealedandthedecision-makersufferscostm(t)·p(t). Sincewemakethe restrictionthatp(t)∈P,wenowwanttocomparethetotalcostofthedecision-makertothecostofthe bestfixeddistributioninP. ConsiderthealgorithminFigure2. NotethatinthespecialcasewhenPisthesetofalldistributionsonthedecisions,thisalgorithmis exactlythebasicMWalgorithmpresentedinFigure1. Therelativeentropyprojectionstepensuresthat THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 129 SANJEEVARORA,ELADHAZANANDSATYENKALE MultiplicativeWeightsUpdatealgorithmwithRestrictedDistributions Initialization: Fixaη ≤ 1. Setp(1) tobebeanarbitrarydistributioninPinitializedto1. 2 Fort =1,2,...,T: 1. Choosedecisionibysamplingfromp(t). 2. Observethecostsofthedecisionsm(t). 3. Compute the probability vector pˆ(t+1) using the usual multiplicative update rule: for every experti, p(t+1) = p(t)(1−ηm(t))/Φ(t) (2.7) i i i whereΦ(t) isthenormalizationfactortomakepˆ(t+1) adistribution. 4. Setp(t+1) tobetherelativeentropyprojectionofpˆ(t) onthesetP,i.e., p(t+1) = argminRE(p(cid:107)pˆ(t)). p∈P Figure2: TheMultiplicativeWeightsalgorithmwithRestrictedDistributions. wealwayschooseadistributioninP. Thisprojectionisaconvexprogramsincerelativeentropyisconvex andPisaconvexset,andhencecanbecomputedusingstandardconvexprogrammingtechniques. Wenowproveaboundonthetotalcostofthealgorithm(comparetoCorollary2.2). Notethatin thebasicsettingwhenPisthesetofalldistributions,theboundgivenbelowistighterthantheonein Theorem2.1. Theorem 2.4. Assume that all costs m(t) ∈ [−1,1] and η ≤ 1/2. Then the Multiplicative Weights i algorithmwithRestrictedDistributionsguaranteesthatafterT rounds,foranyp∈P,wehave T T RE(p(cid:107)p(1)) ∑m(t)·p(t) ≤ ∑(m(t)+η|m(t)|)·p+ , η t=1 t=1 where|m(t)|isthevectorobtainedbytakingthecoordinate-wiseabsolutevalueofm(t). Proof. Weusetherelativeentropybetweenpandp(t),RE(p(cid:107)p(t)):=∑ipiln(pi/pi(t))asa“potential” function. Wehave p(t) RE(p(cid:107)pˆt+1)−RE(p(cid:107)p(t)) = ∑p ln i i pˆ(t+1) i i Φ(t) = ∑p ln i (1−ηm(t)) i i 1 ≤ ln ∑pm(t)+ln(1+η)∑pm(t)+lnΦ(t) i i i i 1−η ≥0 <0 ≤ η(m(t)+η|m(t)|)·p+lnΦ(t). THEORYOFCOMPUTING,Volume8(2012),pp. 121–164 130

Description:
Sanjeev Arora Elad HazanSatyen Kale Received: July 22, including online algorithms This is known as the Prediction from Expert Advice
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.