Anytime many-armed bandits OlivierTeytaud,SylvainGellyandMicheleSebag EquipeTAO(Inria),LRI,UMR8623(CNRS-UniversitéParis-Sud), bât490UniversitéParis-Sud91405OrsayCedexFrance, [email protected], [email protected], [email protected] Résumé : Thispaperintroducesthemany-armedbanditproblem(ManAB),wherethenum- berofarmsislargecomparativelytotherelevantnumberoftimesteps.Whilethe ManABframeworkisrelevanttomanyreal-worldapplications,thestateofthe art does not offer anytime algorithms handling ManAB problems. Both theory andpracticesuggestthattwoproblemcategoriesmustbedistinguished;theeasy categoryincludesthoseproblemswheregoodarmshaverewardprobabilityclose to1;thedifficultcategoryincludesotherproblems.TwoalgorithmstermedFAI- LUREandMUCBTareproposedfortheManABframework.FAILUREandits variantsextendthenon-anytimeapproachproposedforthedenumerable-armed banditandnon-asymptoticboundsareshown;itworksveryefficientlyforeasy ManABproblems.Meanwhile,MUCBTefficientlydealswithdifficultManAB problems. Keywords:bandits,explorationversusexploitation. 1 Introduction One mainstream paradigm for online learning is known as the multi-armed bandit formulated by Lai & Robbins (1985); given n bandit arms with (unknown) reward probabilitiesp ,ineachtimestepttheplayerselectsanarmjandreceivesarewardr , i t wherer = 1withprobabilityp ,andr = 0otherwise.Thegoalistomaximizethe t j t cumulatedrewardgatheredoveralltimesteps,orminimizethelossincurredcompared tothebeststrategy(playingthearmwithmaximalrewardprobabilityineachtimestep), referredtoasregret. Indeed, such optimization problems can be solved exactly using dynamic program- ming approaches when the number N of time steps is known in advance, as shown byBellman(1957)andBertsekas(1995).Currently,themulti-armedbanditlitterature focuses on anytime algorithms (N is not known beforehand), with good asymptotic boundsontheregretandwhicharelesscomputationallyexpensivethantheprohibitive dynamicprogrammingapproach. DevisedbyAueretal.(2001),theso-calledUpperBoundConfidence(UCB)algo- rithmsenforceanoptimalasymptoticboundontheregret(inO(log(N)))inthestatio- CAp2007 narycase.ThenonstationarycasehasalsobeenstudiedbyKocsis&Szepesvari(2005) orHussainetal.(2006),respectivelyconsideringtheadversarialcaseorabruptlychan- gingenvironments.Also,Kocsis&Szepesvari(2006)haveextendedUCBtothecase oftree-structuredarms,definingtheUCTalgorithm. Thispaperfocusesonthecaseofmany-armedbandits(ManAB),whenthenumberof armsislargerelativelytotherelevantnumberN oftimesteps(relevanthorizon)(Banks &Sundaram(1992);Agrawal(1995);Dani&Hayes(2006)).Specifically,weassume intherestofthepaperthatN isnotknowninadvanceandthatN isatmost’ n2.It isclaimedthattheManABsettingisrelevanttomanypotentialapplicationsofonline learning. For instance when Wang & Gelly (2007) adapt UCT to build an automatic Goplayer,thedeepregionsoftheUCTtreecanonlybefrugallyexploredwhilethey involve an exponential number of moves. Applications such as labor markets, votes, consumers choice, dating, resource-mining, drug-testing (Berry et al. (1997)), feature selectionandactivelearning(Cesa-Bianchi&Lugosi(2006))alsoinvolveanumberof optionswhichislargecomparedtotherelevanthorizon. The state of the art does not address the anytime ManAB problem (more on this in section 2). On one hand, UCB algorithms boil down to uniform sampling with no replacement when the number of arms is large comparatively to the number of time steps. On the other hand, the failure-based approaches dealing with the denumerable- armed bandit with good convergence rates (Berry et al. (1997), section 2.2) require both:(i)priorknowledgeontherewarddistribution;(ii)thenumberN oftimesteps to be known in advance. They provide highly suboptimal strategies when N is badly guessedorwhenthep arefarfrom1. i Asafirstcontribution,thispaperextendsthefailure-basedalgorithmsdevisedforthe denumerable-armedbanditbyBerryetal.(1997)totheanytimeframework,preserving theirgoodconvergenceproperties.Theresultingalgorithm,termedFAILURE,however suffers from the same limitations as all failure-based algorithms when the highest re- wardprobabilitiesarewellbelow1.Therefore,twosettingsaredistinguished,theEasy ManAB(EManAB)wheretherewardprobabilitiesp areuniformlyandindependently i distributed in [0,1], and the Difficult ManAb (DManAB) where the p are uniformly i distributedin[0,(cid:15)]with(cid:15)<1.ItmustbeemphasizedthattheDManABsettingisrele- vanttoreal-worldapplications;e.g.intheNewsRecommendationapplication(Hussain etal.(2006))theoptimalrewardprobabilitiesmightbesignificantlylessthanone. We thus propose a second algorithm, inspired from the Meta-Bandit approach first described by Hartland et al. (2006) and referred to as MUCBT. While MUCBT ro- bustlyandefficientlydealswithallManABproblemsincludingthedifficultones,itis outperformedbyFAILUREonEasyManABproblems. Thispaperisorganizedasfollows.Section2brieflyintroducesthemulti-armedban- dit background, presenting the UCB algorithms (Auer et al. (2001)) and the failure- based algorithms devised for the denumerable-armed bandit in the non-anytime case (Berryetal.(1997)).Section3extendsthefailure-basedalgorithmstotheanytimefra- mework, considering both easy and difficult ManAB settings. Section 4 presents the MUCBTalgorithmsspecificallydevisedfortheDifficultManABproblems.Section5 reportsonthecomparativevalidationofthepresentedalgorithmscomparedtothestate oftheart,anddiscusseswhichalgorithmsarebestsuitedtowhichsettings.Thepaper Anytimemany-armedbandits concludeswithsomeperspectivesforfurtherresearch. 2 State of the art ThissectionbrieflyintroducesthenotationsandUCBalgorithms,referringthereader to Auer et al. (2002) for a comprehensive presentation. The state of the art related to infinitelymany-armedbanditsisthenpresented. 2.1 Any-timeMAB A multi-armed bandit involves n arms, where the i-th arm is characterized by its rewardprobabilityp .Ineachtimestept,theplayerorthealgorithmselectssomearm i j = a ; with probability p it gets reward r = 1, otherwise r = 0. The loss after t j t t N timesteps,orregret,isdefinedasNp∗−PN r ,wherep∗ isthemaximalreward t=1 t probabilityamongp ,...,p . 1 n Two indicators are maintained for each i-th arm : the number of times it has been played up to time t, noted n and the average corresponding reward noted pˆ . Sub- i,t i,t scripttisomittedwhenclearfromthecontext. The so-called UCB1 algorithm selects in each step t the arm which maximizes an explorationvsexploitationtradeoff s P 2log n pˆ + k k,t j,t n j,t The first term pˆ clearly reflects the exploitation bias (select the arm with optimal j,t average reward); the second term (select arms which have been played exponentially rarely)correspondstotheexplorationbias.TheasymptoticboundontheregretinUCB1 isO(log(N))whereN isthenumberoftimesteps,whichisknowntobeoptimalafter Lai&Robbins(1985). q2logPknk,t UCB1 logPknk,t + qvj,tlognPj,tknk,t UCB-Tuned pˆj + nj,t qvj,tlognPj,tknk,t KUCBT qclogPnj,ktnk,t cUCB nj,t Exploitation Exploration TAB. 1 – UCB Algorithms, where vj,t denotes the maximum between 0.01 and the empirical variance of the reward for the j-th arm. The max with 0.01 is intended to avoidnullestimatedvariancethatcouldleadtorejectdefinitivelyanarm. The key point is to adjust the exploration strength. Several variants have been proposed,summarizedinTable1 : Based on intuitively satisfactory ideas and taking into account the empirical variance CAp2007 oftherewardassociatedtoeacharm,theUCB-Tuned(UCBT)proposedbyAueretal. (2002)oftenoutperformsUCB1,thoughwithnoformalproofofimprovement. KUCBT is similar to UCBT but without the non-variance-based term P log( n )/n . We additionally consider the cUCB variant, using an explicit k k,t j,t constantctobiastheselectiontowardexploitationasbeingmoreappropriatewhenthe numberofarmsincreasesandthetimehorizondecreases. 2.2 Denumerable-ArmedBandits The case of denumerable-armed bandits (DAB) have been studied by Berry et al. √ (1997), establishing an upper bound 2 N on the regret when the number N of time stepsisknown(nonany-timesetting). Berryetal.(1997)introduceseveralalgorithms: – Thek-failurestrategy.Whenthecurrentarmfailsforksuccessivetimesteps,this armisnevertestedagainandanewarmisselected.Whilethisstrategyconverges towardtheoptimalsuccessrate,theregretdecreasesslowlywithN (O(N/logN)). – Theα-ratestrategy.Whentheaveragerewardofthecurrentarmfallsbelowsome α<1threshold,anewarmisselected.Thisstrategydoesnotnecessarilyconverge towardtheoptimalsuccessrate. – The m-run strategy. This strategy firstly runs the 1-failure strategy until either selecting the m-th arm, or until a sequence of m wins occurs; at this point, the m-runstrategyplaysthearmwithbestaveragerewarduntiltheendoftheN steps. √ When m is of the order of N, the m-run strategy reaches the optimal success √ rateandtheregretdecreaseswithN as2 N;otherwise,them-strategydoesnot necessarilyconvergetotheoptimalsuccessrateasN increases,asitalmostsurely stopsexplorationafterhavingtestedfinitelymanyarms. – Thenon-recallingm-runstrategy.Thisstrategylikewiserunsthe1-failurestra- tegy until a sequence of m wins occurs, and it thereafter plays the current arm untiltheend.Likethem-runstrategy,thenon-recallingm-runstrategyreachesthe √ √ optimalsuccessratewithregret2 N form’ N. – Them-learningstrategy.Thisstrategyuses1-Failureduringthefirstmsteps,and thenusestheempiricallybestarmduringtheremainingN −msteps. 3 Any-time Denumerable-Armed Bandits ThissectionextendstheapproachespresentedbyBerryetal.(1997)totheanytime setting,wherethealgorithmdoesnotdependonthenumberN oftimesteps. 3.1 TheEasyManABsetting Let us first consider the easy setting where the reward probabilities p are indepen- i dentlyuniformlydistributedin[0,1].Thenweshow: Theorem1:EManABsetting.Thereexistsanany-timestrategyforthedenumerable- √ armedbanditwithexpectedfailurerateboundedbyO(1/ N). Anytimemany-armedbandits Proof : Let α > 1 be a parameter and let us define the family of time intervals I = i [iα,(i+1)α[.Letusconsiderthestrategydefinedbyplayingthem-runstrategywith √ m = iα onintervalI (independentlyfromwhathasbeendoneduringtheprevious i timeintervals). Byconstructionthisstrategyisany-time;itdoesnotdependonN.ForagivenN,let kbesuchthatN ∈I .OneachintervalI ,theexpectednumberoffailuresincurredby √ k √ i the iα strategyisO( iα)afterBerryetal.(1997),Thm4.Therefore,theexpected √ numberoffailuresuntiltheendofI isatmostO(Pk iα). k i=1 It comes that the number of failures of the considered strategy up to time step N √ is upper bounded by k × kα = k1+α/2. The failure rate thus is upper bounded by √ k1+α/2/N ≤ k1+α/2/kα =O(1/ N). (cid:3) Wepointoutthatanotheralgorithmcanbeusedforthesameresult.Withthesame proofasaboveusingthepropertiesofm-learnstrategiesinsteadofthepropertiesofm- runstrategy(Berryetal.(1997)),weshowthesameresultforthefollowingalgorithm atstept: √ √ – ifb t/log(t)c > b t−1/log(t−1)c,choosethearmwithlowestindexwhich hasneverfailed(thisistheFAILUREalgorithm,thechosenarmthenisnonvisited ifallvisitedarmshavefailedatleastonce.); – otherwise, use the arm which has the best empirical success rate among all arms thathavebeenrejectedbyFAILURE. This algorithm, termed "MLEARN" in the rest of this paper, is nicer as it has no free parameter;itwillbeusedinexperiments. 3.2 TheDifficultManABsetting Let us consider the difficult ManAB setting, where the reward probabilities p are i uniformly distributed in [0,(cid:15)] for (cid:15) < 1. As shown by Berry et al. (1997), for some given m depending on (cid:15) and N, the m-run strategy reaches an expected failure rate p O( (cid:15)/N). Inthissection,theaboveresultisextendedtothecasewhereN and(cid:15)areunknown. Theorem2:DManABsetting.Letusassumethattherewarddistributionissuchthat thereexistsaconstantC >0whichsatisfies ∀(cid:15)∈]0,1[,P(p >supp −(cid:15))≥min(1,C(cid:15)) (1) 1 i Thenthereexistsanany-timestrategyforthedenumerablearmedbanditwithexpected failure rate bounded by O˜(N−41/C) (with a = O˜(b) the notation for ∃k > 0;a = O(b(log(b))k)). Notethattheboundisuniforminthedistribution(i.e.allconstantshiddenintheO(.) are independent of the distribution) under assumption (1). Assumption (1) typically holdswhentherewardprobabilitiesareuniformlydistributedin[0,(cid:15)]. Proof:Theproofisconstructive,basedonthealgorithmdescribedinTable2.Indices n and w respectively stand for the number of times the i-th arm is played (resp., i,t i,t wins)uptotimet.Twosequences(sn)n∈N and(kn)n∈N,withsincreasingandknon- decreasingareused. CAp2007 1.Init:n =w =0foralli. i,0 i,0 2.Loop:Fort=1; true; t=t+1; Ift=s forsomei,Exploration(t). i Else(t∈]s ,s [),Exploitation(t). i i+1 Exploration(t) t=s i Selectj =argmin{n ,‘∈[1,k [} Incaseofties,prefersmallestj. ‘ i Receiver t n =n +1 ; w =w +r j,t+1 j,t j,t+1 j,t t n =n ; w =w foralli6=j i,t+1 i,t i,t+1 i,t Exploitation(t) t∈]s ,s [ i i+1 Selectj =argmax{w /n ,‘∈[1,k [} Incaseofties,prefersmallestj. ‘ ‘ i Receiver t n =n +1 ; w =w +r j,t+1 j,t j,t+1 j,t t n =n ; w =w foralli6=j i,t+1 i,t i,t+1 i,t TAB.2–DManABAlgorithm Letusdefine(cid:15) themaximalrewardestimationerrorafteriexplorationsteps: i w (cid:15) =argmax{| j,t −p |,j ∈[1,k ],t=s +1} i n j i i j,t Let t be a time step in the i-th epoch (t ∈ [s ,s [). Let (cid:15) define the maximal (cid:15) i i+1 t i suchthatt < s .Uptotimet,i)thenumberofexplorationstepssofarisi;ii)the i+1 armswhichhavebeenplayedareincludedin[1,k ];iii)themaximalestimationerror i sofaris(cid:15) . i Fortheparticulartwosequencesbelow,weshallshowthatthealgorithmisefficient, i.e.(cid:15) goesto0.Let t 1 k =bnαc, α= (2) n 3 n X 1 s = b1+iγc, γ = n 3 i=1 Step1:Fastconvergenceof(cid:15) to0. t Lettbethecurrenttimestep,belongingtothei-thepoch(t ∈ [s ,s [).Letj bean i i+1 arm belonging to the set of arms explored up to now (j ∈ [1,k [). Then, as all arms i havebeenplayedanequalnumberoftimesduringtheiexplorationsteps: n ≥bi/k c (3) j,t i Anytimemany-armedbandits AftertheHoeffdingbound,forallarmsj ∈[1,k [andt≥s ,itcomes i i w P(| j,t −p |>(cid:15))≤exp(−2bi/k c(cid:15)2) (4) n j i j,t w P(sup| j,t −p |>(cid:15))≤k exp(−2bi/k c(cid:15)2) n j i i j<ki j,t andthereforeEsup|wj,t −p |=O(pk log(k )/i) (5) n j i i j<ki j,t w andthereforeEsup| j,t −p |=O˜(i(α−1)/2) (6) n j j<ki j,t Eq.(5)followsfromthelemmabelow((Devroyeetal.,1997,chap12,p208)): p P(Z <0)=0∧P(Z ≥(cid:15))≤cexp(−2m(cid:15)2)⇒EZ ≤ log(ce)/2m Eq.(6)statesthat(cid:15) convergesto0likeO(i−1/3log(i))ielikeO˜(i−1/3). t Step2:exploitationisefficient. LetR denotethesumofallrewardsuptotimestepN,andletusconsidertheexpec- N tationofR .LetusassumefurtherthatN belongstothen-thepoch(N ∈[s ,s [). N n n+1 Itcomes: ERN ≥EnX−1 rsi + siX+1−1rt (7) |{z} i=1exploration t=si+1 | {z } exploitation Let p∗ denote the reward probability of the arm selected during the i-th exploitation i epoch(beingremindedthatasinglearmisplayedduring]s ,s [),andletp∗∗denote i i+1 i the maximal reward probability p for j in [1,k [. Note E the expectation operator, j i n conditionallyto1the(pi)i∈Nandtotheexploration(formally,conditionallytoallthepi fori∈Nandtoallther fort=s ,...,s ). t 1 n n−1 1 1 X E R ≥ (iγp∗ ) N n N N i i=1 n−1 1 X ≥ (iγ(p∗∗−2(cid:15) )) N i i i=1 bydefinitionof(cid:15) .LetusnoteS = 1 Pn−1(1+iγ)p∗∗. i n N i=1 i n−1 1 1 X E S − E R = O( (1+iγ(cid:15) )) n n N n N N i i=1 n−1 1 X = O(n/N + . (iγ.i(α−1)/2.log(i)) (8) N i=1 = O˜(n/N) (9) 1Recallthatthepiarei.i.drandomvariables. CAp2007 almostsurelythankstostep1.LetE denotetheconditionalexpectationwithrespect p top ,i∈{1,2,3,...};astheconstantsintheO(.)notationareuniversalconstants,we i canthereforetaketheexpectationofeq.(9)withrespecttotheexploration(keepingthe conditioningwithrespecttothep ),andget: i E [E S − 1E R ] =E [E S − 1E R ]=E [O˜(n/N)]=O˜(n/N) p n n N n N explor n n N n N explor hence 1E R ≥p∗∗−O˜(n/N) N p N n (10) sinceE S ≤ 1 Pn−1(1+iγ)p∗∗ ≤p∗∗byconstruction. p n N i=1 i n Step3:explorationissufficient.Itremainstolower-boundS ,whichdependsonthe n expectation of the maximum p for j ∈ [1,k [, where p are iid random variables j i j such that eq. (1) holds. Noting as above p∗∗ = max{p , j ∈ [1,k [}, and letting i j i p =supp ,aftereq.(1)itcomes: ∗ i E[p −p∗∗] =R P(p −p∗∗)>t)dt=R Πki−1P(p −p >t)dt ∗ i ∗ i j=1 ∗ j ==RR P(1(−p∗P−(pp∗1−>pt1)k<i−t1)d)tki−1dt (11) <R1/C(1−Ct)kidt 0 hence Ep∗∗ ≥p −O(1/(C.k )). (12) i ∗ i Summingeq.(12)fori∈[1,n]leadsto n 1 X n S ≥p − O( iγ/k )≥p −O( ) (13) n ∗ NC i ∗ NC i=1 Eqs(13)and(10)togetherleadto 1 E R ≥ p −O(1/n(α−1)/2)−O(n/CN) N p N ∗ ≥ p −O(N−1/4/C) ∗ whichconcludestheproof. (cid:3) 4 Algorithms For Many-Armed Bandits The theoretical analysis in the previous section suggests that the easy and difficult ManABsettingsshouldbehandledthroughdifferentalgorithms.Accordingly,thissec- tionpresentstwoFAILUREvariantsadaptedfromthefailurealgorithmsintroducedin section 2.2. The FPU algorithm inspired by Wang & Gelly (2007) and the MUCBT algorithminspiredbyHartlandetal.(2006)arelastpresented. Anytimemany-armedbandits 4.1 TheFAILUREandFAILUCBAlgorithms The1-failurealgorithmpreviouslydefinedforthedenumerable-armedbanditisadap- tedtotheManABsettingintwoways,respectivelyreferredtoasFAILUREandFAIL- UCB. Inbothcases,thealgorithmplaysthecurrentarmuntilitfails;onfailure,oneselects thefirstarmwhichhasneverbeentestedifitexists. Afterallarmshavebeenplayedatleastonce,FAILUREgreedilyselectsthearmwith bestestimatedreward;FAILUCBusesKBCBT(Table1).Indeed,FAILUREoffersno guarantee to converge to the best success rate as N goes to infinity; however, such a poorasymptoticbehaviourisirrelevantintheconsideredframeworksinceN remains comparabletonorn2. 4.2 TheFirstPlayUrgencyAlgorithm TheFirstPlayUrgency(FPU)algorithmwasfirstdefinedinMoGobyWang&Gelly (2007),tohandlealargenumberoftree-structuredarms.Formally,theselectioncrite- rionusedinUCT(Table1)isreplacedby: (cid:26) p P pˆ + 2f log( n )/n ifn >0 V = j,t FPU k k,t j,t j,t j c otherwise FPU (other formula, taking into account variance-terms, are proposed in Wang & Gelly (2007)) It is worth noting that for f = 0 and c = 1, the FPU algorithm FPU FPU coincideswiththeFAILUREone. 4.3 TheMeta-UCBTAlgorithm We last define the meta-bandit algorithm MUCBT to deal with the ManAB setting. MUCBTisinspiredfromthemeta-banditalgorithmdevisedbyHartlandetal.(2006), whichwontheExplorationvsExploitationChallengedefinedbyHussainetal.(2006). However,theEEChallengefocusesontheextensionofthemany-armedbandittonon- stationaryenvironments, wherethe meta-banditwasin chargeof handlingthe change pointdetectionepochs. Quite the opposite, MUCBT is a recursive meta-bandit, where the first meta-bandit decides between the best empirical arm and all other arms, the second meta-bandit decidesbetweenthesecondbestarmandallotherarms,andsoforth(Fig.1,left). A variant of the MUCBT algorithm, referred to as MUCBT-k, uses the first meta- bandittodecidebetweenthefirstbestk−1arms,andtheothers,thesecondmeta-bandit todecidebetweenthenextbestk−1arms,andtheremainingarms,andsoforth(Fig 1,right). Formally, w (respectively ‘ ) denotes the number of wins (resp. losses) with the i- i i th arm up to the current time step. Algorithms MUCBT and MUCBT-k are specified above,(Algs.1&2),whereeachalgorithmchoosesarma attimestept,andt isthe t i numberoftimesteps(previoustot)wherethechosenarmisgreaterthani(t =|{t0 ≤ i t;a ≥i}|. t0 CAp2007 Algorithm1MUCBT Input:a(possiblyinfinite)numberofarms. Initializew =0,‘ =0andt =0forallj. j j j fort=1; true; t←t+1do Sortarmsbydecreasingw /(w +‘ )(with0/0=−∞byconvention). j j j fori=1; true; i←i+1do Computew0 =P w and‘0 =P ‘ . i j>i j i j>i j p V =w /(w +‘ )+ 2log(t )/(w +‘ ) i i i i i i i p V0 =w0/(w0+‘0)+ 2log(t )/(w0+‘0) i i i i i i i ifV >V0then i i break endif endfor Playarmi(a =i). t Ifwinw ←w +1elsel ←l +1 i i i i ∀j ≤i,t ←t +1. j j endfor Best Best arm arm Arm 2 Arm 2 Arm 3 ... Arm 3 Arm 4 Arm 5 Arm 4 ... Arm 6 FIG. 1 – MUCBT algorithm (left) and MUCBT-3 as an example of the MUCBT-k algorithm(right).
Description: