ebook img

Introduction to Multi-Armed Bandits PDF

105 Pages·2017·0.76 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introduction to Multi-Armed Bandits

Introduction to Multi-Armed Bandits (preliminary and incomplete draft) Aleksandrs Slivkins MicrosoftResearchNYC https://www.microsoft.com/en-us/research/people/slivkins (cid:13)c 2017-2019: AleksandrsSlivkins Firstdraft: January2017 Thisversion: March2019 i Preface Multi-armedbanditsisaricharea,multi-disciplinaryareastudiedsince(Thompson,1933),withabigsurge of activity in the past 10-15 years. An enormous body of work has accumulated over the years. While various subsets of this work have been covered in depth in several books and surveys (Berry and Fristedt, 1985; Cesa-Bianchi and Lugosi, 2006; Bergemann and Va¨lima¨ki, 2006; Gittins et al., 2011; Bubeck and Cesa-Bianchi,2012),thisbookprovidesamoretextbook-liketreatmentofthesubject. The organizing principles for this book can be summarized as follows. The work on multi-armed ban- dits can be partitioned into a dozen or so lines of work. Each chapter tackles one line of work, providing aself-containedintroductionandpointersforfurtherreading. Wefavorfundamentalideasandelementary, teachable proofs over the strongest possible results. We emphasize accessibility of the material: while ex- posuretomachinelearningandprobability/statisticswouldcertainlyhelp,astandardundergraduatecourse onalgorithms,e.g.,onebasedon(KleinbergandTardos,2005),shouldsufficeforbackground. With the above principles in mind, the choice specific topics and results is based on the author’s sub- jectiveunderstandingofwhatisimportantand“teachable”(i.e.,presentableinarelativelysimplemanner). Manyimportantresultshasbeendeemedtootechnicaloradvancedtobepresentedindetail. This book (except Chapter 10) is based on a graduate course at University of Maryland, College Park, taughtbytheauthorinFall2016. Eachbookchaptercorrespondstoaweekofthecourse. Thefirstdraftof the book evolved from the course’s lecture notes. Five of the book chapters were used in a similar manner inagraduatecourseatColumbiaUniversity,co-taughtbytheauthorinFall2017. Tokeepthebookmanageable,andalsomoreaccessible,wechosenottodwellonthedeepconnections to online convex optimization. A modern treatment of this fascinating subject can be found, e.g., in the recenttextbook(Hazan,2015). Likewise,wechosenotventureintoamuchmoregeneralproblemspaceof reinforcementlearning, asubjectofmanygraduatecoursesandtextbookssuchasSuttonandBarto(1998) andSzepesva´ri(2010). Acoursebasedonthisbookwouldbecomplementarytograduate-levelcourseson onlineconvexoptimizationandreinforcementlearning. Statusofthemanuscript. Thepresentdraftneedssomepolishing,and,atplaces,amoredetaileddiscussion of related work. (However, our goal is to provide pointers for further reading rather than a comprehensive discussion.) Theauthorplanstoaddmorematerial,inadditiontothetenchaptersalreadyinthemanuscript: anintroductorychapteronthescopeandmotivations,andachapteronconnectionstoincentivesandmech- anismdesign. Inthemeantime,theauthorwouldbegratefulforfeedbackandisopentosuggestions. Acknowledgements. The author is indebted to the students who scribed the initial versions of the lecture notes. Presentation of some of the fundamental results is heavily influenced by the online lecture notes from (Kleinberg, 2007). The author is grateful to Alekh Agarwal, Bobby Kleinberg, Yishay Mansour, and Rob Schapire for discussions and advice. Chapters 9 and 10 have benefited tremendously from numerous conversationswithKarthikAbinavSankararaman. iii Contents 1 BanditswithIIDRewards(rev.Jul’18) 1 1.1 Modelandexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Simplealgorithms: uniformexploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Advancedalgorithms: adaptiveexploration . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 LowerBounds(rev.Jul’18) 15 2.1 BackgroundonKL-divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Asimpleexample: flippingonecoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Flippingseveralcoins: “banditswithprediction” . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 ProofofLemma2.10forK ≥ 24arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Instance-dependentlowerbounds(withoutproofs). . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 InterludeA:BanditswithInitialInformation(rev.Jan’17) 27 3 ThompsonSampling(rev.Jan’17) 29 3.1 Bayesianbandits: preliminariesandnotation . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 ThompsonSampling: definitionandcharacterizations . . . . . . . . . . . . . . . . . . . . . 30 3.3 Computationalaspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Example: 0-1rewardsandBetapriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Example: GaussianrewardsandGaussianpriors . . . . . . . . . . . . . . . . . . . . . . . . 33 3.6 Bayesianregret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7 ThompsonSamplingwithnoprior(andnoproofs) . . . . . . . . . . . . . . . . . . . . . . 37 3.8 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 LipschitzBandits(rev.Jul’18) 39 4.1 Continuum-armedbandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 LipschitzMAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Adaptivediscretization: theZoomingAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 FullFeedbackandAdversarialCosts(rev.Sep’17) 55 iv 5.1 Adversariesandregret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Initialresults: binarypredictionwithexpertsadvice . . . . . . . . . . . . . . . . . . . . . . 58 5.3 HedgeAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 AdversarialBandits(rev.Jun’18) 67 6.1 Reductionfrombanditfeedbacktofullfeedback. . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 Adversarialbanditswithexpertadvice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Preliminaryanalysis: unbiasedestimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 AlgorithmExp4andcrudeanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.5 ImprovedanalysisofExp4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.6 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.7 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 LinearCostsandCombinatorialActions(rev.Jun’18) 77 7.1 Bandits-to-expertsreduction,revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Onlineroutingproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.3 Combinatorialsemi-bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.4 FollowthePerturbedLeader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8 ContextualBandits(rev.Sep’18) 87 8.1 Warm-up: smallnumberofcontexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.2 Lipshitzcontextualbandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.3 Linearcontextualbandits: LinUCBalgorithm(noproofs) . . . . . . . . . . . . . . . . . . . 90 8.4 Contextualbanditswithapolicyclass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.5 Policyevaluationandtraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.6 Contextualbanditsinpractice: challengesandasystemdesign . . . . . . . . . . . . . . . . 96 8.7 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.8 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9 BanditsandZero-SumGames(rev.Jan’18) 105 9.1 Basics: guaranteedminimaxvalue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 9.2 Theminimaxtheorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.3 Regret-minimizingadversary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 9.4 Beyondzero-sumgames: coarsecorrelatedequilibrium . . . . . . . . . . . . . . . . . . . . 111 9.5 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 111 9.6 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 10 BanditswithKnapsacks(rev.March’19) 115 10.1 Definitions,examples,anddiscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.2 LagrangeBwK:agame-theoreticalgorithmforBwK . . . . . . . . . . . . . . . . . . . . . . 118 10.3 Optimalalgorithmsandregretbounds(noproofs) . . . . . . . . . . . . . . . . . . . . . . . 124 10.4 Bibliographicremarksandfurtherdirections . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.5 ExercisesandHints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Bibliography 135 v Chapter 1 Bandits with IID Rewards (rev. Jul’18) [TODO:morecitations,probablyafewparagraphsonpracticalaspects.] Thischaptercoversbanditswithi.i.drewards,thebasicmodelofmulti-armbandits. Wepresent severalalgorithms,andanalyzetheirperformanceintermsofregret. Theideasintroducedinthis chapterextendfarbeyondthebasicmodel,andwillresurfacethroughoutthebook. 1.1 Model and examples Problem formulation (Bandits with i.i.d. rewards). There is a fixed and finite set of actions, a.k.a. arms, denotedA. Learningproceedsinrounds,indexedbyt = 1,2.... ThenumberofroundsT, a.k.a. thetime horizon,isfixedandknowninadvance. Theprotocolisasfollows: Problemprotocol: Multi-armedbandits Ineachroundt ∈ [T]: 1. Algorithmpicksarma ∈ A. t 2. Algorithmobservesrewardr ∈ [0,1]forthechosenarm. t The algorithm observes only the reward for the selected action, and nothing else. In particular, it does not observe rewards for other actions that could have been selected. Such feedback model is called bandit feedback. Per-round rewards are bounded; the restriction to the interval [0,1] is for simplicity. The algorithm’s goalistomaximizetotalrewardoverallT rounds. We make the i.i.d. assumption: the reward for each action is i.i.d (independent and identically dis- tributed). More precisely, for each action a, there is a distribution D over reals, called the reward distri- a bution. Every timethis action is chosen, the reward is sampled independentlyfrom this distribution. D is a initiallyunknowntothealgorithm. Perhaps the simplest reward distribution is the Bernoulli distribution, when the reward of each arm a canbeeither1or0(“successorfailure”,“headsortails”). Thisrewarddistributionisfullyspecifiedbythe meanreward,whichinthiscaseissimplytheprobabilityofthesuccessfuloutcome. Theprobleminstance isthenfullyspecifiedbythetimehorizonT andthemeanrewards. 1 Our model is a simple abstraction for an essential feature of reality that is present in many application scenarios. Weproceedwiththreemotivatingexamples: 1. News: in a very stylized news application, a user visits a news site, the site presents it with a news header, and a user either clicks on this header or not. The goal of the website is to maximize the numberofclicks. Soeachpossibleheaderisanarminabanditproblem, andclicksaretherewards. Notethatrewardsare0-1. Atypicalmodelingassumptionisthateachuserisdrawnindependentlyfromafixeddistributionover users, so that in each round the click happens independently with a probability that depends only on thechosenheader. 2. Adselection: Inwebsiteadvertising,auservisitsawebpage,andalearningalgorithmselectsoneof manypossibleadstodisplay. Ifadaisdisplayed,thewebsiteobserveswhethertheuserclicksonthe ad, in which case the advertiser pays some amount v ∈ [0,1]. So each ad is an arm, and the paid a amountisthereward. Atypicalmodelingassumptionisthatthepaidamountv dependsonlyonthedisplayedad,butdoes a notchangeovertime. Theclickprobabilityforagivenaddoesnotchangeovertime,either. 3. Medical Trials: a patient visits a doctor and the doctor can proscribe one of several possible treat- ments, and observes the treatment effectiveness. Then the next patient arrives, and so forth. For simplicity of this example, the effectiveness of a treatment is quantified as a number in [0,1]. So hereeachtreatmentcanbeconsideredasanarm,andtherewardisdefinedasthetreatmenteffective- ness. Asanidealizedassumption,eachpatientisdrawnindependentlyfromafixeddistributionover patients,sotheeffectivenessofagiventreatmentisi.i.d. Note that the reward of a given arm can only take two possible values in the first two examples, but could,inprinciple,takearbitraryvaluesinthethirdexample. Notation. We use the following conventions in this chapter and (usually) throughout the book. Actions are denoted with a, rounds with t. The number of arms is K, the number of rounds is T. The mean reward of arm a is µ(a) := E[D ]. The best mean reward is denoted µ∗ := max µ(a). The difference a a∈A ∆(a) := µ∗−µ(a)describeshowbadarmaiscomparedtoµ∗;wecallitthebadnessofarma. Anoptimal armisanarmawithµ(a) = µ∗;notethatitisnotnecessarilyunique. Wetakea∗ todenoteanoptimalarm. Regret. How do we argue whether an algorithm is doing a good job across different problem instances, whensomeinstancesinherentlyallowhigherrewardsthanothers? Onestandardapproachistocomparethe algorithmtothebestonecouldpossiblyachieveonagivenprobleminstance,ifoneknewthemeanrewards. More formally, we consider the first t rounds, and compare the cumulative mean reward of the algorithm againstµ∗·t,theexpectedrewardofalwaysplayinganoptimalarm: t (cid:88) R(t) = µ∗·t− µ(a ). (1.1) s s=1 Thisquantityiscalledregretatroundt.1 Thequantityµ∗·tissometimescalledthebestarmbenchmark. 1Itiscalledcalled“regret”becausethisishowmuchthealgorithm“regrets”notknowingwhatisthebestarm. 2 Notethata (thearmchosenatroundt)isarandomquantity,asitmaydependonrandomnessinrewards t and/orthealgorithm. So,regretR(t)isalsoarandomvariable. Hencewewilltypicallytalkaboutexpected regretE[R(T)]. WemainlycareaboutthedependenceofregretontheroundtandthetimehorizonT. Wealsoconsider thedependenceonthenumberofarmsK andthemeanrewardsµ. Wearelessinterestedinthefine-grained dependence on the reward distributions (beyond the mean rewards). We will usually use big-O notation to focusontheasymptoticdependenceontheparametersofinterests,ratherthankeeptrackoftheconstants. Remark 1.1 (Terminology). Since our definition of regret sums over all rounds, we sometimes call it cu- mulative regret. When/if we need to highlight the distinction between R(T) and E[R(T)], we say realized regret and expected regret; but most of the time we just say “regret” and the meaning is clear from the context. ThequantityE[R(T)]issometimescalledpseudo-regretintheliterature. 1.2 Simple algorithms: uniform exploration Westartwithasimpleidea: explorearmsuniformly(atthesamerate),regardlessofwhathasbeenobserved previously, and pick an empirically best arm for exploitation. A natural incarnation of this idea, known as Explore-firstalgorithm,istodedicateaninitialsegmentofroundstoexploration,andtheremainingrounds toexploitation. 1 Explorationphase: tryeacharmN times; 2 Selectthearmaˆwiththehighestaveragereward(breaktiesarbitrarily); 3 Exploitationphase: playarmaˆinallremainingrounds. Algorithm1.1:Explore-FirstwithparameterN. The parameter N is fixed in advance; it will be chosen later as function of the time horizon T and the numberofarmsK,soastominimizeregret. Letusanalyzeregretofthisalgorithm. Lettheaveragerewardforeachactionaafterexplorationphasebedenotedµ¯(a). Wewanttheaverage reward to be a good estimate of the true expected rewards, i.e. the following quantity should be small: |µ¯(a)−µ(a)|. We can use the Hoeffding inequality to quantify the deviation of the average from the true (cid:113) expectation. Bydefiningtheconfidenceradiusr(a) = 2logT,andusingHoeffdinginequality,weget: N Pr{|µ¯(a)−µ(a)| ≤ r(a)} ≥ 1− 1 (1.2) T4 So,theprobabilitythattheaveragewilldeviatefromthetrueexpectationisverysmall. We define the clean event to be the event that (1.2) holds for both arms simultaneously. We will argue separatelythecleanevent,andthe“badevent”–thecomplementofthecleanevent. Remark 1.2. With this approach, one does not need to worry about probability in the rest of the proof. Indeed,theprobabilityhasbeentakencareofbydefiningthecleaneventandobservingthat(1.2)holds! And wedonotneedtoworryaboutthebadeventeither—essentially,becauseitsprobabilityissotiny. Wewill usethis“cleanevent”approachinmanyotherproofs,tohelpsimplifythetechnicaldetails. Thedownsideis thatitusuallyleadstoworseconstantsthatcanbeobtainedbyaproofthatarguesaboutprobabilitiesmore carefully. Forsimplicity,letusstartwiththecaseofK = 2arms. Considerthecleanevent. Wewillshowthatif wechosetheworsearm,itisnotsobadbecausetheexpectedrewardsforthetwoarmswouldbeclose. 3 Let the best arm bea∗, and suppose the algorithm chooses the other arma (cid:54)= a∗. This must have been because its average reward was better than that of a∗; in other words, µ¯(a) > µ¯(a∗). Since this is a clean event,wehave: µ(a)+r(a) ≥ µ¯(a) > µ¯(a∗) ≥ µ(a∗)−r(a∗) Re-arrangingtheterms,itfollowsthat (cid:18)(cid:113) (cid:19) µ(a∗)−µ(a) ≤ r(a)+r(a∗) = O logT . N (cid:18)(cid:113) (cid:19) Thus, each round in the exploitation phase contributes at most O logT to regret. And each round in N exploration trivially contributes at most 1. We derive an upper bound on the regret, which consists of two parts: forthefirstN roundsofexploration,andthenfortheremainingT −2N roundsofexploitation: (cid:113) R(T) ≤ N +O( logT ×(T −2N)) N (cid:113) ≤ N +O( logT ×T). N Recall that we can select any value for N, as long as it is known to the algorithm before the first round. So, we can choose N so as to (approximately) minimize the right-hand side. Noting that the two summands are, resp., monotonically increasing and monotonically decreasing in N, we set N so that they are(approximately)equal. ForN = T2/3(logT)1/3,weobtain: (cid:16) (cid:17) R(T) ≤ O T2/3 (logT)1/3 . To complete the proof, we have to analyze the case of the “bad event”. Since regret can be at most T (becauseeachroundcontributesatmost1),andthebadeventhappenswithaverysmallprobability(1/T4), the(expected)regretfromthiscasecanbeneglected. Formally, E[R(T)] = E[R(T)|cleanevent]×Pr[cleanevent] + E[R(T)|badevent]×Pr[badevent] ≤ E[R(T)|cleanevent]+T ×O(T−4) (cid:112) ≤ O( logT ×T2/3). (1.3) ThiscompletestheproofforK = 2arms. For K > 2 arms, we have to apply the union bound for (1.2) over the K arms, and then follow the same argument as above. Note that the value of T is greater than K, since we need to explore each arm at least once. For the final regret computation, we will need to take into account the dependence on K: specifically, regret accumulated in exploration phase is now upper-bounded by KN. Working through the (cid:113) proof,weobtainR(T) ≤ NK+O( logT×T). Asbefore,weapproximatelyminimizeitbyapproximately N minimizingthetwosummands. Specifically,wepluginN = (T/K)2/3·O(logT)1/3. Completingtheproof samewayasin(1.3),weobtain: Theorem 1.3. Explore-first achieves regret E[R(T)] ≤ T2/3 ×O(KlogT)1/3, where K is the number of arms. 4

Description:
Multi-armed bandits is a rich area, multi-disciplinary area studied since .. http://www.cs.umd.edu/ slivkins/CMSC858G-fall16/Lecture1-intro.pdf] .. event which we introduce in the next section, and is left as an exercise (see Exercise
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.