ebook img

Multi-Armed Bandit Allocation Indices, 2nd Edition PDF

301 Pages·2011·2.09 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Multi-Armed Bandit Allocation Indices, 2nd Edition

Multi-armed Bandit Allocation Indices Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6 Multi-armed Bandit Allocation Indices 2nd Edition John Gittins Department of Statistics, University of Oxford, UK Kevin Glazebrook Department of Management Science, Lancaster University, UK Richard Weber Statistical Laboratory, University of Cambridge, UK A John Wiley & Sons, Ltd., Publication Thiseditionfirstpublished2011 ©2011JohnWiley&Sons,Ltd Registeredoffice JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UnitedKingdom Fordetailsofourglobaleditorialoffices,forcustomerservicesandforinformationabouthowtoapplyfor permissiontoreusethecopyrightmaterialinthisbookpleaseseeourwebsiteatwww.wiley.com. Therightoftheauthortobeidentifiedastheauthorofthisworkhasbeenassertedinaccordancewiththe Copyright,DesignsandPatentsAct1988. Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted, inanyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptas permittedbytheUKCopyright,DesignsandPatentsAct1988,withoutthepriorpermissionofthepublisher. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmaynot beavailableinelectronicbooks. Designationsusedbycompaniestodistinguishtheirproductsareoftenclaimedastrademarks.Allbrand namesandproductnamesusedinthisbookaretradenames,servicemarks,trademarksorregistered trademarksoftheirrespectiveowners.Thepublisherisnotassociatedwithanyproductorvendormentioned inthisbook.Thispublicationisdesignedtoprovideaccurateandauthoritativeinformationinregardtothe subjectmattercovered.Itissoldontheunderstandingthatthepublisherisnotengagedinrendering professionalservices.Ifprofessionaladviceorotherexpertassistanceisrequired,theservicesofacompetent professionalshouldbesought. ® MATLAB isatrademarkofTheMathWorks,Inc.andisusedwithpermission.TheMathWorksdoesnot ® warranttheaccuracyofthetextorexercisesinthisbook.Thisbook’suseordiscussionofMATLAB softwareorrelatedproductsdoesnotconstituteendorsementorsponsorshipbyTheMathWorksofa ® particularpedagogicalapproachorparticularuseoftheMATLAB software. LibraryofCongressCataloging-in-PublicationData Gittins,JohnC.,1938- Multi-armedbanditallocationindices/JohnGittins,RichardWeber,KevinGlazebrook.–2nded. p.cm. Includesbibliographicalreferencesandindex. ISBN978-0-470-67002-6(cloth) 1.Resourceallocation–Mathematicalmodels.2.Mathematicaloptimization.3.Programming(Mathematics) I.Weber,Richard,1953-II.Glazebrook,KevinD.,1950-III.Title. QA279.G552011 519.5–dc22 2010044409 AcataloguerecordforthisbookisavailablefromtheBritishLibrary. PrintISBN: 978-0-470-67002-6 ePDFISBN: 978-0-470-98004-0 oBookISBN: 978-0-470-98003-3 ePubISBN: 978-1-119-99021-5 Setin10/12ptTimesbyLaserwordsPrivateLimited,Chennai,India Contents Foreword ix Foreword to the first edition xi Preface xiii Preface to the first edition xv 1 Introduction or exploration 1 Exercises 16 2 Main ideas: Gittins index 19 2.1 Introduction 19 2.2 Decision processes 20 2.3 Simple families of alternative bandit processes 21 2.4 Dynamic programming 23 2.5 Gittins index theorem 24 2.6 Gittins index 28 2.6.1 Gittins index and the multi-armed bandit 28 2.6.2 Coins problem 29 2.6.3 Characterization of the optimal stopping time 30 2.6.4 The restart-in-state formulation 31 2.6.5 Dependence on discount factor 32 2.6.6 Myopic and forwards induction policies 32 2.7 Proof of the index theorem by interchanging bandit portions 33 2.8 Continuous-time bandit processes 36 2.9 Proof of the index theorem by induction and interchange argument 40 2.10 Calculation of Gittins indices 43 2.11 Monotonicity conditions 44 2.11.1 Monotone indices 44 2.11.2 Monotone jobs 45 2.12 History of the index theorem 47 2.13 Some decision process theory 49 Exercises 50 vi CONTENTS 3 Necessary assumptions for indices 55 3.1 Introduction 55 3.2 Jobs 56 3.3 Continuous-time jobs 58 3.3.1 Definition 58 3.3.2 Policies for continuous-time jobs 58 3.3.3 The continuous-time index theorem for a SFABP of jobs 61 3.4 Necessary assumptions 61 3.4.1 Necessity of an infinite time horizon 61 3.4.2 Necessity of constant exponential discounting 62 3.4.3 Necessity of a single processor 63 3.5 Beyond the necessary assumptions 64 3.5.1 Bandit-dependent discount factors 64 3.5.2 Stochastic discounting 66 3.5.3 Undiscounted rewards 68 3.5.4 A discrete search problem 70 3.5.5 Multiple processors 73 Exercises 76 4 Superprocesses, precedence constraints and arrivals 79 4.1 Introduction 79 4.2 Bandit superprocesses 80 4.3 The index theorem for superprocesses 83 4.4 Stoppable bandit processes 88 4.5 Proof of the index theorem by freezing and promotion rules 90 4.5.1 Freezing rules 93 4.5.2 Promotion rules 95 4.6 The index theorem for jobs with precedence constraints 97 4.7 Precedence constraints forming an out-forest 102 4.8 Bandit processes with arrivals 105 4.9 Tax problems 106 4.9.1 Ongoing bandits and tax problems 106 4.9.2 Klimov’s model 108 4.9.3 Minimum EWFT for the M/G/1 queue 110 4.10 Near optimality of nearly index policies 111 Exercises 113 5 The achievable region methodology 115 5.1 Introduction 115 5.2 A simple example 116 5.3 Proof of the index theorem by greedy algorithm 119 5.4 Generalized conservation laws and indexable systems 124 5.5 Performance bounds for policies for branching bandits 132 5.6 Job selection and scheduling problems 136 5.7 Multi-armed bandits on parallel machines 139 Exercises 147 CONTENTS vii 6 Restless bandits and Lagrangian relaxation 149 6.1 Introduction 149 6.2 Restless bandits 150 6.3 Whittle indices for restless bandits 152 6.4 Asymptotic optimality 155 6.5 Monotone policies and simple proofs of indexability 155 6.6 Applications to multi-class queueing systems 159 6.7 Performance bounds for the Whittle index policy 162 6.8 Indices for more general resource configurations 169 Exercises 171 7 Multi-population random sampling (theory) 173 7.1 Introduction 173 7.2 Jobs and targets 179 7.3 Use of monotonicity properties 181 7.4 General methods of calculation: use of invariance properties 185 7.5 Random sampling times 195 7.6 Brownian reward processes 201 7.7 Asymptotically normal reward processes 205 7.8 Diffusion bandits 210 Exercises 211 8 Multi-population random sampling (calculations) 213 8.1 Introduction 213 8.2 Normal reward processes (known variance) 213 8.3 Normal reward processes (mean and variance both unknown) 218 8.4 Bernoulli reward processes 221 8.5 Exponential reward processes 225 8.6 Exponential target process 229 8.7 Bernoulli/exponential target process 234 Exercises 239 9 Further exploitation 241 9.1 Introduction 241 9.2 Website morphing 242 9.3 Economics 243 9.4 Value of information 244 9.5 More on job-scheduling problems 244 9.6 Military applications 245 References 249 Tables 261 Index 285 Foreword JohnGittins’ firsteditionofthisbookmarkedtheendofanera,theerainwhich asuccessionofinvestigatorsstruggledforanunderstandingand‘solution’ of the multi-armed bandit problem. My foreword to that edition celebrated the gaining of this understanding, and so it seems fitting that this should be retained. The opening of a new era was like the stirring of an ant-heap, with the suddenemergenceofanavidmultitude andarushofscurryingactivity.Thefirst phase was one of exploitation, in which each worker tried to apply his special expertise in this new context. This yielded, among other things, a remarkable array of proofs of optimality of the Gittins index policy. The most elegant and insightful was certainly Richard Weber’s ‘prevailing charge’ proof (see section2.5),expressibleinasingleparagraphofverbalreasoning.Imustconfess, however, to a lingering attachment to the dynamic programming proof (see section 4.3) which provided also the value function of the Gittins policy and a treatment immediately generalizable to the case of superprocesses. The phase of real interest was the subsequent one, of exploration. To what range of models can the Gittins technique be extended? Here the simile latent in the terms ‘bandit’ and ‘arm’ (with their gambling machine origins) begins to become quite strained. I myself spoke rather of a population of ‘projects’, only one of which could be engaged at any given time. One wishes to operate only the high-value projects, but can determine which these are only by trying themall – itistothisthatthephrase‘exploitationandexploration’firstreferred. The process is then one of allocation and learning. Any project does not itself change, but its ‘informational state’ does – one’s knowledge of it as expressed by a Bayesian updating. However, situations occur for which the projects do also have a physical state, which may change by stochastic rules, or for which the population of projects changes by immigration or emigration. These are cases one would wish tocover.Itisthenmorenaturaltothinkoftheprojectsas‘activities’,havingtheir own dynamic structure, and whose performance one would wish to optimize by the appropriate allocation of given resources over activities. This is the classical economicactivityanalysisinadynamicsetting.However,wearenoweffectively adding to this the feature that the current physical states of the activities are incompletely known, and must be inferred from observation. ‘Observation time’ x FOREWORD is then an additional resource to be allocated. Section 6.8 gives the clearest indication of such an ambition. Extension to such cases requires new tools, and Chapters 5 and 6 consider two such: study of the achievable region and Lagrangian relaxation of the optimization. These central chapters present very clearly the current state of theory in these areas, a considerable enhancement of understanding and many examples of cases of practical interest for which an optimal policy – of index character – is determined. It must be confessed, however, that the general problems of indexability and optimality remain beachedon the researchfrontier, although one senses a rising tide which will lift them. Explicitness is also a feature of Chapters 7 and 8, which go into hard detail on the determination of indices under different statistical assumptions. This scholarly and modern work gives a satisfyingly complete, rigorous and illuminatingaccountofthecurrentstateofthesubject.Italsoconveysahistorical perspective, from the work of Klimov and von Olivier (which appeared in print before appreciation of John Gittins’ advance had become general) to the present day. All three authors of the text have made key and recent contributions to the subject, and are in the best position to lay its immediate course. Peter Whittle Foreword to the first edition The term ‘Gittins index’ now has firm currency in the literature, denoting the concept which first proved so crucial in the solution of the long-standing multi- armed bandit problem and since then has provided a guide for the deeper understanding of all such problems. The author is, nevertheless, too modest to use the term so I regard it as my sole role to reassure the potential reader that the author is indeed the Gittins of the index, and that this book sets forth his pioneering work on the solution of the multi-armed bandit problem and his subsequent investigation of a wide class of sequential allocation problems. Suchallocationproblemsareconcernedwiththeoptimaldivisionofresources between projects which need resources in order to develop and which yield benefit at a rate depending upon their degree of development. They embody in essentialformaconflictevidentinall humanaction.Thisistheconflictbetween taking those actions which yield immediate reward and those (such as acquiring informationorskill,orpreparingtheground)whosebenefitwillcomeonlylater. The multi-armed bandit is a prototype of this class of problems, propounded during the Second World War, and soon recognized as so difficult that it quickly became a classic, and a byword for intransigence. In fact, John Gittins had solved the problem by the late sixties, although the fact that he had done so was not generally recognizeduntil the early eighties. I canillustrate the mode of propagation of this news, when it began to propagate, by telling of an American friend of mine, a colleague of high repute, who asked an equally well-known colleague ‘What would you say if you were told that the multi-armed bandit problem had been solved?’ The reply was somewhat in the Johnsonian form: ‘Sir,themulti-armedbanditproblemisnotofsuchanaturethatitcanbesolved’. My friend then undertook to convince the doubter in a quarter of an hour. This is indeed a feature of John’s solution: that, once explained, it carries conviction even before it is proved. John Gittins gives here an account which unifies his original pioneering contributions with the considerable development achieved by both by himself and other workers subsequently. I recommend the book as the authentic and authoritative source-work. Peter Whittle Preface The first edition of this book was published in 1989. A shortened version of the preface to that edition follows this preface. The uninitiated reader might find it helpful to read it at this point. Since 1989 the field has developed apace. There has been a remarkable flowering of different proofs of the main index theorem, each giving its own particularinsightastowhytheresultistrue.Majorbodiesofrelatedtheoryhave also emerged, notably the achievable region and restless bandit methodologies, and the discussion in economics of the appropriate balance between exploration and exploitation. These have led, for example, to improved algorithms for calculatingindicesandtoimprovedboundsontheperformanceofindexstrategies when they are not necessarily optimal. These developments form the case for a new edition, plus the fact that there is an ongoing interest in the book, which is now difficult to buy. There are now three authors rather than just John Gittins. Kevin Glazebrook and Richard Weber bring familiarity with more recent developments, to which theyhavemadeimportantcontributions.Theirexpertisehasallowedustoinclude new chapters on achievable regions and on restless bandits. We have also taken the opportunity to substantially revise the core earlier chapters. Our aim has been to provide an accessible introduction to the main ideas, taking advantage of more recent work, before proceeding to the more challenging material in Chapters 4, 5 and 6. Overall we have tried to produce an expository account rather than just a research monograph. The exercises are designed as an aid to a deeper understanding of the material. The Gittins index, as it is commonly known, for a project competing for investment with other projects allows both for the immediate expected reward, andforthevalueoftheinformationgainedfromtheinitialinvestment,whichmay be useful in securing later rewards. These factors are often called exploitation and exploration, respectively. In this spirit Chapter 1, which is a taster for the rest of the book, is subtitled ‘Exploration’. The mainstream of the book then continues through Chapters2 to 4. It breaks into independent substreams represented by Chapter 5, Chapter 6, Chapters 7 and 8, and Chapter 9, which looks briefly at five further application areas.

Description:
In 1989 the first edition of this book set out Gittins' pioneering index solution to the multi-armed bandit problem and his subsequent investigation of a wide of sequential resource allocation and stochastic scheduling problems. Since then there has been a remarkable flowering of new insights, gener
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.