ptg31266351 Praise for Foundations of Deep Reinforcement Learning “Thisbookprovidesanaccessibleintroductiontodeepreinforcementlearningcovering themathematicalconceptsbehindpopularalgorithmsaswellastheirpractical implementation.Ithinkthebookwillbeavaluableresourceforanyonelookingtoapply deepreinforcementlearninginpractice.” —VolodymyrMnih,leaddeveloperofDQN “Anexcellentbooktoquicklydevelopexpertiseinthetheory,language,andpractical implementationofdeepreinforcementlearningalgorithms.Alimpidexpositionwhich usesfamiliarnotation;allthemostrecenttechniquesexplainedwithconcise,readable code,andnotapagewastedinirrelevantdetours:itistheperfectwaytodevelopasolid foundationonthetopic.” —VincentVanhoucke,principalscientist,Google “Assomeonewhospendstheirdaystryingtomakedeepreinforcementlearningmethods moreusefulforthegeneralpublic,IcansaythatLauraandKeng’sbookisawelcome ptg31266351 additiontotheliterature.Itprovidesbothareadableintroductiontothefundamental conceptsinreinforcementlearningaswellasintuitiveexplanationsandcodeformanyof themajoralgorithmsinthefield.Iimaginethiswillbecomeaninvaluableresourcefor individualsinterestedinlearningaboutdeepreinforcementlearningforyearstocome.” —ArthurJuliani,seniormachinelearningengineer,UnityTechnologies “Untilnow,theonlywaytogettogripswithdeepreinforcementlearningwastoslowly accumulateknowledgefromdozensofdifferentsources.Finally,wehaveabookbringing everythingtogetherinoneplace.” —MatthewRahtz,MLresearcher,ETHZürich Foundations of Deep Reinforcement Learning ptg31266351 The Pearson Addison-Wesley Data & Analytics Series Visit informit.com/awdataseries for a complete list of available publications. ptg31266351 Th e Pearson Addison-Wesley Data & Analytics Series provides readers with practical knowledge for solving problems and answering questions with data. Titles in this series primarily focus on three areas: 1. Infrastructure: how to store, move, and manage data 2. Algorithms: how to mine intelligence or make predictions based on data 3. Visualizations: how to represent data and insights in a meaningful and compelling way The series aims to tie all three of these areas together to help the reader build end-to-end systems for fighting spam; making recommendations; building personalization; detecting trends, patterns, or problems; and gaining insight from the data exhaust of systems and user interactions. Make sure to connect with us! informit.com/socialconnect Foundations of Deep Reinforcement Learning ptg31266351 Theory and Practice in Python Laura Graesser Wah Loon Keng Boston•Columbus•NewYork•SanFrancisco•Amsterdam•CapeTown Dubai•London•Madrid•Milan•Munich•Paris•Montreal•Toronto•Delhi•MexicoCity SãoPaulo•Sydney•HongKong•Seoul•Singapore•Taipei•Tokyo Manyofthedesignationsusedbymanufacturersandsellerstodistinguishtheirproductsareclaimedas trademarks.Wherethosedesignationsappearinthisbook,andthepublisherwasawareofatrademark claim,thedesignationshavebeenprintedwithinitialcapitallettersorinallcapitals. Theauthorsandpublisherhavetakencareinthepreparationofthisbook,butmakenoexpressedor impliedwarrantyofanykindandassumenoresponsibilityforerrorsoromissions.Noliabilityisassumed forincidentalorconsequentialdamagesinconnectionwithorarisingoutoftheuseoftheinformationor programscontainedherein. Forinformationaboutbuyingthistitleinbulkquantities,orforspecialsalesopportunities(whichmay ptg31266351 includeelectronicversions;customcoverdesigns;andcontentparticulartoyourbusiness,traininggoals, marketingfocus,orbrandinginterests),pleasecontactourcorporatesalesdepartment at [email protected] or (800)382-3419. Forgovernmentsalesinquiries,pleasecontact [email protected]. ForquestionsaboutsalesoutsidetheU.S.,pleasecontact [email protected]. VisitusontheWeb:informit.com/aw LibraryofCongressControlNumber:2019948417 Copyright©2020PearsonEducation,Inc. CoverillustrationbyWacomka/Shutterstock SLMLabisanMIT-licensedopensourceproject. Allrightsreserved.Thispublicationisprotectedbycopyright,andpermissionmustbeobtainedfromthe publisherpriortoanyprohibitedreproduction,storageinaretrievalsystem,ortransmissioninanyformor byanymeans,electronic,mechanical,photocopying,recording,orlikewise.Forinformationregarding permissions,requestformsandtheappropriatecontactswithinthePearsonEducationGlobalRights& PermissionsDepartment,pleasevisitwww.pearson.com/permissions. ISBN-13:978-0-13-517238-4 ISBN-10:0-13-517238-1 3 20 For those people who make me feel that anything is possible —Laura For my wife Daniela —Keng ptg31266351 This page intentionally left blank ptg31266351 Contents Foreword xix Preface xxi Acknowledgments xxv About the Authors xxvii 1 Introduction to Reinforcement Learning 1 1.1 ReinforcementLearning 1 1.2 ReinforcementLearningasMDP 6 1.3 LearnableFunctionsinReinforcement Learning 9 1.4 DeepReinforcementLearning Algorithms 11 ptg31266351 1.4.1 Policy-BasedAlgorithms 12 1.4.2 Value-BasedAlgorithms 13 1.4.3 Model-BasedAlgorithms 13 1.4.4 CombinedMethods 15 1.4.5 AlgorithmsCoveredinThis Book 15 1.4.6 On-PolicyandOff-Policy Algorithms 16 1.4.7 Summary 16 1.5 DeepLearningforReinforcement Learning 17 1.6 ReinforcementLearningandSupervised Learning 19 1.6.1 LackofanOracle 19 1.6.2 SparsityofFeedback 20 1.6.3 DataGeneration 20 1.7 Summary 21 x Contents I Policy-Based and Value-Based Algorithms 23 2 REINFORCE 25 2.1 Policy 26 2.2 TheObjectiveFunction 26 2.3 ThePolicyGradient 27 2.3.1 PolicyGradient Derivation 28 2.4 MonteCarloSampling 30 2.5 REINFORCEAlgorithm 31 2.5.1 Improving REINFORCE 32 2.6 ImplementingREINFORCE 33 2.6.1 AMinimalREINFORCE Implementation 33 2.6.2 ConstructingPolicieswith PyTorch 36 ptg31266351 2.6.3 SamplingActions 38 2.6.4 CalculatingPolicy Loss 39 2.6.5 REINFORCETraining Loop 40 2.6.6 On-PolicyReplay Memory 41 2.7 TrainingaREINFORCEAgent 44 2.8 ExperimentalResults 47 2.8.1 Experiment:TheEffectof DiscountFactorγ 47 2.8.2 Experiment:TheEffectof Baseline 49 2.9 Summary 51 2.10 FurtherReading 51 2.11 History 51 3 SARSA 53 3.1 TheQ-andV-Functions 54 3.2 TemporalDifferenceLearning 56 3.2.1 IntuitionforTemporal DifferenceLearning 59