Table Of Content

Reinforcement Learning AdaptiveComputationandMachineLearning ThomasDietterich,serieseditor ChristopherBishop,DavidHeckerman,MichaelJordan,andMichaelKearns,associateeditors Bioinformatics:TheMachineLearningApproach,PierreBaldiandSørenBrunak. ReinforcementLearning:AnIntroduction,RichardS.SuttonandAndrewG.Barto RichardS.SuttonandAndrewG.Barto Reinforcement Learning AnIntroduction ABradfordBook TheMITPress Cambridge,Massachusetts London,England ©1998RichardS.SuttonandAndrewG.Barto Allrightsreserved.Nopartofthisbookmaybereproducedinanyformbyanyelectronicor mechanicalmeans(includingphotocopying,recording,orinformationstorageandretrieval) withoutpermissioninwritingfromthepublisher. ThisbookwassetinTimesRomanbyWindfallSoftwareusingZzTEXandwasprintedand boundintheUnitedStatesofAmerica. LibraryofCongressCataloging-in-PublicationData Sutton,RichardS. Reinforcementlearning:anintroduction/RichardS.Suttonand AndrewG.Barto. p. cm.—(Adaptivecomputationandmachinelearning) “ABradfordbook.” Includesbibliographicalreferencesandindex. ISBN0-262-19398-1(alk.paper) 1.Reinforcementlearning(Machinelearning) I.Barto,AndrewG. II.Title. III.Series. Q325.6.S88 1998 006.3(cid:1)1—dc21 97-26416 CIP InmemoryofA.HarryKlopf Contents SeriesForeword xiii Preface xv I TheProblem 1 1 Introduction 3 1.1 ReinforcementLearning 3 1.2 Examples 6 1.3 ElementsofReinforcementLearning 7 1.4 AnExtendedExample:Tic-Tac-Toe 10 1.5 Summary 15 1.6 HistoryofReinforcementLearning 16 1.7 BibliographicalRemarks 23 2 EvaluativeFeedback 25 2.1 Ann-ArmedBanditProblem 26 2.2 Action-ValueMethods 27 2.3 SoftmaxActionSelection 30 (cid:1) 2.4 EvaluationVersusInstruction 31 2.5 IncrementalImplementation 36 viii Contents 2.6 TrackingaNonstationaryProblem 38 2.7 OptimisticInitialValues 39 (cid:1) 2.8 ReinforcementComparison 41 (cid:1) 2.9 PursuitMethods 43 (cid:1) 2.10 AssociativeSearch 45 2.11 Conclusions 46 2.12 BibliographicalandHistoricalRemarks 48 3 TheReinforcementLearningProblem 51 3.1 TheAgent–EnvironmentInterface 51 3.2 GoalsandRewards 56 3.3 Returns 57 3.4 UnifiedNotationforEpisodicandContinuingTasks 60 (cid:1) 3.5 TheMarkovProperty 61 3.6 MarkovDecisionProcesses 66 3.7 ValueFunctions 68 3.8 OptimalValueFunctions 75 3.9 OptimalityandApproximation 80 3.10 Summary 81 3.11 BibliographicalandHistoricalRemarks 83 II ElementarySolutionMethods 87 4 DynamicProgramming 89 4.1 PolicyEvaluation 90 4.2 PolicyImprovement 93 4.3 PolicyIteration 97 4.4 ValueIteration 100 4.5 AsynchronousDynamicProgramming 103 4.6 GeneralizedPolicyIteration 105 4.7 EfficiencyofDynamicProgramming 107 ix Contents 4.8 Summary 108 4.9 BibliographicalandHistoricalRemarks 109 5 MonteCarloMethods 111 5.1 MonteCarloPolicyEvaluation 112 5.2 MonteCarloEstimationofActionValues 116 5.3 MonteCarloControl 118 5.4 On-PolicyMonteCarloControl 122 5.5 EvaluatingOnePolicyWhileFollowingAnother 124 5.6 Off-PolicyMonteCarloControl 126 5.7 IncrementalImplementation 128 5.8 Summary 129 5.9 BibliographicalandHistoricalRemarks 131 6 Temporal-DifferenceLearning 133 6.1 TDPrediction 133 6.2 AdvantagesofTDPredictionMethods 138 6.3 OptimalityofTD(0) 141 6.4 Sarsa:On-PolicyTDControl 145 6.5 Q-Learning:Off-PolicyTDControl 148 (cid:1) 6.6 Actor–CriticMethods 151 (cid:1) 6.7 R-LearningforUndiscountedContinuingTasks 153 6.8 Games,Afterstates,andOtherSpecialCases 156 6.9 Summary 157 6.10 BibliographicalandHistoricalRemarks 158 III AUnifiedView 161 7 EligibilityTraces 163 7.1 n-StepTDPrediction 164 7.2 TheForwardViewofTD(λ) 169 7.3 TheBackwardViewofTD(λ) 173 x Contents 7.4 EquivalenceofForwardandBackwardViews 176 7.5 Sarsa(λ) 179 7.6 Q(λ) 182 (cid:1) 7.7 EligibilityTracesforActor–CriticMethods 185 7.8 ReplacingTraces 186 7.9 ImplementationIssues 189 (cid:1)7.10 Variableλ 189 7.11 Conclusions 190 7.12 BibliographicalandHistoricalRemarks 191 8 GeneralizationandFunctionApproximation 193 8.1 ValuePredictionwithFunctionApproximation 194 8.2 Gradient-DescentMethods 197 8.3 LinearMethods 200 8.4 ControlwithFunctionApproximation 210 8.5 Off-PolicyBootstrapping 216 8.6 ShouldWeBootstrap? 220 8.7 Summary 222 8.8 BibliographicalandHistoricalRemarks 223 9 PlanningandLearning 227 9.1 ModelsandPlanning 227 9.2 IntegratingPlanning,Acting,andLearning 230 9.3 WhentheModelIsWrong 235 9.4 PrioritizedSweeping 238 9.5 Fullvs.SampleBackups 242 9.6 TrajectorySampling 246 9.7 HeuristicSearch 250 9.8 Summary 252 9.9 BibliographicalandHistoricalRemarks 254 10 DimensionsofReinforcementLearning 255 10.1 TheUnifiedView 255 10.2 OtherFrontierDimensions 258 xi Contents 11 CaseStudies 261 11.1 TD-Gammon 261 11.2 Samuel’sCheckersPlayer 267 11.3 TheAcrobot 270 11.4 ElevatorDispatching 274 11.5 DynamicChannelAllocation 279 11.6 Job-ShopScheduling 283 References 291 SummaryofNotation 313 Index 315