Approximate Solutions to Markov Decision Processes Geo(cid:11)rey J. Gordon June 1999 CMU-CS-99-143 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Submitted in partial ful(cid:12)llment of the requirements for the degree of Doctor of Philosophy. Thesis Committee: Tom Mitchell, Chair Andrew Moore John La(cid:11)erty Satinder Singh Baveja, AT&T Labs Research (cid:13)c Copyright Geo(cid:11)rey J. Gordon, 1999 This research is sponsored in part by the Defense Advanced Research Projects Agency (DARPA) under Contract Nos. F30602-97-1-0215 and F33615-93-1-1330, by the National Science Foundation (NSF) under Grant No. BES-9402439, and by an NSF Graduate Research Fellowship. The views and conclusions expressed in this publication are those of the author andshould not be interpretedas representing the o(cid:14)cialpolicies, eitherexpressedorimplied,ofDARPA,NSF,ortheU.S.government. Keywords: machine learning, reinforcement learning, dynamic program- ming,Markovdecisionprocesses(MDPs),linearprogramming,convexprogram- ming, function approximation, worst-case learning, regret bounds, statistics, (cid:12)tted value iteration, convergence of numerical methods Abstract One of the basic problems of machine learning is deciding how to act in an uncertainworld. Forexample,if Iwantmyrobot tobringmeacupofco(cid:11)ee, it must be able to compute the correct sequence of electrical impulses to send to itsmotorstonavigatefrom theco(cid:11)eepot tomyo(cid:14)ce. Infact, sincethe results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path. In order for any machine learner to act reasonably in an uncertain environ- ment, it must solve problems like the above one quickly and reliably. Unfortu- nately, the worldis oftensocomplicated that it is di(cid:14)cult orimpossibleto (cid:12)nd the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to real-worldproblems, we usually must settle for approximate solutions. Onerepresentationforalearner’senvironmentandgoalsisaMarkovdecision process or MDP. MDPs allow us to represent actions that have probabilistic outcomes, and to plan for complicated, temporally-extended goals. An MDP consists of a set of states that the environment can be in, together with rules for howthe environmentcan changestate and for what the learner is supposed to do. One way to approach a large MDP is to try to compute an approximation to its optimal state evaluation function, the function which tells us how much reward the learner can be expected to achieve if the world is in a particular state. If the approximationis goodenough, we canuse a shallowsearchto (cid:12)nd a good action from most states. Researchers have tried many di(cid:11)erent ways to approximate evaluation functions. This thesis aims for a middle ground, betweenalgorithmsthatdon’t scalewellbecausetheyuseanimpoverishedrep- resentation for the evaluation function and algorithms that we can’t analyze because they use too complicated a representation. i ii Acknowledgements This workwould not havebeen possible without the support of my thesis com- mittee and the rest of the faculty at Carnegie Mellon. Thanks in particular to my advisorTom Mitchell and to Andrew Moorefor helping me to see both the forestandthetrees,andtoTomMitchellfor(cid:12)ndingthefundingtoletmework ontheinterestingproblemsratherthanthelucrativeones. ThanksalsotoJohn La(cid:11)erty, Avrim Blum, Steven Rudich, and Satinder Singh Baveja,both for the advice they gaveme and the knowledge they taught me. Finally,andmostofall,thankstomywifePaulaforsupportingme,encour- aging me, listening to my complaints, making me laugh, laughing at my jokes, taking the time just to sit and talk with me, and putting up with the time I couldn’t spend with her. iii iv Contents 1 INTRODUCTION 1 1.1 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . 5 1.2 MDP examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 FITTED VALUE ITERATION 9 2.1 Discounted processes . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Approximators as mappings . . . . . . . . . . . . . . . . . 13 2.1.2 Averagers . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Nondiscounted processes . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Convergingto what? . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 In practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.1 Puddle world . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 Hill-car . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.3 Hill-car the hard way . . . . . . . . . . . . . . . . . . . . 28 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.1 Can expansive approximatorswork? . . . . . . . . . . . . 30 2.7.2 Nondiscounted case . . . . . . . . . . . . . . . . . . . . . 31 2.7.3 Errorbounds . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7.4 The embedded process for Q-learning . . . . . . . . . . . 34 3 CONVEX ANALYSIS AND INFERENCE 37 3.1 The inference problem . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Convex duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Proof strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 One-step regret . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.3 Amortized analysis . . . . . . . . . . . . . . . . . . . . . . 48 3.3.4 Speci(cid:12)c bounds . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Weighted Majority . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Log loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Generalized gradient descent . . . . . . . . . . . . . . . . . . . . 53 v 3.7 General regret bounds . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.7.3 The bound . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8 GGD examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.9 Inference in exponential families . . . . . . . . . . . . . . . . . . 60 3.9.1 Regret bounds . . . . . . . . . . . . . . . . . . . . . . . . 60 3.9.2 A Bayesianinterpretation . . . . . . . . . . . . . . . . . . 61 3.10 Regression problems . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.10.1 Matching loss functions . . . . . . . . . . . . . . . . . . . 64 3.10.2 Regret bounds . . . . . . . . . . . . . . . . . . . . . . . . 65 3.10.3 Multidimensional outputs . . . . . . . . . . . . . . . . . . 66 3.11 Linear regressionalgorithms . . . . . . . . . . . . . . . . . . . . . 67 3.12 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4 CONVEX ANALYSIS AND MDPS 71 4.1 The Bellman equations . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 The dual of the Bellman equations . . . . . . . . . . . . . . . . . 74 4.2.1 Linear programming duality . . . . . . . . . . . . . . . . . 74 4.2.2 LPs and convex duality . . . . . . . . . . . . . . . . . . . 75 4.2.3 The dual Bellman equations . . . . . . . . . . . . . . . . . 78 4.3 Incremental computation . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Soft constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 A statistical interpretation . . . . . . . . . . . . . . . . . . . . . . 83 4.5.1 Maximum Likelihood in Exponential Families . . . . . . . 84 4.5.2 Maximum Entropy and Duality . . . . . . . . . . . . . . . 85 4.5.3 Relationship to linear programming and MDPs . . . . . . 87 4.6 Introducing approximation . . . . . . . . . . . . . . . . . . . . . 87 4.6.1 A (cid:12)rst try . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6.2 Approximating (cid:13)ows as well as values . . . . . . . . . . . 89 4.6.3 An analogy . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.6.4 Open problems . . . . . . . . . . . . . . . . . . . . . . . . 91 4.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.2 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.8 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.8.1 Tiny MDP . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.8.2 Tetris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.8.3 Hill-car . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5 RELATED WORK 107 5.1 Discrete problems. . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Continuous problems . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.1 Linear-Quadratic-GaussianMDPs . . . . . . . . . . . . . 110 5.2.2 Continuous time . . . . . . . . . . . . . . . . . . . . . . . 110 vi 5.2.3 Linearity in controls . . . . . . . . . . . . . . . . . . . . . 111 5.3 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.1 State aggregation . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.2 Interpolated value iteration . . . . . . . . . . . . . . . . . 115 5.3.3 Linear programming . . . . . . . . . . . . . . . . . . . . . 115 5.3.4 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.5 Collocation and Galkerin methods . . . . . . . . . . . . . 117 5.3.6 Squared Bellman error . . . . . . . . . . . . . . . . . . . . 119 5.3.7 Multi-step methods . . . . . . . . . . . . . . . . . . . . . 122 5.3.8 Stopping problems . . . . . . . . . . . . . . . . . . . . . . 123 5.3.9 Approximate policy iteration . . . . . . . . . . . . . . . . 124 5.3.10 Policies without values . . . . . . . . . . . . . . . . . . . . 124 5.3.11 Linear-quadratic-Gaussianapproximations. . . . . . . . . 124 5.4 Incremental algorithms . . . . . . . . . . . . . . . . . . . . . . . . 125 5.4.1 TD((cid:21)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.4.2 Q-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 SUMMARY OF CONTRIBUTIONS 131 vii viii
Description: