Sparse Value Function Approximation for Reinforcement Learning by Christopher Painter-Wakefield Department of Computer Science Duke University Date: Approved: Ronald Parr, Supervisor Vincent Conitzer Kamesh Munagala Lawrence Carin Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2013 Abstract Sparse Value Function Approximation for Reinforcement Learning by Christopher Painter-Wakefield Department of Computer Science Duke University Date: Approved: Ronald Parr, Supervisor Vincent Conitzer Kamesh Munagala Lawrence Carin An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2013 Copyright (cid:13)c 2013 by Christopher Painter-Wakefield All rights reserved Abstract A key component of many reinforcement learning (RL) algorithms is the approxima- tion of the value function. The design and selection of features for approximation in RL is crucial, and an ongoing area of research. One approach to the problem of feature selection is to apply sparsity-inducing techniques in learning the value func- tion approximation; such sparse methods tend to select relevant features and ignore irrelevant features, thus automating the feature selection process. This dissertation describes three contributions in the area of sparse value function approximation for reinforcement learning. One method for obtaining sparse linear approximations is the inclusion in the ob- jective function of a penalty on the sum of the absolute values of the approximation weights. This L regularization approach was first applied to temporal difference 1 learning in the LARS-inspired, batch learning algorithm LARS-TD. In our first con- tribution, we define an iterative update equation which has as its fixed point the L 1 regularized linear fixed point of LARS-TD. The iterative update gives rise naturally to an online stochastic approximation algorithm. We prove convergence of the online algorithm and show that the L regularized linear fixed point is an equilibrium fixed 1 point of the algorithm. We demonstrate the ability of the algorithm to converge to the fixed point, yielding a sparse solution with modestly better performance than unregularized linear temporal difference learning. Our second contribution extends LARS-TD to integrate policy optimization with iv sparse value learning. We extend the L regularized linear fixed point to include 1 a maximum over policies, defining a new, “greedy” fixed point. The greedy fixed point adds a new invariant to the set which LARS-TD maintains as it traverses its homotopy path, giving rise to a new algorithm integrating sparse value learning and optimization. The new algorithm is demonstrated to be similar in performance with policy iteration using LARS-TD. Finally, we consider another approach to sparse learning, that of using a sim- ple algorithm that greedily adds new features. Such algorithms have many of the good properties of the L regularization methods, while also being extremely effi- 1 cient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. We consider variants of orthogonal matching pursuit (OMP) applied to RL. The resulting algorithms are analyzed and compared experimentally with existing L regularized approaches. We demonstrate 1 that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant provides promising theoretical guarantees under certain assumptions on the feature dictionary while another variant empirically out- performs prior methods both in approximation accuracy and efficiency on several benchmark problems. v For Patty, Ian, and Timothy. vi Contents Abstract iv List of Tables x List of Figures xi Acknowledgements xiii 1 Introduction 1 1.1 Document Organization and Contributions . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Monte Carlo Estimation . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Fitted Value Iteration . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Temporal Difference Learning . . . . . . . . . . . . . . . . . . 13 2.2.4 LSTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Linear Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Linear Approximation in Reinforcement Learning . . . . . . . . . . . 22 2.4.1 Bellman Residual Minimization . . . . . . . . . . . . . . . . . 23 2.4.2 Linear Fixed Point . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Related Work 26 vii 4 L Regularized Linear Temporal Difference Learning 29 1 4.1 L -Regularized Linear Fixed Point . . . . . . . . . . . . . . . . . . . 30 1 4.2 L -Regularized Linear TD . . . . . . . . . . . . . . . . . . . . . . . . 32 1 4.2.1 Linear TD as Gradient Descent . . . . . . . . . . . . . . . . . 32 4.2.2 Linear TD with Soft-Thresholding . . . . . . . . . . . . . . . . 33 4.2.3 A Modification . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Blackjack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.2 Mountain Car . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.3 Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 Simultaneous Feature Selection and Optimization via Least Angle Regression 50 5.1 LARS-TD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 The LARQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3 LARQ Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4.1 50-state Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4.2 Inverted Pendulum . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Greedy Algorithms for Sparse Reinforcement Learning 66 6.1 Prior Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.1 OMP for Regression . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.2 L Regularization in RL . . . . . . . . . . . . . . . . . . . . . 68 1 viii 6.2 OMP for RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.1 Sparse Recovery in OMP-TD . . . . . . . . . . . . . . . . . . 71 6.2.2 Sparse Recovery in OMP-BRM . . . . . . . . . . . . . . . . . 72 6.2.3 Sparse Recovery Behavior . . . . . . . . . . . . . . . . . . . . 73 6.3 Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.1 Extension to Approximately Sparse Case . . . . . . . . . . . . 76 6.3.2 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . 78 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7 Summary and Conclusions 93 7.1 L Regularized Linear Temporal Difference Learning . . . . . . . . . 94 1 7.2 Simultaneous Feature Selection and Optimization via Least Angle Re- gression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.3 Greedy Algorithms for Sparse Reinforcement Learning . . . . . . . . 96 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Bibliography 100 Biography 105 ix List of Tables 6.1 Benchmark experiment properties and experimental settings. . . . . . 88 x
Description: