ebook img

Transfer Learning for Actor-Critic methods in Lipschitz Markov Decision Processes PDF

135 Pages·2017·1.69 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Transfer Learning for Actor-Critic methods in Lipschitz Markov Decision Processes

politecnico di milano Facoltà di Ingegneria Scuola di Ingegneria Industriale e dell’Informazione Dipartimento di Elettronica, Informazione e Bioingegneria Master of Science in Computer science and engineering Transfer Learning for Actor-Critic methods in Lipschitz Markov Decision Processes Supervisor: prof marcello restelli . Assistant Supervisors: dott matteo pirotta . ing andrea tirinzoni . Master Graduation Thesis by: daniel felipe vacca manrique 852802 Student Id n. 2016 2017 Academic Year - A Mis Increíbles ACKNOWLEDGMENTS Ringrazio il Prof. Marcello Restelli per avermi dato questa prima op- portunità di contatto con il mondo della ricerca, e per la sua guida e il suo sostegno, costanti lungo questo percorso, senza i cui non avrei conseguitoquestorisultato.RingrazioilDott.MatteoPirottache,con lasuaimportanteesperienza,ha anchecontribuitoinmanierasignifi- cativa al successo di questo lavoro, nonostante le distanze fisiche che ci separavano. Voglio anche ringraziare Andrea Tirinzoni che, pur es- sendo presente solo negli ultimi mesi, è stato sempre disponibile a darmiunamanoquandoneavevobisogno.Grazieancoraatuttietre per quest’esperienza di crescita professionale. También agradezco a Mis Increíbles, cuyo apoyo nunca faltó desde 5 que empecé con este loco sueño de venir a Italia años atrás; más que el soporte económico, el impulso moral que me dieron todo este tiempofueloquemetrajohastaelfinal.Graciasalrestodemifamilia y mis amigos en Colombia, que siempre creyeron en mí para lograr esta meta. Gracias, en fin, a todos los que, durante estos dos años en Milán, participaron en lo que ha sido la mejor experiencia de mi vida. v CONTENTS Abstract xi Estratto xiii 1 introduction 1 11 3 . Motivation 12 4 . Goal 13 4 . Contribution 14 4 . Outline 2 reinforcement learning 7 21 7 . Theoreticalframework:MarkovDecisionProcesses 211 . . Theagent:PoliciesandMarkovRewardProces- 10 ses 212 12 . . The goal: Cumulative rewards 213 13 . . Value functions 214 15 . . BellmanoperatorsandBellmanequations 22 16 . BrieftaxonomyofReinforcementLearningalgorithms 221 . . Model requirements: Model-based vs. Model- 16 free 222 . . Policy-based sampling strategy: On-policy vs. 17 Off-policy 223 18 . . Solutionstrategy:Policy-basedvs.Value-based 224 19 . . Sample usage: Online vs. Offline 23 20 . Policy gradient 231 20 . . Finite differences 232 21 . . Trajectory-based policy gradient 233 23 . . State-action-based policy gradient 234 25 . . Natural gradient 24 26 . Policy evaluation 241 26 . . Monte Carlo estimation 242 27 . . Temporal Difference estimation 243 28 . . Policyevaluationwithfunctionapproximators 2431 29 . . . The objective functions 2432 32 . . . Optimization mechanisms 2433 33 . . . LeastSquaresTemporalDifference 25 35 . The actor-critic approach 26 36 . Lipschitz Markov Decision Processes 3 transfer learning 39 31 39 . TransferLearningconceptsforReinforcementLearning 311 . . TransferableknowledgeandaTransferLearning- 40 Reinforcement Learning taxonomy 312 . . PerformancemeasuresforTransferLearning-Reinforcement 43 Learning algorithms vii 32 . Transfer Learning algorithms in Reinforcement Lear- 46 ning 4 transfer learning approaches for actor critic - algorithms 49 41 49 . Thesetting:Lipschitzcontinuoustaskenvironments 42 50 . The problem 43 51 . The actor-critic implementation 431 51 . . The critic 432 53 . . The actor 44 54 . Transfer with Importance Sampling 441 55 . . The critic 442 57 . . The actor 45 59 . Transfer with an optimistic approach 451 59 . . The critic 452 63 . . The actor 46 66 . Transfer with a pessimistic approach 461 66 . . The critic 462 69 . . The actor 5 experiments 73 51 73 . Task environment: Mountain Car 52 74 . Experimental instances 53 76 . Anlysis of the results 6 conclusions and future work 83 bibliography 85 a importance sampling 93 a1 93 . Mathematical formulation and properties a2 95 . ImportanceSamplinginReinforcementLearning b kantorovich distance and local information 97 c lipschitz continuity 99 c1 99 . Lispchitz continuity of the tuples distribution c2 104 . Lipschitz continuity of the matrices c3 107 . Lipschitz continuity of the policy performance c4 109 . Lipschitzcontinuityoftheperformancegradient d other derivations 115 d1 . Local Lipschitz continuity and Kantorovich Lipschitz 115 continuity d2 116 . On the objective functions d3 120 . On the proximity of the optimal parameters viii LIST OF FIGURES 11 2 Figure . General agent-environment model 21 Figure . Agent-environmentmodelinReinforcementLe- 8 arning 22 Figure . Geometrical relation between the MSBE and 31 MSPBE 23 36 Figure . Actor-critic architecture 31 43 Figure . Transfer learning framework 32 44 Figure . Transfer Learning metrics 33 46 Figure . Transfer Learning cost scenarios 51 74 Figure . The Mountain Car task 52 77 Figure . NoTransfer learning curve 53 78 Figure . Learning curves for the IS experiments 54 79 Figure . LearningcurvesfortheMinexperiments 55 80 Figure . LearningcurvesfortheMinMaxexperiments 56 Figure . Effective sample size for transfer from optimal 81 policy 57 Figure . Effective sample size for transfer from worst 82 policy 1 98 Figure B. Kantorovich counterexample LIST OF TABLES 21 17 Table . Model-freeandModel-basedalgorithms 22 18 Table . On-policy and Off-policy algorithms 23 19 Table . Policy-basedandValue-basedalgorithms 24 33 Table . Temporal difference algorithms 51 76 Table . List of experiments LIST OF ALGORITHMS 41 Figure . Actor-critic algorithm in the no-transfer scena- 52 rio 42 53 Figure . LSTD in the no-transfer scenario ix 43 Figure . Gradient estimation in the no-transfer scena- 54 rio 44 56 Figure . LSTDintheImportanceSamplingscenario 45 Figure . Gradient estimation in the Importance Sam- 57 pling scenario 46 Figure . Actor-critic algorithm in the Importance Sam- 58 pling scenario ACRONYMS RL Reinforcement Learning MDP Markov Decision Process POMDP Partially Observable Markov Decision Process MRP Markov Reward Process MC Monte Carlo FIM Fisher Information Matrix TD Temporal difference MSE Mean Squared Error MSBE Mean Squared Bellman Error MSTDE Mean Squared Temporal Difference Error LSTD Least Squares Temporal Difference MSPBE Mean Squared Projected Bellman Error NEU Norm of Expected TD Update OPE Operator Error FPE Fixed-Point Error SGD Stochastic Gradient Descent SVD Singular Value Decomposition IS Importance Sampling ESS Effective Sample Size TL Transfer Learning PLC Pointwise Lipschitz Continuous x

Description:
Both techniques are compared with a transfer mechanism based on Importance Sam- pling (IS) estimators. The optimistic approach produces good results in most of the . actor-critic techniques for the continuous scenario, by presenting part 993–1000. isbn: 978-1-60558-516-1 (cit. on p. 31).
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.