ebook img

Online Model Learning Algorithms for Actor-Critic Control PDF

204 Pages·2015·2.46 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Online Model Learning Algorithms for Actor-Critic Control

I v o G r o Online Model Learning Algorithms n d m for Actor-Critic Control a n Ivo Grondman Online Model Learning Algorithms for Actor-Critic Control Ivo Grondman Cover: Saturated policy for the pendulum swing-up problem as learned by the model learningactor-criticalgorithm,approximatedusinganetworkofradialbasisfunctions. Online Model Learning Algorithms for Actor-Critic Control Proefschrift terverkrijgingvandegraadvandoctor aandeTechnischeUniversiteitDelft, opgezagvandeRectorMagnificusprof.ir.K.C.A.M.Luyben, voorzittervanhetCollegevoorPromoties, inhetopenbaarteverdedigenop woensdag4maart2015om12:30uur door IvoGRONDMAN MasterofScience,ImperialCollegeLondon,VerenigdKoninkrijk, geborenteLosser. Ditproefschriftisgoedgekeurddoordepromotor: Prof.dr.R.Babuška Samenstellingpromotiecommissie: RectorMagnificus voorzitter Prof.dr.R.Babuška TechnischeUniversiteitDelft,promotor Onafhankelijkeleden: Prof.dr.ir.B.DeSchutter TechnischeUniversiteitDelft Prof.dr.ir.P.P.Jonker TechnischeUniversiteitDelft Prof.dr.A.Nowé VrijeUniversiteitBrussel Prof.dr.S.Jagannathan MissouriUniversityofScience&Technology Prof.dr.D.Ernst UniversitédeLiège Dr.I.L.Bu¸soniu UniversitateaTehnica¯dinCluj-Napoca Dr. I.L. Bu¸soniu (Universitatea Tehnica¯ din Cluj-Napoca) heeft als begeleider in belangrijke mate aan de totstandkoming van het proefschrift bijgedragen. Thisthesishasbeencompletedinpartialfulfilmentoftherequirementsofthe DutchInstituteforSystemsandControl(DISC)forgraduatestudies. Publishedanddistributedby: IvoGrondman E-mail: [email protected] Web: http://www.grondman.net/ ISBN978-94-6186-432-1 Copyright(cid:13)c 2015byIvoGrondman All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilised in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage andretrievalsystem,withoutwrittenpermissionoftheauthor. PrintedintheNetherlands Acknowledgements DuringthepastyearstherewerequiteafewmomentswhereIthoughtquitting myPhDprojectwasperhapsthebestsolutiontoalltheproblemsandstressit was causing. Now that the thesis is finally finished, there are a lot of people I wanttothankfortheirhelp,supportandencouragement,whichkeptmefrom actually quitting. With the risk of forgetting someone who I definitely should havementioned,heregoes... First, I would like to thank my promotor and supervisor, prof. dr. Robert Babuška, for giving me the opportunity to embark on a PhD and for his efforts to keep me going even after leaving the university. Getting a chance to give several lectures on various control systems courses to both BSc and MScstudentswasalsoagreatexperience. Robert,díkyzavšechno! Despite the large distance between my workplace and his, my daily supervisor dr. Lucian Bu¸soniu has been of tremendous help. Whenever I got stuck he was always available for a discussion to get me back on track. His suggestions on and corrections to drafts of papers, which were always in abundance,werealsogreatlyappreciatedeventhoughImightnothavealways shownitwhileworkingmywaythroughthosestacksofpapercoveredwithred ink. At the start of 2013, I had a very good time at the Missouri University of Science & Technology in Rolla, Missouri, for which I am grateful to prof. dr. SarangapaniJagannathananddr.HaoXu. Within the Delft Center for Systems and Control, I thank (former) col- leagues Mernout, Edwin, Pieter, Gijs, Jan-Willem, Gabriel, Noortje, Kim, Jacopo, Andrea, Marco, Stefan, Subramanya, Sachin, Ilhan and Jan-Maarten v for their enjoyable company. Jeroen and Melody did a great job during their MScprojectsand,althoughheleftDCSCbeforeIarrived,Maartengavemean excellentstartingpointformyresearch. Outsidetheacademicenvironment,Iwanttothankmycurrentcolleagues, especially Rachel and Jo, for giving me the final push I needed to finish my PhD. One of the best ways to relieve stress (and lose weight) during the past yearsturnedouttoberunning,whichIprobablyneverwouldhavediscovered withoutmysistersEvelienandJudith. A less healthy, but nevertheless very agreeable, way to get my mind off of thingswasprovidedinbarsandclubsorduringweekendoutingswithHerman, Edwin, Bram, Marinus, Wouter T., Wouter W., Achiel, Max, Bertjan, Joris, Chiel,JochemandJeroen. Finally, I would like to thank my parents for their understanding and supportduringthosemany,manyyearsIspentinuniversity. IvoGrondman DenHaag,February2015 vi Contents 1 Introduction 1 1.1 Model-BasedControlDesign. . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Actor-CriticReinforcementLearning . . . . . . . . . . . . . . . . . . 2 1.3 FocusandContributions . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 OnlineModelLearningforRL . . . . . . . . . . . . . . . . . 4 1.3.2 UsingRewardFunctionKnowledge . . . . . . . . . . . . . . 6 1.4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Actor-CriticReinforcementLearning 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 MarkovDecisionProcesses . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 DiscountedReward . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 AverageReward . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Actor-CriticintheContextofRL . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Critic-OnlyMethods . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Actor-OnlyMethodsandthePolicyGradient . . . . . . . . 17 2.3.3 Actor-CriticAlgorithms . . . . . . . . . . . . . . . . . . . . . 19 2.3.4 PolicyGradientTheorem . . . . . . . . . . . . . . . . . . . . 23 2.4 StandardGradientActor-CriticAlgorithms . . . . . . . . . . . . . . 28 2.4.1 DiscountedReturnSetting . . . . . . . . . . . . . . . . . . . 29 2.4.2 AverageRewardSetting . . . . . . . . . . . . . . . . . . . . . 32 2.5 NaturalGradientActor-CriticAlgorithms . . . . . . . . . . . . . . . 35 2.5.1 NaturalGradientinOptimisation . . . . . . . . . . . . . . . 36 2.5.2 NaturalPolicyGradient . . . . . . . . . . . . . . . . . . . . . 40 2.5.3 NaturalActor-CriticAlgorithms . . . . . . . . . . . . . . . . 42 2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 vii Contents 3 EfficientModelLearningActor-CriticMethods 51 3.1 IntroductionandRelatedWork . . . . . . . . . . . . . . . . . . . . . 52 3.2 StandardActor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 ModelLearningActor-Critic . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 TheProcessModel . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.2 Model-BasedPolicyGradient . . . . . . . . . . . . . . . . . . 55 3.4 ReferenceModelActor-Critic . . . . . . . . . . . . . . . . . . . . . . 57 3.5 FunctionApproximators . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.1 RadialBasisFunctions . . . . . . . . . . . . . . . . . . . . . . 63 3.5.2 LocalLinearRegression . . . . . . . . . . . . . . . . . . . . . 64 3.5.3 TileCoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6 Example: PendulumSwing-Up . . . . . . . . . . . . . . . . . . . . . 72 3.6.1 StandardActor-Critic. . . . . . . . . . . . . . . . . . . . . . . 73 3.6.2 ModelLearningActor-Critic . . . . . . . . . . . . . . . . . . 80 3.6.3 ReferenceModelActor-Critic . . . . . . . . . . . . . . . . . . 84 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4 SolutionstoFiniteHorizonCostProblemsUsingActor-CriticRL 93 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 MarkovDecisionProcessesfortheFiniteHorizonCostSetting . 95 4.3 Actor-CriticRLforFiniteHorizonMDPs . . . . . . . . . . . . . . . 97 4.3.1 ParameterisingaTime-VaryingActorandCritic . . . . . . 97 4.3.2 StandardActor-Critic. . . . . . . . . . . . . . . . . . . . . . . 99 4.3.3 ModelLearningActor-Critic . . . . . . . . . . . . . . . . . . 100 4.3.4 ReferenceModelActor-Critic . . . . . . . . . . . . . . . . . . 102 4.4 SimulationResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.1 FiniteHorizonStandardActor-Critic . . . . . . . . . . . . . 105 4.4.2 FiniteHorizonModelLearningActor-Critic . . . . . . . . . 106 4.4.3 FiniteHorizonReferenceModelActor-Critic . . . . . . . . 110 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5 SimulationswithaTwo-LinkManipulator 113 5.1 SimulationSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2 ConsequencesforModelLearningMethods . . . . . . . . . . . . . 114 5.3 CaseI:LearntoInjectProperDamping . . . . . . . . . . . . . . . . 115 5.3.1 StandardActor-Critic. . . . . . . . . . . . . . . . . . . . . . . 117 5.3.2 ModelLearningActor-Critic . . . . . . . . . . . . . . . . . . 121 5.4 CaseII:LearntoFindaNontrivialEquilibrium . . . . . . . . . . . 122 5.4.1 StandardActor-Critic. . . . . . . . . . . . . . . . . . . . . . . 123 viii

Description:
With the risk of forgetting someone who I definitely should have mentioned .. for the more complex two-link robotic manipulator task are presented. better way of assigning credit to states or state-action pairs visited several steps.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.