Delft University of Technology Online reinforcement learning control for aerospace systems Zhou, Ye DOI 10.4233/uuid:5b875915-2518-4ec8-a1a0-07ad057edab4 Publication date 2018 Document Version Final published version Citation (APA) Zhou, Y. (2018). Online reinforcement learning control for aerospace systems. https://doi.org/10.4233/uuid:5b875915-2518-4ec8-a1a0-07ad057edab4 Important note To cite this publication, please use the final published version (if applicable). Please check the document version above. Copyright Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim. This work is downloaded from Delft University of Technology. For technical reasons the number of authors shown on this cover page is limited to a maximum of 10. ONLINE REINFORCEMENT LEARNING CONTROL FOR AEROSPACE SYSTEMS ONLINE REINFORCEMENT LEARNING CONTROL FOR AEROSPACE SYSTEMS Proefschrift terverkrijgingvandegraadvandoctor aandeTechnischeUniversiteitDelft, opgezagvandeRectorMagnificusprof.dr.ir.T.H.J.J.vanderHagen, voorzittervanhetCollegevoorPromoties, inhetopenbaarteverdedigenopwoensdag11april2018om15:00uur door Ye ZHOU ingenieurluchtvaartenruimtevaart geborenteHefei,Anhui,China Ditproefschriftisgoedgekeurddoordepromotoren: prof.dr.ir.M.Mulderendr.Q.P.Chu Copromotor: dr.ir.E.vanKampen Samenstellingpromotiecommissie: RectorMagnificus, voorzitter Prof.dr.ir.M.Mulder, TechnischeUniversiteitDelft,promotor Dr.Q.P.Chu, TechnischeUniversiteitDelft,promotor Dr.ir.E.vanKampen, TechnischeUniversiteitDelft,copromotor Onafhankelijkeleden: Prof.dr.J.Si, ArizonaStateUniversity Prof.dr.-ingF.Holzapfel, TechnischeUniversitätMünchen Prof.dr.D.G. Simons, TechnischeUniversiteitDelft Prof.dr.R.Babuska, TechnischeUniversiteitDelft Keywords: ReinforcementLearning; AerospaceSystems; OptimalAdaptiveCon- trol;ApproximateDynamicProgramming;AdaptiveCriticDesigns;In- crementalModel;NonlinearSystems;PartialObservability;Hierarchi- calReinforcementLearning;HybridMethods. Printedby: IpskampPrinting. Front&Back: DesignedbyYeZhou. ISBN978-94-6366-021-1 Anelectronicversionofthisdissertationisavailableat http://repository.tudelft.nl/. Copyright©2018byYeZHOU.Allrightsreserved. Nopartofthispublicationmaybe reproduced,storedinaretrievalsystem,ortransmitted,inanyformorbyanymeans, electronic, mechanical, photocopying, recording, or otherwise, without the prior per- missioninwritingfromtheproprietor. Tomybelovedparents,husband,andlittledaughter... S UMMARY ONLINE REINFORCEMENT LEARNING CONTROL FOR AEROSPACE SYSTEMS Ye ZHOU Recent technological improvements have spurred the development of innovative andmoreadvancedaerospacesystems. Theincreasedcomplexityofthesesystemshas becomeoneofthemajorchallengesoftheaerospacecontrolsystemdesign.Themulti- objectivetasksinvariousapplications,rangingfromtheairdomaintospacedomainand frommilitaryusetocommercialuse,alsoincreasetheautomaticcontrolrequirements andcomplexity.Furthermore,theuncertaintiesinaerospacesystems,suchaschanging shapesofmorphingaircraft,andintheenvironment,suchassuddengusts,complexair traffic,andspacedebrisimpacts,havealsoheightenedtheneedforonlineadaptability incontrolsystems.Tomeetthegrowingcomplexityofthesystemdynamics,theincreas- ingdifficultyofcontroltasks,andthedemandingrequirementofadaptability,aerospace systemsareinurgentneedofhigherlevelsofautonomy. The complexity and diversity of aerospace systems and autonomous control tasks motivateresearcherstoexploreintelligentmethods. Intelligentautonomousaerospace systems,ontheonehand,needtolearnthecurrentsystemdynamicsandtheenviron- mentonlineandcontrolthesystemadaptivelyandaccurately.Ontheotherhand,these systems alsoneed tobeabletotradeoffamongmultiple objectives andretainsafety. Therefore, a complete intelligent system often has a hierarchical control architecture. The low-level control ability is the foundation of the higher levels and limits the im- provementofthewholeautonomouscontrolsystem. Thislimitationisoneofthemain reasonsforthefactthatmanyexistinghigh-levelautonomousalgorithmscannotbesuc- cessfullyappliedyettorealaerospacesystems. Besides,theintelligenceandautonomy ofhigh-leveldecision-makingsystemsarealsoinneedofimprovement,tomeetthenew challengesincurrentandfutureaerospacesystems,suchasdeep-spaceexploration,in- doorguidanceandnavigation,andself-organizedswarmformation. Reinforcement Learning (RL) is a framework of intelligent, self-learning methods that can be applied to different levels of autonomous operations and applications. It linksbio-inspiredartificialintelligencetechniquestothefieldofcontrolanddecision- making. RL methods, in the low-level control field, can be used to improve the con- trolefficiencyandadaptabilitywhenthedynamicalmodelsareunknownoruncertain. vii viii SUMMARY Thesecontrolproblems,suchasstabilizationandreferencetracking,areoftenmodeled incontinuousstateandactionspaces. RLmethods,inthehigh-leveldecision-making field, can be applied to enhance the intelligence of planning and to ensure the coor- dinationwiththelow-levelcontrol. Intheseproblems,stateandactionspacescanbe discrete,continuousorevenhybrid. RL methods are relatively new in the field of aerospace guidance, navigation, and control.Theyhavemanybenefits,butalsosomelimitations,whenappliedtoaerospace systems.Thisdissertationaimstodealwiththefollowingmainresearchquestion: HowcanaerospacesystemsexploitRLmethodstoimprovetheautonomyand onlinelearningwithrespecttotheaprioriunknownsystemandenvironment, dynamicaluncertainties,andpartialobservability? This main research question is addressed in three parts, for three specific RL meth- odsandapplications: (i)ApproximateDynamicProgramming(ADP)forcontrolprob- lemswithanapproximatelyconvextruecost-to-go,(ii)AdaptiveCriticDesigns(ACDs) forgeneralnonlinearcontrolproblems, and(iii)HierarchicalReinforcementLearning (HRL)forhigh-levelguidanceandnavigation.Thisleadstothefollowingresearchques- tions: 1. HowtogeneralizeLinearApproximateDynamicProgramming(LADP)todealwith nonlinear and/or time-varying systems, model mismatch, and partial observa- tions,whileretainingtheefficiencyandmathematicalexplicitness? 2. Howtodeviseonlineadaptivecriticdesignsandimprovetheonlineadaptability, tocopewithinternaluncertainties,externaldisturbances,andevensuddenfaults? 3. Howtoestablishasystematichierarchicalreinforcementlearningcontrollerthat dealswithmultipleobjectivesandpartialobservability,possessestransferlearn- ingability,andutilizesdiverseRLmethods? To address the first question, this dissertation proposes incremental Approximate DynamicProgramming(iADP)methods. Insteadofusingnonlinearfunctionapproxi- matorstoapproximatethetruecost-to-go,iADPmethodsusean(extended)incremental modeltodealwiththenonlinearityofunknownsystemsanduncertaintiesoftheenvi- ronment.Thesemethodscanstillapplyaquadraticcostfunctiontogenerateanefficient andmathematicallyexplicitoptimalcontrolalgorithm.Thesemethodsdonotneedany a priori knowledge of the system dynamics, online identification of the global model, norevenanassumptionofthetimescaleseparation,butonlyanonlineidentified(ex- tended)incrementalmodel. TheiADPmethodisfirstproposedtosolveregulationproblemsfornonlinearsys- tems.Whenthedirectmeasurementofthefullstateisavailable,theincrementalmodel can be identified to predict the next state. With this prediction and a quadratic cost function,thecontrolincrementcanbecalculatedadheringtotheoptimalityprinciple. Whentheonlymeasurementsaretheinput/outputofthedynamicalsystem,theopti- malcontrolincrementiscalculatedwithanoutputfeedbackalgorithmandanextended incremental model. This method is applied to an unknown nonlinear missile model, SUMMARY ix withbothfullstateandoutputmeasurements,toiterativelyoptimizetheflightcontrol policy. ThesimulationresultsdemonstratethattheiADPmethodimprovestheclosed- loopperformanceofthenonlinearsystem,whilekeepingthedesignprocesssimpleand systematic. The concept of iADP is further expanded to tracking problems for Multiple-Input Multiple-Output(MIMO)nonlinearsystemsandtopartialobservablecontrolproblems. BecauseiADPmethodshaveaseparatestructuretorepresentthelocalsystemdynamics, thecostfunctioncanbelessdependentonthesystemorthereference,andonlyneedsto bearoughapproximationofthecost-to-go.Thisapproximationisaquadraticfunction onlyofthecurrenttrackingerror,withoutexpandingthedimensionofthestatespace forthecostfunctiontoanaugmentedone. Twoobservabilityconditionsareconsideredinthistrackingcontrolproblem.When thedirectmeasurementofthefullstateisavailable,theincrementalmodelcanbeonline identifiedtodesigntheoptimalcontrolincrement.Inaddition,whentheonlymeasure- mentistheoutputtrackingerror, involvedwithtrackingastochasticdynamicalrefer- ence,thesystembecomespartiallyobservable.Theobservationsareusedtoidentifythe extendedincrementalmodelandtopredictthenextoutputtrackingerrorfortheoptimal trackingcontrol. Foreachobservabilitycondition,anoff-linelearningalgorithmisap- pliedtoimprovethepolicyiterativelyuntilitisaccurateenough,andhereafteranonline algorithmisappliedtoupdatethepolicyrecursivelyateachtimestep.Therecursiveal- gorithmscanalsobeusedonlineinrealsystemswhichmaybedifferentfromthesystem modelusedintheiterativelearningstage. Thesealgorithmsareappliedtoanattitude controlproblemofasimulatedsatellitedisturbedbyliquidsloshing.Theresultsdemon- strate that the proposed algorithms accurately and adaptively deal with time-varying internaldynamics, whileretainingefficientcontrol, especiallyforunknownnonlinear systemswithonlypartialobservability. To answer the second research question, this dissertation develops online ACDs basedontheincrementalmodel.ACDscangenerallybecategorizedintothreegroups:1) HeuristicDynamicProgramming(HDP),2)DualHeuristicProgramming(DHP),and3) GlobalizedDualHeuristicProgramming(GDHP).Besides,actiondependentvariations ofthesethreeoriginalversionshavebeendevelopedbydirectlyconnectingtheoutput oftheactortotheinputofthecritic. Thisdissertationfocusesonactionindependent ACDs,specificallyHDPandDHP. An Incremental model based Heuristic Dynamic Programming (IHDP) method is proposedtoonlineandadaptivelycontrolunknownaerospacesystems. Thismethod replaces the global system model approximator with an incremental model. This ap- proach,therefore,doesnotneedoff-linetrainingstagesandmayaccelerateonlinelearn- ing. TheIHDPmethodiscomparedwithconventionalHDPinanonlinetrackingcon- troloftheunknownnonlinearmissilemodel.TheresultsshowthatthepresentedIHDP method speeds up the online learning, has a higher tracking precision, and can deal withawiderrangeofinitialstatesthantheconventionalHDPmethod. Inaddition,the IHDPmethodisalsoappliedtotheMIMOsatelliteattitudetrackingcontroldisturbed byliquidsloshingandwithsuddenexternaldisturbances. Thesimulationresultsalso demonstratethattheIHDPmethodisadaptiveandrobusttointernaluncertaintiesand
Description: