UUnniivveerrssiittyy ooff MMaassssaacchhuusseettttss AAmmhheerrsstt SScchhoollaarrWWoorrkkss@@UUMMaassss AAmmhheerrsstt Open Access Dissertations 9-2010 OOppttiimmiizzaattiioonn--bbaasseedd AApppprrooxxiimmaattee DDyynnaammiicc PPrrooggrraammmmiinngg Marek Petrik University of Massachusetts Amherst Follow this and additional works at: https://scholarworks.umass.edu/open_access_dissertations Part of the Computer Sciences Commons RReeccoommmmeennddeedd CCiittaattiioonn Petrik, Marek, "Optimization-based Approximate Dynamic Programming" (2010). Open Access Dissertations. 308. https://doi.org/10.7275/1672083 https://scholarworks.umass.edu/open_access_dissertations/308 This Open Access Dissertation is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Open Access Dissertations by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. OPTIMIZATION-BASEDAPPROXIMATEDYNAMICPROGRAMMING ADissertationPresented by MAREKPETRIK SubmittedtotheGraduateSchoolofthe UniversityofMassachusettsAmherstinpartialfulfillment oftherequirementsforthedegreeof DOCTOROFPHILOSOPHY September2010 DepartmentofComputerScience c CopyrightbyMarekPetrik2010 (cid:13) AllRightsReserved OPTIMIZATION-BASEDAPPROXIMATEDYNAMICPROGRAMMING ADissertationPresented by MAREKPETRIK Approvedastostyleandcontentby: ShlomoZilberstein,Chair AndrewBarto,Member SridharMahadevan,Member AnaMuriel,Member RonaldParr,Member AndrewBarto,DepartmentChair DepartmentofComputerScience TomyparentsFedorandMariana ACKNOWLEDGMENTS IwanttothankthepeoplewhomademystayatUMassnotonlyproductive,butalsovery enjoyable. Iamgratefultomyadvisor,ShlomoZilberstein,forguidingandsupportingme throughout the completion of this work. Shlomo’s thoughtful advice and probing ques- tionsgreatlyinfluencedbothmythinkingandresearch. Hisadvicewasessentialnotonly informingandrefiningmanyoftheideasdescribedinthiswork,butalsoinassuringthat I become a productive member of the research community. I hope that, one day, I will be abletobecomeanadvisorwhoisjustashelpfulanddedicatedasheis. Themembersofmydissertationcommitteewereindispensableinformingandsteeringthe topic of this dissertation. The class I took with Andrew Barto motivated me to probe the foundations of reinforcement learning, which became one of the foundations of this the- sis. SridharMahadevan’sexcitingworkonrepresentationdiscoveryledmetodeepenmy understanding andappreciate betterapproximate dynamicprogramming. I reallyappre- ciate the detailed comments and encouragement that Ron Parr provided on my research and thesis drafts. Ana Muriel helped me to better understand the connections between myresearchandapplicationsinoperationsresearch. CoauthoringpaperswithJeffJohns, BrunoScherrer,andGavinTaylorwasaverystimulatingandlearningexperience. Myre- searchwasalsoinfluencedbyinteractionswithmanyotherresearches. Theconversations with Raghav Aras, Warren Powell, Scott Sanner, and Csaba Szepesvari were especially il- luminating. ThisworkwasalsosupportedbygenerousfundingfromtheAirForceOffice ofScientificResearch. ConversationswithmylabmateHalaMostafamadethelonghoursinthelabmuchmore enjoyable. While our conversations often did not involve research, those that did, moti- vated me to think deeper about the foundations of my work. I also found sharing ideas withmyfellowgradstudentsMartinAllen,ChrisAmato,AlanCarlin,PhilKirlin,Akshat v Kumar, Sven Seuken, Siddharth Srivastava, and Feng Wu helpful in understanding the broaderresearchtopics. MyfreetimeatUMasskeptmesanethankstomanygreatfriends thatIfoundhere. Finallyandmostimportantly,Iwantthankmyfamily. Theyweresupportiveandhelpful throughout the long years of my education. My mom’s loving kindness and my dad’s intensefascinationwiththeworldwereespeciallyimportantinformingmyinterestsand work habits. My wife Jana has been an incredible source of support and motivation in both research and private life; her companionship made it all worthwhile. It was a great journey. vi ABSTRACT OPTIMIZATION-BASEDAPPROXIMATEDYNAMICPROGRAMMING SEPTEMBER2010 MAREKPETRIK Mgr.,UNIVERZITAKOMENSKEHO,BRATISLAVA,SLOVAKIA M.Sc.,UNIVERSITYOFMASSACHUSETTSAMHERST Ph.D.,UNIVERSITYOFMASSACHUSETTSAMHERST Directedby: ProfessorShlomoZilberstein Reinforcement learning algorithms hold promise in many complex domains, such as re- source management and planning under uncertainty. Most reinforcement learning algo- rithmsareiterative—theysuccessivelyapproximatethesolutionbasedonasetofsamples and features. Although these iterative algorithms can achieve impressive results in some domains, they are not sufficiently reliable for wide applicability; they often require ex- tensive parameter tweaking to work well and provide only weak guarantees of solution quality. Someofthemostinterestingreinforcementlearningalgorithmsarebasedonapproximate dynamicprogramming(ADP).ADP,alsoknownasvaluefunctionapproximation,approx- imates the value of being in each state. This thesis presents new reliable algorithms for ADP that use optimization instead of iterative improvement. Because these optimization– based algorithms explicitly seek solutions with favorable properties, they are easy to an- alyze, offer much stronger guarantees than iterative algorithms, and have few or no pa- rameters to tweak. In particular, we improve on approximate linear programming — an vii existing method — and derive approximate bilinear programming — a new robust ap- proximatemethod. The strong guarantees of optimization–based algorithms not only increase confidence in thesolutionquality,butalsomakeiteasiertocombinethealgorithmswithotherADPcom- ponents. The other components of ADP are samples and features used to approximate the value function. Relying on the simplified analysis of optimization–based methods, we derive new bounds on the error due to missing samples. These bounds are simpler, tighter, and more practical than the existing bounds for iterative algorithms and can be usedtoevaluatesolutionqualityinpracticalsettings. Finally,weproposehomotopymeth- ods that use the sampling bounds to automatically select good approximation features foroptimization–basedalgorithms. Automaticfeatureselectionsignificantlyincreasesthe flexibilityandapplicabilityoftheproposedADPmethods. Themethodspresentedinthisthesiscanpotentiallybeusedinmanypracticalapplications in artificial intelligence, operations research, and engineering. Our experimental results showthatoptimization–basedmethodsmayperformwellonresource-managementprob- lems and standard benchmark problems and therefore represent an attractive alternative totraditionaliterativemethods. viii CONTENTS Page ACKNOWLEDGMENTS ......................................................... v ABSTRACT.....................................................................vii LISTOFFIGURES ............................................................. xiv CHAPTER 1. INTRODUCTION ............................................................ 1 1.1 PlanningModels ......................................................... 3 1.2 ChallengesandContributions ............................................. 5 1.3 Outline .................................................................. 8 PARTI:FORMULATIONS 2. FRAMEWORK:APPROXIMATEDYNAMICPROGRAMMING ............... 12 2.1 FrameworkandNotation................................................. 12 2.2 Model: MarkovDecisionProcess.......................................... 13 2.3 ValueFunctionsandPolicies.............................................. 16 2.4 ApproximatelySolvingMarkovDecisionProcesses......................... 23 2.5 ApproximationError: OnlineandOffline.................................. 31 2.6 Contributions ........................................................... 34 3. ITERATIVEVALUEFUNCTIONAPPROXIMATION.......................... 35 3.1 BasicAlgorithms ........................................................ 35 3.2 BoundsonApproximationError .......................................... 39 ix
Description: