Robust Planning in Domains with Stochastic Outcomes, Adversaries, and Partial Observability Hugh Brendan McMahan CMU-CS-06-166 December 2006 SchoolofComputerScience CarnegieMellonUniversity Pittsburgh,PA15213 ThesisCommittee: AvrimBlum,Co-Chair GeoffreyGordon,Co-Chair JeffSchneider AndrewNg,StanfordUniversity Submittedinpartialfulfillmentoftherequirements forthedegreeofDoctorofPhilosophy. Copyright(cid:13)c 2006HughBrendanMcMahan This research was sponsored in part by the Defense Advanced Research Projects Agency (DARPA) under contract nos. F30602-01-C-0219 and HR0011-06-1-0023, and by the National Science Foundation (NSF) undergrantsCCR-0105488, NSF-ITRCCR-0122581, and NSF-ITRIIS-0312814. Theviewsand conclu- sions contained in this document are those of the author and should not be interpreted as representing the officialpolicies,eitherexpressedorimplied,ofDARPA,theNSF,theU.S.government,oranyotherentity. Keywords: Planning, Game Theory, Markov Decision Processes, Extensive-form Games,ConvexGames,Algorithms. Abstract Real-world planning problems often feature multiple sources of uncer- tainty, including randomness in outcomes, the presence of adversarial agents, andlackofcompleteknowledgeoftheworldstate. Thisthesisdescribesalgo- rithmsforfourrelatedformalmodelsthatcanaddressmultipletypesofuncer- tainty: Markov decision processes, MDPs with adversarial costs, extensive- form games, and a new class of games that includes both extensive-form gamesandMDPsasspecialcases. Markov decision processes can represent problems where actions have stochasticoutcomes. WedescribeseveralnewalgorithmsforMDPs,andthen show how MDPs can be generalized to model the presence of an adversary thathassomecontrolovercosts. Extensive-formgamescanmodelgameswith random events and partial observability. In the zero-sum perfect-recall case, a minimax solution can be found in time polynomial in the size of the game tree. However, the game tree must “remember” all past actions and random outcomes, and so the size of the game tree grows exponentially in the length of the game. This thesis introduces a new generalization of extensive-form games that relaxes this need to remember all past actions exactly, producing exponentially smaller representations for interesting problems. Further, this formulationunifiesextensive-formgameswithMDPplanning. Wepresentanewclassoffastanytimealgorithmsfortheoff-linecomputa- tion of minimax equilibria in both traditional and generalized extensive-form games. Experimentalresultsdemonstratetheireffectivenessonanadversarial MDP problem and on a large abstracted poker game. We also present a new algorithm for playing repeated extensive-form games that can be used when onlythetotalpayoffofthegameisobservedoneachround. iv Acknowledgments Agreatmanypeoplehavehelpedmealongthepathtothisthesis. Myparentsinspiredmy creativity and provided the education that started the journey. Laura Schueller deserves special credit for introducing me to the world of scary (I mean, higher) mathematics and teaching me how to think rigorously. Andrzej Proskurowski helped me see how to bridge thegapbetweenmathematicsandcomputerscience. Thesupportofmyadvisors,GeoffGordonandAvrimBlum,hasbeeninvaluable. They havebeendiligentguidesthroughtheresearchprocessandreliablesourcesofenthusiasm and fresh ideas. Many other people have helped shape and inspire this work; I am par- ticularly grateful for the support of my other committee members, Andrew Ng and Jeff Schneider. MytimeinPittsburghhasbeenblessedwithwonderfulnewfriendshipsandtheoppor- tunitytobuildonoldones. Youknowwhoyouare. Thankyou! Andlastbutnotleast,I’m especially grateful for the love and support of my wife, Amy. She knows how to get me to relax and how to help me stay focused, and she can always tell which of those I need. I couldn’taskformore. To everyone who has aided my efforts, both those named here and those not, let me simplysay: thankyou. v vi Contents 1 Introduction 1 2 AlgorithmsforPlanninginMarkovDecisionProcesses 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 MarkovDecisionProcesses . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 ApproachesbasedonPrioritizationandPolicyEvaluation . . . . . . . . . 10 2.3.1 ImprovedPrioritizedSweeping . . . . . . . . . . . . . . . . . . 12 2.3.2 PrioritizedPolicyIteration . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Gauss-DijkstraElimination . . . . . . . . . . . . . . . . . . . . . 21 2.3.4 IncrementalExpansions . . . . . . . . . . . . . . . . . . . . . . 25 2.3.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 BoundedReal-TimeDynamicProgramming . . . . . . . . . . . . . . . . 36 2.4.1 BasicResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.2 MonotonicUpperBoundsinLinearTime . . . . . . . . . . . . . 40 2.4.3 BoundedRTDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.4 InitializationAssumptionsandPerformanceGuarantees . . . . . 46 2.4.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . 47 3 Bilinear-payoffConvexGames 53 3.1 FromMatrixGamestoConvexGames . . . . . . . . . . . . . . . . . . . 55 vii 3.1.1 SolutionviaConvexOptimization,andtheMinimaxTheorem . . 58 3.1.2 RepeatedConvexGames . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Extensive-formGames . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 OptimalObliviousRouting . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4 MDPswithAdversary-controlledCosts . . . . . . . . . . . . . . . . . . 72 3.4.1 IntroductionandMotivation . . . . . . . . . . . . . . . . . . . . 73 3.4.2 ModelFormulation . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4.3 SolvingMDPswithLinearProgramming . . . . . . . . . . . . . 76 3.4.4 RepresentationasaConvexGame . . . . . . . . . . . . . . . . . 78 3.4.5 Cost-pairedMDPGames . . . . . . . . . . . . . . . . . . . . . . 80 3.5 ConvexStochasticGames . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4 GeneralizingExtensive-formGameswithConvexActionSets 87 4.1 CEFGs: DefiningtheModel . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 SufficientRecallandImplicitBehaviorReactivePolicies . . . . . . . . . 99 4.3 SolvingaCEFGbyTransformationtoaConvexGame . . . . . . . . . . 115 4.4 ApplicationsofCEFGs . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.4.1 StochasticGamesandPOSGs . . . . . . . . . . . . . . . . . . . 118 4.4.2 ExtendingCost-pairedMDPGameswithObservations . . . . . . 119 4.4.3 PerturbedGamesandGameswithOutcomeUncertainty . . . . . 121 4.4.4 UncertainMulti-stagePathPlanning . . . . . . . . . . . . . . . . 125 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5 FastAlgorithmsforConvexGames 129 5.1 BestResponsesandFictitiousPlay . . . . . . . . . . . . . . . . . . . . . 129 5.2 TheSingle-OracleAlgorithmforMDPswith AdversarialCosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.3 ABundle-basedDoubleOracleAlgorithm . . . . . . . . . . . . . . . . . 137 5.3.1 TheBasicAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . 138 viii 5.3.2 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.3.3 LineSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3.4 ConvergenceGuaranteesandFictitiousPlay . . . . . . . . . . . . 143 5.4 GoodandBadBestResponsesforExtensive-formGames . . . . . . . . . 144 5.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.5.1 Adversarial-costMDPs . . . . . . . . . . . . . . . . . . . . . . . 147 5.5.2 Extensive-formGameExperiments . . . . . . . . . . . . . . . . 151 6 OnlineGeometricOptimizationintheBanditSetting 157 6.1 IntroductionandBackground . . . . . . . . . . . . . . . . . . . . . . . . 157 6.2 ProblemFormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.4.2 HighProbabilityBoundsonEstimates . . . . . . . . . . . . . . . 164 6.4.3 RelatingtheLossofBGAanditsGEXSubroutine . . . . . . . . 167 6.4.4 ABoundontheExpectedRegretofBGA . . . . . . . . . . . . . 168 6.5 ConclusionsandLaterWork . . . . . . . . . . . . . . . . . . . . . . . . 169 7 Conclusions 171 7.1 SummaryofContributions . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.2 SummaryofOpenQuestionsandFutureWork . . . . . . . . . . . . . . . 172 A TheTransitionFunctionsofaCEFGInterpretedasProbabilities 175 B TheConeExtensionofaPolyhedron 177 C SpecificationofaGeometricExpertsAlgorithm 179 D NotionsofRegret 183 ix Bibliography 185 x
Description: