ImperialCollegeLondon DepartmentofComputing Argumentation Accelerated Reinforcement Learning Yang Gao March 2015 Supervised by Professor Francesca Toni Submittedinpartfulfilmentoftherequirementsforthedegreeof DoctorofPhilosophyinComputingofImperialCollegeLondon andtheDiplomaofImperialCollegeLondon 1 Copyright Declaration ThecopyrightofthisthesisrestswiththeauthorandismadeavailableunderaCre- ativeCommonsAttributionNon-CommercialNoDerivativeslicence. Researchers arefreetocopy,distributeortransmitthethesisontheconditionthattheyattribute it,thattheydonotuseitforcommercialpurposesandthattheydonotalter,trans- formorbuilduponit. Foranyreuseorredistribution,researchersmustmakeclear toothersthelicencetermsofthiswork 2 Abstract Reinforcement Learning (RL) is a popular statistical Artificial Intelligence (AI) techniqueforbuildingautonomousagents,butitsuffersfromthecurseofdimen- sionality: thecomputationalrequirementforobtainingtheoptimalpoliciesgrows exponentially with the size of the state space. Integrating heuristics into RL has proventobeaneffectiveapproachtocombatthiscurse, butderivinghigh-quality heuristics from people’s (typically conflicting) domain knowledge is challenging, yet it received little research attention. Argumentation theory is a logic-based AI techniquewell-knownforitsconflictresolutioncapabilityandintuitiveappeal. In this thesis, we investigate the integration of argumentation frameworks into RL algorithms,soastoimprovetheconvergencespeedofRLalgorithms. In particular, we propose a variant of Value-based Argumentation Framework (VAF) to represent domain knowledge and to derive heuristics from this knowl- edge. We prove that the heuristics derived from this framework can effectively instructindividuallearningagentsaswellasmultiplecooperativelearningagents. In addition,we propose the Argumentation Accelerated RL (AARL) framework to integrate these heuristics into different RL algorithms via Potential Based Re- ward Shaping (PBRS) techniques: we use classical PBRS techniques for flat RL (e.g. SARSA(λ))basedAARL,andproposeanovelPBRStechniqueforMAXQ- 0, a hierarchical RL (HRL) algorithm, so as to implement HRL based AARL. WeempiricallytesttwoAARLimplementations—SARSA(λ)-basedAARLand MAXQ-based AARL — in multiple application domains, including single-agent andmulti-agentlearningproblems. EmpiricalresultsindicatethatAARLcanim- prove the convergence speed of RL, and can also be easily used by people that havelittlebackgroundinArgumentationandRL. 3 Acknowledgements I consider myself very fortunate to have benefited from the unparallelled knowl- edge and guidance of my supervisor, Professor Francesca Toni. I am greatly in- debtedtoherpatience,continuedsupportandencouragementthroughoutmyPhD studies. Francesca allowed me to follow my own research agenda and to slowly plow my way through research topics I am interested in, while, at the same time, inspiringmewithnewideasandprovidinginvaluableinstructionswheneverIhad problems. She spent late nights and early mornings in reading and revising my papers, shared her enthusiasm for my work with her colleagues, and introduced metotheacademiccommunity. Withoutherhelp,thisthesiswouldnothavebeen possible. Also, I must give my special thanks to my parents and my girlfriend, Kai Sun, fortheirconsistentsupport,helpandlove. Kaicheerilyenduredmyfoulestmoods and helped me in countless ways, from cooking meals for me during long nights spentinfrontofmycomputer,toproofreadingdraftsofmythesis. Thevideotalk withmyparentsineachweekendhasbeenmymostenjoyablepartineachweek. Special thanks also to Xiuyi Fan, who has been an invaluable sounding board giving helpful feedback on many naive thoughts I gave him, a trustworthy ‘big brother’ providing me with a lot of insights into the problems I encountered, and ahumorousfriendcheeringmeupwhenmyself-confidencewaned. WithoutFan, mylifeatImperialwouldbefarlesscolourful. Inaddition,Iwouldliketothankmyfundingbody,ChinaScholarshipCouncil, foritsgenerousscholarshipcoveringbothmytuitionandlivingexpenses. Thanks to everybody in the CLArg group and many others in the department forhelpingtocreateanenjoyableworkingenvironment. SpecialthankstoKrysia Broda and Murray Shanahan for their comments on my work, and to Amani El- Kholy,AnnHalfordandTeresaNgforlookingaftermesowellinthedepartment. Finally,manythankstoAbbasEdalat,SanjayModgilandMarcDeisenroth,my examiners,forreadingthisthesisandgivingsomeveryusefulcommentsthathave contributedtogreatlyimproveit. 4 Contents 1 Introduction 11 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 ProactiveDecisionMakingandArgumentation . . . . . . . . . . 14 1.3 ReactiveDecisionMakingandRL . . . . . . . . . . . . . . . . . 17 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 StructureofThisDissertation . . . . . . . . . . . . . . . . . . . . 20 1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.7 StatementofOriginality . . . . . . . . . . . . . . . . . . . . . . 22 2 Background 23 2.1 ArgumentationTheory . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 ABriefOverviewofArgumentationTheory . . . . . . . . 23 2.1.2 AbstractArgumentationFrameworks . . . . . . . . . . . 25 2.1.3 Value-BasedArgumentationFramework . . . . . . . . . . 27 2.1.4 ComputationofArgumentationSemantics . . . . . . . . . 28 2.2 ReinforcementLearning(RL) . . . . . . . . . . . . . . . . . . . 29 2.2.1 MarkovDecisionProcesses(MDPs)andSemi-MDPs . . . 31 2.2.2 SARSALearningAlgorithm . . . . . . . . . . . . . . . . 36 2.2.3 HierarchicalRL(HRL)andMAXQ . . . . . . . . . . . . 42 2.2.4 EligibilityTraces . . . . . . . . . . . . . . . . . . . . . . 53 2.2.5 Potential-BasedRewardShaping(PBRS) . . . . . . . . . 56 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3 ArgumentationFrameworksforReinforcementLearning 60 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.1.1 TheRoboCupSoccerGames . . . . . . . . . . . . . . . . 61 3.1.2 WhyIncorporateArgumentationintoRL . . . . . . . . . 64 3.2 ArgumentationFrameworksforReinforcementLearning . . . . . 65 3.2.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . 65 5 3.2.2 ArgumentforRL . . . . . . . . . . . . . . . . . . . . . . 66 3.2.3 ArgumentationFrameworks . . . . . . . . . . . . . . . . 68 3.2.4 FromExtensionstoHeuristics . . . . . . . . . . . . . . . 76 3.3 ArgumentationAcceleratedRL(AARL) . . . . . . . . . . . . . . 78 3.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4 SARSA(λ)-basedAARL 84 4.1 SARSA(λ)-basedAARL . . . . . . . . . . . . . . . . . . . . . . 84 4.2 ExperimentsinRoboCupGames . . . . . . . . . . . . . . . . . . 87 4.2.1 PerformancesofBothSidesUsingFixedStrategy . . . . . 89 4.2.2 ExperimentsinKeepawaygames . . . . . . . . . . . . . 91 4.2.3 ExperimentsinTakeawaygames . . . . . . . . . . . . . . 96 4.3 ExperimentsinaWumpusWorldGame . . . . . . . . . . . . . . 102 4.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5 PotentialBasedRewardShapingforHierarchicalRL 112 5.1 MAXQwithPBRS . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.1.1 IntegratingPotentialValuesintotheMAXQDecomposition114 5.1.2 ThePBRS-MAXQ-0Algorithm . . . . . . . . . . . . . . 120 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.1 TaxiProblem . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.2 StochasticWumpusWorld . . . . . . . . . . . . . . . . . 133 5.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 MAXQ-BasedAARL:AnEmpiricalEvaluation 142 6.1 MAXQ-basedAARL . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2 TheEnergyUsageRecommendationSystem . . . . . . . . . . . . 147 6.3 WhyUseMAXQ-BasedAARL . . . . . . . . . . . . . . . . . . 148 6.4 AReal-Data-BasedSimulatedUser . . . . . . . . . . . . . . . . 149 6.5 ModeltheProblemasaMDP . . . . . . . . . . . . . . . . . . . . 152 6.6 ArgumentsforEURS . . . . . . . . . . . . . . . . . . . . . . . . 157 6.6.1 ArgumentsSupportingPrimitiveActionsandTheirValues 157 6.6.2 ArgumentsSupportingCompositeSub-tasksandTheirval- ues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6 6.7 ExperimentalSettingsandResults . . . . . . . . . . . . . . . . . 165 6.7.1 ExperimentalSettings . . . . . . . . . . . . . . . . . . . 166 6.7.2 ResultsinLearningPhases . . . . . . . . . . . . . . . . . 168 6.7.3 ResultsintheEvaluationPhases . . . . . . . . . . . . . . 169 6.8 RelatedWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7 Conclusion 178 7 List of Tables 2.1 Sometypesofcomputationinabstractargumentationframeworks. 29 2.2 Complexity of some types of computation for the preferred and groundedsemantics(adjustedfrom[DW09]). . . . . . . . . . . . 29 4.1 Theperformancesof3keepersplayingagainst2takers,bothusing fixedstrategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Theperformancesof4keepersplayingagainst3takers,bothusing fixedstrategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 StatevariablesinaN-Keepawaygame. . . . . . . . . . . . . . . 93 4.4 Performances of learning keepers playing against random takers afterseveralhoursoflearning. . . . . . . . . . . . . . . . . . . . 96 4.5 Performances of learning keepers playing against always-tackle takersafter40hoursoflearning. . . . . . . . . . . . . . . . . . . 97 4.6 Performancesoflearningkeepersplayingagainstargument-based takersafter40hoursoflearning. . . . . . . . . . . . . . . . . . . 98 4.7 StatevariablesforlearningtakerT inaN-Takeawaygame. . . . 100 1 4.8 Performances of learning takers playing against random keepers after40hoursoflearning. . . . . . . . . . . . . . . . . . . . . . . 101 4.9 Performances of learning takers playing against random keepers after40hoursoflearning. . . . . . . . . . . . . . . . . . . . . . . 101 4.10 Somestatisticsoftheexperimentsinvolvingstudents’arguments.. 109 5.1 SomestatisticsduringthelearninginthestochasticWumpusWorld.138 6.1 TheaveragerewardsandtheirstandarderrorsofdifferentSARSA(0)- basedalgorithmsintheevaluationphases. . . . . . . . . . . . . . 172 6.2 TheaveragerewardsandtheirstandarderrorsofdifferentMAXQ- basedalgorithmsintheevaluationphases. . . . . . . . . . . . . . 173 6.3 Pairwisep-valuesofSARSA-basedalgorithmsinEURS. . . . . . 174 6.4 Pairwisep-valuesofMAXQ-basedalgorithmsinEURS. . . . . . 175 8 List of Figures 1.1 AscenarioinasimpleWumpusWorld. . . . . . . . . . . . . . . 13 1.2 AnAFforrepresentingthedomainknowledgegiveninSection1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 TheinteractionbetweenaRLagentandtheenvironment. . . . . 18 2.1 A2×2WumpusWorld. . . . . . . . . . . . . . . . . . . . . . . 34 2.2 ThefirstepisodeoftheSARSA-basedlearninginthe2×2Wum- pusWorldinFigure2.1. . . . . . . . . . . . . . . . . . . . . . . 37 2.3 ThesecondepisodeoftheSARSA-basedlearning. . . . . . . . . 39 2.4 ThethirdepisodeoftheSARSA-basedlearning. . . . . . . . . . 40 2.5 ThetaskgraphforthestochasticWumpusWorldproblem. . . . . 45 2.6 AnexamplescenariointheextendedWumpusWorld. . . . . . . 47 2.7 ThehierarchicalexecutionprocessoftheWumpusWorldexample showninFigure2.6. . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1 An example scenario in RoboCup Soccer Game. The ball is the whitecirclenexttokeeperK . . . . . . . . . . . . . . . . . . . . 62 1 3.2 The derived AF of SCAFs for the Keepaway game (left) and the Takeawaygame(right)inthescenarioshowninFigure3.1. . . . . 69 3.3 ThesimplifiedargumentationframeworksfortheKeepawaygame (left)andtheTakeawaygame(right)inthescenarioshowninFig- ure3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4 Thesimplifiedargumentationframeworkfortakersinthescenario showninFigure3.1,giventhenewvaluerankingQUICK CLOSE > v QUICK TAC > QUICK MARK. . . . . . . . . . . . . . . . . . 76 v 3.5 ThearchitectureofAARL. . . . . . . . . . . . . . . . . . . . . . 79 4.1 Theperformancesoflearningkeepersagainstrandomtakers. . . . 94 4.2 Theperformancesoflearningkeepersagainstalways-tackletakers. 94 9 4.3 Theperformancesoflearningkeepersagainstargument-basedtak- ers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4 Theperformancesoflearningtakersagainstrandomkeepers. . . . 99 4.5 Theperformancesoflearningtakersagainsthand-codedkeepers. . 99 4.6 TheperformancesoftwoTakeawaygamesbyusingpotentialval- uesproposedin[DGK11]. . . . . . . . . . . . . . . . . . . . . . 102 4.7 TheGUI-basedsystemoftheWumpusWorldgame. . . . . . . . 104 4.8 PerformancesofSARSA(0)-basedAARLandstandardSARSA(0) intheWumpusWorldgame. . . . . . . . . . . . . . . . . . . . . 108 4.9 Ahistogramsummarisingthenumberofstudents’AARLsreceiv- ingdifferentoverallrewards. . . . . . . . . . . . . . . . . . . . . 109 5.1 SettingsoftheTaxiproblem. . . . . . . . . . . . . . . . . . . . . 114 5.2 PerformancesintheTaxiproblem. . . . . . . . . . . . . . . . . . 133 5.3 PerformancesofR-MAXQandR-MAXQintheTaxiProblem. . . 134 5.4 AWumpusWorld. . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.5 Performances of four RL algorithms in the stochastic Wumpus Worldgame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.1 Theprobabilitydistributionoftheswitchingontimeofthewash- ingmachine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.2 Theprobabilitydistributionoftheswitchingontimeandworking timeoftheTVset. . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.3 Thetaskgraphfortheenergyadvisersystem. . . . . . . . . . . . 156 6.4 SettingsweusetoimplementAARLforEURS. . . . . . . . . . . 158 6.5 PerformancesofSARSA(0)-basedEURSinthelearningphases. . 170 6.6 PerformancesofMAXQ-basedEURSinthelearningphases. Lighter colourareasrepresent95%confidenceintervals. Allresultsareav- eragedover100independentexperiments,eachconsistingof2000 episodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.1 AnAFincludingepistemicarguments. . . . . . . . . . . . . . . . 181 10
Description: