ebook img

Argumentation Accelerated Reinforcement Learning PDF

197 Pages·2015·3.56 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Argumentation Accelerated Reinforcement Learning

ImperialCollegeLondon DepartmentofComputing Argumentation Accelerated Reinforcement Learning Yang Gao March 2015 Supervised by Professor Francesca Toni Submittedinpartfulfilmentoftherequirementsforthedegreeof DoctorofPhilosophyinComputingofImperialCollegeLondon andtheDiplomaofImperialCollegeLondon 1 Copyright Declaration ThecopyrightofthisthesisrestswiththeauthorandismadeavailableunderaCre- ativeCommonsAttributionNon-CommercialNoDerivativeslicence. Researchers arefreetocopy,distributeortransmitthethesisontheconditionthattheyattribute it,thattheydonotuseitforcommercialpurposesandthattheydonotalter,trans- formorbuilduponit. Foranyreuseorredistribution,researchersmustmakeclear toothersthelicencetermsofthiswork 2 Abstract Reinforcement Learning (RL) is a popular statistical Artificial Intelligence (AI) techniqueforbuildingautonomousagents,butitsuffersfromthecurseofdimen- sionality: thecomputationalrequirementforobtainingtheoptimalpoliciesgrows exponentially with the size of the state space. Integrating heuristics into RL has proventobeaneffectiveapproachtocombatthiscurse, butderivinghigh-quality heuristics from people’s (typically conflicting) domain knowledge is challenging, yet it received little research attention. Argumentation theory is a logic-based AI techniquewell-knownforitsconflictresolutioncapabilityandintuitiveappeal. In this thesis, we investigate the integration of argumentation frameworks into RL algorithms,soastoimprovetheconvergencespeedofRLalgorithms. In particular, we propose a variant of Value-based Argumentation Framework (VAF) to represent domain knowledge and to derive heuristics from this knowl- edge. We prove that the heuristics derived from this framework can effectively instructindividuallearningagentsaswellasmultiplecooperativelearningagents. In addition,we propose the Argumentation Accelerated RL (AARL) framework to integrate these heuristics into different RL algorithms via Potential Based Re- ward Shaping (PBRS) techniques: we use classical PBRS techniques for flat RL (e.g. SARSA(λ))basedAARL,andproposeanovelPBRStechniqueforMAXQ- 0, a hierarchical RL (HRL) algorithm, so as to implement HRL based AARL. WeempiricallytesttwoAARLimplementations—SARSA(λ)-basedAARLand MAXQ-based AARL — in multiple application domains, including single-agent andmulti-agentlearningproblems. EmpiricalresultsindicatethatAARLcanim- prove the convergence speed of RL, and can also be easily used by people that havelittlebackgroundinArgumentationandRL. 3 Acknowledgements I consider myself very fortunate to have benefited from the unparallelled knowl- edge and guidance of my supervisor, Professor Francesca Toni. I am greatly in- debtedtoherpatience,continuedsupportandencouragementthroughoutmyPhD studies. Francesca allowed me to follow my own research agenda and to slowly plow my way through research topics I am interested in, while, at the same time, inspiringmewithnewideasandprovidinginvaluableinstructionswheneverIhad problems. She spent late nights and early mornings in reading and revising my papers, shared her enthusiasm for my work with her colleagues, and introduced metotheacademiccommunity. Withoutherhelp,thisthesiswouldnothavebeen possible. Also, I must give my special thanks to my parents and my girlfriend, Kai Sun, fortheirconsistentsupport,helpandlove. Kaicheerilyenduredmyfoulestmoods and helped me in countless ways, from cooking meals for me during long nights spentinfrontofmycomputer,toproofreadingdraftsofmythesis. Thevideotalk withmyparentsineachweekendhasbeenmymostenjoyablepartineachweek. Special thanks also to Xiuyi Fan, who has been an invaluable sounding board giving helpful feedback on many naive thoughts I gave him, a trustworthy ‘big brother’ providing me with a lot of insights into the problems I encountered, and ahumorousfriendcheeringmeupwhenmyself-confidencewaned. WithoutFan, mylifeatImperialwouldbefarlesscolourful. Inaddition,Iwouldliketothankmyfundingbody,ChinaScholarshipCouncil, foritsgenerousscholarshipcoveringbothmytuitionandlivingexpenses. Thanks to everybody in the CLArg group and many others in the department forhelpingtocreateanenjoyableworkingenvironment. SpecialthankstoKrysia Broda and Murray Shanahan for their comments on my work, and to Amani El- Kholy,AnnHalfordandTeresaNgforlookingaftermesowellinthedepartment. Finally,manythankstoAbbasEdalat,SanjayModgilandMarcDeisenroth,my examiners,forreadingthisthesisandgivingsomeveryusefulcommentsthathave contributedtogreatlyimproveit. 4 Contents 1 Introduction 11 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 ProactiveDecisionMakingandArgumentation . . . . . . . . . . 14 1.3 ReactiveDecisionMakingandRL . . . . . . . . . . . . . . . . . 17 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 StructureofThisDissertation . . . . . . . . . . . . . . . . . . . . 20 1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.7 StatementofOriginality . . . . . . . . . . . . . . . . . . . . . . 22 2 Background 23 2.1 ArgumentationTheory . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 ABriefOverviewofArgumentationTheory . . . . . . . . 23 2.1.2 AbstractArgumentationFrameworks . . . . . . . . . . . 25 2.1.3 Value-BasedArgumentationFramework . . . . . . . . . . 27 2.1.4 ComputationofArgumentationSemantics . . . . . . . . . 28 2.2 ReinforcementLearning(RL) . . . . . . . . . . . . . . . . . . . 29 2.2.1 MarkovDecisionProcesses(MDPs)andSemi-MDPs . . . 31 2.2.2 SARSALearningAlgorithm . . . . . . . . . . . . . . . . 36 2.2.3 HierarchicalRL(HRL)andMAXQ . . . . . . . . . . . . 42 2.2.4 EligibilityTraces . . . . . . . . . . . . . . . . . . . . . . 53 2.2.5 Potential-BasedRewardShaping(PBRS) . . . . . . . . . 56 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3 ArgumentationFrameworksforReinforcementLearning 60 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.1.1 TheRoboCupSoccerGames . . . . . . . . . . . . . . . . 61 3.1.2 WhyIncorporateArgumentationintoRL . . . . . . . . . 64 3.2 ArgumentationFrameworksforReinforcementLearning . . . . . 65 3.2.1 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . 65 5 3.2.2 ArgumentforRL . . . . . . . . . . . . . . . . . . . . . . 66 3.2.3 ArgumentationFrameworks . . . . . . . . . . . . . . . . 68 3.2.4 FromExtensionstoHeuristics . . . . . . . . . . . . . . . 76 3.3 ArgumentationAcceleratedRL(AARL) . . . . . . . . . . . . . . 78 3.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4 SARSA(λ)-basedAARL 84 4.1 SARSA(λ)-basedAARL . . . . . . . . . . . . . . . . . . . . . . 84 4.2 ExperimentsinRoboCupGames . . . . . . . . . . . . . . . . . . 87 4.2.1 PerformancesofBothSidesUsingFixedStrategy . . . . . 89 4.2.2 ExperimentsinKeepawaygames . . . . . . . . . . . . . 91 4.2.3 ExperimentsinTakeawaygames . . . . . . . . . . . . . . 96 4.3 ExperimentsinaWumpusWorldGame . . . . . . . . . . . . . . 102 4.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5 PotentialBasedRewardShapingforHierarchicalRL 112 5.1 MAXQwithPBRS . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.1.1 IntegratingPotentialValuesintotheMAXQDecomposition114 5.1.2 ThePBRS-MAXQ-0Algorithm . . . . . . . . . . . . . . 120 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.1 TaxiProblem . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.2 StochasticWumpusWorld . . . . . . . . . . . . . . . . . 133 5.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 MAXQ-BasedAARL:AnEmpiricalEvaluation 142 6.1 MAXQ-basedAARL . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2 TheEnergyUsageRecommendationSystem . . . . . . . . . . . . 147 6.3 WhyUseMAXQ-BasedAARL . . . . . . . . . . . . . . . . . . 148 6.4 AReal-Data-BasedSimulatedUser . . . . . . . . . . . . . . . . 149 6.5 ModeltheProblemasaMDP . . . . . . . . . . . . . . . . . . . . 152 6.6 ArgumentsforEURS . . . . . . . . . . . . . . . . . . . . . . . . 157 6.6.1 ArgumentsSupportingPrimitiveActionsandTheirValues 157 6.6.2 ArgumentsSupportingCompositeSub-tasksandTheirval- ues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6 6.7 ExperimentalSettingsandResults . . . . . . . . . . . . . . . . . 165 6.7.1 ExperimentalSettings . . . . . . . . . . . . . . . . . . . 166 6.7.2 ResultsinLearningPhases . . . . . . . . . . . . . . . . . 168 6.7.3 ResultsintheEvaluationPhases . . . . . . . . . . . . . . 169 6.8 RelatedWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7 Conclusion 178 7 List of Tables 2.1 Sometypesofcomputationinabstractargumentationframeworks. 29 2.2 Complexity of some types of computation for the preferred and groundedsemantics(adjustedfrom[DW09]). . . . . . . . . . . . 29 4.1 Theperformancesof3keepersplayingagainst2takers,bothusing fixedstrategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Theperformancesof4keepersplayingagainst3takers,bothusing fixedstrategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 StatevariablesinaN-Keepawaygame. . . . . . . . . . . . . . . 93 4.4 Performances of learning keepers playing against random takers afterseveralhoursoflearning. . . . . . . . . . . . . . . . . . . . 96 4.5 Performances of learning keepers playing against always-tackle takersafter40hoursoflearning. . . . . . . . . . . . . . . . . . . 97 4.6 Performancesoflearningkeepersplayingagainstargument-based takersafter40hoursoflearning. . . . . . . . . . . . . . . . . . . 98 4.7 StatevariablesforlearningtakerT inaN-Takeawaygame. . . . 100 1 4.8 Performances of learning takers playing against random keepers after40hoursoflearning. . . . . . . . . . . . . . . . . . . . . . . 101 4.9 Performances of learning takers playing against random keepers after40hoursoflearning. . . . . . . . . . . . . . . . . . . . . . . 101 4.10 Somestatisticsoftheexperimentsinvolvingstudents’arguments.. 109 5.1 SomestatisticsduringthelearninginthestochasticWumpusWorld.138 6.1 TheaveragerewardsandtheirstandarderrorsofdifferentSARSA(0)- basedalgorithmsintheevaluationphases. . . . . . . . . . . . . . 172 6.2 TheaveragerewardsandtheirstandarderrorsofdifferentMAXQ- basedalgorithmsintheevaluationphases. . . . . . . . . . . . . . 173 6.3 Pairwisep-valuesofSARSA-basedalgorithmsinEURS. . . . . . 174 6.4 Pairwisep-valuesofMAXQ-basedalgorithmsinEURS. . . . . . 175 8 List of Figures 1.1 AscenarioinasimpleWumpusWorld. . . . . . . . . . . . . . . 13 1.2 AnAFforrepresentingthedomainknowledgegiveninSection1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 TheinteractionbetweenaRLagentandtheenvironment. . . . . 18 2.1 A2×2WumpusWorld. . . . . . . . . . . . . . . . . . . . . . . 34 2.2 ThefirstepisodeoftheSARSA-basedlearninginthe2×2Wum- pusWorldinFigure2.1. . . . . . . . . . . . . . . . . . . . . . . 37 2.3 ThesecondepisodeoftheSARSA-basedlearning. . . . . . . . . 39 2.4 ThethirdepisodeoftheSARSA-basedlearning. . . . . . . . . . 40 2.5 ThetaskgraphforthestochasticWumpusWorldproblem. . . . . 45 2.6 AnexamplescenariointheextendedWumpusWorld. . . . . . . 47 2.7 ThehierarchicalexecutionprocessoftheWumpusWorldexample showninFigure2.6. . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1 An example scenario in RoboCup Soccer Game. The ball is the whitecirclenexttokeeperK . . . . . . . . . . . . . . . . . . . . 62 1 3.2 The derived AF of SCAFs for the Keepaway game (left) and the Takeawaygame(right)inthescenarioshowninFigure3.1. . . . . 69 3.3 ThesimplifiedargumentationframeworksfortheKeepawaygame (left)andtheTakeawaygame(right)inthescenarioshowninFig- ure3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4 Thesimplifiedargumentationframeworkfortakersinthescenario showninFigure3.1,giventhenewvaluerankingQUICK CLOSE > v QUICK TAC > QUICK MARK. . . . . . . . . . . . . . . . . . 76 v 3.5 ThearchitectureofAARL. . . . . . . . . . . . . . . . . . . . . . 79 4.1 Theperformancesoflearningkeepersagainstrandomtakers. . . . 94 4.2 Theperformancesoflearningkeepersagainstalways-tackletakers. 94 9 4.3 Theperformancesoflearningkeepersagainstargument-basedtak- ers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4 Theperformancesoflearningtakersagainstrandomkeepers. . . . 99 4.5 Theperformancesoflearningtakersagainsthand-codedkeepers. . 99 4.6 TheperformancesoftwoTakeawaygamesbyusingpotentialval- uesproposedin[DGK11]. . . . . . . . . . . . . . . . . . . . . . 102 4.7 TheGUI-basedsystemoftheWumpusWorldgame. . . . . . . . 104 4.8 PerformancesofSARSA(0)-basedAARLandstandardSARSA(0) intheWumpusWorldgame. . . . . . . . . . . . . . . . . . . . . 108 4.9 Ahistogramsummarisingthenumberofstudents’AARLsreceiv- ingdifferentoverallrewards. . . . . . . . . . . . . . . . . . . . . 109 5.1 SettingsoftheTaxiproblem. . . . . . . . . . . . . . . . . . . . . 114 5.2 PerformancesintheTaxiproblem. . . . . . . . . . . . . . . . . . 133 5.3 PerformancesofR-MAXQandR-MAXQintheTaxiProblem. . . 134 5.4 AWumpusWorld. . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.5 Performances of four RL algorithms in the stochastic Wumpus Worldgame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.1 Theprobabilitydistributionoftheswitchingontimeofthewash- ingmachine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.2 Theprobabilitydistributionoftheswitchingontimeandworking timeoftheTVset. . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.3 Thetaskgraphfortheenergyadvisersystem. . . . . . . . . . . . 156 6.4 SettingsweusetoimplementAARLforEURS. . . . . . . . . . . 158 6.5 PerformancesofSARSA(0)-basedEURSinthelearningphases. . 170 6.6 PerformancesofMAXQ-basedEURSinthelearningphases. Lighter colourareasrepresent95%confidenceintervals. Allresultsareav- eragedover100independentexperiments,eachconsistingof2000 episodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.1 AnAFincludingepistemicarguments. . . . . . . . . . . . . . . . 181 10

Description:
instruct individual learning agents as well as multiple cooperative learning agents. In addition,we .. Given the complementary advantages of these two techniques (RL's a pit or a Wumpus, it detects breeze or stench, respectively.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.