Agent-Agnostic Human-in-the-Loop Reinforcement Learning DavidAbel JohnSalvatier AndreasStuhlmüller BrownUniversity AIImpacts StanfordUniversity [email protected] [email protected] [email protected] 7 1 OwainEvans 0 UniversityofOxford 2 [email protected] n a J 5 Abstract 1 ] Providing Reinforcement Learning agents with expert advice can dramatically G improvevariousaspectsoflearning. Priorworkhasdevelopedteachingprotocols L that enable agents to learn efficiently in complex environments; many of these . methods tailor the teacher’s guidance to agents with a particular representation s or underlying learning scheme, offering effective but specialized teaching pro- c [ cedures. Inthiswork,weexploreprotocolprograms,anagent-agnosticschema forHuman-in-the-LoopReinforcementLearning. Ourgoalistoincorporatethe 1 beneficial properties of a human teacher into Reinforcement Learning without v makingstrongassumptionsabouttheinnerworkingsoftheagent. Weshowhowto 9 representexistingapproachessuchasactionpruning,rewardshaping,andtraining 7 insimulationasspecialcasesofourschemaandconductpreliminaryexperiments 0 onsimpledomains. 4 0 . 1 0 1 Introduction 7 1 AcentralgoalofReinforcementLearning(RL)istodesignagentsthatlearninafullyautonomous : v way. Anengineerdesignsarewardfunction,input/outputchannels,andalearningalgorithm. Then, i apartfromdebugging,theengineerneednotinterveneduringtheactuallearningprocess. Yetfully X autonomouslearningisofteninfeasibleduetothecomplexityofreal-worldproblems,thedifficulty r ofspecifyingrewardfunctions,andthepresenceofpotentiallydangerousoutcomesthatconstrain a exploration. Consider a robot learning to perform household chores. Human engineers create a curriculum, movingtheagentbetweensimulation,practiceenvironments,andrealhouseenvironments. Over time,theymaytweakrewardfunctions,heuristics,sensors,andstateoractionrepresentations. They mayintervenedirectlyinreal-worldtrainingtopreventtherobotdamagingitself,destroyingvaluable goods,orharmingpeopleitinteractswith. Inthisexample,humansdonotjustdesignthelearningagent: theyarealsointheloopoftheagent’s learningprocess,asistypicalformanylearningsystems. Self-drivingcarslearnwithhumansready tointerveneindangeroussituations. Facebook’salgorithmforrecommendingtrendingnewsstories has humans filtering out inappropriate content [1]. In both examples, the agent’s environment is complex,non-stationary,andthereareawiderangeofdamagingoutcomes(likeatrafficaccident). AsRLisappliedtoincreasinglycomplexreal-worldproblems,suchinteractiveguidancewillbe criticaltothesuccessofthesesystems. Presentedatthe2016NIPSFutureofInteractiveLearningMachinesWorkshop,Barcelona,Spain. PriorliteraturehasinvestigatedhowpeoplecanhelpRLagentslearnmoreefficientlythroughdifferent methods of interaction [32, 24, 28, 36, 42, 48, 21, 17, 25, 43, 23, 49, 46, 44, 30, 10]. Often, the human’sroleistopassalongknowledgeaboutrelevantquantitiesoftheRLproblem,likeQ-values, actionoptimality,orthetruerewardforaparticularstate-actionpair. Thisway,thepersoncanbias exploration,preventcatastrophicoutcomes,andacceleratelearning. Mostexistingworkdevelopsagent-specificprotocolsforhumaninteraction. Thatis,protocolsfor humaninteractionoradvicethataredesignedforaspecificRLalgorithm(suchasQ-learning). For instance,Griffithetal.[17]investigatethepowerofpolicyadviceforaBayesianQ-Learner. Other worksassumethatthestatesoftheMDPtakeaparticularrepresentation,orthattheactionspaceis discreteorfinite. Makingexplicitassumptionsabouttheagent’slearningprocesscanenablemore powerfulteachingprotocolsthatleverageinsightsaboutthelearningalgorithmorrepresentation. 1.1 Ourcontribution: agent-agnosticguidanceofRLalgorithms Ourgoalistodevelopaframeworkforhuman-agentinteractionthatis(a)agent-agnosticand(b) cancaptureawiderangeofwaysahumancanhelpanRLagent. Suchasettingisinformativeof thestructureofgeneralteachingparadigms,therelationshipandinterplayofpre-existingteaching methods,andsuggestiveofnewteachingmethodologies,whichwediscussinSection6. Additionally, approaching human-in-the-loop RL from a maximally general standpoint can help illustrate the relationshipbetweentherequisitepowerofateacherandtheteacher’seffectivenessonlearning. For instance,wedemonstratesufficientconditionsonateacher’sknowledgeaboutanenvironmentthat enableeffective1actionpruningofanarbitraryagent. Resultsofthisformcanagainbeinformative tothegeneralstructureofteachingRLagents. We make two simplifying assumptions. First, we consider environments where the state is fully observed;thatis,thelearningagentinteractswithaMarkovDecisionProcess(MDP)[37,22,40]. Second,wenotethatconductingexperimentswithanactualhumanintheloopcreatesahugeamount ofworkforahuman,andcanslowdowntrainingtoanunacceptabledegree. Forthisreason,we focusonprogramaticinstantiationsofhumans-in-the-loop;apersoninformedaboutthetask(MDP) inquestionwillwriteaprogramtofacilitatevariousteachingprotocols. There are obvious disadvantages to agent-agnostic protocols. The agent is not specialized to the protocol, so it is unable to ask the human informative questions as in [4], or will not have an observationmodelthatfaithfullyrepresentstheprocessthehumanusestogenerateadvice, asin [17,21]. Likewise,thehumancannotprovideoptimallyinformativeadvicetotheagentastheydon’t knowtheagent’spriorknowledge,explorationtechnique,representation,orlearningmethod. Conversely,agent-specificprotocolsmayperformwellforonetypeofalgorithmorenvironment,but poorlyonothers. Inmanycases,withoutfurtherhand-engineering,agent-specificprotocolscan’tbe adaptedtoavarietyofagent-types. WhenresearcherstacklechallengingRLproblems,theytend toexplorealargespaceofalgorithmswithimportantstructuraldifferences: somearemodel-based vs. model-free,someapproximatetheoptimalpolicy,othersavaluefunction,andsoon. Ittakes substantialefforttoadaptanadviceprotocoltoeachsuchalgorithm. Moreover,asadviceprotocols andlearningalgorithmsbecomemorecomplex,greatermodularitywillhelplimitdesigncomplexity. Inourframework,theinteractionbetweenthepersonguidingthelearningprocess,theagent,and theenvironmentisformalizedasaprotocolprogram. Thisprogramcontrolsthechannelsbetween theagentandtheenvironmentbasedonhumaninput,picturedinFigure1. Thisgivestheteacher extensivecontrolovertheagent: inanextremecase,theagentcanbepreventedfrominteractingwith therealenvironmententirelyandonlyinteractwithasimulation. Atthesametime,werequirethat thehumanonlyinteractwiththeagentduringlearningthroughtheprotocolprogram—bothagent andenvironmentareablackboxtothehuman. 1By“effective“wemean:pruningbadactionswhileneverpruninganoptimalaction.SeeRemark3(below). 2 a s, r Environment M 1 Protocol Program P 2 3 Human H a s, r Agent L Figure1: AgeneralsetupforRLwithahumanintheloop. ByinstantiatingP withdifferentprotocol programs,wecanimplementdifferentmechanismsforhumanguidanceofRLagents. 2 Framework AnysystemforRLwithahumanintheloophastocoordinatethreecomponents: 1. The environment is an MDP and is specified by a tuple M = (S,A,T,R,γ), where S isthestatespace, Aistheactionspace, T : S ×A×S (cid:55)→ [0,1], denotesthetransition function,aprobabilitydistributiononstatesgivenastateandaction,R: S×A(cid:55)→Risthe rewardfunction,andγ isthediscountfactor. 2. Theagentisa(stateful,potentiallystochastic)functionL: S×R→A. 3. Thehumancanreceiveandsendadviceinformationofflexibletype, sayX andX , in out so,wewilltreatthehumanasa(stateful,potentiallystochastic)functionH: X →X . in out For example, X might contain the history of actions, states, and rewards so far, and a in new proposed action a(cid:48), and X might be an action as well, either equivalent to a(cid:48) (if out accepted)ordifferent(ifrejected). Weassumethatthehumanknowsingeneraltermshow theirresponseswillbeusedandismakingagood-faithefforttobehelpful. The interaction between the environment, the agent, and a human advisor sets up a mechanism design problem: how can we design an interface that orchestrates the interaction between these componentssuchthatthecombinedsystemmaximizestheexpectedsumofγ-discountedrewards fromtheenvironment? Inotherwords,howcanwewriteaprotocolprogramP: S×R→Athat cantaketheplaceofagivenagentL,butthatachieveshigherrewardsbymakingefficientuseof informationgainedthroughsub-callstoLandH? Byformalizingexistingandnewtechniquesasprograms,wefacilitateunderstandingandcomparison ofthesetechniqueswithinacommonframework. Byabstractingfromparticularagentsandenviron- ments,wemaybetterunderstandthemechanismsunderlyingeffectiveteachingforReinforcement Learningbydevelopingportableandmodularteachingmethods. 3 CapturingExistingAdviceSchemes Naturally, protocol programs cannot capture all advice protocols. Any protocol that depends on prior knowledge of the agent’s learning algorithm, representation, priors, or hyperparameters is ruledout. Despitethisconstraint,theframeworkcancapturearangeofexistingprotocolswherea human-in-the-loopguidesanagent. Figure1showsthatthehumancanmanipulatetheactions(A)senttotheenvironment,theagent’s observedstates(S),andobservedrewards(R). Thispointstothefollowingcombinatorialsetof protocolfamiliesinwhichthehumanmanipulatesoneormoreofthesecomponentstoinfluence learning: {S,A,R,(S,A),(S,R),(A,R),(S,A,R)} 3 The first three elements of the set correspond to state manipulation, action pruning, and reward shaping protocol families.2 The remaining elements represent families of teaching schemes that modifymultipleelementsoftheagent’slearning;theseprotocolsmayintroducepowerfulinterplay betweenthedifferentcomponents,whichhopefutureworkwillexplore. We now demonstrate simple ways in which protocol programs instantiate typical methods for interveninginanagent’slearningprocess. Algorithm1Agentincontrol(standard) Algorithm4Rewardmanipulation 1: procedureAGENTCONTROL(s,r) 1: procedureMANIPULATEREWARD(s,r) 2: returnL(s,r) 2: r =H(s,r) 3: endprocedure 3: returnL(s,r) 4: endprocedure Algorithm2Humanincontrol 1: procedureHUMANCONTROL(s,r) Algorithm5Traininginsimulation 2: returnH(s,r) 1: M∗ =(S,A,T∗,R∗,γ) (cid:46)Simulation 3: endprocedure 2: η =[] (cid:46)History: arrayof(S×R×A) 3: procedureTRAININSIMULATION(s,r) 4: s=s Algorithm3Actionpruning 5: r =r 1: ∆←H.∆ (cid:46)ToPrune: S×A(cid:55)→{0,1} 6: whileH(η)(cid:54)=“agentisready”do 2: procedurePRUNEACTIONS(s,r) 7: a=L(s,r) 3: a=L(s,r) 8: append(s,r,a)toη 4: while∆(s,a)do (cid:46)IfNeedsPruning 9: r ∼R∗(s,a) 5: r =H[(s,a)] 10: s∼T∗(s,a) 6: a=L(s,r) 11: endwhile 7: endwhile 12: returnL(s,r) 8: returna 13: endprocedure 9: endprocedure Figure2:ManyschemesforhumanguidanceofRLalgorithmscanbeexpressedasprotocolprograms. TheseprogramshavethesameinterfaceastheagentL,butcanbesaferormoreefficientlearnersby makinguseofhumanadviceH. 3.1 Rewardshaping Section2definedtherewardfunctionRaspartoftheMDPM. However,whilehumansgenerally don’tdesigntheenvironment,wedodesignrewardfunctions. Usuallytherewardfunctionishand- codedpriortolearningandmustaccuratelyassignrewardvaluestoanystatetheagentmightreach. Analternativeistohaveahumangeneratetherewardsinteractively: thehumanobservesthestate and action and returns a scalar to the agent. This setup has been explored in work on TAMER [23]. Asimilarsetup(withanagent-specificprotocol)wasappliedtoroboticsbyDanieletal.[6]. Itisstraightforwardtorepresentrewardsthataregeneratedinteractively(oronline)usingprotocol programs. Wenowturntootherprotocolsinwhichthehumanmanipulatesrewards. Theseprotocolsassumea fixedrewardfunctionRthatispartoftheMDPM. 3.1.1 RewardshapingandQ-valueinitialization InRewardShapingprotocols,thehumanengineerchangestherewardsgivenbysomefixedreward functioninordertoinfluenceanagent’slearning. Ngetal.[32]introducedpotential-basedshaping, whichshapesrewardswithoutchanginganMDP’soptimalpolicy. Inparticular,eachrewardreceived bytheenvironmentisaugmentedbyashapingfunction: F(s,a,s(cid:48))=γφ(s(cid:48))−φ(s), (1) so the agent actually receives r = F(s,a,s(cid:48))+R(s,a). Wiewiora et al. [48] showed potential shapingtobeequivalent(forQ-learners)toasubsetQ-valueinitializationundersomeassumptions. 2Statemanipulationcancorrespondtoabstractionortraininginsimulation 4 Further,DevlinandKudenko[8]proposedynamicpotentialshapingfunctionsthatchangeovertime. Thatis,theshapingfunctionF alsotakesastwotimeparameters,tandt(cid:48),suchthat: F(s,t,s(cid:48),t(cid:48))=γφ(s(cid:48),t(cid:48))−φ(s,t) (2) Wheret(cid:48) >t. Theirmainresultisthatdynamicshapingfunctionsofthisformalsoguaranteeoptimal policyinvariance. Similarly,Wiewioraetal.[48]extendpotentialshapingtopotential-basedadvice functions,whichidentifiesasimilarclassofshapingfunctionson(s,a)pairs. InSection4, weshowthatourFrameworkcapturesrewardshaping, andconsequently, alimited notionofQ-valueinitialization. 3.2 TraininginSimulation Itiscommonpracticetotrainanagentinsimulationandtransferittotherealworldonceitperforms wellenough. Algorithm5(Figure2)showshowtorepresenttheprocessoftraininginsimulationasa protocolprogram. WeletM representthereal-worlddecisionproblemandletM∗beasimulatorfor M thatisincludedintheprotocolprogram. InitiallytheprotocolprogramhastheagentLinteract withM∗whilethehumanobservestheinteraction. Whenthehumandecidestheagentisready,the protocolprogramhasLinteractwithM instead. 3.3 ActionPruning ActionpruningisatechniquefordynamicallyremovingactionsfromtheMDPtoreducethebranching factorofthesearchspace. Suchtechniqueshavebeenshowntoacceleratelearningandplanning time[39,19,38,2]. InSection5,weapplyaction-pruningtopreventcatastrophicoutcomesduring exploration, aproblemexploredbyLiptonetal.[27],GarciaandFernandez[14,15],Hansetal. [18],MoldovanandAbbeel[31]. Protocolprogramsallowactionpruningtobecarriedoutinteractively. Insteadofhavingtodecide whichactionstoprunepriortolearning,thehumancanwaittoobservethestatesthatareactually encounteredbytheagent,whichmaybevaluableincaseswherethehumanhaslimitedknowledgeof theenvironmentortheagent’slearningability. InSection4,weexhibitanagent-agnosticprotocol forinteractivelypruningactionsthatpreservestheoptimalpolicywhileremovingsomebadactions. Ourpruningprotocolisillustratedinagridworldwithlavapits(Figure3). Theagentisrepresented byagraycircle,“G”isagoalstatethatprovidesreward+1,andtheredcellsarelavapitswithreward −200. Allwhitecellsprovidereward0. Ateachtimestep, thehumancheckswhethertheagent movesintoalavapit. Ifitdoesnot(asinmovingDOWN fromstate34),theagentcontinuesasnormal. Ifitdoes 5 (asinmovingRIGHTfromstate33),thehumanbypasses (34, DOWN, 0, 33) sendinganyactiontothetrueMDP(preventingmovement 4 right)andsendstheagentanextstateof33. Theagent doesn’tactuallyfallinthelavabutthehumansendsthema rewardr ≤−200. Afterthisnegativereward,theagentis 3 lesslikelytotrytheactionagain.Fortheprotocolprogram, (33, RIGHT, -1000, 33) seeAlgorithm3inFigure2. 2 Notethattheagentreceivesnoexplicitsignalthattheir 1 G attemptedcatastrophicactionwasblockedbythehuman. They observe a big negative reward and a self-loop but 1 2 3 4 5 noinformationaboutwhetherthehumanorenvironment generatedtheirobservation. Figure3: Thehumanallowsmovement fromstate34to33butblocksagentfrom 3.4 Manipulatingstaterepresentation fallinginlava(at43). The agent’s state representation can have a significant influenceonitslearning. SupposethestatesofMDPM consistofanumberoffeatures,definingastatevectors. Thehumanengineercanspecifyamappingφsuchthattheagentalwaysreceivesφ(s)=s¯inplace 5 ofthisvectors. Suchmappingsareusedtospecifyhigh-levelfeaturesofstatethatareimportantfor learning,ortodynamicallyignoreconfusingfeaturesfromtheagent. Thistransformationofthestatevectorisnormallyfixedbeforelearning. Aprotocolprogramcan allow thehuman toprovideprocessedstates orhigh-levelfeatures interactively. Bythe timethe humanstopsprovidingfeatures, theagentmighthavelearnedtogeneratethemonitsown(asin LearningwithPrivilegedInformation[45,35]). Othermethodshavefocusedonstateabstractionfunctionstodecreaselearningtimeandpreserve thequalityoflearnedbehavior,asin[26,33,12,20,7,3,13]. Usingastateabstractionfunction, agentscompressrepresentationsoftheirenvironments,enablingdeeperplanningandlowersample complexity. Anystateaggregationfunctioncanbeimplementedbyaprotocolprogram, perhaps dynamicallyinducedthroughinteractionwithateacher. 4 Theory Here we illustrate some simple ways in which our proposed agent-agnostic interaction scheme capturesotherexistingagent-agnosticprotocols. ThefollowingresultsallconcernTabularMDPs, butareintendedtoofferintuitionforhigh-dimensionalorcontinuousenvironmentsaswell. 4.1 RewardShaping Firstweobservethatprotocolprogramscanpreciselycapturemethodsforshapingrewardfunctions. Remark1: ForanyrewardshapingfunctionF,includingpotential-basedshaping,potential-based advice,anddynamicpotential-basedadvice,thereisaprotocolthatproducesthesamerewards. ToconstructsuchaprotocolforagivenF,simplylettherewardoutputbytheprotocol,r,takeon thevalueF(s)+rateachtimestep. Thatis,inAlgorithm4,simplydefineH(s,r)=F(s)+r. 4.2 ActionPruning Wenowshowthatthereisasimpleclassofprotocolprogramsthatcarryoutactionpruningofa certainform. Remark2: Thereisaprotocolforpruningactionsinthefollowingsense: foranysetofstateaction pairssa⊂S×A,theprotocolensuresthat,foreachpair(s ,a )∈sa,actiona isneverexecuted i j j intheMDPinstates . i TheprotocolisasdescribedinSection3.3andshowninAlgorithm3. Thepremiseisthis:inallcases wheretheagentexecutesanactionthatshouldbepruned,theprotocolgivestheagentlowreward andforcestheagenttoself-loop. Knowingwhichactionstopruneisitselfachallengingproblem. Often,itisnaturaltoassumethatthe humanguidingthelearnerknowssomethingabouttheenvironmentofinterest(suchaswherehigh rewardsorcatastropheslie),butmaynotknoweverydetailoftheproblem. Thus,weconsideracase inwhichthehumanhaspartial(butuseful)knowledgeabouttheproblemofinterest,represented asanapproximateQ-function. Thenextremarkshowsthereisaprotocolbasedonapproximate knowledgewithtwoproperties: (1)itneverprunesanoptimalaction,(2)itlimitsthemagnitudeof theagent’sworstmistake: Remark3: Assumingtheprotocoldesignerhasaβ-optimalQfunction: ||Q∗(s,a)−Q (s,a)|| ≤β (3) H ∞ thereexistsaprotocolthatneverprunesanoptimalaction,butprunesallactionssothattheagent’s mistakesarenevermorethan4β belowoptimal. Thatis,foralltimest: VLt(s )≥V∗(s )−4β, (4) t t whereL istheagent’spolicyafterttimesteps. t 6 ProofofRemark3. Theprotocoldesignerhasaβ-approximateQfunction,denotedQ ,definedas H above. Considerthestate-specificactionpruningfunctionH(s): (cid:110) (cid:111) H(s)= a∈A|Q (s,a)≥maxQ (s,a(cid:48))−2β (5) H H a(cid:48) TheprotocolprunesallactionsnotinH(s)accordingtotheself-loopmethoddescribedabove. This protocolinducesaprunedBellmanEquationoveravailableactions,H(s),ineachstate: (cid:32) (cid:33) (cid:88) V (s)= max R(s,a)+γ T(s,a,s(cid:48))V (s(cid:48)) (6) H H a∈H(s) s(cid:48) Leta∗denotethetrueoptimalaction: a∗ =argmax Q∗(s,a(cid:48)). Topreservetheoptimalpolicy,we a(cid:48) needa∗ ∈H(s),foreachstate. Notethata∗ (cid:54)∈H(s)when: Q (s,a∗)<maxQ (s,a(cid:48))−2β (7) H H a(cid:48) ButbydefinitionofQ (s,a): H |Q (s,a∗)−maxQ (s,a)|≤2β (8) H H a Thus,a∗ ∈H(s)canneveroccur. Furthermore,observethatH(s)retainsallactionsaforwhich: Q (s,a)≥maxQ (s,a(cid:48))−2β, (9) H H a(cid:48) holds. Thus,intheworstcase,thefollowingtwohold: 1. Theoptimalactionestimateisβ toolow: Q (s,a∗)=Q∗(s,a∗)−β H 2. Theactionwiththelowestvalue,a ,isβ toohigh: Q (s,a )=Q∗(s,a )+β bad H bad bad FromEquation9,observethattheminimalQ∗(s,a )suchthata ∈H(s)is: bad bad Q∗(s,a )+β ≥Q∗(s,a∗)−β−2β bad ∴Q∗(s,a )≥Q∗(s,a∗)−4β bad Thus, thispruningprotocolneverprunesanoptimalaction, butprunesallactionsworsethen4β belowa∗invalue. Weconcludethattheagentmayneverexecuteanaction4β belowoptimal. 5 Experiments Thissectionappliesouractionpruningprotocols(Section3.3andRemarks2and3above)toconcrete RLproblems. InExperiment1,actionpruningisusedtopreventtheagentfromtryingcatastrophic actions,i.e. toachievesafeexploration. InExperiment2,actionpruningisusedtoacceleratelearning. 5.1 ProtocolforPreventingCatastrophes Human-in-the-loop RL can help prevent disastrous outcomes that result from ignorance of the environment’sdynamicsoroftherewardfunction. Ourgoalforthisexperimentistopreventthe agentfromtakingcatastrophicactions. Thesearerealworldactionssocostlythatwewanttheagent tonevertaketheaction3. Thisnotionofcatastrophicactioniscloselyrelatedtoideasin“SafeRL” [16,31]andtoworkon“significantrareevents”[34]. Section3.3describesourprotocolprogramforpreventingcatastrophesinfiniteMDPsusingaction pruning. Therearetwoimportantelementsofthisprogram: 1. Whentheagenttriesacatastrophicactionainstates,theagentisblockedfromexecuting theactionintherealworld,andtheagentreceivesstateandreward: (s,r ),wherer is bad bad anextremenegativereward. 3WeallowanRLagenttotakesub-optimalactionswhilelearning. Catastrophicactionsarenotallowed becausetheircostisordersofmagnitudeworsethannon-catastrophicactions. 7 Cumulative Reward: taxi_h-10_w-10 2500 qlearner-uniform-prune qlearner-uniform 2000 rmax-h4-prune rmax-h4 ward1500 mulative Re1000 Cu 500 0 500 0 50 100 150 200 Episode Number Figure4: PreventingCatastrophicSpeeds Figure5: PruninginTaxi. 2. This(s,a)isstoredsothattheprotocolprogramcanautomatethehuman’sintervention, whichcouldallowthehumantostopmonitoringafterallcatastropheshavebeenstored. This protocol prevents catastrophic actions while preserving the optimal policy and having only minimal side-effects on the agent’s learning. We can extend this protocol to environments with high-dimensional state spaces. Element (1) above remains the same. But (2) must be modified: preventingfuturecatastrophesrequiresgeneralizationacrosscatastrophicactions(astherewillbe infinitelymanysuchactions). WediscussthissettinginAppendixA. 5.2 Experiment1: PreventingCatastrophesinaPong-likeGame Ourprotocolforpreventingcatastrophesisintendedforuseinareal-worldenvironment. Herewe provideapreliminarytestofourprotocolinasimplevideogame. OurprotocoltreatstheRLagentasablackbox. Tothisend,weappliedourprotocoltoanopen- sourceimplementationofthestate-of-the-artRLalgorithm“TrustRegionPolicyOptimization”from Duanetal.[11]. TheenvironmentwasCatcher,asimplifiedversionofPongwithnon-visualstate representation. SincetherearenocatastrophicactionsinCatcher,wemodifiedthegametogivea largenegativerewardwhenthepaddle’sspeedexceedsaspeedlimit. Wecomparetheperformanceof anagentwhoisassistedbytheprotocol(“Pruned”)andsoisblockedfromthecatastrophicactions4 totheperformanceofanormalRLagent(“NotPruned”). Figure4showstheagent’smeanperformance(±1SDover16trials)overthecourseoflearning. We see that the agent with protocol support (“Pruned”) performed much better overall. This is unsurprising,asitwasblockedfromeverdoingacatastrophicaction. Thegapinmeanperformance islargeearlyonbutdiminishesasthe“NotPruned”agentlearnstoavoidhighspeeds. Bytheend(i.e. after400,000actions),“NotPruned”iscloseto“Pruned”inmeanperformancebutitstotalreturns overthewholeperiodarearound5timesworse. Whilethe“Pruned”agentobservesincongruous statetransitionsduetobeingblockedbyourprotocol,Figure4suggeststheseobservationsdonot havenegativesideeffectsonlearning. 5.3 ProtocolforAcceleratingLearning WealsoconductedasimpleexperimentintheTaxidomainfromDietterich[9]. TheTaxiproblemis amorecomplexversionofgridworld: eachprobleminstancesconsistsofataxiandsomenumberof passengers. Theagentdirectsthetaxitoeachpassenger,picksthepassengerup,andbringsthemto theirdestinationanddropsthemoff. 4 Wedidnotuseanactualhumanintheloop. Insteadtheagentwasblockedbyaprotocolprogramthat checkedwhethereachactionwouldexceedthespeedlimit.ThisisessentiallytheprotocoloutlinedinAppendix Abutwiththeclassifiertrainedofflinetorecognizecatastrophes.Futureworkwilltestsimilarprotocolsusing actualhumans. (Inthisexperimentahumancaneasilyrecognizecatastrophicactionsbyreadingtheagent’s speeddirectlyfromthegamestate.) 8 WeuseTaxitoevaluatetheeffectofouractionpruningprotocolforacceleratinglearningindiscrete MDPs. Thereisanaturalprocedureforpruningsuboptimalactionsthatdramaticallyreducesthesize ofthereachablestatespace: ifthetaxiiscarryingapassengerbutisnotatthepassenger’sdestination, weprunethedropoffactionbyreturningtheagentbacktoitscurrentstatewith-0.01reward. This preventstheagentfromexploringalargeportionofthestatespace,thusacceleratinglearning. 5.4 Experiment2: AcceleratedLearninginTaxi WeevaluatedQ-learning[47]andR-MAX[5]withandwithoutactionpruninginasimple10×10 instancewithonepassenger. Thetaxistartsat(1,1),thepassengerat(4,3)withdestination(2,2). We ran standard Q-Learning with ε-greedy exploration with ε = 0.2 and with R-MAX using a planninghorizonoffour. ResultsaredisplayedinFigure5. Our results suggest that the action pruning protocol simplifies the problem for a Q-Learner and dramaticallysofor R-Max. Intheallottednumberofepisodes,weseethatpruningsubstantially improvestheoverallcumulativerewardachieved;inthecaseofR-MAX,theagentisabletoeffectively solve the problem after a small number of episodes. Further, the results suggests that the agent- agnosticmethodofpruningiseffectivewithouthavinganyinternalaccesstotheagent’scode. 6 Conclusion We presented an agent-agnostic method for giving guidance to Reinforcement Learning agents. Protocol programs written in this framework apply to any possible RL agent, so sophisticated schemes for human-agent interaction can be designed in a modular fashion without the need for adaptationtodifferentRLalgorithms. Wepresentedsomesimpletheoreticalresultsthatrelateour methodtoexistingschemesforinteractiveRLandillustratedthepowerofactionpruningintwotoy domains. A promising avenue for future work are dynamic state manipulation protocols, which can guide an agent’s learning process by incrementally obscuring confusing features, highlighting relevant features, or simply reducing the dimensionality of the representation. Additionally, future work mightinvestigatewhethercertaintypesofvalueinitializationprotocolscanbecapturedbyprotocol programs,suchastheoptimisticinitializationforarbitrarydomainsdevelopedby Machadoetal. [29]. Moreover,thefullcombinatoricspaceoflearningprotocolsissuggestiveofteachingparadigms that have yet to be explored. We hypothesize that there are powerful teaching methods that take advantageoftheinterplaybetweenstatemanipulation,actionpruning,andrewardshaping. Afurther challengeistoextendtheformalismtoaccountfortheinterplaybetweenmultipleagents,inboth competitiveandcooperativesettings. Additionally,inourexperiments,allprotocolsareexplicitlyprogrammedinadvance. Inthefuture, we’dliketoexperimentwithdynamicprotocolswithahumanintheloopduringthelearningprocess. Lastly, an alternate perspective on the framework is that of a centaur system: a joint Human-AI decisionmaker[41]. Underthisview, thehumantrainsandqueriestheAIdynamicallyincases wherethehumanneedshelp. Inthefuture,we’dliketoestablishandinvestigateformalismsrelevant tothecentaurviewoftheframework. 9 Acknowledgments ThisworkwassupportedbyFutureofLifeInstitutegrant2015-144846andbytheFutureofHumanity Institute(Oxford). WethankShimonWhiteson,JamesMacGlashan,andD.EllisHerskowitzfor helpfulconversations. References [1] How does facebook determine what topics are trending? https://www.facebook.com/ help/737806312958641. Accessed: 2016-10-12. [2] DavidAbel,DavidEllisHershkowitz,GabrielBarth-Maron,StephenBrawner,KevinO’Farrell, JamesMacGlashan,andStefanieTellex. Goal-basedactionpriors. InICAPS,pages306–314, 2015. [3] David Abel, D Ellis Hershkowitz, and Michael L. Littman. Near optimal behavior via ap- proximatestateabstraction. InProceedingsofThe33rdInternationalConferenceonMachine Learning,2016. [4] OfraAmir,EceKamar,AndreyKolobov,andBarbaraGrosz. Interactiveteachingstrategiesfor agenttraining. IJCAI,2016. [5] RonenIBrafmanandMosheTennenholtz. R-max-ageneralpolynomialtimealgorithmfor near-optimalreinforcementlearning. TheJournalofMachineLearningResearch,3:213–231, 2003. [6] Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. InProceedingsofRoboticsScience&Systems,2014. [7] ThomasDean, RobertGivan, andSoniaLeach. Modelreductiontechniquesforcomputing approximatelyoptimalsolutionsformarkovdecisionprocesses. InProceedingsoftheThir- teenthConferenceonUncertaintyinArtificialIntelligence,pages124–131.MorganKaufmann PublishersInc.,1997. [8] SamDevlinandDanielKudenko. Dynamicpotential-basedrewardshaping. Proceedingsof the11thInternationalConferenceonAutonomousAgentsandMultiagentSystems(AAMAS), (June):433–440,2012. [9] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. JournalofArtificialIntelligenceResearch,13:227–303,2000. [10] KurtDriessensandSašoDžeroski. Integratingguidanceintorelationalreinforcementlearning. MachineLearning,57(3):271–304,2004. [11] YanDuan,XiChen,ReinHouthooft,JohnSchulman,andPieterAbbeel. Benchmarkingdeep reinforcementlearningforcontinuouscontrol. arXivpreprintarXiv:1604.06778,2016. [12] EyalEven-DarandYishayMansour. ApproximateequivalenceofMarkovdecisionprocesses. InLearningTheoryandKernelMachines,pages581–594.Springer,2003. [13] NormanFerns,PabloSamuelCastro,DoinaPrecup,andPrakashPanangaden. Methodsfor computingstatesimilarityinmarkovdecisionprocesses. Proceedingsofthe22ndconference onUncertaintyinartificialintelligence,2006. [14] JavierGarciaandFernandoFernandez. Safereinforcementlearninginhigh-risktasksthrough policy improvement. IEEE SSCI 2011: Symposium Series on Computational Intelligence - ADPRL2011: 2011IEEESymposiumonAdaptiveDynamicProgrammingandReinforcement Learning,pages76–83,2011. [15] Javier Garcia and Fernando Fernandez. Safe exploration of state and action spaces in rein- forcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012. ISSN 10769757. 10