Balancing Multiple Sources of Reward in Reinforcement Learning ChristianR.Shelton ArtificialIntelligenceLab MassachusettsInstituteofTechnology Cambridge,MA02139 [email protected] Abstract Formanyproblemswhichwouldbenaturalforreinforcementlearning, therewardsignalisnotasinglescalarvaluebuthasmultiplescalarcom- ponents. Examplesofsuchproblemsincludeagentswithmultiplegoals andagentswithmultipleusers. Creatingasinglerewardvaluebycom- bining the multiple components can throw away vital information and canleadtoincorrectsolutions. Wedescribethemultiplerewardsource problem and discuss the problems with applying traditional reinforce- mentlearning. Wethenpresentannewalgorithmforfindingasolution andresultsonsimulatedenvironments. 1 Introduction In the traditional reinforcementlearning framework, the learning agent is given a single scalar value of reward at each time step. The goal is for the agent to optimize the sum oftheserewardsovertime(thereturn). Formanyapplications,thereismoreinformation available. Considerthecase of a homeentertainmentsystem designedtosense whichresidents are currentlyintheroomandautomaticallyselectatelevisionprogramtosuittheirtastes. We mightconstructtherewardsignaltobethetotalnumberofpeoplepayingattentiontothe system. However, a reward signal of 2 ignores important information about which two usersarewatching.Theusersofthesystemchangeaspeopleleaveandentertheroom.We could,intheory,learntherelationshipamongtheuserspresent,whoiswatching,andthe reward. Ingeneral,itisbettertousethedomainknowledgewehaveinsteadofrequiring thesystemtolearnit. We knowwhichusersare contributingtotherewardandthatonly presentuserscancontribute. In other cases, the multiple sources aren’t users, but goals. For elevator scheduling we mightbetradingoffpeopleservicedperminuteagainstaveragewaitingtime. Forfinancial portfoliomanaging,wemightbeweighingprofitagainstrisk.Inthesecases,wemaywish to change the weighting over time. In order to keep from having to relearn the solution fromscratcheachtimetheweightingischanged,weneedtokeeptrackofwhichrewards toattributetowhichgoals. There is a separate difficulty if the rewards are not designed functions of the state but Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Balancing Multiple Sources of Reward in Reinforcement Learning 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Massachusetts Institute of Technology,Artificial Intelligence REPORT NUMBER Laboratory,77 Massachusetts Avenue,Cambridge,MA,02139 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 7 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 rather are given by other agents or people in the environment. Consider the case of the entertainmentsystemabovebutwhereeveryresidenthasadialbywhichtheycangivethe systemfeedbackorreward.Therewardsareincomparable.Oneusermaydecidetoreward the system with values twice as large as those of anotherwhichshould not result in that userhavingtwicethecontrolovertheentertainment.Thisisn’tlimitedtoscalingsbutalso includesany othermonotonictransformsof the returns. If the users of the system know theyaretrainingit,theywillemployallkindsofrewardstrategiestotrytosteerthesystem tothedesiredbehavior[2]. Bykeepingtrackofthesourcesoftherewards,wewillderive analgorithmtoovercomethesedifficulties. 1.1 RelatedWork The work presented here is related to recent work on multiagent reinforcement learning [1,4,5,7]inthatmultiplerewardssignalsarepresentandgametheoryprovidesasolution. Thisworkisdifferentinthatitattackingasimplerproblemwherethecomputationiscon- solidatedonasingleagent. Workinmultiplegoals(see[3,8]asexamples)isalsorelated butassumeseitherthatthe returnsofthegoalsaretobelinearlycombinedforanoverall valuefunctionorthatonlyonegoalistobesolvedatatime. 1.2 ProblemSetup Wewillbeworkingwithpartiallyobservableenvironmentswithdiscreteactionsanddis- creteobservations. We make noassumptionsabouttheworldmodelandthusdonotuse belief states. x(t) and a(t) are the observation and action, respectively, at time t. We consider only reactive policies (although the observations could be expanded to include history).π(x,a)isthepolicyorprobabilitytheagentwilltakeactionawhenobservingx. Ateachtimestep,theagentreceivesasetofrewards(oneforeachsourceintheenviron- ment),r (t)istherewardattimetfromsources. Weusetheaveragerewardformulation s and so Rsπ = limn→∞ n1E[rs(1)+rs(2)+···+rs(n)|π] is the expected return from sourcesforfollowingpolicyπ. Itisthisreturnthatwewanttomaximizeforeachsource. Wewillalsoassumethatthealgorithmknowsthesetofsourcespresentateachtimestep. Sourceswhicharenotpresentprovideaconstantreward,regardlessofthestateoraction, whichwewillassumetobezero. Allsumsoversourceswillbeassumedtobetakenover onlythepresentsources. Thegoalistoproduceanalgorithmthatwillproduceapolicybasedonpreviousexperience andthesourcespresent.Theagent’sexperiencewilltaketheformofpriorinteractionswith theworld. Eachexperienceisasequenceofobservations,action,andrewardtripletsfora particularrunofaparticularpolicy. 2 Balancing Multiple Rewards 2.1 PolicyVotes Ifrewardsarenotdirectlycomparable,weneedtofindapropertyofthesourceswhichis comparableandametrictooptimize.Webeginbynotingthatwewanttolimittheamount ofcontrolanygivensourcehasoverthebehavioroftheagent. Tothat end,weconstruct the policy as the average of a set of votes, one for each source present. The votes for a sourcemustsumto1andmustallbenon-negative(thusgivingeachsourceanequal“say” intheagent’spolicy). Wewillfirstconsiderrestrictingtherewardsfromagivensourceto onlyaffectthevotesforthatsource. Theformforthepolicyistherefore (cid:1) α (x)v (x,a) π(x,a)= s(cid:1)s s (1) α (x) s s (cid:1) (cid:1) whereforeachpresent sources, α (x) = 1, α (x) ≥ 0 for all x, v (x,a) = 1 x s s a s forallx, andv (x,a) ≥ 0 forallxanda. We havebrokenapartthevote fromasource s into two parts, α and v. α (x) is how much effort source s is putting into affecting the s policyforobservationx. v (x,a)isthevotebysourcesforthepolicyforobservationx. s Mathematicallythisisthesameasconstructingasinglevote(v(cid:3)(x,a) = α (x)v (x,a)), s s s butwefindαandvtobemoreinterpretable. We have constrained the total effort and vote any one source can apply. Unfortunately, these votes are not quite the correct parametersfor our policy. They are not invariantto theothersourcespresent. Toillustratethisconsidertheexampleofasinglestatewithtwo actions,twosources,andalearningagentwiththevotingmethodfromabove.Ifs prefers 1 onlya ands likesanequalmixofa anda ,theagentwilllearnavoteof(1,0)fors 1 2 1 2 1 ands canrewardtheagenttocauseittolearnavoteof(0,1)fors resultinginapolicy 2 2 of (0.5,0.5). Whether this is the correct final policy depends on the problem definition. However,therealproblemarises whenwe considerwhathappensif s is removed. The 1 policyrevertsto(0,1)whichisfarfroms ’s(theonlypresentsource’s)desired(0.5,0.5) 2 Clearly,thelearnedvotesfors aremeaninglesswhens isnotpresent. 2 1 Thus, while the voting scheme does limit the control each present source has over the agent,itdoesnotprovideadescriptionofthesource’spreferenceswhichwouldallowfor theremovaloraddition(orreweighting)ofsources. 2.2 ReturnsasPreferences Whilerewards(orreturns)arenotcomparableacrosssources,theyarecomparablewithin asource.Inparticular,weknowthatifRsπ1 >Rsπ2 thatsourcespreferspolicyπ1topolicy π . Wedonotknowhowtoweighthatpreferenceagainstadifferentsource’spreference 2 so an explicit tradeoff is still impossible, but we can limit (using the voting scheme of equation1)howmuchonesource’spreferencecanoverrideanothersource’spreference. Weallowasource’spreferenceforachangetoprevailinasmuchasitsvotesaresufficient to affect the change in the presences of the other sources’ votes. We have a type of a general-sumgame(lettingthesourcesbetheplayersofgametheoryjargon).Thevalueto sources(cid:3)ofthesetofallsources’votesisRπ whereπisthefunctionofthevotesdefined s(cid:1) inequation1. Eachsources(cid:3) wouldliketosetitsparticularvotes,αs(cid:1)(x)andvs(cid:3)(x,a)to maximizeitsvalue(orreturn). Ouralgorithmwillseteachsource’svoteinthiswaythus insuringthatnosourcecoulddobetterby”lying”aboutitstruerewardfunction. In game theory, a “solution” to such a game is called a Nash Equilibrium[6], a point at whicheachplayer(source)is playing(voting)itsbestresponsetotheotherplayers. Ata Nash Equilibrium,no single player can changeits play and achievea gain. Because the votes are real-valued,we are lookingfor the equilibriumofa continuousgame. We will deriveafictitiousplayalgorithmtofindanequilibriumforthisgame. 3 Multiple RewardSource Algorithm 3.1 ReturnParameterization In order to apply the ideas of the previous section, we must find a method for finding a NashEquilibrium. Todothat,wewillpickaparametricformfor Rˆπ (theestimateofthe s return):linearintheKL-divergencebetweenatargetvoteandπ.Lettinga ,b ,β (x),and s s s p (x,a)betheparametersofRˆπ, s s (cid:2) (cid:2) p (x,a) Rˆπ =a β (x) p (x,a)log s +b (2) s s s s π(x,a) s x a (cid:1) (cid:1) wherea ≥ 0, β (x) ≥ 0, β (x) = 1, p (x,a) ≥ 0, and p (x,a) = 1. Just as s s x s s a s α (x)was theamountofvotesourceswas puttingtowardsthepolicyforobservationx, s β (x) istheimportanceforsourcesofthepolicyforobservationx. And,whilev (x,a) s s was the policy vote for observation x for source s, p (x,a) is the preferred policy for s observationxforsources. Theconstantsa andb allowforscalingandtranslationofthe s s return. If we let p(cid:3)(x,a) = a β (x)p (x,a), then, given experiences of different policies and s s s s theirempiricalreturns,we canestimatep(cid:3)(x,a)usinglinearleast-squares. Imposingthe s constraints just involves finding the normal least-squares fit wit(cid:1)h the constraint that all p(cid:3)(x,a) be non-negative. Fromp(cid:3)(x,a) we cancalculatea = p(cid:3)(x,a), β (x) = s (cid:1) s s x,a s s a1s ap(cid:3)s(x,a) and ps(x,a) = ap(cid:1)(cid:1)sp(x(cid:1)s(,ax),a(cid:1)). We now have a methodfor solving for Rˆsπ givenexperience.WenowneedtoPfindawaytocomputetheagent’spolicy. 3.2 BestResponseAlgorithm To produce an algorithm for finding a Nash Equilibrium, let us first start by deriving an algorithmforfindingthebestresponseforsourcestoasetofvotes. Weneedtofindthe setofα (x)andv (x,a)thatsatisfytheconstraintsonthevotesandmaximizeequation2 s s whichisthesameasminimizing (cid:1) (cid:2) (cid:2) β (x) p (x,a)log s(cid:1)(cid:1)αs(cid:1)(x)vs(cid:1)(x) (3) s s x a s(cid:1)αs(cid:1)(x) overα (x)andv (x,a)forgivensbecausetheothertermsdependonneitherα (x)nor s s s v (x,a). s Tominimizeequation3,let’sfirstfixtheα-valuesandoptimizevs(x,a).W(cid:1)ewillignorethe non-negativeconstraintsonv (x,a)andjustimposetheconstraintthat v (x,a) = 1. s a s Thesolution,whosederivationissimpleandomittedduetospace,is (cid:1) v (x,a)= s(cid:1)(cid:4)=sαs(cid:1)(x)(ps(x,a)−vs(cid:1)(x,a)) . (4) s α (x) s Weimposethenon-negativeconstraintsbysettingtozeroanyv (x,a)whicharenegative s andrenormalizing. Unfortunately,we havenotbeenableto findsucha nicesolutionforα (x). Instead,we s usegradientdescenttooptimizeequation3yielding (cid:1) ∆α (x)∝ (cid:1)βs(x) (cid:2)p (x,a) s(cid:1)(cid:4)=sα(cid:1)s(cid:1)(x)(vs(x,a)−vs(cid:1)(x,a)) . (5) s s s(cid:1)αs(cid:1)(x) a s(cid:1)αs(cid:1)(x)vs(cid:1)(x,a) Weconstrainthegradienttofittheconstraints. Wecanfindthebestresponseforsourcesbyiteratingbetweenthetwostepsabove. First weinitializeα (x) = β (x)forallx. Wethensolveforanewsetofv (x,a)withequa- s s s tion4. Usingthosev-values,wetakeastepinthedirectionofthegradientofα (x)with s equation 5. We keep repeating until the solution converges (reducing the step size each iteration)whichusuallyonlytakesafewtensofsteps. 1 1 0.5 0.5 1 sright 0 1 2 3 4 0 1 2 3 4 1 1 0.4 4 5 2 s 0.5 ⇒ 0.5 ⇒ 0.2 bottom 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 1 1 3 0.5 0.5 s left 0 1 2 3 4 0 1 2 3 4 p (5,a) v (5,a) π(5,a) s s Figure1: Load-unloadproblem: Therightisthestatediagram. Cargoisloadedinstate1. Deliverytoaboxedstateresultsinrewardfromthesourceassociatedwiththatstate. The leftisthesolutionfound.Forstate5,fromlefttorightareshownthep-values,thev-values, andthepolicy. 1 1 s 0.5 ⇒ 0.5 ⇒ 00..24 right 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 1 1 0.5 0.5 s bottom 0 1 2 3 4 0 1 2 3 4 p (5,a) v (5,a) π(5,a) s s Figure2: Transferoftheload-unloadsolution: plotsofthesamevaluesasinfigure1but withtheleftsourceabsent. Noadditionallearningwasallowed(theleftsideplotsarethe same).Thevotes,however,change,andthussodoesthefinalpolicy. 3.3 NashEquilibriumAlgorithm To find a Nash Equilibrium, we start with α (x) = β (x) and v (x,a) = p (x,a) and s s s s iteratetoanequilibriumbyrepeatedlyfindingthebestresponseforeachsourceandsimul- taneouslyreplacingthe oldsolution with the new best responses. To preventoscillation, wheneverthechangeinα (x)v (x,a)growsfromonesteptothenext,wereplacetheold s s solutionwithonehalfwaybetweentheoldandnewsolutionsandcontinuetheiteration. 4 Example Results In all of these examples we used the same learningscheme. We ran the algorithm for a series ofepochs. At each epoch,we calculatedπ usingtheNash Equilibriumalgorithm. Withprobability(cid:18),wereplaceπwithonechosenuniformlyoverthesimplexofconditional distributions.Thisinsuressomeexploration.Wefollowπforafixednumberoftimesteps and record the average reward for each source. We add these average rewards and the empiricalestimateofthepolicyfollowedasdatatotheleast-squaresestimateofthereturns. Wethenrepeatforthenextepoch. 4.1 MultipleDeliveryLoad-UnloadProblem We extend the classic load-unload problem to multiple receivers. The observation state is shown in figure 1. The hidden state is whether the agent is currently carrying cargo. Whenevertheagententersthetopstate(state1),cargoisplacedontheagent. Whenever the agent arrives in any of the boxed states while carrying cargo, the cargo is removed andtheagentreceivesreward. Foreachboxedstate,thereisonerewardsourcewhoonly rewards for deliveries to that state (a reward of 1 for a delivery and 0 for all other time steps). Instate5,theagenthasthechoiceoffouractionseachofwhichmovestheagent tothecorrespondingstatewithouterror. Sincetheagentcannotobserveneitherwhetherit 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure3: One-waydoorstatediagram: Ateverystatetherearetwoactions(rightandleft) availabletotheagent.Instates1,9,10,and15wherethereareonlysingleoutgoingedges, bothactionsfollowthesameedge. Withprobability0.1,anactionwillactuallyfollowthe otheredge.Source1rewardsenteringstate1whereassource2rewardsenteringstate9. 0.2 0.6 s 0.1 00..24 1 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0.2 0.6 s 0.1 00..24 2 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 1 β (x) α (x) s ⇒ s ⇒ 0.5 0 2 4 6 8 10 12 14 1 1 0.5 0.5 s 1 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 1 1 0.5 0.5 s 2 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 p (x,right) v (x,right) π(x,right) s s Figure4: One-waydoorsolution: fromlefttoright: thesources’idealpolicies,thevotes, andthefinalagent’spolicy.Lightbarsareforstatesforwhichbothactionsleadtothesame state. hascargonoritshistory,theoptimalpolicyforstate5isstochastic. Thealgorithmsetallα-andβ-valuesto0forstatesotherthanstate5. Westarted(cid:18)at0.5 and reduced it to 0.1 by the end of the run. We ran for 300 epochs of 200 iterations by whichpointthealgorithmconsistentlysettledonthesolutionshowninfigure1. Foreach source,thealgorithmfoundthebest solutionofrandomlypickingbetweentheloadstate andthesource’sdeliverystate(asshownbythep-values). Thevotesareheavilyweighted towards the delivery actions to overcome the other sources’ preferences resulting in an approximatelyuniformpolicy.Theimportantpointisthat,withoutadditionallearning,the policycanbechangediftheleft sourceleaves. Thelearnedα-andp-valuesarekept the same,buttheNashEquilibriumisdifferentresultinginthepolicyinfigure2. 4.2 One-wayDoorProblem Inthiscaseweconsidertheenvironmentshowninfigure3. Fromeachstatetheagentcan movetotheleftorrightexceptinstates1,9,10,and15wherethereisonlyonepossible action. We can think of states 1 and 9 as one-way doors. Once the agent entersstates 1 or9,itmaynotpassbackthroughexceptbygoingaroundthroughstate5. Source1gives rewardwhentheagentpassesthroughstate1.Source2givesrewardwhentheagentpasses throughstate9.Actionsfail(moveintheoppositedirectionthanintended)0.1ofthetime. Weranthelearningschemefor1000epochsof100iterationsstarting(cid:18)at0.5andreducing itto0.015bythelastepoch. Thealgorithmconsistentlyconvergedtothesolutionshown in figure 4. Source 1 considers the left-side states (2–5 and 11–12) the most important while source 2 considers the right-side states (5–8 and 13–14) the most important. The ideal policies captured by the p-values show that source 1 wants the agent to move left and source 2 wants the agent to move right for the upper states (2–8) while the sources agreethatforthe lowerstates (11–14)theagentshouldmovetowardsstate 5. Thevotes reflect this preferenceand agreement. Both sources spend most of their vote on state 5, the state theybothfeel is important and on whichtheydisagree. The otherstates (states forwhichonlyonesourcehasastrongopinionoronwhichtheyagree),theydonotneed tospendmuchoftheirvote. Theresultingpolicyis thenaturalone: instate5, theagent randomlypicksadirectionafterwhich,theagentmovesaroundthechosenloopquicklyto returnto state 5. Just as inthe load-unloadproblem,if we removeonesource, theagent automaticallyadaptstotheidealpolicyfortheremainingsource(withonlyonesource,s , 0 present,π(x,a)=p (x,a)). s0 Estimating the optimal policies and then taking the mixture of these two policies would produceafarworseresult. Forstates2–8,bothsourceswouldhavedifferingopinionsand themixturemodelwouldproduceauniformpolicyinthosestates;theagentwouldspend mostofits timenearstate5. Constructingarewardsignalthatisthesumofthesources’ rewardsdoesnotleadtoagoodsolutioneither. Theagentwillfindthatcirclingeitherthe leftorrightloopisoptimalandwillhavenoincentivetoevertravelalongtheotherloop. 5 Conclusions Itisdifficulttoconceiveofamethodforprovidingasinglerewardsignalthatwouldresult in the solution shown in figure 4 and still automatically change when one of the reward sourceswasremoved.Thebiggestimprovementinthealgorithmwillcomefromchanging the form of the Rˆπ estimator. For problems in which there is a single best solution, the s KL-divergencemeasureseemstoworkwell. However,wewouldliketobeabletoextend theload-unloadresulttothesituationwheretheagenthasamemorybit. Inthiscase,the returnsasafunctionofπarebimodal(duetothesymmetryintheinterpretationofthebit). In general, allowingeach source’s preferenceto be modelledin a morecomplexmanner couldhelpextendtheseresults. Acknowledgments WewouldliketothankCharlesIsbell,TommiJaakkola,LeslieKaelbling,MichaelKearns,Satinder Singh,andPeterStonefortheirdiscussionsandcomments. ThisreportdescribesresearchdonewithinCBCLintheDepartmentofBrainandCognitiveSciences andintheAILabatMIT.ThisresearchissponsoredbyagrantsfromONRcontractsNos.N00014- 93-1-3085&N00014-95-1-0600,andNSFcontractsNos.IIS-9800032&DMS-9872936.Additional support was provided by: AT&T, Central Research Institute of Electric Power Industry, Eastman Kodak Company, Daimler-Chrysler, Digital Equipment Corporation, Honda R&D Co., Ltd., NEC Fund,NipponTelegraph&Telephone,andSiemensCorporateResearch,Inc. References [1] J.Hu and M. P.Wellman. Multiagent reinforcement learning: Theoretical framework and an algorithm. InProc.ofthe15thInternationalConf.onMachineLearning,pages242–250,1998. [2] C.L.Isbell,C.R.Shelton,M.Kearns,S.Singh,andP.Stone. Asocialreinforcementlearning agent. 2000. submittedtoAutonomousAgents2001. [3] J.Karlsson. LearningtoSolveMultipleGoals. PhDthesis,UniversityofRochester,1997. [4] M.Kearns,Y.Mansouor,andS.Singh. Fastplanninginstochasticgames. InProc.ofthe16th ConferenceonUncertaintyinArtificialIntelligence,2000. [5] M.L.Littman.Markovgamesasaframeworkformulti-agentreinforcementlearning.InProc.of the11thInternationalConferenceonMachineLearning,pages157–163,1994. [6] G.Owen. GameTheory. AcademicPress,UK,1995. [7] S.Singh,M.Kearns,andY.Mansour. Nashconvergenceofgradientdynamicsingeneral-sum games. InProc.ofthe16thConferenceonUncertaintyinArtificialIntelligence,2000. [8] S.P.Singh. Theefficientlearningofmultipletasksequences. InNIPS,volume4,1992.