ebook img

Angrier Birds: Bayesian reinforcement learning PDF

0.52 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Angrier Birds: Bayesian reinforcement learning

ArtificialIntelligenceCS221 FinalProject Angrier Birds: Bayesian reinforcement learning Imanol Arrieta Ibarra1, Bernardo Ramos1, Lars Roemheld1 Abstract WetrainareinforcementlearnertoplayasimplifiedversionofthegameAngryBirds. Thelearnerisprovided with a game state in a manner similar to the output that could be produced by computer vision algorithms. Weimproveontheefficiencyofregularε-greedyQ-Learningwithlinearfunctionapproximationthroughmore systematic exploration in Randomized Least Squares Value Iteration (RLSVI), an algorithm that samples its policy from a posterior distribution on optimal policies. With larger state-action spaces, efficient exploration becomesincreasinglyimportant,asevidencedbythefasterlearninginRLSVI. 6 Keywords 1 ReinforcementLearning—Q-Learning—RLSVI—Exploration 0 2 1DepartmentofManagementScienceandEngineering,StanfordUniversity,California,UnitedStates n a J 7 Contents Tosimulategameplay,weadaptedanopen-sourceproject torecreateAngryBirdsinPython[1]usingthePygameframe- ] Introduction 1 work.Afteradaptingthebasecodeforourneeds,wedesigned I A anAPIwhichexposedthegameasaMarkovDecisionPro- 1 Model 1 . cess(MDP),largelyfollowingtheconventionsofourclass s 2 Algorithms 2 c framework. 2.1 Q-Learning(linearfunctionapproximation) . . . .2 [ The Python port we used is a simplified version of the 2.2 RandomizedLeastSquaresValueIteration . . . .3 original Angry Birds game: it includes 11 levels, only one 2 v 2.3 Featureextractors . . . . . . . . . . . . . . . . . . . . . .3 (standard)typeofbird,onlyonetypeoftarget,andonlyone 7 PigPositionValues•PigPositionIndicator(PP)•NestedPig type of beam. A level ends whenever there are either no 9 PositionCounter(NPP)•NestedPigPositions:ShiftedCounters targets or no more birds on the screen. If all targets were 2 (NPPS)•NestedPigPositionsCounterswithObstacles(NPPO) destroyedthentheagentadvancestothenextlevel,otherwise, 1 3 ResultsandDiscussion 4 theagentlosesandgoesbacktothefirstlevel. Eachlevel’s 0 . 3.1 Comparisonoffeatureextractors . . . . . . . . . . . 4 scoreiscalculatedaccordingtothreeparameters: numberof 1 0 3.2 RLSVIvs. regularQ-Learning . . . . . . . . . . . . . . 4 birdsleft,shatteredbeamsandcolumnsanddestroyedtargets. 6 3.3 Learningsuccess . . . . . . . . . . . . . . . . . . . . . . . 4 Iftheagentloses,itincursanarbitrarypenalty. 1 Thissomewhatreducedstate-actionspaceallowedarel- : 4 ConclusionsandFutureWork 5 atively straightforward proof-of-concept – we believe that v i References 6 verysimilaralgorithmsandmodelswouldalsoworkformore X complexversionsofthegame. r a Introduction 1. Model AngryBirdshasbeenawildlysuccessfulinternetfranchise centered around an original mobile game in which players ThegamestateexposedbyourAPIiscomposedofthefol- shootbirdsoutofaslingshottohittargetsforpoints. Angry lowinginformation. Thisstaterepresentationfullydescribes Birdsislargelyadeterministicgame: 1 acomplexphysics therelevantgamesituationfromtheplayer’sperspectiveand enginegovernsflight,collisionandimpacttoappearalmost therefore suffices to interface between the learner and the realistically. Assuch,optimalgameplaycouldbeachieved game. Wenotethatallpartsofthegamestatecouldbeob- byoptimizingahighlycomplexfunctionthatdescribestra- tainedfromcomputervisionalgorithms,sothatinprinciple jectory planning – instead, we train several reinforcement nodirectinteractionwiththegamemechanicswouldbenec- learningalgorithmstoplaythegame,andcomparetheirfinal essary. performanceandlearningspeedswiththatofahuman-play oracle. • Numberofbirdsleft(“ammunition”) 1Inoursimulation,thephysicsengineactuallyseemedtoproducesemi- • Numberoftargets(“pigs”),andtheirabsolutepositions randomresults:thesamesequenceofmoveswasnotguaranteedtoyieldthe exactsameoutcomes. onthegamemap AngrierBirds: Bayesianreinforcementlearning—2/6 Figure1. AngryBirdsgameplay: thestationaryslingshotontheleftisusedtohitthetargetsontheright. Thebeamstructures canbedestroyed,allowingformorecomplexconsequencesandinteractionsofchosenactions. • Numberofobstacles(“beams”),andtheirabsolutepo- erratic exploration of possible actions. We implemented a sitionsonthegamemap variationwhichupdatesabeliefdistributiononoptimalpoli- cies,andthensamplespoliciesfromthisdistribution[2]. This • Thegamescoreandlevelachievedbytheplayersofar. impliesmoresystematicexploration,sincepoliciesthatseem unattractivewithhighconfidencewillnotbere-tried. We further simplified game mechanics by making the Afterintroducingthetwoalgorithms,wediscussvarious gamefullyturn-based: thelearneralwayswaitsuntilallob- iterationstofindoptimalfeatureextractors. jects are motionless before taking the next action. This as- sumptionallowedustoworkonabsolutepositionsonly. Thesetofpossibleactionsisapair(angle,distance),de- 2.1 Q-Learning(linearfunctionapproximation) scribingtheextensionandaimingoftheslingshotandthereby Asexposedin[3],Q-Learningisamodel-freealgorithmthat the bird’s launching momentum and direction. To some- approximatesastate-actionvaluefunctionQopt(s,a)inorder whatacceleratelearning,weallowedonlyshootingforwards. toderiveanoptimalpolicyforanunknowndecisionprocess. Withinthatconstraint,weallowedallpossibleextensionsand In order to learn optimal behavior in an unknown game, a angles – we compare discretizations of the action space in Q-Learnerneedstobothexplorethesystembychoosingpre- varyinglevelsofgranularity. Ourbasediscretizationused32 viously unseen actions, and to exploit the knowledge thus different(evenlyspaced)actions. gatheredbychoosingtheactionthatpromisesbestoutcomes, givenpreviousexperience. Balancingefficientexplorationandexploitationisamain 2. Algorithms challengeinreinforcementlearning: themostpopularmethod Reinforcementlearningisasub-fieldofartificialintelligence inliteratureistheuseofε-greedymethods: withprobability thatdescribeswaysforalearningalgorithmtoactinunknown ε, the algorithm would ignore its previous experience and andunspecifieddecisionprocesses,guidedmerelybyrewards pickanactionatrandom. Thismethodtendstoworkwellin and punishments for desirable and unwanted outcomes, re- practicewhenevertheactionsneededtoincreaseapayoffare spectively. Assuch,reinforcementlearningcloselymirrors nottoocomplicated. However,whencomplexsetsofactions humanintuitionaboutlearningthroughconditioning,andal- need to be taken in order to get a higher reward, ε-greedy lowsalgorithmstomastersystemsthatwouldotherwisebe tendstotakeexponentialtimetofindtheserewards(see[2], hardorimpossibletospecify. [4],[5].) Astandardalgorithminreinforcementlearningisε-greedy ToaidgeneralizationovertheAngryBirdsstatespace,we Q-learning. Whileveryeffective,thisalgorithmreliesonan usedlinearfunctionapproximation: wedefinedQ (s,a)= w AngrierBirds: Bayesianreinforcementlearning—3/6 wTφ(s,a), where w∈Rf, and φ :S×A→Rf is a feature steadythealgorithm–unfortunately,ourgamesimulationwas extractorthatcalculates f featuresfromthegivenstate-action computationallyexpensive,andtherelativelysmallnumber pair. Thisallowstheexploitationsteptobeanupdateonly ofactionstobesimulatedforcedustolearninacontinuous on the weights vector w, which we performed in a fashion context. similartostochasticgradientdescent(followingthestandard algorithmfrom[3]). 2.3 Featureextractors Indesigningourfeatureextractorsφ(s,a),wefollowedtwo 2.2 RandomizedLeastSquaresValueIteration premises: first,giventhatourvaluefunctionapproximation Osbandetal. proposeRandomizedLeastSquaresValueIter- waslinear,itwasclearthatinteractionbetweenfeatureswould ation(RLSVI)[2],analgorithmthatdiffersfromourimple- havetobelinear;inparticular,separatingthexandycompo- mentationofQ-Learningintwopoints: nentsoflocationswouldnotwork. Second,wewantedtogive awayaslittledomainknowledgeaboutthegameaspossible 1. Insteadofusinggradientdescenttolearnfromexperi- – the task for the algorithm was to understand how to play enceinanonlinefashion,RLSVIstoresafullhistory AngryBirdswithoutpreviouslyhavingaconceptoftargetsor of(state,action,reward,newstate)tuplestoderivean obstacles. exact”optimal”policy,giventhedata. These two premises effectively impose a constraint on 2. Givenhyperparametersaboutthequalityofthelinear strategies, as there is no reliable way for the algorithm to approximation,Bayesianleastsquaresisemployedto detect-forexample-whetheranobstacleisjustinfrontorjust estimateadistributionaroundthe”optimal”policyde- behindatarget. Regardless,ourlearnerdevelopedimpressive rivedin1. Thelearnerthensamplesapolicyfromthis performance,aswillbeseenlater. Onereasonforthismaybe distribution–thisreplacesε-greedyexploration. thatthelevelsinoursimulatorwererelativelysimple. Weiteratedoverthefollowingfiveideastoformmean- Specifically,RLSVImodelsthestate-actionvaluefunction ingful features. Due to the relatively complex state-action asfollows: space,ourfeaturespacewasfairlybig,quicklyspanningsev- eralthousandfeatures. Thisturnedouttobeaproblemfor Q (s,a)=Reward(s,a)+γmaxQ (Successor(s,a),a ) w t t t t w t t t+1 at+1 RLSVI,asmemoryandcomputationalrequirementsgrewto ≈wTφ(s,a)+ν beintractable. Rewritingthealgorithmtorunonsparsema- t t t tricesofferedsomehelp,butultimatelywefoundreasonable whereγ isanarbitrarydiscountfactor,ν ∼N(0,σ2),andσ performancewithonlythebest-functioningfeatures. isahyperparameter. Givenamemoryof(state,action,reward,newstate)tu- 2.3.1 PigPositionValues ples, we can use Bayesian least squares to derive w¯ , the Inafirstattempt,weusedroundedtargetpositionsasfeature t expectedvalueoftheoptimalpolicyandΣ =Cov(w),the values: for every pig i in the given state S, we would have t t covarianceoftheoptimalpolicy(thedetailsofwhicharefairly featurevalues pi=(x,y,xy)T. Thefeatureswererepeatedfor straight-forward,andcanbefoundin[2]). Notethatgiventhe everyactiona,andwouldbenon-zeroonlyfortheactionthat assumptionoflinearapproximation,aweightsvectorw fully wastaken. Thisallowsdifferentweightsfordifferentposition- t specifiesaQ-function,andtherebyapolicy. actioncombinations, whichultimatelyimplieslearningthe Thelearnerthenfinallypicksapolicybysampling bestposition-actioncombinations. By including the interaction term xy, we had hoped to wˆ ∼N(w¯ ,Σ) capture the relative distance of the target to the slingshot t t t (which x and y couldn’t establish by themselves in a linear andtakingtheactionsuchthatQwˆt predictsthehighestreward. function):thiswouldhaveallowedfastgeneralizationoftarget Instead of taking entirely random actions with probability positions. Unfortunately,ifinhindsightnotsurprisingly,this ε, the algorithm thus always samples random policies, but approach failed very quickly and produced practically no convergestooptimalmeansasmoredataisaccumulatedand learning. variancesshrink. Therefore,lesstimemaybeexpectedtobe ”wasted”onpoliciesthatarehighlyunlikelytobesuccessful. 2.3.2 PigPositionIndicator(PP) Tokeepthisalgorithmfromerraticpolicychangeswhen Thenextapproachwasanindicatorvariableforafinegridof onlyfewobservationshavebeenmadeyet,weinitializewith pigpositions: wecreatedaseparatefeatureforeverypossible anuninformedprior. Furthermore,inspectionshowedthatΣ pigpositionandagainincludedtheactiontaken. Thiscreated wasalmostdiagonal. Tosavecomputation,wethereforede- an impractically large feature space, yet worked relatively cidedtosampleusingonlyvariancesoftheindividualweights well. Itdid,however,clearly”over-fit”–onceasuccessful (ignoringcovariancebetweenweights). series of actions was found, the algorithm would reliably Osbandetal. proposeRLSVIinthecontextofepisodic completealeveloverandoveragain. Ifatargethadmoved learning, i.e. they suggest updating the policy only at the onlyafewpixels,however,thealgorithmwouldhavetobe end of episodes of a fixed number of actions. This helps retrainedcompletely. AngrierBirds: Bayesianreinforcementlearning—4/6 2.3.3 NestedPigPositionCounter(NPP) thatweusedthemfortargets,hopingthatthealgorithmwould In order to solve the over-fitting problem we developed a learntopreferareaswithahigh targets ratio. obstacles nestedmechanismthatwouldgeneralizetounobservedstates. JustasinthecaseofNPPS,weweresurprisedthatadding Toachievethis,wedefined3nestedgridsoverthegamescreen informationaboutobstacleswasdetrimentaltolearningsuc- asexemplifiedinFigure2. Thethreegridsareprogressively cess–indeed,addingobstacleinformationstoppedlearning smaller,andeachsquareofeachgridisacountofthenumber altogether.AsinNPPS,wesuspectthatNPPOmayworkwell oftargetscontainedinit. ifgivenmoretimetoconvergeinthebloatedfeaturespace. This solved the generalization issue while maintaining someofthenicepropertiesofthepreviousfeatureextractor. 3. Results and Discussion While the larger grids helped to generalize, the finer ones The somewhat crude feature extractor NPP provided best providedawaytorememberwhenalevelhadalreadybeen results,andRSLVIdidindeedoutperformtheregularε-greedy observed. Q-Learner. RLSVIwasparticularlyimpressiveforitsability to clear levels at almost the same speed as an (untrained) humanplayer. We could not simulate the gameplay until convergence asaresultofcomputationallimitations: thegameenginewe usedwasnotintendedforbigsimulations,anddidnoteven speedupsignificantlywhengraphicsweredisabled. 3.1 Comparisonoffeatureextractors WecomparedourfivedifferentfeatureextractorsontheQ- Learningbaselinealgorithm,asdisplayedinfigure3. It is noteworthy that PP starts on a high baseline score, butdoesnotimprovemuchfromthere: thisisduetothefact thatthelearnerveryquicklymastersalevel,butthendoesnot generalizetofollowinglevels. Asaconsequence,PPtendsto getstuckandachieverelativelyconstantgamescoresinthe firstfewlevels. Incontrast,NPPlearnstomastereasylevelsequallyfast, butthencarriesovertoharderlevelsbetter,yieldinghigher totalscoresperattempt. Figure2. NPPandNPPOgridstructure: progressively NPPSshowsaverysimilarslopetoNPP,supportingthe smallergridscreatesquareswithinsquares,extractingboth explanationthatthesheernumberoffeaturesslowsthelearn- specificlocationinformationandallowingforgeneralization. ingprocessdown. However,NPPSisclearlyoutperformed byNPPwithinoursimulationtimeframe. Wearesomewhat surprised at the complete failure of NPPO to improve over 2.3.4 NestedPigPositions: ShiftedCounters(NPPS) time. OneissuewithNPPisthatespeciallyforthelargersquares, targetsthatareclosetoasquareboundaryarenotcaptured 3.2 RLSVIvs. regularQ-Learning adequately. Toimproveonthis, wecreatedacopyofeach Exploringthestate-actionspaceaccordingtoposteriorbeliefs grid, shifted diagonally by half a square size. Now every didindeedproducebetterresultsthantheε-greedyQ-Learner. targetlayinexactlytwosquares,allowingtojudgewhethera We trained both algorithms on the same feature extractor targetisfurthertotheleftortherightwithinasquare. (NPP) and compared a moving average game score per at- NPPSwasthereforeafeaturesetoftwosetsofthreeover- tempt. RLSVIachievedhigherscoresoverallinthesimulated lappingsquaregridseach. Tooursurprise,NPPSperformed timeframe,andlearnedmorequickly. Acomparisonofthe worsethanthesimplerNPP.Weassumethatthisisduetothe twoalgorithmsisgiveninfigure4. muchbiggerfeaturespace,andthefactthatwecouldnotlearn It should be noted that in our implementation, RLSVI the algorithms until full convergence due to computational required significantly more memory and computation than limits. theregularQ-Learner,sinceitrequiredkeepingacomplete historyofobservedfeatures,andmatrixoperationsonlarge 2.3.5 NestedPigPositionsCounterswithObstacles(NPPO) datablocks. Finally,wetriedtoaddresstheissueofobstacles:asdescribed, wedidnotwanttogiveawaygameplay-relatedinformationby 3.3 Learningsuccess representingafixedrelationshipbetweentargetsandobstacles. Neitherahumanplayer-ouroracleforscorecomparison-nor Wethereforeaddedcountersforobstaclesinthesameway any of the algorithms managed to go through all 11 levels AngrierBirds: Bayesianreinforcementlearning—5/6 Figure3. MovingAverageforthedifferentfeatureextractors. Figure4. RLSVIisfasterandachievesgreaterrewardsthan Thedysfunctional”positionvalues”featureextractorhas Q-Learning. Wedisplayamovingaveragegamescoreover beenexcluded. Wedisplayamovingaveragegamescore thenext10attemptsafteranumberofattemptshasbeen overthenext10attemptsafteranumberofattemptshasbeen made(whereanattemptisdefinedasanynumberofshots made(whereanattemptisdefinedasanynumberofshots untilalevelisfailed). Wesuccessivelyusedoneattemptto untilalevelisfailed). Wesuccessivelyusedoneattemptto explore,andthenrecordedthefollowingattemptwithout explore,andthenrecordedthefollowingattemptwithout exploration,inordertonotdistortresultswithexploration exploration,inordertonotdistortresultswithexploration behavior. They-axisisinunitsof104points. behavior. They-axisisinunitsof104points. Model Numberoftrialstofinish: Algorithm Features Level0 Level5 Level7 providedbythegameengineweusedinoneattempt. Both theregularQ-LearnerandRLSVIhoweveroutperformedthe Human —- 1 2 4 humanplayer(whoshallremainunnamed,giventhathewas Q-Learning PP 2 5 10 shamefullybeatbycrudealgorithms)intermsofmaximum Q-Learning NPP 1 4 15 pointsachievedinasingleattempt,andintermsofthehigh- Q-Learning NPPO 4 20 40 est level reached. Table 1 summarizes the maximumscore RLSVI NPP 1 2 5 attainedandthehighestlevelreachedfordifferentplayers. Table2. Averagenumberoftrialsneededtofinishalevel. RLSVIlearnsveryquicklyinlaterlevels,andpasseslevels afterimpressivelyshortexplorationperiods. Model MaximumPerformance Algorithm Features Score Level 4. Conclusions and Future Work Human —- 210000 7 Q-Learning PP 170000 8 RLSVI’sBayesianapproachtostate-actionspaceexploration Q-Learning NPP 495000 9 seemslikeapromisingandratherintuitivewayofnavigating Q-Learning NPPO 120000 3 unknownsystems. PreviousworkhasshownBayesianbelief RLSVI NPP 550000 8 updatingandsamplingforexplorationtobehighlyefficient [4],whichourfindingsconfirm.RLSVIbeatboththebaseline Table1. Maximumlevelandscoresachieved. Boththe Q-Learning algorithm and, somewhat embarrassingly, our regularQ-LearnerandRLSVIoutperformouruntrained humanoracle. humanplayer. Amainconstraintonourworkwassimulationspeed,lim- itingthealgorithmconvergencewecouldachieve. Wewould Comparingthenumberofattemptsrequiredtopassagiven havelikedtofurtherexplorepossiblefeatures,especiallyafter levelprovidesinterestinginsights: especiallyinlaterlevels, morecomputation. RLSVI’s exploration proves very efficient, as it passes the Inthesamevein,itwouldbeinterestingtotrainadeep levelsatalmostthesamenumberofattemptsasrequiredby neuralnetworkfortheQ(s,a)function,toallowlearningmore thehumanplayer. Withthesamefeatures,regularQ-Learning complexinteractioneffects,inparticularbetweentargetsand required3timesasmanyattempts,ascanbeseenintable2. obstacles. PromisingresultscombiningBayesianexploration withdeepreinforcementlearninghavebeenshownin[5]. Finally, it may be interesting to explore different distri- AngrierBirds: Bayesianreinforcementlearning—6/6 bution assumptions within RLSVI; assuming least-squares [10] JeromeFriedman,TrevorHastie,andRobertTibshirani. noisetobenormallydistributed,andthereforesamplingfrom Theelementsofstatisticallearning,volume1. Springer anormaldistributionaroundexpectedoptimalpoliciesmay seriesinstatisticsSpringer,Berlin,2001. be particularly prone to get stuck in local minima. Using [11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, bootstrapping methods in lieu of closed-form Bayesian up- Alex Graves, Ioannis Antonoglou, Daan Wierstra, and datesmayprovetobeapowerfulimprovementontheRLSVI MartinRiedmiller. Playingatariwithdeepreinforcement algorithmexploredhere. learning. arXivpreprintarXiv:1312.5602,2013. [12] ItamarArel,CongLiu,TUrbanik,andAGKohls. Rein- Appendix forcementlearning-basedmulti-agentsystemfornetwork trafficsignalcontrol. IntelligentTransportSystems,IET, Ourworkcanbefoundat 4(2):128–135,2010. github.com/imanolarrieta/angrybirds. Origi- nalcodesforAngryBirdsinPythoncanbefoundathttps: //github.com/estevaofon/angry-birds-python. Acknowledgments WewouldliketothanktheCS221teachingteamforaninspir- ingquarter. References [1] estevaofon. Angry birds in python. Github repos- itory at https://github.com/estevaofon/ angry-birds-python, January 2015. Forked to https://github.com/imanolarrieta/ angrybirds. [2] IanOsband,BenjaminVanRoy,andZhengWen. Gener- alizationandexplorationviarandomizedvaluefunctions. arXivpreprintarXiv:1402.0635,2014. [3] RichardSSuttonandAndrewGBarto. Reinforcement learning: Anintroduction, volume1. MITpressCam- bridge,1998. [4] IanOsband,DanRusso,andBenjaminVanRoy. (more) efficientreinforcementlearningviaposteriorsampling. InAdvancesinNeuralInformationProcessingSystems, pages3003–3011,2013. [5] IanOsbandandBenjaminVanRoy. Bootstrappedthomp- son sampling and deep exploration. arXiv preprint arXiv:1507.00300,2015. [6] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian q-learning. In AAAI/IAAI, pages 761–768, 1998. [7] MichaelKearnsandSatinderSingh. Near-optimalrein- forcementlearninginpolynomialtime. MachineLearn- ing,49(2-3):209–232,2002. [8] Leslie Pack Kaelbling, Michael L Littman, and An- drewWMoore. Reinforcementlearning: Asurvey. Jour- nal of artificial intelligence research, pages 237–285, 1996. [9] ChristopherJCHWatkinsandPeterDayan. Q-learning. Machinelearning,8(3-4):279–292,1992.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.