ebook img

A K-fold Method for Baseline Estimation in Policy Gradient Algorithms PDF

0.68 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A K-fold Method for Baseline Estimation in Policy Gradient Algorithms

A K-fold Method for Baseline Estimation in Policy Gradient Algorithms NithyanandKota†,AbhishekMishra†,SunilSrinivasa† SamsungSDSResearchAmerica †Theseauthorscontributedequally. {n.kota, a2.mishra, s.srinivasa}@samsung.com 7 1 Xi(Peter)Chen,PieterAbbeel 0 2 OpenAI UCBerkeley,DepartmentofEECS n {peter, pieter}@openai.com a J 3 Abstract ] I A The high variance issue in unbiased policy-gradient methods such as VPG and . s REINFORCEistypicallymitigatedbyaddingabaseline. However,thebaseline c fittingitselfsuffersfromtheunderfittingortheoverfittingproblem. Inthispaper, [ wedevelopaK-foldmethodforbaselineestimationinpolicygradientalgorithms. 1 The parameter K is the baseline estimation hyperparameter that can adjust the v bias-variancetrade-offinthebaselineestimates. Wedemonstratetheusefulnessof 7 ourapproachviatwostate-of-the-artpolicygradientalgorithmsonthreeMuJoCo 6 locomotivecontroltasks. 8 0 0 . 1 Introduction 1 0 7 Policygradientalgorithmshavegarneredalotofpopularityintheareaofreinforcementlearning, 1 sincetheydirectlyoptimizethecumulativerewardfunction,andcanbeusedinastraightforward : v mannerwithnon-linearfunctionapproximatorssuchasdeepneuralnetworks[1,2]. Theyhavebeen i successfullyappliedtowardssolvingcontinuouscontroltasks[3,1]andlearningtoplayAtarifrom X rawpixels[1,4]. TraditionalpolicygradientalgorithmssuchasREINFORCE[5]andVanillapolicy r gradient[6]operatebyrepeatedlycomputingestimatesofthegradientoftheexpectedrewardofa a policy,andthenupdatingthepolicyinthedirectionofthegradient. Whilethesealgorithmsprovide anunbiasedpolicygradientestimate,thevarianceoftheestimatesisoftenquitehigh,whichcan severelydegradethealgorithm’sperformance. Severalclassesofalgorithmshavebeenproposed towards reducing the variance of the policy gradient, namely actor-critic methods [7], using the discountfactor[6],andgeneralizedadvantageestimation[8]. Thevarianceofthepolicygradientcanbefurtherreduced(whilstaddingnobias)byaddingabaseline [9],whichiscommonlyimplementedasanestimatorofthestate-valuefunction. Typically,policy gradientalgorithmsoperateinaniterativefashion: ineachiteration,theyusepredictionsfromthe baselinethatisfittedinthepreviousiteration. Ifthepolicychangesdrasticallybetweeniterations,the baselinebecomesapoorestimateofthestate-valuefunction,resultinginunderfitting. Alternatively, wecouldfitthebaselineanduseittopredictthevaluefunctioninthesameiteration. Thisapproach suffersfromtheoverfittingproblem. Intheextremecase,ifweonlyhaveonetrajectorystartingfrom eachstate,andthebaselinecouldfitthedataperfectly,thenthebaselinewouldperfectlypredictthe returnsgivingusnogradientsignalatall. 30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain. Inthispaper,weproposeanewmethod(whichwenametheK-foldmethod)forbaselineestimation inpolicygradientalgorithms. TheparameterK isthebaselineestimationhyperparameterthatcan adjustthebias-variancetrade-offtoprovideagoodbalancebetweenoverfittingandunderfitting. We applyourbaselinepredictionmethodinconjunctionwithtwopolicygradientalgorithms–Trust RegionPolicyOptimization(TRPO)[1]andTruncatedNaturalPolicyGradient(TNPG)[10]. We analyzetheeffectofdifferentK valuesonperformanceforthreeMuJoColocomotivecontroltasks– Walker,HopperandHalf-Cheetah. 2 PolicyGradientPreliminaries Theagent’sinteractionwiththeenvironmentisbrokendownintoaseriesofN episodesortrajectories. Thejthtrajectory,j ∈{1...N},isnotatedτ =(sj,aj,rj,sj,aj,rj,...,sj ,aj ,rj ,sj), j 0 0 0 1 1 1 T−1 T−1 T−1 T wheres ,a andr arethestate,actionand(instantaneous)rewardrespectively,asseenbythesystem t t t attimet,andT isthetimehorizon. Theactionsarechosenbytheagentineachtimestepaccording toapolicyπ: a ∼ π(a |s )andthenextstateandrewardaresampledaccordingtothetransition t t t probabilitydistributions ,r ∼P(s ,r |s ,a ). t+1 t t+1 t t t Awell-knownexpression[5]forthepolicygradientis g :=∇ E [R(τ)] θ τ (cid:34)(cid:88)T (cid:35) (1) =E ∇ logπ(a |s ,θ)R(τ) , τ θ t t t=0 wherethepolicyπisparameterizedbyθandRreferstothereturnofatrajectoryunderthispolicy. Anestimateof(1)thatprovidesbetter(lower-variance)policygradientsis N (cid:34) T (cid:32) T (cid:33)(cid:35) gˆ= 1 (cid:88) (cid:88)∇ logπ(aj|sj,θ) (cid:88)γt(cid:48)−trj −b(sj) , (2) NT θ t t t(cid:48) t j=0 t=0 t(cid:48)=t whereN isthebatchsize,andb(sj)referstothebaseline,whichisanapproximationofthestate- t valuefunction: b(sj) ≈ E[(cid:80)T γt(cid:48)−trj|sj]. Classically, policygradientalgorithmsworkinan t t(cid:48)=t t(cid:48) t iterativefashionasfollows: Algorithm1ClassicalPolicyGradientAlgorithm Initialize:Foriterationi=0,initializethepolicyparameterθ randomly,andthebaselineatiteration 0 0,b(0)(·)to0. Iterate: Repeatforeachiterationi,i∈1,2,...untilconvergence: 1: SampleN trajectoriesforpolicyπ(·|·,θi−1): τj:1,...,N. 2: Evaluategˆasgivenby(2)usingpredictionsonthebaselineatiterationi−1,b(i−1)(·). 3: Updatethepolicyparametersusingthepolicygradients: θi−1 →θi. 4: Foreachstatest,computethediscountedreturnsRt =(cid:80)Tt(cid:48)=tγt(cid:48)−trt(cid:48). 5: Train(fit)anewbaselineregressionmodelb(i)(·)withinputsstandoutputsRt. 3 K-FoldMethodforBaselineEstimation Thepolicyupdateθ →θ aboveisperformediniterationiusingpredictionsfromthebaseline i−1 i b(i−1)(·)thatwasfitinthepreviousiterationi−1. Anissuewiththisapproachisunderfitting: when thepolicychangesalotbetweeniterations,thebaselinepredictionswillbequitenoisy. Tocircumvent thisissue,wecouldusethedatasamplescollectedduringiterationiforfittingthebaselinebi(·)first, andlaterusethesamebaseline’spredictionsforcomputingthegradientandupdatingthepolicy. As notedin[8],thiscancauseoverfitting,i.e.,itreducesthevarianceingˆattheexpenseofadditional bias. Moreover,intheextremecase,ifweonlyhaveonetrajectorystartingfromeachstate,andifthe baselinefitsthedataperfectly,thenthegradientestimateisalwayszero. 2 Inthissection,weintroducetheK-foldvariantofbaselinepredictionforpolicygradientoptimization (Algorithm1)thatismoresample-efficientthanpriortechniques,andhelpscounterboththeunderfit- tingandoverfittingissuesdescribedabove. Atahighlevel,thealgorithmoperatesbybreakingthe datasamplesintoK partitions. Foreachpartition,abaselineistrainedusingdatafromalltheother partitions,andthesamebaselineisusedforpredictingthevaluefunction. Sincethebaselinefitting isperformedusingsamplesfromthecurrentpolicy,wemitigatetheproblemofunderfitting. Since wedonotdirectlyfitonthecurrentpartition’sdatasamples,wealsomitigatetheoverfittingissue. Ouralgorithmisgeneralenoughtobeapplicabletomostofthealgorithmsforpolicyoptimization includingTRPOandTNPG. Whenwedividethedataintomultiplepartitions,itispossibletoperformpolicyoptimizationvia two different methods. In the first method, which we call the parameter-based K-fold baseline estimation,wecomputethegradientsforeachpartition,usethemtocomputeK differentparameters andfinallyaveragealltheparametersacrossthepartitionstoobtainthepolicy’snewparameters. In thesecondmethod,whichwecallthegradient-basedK-foldbaselineestimation,wecomputeonly thegradientforeachpartitionandusetheaveragedgradienttodeterminethepolicy’snewparameters. Thepseudo-codefortheparameter-basedandgradient-basedapproachesarepresentedinAlgorithm 2andinAlgorithm3,respectively. ThehyperparameterK essentiallycontrolsthebias-variancetrade-offinthebaselineestimates. On theonehand,whenK issmallwehavelesserdatatofitthebaselineandconsequentlythebaseline estimateshavehighvariance. Ontheotherhand,whenK ishighwegettofitthebaselinewithalot ofdatacausingthebaselineestimatestohavelowvariance,however,atanincreasedcomputational cost. Theoptimalvaluemaybefoundusingstandardsearchtechniquesforhyperparameters. Inthis paper,wetabulatetheperformancesofthreeMujocotasksforthreevaluesofK,(K =1,2and4)in Section6. Algorithm2Parameter-basedK-FoldBaselineEstimationforPolicyOptimization Initialize: Foriterationi=0,initializethepolicyparameterθ randomly. 0 Iterate: Repeatforeachiterationi,i∈1,2,...untilconvergence: 1: SampleN trajectoriesfrompolicyπ(·|·,θi−1): τj:1,...,N. 2: ForeachstatestintheN trajectories,computethediscountedreturnsRt =(cid:80)Tt(cid:48)=tγt(cid:48)−trt(cid:48) 3: PartitiontheN trajectoriesintoK <N disjointpartitions: P1,P2,...,PK,∪Kk=1Pk =∪Nj=1τj 4: foreachpartitionPk,k ∈{1,...,K}do 5: Train(fit)abaselineregressionmodelb(ki)(·)forpartitionPk usingthestatesandreturns fromalltheremainingpartitions(P ,l=1,...,K,l(cid:54)=k)asinputandoutputrespectively. l 6: ForeachpartitionPk,initializethepolicywithparametersθi−1anduseanypolicyoptimiza- tionalgorithmtofindoptimizedpolicyparametersθk: θ →θk. i i−1 i 7: endfor 8: Updatethepolicyparametersθi−1 →θiastheaverageofalltheoptimizedpolicyparameters obtained,i.e.,θ = 1 (cid:80)K θk. i K k=1 i 4 Tasks WeevaluateouralgorithmonthreeMuJoColocomotivetasks-Hopper,WalkerandHalf-Cheetah using the rllab [10] platform. These tasks are chosen since they are complex, yet allow for fast experimentation. Hopperisaplanarmonopodrobotwith4rigidlinksand3actuatedjoint. Theaimistolearnastable hoppinggaitwithoutfalling,andavoidlocalminimumgaitslikedivingforward. Theobservation spaceis20-dimensionalincludingjointangles,jointvelocities,centerofmasscoordinatesandthe constraintforces. Therewardfunctionisgivenbyr(s,a):=v −0.005∗||a||2+1. Theepisodeis x 2 terminatedwhenz <0.7or|θ |<0.2wherez isthez-coordinateofthecenterofmass,and body y body θ istheforwardpitchofthebody. y Walkerisaplanarbipedalrobotwith7rigidlinksand6actuatedjoints. Theobservationspaceis 21-dimensionalincludingjointangles,jointvelocitiesandcenterofmasscoordinates.Thereward functionisgivenbyr(s,a):=v −0.005∗||a||2. Theepisodeisterminatedwhenz <0.8or x 2 body 3 Algorithm3Gradient-basedK-FoldBaselineEstimationforPolicyOptimization Initialize: Foriterationi=0,initializethepolicyparameterθ randomly. 0 Iterate: Repeatforeachiterationi,i∈1,2,...untilconvergence: 1: SampleN trajectoriesfrompolicyπ(·|·,θi−1): τj:1,...,N. 2: ForeachstatestintheN trajectories,computethediscountedreturnsRt =(cid:80)Tt(cid:48)=tγt(cid:48)−trt(cid:48) 3: PartitiontheN trajectoriesintoK <N disjointpartitions: P1,P2,...,PK,∪Kk=1Pk =∪Nj=1τj 4: foreachpartitionPk,k ∈{1,...,K}do 5: Train(fit)abaselineregressionmodelb(ki)(·)forpartitionPk usingthestatesandreturns fromalltheremainingpartitions(P ,l=1,...,K,l(cid:54)=k)asinputandoutputrespectively. l 6: EvaluatethegradientforpartitionPk,gk usingpredictionsfromthebaselinebk(i)(·). 7: endfor 8: Computetheaveragegradientg = K1 (cid:80)Kk=1gk. 9: Usetheaveragegradientginanypolicyoptimizationalgorithmtoupdatethepolicyparameters θ →θ . i−1 i z > 2, orwhen|θ | > 1wherez isthez-coordinateofthecenterofmass, andθ isthe body y body y forwardpitchofthebody. Cheetah is a planar bipedal robot with 9 links and 6 actuated joints. The observation space is 20-dimensionalincludingjointangles,jointvelocitiesandcenterofmasscoordinates.Thereward functionisgivenbyr(s,a):=v −0.05∗||a||2. x 2 5 ExperimentalSetup Theexperimentalsetupusedtogeneratetheresultsisdescribedbelow. PerformanceMetrics: Foreachexperiment(runningaspecificalgorithmonaspecifictask),we define its performance as 1(cid:80)I R , where I is the number of training iterations and R the I i=1 i i undiscountedaveragereturnfortheithiteration. Thisisessentiallytheareaundertheaveragereturn curve. Weuse5randomstartingseedsandreporttheperformanceaveragedoveralltheseeds. PolicyNetwork: Weemployafeed-forwardMulti-LayerPerceptron(MLP)with3hiddenlayersof sizes100,50and25withtanhnonlinearitiesafterthefirsttwohiddenlayersthatmapsstatestothe meanofaGaussiandistribution. Weusedtheconjugategradientmethodforpolicyoptimization. Baseline: WeuseaGaussianMLPforthebaselinerepresentationaswell,with2hiddenlayersof size32each(andatanhnonlinearlityafterthefirsthiddenlayer). Thebaselineisfittedusingthe ADAMfirstorderoptimizationmethod[11]. Ineachiteration,weutilized10ADAMsteps,each withabatchsizeof50. Hyperparameters: Foralltheexperiments,weusethesameexperimentalsetupdescribedin[10], andlistedinTable1. Table1: Basicexperimentalsetupparameters. DiscountFactor 0.99 Horizon 500 NumberofIterations 500 StepsizeforbothTRPOandTNPG 0.08 6 Results Weevaluateboththeparameter-basedandthegradient-basedK-foldalgorithmsonthethreelocomo- tivetaskslistedaboveandontwopolicygradientalgorithms,TRPOandTNPG.Inthetablesbelow, weprovidethemean(returns)andthestd(returns)numbers. Wealsohighlight(boldface)the caseswherethemean−stdnumberswerethebest. 4 6.1 ResultswithTRPO,Datasize=50000 First,weevaluatetheK-foldmethods(parameter-basedandgradient-based)withtheTrustRegion PolicyOptimization(TRPO)algorithm. Weusedadata(sample)sizeof50,000. Theresultsare tabulatedinTable2 Table2: ResultswithTRPOanddatasizeof50,000. Parameter-based Task K =1 K =2 K =4 Walker 911.0±681.0 1015.7±327.3 938.7±462.1 Hopper 727.7±242.6 723.7±190.5 721.4±149.5 Cheetah 1595.1±404.4 1528.5±406.6 1383.8±356.1 Undertheparameter-basedapproachitisobservedthatforK >1,themeanKLdivergencebetween consecutivepoliciesismuchlowerthantheprescribedconstraintvalueof0.08. Infact,itisseen (seeFigure1(Left))thattheKLdivergencevaluesscalesdownlinearlywithincreasingvaluesofK. AlowerKLdivergencebetweenconsecutivepoliciesimpliesreducedexplorationforlargerK. To overcomethisshortcoming,weperformanadditionalexperimentfortheparameter-basedmethod wherethestep-sizewasscaledupbyafactorofK. Afterscalingupthestepsizes,weobservethat themeanKLdivergencevaluesarecomparableacrossallvaluesofK (seeFigure1(Center)). Table 3providesthereturnnumbersfortheparameter-basedapproachwithscaledstepsizes. Figure1: Mean±stdKLdivergencenumbersforK =1,2and4fortheCheetahtask. Thesolid red line depicts the mean KL divergence constraint of 0.08. (Left): the KL divergence numbers fortheparameter-basedmethod-theyareseentoscaledownlinearlywithincreasingK. (Center): theKLdivergencenumbersfortheparameter-basedmethodwithscaledstepsizesand(Right): the KLdivergencenumbersforthegradient-basedmethod. Forthelattertwomethods,KLdivergence numbersareverycomparableacrossthethreevaluesofK. Table3: ResultswithTRPOwithscaledstep-sizesanddatasizeof50,000. Parameter-based(withscaledstepsizes) Task K =1 K =2 K =4 Walker 911.0±681.0 1053.7±427.1 1188.8±251.8 Hopper 727.7±242.6 811.3±112.7 745.1±114.1 Cheetah 1595.1±404.4 1496.3±401.4 1600.0±237.9 Thescalingideaabovewasanad-hocapproachforimprovingtheexplorationforcaseswithK >1. Alternatively, we can use the gradient-based method, wherein averaging is performed across the K parametergradientsthemselves,naturallyensuringthatthemeanKLdivergencevaluesremain similarandbelowtheconstraint(seeFigure. 1(Right)). Table4liststhereturnnumbersforthe gradient-basedK-foldvariant. Figure2(left)plotstheaveragereturnmeanandstdnumbersforthe HopperTask. 6.2 ResultswithTNPG,Datasize=50000 Next, we evaluate the K-fold gradient-based method on another policy gradient algorithm, the TruncatedNaturalPolicyGradient(TNPG).Theresultsaretabulatedin5. 5 Table4: ResultswithTRPOwiththegradient-basedapproachandadatasizeof50,000. Gradient-based Task K =1 K =2 K =4 Walker 911.0±681.0 1035.0±491.1 1092.8±401.2 Hopper 727.7±242.6 786.0±171.1 847.7±274.0 Cheetah 1595.1±404.4 1664.1±337.1 1676.1±333.4 Table5: ResultswithTNPGwiththegradient-basedapproachanddatasizeof50,000. Gradient-based Task K =1 K =2 K =4 Walker 863.1±714.7 930.3±597.2 984.2±674.7 Hopper 895.4±173.4 858.3±93.7 775.5±246.0 Cheetah 1635.6±352.9 1602.3±392.3 1627.8±314.5 6.3 ResultswithTRPOandTNPG,Datasize=5000 Finally,weevaluatetheK-foldgradient-basedmethodonlowerdatasize. Theresultsarepresented inTable6. Figure2(Right)plotstheaveragereturnmeanandstdnumbersfortheWalkerTask. Table6: ResultswithTRPOandTNPGwiththegradient-basedapproachanddatasizeof5,000. TRPO Task K =1 K =2 K =4 Walker 375.1±170.2 292.8±167.9 389.6±225.1 Hopper 346.9±36.2 348.3±41.5 319.5±13.4 Cheetah 666.5±364.0 727.0±313.2 591.5±184.4 TNPG Task K =1 K =2 K =4 Walker 299.4±154.0 316.6±164.6 336.7±91.9 Hopper 331.4±42.6 317.3±29.5 344.7±31.9 Cheetah 609.5±215.3 445.9±228.8 445.9±181.9 7 Summary Inthispaper,weproposedanewmethodforbaselineestimationinpolicygradientalgorithms. We testedourmethodacrossthreeenvironments: Walker,Hopper,andHalf-Cheetah;twopolicygradient algorithms: TRPOandTNPG;twodatasizes: 50000and5000;andthreedifferentvaluesofK: 1,2, and4. WefindthathyperparameterK isausefulparametertoadjustthebias-variancetrade-off,in ordertoachieveagoodbalancebetweenoverfittingandunderfitting. WhilenosinglevalueofK was foundtoresultinthebestaveragereturnacrossallthecases,itwasobservedthatvaluesotherthan K =1canprovideimprovedreturnsincertaincases,thushighlightingtheneedforahyperparameter searchacrossthecandidateK values. Asapartoffutureworkitwillbeinterestingtostudythe benefitthattheK-foldmethodprovidesforotherenvironments, policygradientalgorithms, step sizesandbatchsizes. Acknowledgement TheauthorswouldliketothankGirishKathalagiri,AleksanderBeloi,LuisCarlosQuintelaandAtul Varshneyafortheirinvaluablehelpandsuggestionsduringvariousstagesofthisproject. References [1] JohnSchulman,SergeyLevine,PhilippMoritz,MichaelI.Jordan,andPieterAbbeel. Trust regionpolicyoptimization. CoRR,abs/1502.05477,2015. 6 Figure2: Averagemean±standarddeviationtrajectories. (Left): HoppertaskwithTRPOandadata sizeof50000. (Right): Walker2DtaskwithTNPGandadatasizeof5000. [2] ShamKakade. Anaturalpolicygradient. InThomasG.Dietterich,SuzannaBecker,andZoubin Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS 2001), pages1531–1538.MITPress,2001. [3] J.PetersandS.Schaal. Reinforcementlearningofmotorskillswithpolicygradients. Neural Networks,21(4):682–697,May2008. [4] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lill- icrap,TimHarley,DavidSilver,andKorayKavukcuoglu. Asynchronousmethodsfordeep reinforcementlearning. arXivpreprintarXiv:1602.01783,2016. [5] RonaldJ.Williams. Simplestatisticalgradient-followingalgorithmsforconnectionistreinforce- mentlearning. MachineLearning,8(3):229–256,1992. [6] RichardSSutton,DavidA.McAllester,SatinderP.Singh,andYishayMansour. Policygradient methodsforreinforcementlearningwithfunctionapproximation. InS.A.Solla,T.K.Leen,and K.Müller,editors,AdvancesinNeuralInformationProcessingSystems12,pages1057–1063. MITPress,2000. [7] VijayR.KondaandJohnN.Tsitsiklis. Onactor-criticalgorithms. SIAMJ.ControlOptim., 42(4):1143–1166,April2003. [8] JohnSchulman,PhilippMoritz,SergeyLevine,MichaelI.Jordan,andPieterAbbeel. High- dimensionalcontinuouscontrolusinggeneralizedadvantageestimation.CoRR,abs/1506.02438, 2015. [9] EvanGreensmith,PeterL.Bartlett,andJonathanBaxter. Variancereductiontechniquesfor gradientestimatesinreinforcementlearning. J.Mach.Learn.Res.,5:1471–1530,December 2004. [10] YanDuan,XiChen,ReinHouthooft,JohnSchulman,andPieterAbbeel. Benchmarkingdeep reinforcementlearningforcontinuouscontrol. arXivpreprintarXiv:1604.06778,2016. [11] DiederikKingmaandJimmyBa. Adam: Amethodforstochasticoptimization. arXivpreprint arXiv:1412.6980,2014. 7

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.