Table Of ContentA K-fold Method for Baseline Estimation in Policy
Gradient Algorithms
NithyanandKota†,AbhishekMishra†,SunilSrinivasa†
SamsungSDSResearchAmerica
†Theseauthorscontributedequally.
{n.kota, a2.mishra, s.srinivasa}@samsung.com
7
1
Xi(Peter)Chen,PieterAbbeel
0
2 OpenAI
UCBerkeley,DepartmentofEECS
n
{peter, pieter}@openai.com
a
J
3
Abstract
]
I
A
The high variance issue in unbiased policy-gradient methods such as VPG and
.
s REINFORCEistypicallymitigatedbyaddingabaseline. However,thebaseline
c fittingitselfsuffersfromtheunderfittingortheoverfittingproblem. Inthispaper,
[
wedevelopaK-foldmethodforbaselineestimationinpolicygradientalgorithms.
1 The parameter K is the baseline estimation hyperparameter that can adjust the
v bias-variancetrade-offinthebaselineestimates. Wedemonstratetheusefulnessof
7 ourapproachviatwostate-of-the-artpolicygradientalgorithmsonthreeMuJoCo
6 locomotivecontroltasks.
8
0
0
. 1 Introduction
1
0
7
Policygradientalgorithmshavegarneredalotofpopularityintheareaofreinforcementlearning,
1
sincetheydirectlyoptimizethecumulativerewardfunction,andcanbeusedinastraightforward
:
v mannerwithnon-linearfunctionapproximatorssuchasdeepneuralnetworks[1,2]. Theyhavebeen
i successfullyappliedtowardssolvingcontinuouscontroltasks[3,1]andlearningtoplayAtarifrom
X
rawpixels[1,4]. TraditionalpolicygradientalgorithmssuchasREINFORCE[5]andVanillapolicy
r gradient[6]operatebyrepeatedlycomputingestimatesofthegradientoftheexpectedrewardofa
a
policy,andthenupdatingthepolicyinthedirectionofthegradient. Whilethesealgorithmsprovide
anunbiasedpolicygradientestimate,thevarianceoftheestimatesisoftenquitehigh,whichcan
severelydegradethealgorithm’sperformance. Severalclassesofalgorithmshavebeenproposed
towards reducing the variance of the policy gradient, namely actor-critic methods [7], using the
discountfactor[6],andgeneralizedadvantageestimation[8].
Thevarianceofthepolicygradientcanbefurtherreduced(whilstaddingnobias)byaddingabaseline
[9],whichiscommonlyimplementedasanestimatorofthestate-valuefunction. Typically,policy
gradientalgorithmsoperateinaniterativefashion: ineachiteration,theyusepredictionsfromthe
baselinethatisfittedinthepreviousiteration. Ifthepolicychangesdrasticallybetweeniterations,the
baselinebecomesapoorestimateofthestate-valuefunction,resultinginunderfitting. Alternatively,
wecouldfitthebaselineanduseittopredictthevaluefunctioninthesameiteration. Thisapproach
suffersfromtheoverfittingproblem. Intheextremecase,ifweonlyhaveonetrajectorystartingfrom
eachstate,andthebaselinecouldfitthedataperfectly,thenthebaselinewouldperfectlypredictthe
returnsgivingusnogradientsignalatall.
30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain.
Inthispaper,weproposeanewmethod(whichwenametheK-foldmethod)forbaselineestimation
inpolicygradientalgorithms. TheparameterK isthebaselineestimationhyperparameterthatcan
adjustthebias-variancetrade-offtoprovideagoodbalancebetweenoverfittingandunderfitting. We
applyourbaselinepredictionmethodinconjunctionwithtwopolicygradientalgorithms–Trust
RegionPolicyOptimization(TRPO)[1]andTruncatedNaturalPolicyGradient(TNPG)[10]. We
analyzetheeffectofdifferentK valuesonperformanceforthreeMuJoColocomotivecontroltasks–
Walker,HopperandHalf-Cheetah.
2 PolicyGradientPreliminaries
Theagent’sinteractionwiththeenvironmentisbrokendownintoaseriesofN episodesortrajectories.
Thejthtrajectory,j ∈{1...N},isnotatedτ =(sj,aj,rj,sj,aj,rj,...,sj ,aj ,rj ,sj),
j 0 0 0 1 1 1 T−1 T−1 T−1 T
wheres ,a andr arethestate,actionand(instantaneous)rewardrespectively,asseenbythesystem
t t t
attimet,andT isthetimehorizon. Theactionsarechosenbytheagentineachtimestepaccording
toapolicyπ: a ∼ π(a |s )andthenextstateandrewardaresampledaccordingtothetransition
t t t
probabilitydistributions ,r ∼P(s ,r |s ,a ).
t+1 t t+1 t t t
Awell-knownexpression[5]forthepolicygradientis
g :=∇ E [R(τ)]
θ τ
(cid:34)(cid:88)T (cid:35) (1)
=E ∇ logπ(a |s ,θ)R(τ) ,
τ θ t t
t=0
wherethepolicyπisparameterizedbyθandRreferstothereturnofatrajectoryunderthispolicy.
Anestimateof(1)thatprovidesbetter(lower-variance)policygradientsis
N (cid:34) T (cid:32) T (cid:33)(cid:35)
gˆ= 1 (cid:88) (cid:88)∇ logπ(aj|sj,θ) (cid:88)γt(cid:48)−trj −b(sj) , (2)
NT θ t t t(cid:48) t
j=0 t=0 t(cid:48)=t
whereN isthebatchsize,andb(sj)referstothebaseline,whichisanapproximationofthestate-
t
valuefunction: b(sj) ≈ E[(cid:80)T γt(cid:48)−trj|sj]. Classically, policygradientalgorithmsworkinan
t t(cid:48)=t t(cid:48) t
iterativefashionasfollows:
Algorithm1ClassicalPolicyGradientAlgorithm
Initialize:Foriterationi=0,initializethepolicyparameterθ randomly,andthebaselineatiteration
0
0,b(0)(·)to0.
Iterate: Repeatforeachiterationi,i∈1,2,...untilconvergence:
1: SampleN trajectoriesforpolicyπ(·|·,θi−1): τj:1,...,N.
2: Evaluategˆasgivenby(2)usingpredictionsonthebaselineatiterationi−1,b(i−1)(·).
3: Updatethepolicyparametersusingthepolicygradients: θi−1 →θi.
4: Foreachstatest,computethediscountedreturnsRt =(cid:80)Tt(cid:48)=tγt(cid:48)−trt(cid:48).
5: Train(fit)anewbaselineregressionmodelb(i)(·)withinputsstandoutputsRt.
3 K-FoldMethodforBaselineEstimation
Thepolicyupdateθ →θ aboveisperformediniterationiusingpredictionsfromthebaseline
i−1 i
b(i−1)(·)thatwasfitinthepreviousiterationi−1. Anissuewiththisapproachisunderfitting: when
thepolicychangesalotbetweeniterations,thebaselinepredictionswillbequitenoisy. Tocircumvent
thisissue,wecouldusethedatasamplescollectedduringiterationiforfittingthebaselinebi(·)first,
andlaterusethesamebaseline’spredictionsforcomputingthegradientandupdatingthepolicy. As
notedin[8],thiscancauseoverfitting,i.e.,itreducesthevarianceingˆattheexpenseofadditional
bias. Moreover,intheextremecase,ifweonlyhaveonetrajectorystartingfromeachstate,andifthe
baselinefitsthedataperfectly,thenthegradientestimateisalwayszero.
2
Inthissection,weintroducetheK-foldvariantofbaselinepredictionforpolicygradientoptimization
(Algorithm1)thatismoresample-efficientthanpriortechniques,andhelpscounterboththeunderfit-
tingandoverfittingissuesdescribedabove. Atahighlevel,thealgorithmoperatesbybreakingthe
datasamplesintoK partitions. Foreachpartition,abaselineistrainedusingdatafromalltheother
partitions,andthesamebaselineisusedforpredictingthevaluefunction. Sincethebaselinefitting
isperformedusingsamplesfromthecurrentpolicy,wemitigatetheproblemofunderfitting. Since
wedonotdirectlyfitonthecurrentpartition’sdatasamples,wealsomitigatetheoverfittingissue.
Ouralgorithmisgeneralenoughtobeapplicabletomostofthealgorithmsforpolicyoptimization
includingTRPOandTNPG.
Whenwedividethedataintomultiplepartitions,itispossibletoperformpolicyoptimizationvia
two different methods. In the first method, which we call the parameter-based K-fold baseline
estimation,wecomputethegradientsforeachpartition,usethemtocomputeK differentparameters
andfinallyaveragealltheparametersacrossthepartitionstoobtainthepolicy’snewparameters. In
thesecondmethod,whichwecallthegradient-basedK-foldbaselineestimation,wecomputeonly
thegradientforeachpartitionandusetheaveragedgradienttodeterminethepolicy’snewparameters.
Thepseudo-codefortheparameter-basedandgradient-basedapproachesarepresentedinAlgorithm
2andinAlgorithm3,respectively.
ThehyperparameterK essentiallycontrolsthebias-variancetrade-offinthebaselineestimates. On
theonehand,whenK issmallwehavelesserdatatofitthebaselineandconsequentlythebaseline
estimateshavehighvariance. Ontheotherhand,whenK ishighwegettofitthebaselinewithalot
ofdatacausingthebaselineestimatestohavelowvariance,however,atanincreasedcomputational
cost. Theoptimalvaluemaybefoundusingstandardsearchtechniquesforhyperparameters. Inthis
paper,wetabulatetheperformancesofthreeMujocotasksforthreevaluesofK,(K =1,2and4)in
Section6.
Algorithm2Parameter-basedK-FoldBaselineEstimationforPolicyOptimization
Initialize: Foriterationi=0,initializethepolicyparameterθ randomly.
0
Iterate: Repeatforeachiterationi,i∈1,2,...untilconvergence:
1: SampleN trajectoriesfrompolicyπ(·|·,θi−1): τj:1,...,N.
2: ForeachstatestintheN trajectories,computethediscountedreturnsRt =(cid:80)Tt(cid:48)=tγt(cid:48)−trt(cid:48)
3: PartitiontheN trajectoriesintoK <N disjointpartitions: P1,P2,...,PK,∪Kk=1Pk =∪Nj=1τj
4: foreachpartitionPk,k ∈{1,...,K}do
5: Train(fit)abaselineregressionmodelb(ki)(·)forpartitionPk usingthestatesandreturns
fromalltheremainingpartitions(P ,l=1,...,K,l(cid:54)=k)asinputandoutputrespectively.
l
6: ForeachpartitionPk,initializethepolicywithparametersθi−1anduseanypolicyoptimiza-
tionalgorithmtofindoptimizedpolicyparametersθk: θ →θk.
i i−1 i
7: endfor
8: Updatethepolicyparametersθi−1 →θiastheaverageofalltheoptimizedpolicyparameters
obtained,i.e.,θ = 1 (cid:80)K θk.
i K k=1 i
4 Tasks
WeevaluateouralgorithmonthreeMuJoColocomotivetasks-Hopper,WalkerandHalf-Cheetah
using the rllab [10] platform. These tasks are chosen since they are complex, yet allow for fast
experimentation.
Hopperisaplanarmonopodrobotwith4rigidlinksand3actuatedjoint. Theaimistolearnastable
hoppinggaitwithoutfalling,andavoidlocalminimumgaitslikedivingforward. Theobservation
spaceis20-dimensionalincludingjointangles,jointvelocities,centerofmasscoordinatesandthe
constraintforces. Therewardfunctionisgivenbyr(s,a):=v −0.005∗||a||2+1. Theepisodeis
x 2
terminatedwhenz <0.7or|θ |<0.2wherez isthez-coordinateofthecenterofmass,and
body y body
θ istheforwardpitchofthebody.
y
Walkerisaplanarbipedalrobotwith7rigidlinksand6actuatedjoints. Theobservationspaceis
21-dimensionalincludingjointangles,jointvelocitiesandcenterofmasscoordinates.Thereward
functionisgivenbyr(s,a):=v −0.005∗||a||2. Theepisodeisterminatedwhenz <0.8or
x 2 body
3
Algorithm3Gradient-basedK-FoldBaselineEstimationforPolicyOptimization
Initialize: Foriterationi=0,initializethepolicyparameterθ randomly.
0
Iterate: Repeatforeachiterationi,i∈1,2,...untilconvergence:
1: SampleN trajectoriesfrompolicyπ(·|·,θi−1): τj:1,...,N.
2: ForeachstatestintheN trajectories,computethediscountedreturnsRt =(cid:80)Tt(cid:48)=tγt(cid:48)−trt(cid:48)
3: PartitiontheN trajectoriesintoK <N disjointpartitions: P1,P2,...,PK,∪Kk=1Pk =∪Nj=1τj
4: foreachpartitionPk,k ∈{1,...,K}do
5: Train(fit)abaselineregressionmodelb(ki)(·)forpartitionPk usingthestatesandreturns
fromalltheremainingpartitions(P ,l=1,...,K,l(cid:54)=k)asinputandoutputrespectively.
l
6: EvaluatethegradientforpartitionPk,gk usingpredictionsfromthebaselinebk(i)(·).
7: endfor
8: Computetheaveragegradientg = K1 (cid:80)Kk=1gk.
9: Usetheaveragegradientginanypolicyoptimizationalgorithmtoupdatethepolicyparameters
θ →θ .
i−1 i
z > 2, orwhen|θ | > 1wherez isthez-coordinateofthecenterofmass, andθ isthe
body y body y
forwardpitchofthebody.
Cheetah is a planar bipedal robot with 9 links and 6 actuated joints. The observation space is
20-dimensionalincludingjointangles,jointvelocitiesandcenterofmasscoordinates.Thereward
functionisgivenbyr(s,a):=v −0.05∗||a||2.
x 2
5 ExperimentalSetup
Theexperimentalsetupusedtogeneratetheresultsisdescribedbelow.
PerformanceMetrics: Foreachexperiment(runningaspecificalgorithmonaspecifictask),we
define its performance as 1(cid:80)I R , where I is the number of training iterations and R the
I i=1 i i
undiscountedaveragereturnfortheithiteration. Thisisessentiallytheareaundertheaveragereturn
curve. Weuse5randomstartingseedsandreporttheperformanceaveragedoveralltheseeds.
PolicyNetwork: Weemployafeed-forwardMulti-LayerPerceptron(MLP)with3hiddenlayersof
sizes100,50and25withtanhnonlinearitiesafterthefirsttwohiddenlayersthatmapsstatestothe
meanofaGaussiandistribution. Weusedtheconjugategradientmethodforpolicyoptimization.
Baseline: WeuseaGaussianMLPforthebaselinerepresentationaswell,with2hiddenlayersof
size32each(andatanhnonlinearlityafterthefirsthiddenlayer). Thebaselineisfittedusingthe
ADAMfirstorderoptimizationmethod[11]. Ineachiteration,weutilized10ADAMsteps,each
withabatchsizeof50.
Hyperparameters: Foralltheexperiments,weusethesameexperimentalsetupdescribedin[10],
andlistedinTable1.
Table1: Basicexperimentalsetupparameters.
DiscountFactor 0.99
Horizon 500
NumberofIterations 500
StepsizeforbothTRPOandTNPG 0.08
6 Results
Weevaluateboththeparameter-basedandthegradient-basedK-foldalgorithmsonthethreelocomo-
tivetaskslistedaboveandontwopolicygradientalgorithms,TRPOandTNPG.Inthetablesbelow,
weprovidethemean(returns)andthestd(returns)numbers. Wealsohighlight(boldface)the
caseswherethemean−stdnumberswerethebest.
4
6.1 ResultswithTRPO,Datasize=50000
First,weevaluatetheK-foldmethods(parameter-basedandgradient-based)withtheTrustRegion
PolicyOptimization(TRPO)algorithm. Weusedadata(sample)sizeof50,000. Theresultsare
tabulatedinTable2
Table2: ResultswithTRPOanddatasizeof50,000.
Parameter-based
Task K =1 K =2 K =4
Walker 911.0±681.0 1015.7±327.3 938.7±462.1
Hopper 727.7±242.6 723.7±190.5 721.4±149.5
Cheetah 1595.1±404.4 1528.5±406.6 1383.8±356.1
Undertheparameter-basedapproachitisobservedthatforK >1,themeanKLdivergencebetween
consecutivepoliciesismuchlowerthantheprescribedconstraintvalueof0.08. Infact,itisseen
(seeFigure1(Left))thattheKLdivergencevaluesscalesdownlinearlywithincreasingvaluesofK.
AlowerKLdivergencebetweenconsecutivepoliciesimpliesreducedexplorationforlargerK. To
overcomethisshortcoming,weperformanadditionalexperimentfortheparameter-basedmethod
wherethestep-sizewasscaledupbyafactorofK. Afterscalingupthestepsizes,weobservethat
themeanKLdivergencevaluesarecomparableacrossallvaluesofK (seeFigure1(Center)). Table
3providesthereturnnumbersfortheparameter-basedapproachwithscaledstepsizes.
Figure1: Mean±stdKLdivergencenumbersforK =1,2and4fortheCheetahtask. Thesolid
red line depicts the mean KL divergence constraint of 0.08. (Left): the KL divergence numbers
fortheparameter-basedmethod-theyareseentoscaledownlinearlywithincreasingK. (Center):
theKLdivergencenumbersfortheparameter-basedmethodwithscaledstepsizesand(Right): the
KLdivergencenumbersforthegradient-basedmethod. Forthelattertwomethods,KLdivergence
numbersareverycomparableacrossthethreevaluesofK.
Table3: ResultswithTRPOwithscaledstep-sizesanddatasizeof50,000.
Parameter-based(withscaledstepsizes)
Task K =1 K =2 K =4
Walker 911.0±681.0 1053.7±427.1 1188.8±251.8
Hopper 727.7±242.6 811.3±112.7 745.1±114.1
Cheetah 1595.1±404.4 1496.3±401.4 1600.0±237.9
Thescalingideaabovewasanad-hocapproachforimprovingtheexplorationforcaseswithK >1.
Alternatively, we can use the gradient-based method, wherein averaging is performed across the
K parametergradientsthemselves,naturallyensuringthatthemeanKLdivergencevaluesremain
similarandbelowtheconstraint(seeFigure. 1(Right)). Table4liststhereturnnumbersforthe
gradient-basedK-foldvariant. Figure2(left)plotstheaveragereturnmeanandstdnumbersforthe
HopperTask.
6.2 ResultswithTNPG,Datasize=50000
Next, we evaluate the K-fold gradient-based method on another policy gradient algorithm, the
TruncatedNaturalPolicyGradient(TNPG).Theresultsaretabulatedin5.
5
Table4: ResultswithTRPOwiththegradient-basedapproachandadatasizeof50,000.
Gradient-based
Task K =1 K =2 K =4
Walker 911.0±681.0 1035.0±491.1 1092.8±401.2
Hopper 727.7±242.6 786.0±171.1 847.7±274.0
Cheetah 1595.1±404.4 1664.1±337.1 1676.1±333.4
Table5: ResultswithTNPGwiththegradient-basedapproachanddatasizeof50,000.
Gradient-based
Task K =1 K =2 K =4
Walker 863.1±714.7 930.3±597.2 984.2±674.7
Hopper 895.4±173.4 858.3±93.7 775.5±246.0
Cheetah 1635.6±352.9 1602.3±392.3 1627.8±314.5
6.3 ResultswithTRPOandTNPG,Datasize=5000
Finally,weevaluatetheK-foldgradient-basedmethodonlowerdatasize. Theresultsarepresented
inTable6. Figure2(Right)plotstheaveragereturnmeanandstdnumbersfortheWalkerTask.
Table6: ResultswithTRPOandTNPGwiththegradient-basedapproachanddatasizeof5,000.
TRPO
Task K =1 K =2 K =4
Walker 375.1±170.2 292.8±167.9 389.6±225.1
Hopper 346.9±36.2 348.3±41.5 319.5±13.4
Cheetah 666.5±364.0 727.0±313.2 591.5±184.4
TNPG
Task K =1 K =2 K =4
Walker 299.4±154.0 316.6±164.6 336.7±91.9
Hopper 331.4±42.6 317.3±29.5 344.7±31.9
Cheetah 609.5±215.3 445.9±228.8 445.9±181.9
7 Summary
Inthispaper,weproposedanewmethodforbaselineestimationinpolicygradientalgorithms. We
testedourmethodacrossthreeenvironments: Walker,Hopper,andHalf-Cheetah;twopolicygradient
algorithms: TRPOandTNPG;twodatasizes: 50000and5000;andthreedifferentvaluesofK: 1,2,
and4. WefindthathyperparameterK isausefulparametertoadjustthebias-variancetrade-off,in
ordertoachieveagoodbalancebetweenoverfittingandunderfitting. WhilenosinglevalueofK was
foundtoresultinthebestaveragereturnacrossallthecases,itwasobservedthatvaluesotherthan
K =1canprovideimprovedreturnsincertaincases,thushighlightingtheneedforahyperparameter
searchacrossthecandidateK values. Asapartoffutureworkitwillbeinterestingtostudythe
benefitthattheK-foldmethodprovidesforotherenvironments, policygradientalgorithms, step
sizesandbatchsizes.
Acknowledgement
TheauthorswouldliketothankGirishKathalagiri,AleksanderBeloi,LuisCarlosQuintelaandAtul
Varshneyafortheirinvaluablehelpandsuggestionsduringvariousstagesofthisproject.
References
[1] JohnSchulman,SergeyLevine,PhilippMoritz,MichaelI.Jordan,andPieterAbbeel. Trust
regionpolicyoptimization. CoRR,abs/1502.05477,2015.
6
Figure2: Averagemean±standarddeviationtrajectories. (Left): HoppertaskwithTRPOandadata
sizeof50000. (Right): Walker2DtaskwithTNPGandadatasizeof5000.
[2] ShamKakade. Anaturalpolicygradient. InThomasG.Dietterich,SuzannaBecker,andZoubin
Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS 2001),
pages1531–1538.MITPress,2001.
[3] J.PetersandS.Schaal. Reinforcementlearningofmotorskillswithpolicygradients. Neural
Networks,21(4):682–697,May2008.
[4] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lill-
icrap,TimHarley,DavidSilver,andKorayKavukcuoglu. Asynchronousmethodsfordeep
reinforcementlearning. arXivpreprintarXiv:1602.01783,2016.
[5] RonaldJ.Williams. Simplestatisticalgradient-followingalgorithmsforconnectionistreinforce-
mentlearning. MachineLearning,8(3):229–256,1992.
[6] RichardSSutton,DavidA.McAllester,SatinderP.Singh,andYishayMansour. Policygradient
methodsforreinforcementlearningwithfunctionapproximation. InS.A.Solla,T.K.Leen,and
K.Müller,editors,AdvancesinNeuralInformationProcessingSystems12,pages1057–1063.
MITPress,2000.
[7] VijayR.KondaandJohnN.Tsitsiklis. Onactor-criticalgorithms. SIAMJ.ControlOptim.,
42(4):1143–1166,April2003.
[8] JohnSchulman,PhilippMoritz,SergeyLevine,MichaelI.Jordan,andPieterAbbeel. High-
dimensionalcontinuouscontrolusinggeneralizedadvantageestimation.CoRR,abs/1506.02438,
2015.
[9] EvanGreensmith,PeterL.Bartlett,andJonathanBaxter. Variancereductiontechniquesfor
gradientestimatesinreinforcementlearning. J.Mach.Learn.Res.,5:1471–1530,December
2004.
[10] YanDuan,XiChen,ReinHouthooft,JohnSchulman,andPieterAbbeel. Benchmarkingdeep
reinforcementlearningforcontinuouscontrol. arXivpreprintarXiv:1604.06778,2016.
[11] DiederikKingmaandJimmyBa. Adam: Amethodforstochasticoptimization. arXivpreprint
arXiv:1412.6980,2014.
7