Table Of Content

Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes Aviv Tamar [email protected] Dotan Di Castro [email protected] Shie Mannor [email protected] Department of Electrical Engineering, The Technion - Israel Institute of Technology, Haifa, Israel 32000 3 √ 1 Abstract (d) Maximize J −c V 0 In this paper we extend temporal difference 2 Therationalebehindourchoiceofriskmeasureisthat policy evaluation algorithms to performance n these performance criteria, such as the Sharpe Ratio criteriathatincludethevarianceofthecumu- a (Sharpe, 1966) mentioned above, are being used in J lativereward. Suchcriteriaareusefulforrisk 1 management, and are important in domains practice. Moreover,itseemsthathumandecisionmak- ersunderstandhowtousevariancewell,incomparison such as finance and process control. We pro- to exponential utility functions (Howard & Matheson, ] posebothTD(0)andLSTD(λ)variantswith G 1972), which require determining a non-intuitive ex- linear function approximation, prove their L ponent coefficient. convergence,anddemonstratetheirutilityin . s a 4-dimensional continuous state space prob- AfundamentalconceptinRListhethevaluefunction c lem. - the expected reward to go from a given state. Esti- [ matesofthevaluefunctiondrivemostRLalgorithms, 1 and efficient methods for obtaining these estimates v 1. Introduction have been a prominent area of research. In particu- 4 lar,TemporalDifference(TD;(Sutton&Barto,1998)) 0 In both Reinforcement Learning (RL; Bertsekas & 1 based methods have been found suitable for problems Tsitsiklis,1996)andplanninginMarkovDecisionPro- 0 where the state space is large, requiring some sort of cesses (MDPs; Puterman, 1994), the typical objective 1. function approximation. TD methods enjoy theoreti- is to maximize the cumulative (possibly discounted) 0 cal guarantees (Bertsekas, 2012; Lazaric et al., 2010) expectedreward,denotedbyJ. Inmanyapplications, 3 andempiricalsuccess(Tesauro,1995),andareconsid- 1 however, the decision maker is also interested in min- ered the state of the art in policy evaluation. : imizing some form of risk of the policy. By risk, we v mean reward criteria that take into account not only InthisworkwepresentaTDframeworkforestimating i X the expected reward, but also some additional statis- the variance of the reward to go. Our approach is r tics of the total reward such as its variance, its Value based on the following key observation: the second a at Risk, etc. (Luenberger, 1998). moment of the reward to go, denoted by M, together with the value function J, obey a linear equation - In this work we focus on risk measures that involve similar to the Bellman equation that drives regular the variance of the cumulative reward, denoted by V. TD algorithms. By extending TD methods to jointly Typical performance criteria that fall under this defi- estimateJ andM,weobtainasolutionforestimating nition include the variance, using the relation V =M −J2. (a) Maximize J s.t. V ≤c We propose both a variant of Least Squares Temporal Difference (LSTD) (Boyan, 2002) and of TD(0) (Sut- (b) Minimize V s.t. J ≥c ton & Barto, 1998) for jointly estimating J and M √ with a linear function approximation. For these algo- (c) Maximize the Sharpe Ratio: J/ V rithms, we provide convergence guarantees and error bounds. In addition, we introduce a novel approach for enforcing the approximate variance to be positive, through a constrained TD equation. Policy Evaluation with Variance Related Risk Criteria Finally, an empirical evaluation on a challenging con- We will find it convenient to define also the second tinuous maze domain highlights both the usefulness moment of the reward to go of our approach, and the importance of the variance function in understanding the risk of a policy. M(x)(cid:44)E(cid:2)B2|x0 =x(cid:3), x∈X. This paper is organized as follows. In Section 2 we present our formal RL setup. In Section 3 we derive OurgoalistoestimateJ(x)andV(x)fromtrajectories the fundamental equations for jointly approximating obtained by simulating the MDP with policy π. J and M, and discuss their properties. A solution to these equations may be obtained by simulation, 3. Approximation of the Variance of through the use of TD algorithms, as presented in the Reward To Go Section 4. In Section 5 we further extend the LSTD frameworkbyforcingtheapproximatedvariancetobe In this section we derive a projected equation method positive. Section 6 presents an empirical evaluation, for approximating J(x) and M(x) using linear func- and Section 7 concludes, and discusses future direction approximation. The estimation of V(x) will then tions. follow from the relation V(x)=M(x)−J(x)2. Our starting point is a system of equations for J(x) 2. Framework and Background and M(x), first derived by Sobel (1982) for a discounted infinite horizon case, and extended here to We consider a Stochastic Shortest Path (SSP) prob- the SSP case. Note that the equation for J is the well lem1(Bertsekas,2012),wheretheenvironmentismod- known Bellman equation for a fixed policy, and inde- eledbyanMDPindiscretetimewithafinitestateset pendent of the equation for M. X (cid:44){1,...,n}andaterminalstatex∗. Afixedpolicy π determines, for each x ∈ X, a stochastic transition Proposition 2. The following equations hold for x∈ to a subsequent state y ∈ {X ∪x∗} with probability X P(y|x). We consider a deterministic and bounded re- (cid:88) ward function r : X → R. We denote by x the state J(x)=r(x)+ P(y|x)J(y), k at time k, where k =0,1,2,.... y∈X (cid:88) (cid:88) M(x)=r(x)2+2r(x) P(y|x)J(y)+ P(y|x)M(y). Apolicyissaidtobeproper (Bertsekas,2012)ifthere isapositiveprobabilitythattheterminalstatex∗ will y∈X y∈X (1) bereachedafteratmostntransitions,fromanyinitial state. Inthispaperwemakethefollowingassumption Furthermore, under Assumption 1 a unique solution Assumption 1. The policy π is proper. to (1) exists. Letτ (cid:44)min{k >0|x =x∗}denotethefirstvisittime The proof is straightforward, and given in Appendix k to the terminal state, and let the random variable B A. denote the accumulated reward along the trajectory At this point the reader may wonder why an equation until that time2 forV isnotpresented. Whilesuchanequationmaybe τ−1 derived, as was done in (Tamar et al., 2012), it is not (cid:88) B (cid:44) r(xk). linear. Thelinearityof (1)isthekeytoourapproach. k=0 As we show in the next subsection, the solution to (1) may be expressed as the fixed point of a linear In this work, we are interested in the mean-variance mapping in the joint space of J and M. We will then tradeoff in B, represented by the value function show that a projection of this mapping onto a linear J(x)(cid:44)E[B|x =x], x∈X, feature space is contracting, thus allowing us to use 0 existingTDtheorytoderiveestimationalgorithmsfor and the variance of the reward to go J and M. V(x)(cid:44)Var[B|x =x], x∈X. 0 3.1. A Projected Fixed Point Equation on the 1This is also known as an episodic setup. Joint Space of J and M 2Wedonotdefinetherewardattheterminalstateasit For the sequel we introduce the following vector no- is not relevant to our performance criteria. However, the customaryzeroterminalrewardmaybeassumedthrough- tations. We denote by P ∈ Rn×n and r ∈ Rn out the paper. the SSP transition matrix and reward vector, i.e., Policy Evaluation with Variance Related Risk Criteria P = P(y|x) and r = r(x), where x,y ∈ X. Also, and let x,y x we define R(cid:44)diag(r). ∞ (cid:88) For a vector z ∈ R2n we let z ∈ Rn and z ∈ Rn q(x)= qt(x), x∈X J M denote its leading and ending n components, respec- t=0 tively. Thus, such a vector belongs to the joint space Q(cid:44)diag(q). of J and M. Wemakethefollowingassumptiononthepolicyπand We define the mapping T :R2n →R2n by initial distribution ζ 0 [Tz] =r+Pz , Assumption 4. Each state has a positive probability J J of being visited, namely, q(x)>0 for all x∈X. [Tz] =Rr+2RPz +Pz . M J M It may easily be verified that a fixed point of T is a ForvectorsinRn,weintroducetheweightedEuclidean solutionto(1),andbyProposition2suchafixedpoint norm exists and is unique. (cid:118) (cid:117) n WhenthestatespaceX islarge,adirectsolutionof(1) (cid:107)y(cid:107)q =(cid:117)(cid:116)(cid:88)q(i)(y(i))2, y ∈Rn, is not feasible, even if P may be accurately obtained. i=1 ApopularapproachinthiscaseistoapproximateJ(x) by restricting it to a lower dimensional subspace, and and we denote by Π and Π the projections from J M use simulation based TD algorithms to adjust the ap- Rn onto the subspaces S and S , respectively, with J M proximation parameters (Bertsekas, 2012). In this pa- respect to this norm. For z ∈R2n we denote by Π the per we extend this approach to the approximation of projection of z onto S and z onto S , namely 3 J J M M M(x) as well. (cid:18) (cid:19) Π 0 Π= J . (3) Weconsideralinearapproximationarchitectureofthe 0 Π M form J˜(x)=φ (x)Tw , We are now ready to fully describe our approximation J J (2) M˜(x)=φ (x)Tw , scheme. Weconsidertheprojected fixedpointequation M M wherewJ ∈RlJ andwM ∈RlM aretheapproximation z =ΠTz, (4) sptaartaemdeetperenvdeecntotrfse,aφtuJr(exs),∈anRdlJ(·a)TnddφenMo(txes)∈thReltMraanrse- and,lettingz∗denoteitssolution,proposetheapprox- imatevaluefunctionJ˜=z∗ ∈S andsecondmoment pose of a vector. The low dimensional subspaces are J J function M˜ =z∗ ∈S . therefore M M S ={Φ w|w ∈RsJ}, Weproceedtoderivesomepropertiesoftheprojected J J fixed point equation (4). We begin by stating a well S ={Φ w|w ∈RsM}, M M known result regarding the contraction properties of whereΦJ andΦM arematriceswhoserowsareφJ(x)T the projected Bellman operator ΠJTJ, where TJy = and φ (x)T, respectively. We make the following r +Py. A proof can be found at (Bertsekas, 2012), M standard independence assumption on the features proposition 7.1.1. Assumption 3. The matrix Φ has rank l and the Lemma 5. Let Assumptions 1, 3, and 4 hold. Then, J J matrix Φ has rank l . there exists some norm (cid:107)·(cid:107) and some β < 1 such M M J J that Asoutlinedearlier,ourgoalistoestimatewJ andwM (cid:107)ΠJPy(cid:107)J ≤βJ(cid:107)y(cid:107)J, ∀y ∈Rn. from simulated trajectories of the MDP. Thus, it is Similarly, there exists some norm (cid:107) · (cid:107) and some constructive to consider projections onto S and S M J M β <1 such that with respect to a norm that is weighted according to M the state occupancy in these trajectories. (cid:107)Π Py(cid:107) ≤β (cid:107)y(cid:107) , ∀y ∈Rn. M M M M For a trajectory x ,...,x , where x is drawn from 0 τ−1 0 a fixed distribution ζ0(x), and the states evolve ac- Next, we define a weighted norm on R2n cording to the MDP with policy π, define the state occupancy probabilities 3The projection operators Π and Π are linear, and J M may be written explicitly as Π = Φ (ΦTQΦ )−1ΦTQ, J J J J J q (x)=P(x =x), x∈X, t=0,1,... and similarly for Π . t t M Policy Evaluation with Variance Related Risk Criteria Definition 6. For a vector z ∈R2n and a scalar 0< where (cid:107)·(cid:107) denotes the Euclidean norm. Let λ de- 2 α<1, the α-weighted norm is note the spectral norm of the matrix 2Π RP, which M is finite since all the matrix elements are finite. We (cid:107)z(cid:107)α =α(cid:107)zJ(cid:107)J +(1−α)(cid:107)zM(cid:107)M, (5) have (cid:107)2Π RPy(cid:107) ≤λ(cid:107)y(cid:107) , ∀y ∈Rn. wherethenorms(cid:107)·(cid:107) and(cid:107)·(cid:107) aredefinedinLemma M 2 2 J M 5. Using again the fact that all vector norms are equiva- lent, there exists a finite C such that 3 Ourmainresultofthissectionisgiveninthefollowing lemma,whereweshowthattheprojectedoperatorΠT (cid:107)y(cid:107)2 ≤C3(cid:107)y(cid:107)J, ∀y ∈Rn. is a contraction with respect to the α-weighted norm. Setting C = C λC we get the desired bound. Let 2 3 Lemma 7. Let Assumptions 1, 3, and 4 hold. Then, β˜=max{β ,β }<1, and choose (cid:15)>0 such that J M there exists some 0<α<1 and some β <1 such that ΠT is a β-contraction with respect to the α-weighted β˜+(cid:15)<1. norm, i.e., Now, choose α such that (cid:107)ΠTz(cid:107) ≤β(cid:107)z(cid:107) , ∀z ∈R2n. α α C α= . (cid:15)+C Proof. Let P denote the following matrix in R2n×2n We have that (cid:18) (cid:19) P 0 (1−α)C =α(cid:15), P = , 2RP P and plugging in (7) and let z ∈R2n. We need to show that (1−α)(cid:107)2Π RPy(cid:107) ≤α(cid:15)(cid:107)y(cid:107) . M M J (cid:107)ΠPz(cid:107) ≤β(cid:107)z(cid:107) . α α Plugging in (6) we have From (3) we have αβ (cid:107)z (cid:107) +(1−α)β (cid:107)z (cid:107) +(1−α)(cid:107)2Π RPz (cid:107) J J J M M M M J M (cid:18) (cid:19) ≤αβ (cid:107)z (cid:107) +(1−α)β (cid:107)z (cid:107) +α(cid:15)(cid:107)z (cid:107) Π P 0 J J J M M M J J ΠP = J . 2ΠMRP ΠMP ≤(β˜+(cid:15))(α(cid:107)zJ(cid:107)J +(1−α)(cid:107)zM(cid:107)M) Therefore, we have and therefore (cid:107)ΠPz(cid:107) =α(cid:107)Π Pz (cid:107) (cid:107)ΠPz(cid:107) ≤(β˜+(cid:15))(cid:107)z(cid:107) α J J J α α +(1−α)(cid:107)2Π RPz +Π Pz (cid:107) M J M M M Finally, choose β =β˜+(cid:15). ≤α(cid:107)Π Pz (cid:107) J J J +(1−α)(cid:107)ΠMPzM(cid:107)M Lemma 7 guarantees that the projected operator ΠT (6) +(1−α)(cid:107)2Π RPz (cid:107) has a unique fixed point. Let us denote this fixed M J M point by z∗, and let w∗,w∗ denote the correspond- ≤αβ (cid:107)z (cid:107) J M J J J ing weights, which are unique due to Assumption 3 +(1−α)β (cid:107)z (cid:107) M M M ΠTz∗ =z∗, +(1−α)(cid:107)2Π RPz (cid:107) , M J M z∗ =Φ w∗, (8) J J J where the equality is by definition of the α weighted z∗ =Φ w∗ . norm (5), the first inequality is from the triangle in- M M M equality, and the second inequality is by Lemma 5. Inthenextlemmaweprovideaboundontheapprox- Now, we claim that there exists some finite C such imation error. The proof is in Appendix B. that (cid:107)2Π RPy(cid:107) ≤C(cid:107)y(cid:107) , ∀y ∈Rn. (7) Lemma 8. Let Assumptions 1, 3, and 4 hold. De- M M J note by z ∈R2n the true value and second moment To see this, note that since Rn is a finite dimensional true functions, i.e., z satisfies z =Tz . Then, realvectorspace,allvectornormsareequivalent(Horn true true true &Johnson,1985)thereforethereexistfiniteC andC 1 1 2 (cid:107)z −z∗(cid:107) ≤ (cid:107)z −Πz (cid:107) , such that for all y ∈Rn true α 1−β true true α C (cid:107)2Π RPy(cid:107) ≤(cid:107)2Π RPy(cid:107) ≤C (cid:107)2Π RPy(cid:107) , with α and β defined in Lemma 7. 1 M 2 M M 2 M 2 Policy Evaluation with Variance Related Risk Criteria 4. Simulation Based Estimation following estimates of the terms in (11) Algorithms (cid:34)τ−1 (cid:35) (cid:88) A =E φ (x )(φ (x )−φ (x ))T , We now use the theoretical results of the previous N N J t J t J t+1 t=0 subsection to derive simulation based algorithms for (cid:34)τ−1 (cid:35) jointly estimating the value function and second mo- b =E (cid:88)φ (x )r(x ) , N N J t t ment. The projected equation (8) is linear, and can t=0 bewritteninmatrixformasfollows. Firstletuswrite (cid:34)τ−1 (cid:35) the equation explicitly as C =E (cid:88)φ (x )(φ (x )−φ (x ))T , N N M t M t M t+1 t=0 ΠJ(r+PΦJwJ∗)=ΦJwJ∗, (9) d =E (cid:34)τ(cid:88)−1φ (x )r(x )(cid:0)r(x )+2φ (x )TA−1b (cid:1)(cid:35), Π (Rr+2RPΦ w∗ +PΦ w∗ )=Φ w∗ . N N M t t t J t+1 N N M J J M M M M t=0 (12) where E denotes an empirical average over trajecto- Projecting a vector y onto Φw satisfies the following N ries,i.e.,E [f(x,τ)]= 1 (cid:80)N f(xk,τk). TheLSTD orthogonality condition N N k=1 approximation is given by wˆ∗ =A−1b , ΦTQ(y−Φw)=0, J N N wˆ∗ =C−1d . M N N The next theorem shows that the LSTD approxima- therefore we have tion converges. Theorem 9. Let Assumptions 1, 3, and 4 hold. Then ΦTQ(Φ w∗ −(r+PΦ w∗))=0, J J J J J wˆ∗ → w∗ and wˆ∗ → w∗ as N → ∞ with probability J J M M ΦT Q(Φ w∗ −(Rr+2RPΦ w∗ +PΦ w∗ ))=0, 1. M M M J J M M Theproofinvolvesastraightforwardapplicationofthe which can be written as law of large numbers and is described in Appendix C. Aw∗ =b, 4.2. An online TD(0) Algorithm J (10) CwM∗ =d, Our second estimation algorithm is an extension of the well known TD(0) algorithm (Sutton & Barto, 1998). Again, we simulate trajectories of the MDP with corresponding to the policy π and initial state distribution ζ , and we iteratively update our estimates at A=ΦTJQ(I−P)ΦJ, b=ΦTJQr, everyvis0ittotheterminalstate4. Forsome0≤t<τk C =ΦTMQ(I−P)ΦM, d=ΦTMQR(cid:0)r+2PΦJA−1b(cid:1), and weights wJ,wM, we introduce the TD terms (11) δk(t,w ,w )=r(xk)+(cid:0)φ (xk )T −φ (xk)T(cid:1)w , J J M t J t+1 J t J δk (t,w ,w )=r2(xk)+2r(xk)φ (xk )Tw M J M t t J t+1 J and the matrices A and C are invertible since Lemma +(cid:0)φ (xk )T −φ (xk)T(cid:1)w . 7 guarantees a unique solution to (8) and Assumption M t+1 M t M 3 guarantees the unique weights of its projection. Note that δk is the standard TD error (Sutton & J Barto, 1998). The TD(0) update is given by 4.1. A Least Squares TD Algorithm τk−1 (cid:88) Our first simulation based algorithm is an extension wˆJ;k+1 =wˆJ;k+ξk φJ(xt)δJk(t,wˆJ;k,wˆM;k), of the Least Squares Temporal Difference (LSTD) al- t=0 gorithm (Boyan, 2002). We simulate N trajectories τk−1 (cid:88) of the MDP with the policy π and initial state dis- wˆM;k+1 =wˆM;k+ξk φM(xt)δMk (t,wˆJ;k,wˆM;k), tribution ζ . Let xk,xk,...,xk and τk, where t=0 0 0 1 τk−1 k = 0,1,...,N, denote the state sequence and visit 4An extension to an algorithm that updates at every times to the terminal state within these trajectories, statetransitionisalsopossible,butwedonotpursuesuch respectively. Wenowusethesetrajectoriestoformthe here. Policy Evaluation with Variance Related Risk Criteria where {ξ } are positive step sizes. where k The next theorem shows that the TD(0) algorithm A(λ) =ΦTQ(cid:16)I−P(λ)(cid:17)Φ , b(λ) =ΦTQ(I−λP)−1r, converges. J J J (cid:16) (cid:17) Theorem 10. Let Assumptions 1, 3, and 4 hold, and C(λ) =ΦTMQ I−P(λ) ΦM, let the step sizes satisfy (cid:16) (cid:17) d(λ) =ΦT Q(I−λP)−1R r+2PΦ w∗(λ) , M J J ∞ ∞ (cid:88) (cid:88) ξ =∞, ξ2 <∞. k k and k=0 k=0 (cid:88)∞ P(λ) =(1−λ) λlPl+1. Then wˆ → w∗ and wˆ → w∗ as k → ∞ with J;k J M;k M l=0 probability 1. Simulation based estimates A(λ) and b(λ) of the ex- N N Theproof,providedinAppendixD,isbasedonrepre- pressions above may be obtained by the use of eligi- senting the TD(0) algorithm as a stochastic approxi- bility traces, as described in (Bertsekas, 2012), and mationandusingcontractionpropertiessimilartothe the LSTD(λ) approximation is then given by wˆ∗(λ) = ones of the previous section to prove convergence. J (A(λ))−1b(λ). By substituting w∗(λ) with wˆ∗(λ) in the N N J J expression for d(λ), a similar procedure may be used 4.3. Multistep Algorithms to derive estimates C(λ) and d(λ), and to obtain the N N Acommonmethodinvaluefunctionapproximationis LSTD(λ) approximation wˆ∗(λ) = (C(λ))−1d(λ). Due M N N toreplacethesinglestepmappingTJ withamultistep to the similarity to the LSTD procedure in (12), the version of the form exact details are omitted. ∞ T(λ) =(1−λ)(cid:88)λlTl+1 J J 5. Positive Variance as a Constraint in l=0 LSTD with 0 < λ < 1. The projected equation (9) then The TD algorithms of the preceding section approx- becomes imated J and M by the solution to the fixed point (cid:16) (cid:17) Π T(λ) Φ w∗(λ) =Φ w∗(λ). equation (8). While Lemma 8 provides us a bound on J J J J J J the approximation error of Jãnd M˜ measured in the Similarly, we may write a multistep equation for M α-weighted norm, it does not guarantee that the ap- (cid:16) (cid:17) proximated variance V˜, given by M˜ −J˜2, is positive Π T(λ) Φ w∗(λ) =Φ w∗(λ), (13) M M M M M M forallstates. IfweareestimatingM asameanstoin- fer V, it may be useful to include our prior knowledge where that V ≥ 0 in the estimation process. In this section ∞ T(λ) =(1−λ)(cid:88)λlTl+1, we propose to enforce this knowledge as a constraint M M∗ in the projected fixed point equation. l=0 and Themultistepequationforthesecondmomentweights TM∗(y)=Rr+2RPΦJwJ∗(λ)+Py. (13) may be written with the projection operator as Note the difference between T and T defined ear- an explicit minimization M∗ M lier; We are no longer working on the joint space of J (cid:16) (cid:17) and M but instead we have an independent equation wM∗(λ) =argmin(cid:107)ΦMw− r˜+Φ˜wM∗(λ) (cid:107)q, w for approximating J, and its solution w∗(λ) is part of J equation (13) for approximating M. By Proposition with 7.1.1. of (Bertsekas, 2012) both Π T(λ) and Π T(λ) Φ˜ =P(λ)ΦM, J J M M are contractions with respect to the weighted norm and (cid:107)·(cid:107) ,thereforebothmultistepprojectedequationsad- q (cid:16) (cid:17) mitauniquesolution. Inasimilarmannertothesingle r˜=(I−λP)−1 Rr+2RPΦJwJ∗(λ) . step version, the projected equations may be written in matrix form Requiring non negative variance in some state x may be written as a linear constraint in w∗(λ) A(λ)w∗(λ) =b(λ), M J (14) C(λ)wM∗(λ) =d(λ), φM(x)TwM∗(λ)−(φJ(x)TwJ∗(λ))2 ≥0. Policy Evaluation with Variance Related Risk Criteria Let {x ,...,x } denote a set of states in which we de- that the variance is negative for the last two states. 1 l mand that the variance be non negative. Let H ∈ Using algorithm (16) we obtained a positive variance Rl×lM denote a matrix with the features −φTM(xi) as constrainedapproximation,whichisdepictedinfigure its rows, and let g ∈ Rl denote a vector with ele- 2(dashedline). Notethatthevarianceisnowpositive ments −(φ (x )Tw∗(λ))2. We can write the variance- for all states (as was required by the constraints). J i J constrainedprojectedequationforthesecondmoment as (cid:40) (cid:16) (cid:17) argmin (cid:107)Φ w− r˜+Φ˜wvc (cid:107) wvc = w M M q (15) M s.t. Hw ≤g The following assumption guarantees that the constraints in (15) admit a feasible solution. Figure 1. A Markov chain Assumption 11. There exists w such that Hw <g. Note that a simple way to satisfy Assumption 11 is to 6. Experiments have some feature vector that is positive for all states. Equation (15) is a form of projected equation stud- Inthissectionwepresentnumericalsimulationsofpol- ied in (Bertsekas, 2011), the solution of which may be icy evaluation on a challenging continuous maze do- obtained by the following iterative procedure main. Thegoalofthispresentationistwofold;first,we w =Π [w −γΞ−1(C(λ)w −d(λ))], (16) showthatthevariancefunctionmaybeestimatedsuc- k+1 Ξ,WˆM k k cessfullyonalargedomainusingareasonableamount where Ξ is some positive definite matrix, and Π of samples. Second, the intuitive maze domain high- Ξ,WˆM denotes a projection onto the convex set Wˆ = lights the information that may be gleaned from the M variancefunction. Webeginbydescribingthedomain {w|Hw ≤g}withrespecttotheΞweightedEuclidean and then present our policy evaluation results. norm. The following lemma, which is based on a convergence result of (Bertsekas, 2011), guarantees that The Pinball Domain (Konidaris & Barto, 2009) is algorithm (16) converges. a continuous 2-dimensional maze where a small ball Lemma 12. Assume λ > 0. Then there exists γ¯ > 0 needs to be maneuvered between obstacles to reach such that ∀γ ∈ (0,γ¯) the algorithm (16) converges at some target area, as depicted in figure 3 (left). The a linear rate to wvc. ball is controlled by applying a constant force in one M of the 4 directions at each time step, which causes Proof. This is a direct application of the convergence acceleration in the respective direction. In addition, result in (Bertsekas, 2011). The only nontrivial as- the ball’s velocity is susceptible to additive Gaussian sumption that needs to be verified is that T(λ) is a noise(zeromean,standarddeviation0.03)andfriction M (dragcoefficient0.995). Thestateoftheballisthus4- contraction in the (cid:107)·(cid:107) norm (Proposition 1 in Bert- q dimensional (x,y,x˙,y˙), and the action set is discrete, sekas, 2011). For λ > 0 Proposition 7.1.1. of (Bert- with 4 available controls. The obstacles are sharply sekas,2012)guaranteesthatT(λ) isindeedcontracting M shaped and fully elastic, and collisions cause the ball in the (cid:107)·(cid:107) norm. q to bounce. As noted in (Konidaris & Barto, 2009), thesharpobstaclesandcontinuousdynamicsmakethe We illustrate the effect of the positive variance con- pinball domain more challenging for RL than simple straint in a simple example. Consider the Markov navigation tasks or typical benchmarks like Acrobot. chain depicted in Figure 1, which consists of N states with reward −1 and a terminal state x∗ with zero re- A Java implementation of the pinball domain used in ward. The transitions from each state is either to a (Konidaris & Barto, 2009) is available on-line 5 and subsequent state (with probability p) or to a preced- wasusedforoursimulationsaswell,withtheaddition ingstate(withprobability1−p),withtheexceptionof of noise to the velocity. the first state which transitions to itself instead. We Weobtainedanear-optimalpolicyusingSARSA(Sut- chose to approximate J and M with polynomials of ton & Barto, 1998) with radial basis function features degree1and2,respectively. Forsuchasmallproblem and a reward of -1 for all states until reaching the tar- the fixed point equation (14) may be solved exactly, get. The value function for this policy is plotted in yielding the approximation depicted in Figure 2 (dot- ted line), for p = 0.7, N = 30, and λ = 0.95. Note 5http://people.csail.mit.edu/gdk/software.html Policy Evaluation with Variance Related Risk Criteria Value Function Approx. Second Moment Approximation Variance Approximation 0 6000 400 M V −10 5000 MM acopnpsrotrxa.ined approx. 350 VV acopnpsrotrxa.ined approx. −20 300 4000 250 −30 3000 200 −40 2000 150 −50 100 1000 −60 50 −70 JJ approximate 0 0 −800 5 10 15 20 25 30 −10000 5 10 15 20 25 30 −500 5 10 15 20 25 30 x x x Figure 2. Value, second moment and variance approximation Figure 3, for states with zero velocity. As should be More importantly, at the moment it remains unclear expected, the value is approximately a linear function howthevariancefunctionmaybeusedforpolicy opti- of the distance to the target. mization. Whileanaivepolicyimprovementstepmay be performed, its usefulness should be questioned, as Using 3000 trajectories (starting from uniformly dis- it was shown to be problematic for the standard devi- tributed random states in the maze) we estimated the ation adjusted reward (Sobel, 1982) and the variance value and second moment functions by the LSTD(λ) constrained reward (Mannor & Tsitsiklis, 2011). In algorithm described above. We used uniform tile cod- (Tamar et al., 2012), a policy gradient approach was ing as features (50×50 non-overlapping tiles in x and proposed for handling variance related criteria, which y,nodependenceonvelocity)andsetλ=0.9. There- may be extended to an actor-critic method by using sultingestimatedstandarddeviationfunctionisshown the variance function presented here. in Figure 4 (left). In comparison, the standard deviation function shown in Figure 4 (right) was estimated bythenaivesamplevariance,andrequired500trajec- References tories from each point - a total of 1,250,000 trajecto- Bertsekas, D. P. Dynamic Programming and Optimal ries. Control, Vol II. Athena Scientific, fourth edition, Note that the variance function is clearly not a linear 2012. function of the distance to the target, and in some Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic places not even monotone. Furthermore, we see that programming. Athena Scientific, 1996. anareainthetoppartofthemazebeforethefirstturn is very risky, even more than the farthest point from Bertsekas, D.P. Temporal difference methods for gen- the target. We stress that this information cannot be eral projected equations. IEEE Trans. Auto. Con- gleaned from inspecting the value function alone. trol, 56(9):2128–2139, 2011. 7. Conclusion Borkar, V.S. Stochastic approximation: a dynamical systems viewpoint. Cambridge Univ Press, 2008. Thisworkpresentedanovelframeworkforpolicyeval- uation in RL with variance related performance crite- Boyan,J.A.Technicalupdate: Least-squarestemporal ria. We presented both formal guarantees and empir- difference learning. Machine Learning, 49(2):233– ical evidence that this approach is useful in problems 246, 2002. with a large state space. Delage, E. and Mannor, S. Percentile optimization Afewissuesareinneedoffurtherinvestigation. First, forMarkovdecisionprocesseswithparameteruncer- we note a possible extension to other risk measures tainty. Operations Research, 58(1):203–213, 2010. such as the percentile criterion (Delage & Mannor, 2010). In a recent work, Morimura et al. (2012) de- Horn,R.A.andJohnson,C.R.MatrixAnalysis.Cam- rived Bellman equations for the distribution of the to- bridge University Press, 1985. tal return, and appropriate TD learning rules were proposed, albeit without function approximation and Howard, R. A. and Matheson, J. E. Risk-sensitive formal guarantees. markovdecisionprocesses. ManagementScience,18 (7):356–369, 1972. Policy Evaluation with Variance Related Risk Criteria Figure 3. The pinball domain Figure 4. Standard Deviation of Reward To Go Konidaris, G.D. and Barto, A.G. Skill discovery Sutton, R. S. and Barto, A. G. Reinforcement Learn- in continuous reinforcement learning domains using ing. MIT Press, 1998. skill chaining. In NIPS, 2009. Tamar, A., Di Castro, D., and Mannor, S. Policy gra- Lazaric,A.,Ghavamzadeh,M.,andMunos,R. Finite- dients with variance related risk criteria. In ICML, sample analysis of lstd. In ICML, 2010. 2012. Luenberger,D. InvestmentScience. OxfordUniversity Tesauro, G. Temporal difference learning and td- Press, 1998. gammon. Communications of the ACM, 38(3):58– 68, 1995. Mannor, S. and Tsitsiklis, J. N. Mean-variance optimization in markov decision processes. In ICML, 2011. Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. Parametric return density estimation for reinforcement learning. arXiv preprint arXiv:1203.3497, 2012. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994. Sharpe,W.F. Mutualfundperformance. The Journal of Business, 39(1):119–138, 1966. Sobel, M. J. The variance of discounted markov decision processes. J. Applied Probability, pp. 794–802, 1982. Policy Evaluation with Variance Related Risk Criteria Supplementary Material A. Proof of Proposition 2 Proof. The equation for J(x) is well-known, and its proof is given here only for completeness. Choose x ∈ X. Then, J(x)=E[B|x =x] 0 (cid:34)τ−1 (cid:35) (cid:88) =E r(x )|x =x k 0 k=0 (cid:34)τ−1 (cid:35) (cid:88) =r(x)+E r(x )|x =x k 0 k=1 (cid:34) (cid:34)τ−1 (cid:35)(cid:35) (cid:88) =r(x)+E E r(x )|x =x,x =y k 0 1 k=1 (cid:88) =r(x)+ P(y|x)J(y) y∈X where we excluded the terminal state from the sum since reaching it ends the trajectory. Similarly, M(x)=E(cid:2)B2|x =x(cid:3) 0 (cid:32)τ−1 (cid:33)2  (cid:88) =E r(xk) |x0 =x k=0 (cid:32) τ−1 (cid:33)2  (cid:88) =E r(x0)+ r(xk) |x0 =x k=1 (cid:34)τ−1 (cid:35) (cid:32)τ−1 (cid:33)2  (cid:88) (cid:88) =r(x)2+2r(x)E r(xk)|x0 =x +E r(xk) |x0 =x k=1 k=1 (cid:88) (cid:88) =r(x)2+2r(x) P(y|x)J(y)+ P(y|x)M(y). y∈X y∈X The uniqueness of the value function J for a proper policy is well known, c.f. proposition 3.2.1 in (Bertsekas, 2012). The uniqueness of M follows by observing that in the equation for M, M may be seen as the value function of an MDP with the same transitions but with reward r(x)2+2r(x)(cid:80) P(y|x)J(y). Since only the y∈X rewards change, the policy remains proper and proposition 3.2.1 in (Bertsekas, 2012) applies. B. Proof of Lemma 8 Proof. We have (cid:107)z −z∗(cid:107) ≤(cid:107)z −Πz (cid:107) +(cid:107)Πz −z∗(cid:107) true α true true α true α =(cid:107)z −Πz (cid:107) +(cid:107)ΠTz −ΠTz∗(cid:107) true true α true α ≤(cid:107)z −Πz (cid:107) +β(cid:107)z −z∗(cid:107) . true true α true α rearranging gives the stated result.

Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes PDF

0.44 MB·English

by Aviv Tamar

#additional_collections #journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.