5 1 Controlled Markov Processes with AVaR Criteria for 0 2 Unbounded Costs v o N 7 Kerem U˘gurlu 1 ] Wednesday 18th November, 2015 R P . h t a m [ DepartmentofMathematics, UniversityofSouthernCalifornia,LosAngeles, CA90089 4 v e-mails:[email protected] 8 1 Abstract 5 2 0 In this paper, we consider the control problem with the Average-Value-at-Risk . (AVaR) criteria of the possibly unbounded L1-costs in infinite horizon on a Markov 1 0 Decision Process (MDP). With asuitable state aggregation andby choosingapriori 5 a global variable s heuristically, we show that there exist optimal policies for the 1 : infinite horizon problem. v i X Mathematics Subject Classification: 90C39, 93E20 r a Keywords:Markov Decision Problem, Average-Value-at-Risk, Optimal Control; 1 Introduction In classical models, the optimization problem has been solved by expected performance criteria. Beginning with Bellman [6], risk neutral performance evaluation has been used via dynamic programming techniques. This methodology has seen huge development both in theory and practice since then. However, in practice expected values are not appropriate to measure the performance criteria. Due to that, risk aversive approaches have been begun to forecast the corresponding problem and its outcomes specifically by utility functions (see e.g. [8, 10]). To put risk-averse preferences into an axiomatic framework, with the seminal paper of Artzner et al. [2], the risk assessment gained new aspects for random outcomes. In [2], the concept of risk measure has been defined and 1 2 theoretical framework has been established. We will use this framework to measure risk aversion. We replace the risk neutral expectation operator with this risk averse operator andstudytheoptimalcontrolofinfinitesumofcostfunctionsandcharacterizetheoptimal policy stationary as in risk neutral case but in a state-aggregated setting. The rest of the paper is as follows. In Section 2, we give the preliminary theoretical framework. In Section 3, we derive the dynamic programming equations for MDP using AVaRcriteria fortheinfinitetimehorizonandconclude thepaper bygivinganapplication of our results to classical LQ problem to illustrate our results. 1.1 Controlled Markov Processes We take the control model M = {M ,n ∈ N }, where for each n ∈ N , n 0 0 M := (X ,A ,K ,F ,c ) (1.1) n n n n n n with the following components: • X and A denote the state and action (or control) spaces, where X take values in n n n a Borel set X whereas A take values in a Borel set A. n • For each x ∈ X , let A (x) ⊂ A be the set of all admissible controls in the state n n n x = x. Then n K := {(x,a) : x ∈ X ,a ∈ A (x)}, (1.2) n n n stands for the set of feasible state-action pairs at time n, where we assume that K n is a Borel subset of X ×A . n n • We let x = F (x ,a ,ξ ), for all n = 0,1,... with x ∈ X and a ∈ A as n+1 n n n n n n n n described above, with independent random disturbances ξ ∈ S having probability n n distributions µ , where the S are Borel spaces. n n • c (x,a) : K → R stands for the deterministic cost function at stage n ∈ N with n n 0 (x,a) ∈ K . n Therandomvariables{ξ } aredefinedonacommonprobabilityspace(Ω,F,{F } ,P), n n≥0 n n≥0 where P is the reference probability space with each ξ measurable with respect to sigma n algebra F with F = σ(∪∞ F ). Based on the action a ∈ K (x) chosen at time n, we n n=0 n n assume that A is F = σ(X ,A ,...,X )-measurable, i.e. our decision might depend n n 0 0 n 3 entirely on the history h , where h = (x ,a ,x ,...,a ,x ) ∈ H is the history up to n n 0 0 1 n−1 n n time n, where define recursively H := X, H := H ×A×X (1.3) 0 n+1 n For each n ∈ N , let F be the family of measurable functions f : H → A such that 0 n n n n f (x) ∈ A (x), (1.4) n n for all x ∈ X . A sequence π = {f } of functions f ∈ F for all n ∈ N is called a policy. n n n n 0 We denote by Π the set of all the policies. Then for each policy π ∈ Π and initial state x ∈ X, a stochastic process {(x ,a )} and a probability measure Pπ is defined on (Ω,F) n n x in a canonical way, where x and a represent the state and the control at time n ∈ N . n n 0 The expectation operator with respect to Pπ is denoted by Eπ.The distribution of X x x n+1 is given by the transition kernel Q from X ×A to X as follows: Pπ(X ∈ B |X ,g (X ),...,X ,f (X ,A ,...,X )) n+1 x 0 0 0 n n 0 0 n = Pπ(X ∈ B |X ,f (X ,A ,...,X )) n+1 x n n 0 0 n = Q(B |X ,g (X ,A ,...,X )) x n n 0 0 n for Borel measurable sets B ⊂ X. A Markov policy is of the form x Pπ(X ∈ B |X ,f (X ,A ,...,X )) = Q(B |X ,f (X )). (1.5) n+1 x n n 0 0 n x n n n That is to say, the Markov policy π = {f } depends only on current state X . We n n≥0 n denote the set of all Markovian policies as ΠM. Similarly, the stationary policy is of the form π = {f} with n≥1 Pπ(X ∈ B |X ,f (X ,A ,...,X )) = Q(B |X ,f(X )), (1.6) n+1 x n n 0 0 n x n n i.e. we apply the same rule for each time episode n. Suppose, we are given a policy σ = {f }∞ , then by Ionescu Tulcea theorem [7], there exists a unique probability measure n n=0 Pσ on (Ω,F), which ensures the consistency of the infinite horizon problem considered. Hence, for every measurable set B ⊂ F and all h ∈ H , n ∈ N , we denote n n n 0 Pσ(x ∈ B) =P(B) 1 Pσ(x ∈ B|h ) =Q(B|x ,π (h )) n+1 n n n n We consider the following cost function ∞ C∞ := c (x ,a ), (1.7) n n n Xn=0 4 for the infinite planning horizon and N CN = c (x ,a ) (1.8) n n n Xn=0 for the finite planning horizon for some terminal time N ∈ N . We take that the cost 0 functions {c (x ,a )} are non-negative and CN and C∞ belong to space L1(Ω,F,P ). n n n n≥0 0 We start from the following two well-studied optimization problems for controlled Markov processes. The first one is called finite horizon expected value problem, where we want to find a policy π = {g }N with the minimization of the expected cost: n n=0 N minEπ[ c (x ,a )] x n n n π∈Π Xn=0 where a = π (x ,x ,...,x )andc (x ,a )ismeasurable foreachn = 0,..,N. Thesecond n n 0 1 n n n n problem is the infinite horizon expected value problem. The objective is to find a policy π = {g }∞ with the minimization of the expected cost: n n=0 ∞ minEπ[ c (x ,a )] x n n n π∈Π Xn=0 Under some assumptions the first optimization problem has solution in form of Markov policies, whereasininfinitecasethepolicyisstationary. Inbothcases, theoptimalpolicies can be found by solving corresponding dynamic programming equations. Our goal is to study the infinite horizon problem, where we use a risk-averse operator ρ instead of the expectation operator and look for stationary optimal policy under some conditions. 1.2 Coherent risk measures on L1 We introduce the corresponding risk averse operators that we will be working on through- out the rest of the paper. Definition 1.1. A function ρ : L1 → R is said to be a coherent risk measure if it satisfies the following axioms [2]: • ρ(λX +(1−λ)Y) ≤ λρ(X)+(1−λ)ρ(Y) ∀λ ∈ (0,1), X,Y ∈ Lp ; convexity-1 • If X ≤ Y P−a.s. then ρ(X) ≤ ρ(Y), ∀X,Y ∈ Lp • ρ(c+X) = c+ρ(X), ∀c ∈ R, X ∈ Lp; 5 • ρ(βX) = βρ(X), ∀X ∈ Lp, β ≥ 0. The particular risk averse operator that we will be working with is the AVaR (X) α Definition 1.2. Let X ∈ L1(Ω,F,P) be a real-valued random variable and let α ∈ (0,1). • We define the Value-at-Risk of X at level α, VaR (X), by α VaR (X) = inf{x ∈ R : P(X ≤ x) ≥ α} (1.9) α • We define the coherent risk measure, the Average-Value-at-Risk of X at level α, denoted by AVaR (X) as α 1 1 AVaR (X) = VaR (X)dt (1.10) α t 1−α Z α We will also need the following alternative representation for AVaR (X) as shown in [15]. α lemma_static_avar_repre: Lemma 1.1. Let X ∈ Lp(Ω,F,P) be a real-valued random variable and let α ∈ (0,1). Then it holds that 1 AVaR (X) = min s+ E[(X −s)+] , (1.11) avar_representation α s∈R (cid:26) 1−α (cid:27) where the minimum is attained at s = VaR (X). α Remark 1.2. We note from the representation above that the AVaR (X) is real-valued α for any X ∈ L1(Ω,F,P). 1.3 Time Consistency Definition 1.3. Let L0(F ) be the vector space of all real-valued, F -measurable random n n variables on the space (Ω,F,F ,P) defined above. A one-step coherent dynamic risk n measure on L0(F ) is a sequence of mappings such that n+1 ρ : L0(F ) → L0(F ),n = 0,...,N −1. (1.12) t n+1 n that satify the followings • ρ (λX +(1−λ)Y) ≤ λρ (X)+(1−λ)ρ (Y) ∀λ ∈ (0,1), Z,W ∈ L0(F ) ; convexity-1 n n n n+1 • If X ≤ Y P−a.s. then ρ (X) ≤ ρ (Y), ∀X,Y ∈ L0(F ) n+1 n+1 n+1 6 • ρ (c+X) = c+ρ (X), ∀c ∈ L0(F ), X ∈ L0(F ); n n n n+1 • ρ (βX) = βρ (X), ∀X ∈ L0(F ), β ≥ 0 n n n Definition 1.4. A dynamic risk measure (ρ )N−1 on L0(F ) is called time-consistent if n n=0 N for all X,Y ∈ L0(F ) and n = 0,...,N −1, ρ (X) ≥ ρ (Y) implies ρ (X) ≥ ρ (Y). N n+1 n+1 n n Another way to define time consistency is from the point of view of optimal policies (see also[20]). Intuitively, thesequence ofoptimizationproblemsissaidtobedynamically consistent, if the optimal strategies obtained when solving the original problem at time t remain optimal for all subsequent problems. More precisely, if a policy π is optimal on the time interval [s,T], then it is also optimal on the sub-interval [t,T] for every t with s ≤ t ≤ T. Remark31 Remark 1.3. Given that the probability space is atomless, it is shown in [20] and [14] that the only law invariant coherent risk measure operators ρ on (Ω,F,FN ,P), i.e. n=0 d X = Y ⇒ ρ(X) = ρ(Y) (1.13) satisfying ρ(Z) = ρ(ρ|F (...ρ|F )(Z)), (1.14) 1 N−1 for all random variables Z are esssup(Z) and expectation E(Z) operators. This suggests that optimization problems with mostof the coherent risk measures are not time consistent. 2 Infinite Horizon Problem We are interested in solving the following optimization problem in the infinite horizon. ∞ minAVaRπ( c(x ,a )), (2.15) main_problem α n n π∈Π Xn=0 First, we put the following assumptions on the problem. assumptions Assumption 2.1. For every n ∈ N , we impose the following assumptions on the prob- 0 lem: 1. The cost function c : K → R is nonnegative, lower semicontinuous (l.s.c.), that is n n if (x ,a ) → (x,a), then k k liminfc (xk,ak) ≥ c (x,a) (2.16) n n k→∞ 7 and inf-compact on K , i.e., for every x ∈ X and r ∈ R, the set n n {a ∈ A (x)|c (x,a) ≤ r} (2.17) n n is compact. 2. The function (x,a) → v(x′,s−c)Q(dx′|x,a) is l.s.c. for all l.s.c. functions v ≥ 0. R 3. The set A (x) is compact for every x ∈ X and for every n ∈ N . n n 0 4. The system function x = F (x ,a ,ξ ) is measurable as a mapping n+1 n n n n F : X ×A ×S → X , and (x,a) → F (x,a,s) is continuous on K for every n n n n n+1 n n s ∈ S . n 5. The multifunction also called point-to-set function x → A (x), from X to A is upper n semicontinuous (u.s.c.) that is, if {x } ⊂ X and {a } ⊂ A are sequences such that n n x → x∗, a ∈ A(x ) forall n, and a → a∗, (2.18) n n n n then a∗ is in A (x∗). n 6. There exists a policy π ∈ Π such that V (x,π) < M for all x ∈ X . 0 0 main_problem To solve (2.15), we first rewrite the infinite horizon problem as follows: 1 infAVaRπ(C∞|X = x) = inf inf s+ Eπ[(C∞ −s)+] π α 0 π∈Πs∈R(cid:26) 1−α x (cid:27) 1 =inf inf s+ Eπ[(C∞ −s)+] s∈Rπ∈Π(cid:26) 1−α x (cid:27) 1 =inf s+ inf Eπ[(C∞ −s)+] s∈R(cid:26) 1−α π∈Π x (cid:27) Based on this representation, we investigate the inner optimization problem for finite time N as in [4]. Let n = 0,1,2,...,N. We define w (x,s) := Eπ[(CN −s)+], x ∈ X, s ∈ R,π ∈ Π, (2.19) Nπ x w (x,s) := inf w (x,s), x ∈ X, s ∈ R, (2.20) N Nπ π∈Π We work with the Markov Decision Model with a 2-dimensional state space X , X ×R. The second component of the state (x ,s ) ∈ X gives the relevant information of the n n e history of the process. We take that there is no running cost and we assume that the e 8 terminal cost function is given by V (x,s) := V (x,s) := s−. We take the decision −1π −1 rules f : X → A such that f (x,s) ∈ A (x) and denote by ΠM the set of Markov policies n n n π = (f ,f ,...,), where f are decision rules. Here, by Markov policy, we mean that the 0 1e n decision at time ndepends onlyonthe current state x andaswell asonthe globalvariable s as to be seen in the proof below. We denote for v ∈ M(X) := {v : X → R : measurable} (2.21) + e e the operators: Lv(x,s,a) := v(x′,s−c)Q(dx′|x,a), (x,s) ∈ X,a ∈ A (x) (2.22) n Z e and T v(x,s) := v(x′,s−c)Q(dx′|x,f(x,s)), (x,s) ∈ X f Z The minimal cost operator of the Markov Decision Model is given by e Tv(x,s) = inf Lv(x,s,a). (2.23) a∈An(x) For a policy π = (f ,f ,f ,...) ∈ ΠM. We denote by ~π = (f ,f ,...) the shifted policy. 0 1 2 1 2 We define for π ∈ ΠM and n = −1,0,1,...,N: V := T V , n+1,π f0 nπ V := infV n+1 n+1π π = TV . n A decision rule fn∗ with the property that Vn = Tfn∗Vn−1 is called the minimizer of Vn. We have Markovian policies ΠM ⊂ Π in the following sense: Given the global variable s, for every σ = (f ,f ,...) ∈ ΠM we find a policy π = (g ,g ,...) ∈ Π such that 0 1 0 1 g (x ) := f (x ,s) 0 0 0 0 g (x ,a ,x ) := f (x ,s−c ) 1 0 0 1 1 1 0 . . . . . := . We remark here that a Markovian policy σ = (f ,f ,...,) ∈ ΠM also depends on the 0 1 history of the process but not on the whole information. The necessary information at time n of the history h = (x ,a ,x ,...,a ,x ) are the state x and the necessary n 0 0 1 n−1 n n information s , s −c −c −...−c . This dependence of the past and the optimality n 0 0 1 n−1 of the Markovian policy is shown in the following theorem. 9 Theorem 2.2. [4] For a given policy σ, the only necessary information at time n of the augment_info history h = (x ,a ,x ...,a ,x ) are the followings n 0 0 1, n−1 n • the state x n • the value s = s−c −c −...−c for n = 1,2,...,N. n 0 1 n−1 Moreover, it holds for n = 0,1,...,N that • w = V for σ ∈ ΠM. nσ nσ • w = V n n If there exist minimizers f∗ of V on all stages, then the Markov policy σ∗ = (f∗,...,f∗) n n 0 N is optimal for the problem inf Eπ[(CN −s)+] (2.24) x π∈Π Proof. For n = 0, we obtain V (x,s) = T V (x,s) 0σ f0 −1 = V (x′,s−c)Q(dx′|x,f (x,s)) −1 0 Z = (s−c)−Q(dx′|x,f (x,s)) 0 Z = (c−s)+Q(dx′|x,f (x,s)) 0 Z = Eπ[(C −s)+] = w (x,s) x 0 0σ Next by induction argument V (x,s) = T V (x,s) n+1σ f0 n~σ = V (x′,s−c)Q(dx′|x,f (x,s)) n~σ 0 Z = E~σ [(Cn −(s−c))+]Q(dx′|x,f (x,s)) x′ 0 Z = E~σ [(c+Cn −s)+]Q(dx′|x,f (x,s)) x′ 0 Z = Eσ[Cn+1 −s] = w (x,s) x n+1σ WenotethatthehistoryoftheMarkovDecisionProcessh = (x ,s ,a ,x ,s ,a ,...,x ,s ) n 0 0 0 1 1 1 n n containshistoryh = (x ,a ,x ,a ,...,x ). WedenotebyΠthehistorydependent policies n 0 0 1 1 n e of the Markov Decision Process. By ([5], Theorem 2.2.3), we get e inf Vnσ(x,s) = inf Vnπe(x,s). σ∈ΠM πe∈Πe 10 Hence, we obtain inf w ≥ inf w ≥ inf = inf V = inf w nσ nπ nσ nσ σ∈ΠM π∈Π πe∈Πe σ∈ΠM σ∈ΠM (cid:3) We conclude the proof. assumptions Theorem 2.3. [4] Under the conditions of the Assumptions 2.1, there exists an optimal bauerle_finite Markov policy, in the sense introduced above, σ∗ ∈ Π for any finite horizon N ∈ N with 0 inf Eπ[(CN −s)+] = Eσ∗[(CN −s)+] (2.25) x x π∈Π Now we are ready to state our main result. assumptions Theorem 2.4. Under Assumptions 2.1, there exists an optimal Markov policy π∗ for the main_problem infinite horizon problem (2.15). assumptions Proof. For the policy π ∈ Π stated in the Assumption 2.1, we have w = Eπ[(C∞ −s)+] ∞,π x ∞ = Eπ[(Cn + C −s)+] x k k=Xn+1 ∞ ≤ Eπ[(Cn −s)+]+Eπ[ C ], x x k k=Xn+1 ≤ Eπ[(Cn −s)+]+M(n), (2.26) x assumptions where M(n) → 0 as n → ∞ due to the Assumption 2.1. Taking the infimum over all π ∈ Π we get w (x,s) ≤ w +M(n) (2.27) ∞ n Hence we get w ≤ w (x,s) ≤ w +M(n) (2.28) n ∞ n Letting n → ∞, we get lim w = w (2.29) n ∞ n→∞ bauerle_finite Moreover, by Theorem 2.3, there exists π∗ = {f }N ∈ Π such that VN(x) = V∗ (x) n n=0 π 0,N and we also have by the assumption that VN(x) is l.s.c. By the nonnegativity of the cost π functions c ≥ 0, we have that N → V∗ (x) is nondecreasing and V∗ (x) ≤ V∗ (x) for n 0,N 0,N 0,∞ all x ∈ X. Denote u(x) := supV∗ (x). (2.30) 0,N N>0