Table Of Content

T R R ECHNICAL ESEARCH EPORT Convergence of Sample Path Optimal Policies for Stochastic Dynamic Programming by Michael C. Fu, Xing Jin TR 2005-84 I R INSTITUTE FOR SYSTEMS RESEARCH ISR develops, applies and teaches advanced methodologies of design and analysis to solve complex, hierarchical, heterogeneous and dynamic problems of engineering technology and systems for industry and government. ISR is a permanent institute of the University of Maryland, within the Glenn L. Martin Institute of Technol- ogy/A. James Clark School of Engineering. It is a National Science Foundation Engineering Research Center. Web site http://www.isr.umd.edu Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2005 2. REPORT TYPE - 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Convergence of Sample Path Optimal Policies for Stochastic Dynamic 5b. GRANT NUMBER Programming 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Air Force Office of Scientific Research,875 North Randolph REPORT NUMBER Street,Arlington,VA,22203-1768 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 16 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 CONVERGENCE OF SAMPLE PATH OPTIMAL POLICIES FOR STOCHASTIC DYNAMIC PROGRAMMING MICHAEL C. FU,∗ University of Maryland XINGJIN,∗∗ National University of Singapore Abstract We consider the solution of stochastic dynamic programs using sample path estimates. Applyingthetheoryof large deviations, wederiveprobability error bounds associated with the convergence of the estimated optimal policy to the true optimal policy, for finite horizon problems. These bounds decay at an exponential rate, in contrast with the usual canonical (inverse) square root rate associated with estimation of the value (cost-to-go) function itself. These results have practical implications for Monte Carlo simulation-based solution approaches to stochastic dynamic programming problems where it is impractical to extract the explicit transition probabilities of the underlying system model. Keywords: stochastic dynamic programming; Markov decision processes; sample path optimization; large deviations; simulation optimization. 2000 Mathematics Subject Classification: Primary 49L20 Secondary 65C50; 65C05 ∗Postal address: Robert H. Smith School of Business, University of Maryland College Park, MD 20742-1871 ∗∗Postaladdress: DepartmentofMathematics,NationalUniversityofSingapore,Singapore117543 1. Introduction Considerastochasticdynamicprogrammingmodel(alsoknownasaMarkovdecision process (MDP), see Arapostathis et al. 1993, Bertsekas 1995, Puterman 1994), in the setting where only sample paths of state transition sequences are available, e.g., when it is impractical to explicitly specify the transition probabilities, but the underlying system can be readily simulated. This is often the case when the system of interest is large and complex, and must therefore be modeled by a stochastic simulation model. One drawback of using sample path estimation is the relatively slow convergence rate forestimationofperformancemeasures(e.g.,the value,orcost-to-go,function), which is generally on the order of O(N−0.5), where N is the number of sample paths. The focus of this paper is the problem offinding an optimal policy, and we exploit the fact that the policy search involves ordinal comparisons, rather than absolute estimation. Inpractice,themainideaofthisapproachistocomparerelativeordersofperformance measures in finding the best action as quickly as possible rather than wasting effort in getting a more precise absolute estimate of the value function associated with each possibleaction. Underappropriateconditions,weshowthattheprobabilityofselecting suboptimalactionsisboundedbyaquantitythatdecaystozeroatanexponentialrate. Theoverridingpurposeofourworkistoprovidearigoroustheoreticalfoundationfor the sample pathapproachin finding goodpolicies instochastic dynamicprogramming problems. The convergence results obtained here are completely new to this setting. To put our results in some perspective, we touch on the most closely related work. A typeofexponential(geometric)convergencerateiswellknowninthetraditionalMDP framework(e.g.,Puterman1994),wheretheconvergenceiswithrespecttothehorizon length for the value iteration procedure in infinite horizon problems with explicitly known transition probabilities. Our finite action setting is included in the book of Bertsekas and Tsitsiklis (1996), where the solution approach goes under the name of neuro-dynamic programming, but the focus there is on approximating the value function, and sample path optimal policies are not analyzed. Our results buttress the literature on ordinal optimization see Ho et al. (1992, 2000), which focuses on the efficiencyofordinalcomparisonsratherthanabsoluteestimation. Inparticular,theex- ponentialconvergencerateforstatic stochasticoptimizationproblemsisestablishedin Dai(1996)andDaiandChen(1997). Also,somewhatinthesamespiritasourapproach is the workofRobinson(1996)andGu¨rkan,O¨zge,andRobinson(1999),who consider sample path solution to stochastic variational inequalities, and establish conditions under which the sample path solution converges to the true solution; however, their setting is quite different from ours, in that we consider a dynamic model involving sequential decision making under uncertainty, and we focus on actually quantifying the error incurred in utilizing sample path estimates, going beyond just establishing convergence. The restof the paper is organizedas follows. Section 2 defines the problemsetting. Section 3 establishes the theoretical results on the exponentially decaying probability error bounds for the basic finite horizon discounted cost problems. Section 4 briefly discusses some easy extensions, and the Appendix contain the detailed proof of one of the more technical lemmas. 1 2. Problem Setting In this section, we formulate the basic problem of minimizing total expected discounted cost in a setting where the state space and action space are finite, albeit possibly non-stationary. Let {Xk,k = 1,2,...} denote a Markov decision process with finite state space S (|S| > 1), where X1 is the starting state. Let T ≥ 2 be the time horizon, or number of periods (also known as stages), Sk ⊆ S be the state space for the kth period, and Ak(x),x∈Sk, be the (finite) set of feasible actions in state x and period k. At stage k in state x, the decision maker chooses an action a ∈Ak(x); as a result the following occur: (i) an immediate (deterministic) cost ck(x,a)≥0 is accrued, and (ii) the process moves to a sta(cid:1)te x(cid:1) ∈ Sk+1 with transition probability pk(x(cid:1)|x,a), where pk(x(cid:1)|x,a)≥0 and x(cid:1)∈S pk(x(cid:1)|x,a)=1. k+1 The objective is to find a sequence of decision rules {µk(·)} comprising a policy µ = {µk} that minimizes total expected discounted cost given by (cid:2) (cid:4) (cid:3)T E αk−1ck(Xk,Ak) , (1) k=1 where Ak is the action takenin period k — which would be µk(Xk) under policy µ — and α ∈ (0,1) is the (constant) discount factor. Here Xk+1 depends on both Xk and Ak, i.e., given Xk =x and Ak =a, we have Xk+1(x,a)∼{pk(·|x,a)}, (2) but such dependence will generally be suppressed for the sake of simplicity. Through- out, we assume a fixed initial state X1 = x1, but this can easily be generalized to the setting where the initial state is a random variable with an associated probability distribution. Define the optimal cost-to-go (or value) function from stage k by (cid:2) (cid:5) (cid:4) (cid:3)T (cid:5) (cid:5) Jk(x)=mµ∈iUnE αi−kci(Xi,µi(Xi))(cid:5)(cid:5)Xk =x ,∀x∈Sk, k =1,...,T, (3) i=k where U denotes the set of all policies. The value of the MDP is given by J1(x1), and an optimal policy µ∗ — defined as any policy that minimizes (1) — satisfies the following set of equations: µ∗k(x)∈arg min {ck(x,a)+αE[Jk+1(Xk+1(x,a))]}, k =1,2,...,T, (4) a∈A (x) k wheretheexpectationistakenwithrespecttothenextstateXn+1,whichisafunction of the current state x and action a, and we follow the convention that JT+1(·) = 0. It will be convenient to introduce the Q-factors defined by the expectation on the right-hand side (e.g., Bertsekas 1995): Qk(x,a)=ck(x,a)+αE[Jk+1(Xk+1(x,a))], k =1,2,...,T, (5) 2 representing the expected cost of taking action a from state Xk = x in period k, and then following the optimal policy thereafter. In particular, QT(x,a)=cT(x,a). (6) Thus, we have Jk(x)= min Qk(x,a). a∈A (x) k For finite horizon dynamic programming with finite space, backward induction can be used via Equation (3) to obtain the optimal value functions {Jk(x),x ∈ Sk,k = 1,2,....,T} and a corresponding optimal policy satisfying (4). For the infinite horizon case, value iteration, policy iteration, or variants on these are used to solve the sta- tionaryversionof(3)whenapplicable. Whenthe transitionprobabilitiesareexplicitly known, these procedures can sometimes be carried out in closed form or by using straightforwardnumerical procedures to calculate the necessary expectations. In our setting, based on sample paths of the MDP sequence X1,X2,... for a given policy µ, the expectations in (1), (3), or (4), are estimated by taking sample means. By a sample path optimal policy, we meana policy (possibly only partially specified, if not all states are visited in the sample paths) that optimizes the sample mean of the objective function given in (1). (This is not to be confused with using a single “long” samplepathtoestimateastationaryoptimalpolicyforinfinitehorizonproblems.) This willbeafunctionofboththesamplepathlengthandthenumberofsamplepaths. For thefinitehorizonsetting,thesamplepathlengthwillbeequaltothenumberofperiods T, whereas in the infinite horizon case, the optimal policy is approximated by a finite horizon sample path optimal policy. A direct implementation for using sample paths would be to take a “large” number of samples for each value that must be estimated, thus in essence reducing the problem to the traditional setting. In practice, taking a large number of samples may be unnecessarily wasteful, especially when the ultimate objectiveistofindtheoptimalpolicy,notnecessarilytopreciselyestimatetheoptimal value functions for all states. The underlying philosophy is that one may obtain good policiesthroughordinalcomparisonevenwhiletheestimateofthevaluefunctionitself is not that accurate. 3. Sample Path Probability Error Bounds We now deriveprobabilityerrorboundsforthe convergenceofsample pathoptimal policies to a true optimal policy. We focus on searching for the optimal action in the first period, since optimal actions for subsequent periods can be obtained in the same manner. Write the feasible action set for the initial period starting in state x1 as A1(x1)={a1,a2,...,am}. The Q-factor of interest for the first period, as defined by (5), is Q1(x,a)=c1(x,a)+αE[J2(X2(x,a))], whereJ2 isthecost-to-gofunctiondefinedby(3)withhorizonT−1. Sincethroughout wearefocusingonthe firstperiodwithinitialstate X1 =x1, wewillsimplify notation 3 by dropping explicit display of the dependence on the period and initial state by defining the unsubscripted Q-factor: Q(a)=Q1(x1,a). Without loss of generality, we assume Q(a1)<Q(a2)≤...≤Q(am), i.e., µ∗1(x1) = a1. The case with ties for the best can also be handled in exactly the same way; see Remark 3.2 following Theorem 3.1. The procedure to estimate the optimal action from a given state in the setting of this section uses sample trees. Specifically, for state x1, for each action al ∈ A1(x1), n independent sample trees are generated. Each tree begins by taking action al in period 1, and then sampling all possible actions in subsequent states visited. Since the state space is finite, there may be common states visited between trees and also within trees. We keep the tree structure by sampling from each node separately and independently accordingto(2), sothere willbe no“recombining”branches,evenifthe same state were reached at different nodes of the tree. To be more specific, a sample tree is generated as follows for initial period action al: (i) In period 1, generate one next (period 2) state sample (node) according to p1(·|x1,al). (ii) Inperiod2,generateanext(period3)statesample(node)accordingtop2(·|x,a), for each feasible action a∈A2(x), where x is the state generated in step (i). (iii) Starting from each state x visited in period k of the tree (k = 3,...,T −1), generateanext(periodk+1)statesample(node)accordingtopk(·|x,a)foreach feasible action a∈Ak(x). Asmentionedearlier,allsamplingisdoneindependentlyofothertreesandothernodes in the same tree; however, correlation between sampling of different actions from the same node in a tree is allowed. Let Sk(l) ⊆ Sk,k = 2,...,T, denote the set of states actuallyvisitedinperiodkoverallnsampletreesinitiatedwithactional. Anexample forn=3isshowninFigure1. Inthisexample,evenifx5 =x6,i.e.,thestatereachedis the same, the nodes themselves remain distinct, in that separate independent samples would be generated from each for each possible action in A3(x5)=A3(x6). Sample path estimates for the Q-factors and cost-to-go functions are obtained via backwardinduction as follows: Q(cid:6)(Tl)(x,a) = cT(x,a), x∈ST(l), (7) µ(cid:6)(l)(x) ∈ arg min Q(cid:6)(l)(x,a), x∈S(l), k =2,...,T, (8) k k k a∈A (x) k J(cid:6)(l)(x) = min Q(cid:6)(l)(x,a)=Q(cid:6)(l)(x,µ(cid:6)(l)(x)), x∈S(l), k =2,...,T, (9) k k k k k a∈A (x) k (cid:3) Q(cid:6)(kl)(x,a) = ck(x,a)+|Nk(+l)1(x,a)|−1α J(cid:6)k(l+)1(y), x∈Sk(l), k=2,...,T −1,(10) y∈N(l) (x,a) k+1 where N(l) (x,a) is the multi-set (i.e., includes states repeated if sampled more than k+1 once) of states reached in period k+1 from state x with action a in period k (k = 1,...,T −1). 4 x 5 (cid:2)• a(cid:1) (cid:2)(cid:2) (cid:2)1(cid:2) (cid:2) x x (cid:2) E •1 al •(cid:2)(cid:1)2 (cid:2)(cid:2) 1 (cid:1)(cid:1)(cid:1) a(cid:1) (cid:1)(cid:1)2(cid:1) x (cid:1)(cid:1) 6 (cid:1)• x 7 (cid:2)• a(cid:1) (cid:2)(cid:2) (cid:2)(cid:2)3 (cid:2) x x (cid:2) x E •1 al •(cid:2)(cid:1)3 (cid:2)(cid:2) a(cid:1)4 •8 2 (cid:1)(cid:1) (cid:1)(cid:1) a(cid:1) (cid:1)(cid:1)5 (cid:1) x (cid:1)(cid:1)9 • x 10 (cid:2)• a(cid:1) (cid:2)(cid:2) (cid:2)6(cid:2) (cid:2) x x (cid:2) E •1 al •(cid:2)(cid:1)4 (cid:2)(cid:2) 3 (cid:1)(cid:1)(cid:1) a(cid:1) (cid:1)(cid:1)7(cid:1) x (cid:1)(cid:1) 11 (cid:1)• A2(x2)={a(cid:1)1,a(cid:1)2}, A2(x3)={a(cid:1)3,a(cid:1)4,a(cid:1)5}, A2(x3)={a(cid:1)6,a(cid:1)7}, S2(l) ={x2,x3,x4},S3(l) ={x5,x6,...,x11}. Figure1: Example of simulated trees for n=3. Similar to the unsubscripted initial period, initial state, Q-factors defined earlier, we define the following corresponding tree-based estimator: (cid:3) Q(cid:6)(al)=c1(x1,al)+ n1α J(cid:6)2(l)(y), y∈N(a) l where we have defined N(al) = N2(l)(x1,al) and |N(al)| = n. We then estimate the optimal first-period action in the natural way: aˆ1(n)=arg min {Q(cid:6)(al)}. (11) al∈A1(x1) AveragingoverN(l) (x,a)in(10)isneededtoensureconsistencyindefiningdecision k+1 rules via (8), since the same state can be reached more than once in sampling, on the sametreeorondifferenttrees. Ifallntreesforagivenal aredistinctwithnocommon states in any period beyond the initial state, so |N(l) (x,a)| = 1 for k > 1, then the k+1 DP algorithm simply corresponds to performing (deterministic) backward induction individually on each tree. Figure 2showsa simple example,whichweuse to illustrate howEquations(7)-(11) are applied and why the averaging is necessary. There are two trees (n=2), and both reach the same state x2 in period 2, hence the multi-set N(al) = {x2,x2}. Assume 5 (cid:2)1(cid:2) x3 (cid:2)(cid:2)(cid:2) (cid:2) (cid:2)•(cid:1) a(cid:1) (cid:2)(cid:2) (cid:1)(cid:1) (cid:2)(cid:2)1(cid:2) (cid:1)(cid:1)2(cid:1) x x (cid:2) •1 al •(cid:2)(cid:1)2 (cid:2)(cid:2) 1 (cid:1) (cid:1)(cid:1) a(cid:1) (cid:1)(cid:1)2(cid:1) x (cid:2)(cid:2)3(cid:2) (cid:1)(cid:1) 4 (cid:2)(cid:2) 2 (cid:1)•(cid:2)(cid:1) 4 (cid:1) (cid:1) (cid:1)(cid:1)5(cid:1) 3(cid:2)(cid:2) x4 (cid:2)(cid:2)(cid:2) (cid:2)•(cid:1)(cid:2) 4 a(cid:1) (cid:2)(cid:2) (cid:1)(cid:1) (cid:2)(cid:2)1(cid:2) (cid:1)5(cid:1)(cid:1) x x (cid:2) •1 al •(cid:2)(cid:1)2 (cid:2)(cid:2) 1 (cid:1) (cid:1)(cid:1) a(cid:1) (cid:1)(cid:1)2(cid:1)(cid:1) x5 (cid:2)(cid:2)1(cid:2)(cid:2) (cid:1) (cid:2) 2 (cid:1)•(cid:1)(cid:2) (cid:1) (cid:1) (cid:1)(cid:1)2(cid:1) N2(x1,al)={x2,x2},N3(x2,a(cid:1)1)={x3,x4},N3(x2,a(cid:1)2)={x4,x5}, S2(l) ={x2},S3(l) ={x3,x4,x5}. Figure 2: Example of two simulated trees with common states (costs shown on branches). Dynamicprogrammingperformedseparatelyoneachtreewouldleadtoadifferentactionfrom state x2 in period 2 on the two trees (a(cid:2)1 in the upper tree, a(cid:2)2 in the lower tree). Averaging appropriately overthe corresponding nodes in two trees leads to µ(cid:0)2(x2)=a(cid:2)1. for simplicity that the discount factor is one (α = 1). Applying the DP algorithm separately to each tree, we obtain (suppressing superscripted (l) for notational con- venience) J(cid:6)3(x3) = 1,J(cid:6)3(x4) = 3,J(cid:6)3(x5) = 1; for the upper tree, J(cid:6)2(x2) = 2 and µ(cid:6)2(x2) = a(cid:1)1, whereas for the lower tree, J(cid:6)2(x2) = 3 and µ(cid:6)2(x2) = a(cid:1)2, leading to a conflict in specifying the decision rule (action for state x2). On the other hand, with the averaging(over just the two trees, i.e., n=2), Q(cid:6)2(x2,a(cid:1)1)=1+(1+3)/2=3 and Q(cid:6)2(x2,a(cid:1)2)=2+(3+1)/2=4, which gives J(cid:6)2(x2)=min{Q(cid:6)2(x2,a(cid:1)1),Q(cid:6)2(x2,a(cid:1)2)}= 3 and µ(cid:6)2(x2) = argmina(cid:1){Q(cid:6)2(x2,a(cid:1)i)} = a(cid:1)1, hence Q(cid:6)(al) = c1(x1,al)+3. This would i be repeated for all other actions in A1(x1), and then (one of) the action(s) with the lowest value of Q(cid:6)(·) would be selected to be the estimated optimal action in state x1. Our results use the largedeviations principle (cf. Dembo and Zeitouni1998),which yields exponentially decaying probability error bounds under appropriate conditions. Lemma 3.1: Considera sequenceofi.i.d. randomvaria(cid:1)bles{Yn,n≥1}withmoment generating function M(λ) = E[exp(λY1)]. Let Sn = ni=1Yi. If M(λ) exists in a neighborhood (−ε,ε) of λ=0 for some ε>0, then P(Sn/n≥x)≤exp(−nΛ∗+(x)), ∀x, and P(Sn/n≤x)≤exp(−nΛ∗−(x)), ∀x, 6 where Λ∗(x)= sup (λx−logM(λ)) + 0≤λ<ε and Λ∗(x)= sup (λx−logM(λ)). − −ε<λ≤0 Furthermore, if |Y1| ≤ M for some constant M < ∞, then Λ∗+(x) > 0 for x > E[Y1], and Λ∗−(x)>0 for for x<E[Y1]. Proof. ThefirstpartfollowsdirectlyfromXie (1997). Forthe secondpart,weshow only the x>E[Y1] case, since the x<E[Y1] case is similar. Using a Taylor series expansion around λ≥0, there exists ξ ∈[0,λ) such that Λ(λ) = logE[exp[λY1]] 1 1 = Λ(0)+Λ(cid:1)(0)λ+ Λ(cid:1)(cid:1)(ξ)λ2 =λE[Y1]+ Λ(cid:1)(cid:1)(ξ)λ2, 2 2 the last equality following from Λ(0)=0 and Λ(cid:1)(0)=E[Y1]. We now turn to evaluating Λ(cid:1)(cid:1)(ξ). Since, |Y1|≤M, Λ(cid:1)(cid:1)(ξ) = E[Y12exp(ξY1)]E[exp(ξY1)]−(E[Y1exp(ξY1)])2 (E[exp(ξY1)])2 ≤ E[Y12exp(ξY1)] ≤M2. E[exp(ξY1)] Consequently, for x>E[Y1], Λ∗+(x) = sup{λx−logE[exp[λY1]]} λ≥0 (cid:7) (cid:8) M2λ2 ≥ sup λ(x−E[Y1])− >0, (12) λ≥0 2 completing the proof. Remark 3.1: For a finite-horizon MDP with finite action and state spaces, the total (cid:1) discounted cost Tk=1αk−1ck(Xk,µk(Xk)) has finite moment generating function on (−∞,∞) for any policy µ∈U. Define (cid:9)c0 = max max ck(x,a), k∈{1,2,...,T}x∈S ,a∈A (x) k k and (cid:3)T J(cid:9)0 = αk−1(cid:9)c0. k=1 From the backward induction DP algorithm, it is easy to show that for any l, k, and x, J(cid:6)k(l)(x)≤J(cid:9)0, and, from the definition of Jk(x), it is easy to see Jk(x)≤J(cid:9)0. 7

DTIC ADA438510: Convergence of Sample Path Optimal Policies for Stochastic Dynamic Programming PDF

0.18 MB·English

by Defense Technical Information Center

#additional_collections #dticarchive

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA438510: Convergence of Sample Path Optimal Policies for Stochastic Dynamic Programming

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.