From Infinite to Finite Programs: Explicit Error Bounds with Applications to Approximate Dynamic Programming PEYMANMOHAJERINESFAHANI,TOBIASSUTTER,DANIELKUHN,ANDJOHNLYGEROS 7 1 0 2 Abstract. Weconsiderlinearprogramming(LP)problemsininfinitedimensionalspacesthatareingeneral computationally intractable. Under suitable assumptions, we develop an approximation bridge from the b infinite-dimensionalLPtotractablefiniteconvexprogramsinwhichtheperformanceoftheapproximationis e F quantifiedexplicitly. Tothisend,weadopttherecentdevelopmentsintwoareasofrandomizedoptimization and first order methods, leading to a priori as well as a posteriori performance guarantees. We illustrate 0 thegeneralityandimplicationsofourtheoretical resultsinthespecialcaseofthelong-runaveragecostand 2 discountedcostoptimalcontrolproblemsforMarkovdecisionprocessesonBorelspaces. Theapplicabilityof ] the theoretical results is demonstrated through a constrained linear quadratic optimal control problem and C afisheriesmanagementproblem. O Keywords. infinite-dimensional linear programming, Markov decision processes, approximate dynamic . h programming, randomized and convex optimization t a m 1. Introduction [ Linear programming(LP)problems in infinite dimensionalspacesappear in, amongother areas,engineer- 2 ing, economics,operationsresearchandprobabilitytheory [33, 1, 32]. Infinite LPsoffer remarkablemodeling v 9 power, subsuming general finite dimensional optimization problems and the generalized moment problem as 7 special cases. They are, however, often computationally formidable, motivating the study of approximations 3 6 schemes. 0 AparticularlyrichclassofproblemsthatcanbemodeledasinfiniteLPsinvolvesMarkovdecisionprocesses . 1 (MDP) and their optimal control. More often than not, it is impossible to obtain explicit solutions to MDP 0 problems, making it necessary to resort to approximation techniques. Such approximations are the core of 7 1 a methodology known as approximate dynamic programming [8, 6]. Interestingly, a wide range of optimal : control problems involving MDP can be equivalently expressed as static optimization problems over a closed v i convex set of measures, more specifically, as infinite LPs [24, 26, 27]. This LP reformulation is particularly X appealing for dealing with unconventional settings involving additional constraints [23, 3], secondary costs r a [18], information-theoretic considerations [43], and reachability problems [28, 34]. In addition, the infinite LP reformulationallowsoneto leveragethe developmentsin the optimizationliterature,in particularconvex approximation techniques, to develop approximation schemes for MDP problems. This will also be the perspective adopted in the present article. Approximation schemes to tackle infinite LPs have historically been developed for special classes of prob- lems, e.g., the general capacity problem [31], or the generalized moment problem [32]. The literature on control of MDP with infinite state or action spaces mostly concentrates on approximation schemes with asymptotic performance guarantees [26, 25], see also the comprehensive book [30] for controlled stochastic differentialequationsand[41,35]forreachabilityproblemsinasimilarsetting. Fromapracticalviewpoint,a Date:February22,2017. The authors are with the Delft Center for Systems and Control, TU Delft, The Nether- lands ([email protected]), the Automatic Control Laboratory, ETH Zurich, Switzerland, ({sutter,lygeros}@control.ee.ethz.ch), and the Risk Analytics and Optimization Chair, EPFL, Switzerland ([email protected]). 1 2 challengeusingtheseschemesisthattheconvergenceanalysisisnotconstructiveanddoesnotleadtoexplicit error bounds. A wealth of approximation schemes have been proposed in the literature under the names of approximatedynamicprogramming[5],neuro-dynamicprogramming[8],reinforcementlearning[29,46],and value and/or policy iteration [6]. Most, however, deal with discrete (finite or at most countable) state and action spaces, while approximation over uncountable spaces remains largely unexplored. The MDP literature on explicit approximation errors in uncountable settings can, roughly speaking, be divided to two groups in terms of the performance criteria considered: discounted cost, and averagecost. Of the two, the discountedcost setting has receivedmore attention as the correspondingdynamic programming operator is a contraction, a useful property to obtain a convergence rate for the approximation error. Ex- amples include the linear programming approach [13, 14], and also a recent series of works [17, 18, 11] on approximating a probability measure that underlies the random transitions of the dynamics of the system using different discretization procedures. Long-run average cost problems introduce new challenges due to loosingthecontractionproperty. Theauthorsin[19]developapproximationschemesleadingtofinitebutnon- convex optimization problems, while [42] investigates the convergence rate of the finite-state approximation to the original (uncountable) MDP problem. The approach presented in this article tackles a class of general infinite LPs that, as a special case, cover both long-rundiscountedandaveragecostperformancecriteriainthe optimalcontrolofMDP.The resulting approximation is based on finite convex programs that are different from the existing schemes. Closest in spirit to our proposed approximation is the linear programming approach based on constraint sampling in [13, 14, 45]. Unlike these works, however, we introduce an additional norm constraint that effectively acts as a regularizer. We study in detail the conditions under which this regularizer can be exploited to bound the optimizers of the primal and dual programs, and hence provide an explicit approximation error for the proposed solution. The proposed approximation scheme involves a restriction of the decision variables from an infinite di- mensional space to a finite dimensional subspace, followed by the approximation of the infinite number of constraints by a finite subset; we develop two complementary methods for performing the latter step. The structure of the article is illustrates in Figure 1, where the contributions are summarized as follows: We introduce a subclass of infinite LPs whose regularized semi-infinite restriction enjoys analytical • boundsforbothprimalanddualoptimizers(Proposition3.2). TheimplicationsforMDPwithaverage cost (Lemma 3.7) and with discounted cost (Lemma A.2) are also investigated. We derive an explicit error bound between the original infinite LP and the regularized semi-infinite • counterpart,providinginsightson the impact ofthe underlying normstructureas wellas onhow the choice of basis functions contributes to the approximationerror (Theorem 3.3, Corollary3.5). In the MDP setting, we recover an existing result as a special case (Corollary 3.9). We adopt the recent developments from the randomized optimization literature to propose a finite • convex program whose solution enjoys a priori probabilistic performance bounds (Theorem 4.4). We extend the existing results to offer also an a posteriori bound under a generic underlying norm structure. TherequiredconditionsandtheoreticalassertionsarevalidatedintheMDPsetting(Corol- lary 4.12). In parallel to the randomized approach, we also utilize the recent developments in the structural • convex optimization literature to propose an iterative algorithm for approximating the semi-infinite program. Forthispurpose,weextendthesettingtoincorporateunboundedprox-termswithacertain growthrate(Theorem5.3). Weillustratehowthisextensionallowsustodeploytheentropyprox-term in the MDP setting (Lemma 5.10, Corollary 5.8). Section 2 introduces the main motivation for the work, namely the control of discrete-time MDP and their LP characterization. Using standard results in the literature we embed these MDP in the more general framework of infinite LPs. Section 3 studies the link from infinite LPs to semi-infinite programs. Section 4 presents the approximation of semi-infinite programs based on randomization, while Section 5 approaches 3 equivalent discretetime infiniteLP MDP ( ): J P Theorem3.3 Proposition3.2 robustprogram semi-infiniteprogram (Pn): Jn strongduality (Dn): Jn e Theorem4.4 Theorem5.3 infiniteprogram semi-infiniteprograms scenarioprogram regularizedprogram (Pn,N): Jn,N (Dn,η): Jn,η finiteprograms e Theorem6.1 prior&posteriorerror J J n,η− n,N e Figure 1. Graphical representation of the article structure and its contributions the same objective using first-order convex optimization methods. Section 6 summarizes the results in the preceding sections, establishing the approximation error from the original infinite LP to the finite convex counterparts. Section 7 illustrates the theoretical results through a truncated LQG example and a fisheries management problem. Notation. The set R denotes the set of non-negativereals and for p [1, ] the standardp-normin + k·kℓp ∈ ∞ Rn. Given a function u: S R, we denote the infinity norm of the function by u := sup u(s), and the Lipschitz norm by u →:= sup u(s),|u(s)−u(s′)| . The space of Lipschkitkz∞functionss∈oSn|a se|t S is k kL s,s′∈S | | ks−s′kℓ∞ denoted by L(S); define the function 1(s) 1 for all s S. We denote the Borel σ-algebra on the (topo- (cid:8) ≡ ∈(cid:9) logical) space S by B(S). Measurability is always understood in the sense of Borel. Products of topological spaces are assumedto be endowedwith the product topology and the corresponding product σ-algebra. The space of finite signed measures (resp. probability measures) on S is denoted by (S) (resp. (S)). The M P Wasserstein norm on the space of signed measures (S) is defined by µ := sup u(s)µ(ds) and M k kW kukL≤1 S can be shownto be the dualof the Lipschitz norm. The set ofextreme points of a set A is denoted by A . R E{ } Given a bilinear form , , the support function of A is defined by σ (y) = sup y,x . The standard · · A x∈A bilinear form in Rn (i.e., the inner product) is denoted by y x. (cid:10) (cid:11) (cid:10) (cid:11) · 2. Motivation: Control of MDP and LP Characterization 2.1. MDP setting We briefly recall some standard definitions and refer interested readers to [24, 23, 2] for further details. Consider a Markov control model S,A, A(s):s S ,Q,ψ , where S (resp. A) is a metric space called the { ∈ } state space (resp. action space) and for each s S the measurable set A(s) A denotes the set of feasible (cid:0) ∈ (cid:1) ⊆ actions when the systemis in state s S. The transition law is a stochastic kernel Q on S giventhe feasible ∈ state-action pairs in K := (s,a) : s S,a A(s) . A stochastic kernel acts on real valued measurable { ∈ ∈ } 4 functions u from the left as Qu(s,a):= u(s′)Q(ds′ s,a), (s,a) K, | ∀ ∈ ZS and on probability measures µ on K from the right as µQ(B):= Q(B s,a)µ d(s,a) , B B(S). | ∀ ∈ ZK Finally ψ :K R denotesameasurablefunctioncall(cid:0)edthe o(cid:1)ne-stage cost function. Theadmissible history + → spaces are defined recursively as H := S and H := H K for t N and the canonical sample space 0 t t−1 × ∈ is defined as Ω := (S A)∞. All random variables will be defined on the measurable space (Ω, ) where × G G denotes the corresponding product σ-algebra. A generic element ω Ω is of the form ω =(s ,a ,s ,a ,...), 0 0 1 1 ∈ where si ∈S are the states and ai ∈A the action variables. An admissible policy is a sequence π =(πt)t∈N0 of stochastic kernelsπ onA givenh H , satisfyingthe constraintsπ (A(s )h )=1. The setof admissible t t t t t t ∈ | policieswillbedenotedbyΠ. Givenaprobabilitymeasureν (S)andpolicyπ Π,bytheIonescuTulcea ∈P ∈ theorem [7, p. 140-141] there exists a unique probability measure Pπ on (Ω, ) such that for all measurable ν G sets B S, C A, h H , and t N t t 0 ⊂ ⊂ ∈ ∈ Pπ s B =ν(B) ν 0 ∈ Pπν at(cid:0)∈C|ht(cid:1)=πt(C|ht) Pπν st+1(cid:0)∈B|ht,at(cid:1)=Q(B|st,at). The expectation operator with respect t(cid:0)o Pπν is denoted(cid:1)by Eπν. The stochastic process Ω,G,Pπν,(st)t∈N0 is called a discrete-time MDP. For most of the article we consider optimal control problems where the aim is (cid:0) (cid:1) to minimise a long term average cost (AC) over the set of admissible policies and initial state measures. We definite the optimal value of the optimal control problem by T−1 1 JAC := inf limsup Eπ ψ(s ,a ) . (1) (π,ν)∈Π×P(S) T→∞ T ν" t t # t=0 X We emphasize, however, that the results also apply to other performance objective, including the long-run discounted cost problem as shown in Appendix A. 2.2. Infinite LP characterization The problem in (1) admits an alternative LP characterizationunder some mild assumptions. Assumption 2.1 (Control model). We stipulate that (i) the set of feasible state-action pairs is the unit hypercube K =[0,1]dim(S×A); (ii) the transition law Q is Lipschitz continuous, i.e., there exists L >0 such that for all k,k′ K and Q ∈ all continuous functions u Qu(k) Qu(k′) L u k k′ ; | − |≤ Qk k∞k − kℓ∞ (iii) the cost function ψ is non-negative and Lipschitz continuous on K with respect to the ℓ -norm. ∞ Assumption 2.1(i) may seem restrictive,however,essentially it simply requires that the state-actionset K is compact. We refer the reader to Example 7.2 where a non-rectangular K is transferred to a hypercube, and to [26, Chapter 12.3] for further information about the LP characterizationin more general settings. Theorem 2.2 (LP characterization[19, Proposition 2.4]). Under Assumption 2.1, inf ρ ρ,u − JAC = s.t. ρ+u(s) Qu(s,a) ψ(s,a), (s,a) K (2) − − ≤ ∀ ∈ ρ R, u L(S). ∈ ∈ 5 The LP (2) can be expressed in the standard conic form infx∈X x,c : x b K by introducing A − ∈ X=R L(S) (cid:8)(cid:10) (cid:11) (cid:9) × x=(ρ,u) X ∈ bc(x=s,,ca(c)1==,cc2−1)ρψ=+(s(,−aS1)u,(0s))∈c2(Rds×)M(S) (3) x(s,a)= ρ u(s)+Qu(s,a) where (S)isthesetoffinitesignedmK(cid:10)Ae=as(cid:11)uLre+s(Ksu−)p,poR−rtedonS,andL+(K)istheconeofLipschitzfunctions M taking non-negative values. It should be noted that the choice of the positive cone K = L (K) is justified + since, thanks to Assumption 2.1(ii), the linear operator maps the elements of X into L(K). A Remark 2.3 (Constrained MDP). The LP characterization of MDP naturally allows us to incorporate con- straints in the form of T−1 1 limsup Eπ d (s ,a ) ℓ , i 1, ,I , T→∞ T ν" i t t #≤ i ∀ ∈{ ··· } t=0 X where the functions d : K R and constants ℓ reflect our desired specifications. To this end, it suffices to i i → introduce auxiliary decision variables β R , and in (2) replace ρ in the objective with ρ I β ℓ and i ∈ + − i=1 i i in the constraint with ρ I β d , see [23, Theorem 5.2]. − i=1 i i P P Our aim is to derive an approximation scheme for a class of such infinite dimensional LPs, including problems of the form (2), that comes with an explicit bound on the approximation error. 3. Infinite to Semi-infinite Programs 3.1. Dual pairs of normed vector spaces The triple X,C, is called a dual pair of normed vector spaces if k·k X and(cid:0)C are vec(cid:1)tor spaces; • , is a bilinear form on X C that “separates points”, i.e., • · · × – for each nonzero x X there is some c C such that x,c =0, (cid:10) (cid:11) ∈ ∈ 6 – for each nonzero c C there is some x X such that x,c =0; ∈ ∈ (cid:10) (cid:11)6 X is equipped with the norm , which together with the bilinear form induces a dual norm in C • k·k (cid:10) (cid:11) defined through c :=sup x,c . k k∗ kxk≤1 The norminthe vectorspacesis usedasa(cid:10)mea(cid:11)nsto quantify the performanceofthe approximationschemes. In particular, we emphasize that the vector spaces are not necessarily complete with respect to these norms. Let B,Y, be another dual pair of normed vector spaces. As there is no danger of confusion, we use k·k the same notation for the potentially different norm and bilinear form for each pair. Let : X B be a (cid:0) (cid:1) A → linear operator, and K be a convex cone in B. Given the fixed elements c C and b B, we define a linear ∈ ∈ program, hereafter called the primal program , as P inf x,c J := x∈X ( ) ( s.t. (cid:10) x (cid:11)K b P A (cid:23) where the conic inequality x K b is understood in the sense of x b K. Throughout this study we A (cid:23) A − ∈ assume that the program has an optimizer (i.e., the infimum is indeed a minimum), the cone K is closed P and the operator is continuous where the corresponding topology is the weakest in which the topological A duals of X and B are C and Y, respectively. Let ∗ :Y C be the adjoint operator of defined by A → A x,y = x, ∗y , x X, y Y. A A ∀ ∈ ∀ ∈ (cid:10) (cid:11) (cid:10) (cid:11) 6 Recall that if is weakly continuous, then the adjoint operator ∗ is well defined as its image is a subset of A A C [26, Proposition 12.2.5]. The dual programof is denoted by and is given by P D sup b,y y∈Y J := s.t. (cid:10) ∗y(cid:11)=c ( ) Ay K∗, D e ∈ where K∗ is the dual cone of K defined as K∗ :=y Y: b,y 0, b K . It is not hard to see that weak ∈ ≥ ∀ ∈ duality holds, as (cid:8) (cid:10) (cid:11) (cid:9) J = inf sup x,c x b,y sup inf x,c x b,y =J. x∈Xy∈K∗ − A − ≥y∈K∗x∈X − A − (cid:10) (cid:11) (cid:10) (cid:11) (cid:10) (cid:11) (cid:10) (cid:11) An interesting question is when the above assertion holds as an equality. This is knowne as zero duality gap, also referred to as strong duality particularly when both and admit an optimizer [1, p. 52]. Our study P D is not directly concerned with conditions under which strong duality between and holds; see [1, Section P D 3.6] for a comprehensive discussion of such conditions. The programs and are assumed to be infinite, in P D thesensethatthedimensionsofthedecisionspaces(Xin ,andYin )aswellasthenumberofconstraints P D are both infinite. 3.2. Semi-infinite approximation Consider a family of linearly independent elements xn n∈N X, and let Xn be the finite dimensional { } ⊂ subspace generated by the first n elements x . Without loss of generality, we assume that x are i i≤n i { } normalized, i.e., x = 1. Restricting the decision space X of to X , along with an additional norm i n k k P constraint, yields the program inf n α x ,c α∈Rn i=1 i i Jn := s.t. Pni=1αiA(cid:10) xi (cid:23)(cid:11)K b (4) Pα R θP k k ≤ where R is a givennormon Rn and θPdetermines the size of the feasible set. In the spirit of dual-paired k·k normed vector spaces,one can approximate(X,C, ) by the finite dimensionalcounterpart(Rn,Rn, R) k·k k·k where the bilinear form is the standard inner product. In this view, the linear operator :X B may also A → be approximated by the linear operator :Rn B with the respective adjoint ∗ :Y Rn defined as An → An → n α:= α x , ∗y := x ,y , , x ,y . (5) An iA i An A 1 ··· A n i=1 X (cid:2)(cid:10) (cid:11) (cid:10) (cid:11)(cid:3) Itisstraightforwardtoverifythedefinitions(5)bynotingthat α,y =α ∗y forallα Rn andy Y. An ·An ∈ ∈ Defining the vector c:=[ x ,c , , x ,c ], we can rewrite the program(4) as 1 ··· n (cid:10) (cid:11) (cid:10) (cid:11) (cid:10) (cid:11) inf α c α∈Rn · Jn := s.t. AαnαR(cid:23)KθbP. (Pn) k k ≤ We call Pn a semi-infinite program, as the decision variable is a finite dimensional vector α ∈ Rn, but the number of constraints is still in general infinite due to the conic inequality. The additional constraint on the norm of α in acts as a regularizer and is a key difference between the proposed approximation schemes n P and existing schemes in the literature. Methods for choosing the parameter θ will be discussed later. P Dualizing the conic inequality constraint in and using the dual norm definition leads to a dual coun- n P terpart Jn := ssyu.∈tpY. y(cid:10)b,yK(cid:11)∗−, θPkA∗ny−ckR∗ (Dn) ∈ e 7 where R∗ denotes the dual norm of R. Note that setting θP = effectively implies that the second k·k k·k ∞ termofthe objectivein introducesnhardconstraints ∗y =c(cf. (5)). Westudyfurtherthe connection Dn An between and under the following regularity assumption: n n P D Assumption 3.1 (Semi-infinite regularity). We stipulate that (i) the program is feasible; n P (ii) there exists a positive constant γ such that kA∗nykR∗ ≥ γkyk∗ for every y ∈ K∗, and θP is large enough so that γθ > b . P k k Assumption 3.1(ii) is closely related to the condition x,y inf sup A γ, y∈K∗x∈Xn k(cid:10)xkkyk(cid:11)∗ ≥ that in the literature of numerical algorithms in infinite dimensional spaces, in particular the Galerkin dis- cretization methods for partial differential equations, is often referred to as the “inf-sup” condition, see [20] for a comprehensive survey. To see this, note that for every x X the definitions in (5) imply that n ∈ n x,y = α,y =α ∗y, x= α x . A An ·An i i i=1 (cid:10) (cid:11) (cid:10) (cid:11) X These conditions are in fact equivalent if the norm R is induced by the originalnorm on X, i.e., α R := k·k k k n α x . We note that ∗ maps an infinite dimensional space to a finite dimensional one, and as such k i=1 i ik An Assumption 3.1(ii) effectively necessitatesthat the null-spaceof ∗ intersectsthe positive cone K∗ only at 0. P An Inthefollowingweshowthatthisregularityconditionleadstoazerodualitygapbetween and ,aswell n n P D as anupper bound for the dualoptimizers. The latter turns outto be a criticalquantity for the performance bounds of this study. Proposition3.2(Dualitygap&boundeddualoptimizers). UnderAssumption3.1(i),thedualitygapbetween theprograms and is zero, i.e., J =J . Ifinaddition Assumption3.1(ii) holds, then for anyoptimizer n n n n P D y⋆ of the program and any lower bound JLB J we have n Dn n ≤ n e y⋆ θ := θPkckR∗ −JnLB 2θPkckR∗. (6) k nk∗ ≤ D γθ b ≤ γθ b P P −k k −k k Proof. Since the elements x are linearly independent, the feasible set of the decision variable α in i i≤n { } program is a bounded closed subset of a finite dimensional space, and hence compact. Thus, thanks to n P the feasibility Assumption 3.1(i) and compactness of the feasible set, the zero duality gap follows because J = inf α c+ sup b α,y = sup inf b,y α ( ∗y c) =J , n kαkR≤θP · y∈K∗ −An y∈K∗kαkR≤θP − · An − n n (cid:10) (cid:11)o n(cid:10) (cid:11) o where the first equality holds by the definition of the dual cone K∗, and the second equality feollows from Sion’s minimax theorem [44, Theorem 4.2]. Thanks to the zero duality gap above, we have JnLB ≤Jn =Jn = b,yn⋆ −θPkA∗nyn⋆ −ckR∗ ≤ b,yn⋆ −θPkA∗nyn⋆kR∗ +θPkckR∗. By Assumption 3.1(ii), we then(cid:10)have(cid:11) (cid:10) (cid:11) e Jn ≤kbkkyn⋆k∗−γθPkyn⋆k∗+θPkckR∗ =θPkckR∗ − γθP −kbk kyn⋆k∗, which together with the simple lower bound JnLB :=−θPkckR∗ ≤Jn con(cid:0)cludes the(cid:1)proof. (cid:3) Proposition 3.2 effectively implies that in the program one can add a norm constraint y θ n ∗ D D k k ≤ without changingthe optimalvalue. Theparameterθ depends onJLB, alowerboundfor the optimalvalue D n of Jn. A simple choice for such a lower bound is θP c R∗, but in particular problem instances one may be − k k able to obtain a less conservative bound. We validate the assertions of Proposition 3.2 for long-run average cost problems in the next section and for long-run discounted cost problems in Appendix A. 8 Program is a restricted version of the original program (also called an inner approximation [26, n P P Definition 12.2.13]), and thus J J . However, under Assumption 3.1, we show that the gap J J can n n ≤ − be quantified explicitly. To this end, we consider the projection mapping ΠA(x):=argminx′∈A x′ x , the k − k operator norm :=sup x , and define the set kAk kxk≤1kA k n Bn := αixi Xn : α R θP . (7) ∈ k k ≤ nXi=1 o Theorem 3.3 (Semi-infinite approximation). Let x⋆ and y⋆ be optimizers for the programs and , n P Dn respectively, and let rn := x⋆ −ΠBn(x⋆) be the projection residual of the optimizer x⋆ onto the set Bn as defined in (7). Under Assumption 3.1(i), we have 0 J J r , ∗y⋆ c where J and J are the ≤ n − ≤ n A n − n optimal value of the programs and . In addition, if Assumption 3.1(ii) holds, then Pn P (cid:10) (cid:11) 0 J J c +θ r , (8) n ∗ D n ≤ − ≤ k k kAk k k where θD is the dual optimizer bound introduced in(cid:0)(6). (cid:1) Proof. The lower bound 0 J J is trivial, and we only need to prove the upper bound. Note that since n ≤ − the optimizerx⋆ Xisafeasible solutionof ,then x⋆ b K. By thedefinitionofthe dualconeK∗,this ∈ P A − ∈ implies that x⋆ b,y 0 for all y K∗. Since the dual optimizer y⋆ belongs to the dual cone K∗, then A − ≥ ∈ n (cid:10) Jn−(cid:11)J ≤Jn−J + Ax⋆−b,yn⋆ =Jn− x⋆,c + Ax⋆,yn⋆ − b,yn⋆ =Jn+ x⋆,(cid:10)A∗yn⋆ −c −(cid:11)b,yn⋆ (cid:10) (cid:11) (cid:10) (cid:11) (cid:10) (cid:11) =Jn+(cid:10)rn,A∗yn⋆ −c(cid:11)+(cid:10)ΠBn((cid:11)x⋆),A∗yn⋆ −c − b,yn⋆ , =Jn+(cid:10)rn,A∗yn⋆ −c(cid:11)+(cid:10)α · A∗nyn⋆ −c − b(cid:11),yn⋆(cid:10), (cid:11) for some α Rn with norm α R (cid:10) θP; for the(cid:11)last line(cid:0), see the d(cid:1)efin(cid:10)ition(cid:11)of the operator n in (5) as well ∈ k k ≤ e A as the vector c in the program . Using the definition of the dual norm and the operators (5), one can n P deduce froem above that e Jn−J ≤Jn+ rn,A∗yn⋆ −c +θPkA∗nyn⋆ −ckR∗ − b,yn⋆ =Jn+ rn,A∗yn⋆ −c −Jn, which in conjunction with(cid:10)the zero dua(cid:11)lity gap (J = J ) esta(cid:10)blish(cid:11)es the firs(cid:10)t assertion of(cid:11)the proposition. n n e The second assertion is simply the consequence of the first part and the norm definitions, i.e., e r , ∗y⋆ c = r , c + r ,y⋆ r c + r y⋆ r c + y⋆ . n A n− n − A n n ≤k nkk k∗ kA nkk nk∗ ≤k nk k k∗ kAkk nk∗ Invoking(cid:10)the bound o(cid:11)n th(cid:10)e dual(cid:11)opti(cid:10)mizer y⋆(cid:11)from Proposition 3.2 completes the(cid:16)proof. (cid:17) (cid:3) n Remark 3.4(Impactofnormsonsemi-infiniteapproximation). Wenotethefollowing concerningtheimpact of the choice of norms on the approximation error: (i) The only norm that influences the semi-infinite program n is R on Rn. When it comes to the P k·k approximation error (8), the norm R may have an impact on the residual rn only if the set Bn in k·k (7) does not contain ΠXn(x⋆), the projection x⋆ on the subspace Xn, where x⋆ is an optimizer of the infinite program . P (ii) The norms of the dual pairs of vector spaces only appear in Theorem 3.3 to quantify the approxima- tion error. Note that in (8) the stronger the norm on X, the higher r , and the lower c and n ∗ k k k k . On the other hand, the stronger the norm on B, the higher b and and the lower γ (cf. kAk k k kAk Assumption 3.1(ii)). The error bound (8) can be further improved when X is a Hilbert space. In this case, let X denote the n orthogonal complement of X . We define the restricted norms by n x,c x c := sup , := sup kA k. (9) ∗n n k k x kAk x x∈Xn (cid:10)k k(cid:11) x∈Xn k k It is straightforwardto see that by definition c c and . ∗n ∗ n k k ≤k k kAk ≤kAk 9 Corollary 3.5 (Hilbert structure). Suppose that X is a Hilbert space and is the norm induced by the k·k corresponding inner product. Let {xi}i∈N be an orthonormal dense family and k·kR = k·kℓ2. Let x⋆ be an optimal solution for and chose θ x⋆ . Under the assymptions of Theorem 3.3, we have P P ≥k k 0 J J c +θ Π (x⋆) . n n D n X ≤ − ≤ k k kAk n Proof. We first note that the ℓ2-norm on Rn is(cid:0)indeed the norm(cid:1)(cid:13)(cid:13)induced (cid:13)(cid:13)by , since due to the orthonor- k·k mality of xi i∈N we have { } n n kαkR := αixi =v α2ikxik2 =kαkℓ2. (cid:13)Xi=1 (cid:13) uuXi=1 If θP ≥ kx⋆k, then ΠBn(x⋆) = ΠXn(x⋆)(cid:13)(cid:13), i.e., the(cid:13)(cid:13)protjection of the optimizer x⋆ on the ball Bn is in fact the projection onto the subspace X . Therefore, thanks to the orthonormality, the projection residual r = n n x⋆−ΠXn(x⋆) belongs to the orthogonalcomplement Xn. Thus, following the same reasoning as in the proof of Theorem3.3, one arrivesata boundsimilar to (8)but using the restrictednorms(9); recallthatthe norm in a Hilbert space is self-dual. (cid:3) 3.3. Semi-infinite results in the MDP setting We now return to the MDP setting in Section 2, and in particular the AC problem (2), to investigate the application of the proposed approximation scheme. Recall that the AC problem (1) can be recast in an LP framework in the form of , see (3). To complete this transition to the dual pairs, we introduce the spaces P X=R L(S), C=R (S), × ×M B=L(K), Y= (K), (10) M K=L+(K), K∗ = +(K). M The bilinear form between each pair (X,C) and (B,Y) is defined in an obvious way (cf. (3)). The linear operator : X B is defined as (ρ,u)(s,a) := ρ u(s)+Qu(s,a), and it can be shown to be weakly A → A − − continuous [26, p. 220]. On the pair (X,C) we consider the norms x = (ρ,u) =max ρ, u =max ρ, u ,sup u(s)−u(s′) , k k k k | | k kL} | | k k∞ s,s′∈S ks−s′kℓ∞ (11a) c :=sup x,c(cid:8)= c +sup (cid:8) u(s)c (ds)= c + c (cid:9). k k∗ kxk≤1 | 1| kukL≤1 S 2 | 1| k 2kW Recallthat istheLipschitznor(cid:10)mon(cid:11)L(S)whosedualnoRrm in (S)isknownastheWasserstein L W k·k k·k M norm [47, p. 105]. The adjoint operator ∗ : Y C is given by ∗y( ) := 1,y , y( A)+yQ( ) , A → A · − − · × · where 1 is the constant function in L(S) with value 1. In the second pair (B,Y), we consider the norms (cid:0) (cid:10) (cid:11) (cid:1) b = b :=max b ,sup b(k)−b(k′) , k k k kL k k∞ k,k′∈K kk−k′kℓ∞ (11b) y :=sup (cid:8)b,y = y . (cid:9) k k∗ kbkL≤1 k kW Acommonlyusednormonthe setofmeasuresist(cid:10)het(cid:11)otalvariationwhosedual(variational)characterization is associated with in the space of continuous functions [26, p. 2]. We note that in the positive cone ∞ k·k K∗ = (K) the total variation and Wasserstein norms indeed coincide. + M Following the construction in , we consider a collection of n-linearly independent, normalized functions n P u , u =1, and define the semi-infinite approximation of the AC problem (2) by i i≤n i L { } k k inf ρ (ρ,α)∈R×Rn − −JnAC = s.t. ρ+i=n1αi ui(s)−Qui(s,a) ≤ψ(s,a), ∀(s,a)∈K (12) α RP θP(cid:0) (cid:1) k k ≤ Comparingwiththeprogram n,wenotethatthefinitedimensionalsubspaceXn R L(S)isthesubspace P ⊂ × spanned by the basis elements x = (1,0) and x = (0,u ) for all i 1, ,n , i.e., the subspace X is 0 i i n ∈ { ··· } 10 in fact n+1 dimensional. Moreover, the norm constraint in (12) is only imposed on the second coordinate of the decision variables (ρ,α) (i.e., α R θP). The following lemmas address the operator norm and the k k ≤ respective regularity requirements of Assumption 3.1 for the program(12). Lemma 3.6 (MDPoperatornorm). In the ACproblem (2) under Assumption 2.1(ii) with the specific norms defined in (11), the linear operator norm satisfies I Q :=sup u Qu 1+max L ,1 . k − k kukL≤1k − kL ≤ { Q } Proof. Using the triangle inequality it is straightforwardto see that u Qu Qu Qu L L L I Q = sup k − k 1+ sup k k 1+ sup k k k − k u∈L(S) kukL ≤ u∈L(S) kukL ≤ u∈L(S) kuk∞ Qu ∞ 1+max L , sup k k 1+max L ,1 , Q Q ≤ n u∈L(S) kuk∞ o≤ { } where the second line is an immediate consequence of Assumption 2.1(ii) and the fact that the operatorQ is a stochastic kernel. Hence, Qu(s,a) = u(y)Q(dy s,a) u ( Q(dy s,a))= u . (cid:3) | | | S | |≤k k∞ S | k k∞ Lemma 3.7 (MDP semi-infinite regularRity). Consider the AC progRram (2) under Assumption 2.1. Then, Assumption 3.1 holds for the semi-infinite counterpart in (12) for any positive θ and all sufficiently large γ. P In particular, the dual optimizer bound in Proposition 3.2 simplifies to y⋆ θ =1. k nkW ≤ D Proof. Since K is compact, for any nonnegative θ , the program (12) is feasible and the optimal value is P bounded; recall that (Q I)u 1+max L ,1 from Lemma 3.6 and ψ < thanks to Assump- i L Q ∞ k − k ≤ { } k k ∞ tion 2.1(iii). Hence, the optimal value of (12) is bounded and, without loss of generality, one can add a redundant constraint ρ ω−1θ , where ω is a sufficiently small positive constant. In this view, the last P | | ≤ constraint α R θP may be replaced with k k ≤ (ρ,α) ω :=max ω ρ, α R θP, (13) k k { | | k k }≤ where can be cast as the norm on the pair (ρ,α) R Rn+1. Using the ω-norm as defined in (13), we ω k·k ∈ × cannowdirectly translatethe program(12)into the semi-infinite frameworkof . As mentionedabove,the n P feasibility requirement in Assumption 3.1(i) immediately holds. In addition, observe that for every y K∗ ∈ we have kA∗nykω∗ =k(ρ,sαu)kpω≤1(ρ,α)· − 1,y , Qu1−u1,y ,··· , Qun−un,y (cid:2) (cid:10) (cid:11) (cid:10) (cid:11) (cid:10) (cid:11)(cid:3) = sup ρ 1,y + sup α Qu u ,y , , Qu u ,y 1 1 n n ω|ρ|≤1− kαkR≤1 · − ··· − (cid:10) (cid:11) (cid:2)(cid:10) (cid:11) (cid:10) (cid:11)(cid:3) ω−1 y , W ≥ k k where the third line above follows from the equality 1,y = y for every y in the positive cone K∗, W k k and the fact that the second term in the second line is nonnegative. Since ω can be arbitrarily close to 0, (cid:10) (cid:11) the inf-sup requirement Assumption 3.1(ii) holds for all sufficiently large γ = ω−1. The second assertion of the lemma follows from the bound (6) in Proposition 3.2. To show this, recall that in the MDP setting c = ( 1,0) R (S) (cf. (3)) with the respective vector c = [ 1,0, ,0] R Rn (cf. ). Thus, n − ∈ ×M − ··· ∈ × P kckω∗ =supk(ρ,α)kω≤1(ρ,α) ·[−1,0,··· ,0]=ω−1, that helps simplifying the bound (6) to y⋆ θ := θPkckR∗ −JnLB = θPω−1+kψk∞, k nkW ≤ D γθ b ω−1θ ψ P P L −k k −k k which delivers the desired assertion when ω tends to 0. (cid:3) Remark 3.8 (AC dual optimizers bound). As opposed to the general LP in Proposition 3.2, Lemma 3.7 implies that the dual optimizers for the AC problem is not influenced by the primal norm bound θ and is P uniformly bounded by 1. In fact, this result can be strengthened to y⋆ = 1 due to the special minimax k nkW structureoftheACprogram (12). Thisrefinementisnotneededatthisstageandwepostponethediscussionto