ebook img

Optimality of Myopic Policy for Restless Multiarmed Bandit with Imperfect Observation PDF

0.2 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Optimality of Myopic Policy for Restless Multiarmed Bandit with Imperfect Observation

1 Optimality of Myopic Policy for Restless Multiarmed Bandit with Imperfect Observation Kehao Wang Wuhan University of Technology Hubei, P.R.C. Email: [email protected] 6 1 0 Abstract—We consider the scheduling problem concerning N study some instances of the generic RMAB in which the 2 projects. Each project evolves as a multi-state Markov process. optimal policy has a simple structure. Specially, we develop Ateachtimeinstant,oneprojectisscheduledtowork,andsome n some sufficient conditions to guarantee the optimality of the reward dependingon thestateof thechosen projectisobtained. a myopicpolicy;thatis, the optimalpolicyis to access the best J Theobjectiveistodesignaschedulingpolicythatmaximizesthe expected accumulateddiscountedrewardover afiniteorinfinite channelseach time in the sense of monotoniclikelihood ratio 1 horizon. The considered problem can be cast into a restless order. 3 multi-armed bandit (RMAB) problem that is of fundamental IntheclassicRMABproblem,aplayerchoosesM outofN ] importance in decision theory. It is well-known that solving the arms, each evolvingas a Markovchain, to activate each time, C RMABproblemisPSPACE-hard,withtheoptimalpolicyusually andreceivesarewarddeterminedbythestatesoftheactivated intractable due to the exponential computation complexity. A O naturalalternativeistoconsidertheeasilyimplementablemyopic arms. The objective is to maximize the long-run reward over . policy that maximizes the immediate reward. In this paper, we aninfinitehorizonbychoosingwhichN armstoactivateeach h perform an analytical study on the considered RMAB problem, time.Ifonlytheactivatedarmschangetheirstates,theproblem t a and establish a set of closed-form conditions to guarantee the is degeneratedto the multi-armedbandit (MAB) problem[5]. m optimality of the myopic policy. The MAB problem is solved by Gittins by showing that the [ Index Terms—Restless bandit, myopic policy, optimality, optimal policy has an index structure [5, 6]. 1 stochastic order, scheduling There exist two major thrusts in the research of the RMAB v problem. Since the optimality of myopic policy is not gen- 5 I. INTRODUCTION erally guaranteed, the first research thrust is to analyze the 9 Consider a scheduling system composed of N independent performance difference between optimal policy and approx- 1 projects each of which is models as a X-state Markov chain imation policy [7–9]. Specifically, a simple myopic policy, 0 0 with known matrix of transition probabilities. At each time also called greedy policy, is developed in [7] which yields . period one project is scheduled to work and a reward de- a factor 2 approximation of the optimal policy for a subclass 2 pending on the states of the worked project is obtained. The of scenariosreferred to as Monotone MAB. The second thrust 0 6 objective is to design a scheduling policy that maximizing istoestablish sufficientconditionsto guaranteethe optimality 1 theexpectedaccumulateddiscountedreward(respectively,the of the myopic policy in some specific instances of restless : expected accumulated reward) collected over a finite (respec- bandit scenarios, particularly in the context of opportunistic v i tively, infinite) time horizon. Mathematically, the considered communications [10–14, 17–19]. X channel access problem can be cast into the restless multi- For the case of two-state, Zhao et al. [10] established r armedbandit(RMAB) problemof fundamentalimportancein the structure of the myopic policy, and partly obtained the a decisiontheory[1].RMABproblemsariseinmanyareas,such optimality for the case of i.i.d. channels. Then Ahmad and as wired and wireless communicationsystems, manufacturing Liu et al. [15] derived the optimality of the myopic sensing systems,economicsystems,statistics,biomedicalengineering, policyforthepositivelycorrelatedi.i.d.channelsforaccessing and information systems etc. [1, 2]. However, the RMAB one channel (i.e., k = 1) each time, and further extended problem is proved to be PSPACE-Hard [3]. the optimality to access multiple i.i.d. channels (k > 1) [12]. The considered problem can also be formulated as a From another point, in [14], we extended i.i.d. channels [15] multi-state Partially Observed Markov Decision Process to non i.i.d. ones, and focused on a class of so-called regu- (POMDP) [4]. The challenges of multistate POMDPs are lar functions, and derived closed-form sufficient conditions twofold: First, the probability vector is not completely or- to guarantee the optimality of myopic sensing policy. The dered in the probability space, making the structural analysis authors[17]studiedthemyopicchannelprobingpolicyforthe substantially more difficult; Second, multistate POMDPs tend similar scenario proposed, but only established its optimality to encounter the “curse of dimensionality”, which is further in the particular case of probing one channel (M = 1) complicated by the uncountably infinite probability space. each time. In our previous work [18], we established the Hence, numerical methods are adopted popularly. However, optimality of myopic policy for the case of probing N 1 − the numerical approach does not provide any meaningful of N channels each time and analyzed the performance of insightintooptimalpolicy.Moreover,thisnumericalapproach the myopic probing policy by domination theory, and further has huge computational complexity. For the two reasons, we in [19] studied the generic case of arbitrary M and derived 2 more strong conditions on the optimality by dropping one of II. PROBLEM FORMULATION the non-trivial conditions of [17]. Consider N independent projects n = 1, ,N. Assume For the complicated case of multi-state, the authors in [16] ··· each project n has a finite number, X, of states, denoted established the sufficient conditions for the optimality of as . Let s(n) denote the state of project n at discrete myopic sensing policy in multi-state homogeneous channels X t time t = 1,2, . At each time instant t, only one of with a set of non-trivial assumptions. ··· these projects can be worked on. If project n is worked on at time t, an instantaneous reward βtR(s(n),n) is accrued t A. Contribution of the Paper (R(s(n),n) is assumed finite). Here, 0 β 1 denotes the discotuntfactor;the state s(n) evolvesac≤cordi≤ngto an X-state The main results of this paperare the optimalityconditions t homogeneousMarkovchainwithtransitionprobabilitymatrix forexpectedaccumulateddiscountedrewardinTheorem1and A=(a ) , where, Theorem2forimperfectobservation,whichmakesitdifferent ij i,j ∈X from the most relevant paper [16] with perfect observation. a =P(s(n) =j s(n) =i) if project n is worked on at t. The major difficulties encountered in optimizing the rewards ij t+1 | t in multi-state channel are: 1) how to obtain a non-trivial All projects are initialized with s(n) x(n), where x(n) are 0 ∼ 0 0 upper bound for multiple different stochastic matrices under specified initial distributions for n=1, ,N. ··· multivariatereward(correspondingtomulti-state)case;2)how The state of the active project n is indirectly observed via to determine the stochastic order of belief vectors; 3) identify noisy measurements (observations) y(n) of the active project t+1 the number of branches in the decision tree determined by a state s(n). Assume that these observations y(n) belong to a t+1 t+1 specific policy corresponding to the auxiliary value function finiteset indexedbym=1, , . LetB =(b ) im i ,j definedinthispaper.Theseissuesareresolvedby1)assuming Y ··· Y ∈X ∈Y denotethe observationprobabilitymatrixofthe HMM,where thateachtransmissionmatrixhasanon-trivialeigenvaluewith each element b ,P(y(n) =my(n) =i,u =n). X 1times,underwhichthefirst-orderstochasticdominance im t+1 | t t − Let ut 1, ,N denote which project is worked on is preserved and meanwhile, the upper bound of each matrix at time t.∈Co{nse·q·u·ently}, s(ut) denotes the state of the active is characterized by the eigenvalue; 2) assuming that there t+1 project at time t+1. Denote the observation history at time existsadeterminedstochasticdominanceorderoftransmission t as Y = (y(u0), ,y(ut−1)) and let U = (u , ,u ). matrices at any time instance; 3) consideringthe performance t 1 ··· t t 0 ··· t Then the project at time t+1 is chosen according to u = difference of two specific policies which differ in only one t+1 µ(Y ,U ), where the policy denoted as µ belongs to the element of belief vectors; that is, the two policies have the t+1 t class of stationary policies . The total expected discounted form of difference, mathematically. Further, we obtain the U reward over an infinite-time horizon is given by number of branches needed to be fix their bounds. In this paper, we considered the problem of indirect obser- J =E ∞ βtR(s(ut),u ) , u =µ(Y ,U ), (1) vation of project states which makes our scheduling problem µ t t t t t 1 − is different from [16] to a large extent. In particular, the hXt=0 i contributions of this paper include: where E denotes mathematical expectation. The aim is to determine the optimal stationary policy µ =argmax J , The structure of the myopic policy is shown to be a ∗ µ µ • which yields the maximum rewards in (1). ∈U simple queue determined by the information states of projects provided that certain conditions are satisfied for the transition matrix of multi-state projects. A. Information state We establish a set of conditionsunderwhich the myopic The above partially observed multiarmed bandit problem • policy is proved to be optimal. can be re-expressed as a fully observed multiarmed bandit in Ourderivationdemonstratestheadvantageofbranch-and- terms of the information state. For each project n, denoted • bound and the directed comparison based optimization by x(n) the information state at time t (Bayesian posterior t approach. The results of this paper are a generic contri- distributionofs(n))asx(n) =(x(n)(i)) i=1, ,X,where butiontothestateoftheartofthetheoryofrestlessbandit x(n)(i),P(s(nt) =iY ,tU ).TtheHMMmul·ti·a·rmedbandit problems, althoughthe structure of the optimal policy of t t | t t−1 problem can be viewed as the following scheduling problem: generic restless bandit is not known. ConsiderN parallelHMMstateestimationfilters,oneforeach project.Theprojectnisactive,anobservationy(n) isobtained t+1 B. Organization and the informationstate x(n) is computedrecursivelybythe t+1 The rest of the paper is organizedas follows. In Section II, HMM state filter according to we present the system model and the formulation of the x(n) =T(x(n),y(n)), if project n is worked on at time t, optimization problem. In Section III, we construct a set of t+1 t t+1 conditions to guarantee the optimality of myopic policy by where deriving some properties of transmission matrix and some B(y(n))Ax(n) bounds of serval pairs of policies. In Section IV, the opti- T(x(n),y(n)), ′ , (2) d(x(n),y(n)) mality results are extended to two different cases. Finally, we d(x(n),y(n)),1 B(y(n))Ax(n). conclude in Section V. ′X ′ 3 In (2), if y(n) =m, then B(m)=diag[b , ,b ] is the Definition 1 (MLR ordering, [20]). Let x , x Π(X) be 1m Xm 1 2 ··· ∈ diagonalmatrixformedbythe mthcolumnoftheobservation anytwobeliefvectors.Thenx isgreaterthanx withrespect 1 2 matrix B, A is the xth row of the matrix A, and 1 is an to the MLR ordering—denoted as x x , if x X 1 r 2 ≥ X-dimensional column vector of ones. x (i)x (j) x (i)x (j), i>j, i,j 1,2, ,X . ThestateestimationoftheotherN 1projectsisaccording 1 2 ≤ 2 1 ∈{ ··· } − to Definition 2 (first order stochastic dominance, [20]). Let x(t+n)1 =A′x(tn), (3) x1, x2 ∈ Π(X), then x1 first order stochastically domi- nates x —denoted as x x , if the following exists for 2 1 s 2 ≥ ifprojectlisnotworkedonattimet,l 1, ,N , l =n. j =1,2, ,X, ∈{ ··· } 6 ··· Let Π(X) denote the state space of information states X X x(n), n 1,2, ,N , which is a X 1-dimensional x (i) x (i). ∈ { ··· } − 1 2 simplex: ≥ i=j i=j X X Π(X)= x RX :1 x=1,0 x(i) 1 for all i . Some useful results [20] are stated here: ∈ ′X ≤ ≤ ∈X n o Proposition 1 ([20]). Let w1, w2 Π(X), the following The process x(n), n = 1, ,N, qualifies as an informa- holds ∈ t ··· tion state since choosing ut+1 = µ(Yt+1,Ut) is equivalent 1) w1 rw2 implies w1 sw2. to choosing ut+1 =µ(x(t+1)1,··· ,x(t+N1)). Using the smoothing 2) Let≥V denote the set o≥f all X dimensionalvectors v with property of conditional expectations, the reward function (1) nondecreasing components, i.e., v v v . 1 2 X can be rewritten in terms of the information state as Then w w iff for all v , v w≤ v≤w·.·· ≤ 1 s 2 ′ 1 ′ 2 ≥ ∈V ≥ Jµ =E ∞ βtR′(ut)xt(ut) , ut =µ(x(t1),··· ,x(tN)), (Duˆe0fi,nuˆi1t,ion ,3uˆT()Myisopthice Ppoolliiccyy).thTathesemleycotspitchepobleisctypuˆroje:=ct hXt=0 i (in the···sense of MLR) at each time. That is, if where R(u ) denotes the X dimensional reward column x(σ1) x(σN), then the myopic policy at t is vector[R(′s(nt) = 1,u ), ,R(s(n) = X,u )]. The aim is to t ≥s···≥s t compute thte optimaltpo·li·c·y argmtaxµ Jµ.t uˆt =µt(x(t1),··· ,x(tN))=σ1. To get more insight on the struct∈urUe of the optimization III. OPTIMALITY problem formulated in (4), we derive its dynamic program- ming formulation as follows: To analyze the performance of the myopic policy, we first introducean auxiliaryvalue functionand then provea critical VT(x(T1:N))=maxuT E R′(uT)x(TuT) , featureoftheauxiliaryvaluefunction.Next,wegivea simple Vt(x(t1:N))=maxutE R(cid:2) ′(ut)xt(ut) (cid:3) assumption about transmission matrix, and show its special  +β m d(xt(ut),mh )Vt+1(x(t+1:1ut−1),xt(+ut1),m,x(t+ut1+1:N)) ,sptoolcichiaesst,icwoerdgeert.sFoimnaelliym, pboyrtdaenrtivbionugntdhse, wbohuicnhdsseorfvedsifafesrtehnet ∈Y  P (4)i basis to prove the optimality of the myopic policy. where, x(i:j) , x(i),x(i+1), ,x(j) , and t t t ··· t A. Value Function and its Properties (cid:0) x(ut) =T(x(ut),m(cid:1)) First, we define the auxiliary value function (AVF) as t+1,m t (5) follows: tBh.eTMahbeyooovrpeeitcdicyPanollalyim,c(ytihcxep(t+nroo)p1gti=rmamaAlm′pxio(tnnlgi)c.,yItcniasn6=inbfueetao.sbibtalein,ehdowbyevseorl,vdinuge WWTτ+uuˆˆ((βxxm|(T(τX11∈::NNY))d))(==x(τRRuˆτ′′)((,uuˆmTτ)))xxWT((TuˆuτuˆτT+)̥)1,((xx(ττ(1{1:+N:zuˆ1)τ,u−ˆτ1)),x(τuˆ+τ1),m,x(τuˆ+τ1+1:N)}), t+1≤τ ≤T nmtironeacitzftuuhairrncesagtilviomteahblpetteeaaqiricmnnutiaamnottigifevodetnthhisaeeitsiecosturpoecrtwriomsemaenaretpdlkuaswtcoaathltiuiosioltineimnoaionpglnllnyedotihrprmeierncyogtfohluypittbhuifcierrteioivpmmeroe.plwitacHhcayeetrdnaom,cbfeaaot,xnhvidaee- Wt+u(βx|m(tX1∈:NY)d)(=xt(Rut′)(,umt))xWt(utuˆt+)̥1((xx(tt(1{1+:Nz:1u)t,−ut1)),xt(+ut1),m,x(tu+t1+1:N)}), (7) currentaction on the future reward, which is easy to compute and implement, formally defined as follows: Remark. AVF is the reward under the policy: at slot t, u is t uˆ(t)=argnmaxR′x(tn). (6) aaddoopptteedd,. while after t, myopic policy uˆτ (t+1 ≤ τ ≤ T) is For the purpose of tractable analysis, we introduce some Let e be an X-dimensional column vector with 1 in the i partial orders used in the following sections. i-thelementand0inothers,andE betheX X unitmatrix. × 4 Lemma 1. Wu(x(1:N)) is decomposable for all t = Proposition 3. Let x ,x Π(X) and t t 1 2 ∈ 0,1, ,T, i.e., (A ) x x (A ), then T(x ,K) T(x ,K). 1 ′ r 1 r 2 r X ′ 1 r 2 ··· ≤ ≤ ≤ ≤ Wtu(x(t1:n−1),x(tn),x(tn+1:N)) Proposition3statestheincreasingmonotonicityofupdating X rule with information state for scheduled project. = xt(n)(i)Wtu(x(t1:n−1),ei,x(tn+1:N)) Proposition 4. Let x Π(X) and A x A , then 1 r r X i=1 ∈ ≤ ≤ X T(x,k) T(x,m) for any 1 k m Y. X ≤r ≤ ≤ ≤ = e′ix(tn)Wtu(x(t1:n−1),ei,x(tn+1:N)) Proposition4statestheincreasingmonotonicityofupdating i=1 rule with the increasing number of observation state for X Proof: Please refer to Appendix A. scheduled project. Proposition 5. Under Assumption 1, we have either B. Assumptions x(l) x(n) or x(n) x(l) for all l,n 1,2, ,N t ≤s t t ≤s t ∈ { ··· } We make the following assumptions/conditions. for all t. Assumption 1. Assume that Proposition 5 states that under Assumption 1, the informa- 1) A A A . tion states of all projects can be ordered stochastically at all 1 r 2 r r X ≤ ≤ ··· ≤ 2) B(1) B(2) B(Y). times. r r r ≤ ≤ ··· ≤ 3) There exists some K (2 K Y) such that Now we give an important structural property on transition ≤ ≤ matrix in the following proposition. T(Ae ,K) (A)2e , ′ 1 r ′ 1 ≥ T(Ae ,K 1) (A)2e . Proposition 6. Suppose that transition matrix A has X ′ X r ′ 1 − ≤ eigenvalues λ λ λ and the corresponding or- 1 2 X 54)) RA1′(e≤i+r1x(01e)i)≤r xR(0′2Q) ′≤(eri+·1·· ≤eir)(x1(0N)i≤r XAX.1),where tthheongownealheaivgeenv≥ectors≥ar·e··V≥1,V2,··· ,VX. If x1,x2 ∈ Π(X), − ≥ − ≤ ≤ − A=VΛV 1, Q=VΥV 1, − − λ =1 and V = 1 1 ; • 1 1 √X X 1 0 ... 0 for any λ, 0 λ ... 0 • 2 Λ= ... ... ... ... , Λ1V1′(x1−x2)=Λ2V1′(x1−x2), (8)    0 0 ... λ   X  where  1 0 ... 0 0 βλ2 ... 0 λ1 0 ... 0 Υ= ... 1−β...λ2 ... ... . Λ1 = 0.. λ..2 ..... 0.. ,   . . . .  0 0 ... βλX     1−βλX   0 0 ... λ4  Remark. Assumption 1.1 ensures that the higher the quality  λ 0 ... 0  of the channel’s current state the higher is the likelihood that 0 λ ... 0 2 tahloenngexwticthha1n.n1e-l1s.2tateenwsuirllebtehaotfthhieghinqfuoarmlitayt.ioAnsssutmatepstioonf1a.l3l Λ2 = ... ... ... ... .   projects can be ordered at all times in the sense of stochastic  0 0 ... λ4  order. Assumption 1.4 states that initially the channels can be   Proposition 6 states that 1) for any transition matrix, the ordered in terms of their quality. Assumption 1.5 states that largest eigenvalue is 1, named as trivial eigenvalue, and the instantaneous rewards obtained at different states of the its corresponding eigenvector is 1 1 , named as trivial channel are sufficiently separated. √X X eigenvector;2)foranytwoinformationstates,x ,x Π(X), 1 2 ∈ one special equationholds where the largesteigenvalue1 can C. Properties be replacing by any value. UnderAssumption1.1-1.5,we havesomeimportantpropo- Proposition 7. Given x ,x Π(X), we have sitions concerning the structure of information state in the 1 2 ∈ following, which are proved in Appendix B. ∞ R (βA)i(x x )=RQ(x x )=R(VΥV 1)(x x ). Proposition 2. Let x ,x Π(X) and x x , then ′ ′ 1 2 ′ ′ 1 2 ′ − ′ 1 2 1 2 1 r 2 − − − (A1)′ r A′x1 r A′x2 r∈(AX)′. ≤ Xi=1 ≤ ≤ ≤ Proposition 2 states that if at any time t the information Proposition 7 states that the accumulated reward difference states of two channels are stochastically ordered and none of betweentwo differentstate informationvectorscan be simply these channels is chosen at t, then the same stochastic order written as a matrix form. between the information states at time t+1 is maintained. Proposition8. R(e e ) RQ(e e )(1 j <i X). ′ i j ′ ′ i j − ≥ − ≤ ≤ 5 X D. Analysis of Optimality (l) (l) (l) (xˇ (i) x (i))(e e )+x (j)(e e ) We first give some bounds of performance difference on t − t j − j−1 t j − 1 serval pairs of policies, and then derive the main theorem on hXi=j i X X the optimality of myopic policy. = (xˇ(l)(i) x(l)(i))R(E Q)(e e ) t − t ′ − ′ j − j−1 Lemma 2. Under Assumption 1, xlt = (xt(−l),x(tl)), xˇlt = Xj=2hXi=j (xt(−l),xˇt(l)), xt(l) ≤r xˇ(tl), we have for 1≤t≤T +x(tl)(j)R′(E−Q′)(ej −e1) (C1) if u′t =ut =l, X X i R′(xˇt(l)−x(tl))≤WT tut′(xˇlt)−Wtu(xlt) =Xj=2hXi=j(xˇ(tl)(i)−x(tl)(i))[R′(ej −ej−1)−R′Q′(ej −ej−1)] ≤ − βiR′(A′)i(xˇ(tl)−x(tl)); +x(tl)(j)[R′(ej −e1)−R′Q′(ej −e1)] Xi=0 (b) i 0, (C2) if u′t 6=l, ut 6=l, and u′t =ut, ≥ T t where,theequality(a)isfromProposition7,andtheinequality 0≤Wtu′(xˇlt)−Wtu(xlt)≤ − βiR′(A′)i(xˇ(tl)−x(tl)); (b) is from Proposition 8, and Xi=j(xˇ(tl)(i)−xt(l)(i))≥0 is Xi=1 due to xˇ(l) x(l) from Proposition 1. (C3) if u =l and u =l, t ≥s t P ′t t 6 Remark. Lemma 3 states that scheduling the project with T t 0 Wu′(xˇl) Wu(xl) − βiR(A)i(xˇ(l) x(l)). better information state would bring more reward. ≤ t t − t t ≤ ′ ′ t − t i=0 Based on Lemma 3, we have the following theorem which X Proof: Please refer to Appendix C. states the optimal condition of the myopic policy. Remark. We would like to emphasize on what conditions Theorem 1. Under Assumption 1, the myopic policy is opti- the bounds of Lemma 2 are achieved. For (C1), the lower mal. bound is achieved when project l is scheduled at slot t but Proof:WhenT 9 ,weprovethetheorembybackward neverscheduledaftert;theupperboundisachievedwhenl is ∞ induction. The theorem holds trivially for T. Assume that it scheduled from t to T. For (C2), the lower boundis achieved holds for T 1, ,t+1, i.e., the optimal accessing policy when project l is never scheduled from t; the upper bound − ··· is to access the best channels (in the sense of stochastic is achieved when l is scheduled from t+1 to T. For (C3), dominance in terms of ) from time slot t+1 to T. We now thelowerboundisachievedwhenprojectl isneverscheduled show that it holds for t. Suppose, by contradiction,that given fromt;the upperboundisachievedwhenl is scheduledfrom x, x(i1), ,x(iN) and x(1) >s >s x(N), the optimal t to T. { ··· } ··· policyistochoosethebestfromtimeslott+1toT,andthus, Lemma 3. UnderAssumption1, we havethen Wtl(x(t1:N))> atslott, to chooseµt =i1 6=1=µˆt, giventhatthe latter,µˆt, Wn(x(1:N)) if x(l) > x(n). istochoosethebestprojectin thesense ofstochasticorderat t t t r t slott. Theremustexistin atslott suchthatx(in) >s x(i1). It Proof: By Lemma 2, we have thenfollowsfromLemma3thatWin(x(1:N))>Wi1(x(1:N)), t t t t Wl(x(1:N)) Wn(x(1:N)) which contradicts with the assumption that the latter is the t t − t t optimal policy. This contradiction completes our proof for T. =[Wtl(xt(−l),xt(l))−Wtl(xt(−l),x(tn))] When T , the proof is finished. →∞ −[Wtl(xt(−l),x(tn))−Wtn(xt(−n),x(tn))] =[Wtl(xt(−l),xt(l))−Wtl(xt(−l),x(tn))] E. Discussion −[Wtn(xt(−l),x(tn))−Wtn(xt(−n),x(tn))] 1) Comparison: In [16], the authors considered the prob- T t lem of scheduling multiple channels with direct or perfect ≥R′(xˇt(l)−xt(l))− − βiR′(A′)i(xˇ(tl)−x(tl)) observation, and then the method is based on the information i=1 states of all channels in the sense of first order stochastic X T t dominance order; that is, the critical property is to keep =R′ E− − (βA′)i (xˇ(tl)−x(tl)) the information states completely ordered or separated in the (cid:16) Xi=1 (cid:17) sense of first order stochastic dominance order. However, in R E ∞ (βA)i (xˇ(l) x(l)) the case of indirect or imperfect observation, an observation ≥ ′ − ′ t − t matrix is introduced to replace the unit matrix E for the (cid:16) Xi=1 (cid:17) direct observation considered in [16]. Hence, the stochastic (=a)R′ E−VΥV−1 (xˇ(tl)−x(tl)) dominance order is not sufficient to characterize the order of (cid:16) X (cid:17) information states, and then the monotonic likelihood ratio =R(E Q) order, a kind of more stronger stochastic order, is used to ′ ′ − describe the order structure of information states. j=2 X 6 The Assumption 1.5 is differentfrom the Assumption (A4) Lemma 4. Under Assumption 2, xlt = (xt(−l),xt(l)), xˇlt = of [16]. (xt(−l),xˇ(tl)), x(tl) ≤r xˇ(tl), we have for 1≤t≤T 2) Bounds: The bounds in (C1)-(C3) are not enough tight (D1) if u =u =l, todropthenon-trivialAssumption1.5.Actually,weconjecture ′t t the optimality of myopic policy is kept even without the T−t Aadsosputmedptiionnt1h.i5s.Hpaopweerv,ewr,educeantonothteocbotnasintrabinetttoefrthbeoumnedtshotod R′ E−⌈ 2 ⌉(βA′)2i−1 (xˇ(tl)−x(tl)) dropthe non-trivialAssumption1.5.Therefore,oneoffurther (cid:16)Wu′(xˇXil=)1 Wu(xl) (cid:17) directionsistoobtaintheoptimalityofmyopicpolicywithout ≤ t t − t t T−t Assumption 1.5 by some new methods. ⌊ 2 ⌋ R E+ (βA)2i (xˇ(l) x(l)); ≤ ′ ′ t − t IV. OPTIMALITY EXTENSION (cid:16) Xi=1 (cid:17) In this section, we first extend the obtained optimality (D2) if u =l, u =l, and u =u , ′t 6 t 6 ′t t results to the case in which the transition matrix is totally T−t negativeorder,asacomplementarytothetotallypositiveorder ⌈ 2 ⌉ discussed in the previous section, which means that those −R′ (βA′)2i−1(xˇ(tl)−x(tl)) relative propositions are stated here by replacing increasing i=1 X monotonicitywithdeceasingmonotonicity.Second,weextend Wu′(xˇl) Wu(xl) ≤ t t − t t the optimality to the case of scheduling multiple projects T−t simultaneously. ≤R′⌊ 2 ⌋(βA′)2i(xˇ(tl)−x(tl)); i=1 X A. Assumptions (D3) if u =l and u =l, Some important assumptions are stated in the following. ′t t 6 T−t Assumption 2. Assume that ⌈ 2 ⌉ R (βA)2i 1(xˇ(l) x(l)) 1) A1 r A2 r r AX. − ′ ′ − t − t ≥ ≥ ··· ≥ i=1 2) B(1) ≤r B(2)≤r ··· ≤r B(Y). WuX′(xˇl) Wu(xl) 3) There exists some K (2 K Y) such that ≤ t t − t t ≤ ≤ T−t T(A′eX,K) r (A′)2eX, R E+⌊ 2 ⌋(βA)2i (xˇ(l) x(l)). T(A′e1,K−≤1) ≥r (A′)2eX. ≤ ′(cid:16) Xi=1 ′ (cid:17) t − t 4) A x(1) x(2) x(N) A . Remark. (D1) achieves its lower bound when l is chosen at 1 ≥r 0 ≥r 0 ≥r ··· ≥r 0 ≥r X 5) R(e e ) RQ(e e )(1 i X 1),where slott,t+1,t+3, ,andachievestheupperboundwhenl is ′ i+1− i ≥ ′ ′ i+1− i ≤ ≤ − ··· A=VΛV 1, Q=VΥV 1. chosen from t,t+2,t+4, . (D2) achievesits lower bound − − ··· when l is chosen at slot t+1,t+3, , and upper bounds Remark. Assumption 2 differs from Assumption 1 in three ··· when l is chosen at t+2,t+4, . (D3) achieves its lower aspects, i.e., 2.1, 2.3, 2.4, which reflects the inverse TP2 ··· bound when l is chosen at slot t+1,t+3, , and upper order [20] in matrix A. ··· bounds when l is chosen from t,t+2,t+4, . ··· Based on Lemma 3 and 4, we have the following theorem. B. Optimality Under Assumption 2, we have the following propositions Theorem 2. Under Assumption 2, the myopic policy is opti- similar to Proposition 2—Proposition 5. mal. Proposition 9. Let x ,x Π(X) and x x , then 1 2 1 r 2 ∈ ≤ (A1)′ r A′x1 r A′x2 r (AX)′. C. Extension of Scheduling Multiple Projects Simultaneously ≥ ≥ ≥ Proposition 10. Let x1,x2 Π(X) and It is necessary to point out that the method adopted and ∈ (A1)′ r x1 r x2 r (AX)′, then T(x1,K) r T(x2,K). the bounds obtained in this paper can be trivially extended ≥ ≥ ≥ ≤ to the case of scheduling multiple projects simultaneously. Proposition 11. Let x Π(X) and (A ) x (A ), 1 ′ r r X ′ ∈ ≥ ≥ In this case, the bounds in Lemmas 2 and 4 still hold with- then T(x,k) T(x,m) for any 1 k m Y. r ≥ ≤ ≤ ≤ out modifying any assumptions. This is because scheduling Proposition 12. Under Assumption 2, we have either multiple projects simultaneously can be easily regarded as x(tl) ≤s x(tn) or xt(n) ≤s x(tl) for all l,n ∈ {1,2,··· ,N} scheduling multiple projects one by one at each slot, while for all t. those non-scheduled projects remain their states. Therefore, theoptimalityofschedulingoneprojectateachslotguarantees Following the similar derivation of Lemma 2, we have the the optimality of scheduling multiple projects simultaneously following important bounds. under Assumption 1 or 2. 7 V. CONCLUSION Now, we have RHS and LHS of (11) as follows In this paper, we have investigated the problem of schedul- X ing multi-state projects. In general, the problem can be for- d(x(n),m) e x(n) t ′j t+1,m mulated as a partially observable Markov decision process or m j=1 restlessmulti-armedbandit,whichisprovedtobePspace-hard. X∈Y X X B(m)Ax(n) Iton tghuiasrpanapteeer,twheeohpatviemdaelirtiyveodfathseetmoyfocpliocsepdolfiocrym(sccohneddiutiloinngs = d(xt(n),m) e′j d(x(n),′mt) m j=1 t the best project) in the sense of monotonic likelihood ratio X∈Y X X order. Due to the generic RMAB formulationof the problem, = e B(m)Ax(n). (12) ′j ′ t the derived results and the analysis methodology proposed in m j=1 X∈YX this paper can be applicable in a wide range of domains. X X APPENDIX A x(n)(i) d(e ,m) e T(e ,m) t i ′j i PROOF OF LEMMA1 i=1 m j=1 X X∈Y X For Slot T, it trivially holds. Suppose it holds for T X X 1, ,t+2,t+1, we prove it holds for slot t. − = x(n)(i) d(e ,m) e B(m)A′ei ··· t i ′j d(e ,m) At slot t, we prove it by two cases in the following. i=1 m j=1 i X X∈Y X Case 1: ut =n, X X = x(n)(i) e B(m)Ae Wtu(x(t1:n−1),xt(n),x(tn+1:N)) i=1 t m j=1 ′j ′ i =R′(n)x(tn)+βmX∈Yd(x(tn),m)Wtuˆ+1(x(t+1:1n−1),x(t+n)1,m,x(t+n+11:N)) =X X e′jBX(∈mY)XA′ X xt(n)(i)ei (=a)R′(n)x(tn) mX∈YXjX=1 Xi=1 X = e B(m)Ax(n). (13) +β d(x(tn),m) e′jx(t+n)1,mWtuˆ+1(x(t+1:1n−1),ej,x(t+n+11:N)), m j=1 ′j ′ t m j=1 X∈YX X∈Y X (9) Combing (12) and (13), we have (11), and further, prove the lemma. where the equality (a) is due to the induction hypothesis. Case 2: u = n, without loss of generality, assuming u t t 6 ≥ n+1, X x(tn)(i)Wtu(x(t1:n−1),ei,x(tn+1:N)) Wtu(x(t1:n−1),xt(n),x(tn+1:N)) i=1 XX =R′(ut)xt(ut)+β d(xt(ut),m) = x(tn)(i) R′(n)x(tn) mX∈Y Xi=1 h Wtuˆ+1(x(t+1:1ut−1),xt(+ut1),m,x(t+ut1+1:N)) +βmX∈Yd(ei,m)Wtuˆ+1(x(t+1:1n−1),T(ei,m),x(t+n+11:N))i (=a)R′(ut)xt(ut)+β d(xt(ut),m) X xt(+n)1(i) X m i=1 (=b)R′(n)x(tn)+β x(tn)(i) d(ei,m) Wtuˆ+1(x(t+1:1n−1),eiX,∈xY(t+n+11:ut−1),xt(X+ut1),m,x(t+ut1+1:N)), (14) i=1 m X X∈Y ×Wtuˆ+1(x(t+1:1n−1),T(ei,m),x(t+n+11:N)) where, the equality (a) is due to the induction hypothesis. X X (=c)R′(n)x(tn)+β x(tn)(i) d(ei,m) e′jT(ei,m) X xt(n)(i)Wtu(x(t1:n−1),ei,x(tn+1:N)) i=1 m j=1 X X∈Y X Xi=1 ×Wtuˆ+1(x(t+1:1n−1),ej,x(t+n+11:N)), (10) = X x(n)(i) R(u )x(ut)+β d(x(ut),m) t ′ t t t where,theequality(b)isfrom X x(n)(i)=1,andequality i=1 t Xi=1 h mX∈Y (c)Tios dpureovteo itnhdeuctthieonlhemypmoath,ePsitis.is sufficient to prove the Wtuˆ+1(x(t+1:1n−1),ei,x(t+n+11:ut−1),xt(+ut1),m,x(t+ut1+1:N)) following equation X i (=b)R(u )x(ut)+β x(n)(i) d(x(ut),m) X ′ t t t t m d(x(tn),m)j=1e′jx(t+n)1,m Wtuˆ+1(x(t+1:1n−1),eXii=,x1(t+n+11:ut−mX1∈),Yxt(+ut1),m,x(t+ut1+1:N)), (15) X∈Y X X X = xt(n)(i) d(ei,m) e′jT(ei,m). (11) where, the equality (b) is from Xi=1xt(n)(i)=1. Combining (14) and (15), we prove the lemma. i=1 m j=1 X X∈Y X P 8 APPENDIX B D. Proof of Proposition 5 PROOF OF PROPOSITIONS 2–8 Let φ(z) = B(K)z where z Π(X) and 1′XB(K)z ∈ (A ) z (A ). We first show that φ(z ) A. Proof of Proposition 2 1 ′ ≤r ≤r X ′ 1 − z φ(z ) z for z z . Suppose i>j, we have 1 r 2 2 2 r 1 ≤ − ≥ Suppose i>j, we have (φ(z ) z ) (φ(z ) z ) (φ(z ) z ) (φ(z ) z ) 1 1 i 2 2 j 1 1 j 2 2 i − · − − − · − b z (i) b z (j) (e′iAX′x2)·(e′jA′xX1)−(e′jA′x2)·X(e′iA′x1) X =(cid:16) Xl=i1Kbl1Kz1(l) −z1(i)(cid:17)(cid:16) Xl=j1Kbl2Kz2(l) −z2(j)(cid:17) = a x (k) a x (l) a x (k) a x (l) b z (j) b z (i) ki 2 lj 1 − kj 2 li 1 P jK 1 z (j) P iK 2 z (i) Xk=1X X Xl=1 X X Xk=1 Xl=1 −(cid:16) Xl=1blKz1(l) − 1 (cid:17)(cid:16) Xl=1blKz2(l) − 2 (cid:17) =(z (i)z (j) z (j)z (i)) = akialj akjali x2(k)x1(l) P1 2 − 1 2 P − b b (cid:16)kX=1Xl=1 kX=1Xl=1 (cid:17) iK 1 jK 1 0, X X X X × X b z (l) − X b z (l) − ≤ = (a a a a ) (a a a a ) (cid:16) l=1 lK 1 (cid:17)(cid:16) l=1 lK 2 (cid:17) ki lj li kj li kj ki lj (cid:16)Xl=1Xk=l − −kX=1Xl=k − (cid:17) wwehehreaPv,ez1φ((iz)z2)(j)z−z1(jφ)(zz2P()i) ≤z0fiosrfzrom z2z≥.r z1. Thus, x2(k)x1(l) 1 − 1 ≤r 2 − 2 2 ≥r 1 × According to Assumption 1.3, we have φ(z) z = X X B(K)z z 0 for any z (A)2e ; that is, T(−x,K) = (akialj −aliakj)(x2(k)x1(l)−x2(l)x1(k))≥0, A1′XxB(K)z0−for≥arny x Ae ≥. Cromb′inin1g Proposition 4, w−e Xl=1Xk=l T(′x,≥k)r Ax 0 f≥orrk ′ 1K and any x Ae . ′ r r ′ 1 − ≥ ≥ ≥ where, the last inequality is due to A A (k l) and According to Assumption 1.3 and Proposition 4, we k r l ≥ ≥ x x . T(x,k) Ae for k K 1 and any x Ae . 2 r 1 s ′ 1 r ′ 1 ≥ ≤ ≤ − ≥ Thenwehave(A ) =Ae Ax Ax Ae = Thus, we have the proposition. 1 ′ ′ 1 r ′ 1 r ′ 2 r ′ X ≤ ≤ ≤ (A ) considering e x x e . X ′ 1 r 1 r 2 r X ≤ ≤ ≤ E. Proof of Proposition 6 (1)Forthepropertyofλ =1andV = 1 1 ,itiseasily B. Proof of Proposition 3 1 1 √X X verified, i.e., z2.AScucpoprdoisnegito>Pjr,opwoesihtiaovne2,wehavez1 =A′x1 ≤r A′x2 = 1 1 AA21·11XX 1 11 1 √XA·1X = √X  ·... = √X  ... = √X1X. (T(x ,K)) (T(x ,K)) (T(x ,K)) (T(x ,K)) 2 i· 1 j − 2 j · 1 i  A 1   1  biKz2(i) bjKz1(j)  X · X    =     Xx=1bxKz2(x) · Xx=1bxKz1(x) (2) For the property of replacing λ1 with any value λ, we have the LHS of (8) b z (j) b z (i) P jK 2 P iK 1 − Xx=1bxKz2(x) · Xx=1bxKz1(x) Λ1V′(x1−x2) = biKbPjK(z2(i)z1(j)−z2P(j)z1(i)) 0, =[λ1V1(x1−x2), λ2V2(x1−x2), ··· , λXVX(x1−x2)]′ Xx=1bxKz2(x) Xx=1bxKz1(x) ≥ =[λ1 1 1X(x1 x2), λ2V2(x1 x2), , λXVX(x1 x2)]′ √X − − ··· − where,Pz2(i)z1(j)−z2(Pj)z1(i)≥0 is from z1 ≤r z2. =[0 λ2V2(x1−x2), ··· , λXVX(x1−x2)]′. (16) For the RHS of (8), we have C. Proof of Proposition 4 Λ V (x x ) 2 ′ 1 2 − Let z =A′x. Suppose i>j, we have =[λV1(x1−x2), λ2V2(x1−x2), ··· , λXVX(x1−x2)]′ 1 =[λ 1 (x x ), λ V (x x ), , λ V (x x )] (T(x,m))i (T(x,k))j (T(x,m))j (T(x,k))i 1√X X 1− 2 2 2 1− 2 ··· X X 1− 2 ′ · − · = bimz(i) bjkz(j) =[0 λ2V2(x1−x2), ··· , λXVX(x1−x2)]′. (17) X b z(l) · X b z(l) l=1 lm l=1 lk By (16) and (17), we prove the equation (8). b z(j) b z(i) P jm P ik − X b z(l) · X b z(l) l=1 lm l=1 lk F. Proof of Proposition 7 (b b b b )z(i)z(j) =Pim jk− jm iPk 0, X b z(l) X b z(l) ≥ l=1 lm l=1 lk ∞ ∞ R (βA)i(x x )=R (β(V 1)ΛV )i(x x ) where, b Pb b b P 0 is from B(m) B(k). ′ ′ 1− 2 ′ − ′ ′ 1− 2 im jk− jm ik ≥ ≥r Xi=1 Xi=1 9 (=a)R′ ∞ (β(V−1)′Λ2V′)i(x1 x2) −e′jB(m)A′x(tl)Wtuˆ+1(xt(+−1l),ej) − Xi=1 (=a) e B(m)A(xˇ(l) x(l)) i ∞ ′j ′ t − t =R′(V−1)′ (βΛ2)iV′(x1−x2) mX∈Yj∈XX−{1}h =R(V 1)ΥXi=V1 (x x ) × Wtuˆ+′1(xt(+−1l),ej)−Wtuˆ+1(xt(+−1l),e1) , ′ − ′ ′ 1− 2 (cid:16) (cid:17)i(20) =R′(VΥV−1)′(x1 x2) − where, the equality (a) is due to x(l)(1) = 1 =R′Q′(x1−x2), x(l)(j). t − j (l) 1 t where, the equality (a) is due to Proposition 6. PN∈eXxt,w−{ea}nalyzetheterminthebracket,Wtuˆ+′1(xt(+−1l),ej)− Wtuˆ+1(xt(+−1l),e1), of RHS of (20) through three cases: G. Proof of Proposition 8 Case1:ifuˆ =landuˆ =l,accordingtotheinduction ′t+1 t+1 According to Assumption 1.5, we have R(e e ) hypothesis, we have ′ j+1 j − ≥ Rpr′oQve′(eRj+′(1ei−eejj))(1 R≤′Qj′≤(eiX −ej)1)f.orThanuys,iw>ejon+ly1.need to 0≤Wtuˆ+′1(xt(+−1l),ej)−Wtuˆ+1(xt(+−1l),e1) − ≥ − T t 1 − − R′(ei ej) R′Q′(ei ej) R′(βA′)i(ej e1). − − − ≤ − i−1 i−1 Xi=0 =R′ (ek+1−ek)−R′Q′ (ek+1−ek) Case 2: if uˆ′t+1 6=l, uˆt+1 6=l, and uˆ′t+1 =uˆt+1, according k=j k=j to the induction hypothesis, we have X X = i−1 R′(ek+1−ek)−R′Q′(ek+1−ek) ≥0, 0≤Wtuˆ+′1(xt(+−1l),ej)−Wtuˆ+1(xt(+−1l),e1) Xk=jh i T−t−1R(βA)i(e e ). where, the last inequality is from Assumption 1.5. ≤ ′ ′ j − 1 i=1 X Case3:ifuˆ =landuˆ =l,accordingtotheinduction APPENDIX C ′t+1 t+1 6 hypothesis, we have PROOF OF LEMMA2 We prove the lemma by backward induction. 0≤Wtuˆ+′1(xt(+−1l),ej)−Wtuˆ+1(xt(+−1l),e1) For slot T, we have T t 1 − − 1) For u = u = l, it holds that Wu′(xˇl ) Wu(xl ) = R′(βA′)i(ej e1). 2) FRo′(rxˇuT(l′T)−=xT(Tll),);u = l and u T= uT ,−it hToldsTthat Combining≤ CXi=as0e 1–3, we −obtain the bounds of WTu′(xˇ′TlT)6 −WTu(TxlT6)=0; ′T T Wtuˆ+′1(xt(+−1l),ej)−Wtuˆ+1(xt(+−1l),e1) as follows: 3) Fsuocrhut′Tha=t ul an=d unTa6=ndlxˇi(tl)exisxts(na)t lexa(slt).onItetchheannnheolldns 0≤Wtuˆ+′1(xt(+−1l),ej)−Wtuˆ+1(xt(+−1l),e1) that 0≤WTu′T′(xˇlT)−WTu(TxlT≥)s≤TR≥′(xsˇ(TlT)−x(Tn)). T−t−1R′(βA′)i(ej e1). Therefore, Lemma 2 holds for slot T. ≤ − i=0 X Assume that Lemma 2 holds for T 1, ,t+1, then we Therefore, we have − ··· prove the lemma for slot t. Wefirstprovethefirstcase:u′t =l,ut =l.Bydeveloping Wtu′(xˇlt)−Wtu(xlt) xˇlt and xˇlt according to Lemma 1, we have: =R′(xˇ(tl)−x(tl))+β̥(xˇlt,u′t)−̥(xlt,ut) ̥(xˇlt,u′t)= d(xˇ(tl),m) e′jT(xˇ(tl),m)Wtuˆ+′1(xt(+−1l),ej), =R′(xˇ(tl)−x(tl))+β mX∈Y jX∈X mX∈Yj∈XX−{1} = e′jB(m)A′xˇ(tl)Wtuˆ+′1(xt(+−1l),ej) (18) e′jB(m)A′(xˇ(tl)−x(tl)) Wtuˆ+′1(xt(+−1l),ej)−Wtuˆ+1(xt(+−1l),e1) m j ̥(xlt,ut)= X∈YdX∈(xX(tl),m) e′jT(x(tl),m)Wtuˆ+1(xt(+−1l),ej) ≤Rh′(xˇ(tl)−x(tl))+β (cid:16) (cid:17)i m j mX∈Yj∈XX−{1} X∈Y X∈X T t 1 = e′jB(m)A′x(tl)Wtuˆ+1(xt(+−1l),ej). (19) e′jB(m)A′(xˇ(tl)−x(tl)) − − R′(βA′)i(ej −e1) mX∈YjX∈X h (cid:16) Xi=0 (cid:17)i Furthermore, we have T t = − R(βA)i(xˇ(l) x(l)). ̥(xˇl,u ) ̥(xl,u ) ′ ′ t − t t ′t − t t Xi=0 = e′jB(m)A′xˇ(tl)Wtuˆ+′1(xt(+−1l),ej) To the end, we complete the proof of the first part, u′t = l and u =l, of Lemma 2. mX∈YjX∈Xh t 10 T t 1 u′tS=ecuotn,dwlyh,iwcheipmrpolvieesththeatseincotnhdisccaassee,uu′t′t6==l,utu.tA6=sslu,mainndg ≤ − − R′(βA′)iA′(xˇ(tl)−x(tl)). (24) i=0 u =u =k, we have: X ′t t Combing (24) and (23), we have ̥(xˇl,u ) = td(x′tt(k),m) e′jT(x(tk),m)Wtuˆ+′1(xt(+−1k,−l),ej,A′xˇ(tl))W=tuβ′((xˇ̥lt()xˇ−l,Wut)u(xlt̥)(xl,u )) mX∈Y jX∈X t ′t − t t = e′jB(m)A′x(tk)Wtuˆ+′1(xt(+−1k,−l),ej,A′xˇ(tl)) (21) =β e′jB(m)A′x(tk) ̥mX(∈xYlt,jX∈uXt) × WmXtu∈ˆ+′Y1(jXx∈t(X+−1k,−l),ej,A′xˇ(tl))−Wtuˆ+1(xt(+−1k,−l),ej,A′xt(l)) = d(xt(k),m) e′jT(x(tk),m)Wtuˆ+1(xt(+−1k,−l),ej,A′x(tl)) hβ e B(m)Ax(k)T−t−1R(βA)iA(xˇ(l) x(l))i mX∈Y jX∈X ≤ ′j ′ t ′ ′ ′ t − t = e′jB(m)A′x(tk)Wtuˆ+1(xt(+−1k,−l),ej,A′x(tl)). (22) mX∈YjX∈X Xi=0 T t 1 Thusm,X∈YjX∈X = e′j B(m) A′x(tk) − − R′(βA′)i+1(xˇt(l)−xt(l)) jX∈X hmX∈Y i Xi=0 ̥(xˇlt,u′t)−̥(xlt,ut) = e EAx(k)T−t−1R(βA)i+1(xˇ(l) x(l)) = e B(m)Ax(k) ′j ′ t ′ ′ t − t ′j ′ t j i=0 X∈X X mXW∈tYuˆ+′jX1∈(Xxt(+−1k,−l),ej,A′xˇ(tl))−Wtuˆ+1(xt(+−1k,−l),ej,A′x(tl)) . =1′XA′x(tk)T−t−1R′(βA′)i+1(xˇ(tl)−x(tl)) h (2i3) Xi=0 T t 1 l Fiosr ntehveertecrmhoseinn tfhoer bWratucˆ+′k1e(txt(+−of1k,−Rl)H,eSj,Aof′xˇ(tl()2)3),anidf =1′Xx(tk) X−i=−0 R′(βA′)i+1(xˇ(tl)−x(tl)) Woftuˆ+ti1m(xet(+−h1ko,r−izl)o,nej,oAf ′xin(ttle))resftroTm. tThehastloits tto+s1ay,touˆt′τhe6=endl =T−t−1R′(βA′)i+1(xˇ(tl)−x(tl)) and uˆτ = l for t + 1 τ T, and further, we have Xi=0 oWthtuˆ+e′r1w(xist(e+−6,1ki,t−el)x,iestjs,Ato′xˇ((ttl)+)≤−1 Wttu≤ˆ+o1(xTt(+−)1k,s−ulc)h,etjh,aAt′oxn(tel))of=th0e; =T−tR′(βA′)i(xˇ(tl)−x(tl)), ≤ ≤ i=1 following three cases holds. X andCause 1=:ul;′τ 6=l anduτ 6=l fort≤τ ≤t0−1whileu′t0 =l lw/hich. completes the proof of Lemma 2 when l ∈/ A′ and t0 ∈A Ru′t′0C[A6=a′s]etl0−a2n:txdˇut(lu′τ)t06=≥=lRla′[(nANd′o]tut0eτ−tth6=xa(ttl)lthfaioscrccotarsde≤indgτoetos≤nthot0et e−fixris1sttwosirhndiceleer ditenLeoxatisesttd,swaasetxlpet(nrao)s,tvesountchehepthrtaohtcirexˇds(tsl)c≥uatssex=t(un′t)n≥=,saxnl(tdl)a.niWtdseubehtlai6=veeflv,etchteonr stochastic dominance of transition matrix A); Wu′(xˇl) Wu(xl) Case 3:u =l andu =l fort τ t0 1whileu =l t t − t t and ut0 6=l.′τ 6 τ 6 ≤ ≤ − ′t0 =Wtl(x(t1),··· ,xt(l−1),xˇ(tl),x(tl+1),··· ,x(tN)) l),Fwoer Chaavsee 1, accordingto the hypothesis(u′t0 =l andut0 = −Wtn(x(t1),··· ,xt(l−1),x(tl),x(tl+1),··· ,xt(N)) βt0−t(WtuTˆ0′(xˇtolt0)−Wtuˆ0(xlt0)) =[−WtWl(xtn(t(1x),(t1·)·,··,·x·t(,l−xt(1l)−,1xˇ)(t,l)x,t(xn(t)l,+x1(t)l,+·1·)·,·,·x·(tN,x))t(N))] ≤βt0−t − (βλ)iR′(xˇ(t0l)−x(t0l)) +[Wtn(x(t1),··· ,xt(l−1),xt(n),x(tl+1),··· ,xt(N)) TXi=0to −Wtn(x(t1),··· ,xt(l−1),x(tl),x(tl+1),··· ,xt(N))] =βt0−t − R′(βA′)i[A′]t0−t(xˇ(tl)−x(tl)) =[Wtl(x(t1),··· ,xt(l−1),xˇ(tl),x(tl+1),··· ,x(tN)) (≤b)βT−t−Xi=10R′(βA′)iA′(xˇ(tl)−x(tl)), +[−WtWn(txl((tx1)(t1,)·,····,·x,t(xl−t(l1−)1,)x,t(xnt()n,)x,(txl+(tl1+)1,)·,····,·x,t(xNt(N))))] Xi=0 −Wtn(x(t1),··· ,xt(l−1),x(tl),x(tl+1),··· ,xt(N))]. (25) where, the inequality (b) is from t0 t+1. ForCase3,bytheinductionhypot≥hesis,wehavethesimilar According to the induction hypothesis (l ∈ A′ and l ∈A), the first term of the RHS of (25) can be bounded as follows: results with Case 1. Combing the results of the three cases, we obtain Wtu′(x(t1),··· ,xt(l−1),xˇ(tl),x(tl+1),··· ,xt(N)) Wtuˆ+′1(xt(+−1k,−l),ej,A′xˇ(tl))−Wtuˆ+1(xt(+−1k,−l),ej,A′x(tl)) −Wtu(x(t1),··· ,xt(l−1),x(tm),x(tl+1),··· ,xt(N))

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.