Index Policies for Optimal Mean-Variance Trade-Off of Inter-delivery Times in Real-Time Sensor Networks Rahul Singh∗, Xueying Guo† and P.R. Kumar∗ ∗Department of Electrical and Computer Engineering, Texas A&M University. Email: {rsingh1,prk}@tamu.edu †Department of Electronic Engineering, Tsinghua University. Email: [email protected] 5 1 0 2 Abstract—A problem of much current practical interest highserviceregularity.Wecanassociatearewardfunction n is the replacement of the wiring infrastructure connecting θi −var(D ) withclient i,where θ is theparameterthat a approximately200sensorandactuatornodesinautomobiles D¯i i i client i uses to tune its trade-off between its throughput J by an access point. This is motivated by the considerable 1 (where D¯ is the mean inter-delivery time between 6 savingsinautomobileweight,simplificationofmanufactura- D¯i i 2 bility, and future upgradability. packetsofclienti)andtheserviceregularityvar(Di),the A key issue is how to schedule the nodes on the shared varianceoftheinter-deliverytimesforclienti.Byvarying ] access point so as to provide regular packet delivery. In θ one can explore the full range of design freedom I i this and other similar applications, the mean of the inter- N along the Pareto frontier of all mean-variance tradeoffs. delivery times of packets, i.e., throughput, is not sufficient s. to guarantee service-regularity. The time-averaged variance Insummary,thenetfunctionwhichcapturesthetrade-off c of the inter-delivery times of packets is also an important is, [ metric. 1 So motivated, we consider a wireless network where an N R θi −var(D ) , AccessPointschedulesreal-timegeneratedpacketstonodes i D¯ i v over afadingwireless channel.Weareinterested indesign- i=1 (cid:18) i (cid:19) 6 X ing simple policies which achieve optimal mean-variance 4 tradeoff ininterdelivery timesof packetsby minimizingthe where Ri > 0 is the weight attached to client i, and θi 4 is a tunable parameter permitting full exploration of the sum of time-averaged means and variances over all clients. 6 Our goal is to explore the full range of the Pareto frontier Pareto frontier. 0 of all weighted linear combinations of mean and variance Our contributions can be summarized as follows. We . 1 so that one can fully exploit the design possibilities. show how one may obtain tractable decoupled solutions 0 WetransformthisproblemintoaMarkovdecisionprocess for the problem of scheduling the clients by addressing it 5 andshowthattheproblemofchoosingwhichnode’spacket asaRestlessMulti-ArmedBandit Problem[9].Inparticu- 1 to transmit in each slot can be formulated as a bandit : problem. We establish that this problem is indexable and lar we obtain the Whittle indices in a closed form, which v explicitly derive the Whittle indices. The resulting Index yieldsaveryelegant solutionbased merelyoncomparing Xi policy is optimal in certain cases. We also provide upper the indices of the clients. We also derive upper bounds and lower bounds on the cost for any policy. Extensive on the achievable performance of any policy. Simulation r simulations show that Index policies perform better than a results show that the performance of the obtained Index previously proposed policies. policy is very close to optimal. I. INTRODUCTION Traditionally, throughput and delay have been used as II. RELATED WORKS performance metrics to judge quality of service (QoS) The steady-state variance of the inter-delivery times of [1]–[7]. The steady-state variance of inter-delivery times packets of clients as a measure of service regularity has ofpacketsisconsideredasameasureofserviceregularity beenconsidered in [8]. References[8] and [10] consider in [8]. Motivated by cyber-physical systems applications the scenario where multiple queues are sharing a server serving sensors, we address the problem of achieving an anddealwiththeproblemofstabilizingthequeueswhile optimal “mean-variance trade-off” in the inter-delivery ensuring an optimal delay and service regularity. [11], times of packets of N clients sharing K channels. [12] perform an analysis of the pathwise starvations in We consider an access point with K channels shared service for the case of a single-hop multi-user wireless by N clients. The clients desire a high throughput with network. A detailed intoduction to Restless Multi-Armed Bandit This paperis partiallybasedon work supportedby NSF underCon- Problems (RMBP) can be found in [13]. RMBP and its tractNos.CNS-1302182andCCF-0939370,andAFOSRunderContract No.FA-9550-13-1-0008. relaxation were first introduced in [9]. The RMBP model 2 hasbeenusedearlierinworkssuchas[14],whichconsid- solely a function of the system state s. With this set-up, ered the problem of choosing an appropriate channel for the process s(t) becomes a controlled Markov process. up and downlink transmissions in multichannel access. For a positive discount factor β < 1, the β-discounted Reference [15] is another notable work which uses the optimization problem is to design control policy u(t) so RMBP model and derives index policies for optimizing as to maximize the expected infinite horizon discounted convex holding costs in a multiclass queue. reward, We also note that optimality of Index policies has been T N establishedincertaincasesasthepopulationofarmsgoes liminfE βt R (θ (s =0)−s ) . (2) i i i i to infinity [16] and extensive simulations have shown T→∞ 1 ! t=0 i=1 X X that Index policies have “good” performance even in the Similarly the average reward problem is to maximize the finite population regime [15], [17]. References [18]– expected infinite horizon time-average reward, [21] consider minimization of variance as an objective in Markov Decision Process. 1 T N liminfE R (θ (s =0)−s ) . (3) i i i i III. SYSTEM MODEL T→∞ T t=0 i=1 1 ! X X We consider the situation where time has been dis- Itiseasilyverifiedthattheaboverewardfunctionreduces cretized into slots, and the duration of a slot corresponds to, to the time taken to attempt a packet transmission. Each N θ D (D +1) client is assumed to have one packet at the beginning R i −E i i , (4) i E(D ) 2 of each slot. In each slot, a scheduler chooses K out i=1 (cid:18) i (cid:18) (cid:19)(cid:19) X of the N clients, and attempts to deliver their packets. and thus differs slightly from the original reward func- Channel unreliability is modeled by supposing that if tion (1). client i is served in slot t, then the packet is delivered with probability pi, independent of the past attempts. V. WHITTLE INDEX Moreovertheservicetimesareindependentacrossclients. We will pose the MDP of the previous section as a The scheduler has to choose the K clients transmitted in Restless Multiarmed Bandit Problem (RMBP). First we each slot so as to maximize the reward function, briefly describe the RMBP. A detailed discussion can be N found in [9], [13], [22]. θ R i −var(D ) , (1) Consider a bandit which has N arms modeled as i D¯ i i=1 (cid:18) i (cid:19) Markov processes. At each time a player can choose to X where D¯ and var(D ) are the mean and variance of the play any K < N arms and collect a reward from each i i arm, where the reward is a function of the current state inter-delivery times of packets for client i in the steady ofthe arm thatis played.The timeevolution ofeach arm state distribution. depends on whether it was chosen to play or not; thus IV. MARKOV DECISION PROCESS FORMULATION the bandits (arms) are “restless” and evolve even if they are not played. The player has to choose the K arms to The system state at time t is given by the vector playateachtime,soastomaximizetheexpectedreward. s(t) := (s (t),...,s (t)), with s (t) denoting the time 1 N i A “Whittle” policy, or “Index-based” policy, for the slots elapsed between the latest delivery of a packet of RMBP, calibrates each of the N arms by deriving N clienti,andt.Becausetimeisdiscretized,thestatevector positive functions (called “index functions”) W (·),i = s(t)isupdatedonlyatthebeginningofslott,andremains i 1,...,N, which are defined for each possible value that unchanged within the slot. The state thus evolves as, the state of arm i can assume. At time t the policy s (t)+1 if no packet of client i is simply chooses to play the K arms having the K largest i valuesofW (s (t)).Afterare-labelingsothatW (s (t))≥ delivered in slot t, i i 1 1 s (t+1)= W (s (t))≥W (s (t)), the choices at time t are i 2 2 N N 0 if a packet of client i is delivered in slot t. 1 for i=1,2,...,K, u (t)= i TheAccessPoint(AP)takesadecisionatthebeginningof (0 otherwise. theslotttograntchannelaccesstoK clientsbychoosing The derivation of the functions W (·) follows the fol- i a control u(t) ∈ {0,1}K, Nu (t) = K, where u (t) = 1 lowing procedure. Each arm is considered in isolation i i i implies that client i will be granted channel accessin slot from the rest of the arms, and the reward function is P t. The decision can be based on the entire past history of now modified so that the player receives, in addition to the system up to time t. the original reward of the arm, a “subsidy” each time The “reward earned” at time t when the system is in that he chooses not to play the arm (chooses “passive state s is given by N R (θ (s =0)−s ), and thus is action”), and the goal once again is to maximize the i=1 i i1 i i P 3 average reward. After having solved this problem, let us Expressions involving reward-functions belonging to a denote by Π(w) the set of states that an optimal policy single value of threshold are at times not mentioned as chooses to not play arm (stay passive). Then the arm is a function of threshold. X is a random variable that p said to be indexable if for any two values of subsidies is geometrically distributed with parameter p. Also, we w1,w2, we have w1 > w2 =⇒ Π(w2) ⊆ Π(w1), and the define X :=EβXp and Y :=EXpβXp. original MDP is said to be indexable if all the N arms are Lemma1:Consider thesingle client β discounted MDP. indexable.IncasetheMDPisindexable,theWhittleIndex 1) c (i+1)−c (i) is a linear increasing function of the i i asafunctionofthestatevaluesisdefinedasthesmallest subsidy w for all i ≥ 0 It is strictly negative when value of subsidy that makes an optimal policy choose the w =0. passive action when the client is in state n, i.e., 2) For each n ≥ 0, there exists a unique value of the W(n)=inf{w:n∈Π(w)}. (5) subsidy, denoted W(n), such that cn(n+1)=cn(n). 3) W(n) ≥ W(n − 1); thus W(n) form an increasing Thus, the Whittle index measures, in a sense, the “value” sequence. of an arm as a function of the present state, and the 4) For all values of thresholds T, if j > i ≥ T, then WhittleorIndexpolicychoosesthoseK armswhichhave c (T)>c (T). i j the highest value amongst the N arms. Proof: For T ≥ 0, the infinite horizon discounted VI. THE CLIENT SCHEDULING PROBLEM IS INDEXABLE reward earned starting in state i and following a policy We will consider the β-discounted MDP, show that it with threshold T +i is, is indexable and derive the corresponding Whittle index. T−1 T−1 The results for the average reward MDP will be obtained c (i+T)=w βj − R(i+j)βj i by letting β →1. We begin with a brief description of the j=0 j=0 X X single-arm β discounted reward problem. Xp−1 Considerthefollowingsingleclientβ discountedbandit +RβT E − (i+T +j)βj problem parametrized by w and β. The subscripts are j=0 suppressed for convenience since the discussion below X i applies to each of the N clients. Thus s(t),p are used +βT EβXp Rθ+ (w−Rj)βj in place of si(t),pi. Thereisasingleclient,whosestateattimet,s(t),isthe (cid:0) (cid:1) Xj=0 time-elapsed-since-last-packet-delivery. At each time-slot, +βT+i EβXpci(i+T). we can choose from the following two control actions: Thus c (i+T) depend(cid:0)s on w(cid:1)as, eitherattemptthetransmissionofapacketforit(active), i or stay idle (passive). The reward earned at time t is T−1 i−1 = −Rs(t) + w + Rθ {s(t) = 0} if the client chooses w βj+wβT EβXp βj 1−βT+i EβXp the passive action of n1ot transmitting, while a reward of " Xj=0 (cid:16) (cid:17)Xj=0 #(cid:14)h (cid:16) (cid:17)i −Rs(t)+Rθ {s(t) = 0} is earned if client chooses the =w 1−βT +βT pβ · 1−βi 1−βT+i pβ active action1of transmitting. If the action at time t is 1−β pβ+1−β 1−β pβ+1−β (cid:20) (cid:21) (cid:18) (cid:19) active, then s(t+1), the state at time t+1, becomes 0 w 1−β+pβ−βT(1−β+pβi+1) (cid:14) = . (6) with probability p, and s(t)+1 with probability 1−p. If (1−β)(1−β+pβ−βT+i+1p) (cid:2) (cid:3) the action at time t is passive, then s(t+1) = s(t)+1. Thus c (i + 1) − c (i) depends on w as, The costs are additive over time after discounting by a i i w(1−β)(1−β+pβ) , which is linear and factorβt.Apolicywhetherto beactive or remainpassive (1−β+pβ−pβi+1)(1−β+pβ−pβi+2) increasing in w. at time t when the system state at time t is s(t)=s. Now we consider the case when w = 0. If C is the We will prove that there is an optimal policy which is 1 cost of cycle i → i → 0 → i, then it follows via a simple of threshold type, i.e. there is a threshold “elapsed time coupling argument that the cost of cycle i→i+1→0→ since last delivery” T (which depends on β,w,p), such i+1, denoted C , is given by, that the policy which keeps the client passive in slot t if 2 s(t)<T, and active if s(t)≥T, is optimal. Xp−1 By ci(T) we will denote the β-discounted reward C =−Ri+βC −RβE βj, 2 1 earned by a policy when the system starts with an initial j=0 X state value of i at time 0, and the policy with threshold at T is used. Let τ be the first time that state i is hit, and thustoprovethesecondresult ofthe firststatement, i i.e. τ = min{t ≥ 1 : s(t) = i}. By “reward earned in the we only have to show that i cycle i→j →0→i” we will mean the reward earned by the system starting in state i in the time slots 0,...,τi−1, C1 − −Ri+βC1−RβE Xj=p0−1βj >0. while operating under the policy with threshold at j. 1−βiX 1−βiβX P 4 This is equivalent to showing that, Lemma 2: Let the subsidy be w = W(n). Then for the single client β discounted MDP, 1−βiX Xp−1 1−βiX C1 >−Ri· 1−β −RβE βj· 1−β 1) ci(n)=ci(n+1),∀i≥0. Xj=0 2) ci−1(n)≥ci(n),∀i≥1. Proof: Firstly recall that for subsidy =W(n), c (n)− We observe that −Ri· 1−βiX −Rβ E Xp−1βj ·1−βiX n 1−β j=0 1−β cn(n+1)=0. Thus for i=0,1,...,n−1, is the reward earned over the cyc(cid:16)le i → i →(cid:17)0 → i if P oneweretomodifytheoriginalcost functionand instead c (n)−c (n+1)=βn−i(c (n)−c (n+1))=0. (10) i i n n charge a penalty of −Ri for value of states s(t) ≤ i For i≥n+1, and a penalty of −Rs(t) if s(t) > i. However since the original reward function is = −Rs(t)+ Rθ1{s(t) = 0} c (n+1)−c (n)=βX(c (n+1)−c (n))=0, (notethatw =0),asimplecouplingargumentshowsthat i i 0 0 the reward earned is lower with the modified function. wherethelastequalityfollowsfrom(10).Thisprovesthe This completes the proof of first statement. first statement. Notethatfromthefirststatementitfollowsthatcn(n+ Toprovethesecondresult,considerthefollowingcases: 1)−c (n) is a linear increasing function of w which is n i) For i > n, Lemma 1 implies that the inequality is less than0at w =0.Hence thereexistsavalue ofw such true. thatthefunctioncn(n+1)−cn(n)vanishes,andmoreover ii) For 2 ≤ i ≤ n, denote d as the cost incurred in the i vanishesat aunique pointsince theslope ofthisfunction cycle n → 0 → i−1. Then both c (n), and c (n+1) i i is strictly positive. This value of w, where the function can be derived in terms of d . When subsidy is equal i cn(n+1)−cn(n) vanishes, is W(n). to W(n), we have ci(n)=ci(n+1), i.e., Let C ,C be the costs of cycles n → n → 0 → n and n→n+11→2 0→n. It is seen that, −βn−i(1−β)di = (11) W(n) βnX−βn−i +Rβn−in−βnXi (12) C C c (n)= 1 , c (n+1)= 2 . (7) n 1−βnX n 1−βnβX +Rβ(cid:16)n−i β(1−X(cid:17))−βi+1X+βn+1X2 . (13) 1−β Using a coupling argument we obtain, (cid:16) (cid:17) where the first equality follows from statement 1. Xp−1 C =(W(n)−Rn)+βC −RβE βj. (8) Similarly, ci−1(n)−ci(n)≥0 is equivalent to 2 1 j=0 n−i−1 Combining (7),(8) and the fact that for Xw = W(n) we (W(n)−Ri−Rj)βj+βn−idi ≥ j=0 have c (n)=c (n+1), X n n n−i C1 = (W(n)−Rn)+βC1−RβE Xj=p0−1βj,or , j=0(W(n)−Ri+R−Rj)βj+βn−i+1di 1−βnX 1−βn+1X P X−βnX(W(n)−Ri+R), Xp−1 C (1−β)= W(n)−Rn−RβE βj (1−βnX). i.e., 1 Xj=0 (9) −βn−i(1−β)di+R1−1−βnβ−i +(W(n)−nR+R)βn−i NowletuscheckifunderthevalueofsubsidysettoW(n), −βnX(W(n)−Ri+R)≥0 or we have c (n) > c (n−1). If this is the case, then from the finr−s1t statemne−nt1 of this lemma, we will deduce (1−βnX)(βi−βn+1X) ≥0, that W(n − 1) < W(n). Now, cn−1(n) > cn−1(n−1) is wheretβhi(e1−seβc)ond-lastequivalencefollowsfrom(11). equivalent to showing We note that the last inequality holds trivially for all W(n)−R(n−1)+βC1−βnX(W(n)−R(n−1)) > β ∈ (0,1) and hence the statement 2 holds for i = 1−βnX 2,...,n. C1+RE jX=p0−1βj−βn−1X(W(n)−R(n−1)). iii) i = 1. We compare the cost incurred by the system 1−βn−1X starting in state 0 over the cycle 0→n→0 (say C ) P 0 After some algebraic manipulations and using (9) it can with the cost incurred over the cycle j →n→0→j be shown that proving the above inequality is equivalent when starting in state j ( denoted Cj) via cou- to proving X > 0, which indeed is true. This completes pling the processes associated with the two systems the proof of third statement. constructed on the same probability space. Clearly For the fourth statement, using a coupling argument, C0 > Cj. Thus c0(T) > cj(T) for any value of we obtain, c (T)=c (T)−R(j−i) Xp−1βj, and hence threshold T. j i j=0 c (T)<c (T). j i P 5 Lemma 3: The function w+pβ(c (T)−c (T)) ( which with strict inequality holding if w ∈ (W(n),W(n+1)), i 0 depends on w,i,T) is linear, increasing in w. Also, and equality holding for i = n,w = W(n). Similarly for i=n+1,n+2,... we have to verify the inequality W(n)+pβ(c (n)−c (n))=0 for n=0,1,.... (14) n+1 0 w+βp(c −c )≤0. (17) i+1 0 Proof: We consider the following cases: Wewillfirstprove(16).Weusesuperscriptstodistinguish i) For i ≤ T, it follows from (6) that the function w+ between costs c calculated under different values of pβ(c (T)−c (T)) depends on w as i i 0 subsidy. We have, 1−β−pβ+pβT−i+1 w. (15) w+βp cw −cw ≥W(n)+βp cW(n)−cW(n) 1−β+pβ−pβT+1 i+1 0 i+1 0 We have 1−β +pβ −pβT+1 > 0,∀β < 1. Also, 1− =pβ c(cid:0)W(n)−cW(cid:1)(n) +pβ cW((cid:16)n)−cW(n) (cid:17) 0 n+1 i+1 0 β −pβ +pβT−i+1 ≥ 1−2β +βT−i+1 > 0 since the =pβ(cid:16)cW(n)−cW(n)(cid:17) (cid:16) (cid:17) function i+1 n+1 ≥0, (cid:16) (cid:17) 1−2β+βk ≥0,∀k>1,β ∈(0,1). where the first inequality and equality follow from Thus, in the expression (15) the coefficient of w is Lemma 3, and the last inequality follows from Lemma 2. positive. To prove (17) we have, ii) For i≥T +1, we have, w+βp cw −cw ≤W(n+1)+βp cW(n+1)−cW(n+1) Xp−1 i+1 0 i+1 0 ci(T)=E βj(−i−j)+Xc0(T). (cid:0) (cid:1)=pβ cW(n+1)−c(cid:16)W(n+1) (cid:17) 0 n+2 j=0 X The dependence of c0(T) on w can be obtained +p(cid:16)β ciW+(1n+1)−c0W(n+(cid:17)1) from(6).Combining,w+pβ(ci(T)−c0(T))depends =pβ c(cid:16)W(n+1)−cW(n+1) (cid:17) on w as, i+1 n+2 ≤0, (cid:16) (cid:17) 1−β w, 1−β+pβ−pβT+1 where first two steps follow from Lemma 3, and the last inequality follows from Lemma 2. This completes which has a positive slope with respect to w. the optimality of the policy with threshold at W(n). This completes the proof of first statement. Note that for Following5,theWhittleindexforthestatenisthusgiven w =W(n), we have by c (n+1)=c (n). n n inf{w :n∈Π(w)}=inf{w:w ≥W(n)}=W(n), This implies where the first equality follows from the first statement of Theorem. −Rn+W(n)+βc (n+1)=−Rn n+1 We now proceed to explicitly derive the values of the +β(pc (n)+(1−p)c (n)) i.e. 0 n+1 indices W(n). W(n)+βcn+1 =β(p(c0+(1−p)cn+1) and so Theorem 2: W(n)+pβ(c −c )=0. pβ(f −f −f +f ) n+1 0 W(n)= 1 2 3 4 , where , f Above, in the second implication, we have used the first 5 1−βn statement of Lemma 2 to remove the dependence of ci(·) f1 = (1−β)2 ·((1−X)[n(1−β)+β]−Y(1−β)), on the threshold values. Theorem 1: For the β-discounted MDP with subsidy β(1−βn)−βnn(1−β) f = ·(1−X), w ∈ [W(n),W(n+1)), the policy with threshold at n 2 (1−β)2 is optimal. Thus the MDP is indexable and W(n) is the 1−X f = (1−βnX), Whittle index when the state is n. 3 1−β Proof:Fixaw ∈[W(n),W(n+1)).Ifthepolicyisin- f =θ(1−X), 4 deed optimal, then the Dynamic Programming optimality 1−βn equationwouldbesatisfied.Henceweonlyneedtoverify f5 =1−βnX −pβ 1−β (1−X) the inequality (cid:18) (cid:19) 1−β = . −Ri+w+βci+1 ≥−Ri+β[(1−p)ci+1+pc0], 1−β+pβ for i=0,1,...,n, Proof: From (14) we have, or, equivalently, w+βp(c −c )≥0, (16) i+1 0 W(n)=pβ(c −c ) 0 n+1 6 Xp 1 1−βn β(1−βn)−nβn+1(1−β) =pβ(c0−cn−E βj). (18) 1−βnX ·(cid:18)w 1−β + (1−β)2 − Xj=0 Xp−1 Rθ Now, βn (n+j)βj+ 1−βnX, and so j=1 C −C X c −c = 0 n , (19) lim(1−β)cβ(n)= 0 n 1−βnEβXp β↑1 0 1−βn β(1−βn)−nβn+1(1−β) nwh→erneC→0,0C→n anr.eWtheeccaonstcsoomvepruttheeCc0y−cleCsn0a→s,n→0and lβim↑1(cid:18)w1−βnX + (1−β)(1−βnX) Xp−1 Rθ(1−β) Xp−1 −(1−β)βn (n+j)βj+ C0−Cn =E (n+j)βj(1−βn) (20) Xj=1 1−βnX j=0 np Rp(n2+n) Rpθ X =w + + (24) n−1 np+1 2(np+1) np+1 + (W(n)−j)βj (1−EβXp)+θ 1−βXp . <∞. Xj=0 (cid:0) (cid:1) Since for each m, Wβ(m) → WAvg(m), it follows (21) from Theorem 2 that there exists a β⋆(w) such that the Combining(18,19,20)andsetting∆=E Xp−1(n+j)βj, policy with the threshold at n is optimal for the single we have, j=0 client β-discounted MDP for all β ∈ (β⋆(w),1). However P since lim (1 − β)cβ(n) exists, the policy with thresh- ∆(1−βn)+ nj=−01(W(n)−j)βj 1−EβXp old at nβi↑s1also opti0mal for the average cost problem. W(n)=pβ (cid:16)P 1−βnEβXp (cid:17)(cid:0) (cid:1) However since w can assume any value in the interval Avg Avg W (n),W (n+1) , the policy with threshold at −EXp−1βj+ θ 1−EβXp , n(cid:16) is optimal for the aver(cid:17)age cost MDP for each value of j=0 1(cid:0)−βnEβXp(cid:1) subsidy w ∈ WAvg(n),WAvg(n+1) . Thus, X or, (cid:16) (cid:17) Avg inf{w:optimal policy chooses active at n}≤W (n). n−1βj 1−EβXp (25) j=0 W(n) 1−pβ· = (cid:16)P 1−β(cid:17)n(cid:0)EβXp (cid:1) Similarly, picking subsidy w < WAvg(n) shows that the ∆(1−βn)+ n−1−jβj 1−EβXp active action is not optimal for any value of subsidy w< pβ j=0 WAvg(n). Hence, (cid:16)1P−βnEβXp(cid:17)(cid:0) (cid:1) Avg inf{w:optimal policy chooses active at n}=W (n), Xp−1 θ 1−EβXp (26) −E βj+ , Xj=0 1(cid:0)−βnEβXp(cid:1) and we obtain that WAvg(n) are indeed the Whittle indices for the average cost problem. We note that the expression (24) is the average reward which simplifies to, earned under the subsidy w and threshold at n. We will W(n)·f =pβ(f −f −f +f ). (22) denote this quantity as CAvg(W,n). 5 1 2 3 4 VII. BOUNDS ON OPTIMAL REWARD. Theorem 3: The Whittle indices for the average cost Lemma 4: For the average cost MDP, the reward ob- MDP are given by, tained underany policyis upper-boundedbythevalue of the following optimization problem: WAvg(n)= lim Wβ(n)=nRp· n + 1−p + 1 +Rpθ. β→1 2 1+p 2 N 1 (cid:18) (cid:19) (23) max Ri D¯i2+θiD¯ i=1 (cid:20) i(cid:21) X Proof: The expression (23) is easily derived N 1 from (22). It remains to show that the quantities such that ≤1,D¯ ≥0,i=1,...,N. (27) WAvg(n) are indeed Whittle indices for the average-cost D¯ipi i i=1 problem. Fix the subsidy to be w, and without loss of X Proof: The random reward earned in time steps Avg Avg generality let w ∈ W (n),W (n+1) . Below we 1,2,...,t is given by, use superscripts to (cid:16)exhibit the dependence o(cid:17)f the cost on β. Now, N R Ni(t) C(t):= i − D (l)2+θ N (t) , i i i cβ(n)= t 0 i=1 l=1 X X 7 whereN (t) isthe numberof packetsof client i delivered Theorem 4: Consider the average cost problem for the i bytimetandD (l)istheinterdeliverytimeofl-thpacket case where all the clients are identical, i.e., R ≡ 1 and i i of client i. Let us assume that the average interdelivery- p ≡ p for all the clients. The index policy is optimal in i time for client i under a policy is equal to D¯ . Thus, this case. i Proof: Firstly we note that in this symmetric case, liminfEC(t)≤limsupEC(t) t→∞ t→∞ theIndexpolicyservestheclientwiththelargestvalue of ≤ElimsupC(t) thestate,i.e.thepolicyis,“largesttime-since-last-service- t→∞ first”. We will prove the result only for the case of two N Ni(t)D (l)2 θ N (t) clients, each having channel reliability p. The case where =Elimsup R l=1 i + i i i therearemultiplesuchclientsfollowsinastraighforward t t t→∞ Xi=1 "P # manner. N ≤ R D¯2+θ 1 , Consider the time-horizon at t. If (s1,s2) is the initial i i iD¯ value of the state vector, and R (s) is the maximum i=1 (cid:20) i(cid:21) t X reward that can be earned when there are t time-slots to where the second inequality follows from Fatou’s lemma go, then the Dynamic Programming optimality equation and the last is Jensen’s inequality. Thus solving the op- becomes, timization problem (27) gives a lower bound on the performance of any policy. We note that the constraint R [(s ,s )]=−(s +s )+(1−p)R [(s +1,s +1)] t 1 2 1 2 t−1 1 2 N 1 ≤1,D¯ ≥0 is simply the capacity of the wireless +pmax{R [(0,s +1)],R [(s +1,0)]}, i=1D¯ipi i t−1 2 t−1 1 cPhannel. where the optimal action corresponds to the one max- imizing the expression on the right hand side. Let us Next we consider the Lagrangian relaxation of assume without loss of generality that s < s . Then the RMBP [9]. For this, we relax the constraint 1 2 R [(0,s +1)] ≤ R [(s +1,0)], which implies that of choosing K arms at each time, to the t−1 2 t−1 1 the optimal action is to serve client 2. constraint that one plays K arms on average, i.e., lim Total numbers of arms played by time t = K. t→∞ t IX. SIMULATIONS Clearly the maximum possible reward in the relaxed problem is greater than or equal to the reward earned We have carried out simulations to compare the per- by any policy for the original RMBP. Also since the Index formance of the optimal policy which was obtained via policy is the optimal solution to this relaxed problem ( thePolicyIterationtool-boxinMatlabvs.theIndexpolicy [9]), its value function serves as an upper-bound for the whichwasobtainedinTheorem3.Wepresentthreeplots value function of the RMBP. in Figures 1-3. In all the cases considered 2 clients share Lemma 5: Let CAvg,i be the average reward earned by a single channel. To obtain Figure 1, we fix client 1’s the policy maximizing the single-client average reward parameterasp =.8,θ =3,R =1,whileforclient2we 1 1 1 under the subsidy W (24). Then the reward for the fix θ = 3,R = 1 and vary p from 0 to 1. For Figure 2, 2 2 2 average cost MDP obtained by any policy is less than or we fix Client 1 parameters to be p = .8,θ = 3,R = 1 1 1 1 equal to, while for Client 2 we fix p = .6,R = 1 and vary the 2 2 N value of θ from 1 to 10.Toobtain Figure 3, we fix Client 2 inf CAvg,i(W)−W(N −K) 1’s parameters as p = .8,θ = 5,R = 5, and for Client 1 1 1 W>0 Xi=1 2 we fix the parameters p2 =.6,θ2 =5 while varying the = inf N W nipi + Ripi(n2i +ni) value of R2. W>0 nipi+1 2(nipi+1) WeobservethatIndexpolicygivesnear-optimalperfor- Xi=1 mance in all the cases. R p θ + i i i −W (N −K) , n p +1 i i (cid:19) X. CONCLUDING REMARKS N n p R p (n2+n ) = inf W i i +K−N + i i i i W>0" nipi+1 ! 2(nipi+1) We have proposed an analytical framework for ex- i=1 X ploring the full range of mean vs. variance tradeoffs R p θ + i i i , in inter-delivery times in wireless sensor nerworks, i.e. n p +1 i i (cid:21) Throughput vs. Service Regularity trade-off.The problem where n is such that W ∈(W(n ),W(n +1)). canbeformulatedasRestlessMultiarmedBanditProblem i i i and indices can be obtained in closed form. Simulations VIII. OPTIMALITY OF INDEX POLICY indicate near-optimal performance of the resulting Index Now we consider several special cases of interest. policy. 8 2 in multihop wireless networks,” in in Proceedings of IEEE Confer- 0 enceonDecisionandControl,2004,pp.1484–1489. [5] A.L.Stolyar,“Maxweightschedulinginageneralizedswitch:State −2 space collapseand workload minimization in heavy traffic,” The Optimal Policy AnnalsofAppliedProbability. −4 d Index Policy [6] M.J. Neely, E. Modiano and Chih-ping Li, “Fairness and optimal war −6 stochasticcontrolforheterogeneousnetworks,”IEEE/ACMTrans- Re actionsonNetworking,vol.16,no.2,pp.396–409,April2008. −8 [7] XueyingGuo,ShengZhou,ZhishengNiuandP.R.Kumar,“Optimal −10 wake-up mechanismfor singlebasestationwith sleepmode,” in InternationalTeletrafficCongress(ITC),Sept2013. −12 [8] R. Li, A. Eryilmaz, and B. Li, “Throughput-optimal wireless schedulingwithregulatedinter-servicetimes,”inINFOCOM,2013 −140 2 4 6 8 10 ProceedingsIEEE,April2013,pp.2616–2624. Fig. 1: Reward Optimal Policy vs. Index Policy for p1 = [9] P. Whittle, “Multi-armed bandits and the Gittins index,” J. R. .8,θ =3,R =1,θ =3,R =1, p varying from .1 to 1. Statist.Soc.B,vol.42,pp.143–149,1980. 1 1 2 2 2 [10] B. Li, R. Li, and A. Eryilmaz, “Heavy-traffic-optimal scheduling withregularserviceguaranteesinwirelessnetworks,”inProceed- ingsoftheFourteenthACMInternationalSymposiumonMobileAd 2.5 HocNetworkingandComputing,ser.MobiHoc’13. NewYork,NY, USA:ACM,2013,pp.79–88. 2 [11] Rahul Singh, I-Hong Hou and P.R. Kumar, “Fluctuation analysis 1.5 of debt based policies for wireless networks with hard delay d 1 constraints,”inINFOCOM,2014ProceedingsIEEE,April2014,pp. Rewar 0.05 OInpdteimx aPlo Plioclyicy [12] 2—n4e—t0w0,o–r“2kP4sa0t8hw.witihsehpaerdrfodremlaayncceonosftrdaeibnttsb,”asiendDpeocliiscioiensafonrdwCiornelterossl (CDC),2013IEEE52ndAnnualConferenceon,Dec2013,pp.7838– −0.5 7843. −1 [13] K. G. J.C. Gittins and R. Weber, Multi-armed Bandit Allocation Indices. JohnWiley&Sons,2011. −1.51 2 3 4 5 θ 6 7 8 9 10 [14] K.LiuandQ.Zhao,“Indexabilityofrestlessbanditproblemsand 2 optimalityofwhittleindexfordynamicmultichannelaccess,”IEEE Fig. 2: Reward Optimal Policy vs.Index Policy for p = 1 Transactions on Information Theory, vol. 56, no. 11, pp. 5547– .8,θ1 = 3,R1 = 1,p2 = .6,R2 = 1 while θ2 varies from 1 5567,Nov2010. to 10. [15] P. S. Ansell, K. D. Glazebrook, J. Nio-Mora, and M. O’Keeffe, “Whittle’s index policy for a multi-class queueing system with convex holding costs,” Mathematical Methods of Operations Research, vol. 57, no. 1, pp. 21–39, 2003. [Online]. Available: http://dx.doi.org/10.1007/s001860200257 [16] R.R.WeberandG.Weiss,“Onanindexpolicyforrestlessbandits,” −1 JournalofAppliedProbability,vol.27,Sep1990. −1.2 [17] K.D. Glazebrook, D. Ruiz-Hernandez and C. Kirkbride, “Some ard−1.4 Optimal iPnrdobexaabbillietyf,avmoill.ie3s8o,fnroe.st3le,spspb.a6n4d3it–p6r7o2b,le2m00s,6”.AdvancesinApplied w Index Policy Re−1.6 [18] J.A. Filar, L.C.M. Kallenberg and H.M. Lee, “Variance-penalized markov decisionprocesses,”Math. Oper. Res., vol. 14, no. 1, pp. −1.8 147–161,Mar.1989. −2 [19] HajimeKawai,“Avarianceminimizationproblemforamarkovde- cisionprocess,”EuropeanJournalofOperationalResearch,vol.31, 4 6 8 10 12 R 14 16 18 20 22 no.1,pp.140–145,1987. 2 [20] Masami Kurano, “Markov decision processes with a minimum- variance criterion,” Journal of Mathematical Analysis and Appli- cations,vol.123,no.2,pp.572–583,1987. [21] EitanAltmanandAdamShwartz,“Markovdecisionproblemsand Fig. 3: Reward Optimal Policy vs. Index Policy for p = 1 state-actionfrequencies,”SIAMJ.CONTROLANDOPTIMIZATION, .8,θ1 =5,R1 =5,p2 =.6,θ2 =5 while R2 is varied. vol.29,no.4,pp.786–809. 2 [22] P. Whittle, “Restless bandits: activity allocation in a changing world,”pp.287–298,1988. REFERENCES [1] L. Bui, R. Srikant, and A. Stolyar, “Novel architectures and al- gorithms for delay reduction in back-pressure scheduling and routing,”inINFOCOM2009,IEEE,April2009,pp.2936–2940. [2] Haozhi Xiong, Ruogu Li, A. Eryilmaz and E. Ekici, “Delay-aware cross-layer design for network utility maximization in multi-hop networks,” Selected Areas in Communications, IEEE Journal on, vol.29,no.5,pp.951–959,May2011. [3] A. Eryilmaz and R. Srikant, “Asymptotically tight steady-state queuelengthboundsimpliedbydriftconditions,”QueueingSyst. TheoryAppl.,vol.72,no.3-4,pp.311–359,Dec.2012. [4] XiaojunLinandNessB.Shroff,“Jointratecontrolandscheduling