Prior Ordering and Monotonicity in Dirichlet Bandits Yaming Yu Department of Statistics University of California 1 Irvine, CA 92697,USA 1 [email protected] 0 2 n Abstract a J Oneoftwoindependentstochasticprocesses(arms)aretobeselectedateachofnstages. 5 2 Theselectionissequentialanddependsonpastobservationsaswellasthepriorinformation. ObservationsfromarmiareindependentgivenadistributionPi,and,followingClaytonand ] T Berry (1985), Pi’s have independent Dirichlet process priors. The objective is to maximize S the expected future-discounted sum of the n observations. We study structural properties . of the bandit, in particular how the maximum expected payoff and the optimal strategy h t vary with the Dirichlet process priors. The main results are (i) for a particular arm and a m a fixed prior weight, the maximum expected payoff increases as the mean of the Dirichlet process prior becomes larger in the increasing convex order; (ii) for a fixed prior mean, [ the maximum expected payoff decreases as the prior weight increases. Specializing to the 1 one-armed bandit, the second result captures the intuition that, given the same immediate v 3 payoff, the more is known about an arm, the less desirable it becomes because there is less 0 to learn when selecting that arm. This extends some results of Gittins and Wang (1992) on 9 Bernoulli bandits and settles a conjecture of Clayton and Berry (1985). 4 . Keywords: convex order; Dirichlet bandits; sequential decision; two-armed bandits. 1 MSC 2010: Primary 62L05, 62C10; Secondary 62L15, 60E15. 0 1 1 : 1 Introduction v i X Bandit problems are classical problems in statistical decision theory and have received consid- r a erable attention; see Berry and Fristedt (1985) for an overview. We consider discrete-time, finite-horizon, two-armed bandits from a Bayesian perspective. At each of n stages, an observa- tion is taken from one of two stochastic processes (arms). A strategy specifies which process to select based on past observations. The objective is to maximize the expected payoff, n a Z , i=1 i i where Z is the observation at stage i and A ≡ (a ,a ,...,a ) is a discount sequence satisfying i n 1 2 n P a ≥ 0 and n a > 0. A strategy is optimal if it achieves the maximum expected payoff. i i=1 i An arm is optimal initially if there exists an optimal strategy that selects that arm at the first P stage. The most widely studied bandit problem is the Bernoulli bandit, where each arm generates a sequence of exchangeable Bernoulli random variables. Bernoulli bandits are important as a model for clinical trials. Others such as normal bandits have also been extensively studied 1 (Chernoff 1968). Extending the Bernoulli bandit, Clayton and Berry (1985) have introduced a one-armed Bayesian nonparametric bandit using Dirichlet process priors (Ferguson 1973). Chattopadhyay (1994) extends this and studies the two armed Dirichlet bandit, which is also the setting of this work. Associated with arms 1 and 2 are probability measures P , i = 1,2, i respectively. Observations from arm i are independent samples given P ; observations from i different arms are independent. The P ’s themselves are treated as random, with independent i Dirichlet process priors. Specifically, P ∼ DP(α), where α is a finite nonnull measure with a i i i finite first moment. It is often helpful to write α = M F where M = α (R) so that F is a i i i i i i probability distribution. We refer to F and M as the prior mean distribution and prior weight i i of the Dirichlet process, respectively. We use (α ,α ;A ) to denote such a Dirichlet bandit with 1 2 n discount sequence A . n For such problems one must balance the desire to maximize the immediate payoff and the need to explore a less known arm in the hope of higher payoff later on (the exploitation versus exploration dilemma). Optimal strategies are usually specified through backward induction and are nontrivial to compute. Nevertheless certain structural properties such as the stay-on-a- winner rule (Bradt, Johnson and Karlin 1956; Berry 1972) often hold under suitable conditions. For Dirichlet bandits with known arm 2, Clayton and Berry (1985) obtain several structural results. In particular, the maximum expected payoff increases as F , the mean of the Dirichlet 1 process prior for arm 1, increases in the usual stochastic order. Also, a version of the stay-on- a-winner rule holds: if arm 1 is optimal initially then it is optimal at the next stage provided that the initial observation from arm 1 is sufficiently large. Such results have been extended to the general two-armed Dirichlet bandits (Chattopadhyay 1994). This paper studies further structural properties of Dirichlet bandits, in particular how the value of the bandit (i.e., the maximum expected payoff) varies with the Dirichlet process priors. The main results are (i) the value increases as the mean of the Dirichlet process for any arm becomeslargerintheincreasingconvexorder(definedbelow);(ii)thevaluedecreasesastheprior weight of the Dirichlet process of an arm increases. The second result agrees with the intuition that, given the same immediate payoff, an arm is less appealing when more is known about it, because there remains less to be explored. Though easy to state and intuitively appealing, such results are often difficult to prove. We mention a long-standing conjecture of Berry (1972), which states that for a finite-horizon Bernoulli two-armed bandit with uniform discounting and independent Beta(u ,v ) priors, i = 1,2, for arms 1 and 2 respectively, if u /v = u /v and i i 1 1 2 2 u +v < u +v , then arm 1 is preferred to arm 2 at the initial pull. If, instead of finite-horizon 1 1 2 2 uniform discounting, we assume infinite-horizon geometric discounting, then the corresponding conjecture is true, as shown by Gittins and Wang (1992), who also prove analogous results for someother parametric bandits. Geometric discountingis specialin thattheoptimal strategy for amulti-armed banditischaracterized by a“dynamicallocation index,” or Gittins index (Gittins and Jones 1974; Gittins 1979; Whittle 1980), which reduces the problem to several one-armed bandits. As the Bernoulli bandit is a special case of the Dirichlet bandit, our results may be regarded asageneralizationofGittinsandWang(1992), althoughourmethodofproof,basedonconvexity andstochasticorders,isdifferent. Ourmainresult(Corollary2)confirmsaconjectureofClayton 2 and Berry (1985) concerning the break-even value in the one-armed Dirichlet bandit. We also prove another conjecture of Clayton and Berry (1985) concerning the break-even observation when both arms are optimal initially (Proposition 1). These results will hopefully shed some lightontheconjectureofBerry(1972). SeeHerschkorn(1997) forrelatedresultsandconjectures on the Bernoulli bandit. We find the usual stochastic order, the convex order and the increasing convex order par- ticularly helpful in formulating and deriving the main results. For random variables Z and Z 1 2 taking values on R, we write Z ≤ Z (respectively, Z ≤ Z ), if 1 st 2 1 cx 2 Eφ(Z ) ≤ Eφ(Z ) (1) 1 2 for every increasing (respectively, convex) function φ such that the expectations exist. If Z ≤ 1 st Z thenwealsosayZ istotherightofZ . WesayZ issmallerthanZ intheincreasingconvex 2 2 1 1 2 order,writtenasZ ≤ Z ,if(1)holdsforevery increasingandconvex functionφsuchthatthe 1 icx 2 expectations exist. Hence ≤ is implied by either ≤ or ≤ . The convex order is concerned icx st cx with variability. For example, if Z ≤ Z , both with finite second moments, then EZ = EZ 1 cx 2 1 2 and Var(Z ) ≤ Var(Z ). Another basic property is closure under mixtures: if distributions 1 2 F , G , i= 1,2, satisfy F ≤ F and G ≤ G then ρF +(1−ρ)G ≤ ρF +(1−ρ)G , ρ∈ i i 1 cx 2 1 cx 2 1 1 cx 2 2 [0,1]; closure under mixtures also holds for ≤ and ≤ . (We use the notation ≤ , ≤ , ≤ icx st st cx icx with distribution functions as well as random variables.) For furtherproperties and applications ofvariousstochastic orders,seeMu¨llerandStoyan (2002) andShakedandShanthikumar(2007). 2 Prior mean monotonicity Let us denote the maximum expected payoff of a two-armed Dirichlet bandit (α ,α ;A ) by 1 2 n W(α ,α ;A ). Let Wi(α ,α ;A ) be the expected payoff when selecting arm i initially and 1 2 n 1 2 n using an optimal strategy thereafter. Then W(α ,α ;A )= max W1(α ,α ;A ),W2(α ,α ;A ) . (2) 1 2 n 1 2 n 1 2 n Suppose arm 1 is selected initially, result(cid:8)ing in an observation X. Becau(cid:9)se the prior on P is 1 a Dirichlet process, the posterior is again a Dirichlet process DP(α +δ ), where δ denotes a 1 X x point mass at x. Thus we have W1(α ,α ;A )= a µ + E W(α +δ ,α ;A1) α , (3) 1 2 n 1 1 1 X 2 n 1 W2(α1,α2;An)= a1µ2+ E(cid:2)W(α1,α2+δY;A1n)(cid:12)α2(cid:3), (4) (cid:12) where A1 = (a ,a ,...,a ) and µ denotes the fi(cid:2)rst moment of α , w(cid:12)hic(cid:3)h is also the expected n 2 3 n i i (cid:12) valueof anobservation fromarm i. InE[g(X)|α], thedistributionof X isα/M withM = α(R). The quantities W, W1 and W2 are well defined and finite as long as α , i= 1,2, have finite first i moments, which we assume throughout. Lemma 1 reveals a convexity property of W which we shall use repeatedly. Lemma 1. Let α be a finite measure on R with a finite mean. Then, for u,v ∈ R and r > 0, the function W(α+ρδ +(r−ρ)δ ,α ;A ) is convex in ρ∈ [0,r]. u v 2 n 3 Proof. Let us use induction on n. It is easy to check that the claim holds for n = 1. For n≥ 2, we note that by (2) it suffices to show that each of Wi(α+ρδ +(r−ρ)δ ,α ;A ), i = 1,2, is u v 2 n convex in ρ∈ [0,r]. Since the mean of α+ρδ +(r−ρ)δ is linear in ρ, by (3) and (4), we only u v need to show that both E W(α+ρδ +(r−ρ)δ +δ ,α ;A1) α+ρδ +(r−ρ)δ and (5) u v X 2 n u v (cid:2) E W(α+ρδ +(r−ρ)δ ,α(cid:12) +δ ;A1) α (cid:3) (6) u v (cid:12)2 Y n 2 are convex in ρ. Convexity (cid:2)of (6) follows from the induction hyp(cid:12)oth(cid:3)esis. To deal with (5), we (cid:12) directly compute E[W(α+ρδ +(r−ρ)δ +δ ,α ;A1)|α+ρδ +(r−ρ)δ ] u v X 2 n u v M = E[W(α+ρδ +(r−ρ)δ +δ ,α ;A1)|α] (7) M +r u v X 2 n ρφ(ρ+1)+(r−ρ)φ(ρ) + , (8) M +r where M = α(R) and φ(ρ) = W(α+ρδ +(r+1−ρ)δ ,α ;A1). u v 2 n By the induction hypothesis, φ(ρ) is convex in ρ ∈ [0,r +1]. We claim that this implies that ψ(ρ) ≡ ρφ(ρ+1)+(r−ρ)φ(ρ) is convex in ρ∈ [0,r]. In fact, if φ(ρ) is twice differentiable, then we have ψ′′(ρ) = 2(φ′(ρ+1)−φ′(ρ))+ρφ′′(ρ+1)+(r−ρ)φ′′(ρ) ≥ 0, ρ∈ [0,r], by the convexity of φ. A standard limiting argument shows that ψ(ρ) is convex in ρ ∈ [0,r] as long as φ(ρ) is convex in ρ∈ [0,r+1] without assuming differentiability. Hence the second term (8) is convex. The first term (7) is convex in ρ∈ [0,r] by the induction hypothesis, since in this expectation X is distributed according to α/M independently of ρ. Thus the convexity of (5) is established. Theorem 1 says that the value of the bandit increases as the mean of the Dirichlet process prior for any arm becomes stochastically larger and more dispersed. This strengthens Proposi- tion 2.2 of Clayton and Berry (1985) who consider the usual stochastic order rather than the increasing convex order. Theorem 1. If M > 0 and F ≤ F˜, both with finite means, then icx W(MF,α ;A ) ≤ W(MF˜,α ;A ). 2 n 2 n Proof. Let us use induction. The claim obviously holds for n = 1. For n ≥ 2 we have W2(MF,α ;A ) ≤W2(MF˜,α ;A ) by (4) and the induction hypothesis. Moreover, 2 n 2 n W1(MF,α ;A ) = a E(X|F)+E[W(MF +δ ,α ;A1)|F] 2 n 1 X 2 n ≤ a E(X|F˜)+E[W(MF˜ +δ ,α ;A1)|F] 1 X 2 n ≤ a E(X|F˜)+E[W(MF˜ +δ ,α ;A1)|F˜] 1 X 2 n = W1(MF˜,α ;A ), 2 n 4 where the first inequality follows from F ≤ F˜ and the induction hypothesis, noting that icx (MF + δ )/(M + 1) ≤ (MF˜ + δ )/(M + 1) for any x; the second inequality holds by the x icx x definition of ≤ , because W(MF˜+δ ,α ;A1) is an increasing, convex function of x. To show icx x 2 n this,fix−∞ < u < v < ∞. Itiseasytoshow(MF˜+δ )/(M+1) ≤ (MF˜+δ )/(M+1),which, u icx v by the induction hypothesis, implies W(MF˜ +δ ,α ;A1)≤ W(MF˜ +δ ,α ;A1). Moreover, u 2 n v 2 n W(MF˜+δ ,α ;A1)+W(MF˜ +δ ,α ;A1) u 2 n v 2 n ≥ 2W(MF˜ +(δ +δ )/2,α ;A1) u v 2 n ≥ 2W(MF˜ +δ ,α ;A1), (u+v)/2 2 n wherethefirstinequalityfollowsfromLemma1,andthesecondinequalityholdsbytheinduction hypothesis, noting that MF˜ +δ MF˜ +(δ +δ )/2 (u+v)/2 u v ≤ . icx M +1 M +1 Hence W(MF˜+δ ,α ;A1) is convex in x as needed. x 2 n Remark 1. Theorem1extends tobanditswith morethantwo arms. Thatis, themaximum expectedpayoffincreaseswhenthemeanoftheDirichletprocesspriorforanyarmbecomeslarger in the increasing convex order. We present the two-armed version for notational convenience. The discount sequence in Theorem 1 is very general, i.e., we only assume A is nonnegative. By n approximation, this can be further extended to the infinite-horizon case assuming ∞ a < ∞. i=1 i Similar comments apply to Theorem 2 in Section 3. P When arm 2 has a known distribution P with mean λ, the problem reduces to a one-armed 2 bandit. Without loss of generality we may assume the known arm yields a constant payoff λ at each stage, i.e., we consider the (α,δ ;A ) bandit (the subscript on α is dropped for λ n 1 convenience). It is well known that, assuming the discount sequence is regular in the sense that ( a )2 ≥ ( a )( a ) for all j ≥ 1, this one-armed bandit is an optimal i≥j+1 i i≥j i i≥j+2 i stopping problem, i.e., if at any stage it is optimal to pull arm 2 then arm 2 should be used P P P in all subsequent stages; see Berry and Fristedt (1979). If A is regular, then there exists a n break-even value Λ(α;A ) for the (α,δ ;A ) bandit, such that arm 1 is optimal initially if and n λ n only if λ ≤ Λ(α;A ) and arm 2 is optimal initially if and only if λ ≥ Λ(α;A ). For infinite- n n horizon geometric discounting, this break-even value is also known as the dynamic allocation index or Gittins index (Gittins and Jones 1974). The following result holds by the optimal stopping characterization and is stated for uniform discounting as Lemma 2.1 in Clayton and Berry (1985). Lemma 2. If A is regular, then Λ(α;A ) is the smallest λ such that W(α,δ ;A )≤ λ n a . n n λ n i=1 i Lemma 2 and Theorem 1 yield the following result comparing Λ(α;A ). P n Corollary 1. For M > 0 and F ≤ F˜, both with finite means, we have Λ(MF;A ) ≤ icx n Λ(MF˜;A ), assuming A is a regular discount sequence. n n 5 Suppose A is regular. Monotonicity and continuity considerations (see Clayton and Berry n 1985) show that, for the (α,δ ;A ) bandit there exists a break-even observation b(α;A ) such λ n n that if both arms are optimal initially, and an observation x is taken from arm 1, then arm 1 remains optimal if x ≥ b(α;A ) and arm 2 becomes optimal if x ≤ b(α;A ). That is, n n Λ(α;A ) ≥ Λ(α+δ ;A1), if x ≤ b(α;A ); n x n n Λ(α;A ) ≤ Λ(α+δ ;A1), if x ≥ b(α;A ). n x n n Calculatingthisbreak-evenobservationisnontrivial. Inthecaseofuniformdiscounting,Clayton and Berry (1985) prove an upper bound for b(α;A ) and conjecture that b(α;A ) ≥ Λ(α;A ) n n n based on numerical evidence. We confirm this in Proposition 1. Proposition 1. Suppose n ≥ 2 and A is regular and all positive. Then b(α;A )≥ Λ(α;A ). n n n As noted by Berry and Fristedt (1985; p. 131), Proposition 1 has an intuitive interpretation. Supposeboth arms are optimal initially, and arm 1 is selected. If the initial pull on arm 1 yields no more than Λ(α;A ), which is the yield of arm 2 per pull, the hope of getting higher payoff n fades. Not surprisingly, arm 2 becomes optimal afterwards. This suggests that the break-even observation is at least Λ(α;A ). n To prove Proposition 1 we need a lemma. Lemma 3. For c > 0, λ ∈ R and an arbitrary discount sequence A , we have n W(α+cδ ,δ ;A )≤ W(α,δ ;A ). λ λ n λ n Proof. We use induction on n. The n = 1 case is easy. Suppose n ≥ 2. Let us write M = α(R) and let µ be the first moment of α. Direct calculation using (2)–(4) yields Mφ +cφ 0 1 W(α+cδ ,δ ;A ) = max , φ , (9) λ λ n 2 M +c (cid:26) (cid:27) where φ = a µ+E W(α+cδ +δ ,δ ;A1)|α ; 0 1 λ X λ n φ1 = a1λ+W(cid:2)(α+(c+1)δλ,δλ;A1n); (cid:3) φ = a λ+W(α+cδ ,δ ;A1). 2 1 λ λ n Applying the induction hypothesis, and then (2) and (3), we get φ ≤ a µ+E W(α+δ ,δ ;A1)|α 0 1 X λ n ≤ W(α,δλ;(cid:2)An). (cid:3) Applying the induction hypothesis, and then (2) and (4), we get φ ≤ φ ≤ a λ+W(α,δ ;A1) ≤ W(α,δ ;A ). 1 2 1 λ n λ n That is, φ ≤W(α,δ ;A ) for i = 0,1,2. Hence the claim holds by (9). i λ n Proof of Proposition 1. Suppose λ = Λ(α;A ). By the optimal stopping characterization, we n have W(α,δ ;A1) = λ n a . Lemma 3 yields W(α+δ ,δ ;A1) ≤ λ n a . It follows from λ n i=2 i λ λ n i=2 i Lemma2 that λ ≥ Λ(α+δ ;A1). Thatis, Λ(α;A ) ≥ Λ(α+δ ;A1), which implies λ ≤ b(α;A ) P λ n n λ n P n (under the assumptions b(α;A ) is unique). n 6 3 Prior weight monotonicity Themainresultofthissection(Theorem2)showsthatthemaximumexpectedpayoffofabandit decreases as the prior weight for the Dirichlet process prior of an arm increases. When arm 2 is known and the discount sequence is regular, this shows that the break-even value Λ(M F ;A ) 1 1 n decreases as M (the prior weight associated with arm 1) increases. That is, given the same 1 immediate payoff, arm 1 becomes less desirable as the amount of information about it increases. Theorem 2. Let F be a probability distribution on R with a finite mean. If 0 <M < M˜ then W(MF,α ;A ) ≥ W(M˜F,α ;A ). (10) 2 n 2 n Lemma 2 and Theorem 2 yield thefollowing result concerning the break-even value Λ(α;A ) n for the one armed bandit (α,δ ;A ), as conjectured by Clayton and Berry (1985) in the case of λ n uniform discounting. Corollary 2. For 0 < M < M˜ we have Λ(MF;A ) ≥ Λ(M˜F;A ), assuming A is a regular n n n discount sequence. When F has only two supportpoints, Corollary 2 says that for a Bernoulli one-armed bandit with a Beta(Mu,Mv) prior, u,v > 0, for the unknown arm, the break-even value decreases in M. This Bernoulli case was proved by Gittins and Wang (1992) for infinite-horizon geometric discounting. The rest of this section gives a proof of Theorem 2. We assume F has finite, and then bounded, and finally arbitrary, support. The key step is summarized as Lemma 4. Lemma 4. Assume n ≥ 2, L > 0. Assume α is a finite measure on R with a finite mean and F is a probability distribution on R with s < ∞ support points. Then E[W(α+θF +(L− θ)δ ,α ;A1)|F] decreases in θ ∈ [0,L]. X 2 n Proof. We use induction on s. Although the induction may start at the trivial case s = 1, we present the s = 2 case to illustrate the convexity arguments. Write F = pδ +(1−p)δ where 1 0 p ∈ (0,1) and {0,1} are the supportpoints withoutloss of generality. For fixed 0≤ θ < θ ≤ L, 1 2 let Z ∼ Bernoulli(p) and define Z = θ p+(L−θ )Z, i= 1,2. i i i Then EZ = EZ = pL, and it is easy to verify Z ≤ Z as θ < θ (see, e.g., Shaked and 1 2 2 cx 1 1 2 Shanthikumar 2007, Theorem 3.A.18). Let us define φ(u) = W(α+uδ +(L−u)δ ,α ;A1). 1 0 2 n By direct calculation E W α+θ F +(L−θ )δ ,α ;A1 |F =pφ(θ p+L−θ )+(1−p)φ(θ p) 1 1 X 2 n 1 1 1 (cid:2) (cid:0) (cid:1) (cid:3) =Eφ(Z1) ≥Eφ(Z ) 2 =E W α+θ F +(L−θ )δ ,α ;A1 |F 2 2 X 2 n (cid:2) (cid:0) (cid:1) (cid:3) 7 where the inequality holds because Z ≤ Z and, by Lemma 1, φ(u) is convex in u∈ [0,L]. 2 cx 1 For s ≥ 3, write F = s p δ , where {x , j = 1,...,s} are the support points, p > 0 j=1 j xj j j and s p = 1. Consider the leave-one-out distributions j=1 j P P p Fk = j δ , k = 1,...,s. 1−p xj k j6=k X Denote W(γ)= W(γ,α ;A1) for convenience. For fixed 0 ≤ θ < θ ≤ L, we have 2 n 1 2 (s−1)E[W (α+θ F +(L−θ )δ )|F] 1 1 X s = (1−p )E W (α+θ F +(L−θ )δ ) Fk k 1 1 X Xk=1 h (cid:12) i s (cid:12) (cid:12) = (1−p ) E W α+θ p δ +θ (1−p )Fk +(L−θ )δ Fk k 1 k xk 1 k 1 X Xk=1 h (cid:16) (cid:17)(cid:12) i s (cid:12) (cid:12) ≥ (1−p ) E W α+θ p δ +θ (1−p )Fk +(L−θ (1−p )−θ p )δ Fk k 1 k xk 2 k 2 k 1 k X Xk=1 h (cid:16) (cid:17)(cid:12) i (cid:12) (11) (cid:12) s = p V , j jk k=1j6=k XX where V = W α+θ γjk +θ p δ +(L−θ (1−p −p )−θ p )δ , jk 2 1 k xk 2 k j 1 k xj γjk = (cid:16)p δ , j 6=k. (cid:17) l l l6=j,k X The inequality (11) follows from the induction hypothesis; other steps are algebraic manipula- tions. For fixed j 6= k, let Z ∼ Bernoulli(p /(p +p )) and define k j k Z = θ p +Z(L−θ +(θ −θ )(p +p )); 1 1 k 2 2 1 j k Z = θ p +Z(L−θ ). 2 2 k 2 It is easy to verify that EZ = EZ ; Z ≤ Z . 1 2 2 cx 1 We have p V +p V = (p +p )EW α+θ γjk +Z δ +(L−θ (1−p −p )−Z )δ j jk k kj j k 2 1 xk 2 k j 1 xj (cid:16) (cid:17) ≥ (p +p )EW α+θ γjk +Z δ +(L−θ (1−p −p )−Z )δ j k 2 2 xk 2 k j 2 xj (cid:16) (cid:17) = p W α+θ F +(L−θ )δ +p W (α+θ F +(L−θ )δ ), j 2 2 xj k 2 2 xk (cid:0) (cid:1) 8 where the inequality holds by Lemma 1 as Z ≤ Z . Hence, 2 cx 1 s p V = (p V +p V ) j jk j jk k kj k=1j6=k 1≤j<k≤s XX X ≥ p W α+θ F +(L−θ )δ +p W (α+θ F +(L−θ )δ ) j 2 2 xj k 2 2 xk 1≤j<k≤s X (cid:2) (cid:0) (cid:1) (cid:3) s = (s−1) p W α+θ F +(L−θ )δ j 2 2 xj j=1 X (cid:0) (cid:1) = (s−1)E[W(α+θ F +(L−θ )δ )|F]. 2 2 X Thus we have shown that E[W(α+θF +(L−θ)δ )|F] decreases in θ ∈[0,L]. X Proof of Theorem 2. (i) Assume F has finite support. The claim obviously holds for n = 1. For n ≥ 2 we use induction. In view of (2)–(4), we only need to show E W(MF +δ ,α ;A1)|F ≥ E W(M˜F +δ ,α ;A1)|F and (12) X 2 n X 2 n h i E (cid:2)W(MF,α +δ ;A1)|α (cid:3) ≥ E W(M˜F,α +δ ;A1)|α . (13) 2 Y n 2 2 Y n 2 h i (cid:2) (cid:3) By the induction hypothesis, (13) holds. Define η = (M˜ +1)/(M +1) and θ = M˜/η. Noting M < θ < M +1, we may apply Lemma 4 and get E W(MF +δ ,α ;A1)|F ≥E W(θF +(M +1−θ)δ ,α ;A1)|F X 2 n X 2 n (cid:2) (cid:3) ≥E(cid:2)W(η(θF +(M +1−θ)δX),α2;A1n)(cid:3)|F (14) =E(cid:2)W(M˜F +δ ,α ;A1)|F , (cid:3) X 2 n h i where (14) holds by the induction hypothesis, as η > 1. Thus (12) holds as required. (ii) AssumeF has boundedsupport. Then for arbitrary ǫ > 0 we can construct two distribu- tions F∗ and F supported on {x ,...,x } and {x ,...,x } respectively, where x = x +jǫ, ∗ 1 s 0 s−1 j 0 such that F(x ) = 0, F(x ) = 1 and F (x ) = F∗(x )= F(x ), j = 1,...,s. By construction, 0 s ∗ j j−1 j F ≤ F ≤ F∗. Theorem 1 yields ∗ st st W(MF ,α ;A )≤ W(MF,α ;A )≤ W(MF∗,α ;A ). ∗ 2 n 2 n 2 n Note that if X ∼ F∗ then X−ǫ ∼ F . Therefore the bandits (MF∗,α ;A ) and (MF ,α ;A ) ∗ 2 n ∗ 2 n can be coupled in an obvious way such that, for every strategy of (MF∗,α ;A ), there exists a 2 n strategy of (MF ,α ;A ) under which the payoff at each stage is either the same (when arm 2 ∗ 2 n is selected), or exactly ǫ less (when arm 1 is selected). Thus we have shown n W(MF∗,α ;A )−W(MF ,α ;A )≤ ǫ a . 2 n ∗ 2 n i i=1 X Hence W(MF∗,α ;A )→ W(MF,α ;A ) as ǫ → 0, and the monotonicity of W(MF∗,α ;A ) 2 n 2 n 2 n with respect to M implies the corresponding monotonicity of W(MF,α ;A ). 2 n 9 (iii) Finally, assume F is an arbitrary distribution with a finite mean. Suppose X ∼ F. For L > 0 let F∗ be the distribution of X∗, defined as X if |X| ≤ L and 0 otherwise. We construct a coupling between (MF,α ;A ) and (MF∗,α ;A ). Let X be the resulting observation when 2 n 2 n k arm 1 of (MF,α ;A ) is pulled for the kth time. If |X | ≤ L then let X∗ = X , otherwise 2 n 1 1 1 X∗ = 0, yielding X∗ ∼ F∗. For general k ≥ 1, if |X | ≤ L, i = 1,...,k, then let X∗ = X 1 1 i k+1 k+1 if |X | ≤ L and X∗ = 0 otherwise. In this case the conditional distribution of X k+1 k+1 k+1 given X , i = 1,...,k, is (MF + k δ )/(M +k). Since |X | ≤ L, i = 1,...,k, we have i i=1 Xi i X∗ = X , i = 1,...,k, and the conditional distribution of X∗ given X∗, i = 1,...,k, is i i P k+1 i precisely (MF∗ + ki=1δXi∗)/(M + k). That is, Xi∗, i = 1,...,k + 1, can be regarded as successive pulls from arm 1 of (MF∗,α ;A ) as long as |X | ≤ L, i = 1,...,k. Let the kth pull P 2 n i from arm 2 be Y for both bandits. In the event that all |X | ≤ L, i = 1,...,n, the optimal k i strategy for (MF,α ;A ) can be adopted for (MF∗,α ;A ) throughout, yielding identical pulls 2 2 2 2 (notallX , i= 1,...,n,arerealized). Byconsideringatrivialupper(respectively, lower)bound i for the payoff of (MF,α ;A ) (respectively, (MF∗,α ;A )) when at least one |X |> L, we have 2 2 2 2 i n W(MF,α2;A2)−W(MF∗,α2;A2)≤ E"1∪ni=1{|Xi|>L} (ai(|Yi|+|Xi|)−ai(−|Yi|−L))# i=1 X n ≤ E"1∪ni=1{|Xi|>L} a∗(2|Yi|+|Xi|+L)# i=1 X n n ≤ E 1 a∗(2|Y |+|X |+L) {|Xi|>L} i i " ! # i=1 i=1 X X ≡ a∗h(L), where a∗ ≡ maxn a . Direct calculation using exchangeability yields i=1 i h(L) = n2Pr(|X |> L)(2E|Y |+L)+nE 1 |X | +n(n−1)E 1 |X | 1 1 |X1|>L 1 |X1|>L 2 (cid:2) (cid:3) (cid:2) (cid:3) The first two terms tend to zero as L → ∞ by dominated convergence since E|X | < ∞. For 1 the last term, by conditioning on X we have 1 M 1 E 1 |X | = E 1 E|X|+ |X | , |X1|>L 2 |X1|>L M +1 M +1 1 (cid:20) (cid:18) (cid:19)(cid:21) (cid:2) (cid:3) which also vanishes as L → ∞. Thus limsup[W(MF,α ;A )−W(MF∗,α ;A )] ≤ 0. 2 2 2 2 L→∞ By a parallel argument, we get liminf [W(MF,α ;A )−W(MF∗,α ;A )] ≥ 0. Thus L→∞ 2 2 2 2 W(MF∗,α ;A ) tends to W(MF,α ;A ) as L → ∞, and the monotonicity of W(MF,α ;A ) 2 2 2 2 2 n with respect to M is proved as before. Remark 2. Clayton and Berry (1985) also conjecture that the monotonicity in Corollary 2 is strict if n ≥ 2, A = (1,1,...,1), and F is nondegenerate. This can beconfirmed by a careful n analysis of the above results. Some modifications are needed. Using arguments similar to steps 10