ebook img

Demystifying the Bias from Selective Inference: a Revisit to Dawid's Treatment Selection Problem PDF

0.35 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Demystifying the Bias from Selective Inference: a Revisit to Dawid's Treatment Selection Problem

Demystifying the bias from selective inference: A revisit to Dawid’s treatment selection problem Jiannan Lu ∗ and Alex Deng 6 1 0 Analysis and Experimentation, Microsoft Corporation 2 n u J Abstract 9 ] We extend the heuristic discussion in Senn (2008) on the bias from selective inference for T S the treatment selection problem (Dawid 1994), by deriving the closed-form expression for the . h selection bias. We illustrate the advantages of our theoretical results through numerical and t a m simulated examples. [ 2 Keywords: Bayesian inference; posterior mean; selection paradox; multivariate truncated normal. v 5 3 8 1. INTRODUCTION 5 0 . Selective inference gained popularity in recent years (e.g., Lockhart et al. 2014; G’Sell et al. 2016; 1 0 6 Reid and Tibshirani 2016). To quote Dawid (1994), “... a great deal of statistical practice involves, 1 : explicitly or implicitly, a two stage analysis of the data. At the first stage, the data are used to v i X identify a particular parameter on which attention is to focus; the second stage then attempts to r a make inferences about the selected parameter.” Consequently, the results (e.g., point estimates, p−values) produced by selective inference are generally “cherry-picked” (Taylor and Tibshirani 2015), and therefore it is of great importance for practitioners to conduct “exact post-selection inference” (e.g., Tibshirani et al. 2014; Lee et al. 2015). Todemonstratetheimportanceof“exactpost-selectioninference,”inthispaperwefocusonthe “bias” of the posterior mean associated with the most extreme observation (formally defined later, ∗Address for correspondence: Jiannan Lu, Microsoft Corporation, One Microsoft Way, Redmond, Washington 98052, USA. Email: [email protected] 1 and henceforth referred to as “selection bias”) in the treatment selection problem (Dawid 1994), which is not only fundamental in theory, but also of great practical importance in, e.g., agricultural studies, clinical trials, and large-scale online experiments (Kohavi et al. 2013). In an illuminating paper, Senn (2008) provided a heuristic explanation that the existence of selection bias depended on the prior distribution, and upheld Dawid’s claim that the fact that selection bias did not exist in some standard cases was a consequence of using certain conjugate prior. In this paper, we relax the modeling assumptions in Senn (2008) and derive the closed-form expression for the selection bias. Consequently, our work can serve as a complement of the heuristic explanation provided by Senn (2008), and is useful from both theoretical and practical perspectives. The paper proceeds as follows. Section 2 reviews the treatment selection problem, defines the selection bias, and describes the Bayesian inference framework which the remaining parts of the paper are based on. Section 3 derives the closed-form expression for the selection bias. Section 4 highlights numerical and simulated examples that illustrates the advantages of our theoretical results. Section 5 concludes and discusses future directions. 2. BAYESIAN INFERENCE FOR TREATMENT SELECTION PROBLEM 2.1. Treatment Selection Problem and Selection Bias Consider an experiment with p ≥ 2 treatment arms. For i = 1,...,p, let µ denote the mean yield i of the ith treatment arm. After running the experiment, we observe the sample mean yield of the ith treatment arm, denoted as X . Let i i∗ = argmaxX i 1≤i≤p denote the index of the largest observation. The focus of selective inference is on µ , which relies i∗ on X ,...,X . We let E(µ | X ) be the posterior mean of µ as if it were selected before the 1 p i∗ i∗ i∗ experiment, and E(µ | X ,X = maxX ) i∗ i∗ i∗ i 2 be the “exact post-selection” posterior mean of µ , which takes the selection into account. Follow- i∗ ing Senn (2008), we define the selection bias as ∆ = E(µ | X )−E(µ | X ,X = maxX ). (1) i∗ i∗ i∗ i∗ i∗ i Having defined the selection bias, we briefly discuss the “selection paradox” in Dawid (1994), i.e., “since Bayesian posterior distributions are already fully conditioned on the data, the posterior distribution of any quantity is the same, whether it was chosen in advance or selected in the light of the data.” In other words, if we define the selection bias as ∆˜ = E(µ | X ,...,X )−E(µ | X ,...,X ,X = maxX ), i∗ 1 p i∗ 1 p i∗ i then indeed ∆˜ = 0. 2.2. The Normal-Normal Model Let µ = (µ ,...,µ )(cid:48) and X = (X ,...,X )(cid:48). Following Dawid (1994), we treat them as random 1 p 1 p vectors. We generalize Senn (2008) and assume that µ ∼ N(0,Σ ), X | µ ∼ N(µ,Σ), (2) 0 where Σ = γ2I +(1−γ2)1 1(cid:48), Σ = σ2{η2I +(1−η2)1 1(cid:48)}, 0 ≤ γ,η ≤ 1. (3) 0 p p p p p p To interpret (3) we let X = µ +(cid:15) , where µ is generated by i i i i φ ∼ N (cid:0)0,1−γ2(cid:1), µ | φ ∼ N (cid:0)φ,γ2(cid:1), i and (cid:15) is generated by i ξ ∼ N{0,(1−η2)σ2}, (cid:15) | ξ ∼ N(ξ,η2σ2). i Note that η = 1 in Senn (2008), and we relax this assumption by allowing correlated errors. 3 2.3. Posterior Mean To derive the posterior mean of µ given X ,...,X , we rely on the following classic result. p 1 p Lemma 1 (Normal Shrinkage). Let µ ∼ N(µ ,ν2), Z | µ i∼id N(µ,τ2) (i = 1,...,n). 0 i Then the posterior mean of µ is τ2µ +ν2(cid:80)n Z E(µ | Z ,...,Z ) = 0 i=1 i, 1 n τ2+nν2 Proposition 1. The posterior mean of µ given X is p p 1 E(µ | X ) = X . (4) p p 1+σ2 p Furthermore, let a = γ2+σ2η2, b = 1−γ2+σ2(1−η2) and σ2(η2−γ2) a+(p−1)bγ2 r ,...,r = , r = . 1 p−1 p a(a+pb) a(a+pb) The posterior mean of µ given X ,...,X is p 1 p p (cid:88) E(µ | X ,...X ) = r X . (5) p 1 p i i i=1 Proof of Proposition 1. To prove the first half, notice that µ ∼ N(0,1), X | µ ∼ N(µ ,σ2), p p p p and apply Lemma 1. To prove the second half, note that µ = φ+µ(cid:48), where i i φ ∼ N (cid:0)0,1−γ2(cid:1), µ(cid:48) i∼id N (cid:0)0,γ2(cid:1); i 4 and (cid:15) = ξ+(cid:15)(cid:48), where i i ξ ∼ N{0,(1−η2)σ2}, (cid:15)(cid:48) i∼id N(0,η2σ2). i Consequently we have iid φ+ξ ∼ N(0,b), X | φ+ξ ∼ N(φ+ξ,a), i On the one hand, by Lemma 1 p b (cid:88) E(φ+ξ | X ,...X ) = X , 1 p i a+pb i=1 and 1−γ2 E(φ | φ+ξ,X ,...,X ) = E(φ+ξ | X ,...X ). 1 p 1 p b Consequently, E(φ | X ,...X ) = E{E(φ | φ+ξ,X ,...,X ) | X ,...X } 1 p 1 p 1 p 1−γ2 = E(φ+ξ | X ,...X ) 1 p b 1−γ2 (cid:88)p = X . (6) i a+pb i=1 On the other hand, similarly we have γ2 E(µ(cid:48) | X ,...X ) = E(µ(cid:48) +(cid:15)(cid:48) | X ,...X ) p 1 p a i i 1 p γ2 (cid:40) b (cid:88)p (cid:41) = X − X . (7) p i a a+pb i=1 Combine (6) and (7), we complete the proof. It is worth noting that when γ = η, (5) reduces to (4). 5 3. CLOSED-FORM EXPRESSION FOR THE SELECTION BIAS To simplify future notations, we assume that X is the largest observation, i.e., X = max X . p p 1≤i≤p i Consequently, the selection bias defined in (1) becomes ∆ = E(µ | X )−E(µ | X ,X = maxX ). (8) p p p p p i To derive its closed-form expression, we rely on the following lemmas. Lemma 2. Let X = (X ,...,X )(cid:48), and its distribution conditioning on X is −p 1 p−1 p (cid:18) (cid:19) b ab N 1 X , aI + 1 1(cid:48) . (9) a+b p−1 p p−1 a+b p−1 p−1 Proof of Lemma 2. By (2) we have X ∼ N(0,Ψ), where Ψ = (ψ ) = aI +b1 1(cid:48). jk 1≤j,k≤p p p p Furthermore, let Ψ = (ψ ) = aI +b1 1(cid:48) , Ψ = (ψ ) = a+b, 11 jk 1≤j,k≤p−1 p−1 p−1 p−1 22 pp and Ψ = (ψ ,...,ψ )(cid:48) = b1 , Ψ = (ψ ,...,ψ ) = b1(cid:48) . 12 1p p−1,p p−1 21 p1 p,p−1 p−1 Simple probability argument suggests that X | X ∼ N (cid:0)Ψ−1Ψ X ,Ψ −Ψ Ψ−1Ψ (cid:1), −p p 12 22 p 11 12 22 21 where b Ψ Ψ−1X = 1 X 12 22 p a+b p−1 p 6 and b2 Ψ −Ψ Ψ−1Ψ = aI +b1 1(cid:48) − 1 1(cid:48) 11 12 22 21 p−1 p−1 p−1 a+b p−1 p−1 ab = aI + 1 1(cid:48) p−1 a+b p−1 p−1 The proof is complete. To state the next lemma, we introduce some notations. First, for θ = (θ ,...,θ )(cid:48) and positive 1 n semi-definite matrix Ω = (ω ) , let jk 1≤j,k≤n Y = (Y ,...,Y )(cid:48) ∼ N(θ,Ω). 1 n Second, let V = Y −θ for i = 1,...,n. Consequently, i i i V = (V ,...,V )(cid:48) ∼ N(0,Ω), 1 n whose probability density function is 1 f(v) = e−12v(cid:48)Ω−1v, v = (v1,...,vn)(cid:48). (2π)n/2|Ω|1/2 Third, for constants b ,...,b , we let 1 n (cid:90) α = Pr(V ≤ b −θ ,...,V ≤ b −θ ) = f(v)dv, 1 1 1 n n n v1≤b1−θ1,...,vn≤bn−θn and W = (W ,...,W )(cid:48) be the truncation version of V from above at (b − θ ,...,b − θ )(cid:48). 1 n 1 1 n n Consequently, its probability density function is 1 g(w) = α(2π)n/2|Ω|1/2e−21w(cid:48)Ω−1w ·1{w1≤b1−θ1,...,wn≤bn−θn}, w = (w1,...,wn)(cid:48). 7 For all k = 1,...,n, let the kth marginal density function of W be (cid:90) b1−θ1 (cid:90) bk−1−θk−1(cid:90) bk+1−θk+1 (cid:90) bn−θn (cid:89) g (w) = ... ... g(w ,...,w ,w,w ,...,w ) dw . k 1 k−1 k+1 n l −∞ −∞ −∞ −∞ l(cid:54)=k (10) For efficient analytical and numerical evaluations of (10), see Cartinhour (1990) and Wilhelm and Manjunath (2010), respectively. Lemma 3. For all i = 1,...,n, n (cid:88) E(Y | Y ≤ b ,...,Y ≤ b ) = θ − ω g (b −θ ). i 1 1 n n i ki k k k k=1 Proof of Lemma 3. The proof follows Manjunath and Wilhelm (2012). First, E(Y | Y ≤ b ,...,Y ≤ b ) = θ +E(V | V ≤ b −θ ,...,V ≤ b −θ ) i 1 1 n n i i 1 1 1 n n n = θ +E(W ). (11) i i Next, the moment generating function of W at t = (t ,...,t )(cid:48) is 1 n (cid:90) m(t) = et(cid:48)wg(w)dw (cid:90) 1 = e−21(w(cid:48)Ω−1w−2t(cid:48)w)dw α(2π)n/2|Ω|1/2 w1≤b1−θ1,...,wn≤bn−θn (cid:90) 1 = e21t(cid:48)Ωt e−12(w−Ωt)(cid:48)Ω−1(w−Ωt)dw α(2π)n/2|Ω|1/2 w1≤b1−θ1,...,wn≤bn−θn (cid:124) (cid:123)(cid:122) (cid:125)(cid:124) (cid:123)(cid:122) (cid:125) m1(t) m2(t) On the one hand, by definition ∂m(t) E(W ) = | i t=0 ∂t i ∂m (t) ∂m (t) 2 1 = m (0) | +m (0) | 1 t=0 2 t=0 ∂t ∂t i i ∂m (t) 2 = | . (12) t=0 ∂t i 8 On the other hand, let n (cid:88) b∗ = b −θ − ω t , i = 1,...,n, i i i ik k k=1 and we can rewrite m (t) as 2 (cid:90) b∗ (cid:90) b∗ 1 n m (t) = ... g(w)dw ...dw . 2 1 n −∞ −∞ Therefore, by chain rule and Leibniz integral rule ∂m2(t) = (cid:88)n ∂b∗k ∂m2(t) ∂t ∂t ∂b∗ i k=1 i k n (cid:90) b∗ (cid:90) b∗ (cid:90) b∗ (cid:90) b∗ = −(cid:88)ω 1 ... k−1 k+1... n g(w ,...,w ,b∗,w ,...,w )(cid:89)dw , ki 1 k−1 k k+1 n l −∞ −∞ −∞ −∞ k=1 l(cid:54)=k and consequently n ∂m2(t) (cid:88) | = − ω g (b −θ ). (13) t=0 ki k k k ∂t i k=1 Combine (11), (12) and (13), the proof is complete. Proposition 2. For i = 1,...,p−1, let h denote the ith marginal probability density function of i therandomvectordefinedby (9)truncatedfromaboveat1 X .Thentheclosed-formexpression p−1 p for (8) is σ2(η2−γ2) (cid:88)p−1 (cid:18)γ2+σ2η2 (cid:19) ∆ = h X . (14) 1+σ2 i 1+σ2 p i=1 Proof of Proposition 2. Apply Lemma 2 and 3 to (9),   a  ab (cid:88)p−1 (cid:18) a (cid:19) (cid:18) a (cid:19) E(X | X ,X = maxX ) = X − h X +ah X . i p p i p j p i p a+b a+b a+b a+b   j=1 (cid:124) (cid:123)(cid:122) (cid:125) δi 9 Consequently, by (5) we have p−1 (cid:88) E(µ | X ,X = maxX ) = r X + r E(X | X ,X = maxX ) p p p i p p i i p p i i=1 (cid:32) p−1 (cid:33) p−1 a (cid:88) (cid:88) = r + r X − r δ p i p i i a+b i=1 i=1 (cid:26) (cid:27)p−1 (cid:18) (cid:19) Xp (p−1)ab (cid:88) a = + +a r h X i i p a+b a+b a+b i=1 σ2(η2−γ2) (cid:88)p−1 (cid:18)γ2+σ2η2 (cid:19) = E(µ | X )− h X . p p 1+σ2 i 1+σ2 p i=1 The proof is complete. Proposition 2 confirms the existence of the selection bias in general. Furthermore, it provides the following interesting insights: 1. For fixed σ, p and X , the sign of the selection bias is the same as the sign of η2 −γ2, i.e., p depending on the correlation structures in (3), neglecting the fact that X = max X p 1≤i≤p i can either over-estimate or under-estimate µ . In particular, the selection bias is zero when i∗ γ = η. This is a generalization of the first main result in Senn (2008), which assumes that γ = η = 1; 2. For fixed γ, η, p and X , the selection bias goes to zero as σ goes to zero. This is intuitive p because X approaches µ as σ goes to zero, and therefore the fact that X = max X p p p 1≤i≤p i becomes irrelevant; 3. For fixed σ, γ, η and p, the selection bias disappears for sufficiently large X . This is because p when X goes to infinity, p (cid:18)σ2+γ2η2 (cid:19) h X → 0, i = 1,...,p−1. i 1+σ2 p This result is in connection with Dawid (1973). 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.