Asymptotics of selective inference Xiaoying Tian, Jonathan Taylor∗ 6 Department of Statistics 1 Stanford University 0 Sequoia Hall 2 Stanford, CA 94305, USA g e-mail: [email protected] [email protected] u A Abstract: In this paper, we seek to establish asymptotic results for se- lective inference procedures removing the assumption of Gaussianity. The 4 classofselection procedures weconsider aredeterminedbyaffineinequal- ities, which we refer to as affine selection procedures. Examples of affine ] selection procedures include selective inference along the solution path of T theLASSO,aswellasselectiveinferenceafterfittingtheLASSOatafixed S valueoftheregularizationparameter.Wealsoconsidersometestsinpenal- . ized generalized linear models. Our result proves asymptotic convergence h inthehighdimensionalsettingwheren<p,andncanbeofalogarithmic t a factorofthedimensionpforsomeprocedures.Ourmethodofproofadapts m amethodofChatterjee(2005). [ AMS 2000 subject classifications:Primary62M40;secondary62J05. Keywords and phrases: selective inference, non-gaussian error, high- 2 dimensionalinference,LASSO. v 8 8 1. Introduction 5 3 0 Selective inference is a recent research topic that studies valid inference after a . statisticalmodelissuggestedbythedataFithian et al.(2014),Lee et al.(2013), 1 0 Taylor et al. (2014, 2013). Classical inference tools break down at this point as 5 the data used for the hypothesis test is allowed to be the data used to suggest 1 the hypothesis. Specifically, instead of being given a priori, the hypothesis to : v testisdependentonthedata,thusrandom.Formally,denotedby ∗ = ∗(y,X) E E i is the model selection procedure, which generates a set of hypotheses to test, X or perhaps parameters for which to form intervals. It is useful to think of ∗ r E a as a point process with values in , where is some collection of questions of S S possible interest. Consider the following example, Suppose y X G with y Rn,X Rn×p, X fixed. For any E 1,...,p | ∼ ∈ ∈ ⊂ { } define the functionals β (G)=eT argminE ( y X β 2 X) j E, j,E j G k − E Ek | ∈ β where e is the unit vector with only the j-th entry being 1. Such functionals j β isessentiallythebestlinearcoefficientswithinthemodelconsistingofonly j,E ∗SupportedinpartbyNSFgrantDMS1208857andAFOSRgrant113039. 1 Tian and Taylor/Asymptotics of selective inference 2 variables in E. Then the collection of possily interesting questions are = β , j E :E 1,...,p . j,E S { ∈ } ⊂{ } (cid:26) (cid:27) The data (y,X) will then suggest a subset of interesting variables E, and ∗(y,X)designatesthe targetfor inference to be β , j E ,the best linear j,E E { ∈ } coefficient within a model consisting of only the variables in E. Previous literature has studied inference after different model selection pro- cedures ∗. Notably Lee et al. (2013) proposed an exact test within the model E suggested by LASSO, that is ∗ = β , j E , where E is the active set of j,E E { ∈ } the LASSO solution. The test is based upon a pivotal quantity which the au- thorsproveto be distributed as Unif(0,1)if the hypothesis to be tested is true. Thus such quantity P (y) can be used to test the hypothesis H : β = 0, j 0j j,E and control the “Type-I error” at level α, P(P (y) α H is true) α. (1) j 0j ≤ | ≤ By inverting such tests, Lee et al. (2013) can also construct valid confidence intervals for β . j,E ItisofcourseworthnoticingthateitherthehypothesisH ortheparameters 0j β are randomas E is suggestedby the data.So the “Type-Ierror”(1) is not j,E the classical Type-I error definition where the hypotheses are given a priori. Such inference framework is first considered in Berk et al. (2013), and we leave the philosophical discussions of such approachto Fithian et al. (2014). The means by which Lee et al. (2013) controlsthe “Type-I error”is through constructing the p-value functions P . Such construction is highly dependent j on the assumption of normality of the error distribution. Other works like Lockhart et al. (2013), Taylor et al. (2014) used similar approaches.Compared tothesepreviouswork,weseektoremovetheGaussianassumptionontheerrors and establish asymptotic distributions of P in this work. We state the condi- j tions under which P will be asymptotically distributed as Unif(0,1), and thus j P canbeusedasp-valuestotestthehypothesesandasymptoticallycontrolthe j “Type-I error” in (1). This allows asymptotically valid inference in the linear regression setting without normality assumptions. It also allows application of covariance test (Lockhart et al. 2013) in generalized linear models. 1.1. Related works Tibshirani et al.(2015)alsoconsidersuniformconvergenceofthestatisticspro- posed by Taylor et al. (2014), but focuses mainly on the low dimensional case. In the high dimensional case, they have a negative result on the uniform con- vergence of the pivot. In this paper, we instead focus on the high dimensional caseandstatetheconditionsinwhichthepivotwillconverge.Morespecifically, n is allowed to be of a logarithmic factor of the dimension p for two common procedures introduced in Section 4. Tian and Taylor/Asymptotics of selective inference 3 IntheworksofBelloni et al.(2012),Meinshausen et al.(2012),Zhang & Zhang (2014), Javanmard& Montanari (2015), the authors proposed various ways of constructingconfidenceregionsfortheunderlyingparametersinthehigh-dimensional setting. One major difference between these works and our framework is that they try to achieve full model inference without using the data to choose a hypothesis. The advantage of such approach is robustness. But in the high- dimensional setting, with tens of thousands of potential variables, it is natu- ral to use the data to select hypotheses of interest and perform valid infer- ence only for those hypotheses. In addition, some of the full model inference works require conditions of linear underlying model Meinshausen et al. (2012), Javanmard& Montanari(2015)whichtheframeworkofselectiveinferencedoes not require. For more philosophical discussions on the comparisons of the two approaches,see Fithian et al. (2014). 1.2. Organization of the paper In Section 2, we formally introduce the methods for selective inference with certain model selection procedures, which we call affine selection procedures. In Section 3, we state the main theorem that will allow asymptotically valid inference. In Section 4, we will illustrate the applications of our results to two selective inference problems, selective inference after solving the LASSO at a fixed λ, and the covariance test for testing the global null in generalized linear models.We collectalltheproofsinSection5anddicussthe directionsoffuture researchin Section 6. 2. Selective inference with affine selection procedures Suppose we have a design matrix X Rn×p, considered fixed, and ∈ y x indG(µ(x ),σ2(x )) (2) i i i i | ∼ wherex isthei-throwofthematrixX andG(µ,σ2)denotesanyone-dimensional i distributionwithmeanµandvarianceσ2.Wealsodenoteµ(X)=(µ(x ),...,µ(x )) 1 n and Σ(X)=diag(σ2(x ),...,σ2(x )), a diagonal matrix with σ2(x ) as the di- i n i agonal entries. Some feature selection procedure is then applied on the data to select a subet E 1,2,...,p and the target of inference will be ∗(y,X) = ⊆ { } E β , j E . In general, we consider certain selection procedures called the j,E { ∈ } affine selection procedures, Definition 1 (Affine selectionprocedure). Suppose a model selection procedure ∗ :Rn Rn×p , where is a finite set of models, E × →S S = ,..., . 1 |S| S {E E } We call ∗ an affine selection procedure, if the selection event can be written E as an affine set in the first argument of ∗. Formally, ∗ is an affine selection E E Tian and Taylor/Asymptotics of selective inference 4 procedure if for each potential model to be selected , E ∈S ∗(z,X)= = A( ,X)z b( ,X) , (z,X) Rn Rn×p. (3) {E E} { E ≤ E } ∈ × where A Rk×n, b Rk and k N are dependent only on and X. Moreover, ∈ ∈ ∈ E the sets A( ,X)z b( ,X) Rn, i=1,..., i i { E ≤ E }⊆ |S| are disjoint or their intersections have measure 0 under the Lebesgue measure on Rn. Examples of affine selection proceduresinclude selection proceduresthat are based on E, the set of variables chosen by the data and usually some other information1. Various algorithms can be used to select E, e.g. E as the active set of the LASSO solution at a fixed λ (Lee et al. 2013), E as the first variable to enter the LASSO or LARS path (Lockhart et al. 2013), (more generally any ℓ penalized generalized linear models) or E as the k variables included at the 1 k-th step of forwardstepwise selection (Taylor et al. 2014). The works of Lee et al. (2013), Lockhart et al. (2013), Taylor et al. (2014, 2013)haveconstructedvalidp-valueswhenthefamilyGistheGaussianfamily. Formally, the pivotal function depends on the following quantities, 2.1. Notations The pivotal function is determined by the following functions. For any A Rk×n, b Rk, Σ Rn×n and η Rn, we define ∈ ∈ ∈ ∈ AΣη α=α(A,b,Σ,η)= , (4) ηTΣη b (Az) +α ηTz j j j L(z;A,b,Σ,η)= max − , (5) αj<0 αj b (Az) +α ηTz j j j U(z;A,b,Σ,η)= min − . (6) αj>0 αj Furthermore, we define Φ((x m)/σ) Φ((a m)/σ) F(x;σ2,m,a,b)= − − − Φ((b m)/σ) Φ((a m)/σ) − − − which is the CDF of the univariate Gaussian law N(m,σ2) truncated to the interval [a,b]. 2.2. A pivotal quantity with Gaussian errors Theorem 1 provides the construction of a pivotal function when the data is normally distributed and ∗ is an affine selection procedure. We denote the E 1SeeSection4fordetails Tian and Taylor/Asymptotics of selective inference 5 responsevariablestobe whenGisthe Gaussianfamily to distinguishitfrom Y y where G is a more general location-scale family. Note all distributions in this paper are conditional on X, that is the law we consider are either ( X) or L Y| (y X). All random variables have access to X as if it were a constant. L | Theorem 1 (Lee et al. (2013)). Suppose X Rn×p and N(µ(X),Σ(X)), µ(X) Rn, Σ(X) Rn×n and ∗ isanaffine∈selectionproYced∼ureonRn Rn×p. Then ∈any for any η∈:Rn RnEmeasurable with respect to σ( ∗) we hav×e → E F(η( ∗)T ;η( ∗)TΣη( ∗),η( ∗)Tµ,LE∗( ),UE∗( )) ∗( ,X)= E Y E E E Y Y E Y E Unif(0,1), (7) (cid:12) (cid:12)∼ where ∗(z,X)= A( ,X)z b( ,X) and E E ⇐⇒ E ≤ E L (z)=L(z,A( ,X),b( ,X),Σ,η( )) (8) E E E E U (z)=U(z,A( ,X),b( ,X),Σ,η( )). (9) E E E E Moreover, marginalizing over the selection procedure ∗, we have the following E F(η( ∗)T ;η( ∗)TΣη( ∗),η( ∗)Tµ,LE∗( ),UE∗( )) Unif(0,1). (10) E Y E E E Y Y ∼ The significance of Theorem 1 is that assuming the diagonal matrix Σ is known,theonlyunknownparameterforthepivotalquantity(10)isηTµ.Totest thehypothesisH :ηTµ=0,wejustneedtopluginthevalueandthencompute 0 (10), which then can be used as a p-value to accept/reject the hypothesis. For example, if we take η =X (XTX )−1e , (11) E E E j where e is the unit vector with only the j-th entry being 1, ηTµ = β . The j j,E quantityin(10)ispivotalandcanbeusedtotestthehypothesisH :β =0, 0j j,E and control the “Type-I errpr” (1). Since X is fixed, we use the shorthand ∗(z)= ∗(z,X), A( )=A( ,X), b( )=b( ,X). E E E E E E 3. Asymptotics with non-Gaussian error Now if we remove the assumption that the error ( X) = N(µ(X),Σ), the L Y| conclusionof Theorem1 does not holdany more.The best we canhope for is a weakconvergenceresultthatthesamepivotalquantities(10)wouldconvergeto Unif(0,1) (as n ). This requires some conditions on both the distribution → ∞ (y X) and the selection procedure ∗. Our main contribution in this work, L | E Theorem 3 establishes conditions on (y X) and ∗ under which the pivotal L | E quantity (10) is asymptotically distributed as Unif(0,1). The main approach is to compare the distribution of the pivots (10) under the distribution (y X) with that under Gaussian distribution ( X). In the L | L Y| latter case, the exact distribution is derived in Theorem 1. In the following, we establish the conditions where the above two distributions are comparable. Tian and Taylor/Asymptotics of selective inference 6 3.1. Bounding the influence function Note the pivotal quantity in (10) depends on y either through the linear func- tions ηTy or the maximum/minimum of linear functions LE∗(y), UE∗(y). In approximating the exact Gaussian theory with asymptotic results a quantity analogoustoaLipschitzconstant(iny)willbenecessary,expressingthechanges inηTy aswellasthe upperandlowerboundsLE∗ andUE∗.This,insomesense, describes the influence each y can have on the pivotal quantity (10). i For an affine selectionprocedure ∗ :Rn Rn×p , without loss of gener- E × →S ality suppose ∗ is surjective. Since ∗ is affine, for any model , there are E E E ∈S the associated A( ) and b( ) as defined in (3). We define E E A( ) ij M( ,η)= max E + η( ) , ∞ E 1≤i≤1n≤rjo≤w(nA(E))(cid:12)(cid:12)(A(E)Ση(E))i(cid:12)(cid:12) k E k (12) M( ∗,(cid:12)(cid:12)η)=maxM( ,(cid:12)(cid:12)η). E E∈S E We also define r( )=nrow(A( )), r( ∗)=maxnrow(A( )). (13) E E E E∈S E The quantity M( ∗,η) measures the maximal influence any y has on a i E smoothedversionofthetriple(η( ∗)Ty,LE∗(y),UE∗(y)).AsM( ∗,η)andr( ∗) E E E arecriticalinboundingthedifferencebetween (y X)and ( X),itisimpor- L | L Y| tant to get a sense of their size. Typically r( ∗) is less than p, and we discuss E the typical size of M( ∗,η) through the following simple example: E Example 3.1. Suppose the design matrix X Rn×p is generated in the fol- lowing way: we first generate each row independ∈ently from a distribution on Rp, and then normalize the column of X to have length 1. Suppose instead of using data to select a model, we just arbitrarily choose a subset E. This is equal to no selection at all, thus M( ,η) = η( ) . If we want to perform inference for ∞ E k E k β , we take j,E η =X (XTX )−1e , E E E j wheree istheunitvectorwithonlythej-thcoordinatebeing1.Sincewenormal- j ize the columns, it is not hard to verify (XTX )−1 =O (1), and max (X )= E E p ij ij O (n−1/2), thus if the selected variables set always satisfies E n, η = p | | ≪ O (n−1/2). Therefore M( ∗,η)=O (n−1/2). p p E This is a very simple example which does not involve selection. In reality we will some meaningful selection procedure that uses the data so M( ∗,η) E would involve A( ∗) and b( ∗) as well. However, we will see through examples E E in Section 4 that it is still reasonable to assume M( ∗,η)=O(n−1/2). E Thefollowingtheoremcomparesthedistributionof(η( ∗)Ty,LE∗(y),UE∗(y)) E under (y X) and its Gaussian counterpart. L | Theorem 2. Fix X Rn×p. Suppose (y, ) are defined conditionally indepen- ∈ Y dent given X on a common probability space such that Tian and Taylor/Asymptotics of selective inference 7 (y X)has independent entrieswithmean vector µandcovariance matrix • L | variance Σ and finite third moments bounded by γ; ( X)=N(µ,Σ). • L Y| Suppose we are given η σ( ∗), then given any bounded function W C3(R3;R) with bounded deriv∈ativeEs satisfying ∈ 0 if v u w W(u,v,w)= ≥ ≤ ≤ (=0 else there exists N = N(M( ∗,η), ,r( ∗),W), such that the following holds for E |S| E n,p N, ≥ EW η( ∗)Ty,LE∗(y),UE∗(y) EW η( ∗)T ,LE∗( ),UE∗( ) E − E Y Y Y (cid:12)(cid:12) (cid:2) (cid:3) (cid:2) 4 15 (cid:3)(cid:12)(cid:12) (cid:12) C(W,γ) log r( ∗) nM( ∗,η)3 (cid:12) (14) ≤ E |S| E (cid:20)(cid:18) (cid:19) (cid:21) (cid:0) (cid:1) where C(W,γ) is a constant depending only on the derivatives of W and γ, and η( ∗) is η( ∗(y)) or η( ∗( )) depending on the context. E E E Y AsitisreasonabletoassumeM( ∗,η)=O(n−1/2),itisreasonabletoassume E theRHSof (14)goestozero.Thusthedistributionof(η( ∗)Ty,LE∗(y),UE∗(y)) E is close to that of (η( ∗)T ,LE∗( ),UE∗( )). In the following, we discuss the E Y Y Y conditions under which the pivotal quantity (10) converges. 3.2. Smoothness of the pivot Note the bound in (14) also depends on C(W,γ), the derivatives of W. Thus besidestheinfluenceofeachy on(10),itisalsonecessarytocontrolthesmooth- i ness of the (10). In particular, the pivot in (10) takes the form of a truncated Gaussiancdf.Moreover,thesmoothness(derivatives)ofthetruncatedGaussian cdf F(x;σ2,m,a,b) can depend heavily on the truncation interval [a,b]. More specifically, a lower bound on the denominator of F(x;σ2,m,a,b) puts some constraintsonthewidthoftheinterval[a,b]aswellasitsdistancetotheorigin. In our context, a,b corresponds to the upper and lower bounds appearing in (10). Formally, we assume the following assumption: Assumption 1. Suppose we have Xn Rn×pn and yn Rn is generated ∈ ∈ according to (2), and is generated independently (conditional on X ) from n n Y N(µ(X ),Σ(X )) a Gaussian distribution with the same means and variances. n n Wealsohaveaffineselectionprocedures ∗ = ∗.Weassumethereexistsδ 0 E En n → such that P(UE∗(yn) LE∗(yn)<δn) 0, − → P(UE∗( n) LE∗( n)<δn) 0, Y − Y → (15) P(min(UE∗(yn), LE∗(yn))>1/δn) 0, | | | | → P(min(UE∗( n), LE∗( n))>1/δn) 0. | Y | | Y | → Tian and Taylor/Asymptotics of selective inference 8 The firsttwoconditionsin(15)puts alowerboundonthe widthofthe trun- cationinterval(LE∗(yn),UE∗(yn)).Thelasttwoconditionsmakessurethetrun- cation will not appear too far from the origin and thus we will have reasonable behavior in the tail. δ is the rate at which the truncation interval will shrink n (orthedistanceofthetruncationintervaltotheorigin).Thisratewillappearin the RHS of (14) and thus we impose a condition on (δ ,M( ∗,η ),r( ∗), ) n En n En |Sn| to ensure the convergence of the pivot (10). 3.3. Main result Suppose we have Xn Rn×pn and yn Rn is generated according to (2). We ∈ ∈ denoteits distributionas (y X ).Theconvergencementionedbelow isunder n n L | this sequence of distributions (y X ) ∞ . {L n| n }n=1 Theorem 3 (Convergence of the pivot). Suppose we have a sequence of y n generated as above with means µ = µ(X ), and variances Σ = Σ(X ) and n n n n have finite third moments. We also assume Assumption 1 is satisfied with a sequence of δ . Furthermore, let ∗ be a sequence of affine selection procedures, n En η =η( ∗), and the corresponding M( ∗,η ), r( ∗) and properly defined as n En En n En Sn in Section 3.1. Then if 4 1/δ6 M( ∗,η )3 n log r( ∗) +log 0, as n , n· En n · En |Sn| → →∞ (cid:20) (cid:21) (cid:0) (cid:1) (cid:0) (cid:1) we have P(ηnTyn;ηnTΣnηn,ηnTµn,LEn∗,UEn∗)→d Unif(0,1), n→∞, (16) where P(x;σ2,m,a,b)=2min(F(x;σ2,m,a,b),1 F(x;σ2,m,a,b)) is the two- − sided pivot. Inthefollowingsection,weapplyTheorem3todifferentselectionprocedures. 4. Examples WegivetwoexamplesinthissectionastheapplicationsofTheorem3.Thefirst exampleistoperformselectiveinferenceaftersolvingtheLASSOandthesecond isto testthe globalnullingeneralizedlinearmodels.Inthese twoexamples,we will explain why the selection procedure is affine, what is the data distribution (y X ) and the quantities (δ ,M( ∗,η ),r( ∗), ). To ease the notations, L n| n n En n En |Sn| we suppress the dependencies on n whenever possible. It is helpful to keep in mind that yn Rn, Xn Rn×pn. ∈ ∈ 4.1. Inference for LASSO with non-Gaussian errors Consider the linear model y =Xβ0+ǫ, X Rn×p, ǫ iidG(0,σ2), (17) i ∈ ∼ Tian and Taylor/Asymptotics of selective inference 9 where σ is known and the distribution G has finite third moments, but is not necessarily Gaussian. Tibshirani(1996)proposedthenowfamousLASSO.Wegetasparsesolution βˆ by solving 1 βˆ=minimize y Xβ 2+λ β , (18) β∈Rp 2k − k2 k k1 whereλ>0isthefixedregularizationparameter.WechooseλasinNegahban et al. (2012).IfwenormalizethecolumnsofX tohavenorm1,Negahban et al.(2012) chooses λ to be O(√logp). 4.1.1. Affine selection procedure As inLee et al. (2013), we solve(18)andgetasolutionβˆ.Now we considerthe selection procedure based on (E,z ), where E E =supp(βˆ), z =sign(βˆ ), E E where βˆ is βˆ restricted to the active set E. Note this is different from the E selectionprocedurebasedonlyonE butiscloselyrelated,fordetaileddiscussion see Lee et al. (2013). The authors in Lee et al. (2013) proved such selection procedure is equivalent to the affine constraints A(E,z )y b(E,z ), where E E ≤ A(E,z )= diag(z )(XTX )−1XT, E − E E E E (19) b(E,z )= λdiag(z )(XTX )−1z . E − E E E E To test the hypothesis H : β = 0 for any j E, we choose η to be as in 0j j,E ∈ (11). In this case, a simple calculation will put the number of possible states at =2p,whichwillcausetheboundin(14)toblowupwhenp>n.However,the |S| choice of λ =O(√logp) (Negahban et al. 2012) together with other conditions will ensure is polynomial in p with high probability. |S| 4.1.2. Number of states for λ=O(√logp) |S| Suppose X is column standardized to be mean zero and norm 1, we first intro- duce the restricted strong convexity condition for matrix X. Definition 2 (Restricted strong convexity Negahban et al. (2012)). We say X Rn×p satisfies the restricted strongconvexity condition for index set A with ∈ constant m>0 if Xv 2 m v 2, k k2 ≥ k k2 for all v ∆ Rp : ∆ 3 ∆ . Ac 1 A 1 ∈{ ∈ k k ≤ k k } Now we define the assumptions needed to ensure is polynomial in p with |S| high probability. Tian and Taylor/Asymptotics of selective inference 10 Assumption 2. X satisfies the restriced convexity condition for A=supp(β0) with constant m, and φmax, the biggest eigenvalue of XTX is bounded by a constant Q. Assumption 3. ǫ are sub-Gaussian errors with known variance σ2. i Assumption 4. The signal is sparse. More specifically, k = supp(β0) is | | bounded by a constant K. Following Negahban et al. (2012), Lemma 1 shows with the above assump- tions, the effective size of is polynomial in p with high probability. |S| Lemma 1. With Assumptions 2-4, if we solve (18) with λ 4σ√logp and get ≥ active set E, then with probability at least 1 c exp( c λ2), 1 1 − − 16Q2 E K | |≤ m2 · where c is some constant that depends on m and the subgaussian constant of 1 the error ǫ. Thus, with probability 1 c exp( c λ2), 1 1 − − 16Q2 pcK, c= . |S|≤ m2 The proof of Lemma 1 is deferred to the appendix. Having controlled , |S| now we need to get a bound for the influences. 4.1.3. Bounding the influence M( ∗,η) E Assume we have normalized the design matrix X columnwise so that each col- umn has norm 1. We further assume the following assumption on X, Assumption 5. Suppose we solve problem (18) with X and get the active set E. Let φmin be the smallest eigenvalue for submatrices of size less than n E , ×| | more specifically, Xv 2 φmin = min k k2. v∈Rp,kvk0≤|E| kvk22 We assume φmin ν >0. ≥ Lemma 2. Suppose X satisfies Assumption 5, then E max (XTX )−1XT | | max X . i,j E E E ij ≤ ν2 · i,j | ij| (cid:12) (cid:12) (cid:12)(cid:0) (cid:1) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 4.1.4. Choice of δ in Assumption 1 n If we normalize the columns of X to have norm 1 and choose λ = O(√logp) in (18) as in Negahban et al. (2012). Then we assume Assumption 1 is satisfied with δ =O((√logp )−1−κ), for any small κ>0. n n