ebook img

The linearity condition and adaptive estimation in single-index regressions PDF

0.07 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The linearity condition and adaptive estimation in single-index regressions

The Lineartiy Condition and Adaptive 0 Estimation in Single-index Regressions 1 0 2 n Yongwu Shao Dennis Cook ∗† † a J Sanford Weisberg 6 2 SchoolofStatistics, UniversityofMinnesota, Minneapolis, MN55455, USA. ] Sep 15, 2006 T S . h t a m Abstract [ Weshowthatunderalinearityconditiononthedistributionofthepredic- 1 tors,thecoefficientvectorinasingle-indexregressioncanbeestimatedwith v 2 the same efficiency as in the case when the link function is known. Thus, 0 thelinearitycondition seemstosubstitute forknowingtheexactconditional 8 distribution oftheresponse giventhelinearcombinations ofthepredictors. 4 . 1 0 0 1 Introduction 1 : v 1.1 Single-index Regressions i X r ConsideracontinuousunivariateresponseY andavectorofcontinuouspredictors a X Rp. The most general goal of a regression is to infer about the conditional ∈ distributionof Y X. In this paper we considersingle-indexregressions, in which | Y X dependsonX throughatmostonelinearcombinationβTX ofthepredictors. | 0 Focusingon themean functionE(Y X), Ha¨rdleand Stoker(1989)developed | a nonparametric method called average derivativeestimation for estimating β in 0 the single-index conditional mean E(Y X) = g(βTX), where the mean function | 0 ∗Correspondingauthor. E-mail:[email protected] †R. D. Cook andY. Shaowere suppotedby GrantDMS-0405360fromthe NationalScience Foundation. 1 g is unknown. Weisberg and Welsh (1994) considered the case in which Y X | follows a generalized linear model, where the linear coefficient β and the link 0 function are unknown. Both pairs of authors gave estimates for β that are √n- 0 consistent. Yin and Cook (2005) proposed the problem of single-index regressions, in which theconditionaldistributionofY X is completelycharacterized by a linear | combinationβTX,sothereisnolossofinformationaboutY ifwereplaceX with 0 βTX. Morespecifically,weassumethat 0 Y X βTX (1) | 0 where for identifiability purposes, we require that β = 1. (1) is equivalent to 0 || || thestatementthatY X hasaconditionaldensityη (y βTx),whereη isunknown. | 0 | 0 0 Single-index regression is a special case of sufficient dimension reduction when the dimension of the central subspace (Cook, 1996) is one. It does not require a pre-specified single-indexmodel. We show that under the linearity condition (Li and Duan, 1989), for single- index regressions there exists an adpative estimate for β that can be estimated 0 with the same efficiency as the maximum likelihood estimate when the condi- tional density η is completely specified. For example, if the true model is Y = 0 g(βTX)+ǫ,wherethelinkfunctiong andthedensityoftheerrorǫareunknown, 0 then β can be estimated with the same efficiency as in the case when g and the 0 errordistributionareknown. 1.2 Linearity Condition Manysufficientdimensionreductionmethodsrequirethelinearitycondition: E(X βTX) | 0 is a linear function of βTX (Li and Duan, 1989). It is used in popular methods 0 likeslicedinverseregression(Li,1991),slicedaveragevarianceestimation(Cook and Weisberg,1991)andprincipal Hessiandirections(Li,1992,Cook, 1998). The linearity condition holds if the predictor has an elliptically distribution (Eaton, 1986), so it holds when X has a multivariate normal distribution. More- over,DiaconisandFreedman(1984)showedthatmostlow-dimensionprojections of a high-dimension data cloud are close to being normal. Hall and Li (1993) argued that the linearity condition holds approximately when p is large. The lin- earityconditionappliesonlytothemarginaldistributionofthepredictorsandnot totheconditionaldistributionofY X asiscommoninregressionmodeling. Con- | sequently at the stage of data collection, we might design the experiment so that 2 the distribution of X will not blatantly violate elliptic symmetry. We can also transform the predictors to normality, or we can re-weight the data (Cook and Nachtscheim,1994)toapproximatean elllipticaldistribution. 1.3 Adaptive Estimation The problem of adaptiveestimation was introduced by Stein (1956). One wishes to estimate a Euclidean parameter θ in the presence of an infinite-dimensional shape parameter G, usually the density. An adaptive estimate performs asymp- toticallyaswellwithG unknownas themaximumlikelihoodestimatedoeswhen G is known (Bickel, 1982). A general method of constructing adaptive estimates wasconstructedbyBickel(1982). Schick(1986,1993)generalizedandimproved Bickel’smethod. It has been shown that adaptive estimation is possible in the symmetric lo- cation problem, in which we need to estimate the center of symmetry of an un- known distribution (Stone, 1975). It is also possible in linear regressions where the error density is symmetric and unknown and we need to estimate the linear coefficient (Bickel, 1982). When the observations are not independent, Koul and Pflug (1990),Schick (1993), Kouland Schick (1996)showedadaptiveestimation is possible in certain autoregressive models. Early literature on adaptive estima- tiongenerallyfocusedonthesemodelsandtheirgeneralizations. Inthispaperwe show that under the linearity condition, adaptive estimation is also possible for single-indexregressions. 2 Main Results Without loss of generality we assume that X has mean zero and covariance I . p Wealsoassumethatβ Θ, where 0 ∈ Θ = β Rp : β = 1 . { ∈ || || } Let l(t,y) = (∂/∂t)η (y t)/η (y t) be the derivative of the log density or equiv- 0 0 | | alently the log likelihood. By using a Lagrange multiplier, the score equation for β is 0 Q E[Xl(βTX,Y)] = 0, (2) β0 0 whereQ = I P andP istheorthogonalprojectionontothesubspacespanned ζ p ζ ζ − bythecolumnsofthematrixζ. It can beshownthat(2)holdsnotonlyforl, butforany f( , ) R. · · ∈ 3 LEMMA 1. Assume that the linearity condition holds. Assume f( , ) R. Then · · ∈ β is asolutionof theequation 0 Q E[Xf(βTX,Y)] = 0. (3) β Proof. Since X has covariance matrix I , according to Cook (1998, pp. 57), we p haveE[Q X βTX] = 0. Therefore β0 | 0 Q E[Xf(βTX,Y)] = E[Q Xf(βTX,Y)] β0 0 β0 0 = E E[Q Xf(βTX,Y)] βTX { β0 0 | 0 } = E E[Q X βTX]E[f(βTX,Y) βTX] { β0 | 0 0 | 0 } = 0 TheabovelemmashowsthatamisspecifiedlstillproducesaFisherconsistent estimate of β . According to van der Vaart (1998, Theorem 25.27), Lemma 1 0 togetherwithsomeregularityconditionswouldenableustoconstructanadaptive estimate for β . The regularity conditions are typically satisfied in practice. A 0 proofofthefollowingtheoremis givenintheappendix. THEOREM 1. Assume that the Fisher information (β) = E [XXTl2(βTX,Y)] β I isfinite,nonsingularanddifferentiablewithrespecttoβ inaneighborhoodofβ . 0 Letˆl (t,y)beanestimateof l(t,y)thatsatisfies n E [ X 2(ˆl(βTX,Y) l(βTX,Y))2] = o (1). (4) β0 || || 0 − 0 p Then underthelinearityconditionwecanconstructanadaptiveestimateofβ in 0 (1)basedonˆl (t,y). n Following van der Vaart (1998, pp. 393), an adaptive estimate can be con- structed in the following way. Suppose β is a √n-consistent estimate of β . n 0 For instance, under the linearity condition β can be chosen as the ordinary least n squares estimator(Li and Duan, 1989). Let Γ be a p (p 1) matrix such that n × − (Γ ,β ) is an orthogonalmatrix. Let n n n ˜ = [X XTˆl2(βTX ,Y )] In i i n n i i Xi=1 4 be an estimator of the information matrix for β. Let βˆ be a one-step iteration of n theNewton-Raphsonalgorithmfor solvingtheequation n Q [X ˆl (βTX ,Y )] = 0 β i n i i Xi=1 withrespect to β on themanifoldΘ, startingat theinitialguess β . Wecan write n βˆ as n n 1 βˆ = β + Γ [ΓT˜ Γ ]−1ΓT [X ˆl (βTX ,Y )] (5) n n n n nIn n n i n n i i Xi=1 Van der Vaart (1998, Theorem 25.27) showed that, by using discretization and sample-splitting devices, βˆ is an adaptive estimate of β if ˆl satisifies (4). One n 0 such ˆl based on the kernel density estimationin Ha¨rdle and Stoker (1989)is con- structedin theAppendix. Sinceβˆ isanadaptiveestimator,ithasthesameasymptoticdistributionasthe n maximumlikelihoodestimator. Nextwewillderivetheasymptoticdistributionof themaximumlikelihoodestimator. Letβˆ bethemaximumlikelihoodestimator mle of β . It is shown in the appendix that under mild regularity conditions βˆ has 0 mle thefollowingasymptoticdistribution. THEOREM 2. Assume that the regularity conditions for the asymptoticnormality ofthemaximumlikelihoodestimatehold. Then n 1 βˆ = β + Γ [ΓT (β )Γ ]−1ΓT [X l(βTX ,Y )]+o (n−1/2) (6) mle 0 n 0 0I 0 0 0 i 0 i i p Xi=1 whereΓ isa p (p 1) matrixsuchthat(Γ ,β ) is anorthogonalmatrix. 0 0 0 × − Since βˆ is an adaptive estimator, it has the same asymptotic distribution as n βˆ , weconcludethat√n(βˆ β ) converges to anormaldistributionwithzero mle n 0 meanandcovariancematrixequ−altothecovariancematrixofΓ [ΓT (β )Γ ]−1ΓTXl(βTX,Y)]. 0 0I 0 0 0 0 3 Discussion In this article we showed that under the linearity condition, there exists an adap- tiveestimateofthecoefficientvectorinasingle-indexregression. Fromthisresult wecanseetheimportantroleofthelineartiyconditioninsingle-indexregression, 5 and more generally, in sufficient dimension reduction. The linearity condition is unusual,asitdoesnotoccurcommonlyoutsideofsufficientdimensionreduction. We have shown that the linearity condition asymptotically takes the place of a known density. We conjecture that if the linearity condition fails, then an adap- tive estimate does not exist. As a consequence, the coefficient vector cannot be estimatedaswell as itcan bewiththemaximumlikelihoodestimator. 6 Appendix Proofof Theorem 1. Since Lemma 1 holds, according to van der Vaart (1998, Theorem25.27),we onlyneed toprovethefollowingtwo statements. 1. The conditional density η (y βTx) is differentiable in quadratic mean with 0 | 0 respect toβ . 0 2. Leth(βTx,y)bethejointdensityofβTX andY,then 2 x 2 l(βTx,y) h(βTx,y) l(βTx,y) h(βTx,y) dxdy 0. Z || || (cid:20) n n − 0 q 0 (cid:21) → p The first statement is true by van der Vaart (1998, Theorem 7.2). So we only need toprovethesecondstatement. Since 2 (β ) = xxT l(βTx,y) h(βTx,y) dxdy I 0 Z (cid:20) 0 q 0 (cid:21) and 2 (β ) = xxT l(βTx,y) h(βTx,y) dxdy I n Z (cid:20) n q 0 (cid:21) By the assumptions, (β) is continuous on a neighborhood of β and (β ) is 0 0 I I finite,weconcludethat (β )isalsofinite,hencetr[ (β )+ (β )] < . Bythe n 0 n I I I ∞ triangularinequality, 2 x 2 l(βTx,y) h(βTx,y)+ l(βTx,y) h(βTx,y) dxdy Z || || (cid:20)| n | n | 0 |q 0 (cid:21) p tr[ (β )+ (β )] < 0 n ≤ I I ∞ Thenby thedominateconvergencetheorem, 2 x 2 l(βTx,y) h(βTx,y) l(βTx,y) h(βTx,y) dxdy 0 Z || || (cid:20) n n − 0 q 0 (cid:21) → p Thereforethesecond statementis alsotrue. 7 Proof. ofTheorem 2. WefirsttransformthemanifoldΘ toRp−1 byusingthefol- lowing linear transformation. For any β Θ, let α = ϕ(β) = Γ β. Then 0 β = ϕ−1(α) = Γ α+(1 α 2)β ,andη (y∈βTx) = η (y ϕ−1(α)Tx). Bytaking 0 0 0 0 thederivativeof η (y ϕ−−1(|α| )|T|x) with respec|t to α, we can| derivetheasymptotic 0 | distributionofthemaximumlikelihoodestimateforα as following, n 1 αˆ = [ΓT (β )]−1ΓT [X l(βTX ,Y )]+o (n−1/2) mle n 0I 0 0 i 0 i i p Xi=1 By thedeltamethod,wehave n 1 βˆ = β + Γ [ΓT (β )Γ ]−1ΓT [X l(βTX ,Y )]+o (n−1/2) mle 0 n 0 0I 0 0 0 i 0 i i p Xi=1 Constructionofˆl thatsatisfies(4). Leth(βTx,y)bethejointdensityof(βTX,Y), 0 0 and g(βTx) bethedensityofβTX,then 0 0 η (y βTx) = h(βTx,y)/g(βTx) 0 | 0 0 0 and ′ ′ l = h/h g /g, − where h′,g′ are the derivative of h,g w.r.t. the first argument. To estimate l, we onlyneed toestimateh′/hand g′/g. We only consider the estimation of g′/g in detail here, because h′/h can be estimated in the same way, except that the dimension of the density estimation is different. Let dbethedimensionofthedensityestimation,d = 1forg and d = 2 forh. Let T = βTX . Fora fixed twicecontinuouslydifferentiableprobabilityden- i 0 i sity w with compact support, a bandwidth parameter σ, and a cut-off tuning pa- rameterδ, set n s T gˆ (s) = σ−d w( − i) n n σ Xi=1 n ′ gˆ ξˆ (s) = n(s)1 (7) n gˆ gˆn(s)>δ n whereξˆ (s)isourestimatorofg′(s)/g(s). ThenE[(ξˆ (X) g′(X)/g(X))2 X 2] n n − || || convergesto zero inprobabilityprovidedδ and σ 0 atappropriatespeeds. ↑ ∞ ↓ 8 Hardleand Stoker, 1991,page992)showedthat undersomeregularitycondi- tionswe haveforanyǫ > 0, sup[ gˆ(s) g(s) 1 ] = O [(n1−(ǫ/2)σd)−1/2] g(s)>(δ/2) p | − | and sup[ gˆ′(s) g′(s) 1 ] = O [(n1−(ǫ/2)σd+2)−1/2] g(s)>(δ/2) p | − | Therefore sup[ (gˆ′/gˆ) (g/g) 1 ] = O [δ−2(n1−(ǫ/2)σd+2)−1/2] g>(δ/2) p | − | Henceforlargen wehave E[(ξˆ g′)2 X 2] n − || || = E[(g′/g)2 X 21 ]+E[((gˆ′/gˆ) (g′/g))2 X 21 ] gˆ<δ gˆ>δ || || − || || E[(g′/g)2 X 21 ]+E[((gˆ′/gˆ) (g′/g))2 X 21 ] g<2δ g>(δ)/2 ≤ || || − || || E[(g′/g)2 X 21 ]+O [δ−2(n1−(ǫ/2)σd+2)−1/2] E[ X 2] g<2δ p ≤ || || · || || AssumethatE[ X 2] < . Since(g′/g)2 X 21 isdominatedby(g′/g)2 X 2, g<2δ andE[(g′/g)2 X|| 2||1 ]∞isfinitebyassum||pti|o|ns,thereforeE[(g′/g)2 X 21 || ]|| g<2δ g<2δ convergestoz|e|ro|w| henδgoestozero. Bytheassumptions,δ−2(n1−(ǫ/|2|)σd|+| 2)−1/2 alsoconverges tozero, therefore E[(ξˆ (g′/g))2 X 2] = o (1). n p − || || In thesamefashionwecan constructζˆ (s,y)toestimateh′/h,exceptthatwe n use(T ,Y ) as observations. Then anestimatorforl can bedefined as i i ˆl = ζˆ ξˆ . (8) n n n − Since E[(ξˆ g′/g)2 X 2]and E[(ζˆ h′/h)2 X 2]converges to zero in prob- n n ability,weha−veE[(ˆl |(|βT|X| ,Y) f(βT−X,Y))2||X||2]convergesto zero in prob- n 0 − 0 || || ability,and (4)issatisfied. References Bickel,P.J.(1982)Onadaptiveestimation. Ann. ofStatist.,10,647-671. Cook, R. D. (1996) Graphics for regressions with a binary response. J. Amer. Statist. Assoc.,91,983-992. Cook, R. D. (1998) Principle Hessian directions revisited (with discussion). J. Amer. Statist. Assoc.,93,84-100. 9 Cook,R.D.andNachtsheim,C.J.(1994)Re-weightingtoachieveellipticallycontoured covariatesinregression. J.Amer. Statist. Assoc.,89,592-600. Cook,R.D.andWeisberg,S.(1991)Discussionof”slicedinverseregressionfordimen- sionreduction”. J.Amer. Statist. Assoc.,86,328-332. Diaconis,P.andFreedman,D.(1984)Asymptoticsofgraphicalprojectionpursuit. Ann. ofStatist.,12,793-815. Eaton, M. L. (1986) A characterization of spherical distributions. J. Mult. Anal., 20, 272-6. Hall, P. and Li, K. C. (1993) On almost linearity of low dimensional projections from highdimensional data. Ann. ofStatist.,21,867-889. Ha¨rdle, W. and Stoker, T. M. (1989) Investigating smooth multiple regression by the methodofaveragederivatives. J.Amer. Statist. Assoc.,84,986-995. Koul, H. L. and Pflug, G. (1990) Weakly adatptive estimators in explosive regression. Ann. ofStatist.,18,939-960. Koul, H. L. and Schick, A. (1996) Adaptive estimation in a random coefficient autore- gressivemodel. Ann. ofStatist.,24,1025-1052. Li,K.C.(1991) Sliced inverse regression fordimension reduction (withdiscussion). J. Amer. Statist. Assoc.,86,316-342. Li, K. C. (1992) On Principle Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. J. Amer. Statist. Assoc., 87, 1025-1039. Li,K.C.andDuan,N.(1989)Regressionanalysisunderlinkviolation. Ann. ofStatist., 17,1009-1052. Schick, A. (1986) On asymptotically efficient estimation in semi-parametric models. Ann. ofStatist.,14,1139-1151. Schick, A. (1993) On efficient estimation in regression models. Ann. of Statist., 21, 1481-1521. Stein, C. (1956) Efficient nonparametric testing and estimation. Proc. Third Berkeley Symp. Math. Statist. Prob.,1,187-196. UnversityofCaliforniaPress. Stone, C. J. (1975) Adaptive maximum likelihood estimators of a location parameter. Ann. ofStatist.,3,276-284. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.