ebook img

A Constructive Approach to the Estimation of Dimension Reduction Directions PDF

1.1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Constructive Approach to the Estimation of Dimension Reduction Directions

A Constructive Approach to the Estimation of 7 Dimension Reduction Directions 0 0 2 n Yingcun Xia a J 6 National University of Singapore 2 ] T Abstract S . h In this paper, we propose two new methods to estimate the dimension- t a reduction directions of the central subspace (CS) by constructing a regression m [ model such that the directions are all captured in the regression mean. Com- 1 pared with the inverse regression estimation methods (e.g. Li, 1991, 1992; v 1 Cook and Weisberg, 1991), the new methods require no strong assumptions 6 7 on the design of covariates or the functional relation between regressors and 1 0 the response variable, and have better performance than the inverse regression 7 estimation methods for finite samples. Compared with the direct regression 0 / h estimation methods (e.g. Ha¨rdle andStoker,1989;Hristache,Juditski, Polzehl t a andSpokoiny,2001;Xia,Tong,LiandZhu,2002),whichcanonlyestimatethe m : directions of CSin the regressionmean, the new methods can detect the direc- v i tions of CS exhaustively. Consistency of the estimatorsand the convergenceof X r corresponding algorithms are proved. a Key words: Conditional density function; Convergence of algorithm; Double- kernel smoothing; Efficient dimension reduction; Root-n consistency. short title: Constructive Dimension Reduction AMS 200 Subject Classifications: Primary 62G08; Secondary 62G09,62H05. 1 1 Introduction Suppose X is a random vector in Rp and Y is a univariate random variable. Let B = (β , ,β ) denote a p q orthogonal matrix with q p, i.e. B⊤B = I , 0 01 ··· 0q × ≤ 0 0 q where I is a q q identity matrix. Given B⊤X, if Y and X are independent, q × 0 i.e. Y X B⊤X, then the space spanned by the column vectors β ,β , ,β , ⊥⊥ | 0 01 02 ··· 0q (B ), is called the dimension reduction space. If all the other dimension reduction 0 S spacesinclude (B )astheirsubspace,then (B )istheso-called centraldimension 0 0 S S reduction subspace (CS); see Cook (1998). The column vectors β ,β , ,β are 01 02 0q ··· called the CS directions. Dimension reduction is a fundamental statistical problem both in theory and in practice. See Li (1991, 1992) and Cook (1998) for more discussion. IftheconditionaldensityfunctionofY givenX exists,thenthedefinition is equivalent to the conditional density function of Y X being the same as that of | Y B⊤X for all possible values of X and Y, i.e. | 0 f (y x) = f (y B⊤x). (1.1) Y|X | Y|B⊤0X | 0 Other alternative definitions for the dimension reduction space can be found in the literature. In the last decade or so, a series of papers (e.g. H¨ardle and Stocker, 1989; Li, 1991; Cook and Weisberg, 1991; Samarov, 1993; Hristache, Juditski, Polzehl and Spokoiny, 2001; Yin and Cook, 2002; Xia, Tong, Li and Zhu, 2002; Cook and Li, 2002; Li, Zha and Chiaromonte, 2004; Lue, 2004) have considered issues related to the dimension reduction problem, with the aim of estimating the dimension reduc- tion space and relevant functions. The estimation methods in the literature can be classified into two groups: inverse regression estimation methods (e.g. SIR, Li, 1991 and SAVE, Cook and Weisberg, 1991) and direct regression estimation methods (e.g. ADE, H¨ardle and Stoker, 1991 and MAVE of Xia, Tong, Li and Zhu 2002). The inverse regression estimation methods are computationally easy and are widely used as an initial step in data mining, especially for large data sets. However, these methods have poor performance in finite samples and need strong assumptions on the design of covariates. The direct regression estimation methods have much bet- ter performance for finite samples than the inverse regression estimations. They 2 need no strong requirements on the design of covariates or on the response variable. However, the direct regression estimation methods cannot find the directions in CS exhaustively, such as those in the conditional variance. None of the methods mentioned above use the definitions directly in searching for the central space. As a consequence, they fail in one way or another to estimate CS efficiently. A straightforward approach in using definition (1.1) is to look for B 0 in order to minimize the difference between those two conditional density functions. Theconditional density functions can beestimated using nonparametric smoothers. Obviously, this approach is not efficient in theory due to the “curse of dimension- ality” in nonparametric smoothing. In calculations, the minimization problem is difficult to implement. People have observed that the CS in the regression mean function, i.e. thecentral mean space (CMS),can beestimated much moreefficiently than the general CS. See, for example, Yin and Cook (2002), Cook and Li (2002) andXia, Tong, LiandZhu(2002). Motivated bythis observation, onecan construct a regression model such that the CS coincides with the CMS space in order to re- ducethe difficulty of estimation. Inthis paper,we firstconstruct aregression model in which the conditional density function f (y x) is asymptotically equal to the Y|X | conditionalmeanfunction. Then,weapplythemethodsofsearchingfortheCMSto the constructed model. Based on the discussion above, this constructive approach is expected to be more efficient than the inverse regression estimation methods for finite samples, and can detect the CS directions exhaustively. Intheestimationofdimensionreductionspace,mostmethodsneedinonewayor another to deal with nonparametric estimation. In terms of nonparametric estima- tion, the inverse regression estimation methods employ a nonparametric regression of X on Y while the direct regression estimation methods employ a nonparametric regression of Y on X. In contrast to existing methods, the methods in this pa- per search for CS from both sides by investigating conditional density functions. A similar idea appeared in Yin and Cook (2005) for a general single-index model. To overcome the difficulties of calculation, we propose two algorithms in this pa- per using a similar idea to Xia, Tong, Li and Zhu (2002). The algorithm solves the minimization problem in the method by treating it as two separate quadratic 3 programming problems, which have simple analytic solutions and can be calculated quite efficiently. The convergence of the algorithm can be proved. Our constructive approach can overcome the disadvantages both in inverse regression estimations, requiring a symmetric design for explanatory variables, and also the disadvantage in direct regression estimation, of not finding the CS directions exhaustively. Sim- ulations suggest that the proposed methods have very good performance for finite samples and are able to estimate the CS directions in very complicated structures. Applying the proposed methods to two real data sets, some useful patterns have been observed, based on the estimations. To estimate the CS, we need to estimate the directions B as well as the di- 0 mension q of the space. In this paper, however, we focus on the estimation of the directions by assuming that q is known. 2 Estimation methods As discussed above, the direct regression estimations have good performance for finite samples. However, it cannot detect exhaustively the CS directions in com- plicated structures. Motivated by these facts, our strategy is to construct a semi- parametric regression model such that all the CS directions are captured in the regression mean function. As we can see from (1.1), all the directions can be cap- turedintheconditionaldensityfunction. Thus,wewillconstructaregressionmodel such that the conditional density function is asymptotically equal to the regression mean function. The primary step is thus to construct an estimate for the conditional density function. Here,weusetheideaofthe“double-kernel”locallinearsmoothingmethod studied in Fan et al (1996) for the estimation. Consider H (Y y) with y running b − through all possible values, where H(v) is a symmetric density function, b > 0 is a bandwidth and H (v) =b−1H(v/b). If b 0 as n , then from (1.1) we have b → → ∞ m (x,y) d=ef E(H (Y y)X = x) = E(H (Y y)B⊤X = B⊤x) f (y B⊤x). b b − | b − | 0 0 → Y|B⊤0X | 0 See Fan et al (1996). The above equation indicates that all the directions can be captured by the conditional mean function m (x,y) of H (Y y) on X = x with b b − 4 x and y running through all possible values. Now, consider a regression model nominally for H (Y y) as b − H (Y y)= m (X,y)+ε (y X), b − b b | whereε (y X) = H (Y y) E(H (Y y)X)withEε (y X) = 0. Letg (B⊤x,y) = b | b − − b − | b | b 0 E(H (Y y)B⊤X = B⊤x). If (1.1) holds, then m (x,y) = g (B⊤x,y). The model b − | 0 0 b b 0 can be written as H (Y y)= g (B⊤X,y)+ε (y X). (2.1) b − b 0 b | As b 0, we have g (B⊤x,y) f (y B⊤x). Thus, the directions B defined → b 0 → Y|B0⊤X | 0 0 in (1.1) are all captured in the regression mean function in model (2.1) if y runs through all possible values. Based on model (2.1), we propose two methods to estimate the directions. One of the methods is a combination of the outer product of gradients (OPG) estima- tion method (H¨ardle, 1991; Samarov, 1993; Xia, Tong, Li and Zhu, 2002) with the “double-kernel” local linear smoothing method (Fan et al, 1996). The other one is a combination of the minimum average (conditional) variance estimation (MAVE) method(Xia, Tong, LiandZhu,2002) withthe“double-kernel” local linear smooth- ing method. The structure adaptive weights in Hristache, Juditski and Spokoiny (2001) and Hristache, Juditski, Polzehl and Spokoiny (2001) are used in the estima- tions. 2.1 Estimation based on outer products of gradients Consider the gradient of the conditional mean function m (x,y) with respect to x. b If (1.1) holds, then it follows ∂m (x,y) ∂g (B⊤x,y) b = b 0 = B g (B⊤x,y), (2.2) ∂x ∂x 0▽ b 0 where g (v , ,v ,y)= ( g (v , ,v ,y), , g (v , ,v ,y))⊤ with ▽ b 1 ··· q ▽1 b 1 ··· q ··· ▽q b 1 ··· q ∂ g (v , ,v ,y) = g (v , ,v ,y), k = 1,2, ,q. ▽k b 1 ··· q ∂v b 1 ··· q ··· k Thus, the directions B are contained in the gradients of the regression mean func- 0 tion in model (2.1). One way to estimate B is by considering the expectation of 0 5 the outer product of the gradients ∂m (X,Y) ∂m (X,Y) ⊤ E b b = B E g (B⊤X,Y) ⊤g (B⊤X,Y) B⊤. ∂x ∂x 0 {▽ b 0 ▽ b 0 } 0 n(cid:16) (cid:17)(cid:16) (cid:17) o It is easy to see that B is in the space spanned by the first q eigenvectors of the 0 expectation of the outer products. Suppose that (X ,Y ),i = 1,2, n is a random sample from (X,Y). To i i { ··· } estimate thegradient∂m (x,y)/∂x, wecanusethenonparametrickernelsmoothing b methods. For simplicity, we adopt the following notation scheme. Let K (v2) be a 0 univariate symmetric density function and define K(v , ,v ) = K (v2+ +v2) 1 ··· d 0 1 ··· d for any integer d and K (u) = h−dK(u/h), where d is the dimension of u and h > 0 h is a bandwidth. Let H (y) = H (Y y), where H(.) and b are defined above. For b,i b i − any (x,y), the principle of the local linear smoother suggests minimizing n 2 n−1 H (y) a b⊤(X x) K (X ) (2.3) b,i i h ix − − − Xi=1n o with respect to a and b to estimate m (x,y) and ∂m (x,y)/∂x respectively, where b b X = X x. SeeFan andGijbels(1996) formoredetails. For each pairof(X ,Y ), ix i j k − we consider the following minimization problem n 2 (aˆ ,ˆb )= arg min H (Y ) a b⊤ X w , (2.4) jk jk ajk,bjk b,i k − jk − jk ij ij Xi=1h i where X = X X and w = K (X ). We consider an average of their outer ij i j ij h ij − products n n Σˆ = n−2 ρˆ ˆb ˆb⊤ , jk jk jk k=1j=1 XX where ρˆ is a trimming function introduced for technical purpose to handle the jk notorious boundary points. In this paper, we adopt the following trimming scheme. For any given point (x,y), we use all observations to estimate its function value and its gradientas in(2.3). We thenconsidertheestimates inacompactregion of (x,y). Moreover, for those points with too few observations around, their estimates might be unreliable. They should not be used in the estimation of the CS directions and shouldbetrimmedoff. Letρ()beanyboundedfunctionwithboundedsecondorder · 6 derivatives on R such that ρ(v) > 0 if v > ω ; ρ(v) = 0 if v ω for some small 0 0 ≤ ω > 0. We take ρˆ = ρ(fˆ(X ))ρ(fˆ (Y )), where fˆ(x) and fˆ (y) are estimators of 0 jk j Y k Y the density functions of X and Y respectively. The CS directions can be estimated by the first q eigenvectors of Σˆ. To allow the estimation to be adaptive to the structure of the dependency of Y on X, we may follow the idea of Hristache et al (2001) and replace w in (2.4) by ij w = K (Σˆ1/2X ), ij h ij where Σˆ1/2 is a symmetric matrix with (Σˆ1/2)2 = Σˆ. Repeat the above procedure until convergence. We call this procedure the method of outer product of gradient based on the conditional density functions (dOPG). To implement the estimation procedure, we suggest the following dOPG algorithm. Step 0: Set Σˆ = I and t = 0. (0) p Step 1: With w = K (Σˆ1/2X ), calculate the solution to (2.4) ij h (t) ij a(t) n 1 1 ⊤ −1 jk = K (Σˆ1/2X ) (cid:18)b(jtk)(cid:19) nXi=1 ht (t) ij (cid:18)Xij(cid:19)(cid:18)Xij(cid:19) o n 1 K (Σˆ1/2X ) H (Y ), × ht (t) ij X bt,i k i=1 (cid:18) ij(cid:19) X where h and b are bandwidths (details are given in (2.6) and (2.7) below). t t Step 2: Define ρ(t) = ρ(f˜(t)(X ))ρ(f˜(t)(Y )) with jk j Y k n (t) n λ f˜(t)(y) = n−1 H (y), f˜(t)(x) = (nµ˜)−1hp k K (Σˆ1/2X ), Y bt,i t h ht (t) ix t Xi=1 λ(kYt)>ht Xi=1 where λ(t),k = 1, ,p, are the eigenvalues of Σˆ1/2 and µ˜ = K ( v2) k ··· (t) 0 k R λ(ktP)>ht dv . Calculate the average of outer products λk(t)>ht k Q n Σˆ = n−2 ρ(t)b(t)(b(t))⊤. (t+1) jk jk jk j,k=1 X 7 Step 3: Set t := t + 1. Repeat Steps 1 and 2 until convergence. Denote the final value of Σˆ by Σ . Suppose the eigenvalue decomposition of Σ is (t) (∞) (∞) Γdiag(λ , ,λ )Γ⊤, where λ λ . Then the estimated directions are 1 p 1 p ··· ≥ ··· ≥ the first q columns of Γ, denoted by Bˆ . dOPG In the dOPG algorithm, f˜(t)(y) and f˜(t)(x), t > 0, are the estimators of the Y density functions of Y and B⊤X respectively. A justification is given in the proof 0 of Theorem 3.1 in Section 6.2. In calculations, the usual stopping criterion can be used. For example, if the largest singular value of Σˆ Σˆ is smaller than 10−6 (t) (t+1) − then we stop the iteration and take Σˆ as the final estimator. The eigenvalues (t+1) of Σ can be used to determine the dimension of the CS. However, we will not go (∞) into the details on this issue in this paper. In practice, we may need to standardize X = (X , ,X )⊤ by setting X := S−1/2(X X¯) and standardize Y by setting i i1 ip i i i ··· X − Yi := (Yi−Y¯)/√sY,whereX¯ = n−1 ni=1Xi andSX =n−1 ni=1(Xi−X¯)(Xi−X¯)⊤, Y¯ = n−1 ni=1Yi and sY = n−1 ni=P1(Yi −Y¯)2. Then the Pestimated CS directions are the firPst q columns of ΓS−1/2P. X 2.2 MAVE based on conditional density function Notethatif(1.1)holds,thenthegradients∂m (x,y)/∂xatall(x,y)areinacommon b q-dimensional subspace as shown in equation (2.2). To use this observation, we can replace b in (2.3), which is an estimate of the gradient, by Bd(x,y) and have the following local linear approximation n n−1 H (y) a d⊤B⊤(X x) 2K (X ), b,i i h ix { − − − } i=1 X where d = d(x,y) is introduced to take the role of g (B⊤x,y) in (2.2). Note that ▽ b 0 the above weighted mean of squares is the local approximation errors of H (y) by b,i a hyperplane with the normal vectors in a common space spanned by B. Since B is common for all x and y, it should be estimated with aims to minimize the approximation errors for all possible X and Y . As a consequence, we propose to j k estimate B by minimizing 0 n n n n−3 ρˆ H (Y ) a d⊤ B⊤X 2w (2.5) jk { b,i k − jk − jk ij} ij k=1j=1 i=1 XX X 8 with respect to a ,d = (d , ,d )⊤,j,k = 1,...,n and B : B⊤B = I , where jk jk jk1 jkq q ··· ρˆ is defined above. This estimation procedure is similar to the minimum average jk (conditional) variance estimation method (Xia, Tong, Li and Zhu, 2002). Because the method is based on the conditional density functions, we call it the minimum average (conditional) variance estimation based on theconditional density functions (dMAVE). The minimization problem in (2.5) can be solved by fixing (a ,d ),j,k = jk jk 1,...,n, and B alternatively. As a consequence, we need to solve two quadratic programming problems which have simple analytic solutions. For any matrix B = (β , ,β ), we define operators ℓ(.) and (.) respectively as 1 d ··· M ℓ(B)= (β⊤, ,β⊤)⊤ and (ℓ(B)) = B. 1 ··· d M We propose the following dMAVE algorithm to implement the estimation. Step 0: Let B be an initial estimator of the CS directions. Set t = 1. (1) Step 1: Let B = B , calculate the solutions of (a ,d ),j,k = 1,...,n, to the (t) jk jk minimization problem in (2.5) a(t) n 1 1 ⊤ −1 jk = K (B⊤ X ) (cid:18)d(jtk)(cid:19) nXi=1 ht (t) ij (cid:18)B⊤(t)Xij(cid:19)(cid:18)B⊤(t)Xij(cid:19) o n 1 K (B⊤ X ) H (Y ), × ht (t) ij B⊤ X bt,i k i=1 (cid:18) (t) ij(cid:19) X where h and b are two bandwidths (details are discussed below). t t Step 2: Let ρ(t) = ρ(fˆ (X ))ρ(fˆ(t)(Y )) with fˆ(t)(y) = n−1 n H (y) and jk B(t) j Y k Y i=1 bt,i fˆB(t)(x) = n−1 ni=1Kht(B(⊤t)Xix). Fixing ajk = a(jtk) and djPk = d(jtk), calculate the solution ofPB or ℓ(B) to (2.5) n −1 b(t+1) = ρ(t)K (B⊤ X )X(t)(X(t))⊤ jk ht (t) ij ijk ijk nk,Xj,i=1 o n ρ(t)K (B⊤ X )X(t) H (Y ) a(t) , × jk ht (t) ij ijk{ bt,i k − jk} k,j,i=1 X (t) (t) where X = d X . ijk jk ⊗ ij 9 Step 3: CalculateΛ = (b(t+1)) ⊤ (b(t+1))andB = (b(t+1))Λ−1/2. (t+1) {M } M (t+1) M (t+1) Set t := t+1 and go to Step 1. Step 4: Repeat steps 1–3 until convergence. Let B be the final value of B . (∞) (t) Then our estimators of the directions are the columns in B , denoted by (∞) Bˆ . dMAVE ThedMAVEalgorithmneedsaconsistentinitialestimatorinStep0toguarantee its theoretical justification. In the following, we use the first iteration estimator of dOPG, the first q eigenvector of Σˆ , as the initial value. Actually, any initial (1) estimator that satisfies (6.6) can be used and Theorem 3.2 will hold. Similar to dOPG, the standardization procedure can be carried out for dMAVE in practice. The stopping criterion for dOPG can also be used here. Note that the estimation in the procedure is related with nonparametric esti- mations of conditional density functions. Several bandwidth selection methods are available for the estimation. See, e.g. Silverman (1986), Scott (1992) and Fan et al (1996). Our theoretical verification of the convergence for the algorithms requires some constraints on the bandwidths although we believe these constraints can be removed with more complicated technical proofs. To ensure the requirements on bandwidths can be satisfied, after standardizing the variables we use the following bandwidths in our calculations. In the first iteration, we use slightly larger band- widths than the optimal ones in terms of MISE as − 1 − 1 h0 = c0n p0+6, b0 = c0n p0+5, (2.6) where p = max(p,3). Then we reduce the bandwidths in each iteration as 0 ht+1 = max rnht,c0n−q+14 , bt+1 = max rnbt,c0n−q+13,c0n−15 (2.7) { } { } for t 0, where r = n−1/(2(p0+6)), c = 2.34 as suggested by Silverman (1986) if n 0 ≥ the Epanechnikov kernel is used. Here, the bandwidth b is selected smaller than h based on simulation comparisons. Fan and Yao (2003, p.337) proposed a method, called the profile least-squares estimation, for the single-index model and its variants by solving a similar mini- 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.