ebook img

High-Dimensional AICs for Selection of Redundancy Models in PDF

28 Pages·2012·0.14 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview High-Dimensional AICs for Selection of Redundancy Models in

High-Dimensional AICs for Selection of Redundancy Models in Discriminant Analysis (cid:3) (cid:3)(cid:3) (cid:3) Tetsuro Sakurai , Takeshi Nakada and Yasunori Fujikoshi (cid:3)Faculty of Science and Engineering, Chuo University, Kasuga, Bunkyo-ku, 112-8551, Japan (cid:3)(cid:3)Financial Systems Development Dept. No. 4, ITOCHU Techno-Solutions Corporation, 1-2-2, Osaki, Shinagawa-ku, 141-8522, Japan Abstract This paper is concerned with high-dimensional modi(cid:12)cations of Akaike information criterion (AIC) for redundancy (no additional in- formation) models in discriminant analysis. The AIC has been pro- posed as an asymptotically unbiased estimator of the risk function when the dimension is (cid:12)xed and the sample size tends to in(cid:12)nity. On the other hand, Fujikoshi (2002) attempted to modify the AIC in two- groups discriminant analysis when the dimension and the sample size tend to in(cid:12)nity. However, its modi(cid:12)cation was obtained under a re- strictive assumption, and furthermore, it was difficult to extend the method to multiple-groups case. In this paper, by a new approach we propose HAIC which is an asymptotically unbiased estimator of the risk function in multiple-groups discriminant analysis when both the dimension and the sample size tend to in(cid:12)nity, for a general class of candidated models. By simulation experiments it is shown that HAIC is more useful than the usual AIC and CAIC. AMS 2000 subject classi(cid:12)cation: primary 62H12; secondary 62H30 Key Words and Phrases: AIC, Bias correction, Discriminant analysis, HAIC, High dimensional case, Multiple-groups case, Selection of variables. . 1 1. Introduction This paper is concerned with selection of variables in discriminant anal- ysis. One way of selecting variables is to formulate as a problem of selecting redundancy models based on no additional information hypothesis due to Rao (1948, 1973). Then, we apply the idea of a model selection criterion AIC to the models. The selection criterion AIC is proposed as an approximately unbiased estimator(AIC)oftheAICtypeofriskde(cid:12)nedbytheexpectedlog-predictive likelihood. The AIC for redundancy models has been proposed under large- sample framework, i.e., the dimension is (cid:12)xed and the sample size tends to in(cid:12)nity. Ontheotherhand,indiscriminantanalysisweencounterahigh-dimensional case, i.e., the case when the dimension is relatively large. There are some works on asymptotic approximations for the expected probabilities of mis- classi(cid:12)cation in the two-groups discriminant analysis under a high dimen- sional framework. For these works, see, e.g., Raudys (1972), Wyman et al. (1990), and Fujikoshi and Seo (1998), in which they point a goodness of such approximations. In this paper, we consider the problem of estimating the AIC type of risk when both the dimension and sample size are large, in multiple-groups discriminant analysis. More precisely, we attempt to reduce for the bias term when we estimate the AIC type of risk by (cid:0)2log likelihood, in a high dimensional case. An attempt has been done by Fujikoshi (2002) in two- groups discriminant analysis. However, a modi(cid:12)cation was obtained under a restrictive assumption (see Section 3.1), and furthermore, it was difficult to extend the method to multiple-groups case. In this paper, by a new approachweobtain an asymptoticallyunbiasedestimator ofthe riskfunction in multiple-groups discriminant analysis when both the dimension and the sample size tend to in(cid:12)nity, for a general class of candidated models. which 2 is not necessary to include the true model. Such an estimator is called high- dimensional AIC, which is denoted by HAIC. Furthermore, it is pointed that HAIC has smaller biases than the AIC in a large-sample framework in a wide range of dimensions and sample sizes, through simulation experiments. It is also shown that HIC provides better model selections than does AIC or CAIC. 2. Preliminaries Let x = (x ;:::;x )′ be a p-dimensional random vector measurable on 1 p the individuals of each of q + 1 populations (cid:5) ;:::;(cid:5) . Let x(i);:::;x(i) 1 q+1 1 Ni be a sample from (cid:5) , and denote all the observations by the N (cid:2)p matrix, i ( ) X = x(1);:::;x(1);:::;x(q+1);:::;x(q+1) ; (2.1) 1 N1 1 Nq+1 where N = N +(cid:1)(cid:1)(cid:1)+N . It is assumed that 1 q+1 E(xj(cid:5) ) = (cid:22)(i); Var(xj(cid:5) ) = (cid:6): (2.2) i i WeconsiderAICanditshigh-dimensionalmodi(cid:12)cationsforaredundancy model of a given subset of fx ;:::;x g. Without loss of generality we treat a 1 p candidate model M , which means that the (cid:12)rst k variate x = (x ;:::;x )′ k 1 1 k is sufficient, or the remainder variate x = (x ;:::;x ) is redundant, i.e., 2 k+1 p the remainder p (cid:0) k variate x has no additional information in canonical 2 discriminant analysis, in presence of x . In order to write the model M in 1 k terms of unknown parameters, let us consider the partitions ( ) ( ) (cid:22)(i) (cid:6) (cid:6) (cid:22)(i) = 1 ; (cid:6) = 11 12 ; (2.3) (cid:22)(i) (cid:6) (cid:6) 21 22 2 and let (cid:22)(i) = (cid:22)(i) (cid:0)(cid:0)(cid:22)(i); i = 1;:::;q +1; (cid:0) = (cid:6) (cid:6)(cid:0)1: 2(cid:1)1 2 1 21 11 Then, M is de(cid:12)ned by k ( ) M : (cid:22)(1) = (cid:1)(cid:1)(cid:1) = (cid:22)(q+1) = (cid:22)(i) : (2.4) k 2(cid:1)1 2(cid:1)1 2(cid:1)1 3 Let f(X;(cid:2)) be the density function of X under M with (cid:2) = f(cid:22)(1);:::; k (cid:22)(q+1);(cid:6)g. Further, let g(X) be the density function of X under the true model M(cid:3). Then, we can write the AIC type of risk de(cid:12)ned by the expected log-predictive likelihood for a model M as k R = E(cid:3) E(cid:3)[(cid:0)2logf(Z;(cid:2)^ )]; (2.5) k X Z k where Z is an N (cid:2) p random matrix that has the same distribution as X ^ and is independent of X, and (cid:2) is the maximum likelihood estimator of (cid:2) k (cid:3) (cid:3) under M . Here E and Var denote the expectation under the true model. k Note that the risk R can be expressed as k R = E(cid:3) [(cid:0)2logf(X;(cid:2)^ )]+b ; (2.6) k X k k where b = E(cid:3) E(cid:3)[(cid:0)2logf(Z;(cid:2)^ )](cid:0)E [(cid:0)2logf(X;(cid:2)^ )]: (2.7) k X Z k X k This means that a naive estimator of R is (cid:0)2logf(X;(cid:2)^ ), and b is its bias k k k term in the estimation of R . In this paper we assume that k g(X) = f(X;(cid:2) ); for some (cid:2) : (2.8) 0 0 In the following we write (cid:2) as (cid:2) simply. 0 Let x(cid:22)(i) and x(cid:22) be the sample mean vectors of the observations of the ith groups and all the groups, respectively, i.e., 1 ∑Ni 1 ∑q+1 ∑Ni x(cid:22)(i) = x(i); i = 1;:::;q +1; x(cid:22) = x(i): N j N j i j=1 i=1 j=1 Further, let W and B be the matrices of sums of squares and products due to within-groups and between-groups, respectively, i.e., ∑q+1 ∑Ni ∑q+1 W = (x(i) (cid:0)x(cid:22)(i))(x(i) (cid:0)x(cid:22)(i))′; B = N (x(cid:22)(i) (cid:0)x(cid:22))(x(cid:22)(i) (cid:0)x(cid:22))′: j j i i=1 j=1 i=1 Put T = W + B and partition W, B and T in the same way as in (2.3). ^ Then we can write the MLE (cid:2) of (cid:2) under M (see, e.g., Fujikoshi (1985)) k k 4 as (cid:22)^(i) = x(cid:22)(i); i = 1;:::;q; (cid:22)^ = x(cid:22) (cid:0)(cid:0)^x(cid:22) ; 1 1 2 2 1 (cid:0)^ = T T(cid:0)1; N(cid:6)^ = W ; (2.9) 21 11 11 11 N(cid:6)^ = T = T (cid:0)T T(cid:0)1T ; 22(cid:1)1 22(cid:1)1 22 21 11 12 and hence, putting ℓ (W;T) = (cid:0)2logf(X;(cid:2)^ ) we have k k ℓ (W;T) = N logjN(cid:0)1W j+N logjN(cid:0)1T j+Np(1+log2(cid:25)) k 11 22(cid:1)1 jW j = (cid:0)N log 22(cid:1)1 +N logjN(cid:0)1Wj+Np(1+log2(cid:25)): (2.10) jT j 22(cid:1)1 Our main concern is to evaluate the bias term b under the assumption k of normality. Note that in the evaluation, it is not assumed that M includes k the true model. Put ∑q+1 ∑q+1 N N (cid:22)(cid:22) = i(cid:22)(i); (cid:4) = i((cid:22)(i) (cid:0)(cid:22)(cid:22))((cid:22)(i) (cid:0)(cid:22)(cid:22))′: N N i=1 i=1 Corresponding to a partition of (cid:6), we partition (cid:4) as ( ) (cid:4) (cid:4) (cid:4) = 11 12 : (cid:4) (cid:4) 21 22 Then, we can write b (see, e.g., Fujikoshi (2002)) as k Nk(N +q +1) b = (cid:0)Np+ +N2~b ; (2.11) k N (cid:0)k (cid:0)q (cid:0)2 k where [ { } { }] ~b = E trT(cid:0)1 (1+N(cid:0)1)(cid:6)+(cid:4) (cid:0)trT(cid:0)1 (1+N(cid:0)1)(cid:6) +(cid:4) : (2.12) k 11 11 11 Here the expectation E means the one under general normal populations. Note that T and T are distributed as noncentral Wishart distributions 11 W (n;(cid:6);N(cid:4)) and W (n;(cid:6) ;N(cid:4) ), respectively, where n = N (cid:0)1. p k 11 11 Our interest is to examine the problem of evaluating the bias term b and k estimating it. In general, the bias correction problem has been studied under a usual large sample framework, LS : p;q;k; (cid:12)x; N ! 1: 5 Fujikoshi (1985) derived an asymptotic unbiased estimator for b under LS k without assuming that M includes the true model, which is given as k b = b(0) +b(1); (2.13) LS LS LS where { } 1 b(0) = 2 k(q +1)+p(cid:0)k + p(p+1) ; LS 2 ( ) ( ) 1 1 1 1 b(1) = (cid:0)2 trC + trC2 + (trC)2 +2 trC + trC2 + (trC )2 : LS 2 2 k 2 k 2 k Here C = (cid:6)(cid:0)1(cid:4)(I +(cid:6)(cid:0)1(cid:4))(cid:0)1; C = (cid:6)(cid:0)1(cid:4) (I +(cid:6)(cid:0)1(cid:4) )(cid:0)1; p k k k k k k and (cid:6) and (cid:4) are the (cid:12)rst k (cid:2) k submatrices of (cid:6) and (cid:4), respectively. k k When M includes the true model, b(1) = 0, and hence b = b(0), which k LS LS LS corresponds to 2(cid:2)(the number of independent parameters under M ). The k usual AIC and its corrected version have been proposed as AIC = ℓ (W;T)+b(0); (2.14) k LS and ^ CAIC = ℓ (W;T)+b ; (2.15) k LS ^ respectively. Here, b is the one obtained from b by substituting the LS LS sample quantities to C and C . k However, the result will not work well as the dimension p increases. In order to overcome this weakness, we study asymptotic unbiased estimator for b under two high-dimensional frameworks such that k HD1 : q;k;(cid:12)x; N ! 1; p ! 1; N (cid:0)p ! 1; c = p=N ! c 2 (0;1): 0 HD2 : q;(cid:12)x; N ! 1; p ! 1; N (cid:0)p ! 1; c = p=N ! c 2 (0;1); 0 k ! 1; N (cid:0)k ! 1; d = k=N ! d 2 (0;1): 0 ^ Our aim is to construct b such that HD E[^b ] = b +O (N(cid:0)1;p(cid:0)1;k(cid:0)1); HD HD 1=2 6 where O (N(cid:0)1;p(cid:0)1;k(cid:0)1) denotes the term of the jth order with respect to j (N(cid:0)1;p(cid:0)1;k(cid:0)1). The AIC with such a bias term is denoted as ^ HAIC = ℓ (W;T)+b : (2.16) k HD 3. Asymptotic Evaluation of the Bias Term In this section we obtain asymptotic expressions for b in (2.12) under k high-dimensional frameworks. In the derivation we are not necessary to as- sume that M includes the true model M(cid:3), but we assume only that the true k model M(cid:3) satis(cid:12)es (2.8), i.e., that (cid:5) ’s are normal populations. It is easy to i ~ see that the b in (2.12) can be expressed as k [ { } { }] ~b = E trT(cid:0)1 (1+N(cid:0)1)I +Ω (cid:0)trT(cid:0)1 (1+N(cid:0)1)I +(cid:9) : (3.1) k p p k k k Here Ω = diag(! ::::;! ;0::::;0); (cid:9) = diag( ::::; ;0::::;0); (3.2) p 1 q k 1 q where ! (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) ! (cid:21) 0 and (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) (cid:21) 0 are possibly non-zero 1 q 1 q roots of (cid:6)(cid:0)1(cid:4) and (cid:6)(cid:0)1(cid:4) , respectively. Further, T and T are distributed 11 11 k as noncentral Wishart distributions W (n;I ;NΩ ) and W (n;I ;N(cid:9) ), re- p p p k k k spectively. Such reductions are obtained by considering appropriate trans- formations. For example, let H be an orthogonal matrix H such that (cid:4)~ = H′(cid:6)(cid:0)1=2(cid:4)(cid:6)(cid:0)1=2H = Ω : p Then { } { } trT(cid:0)1 (1+N(cid:0)1)(cid:6)+(cid:4) = tr(H′(cid:6)(cid:0)1=2T(cid:6)(cid:0)1=2H)(cid:0)1 (1+N(cid:0)1)I +Ω : p p The reduction of the (cid:12)rst expectation is obtained by noting that H′(cid:6)(cid:0)1=2T(cid:6)(cid:0)1=2H (cid:24) W (n;I ;NΩ ): p p p Similarly the expectation of the second term is obtained. 7 3.1 Two-Groups Case For two-groups case, Fujikoshi (2002) used the following expression for the bias term b : k Nk(N +2) b = (cid:0)Np+ k N (cid:0)k (cid:0)3 +N2fQ(N;p;(cid:28)2)(cid:0)Q(N;k;(cid:28)2)g; (3.3) 1 where Q(N;p;(cid:28)2) = E[tr(S +uu′)(cid:0)1f(1+N(cid:0)1)I +(cid:23)(cid:23)′g]; p Q(N;k;(cid:28)2) = E[tr(S +u u′)(cid:0)1f(1+N(cid:0)1)I +(cid:23) (cid:23)′g]; 1 11 1 1 k 1 1 p (cid:23) = ( N N =N)(cid:6)(cid:0)1=2((cid:22)(1) (cid:0)(cid:22)(2)); (cid:28)2 = (cid:23)′(cid:23) = f(N N )=N2g∆2 and 1 2 1 2 ∆2 = ((cid:22)(1) (cid:0)(cid:22)(2))′(cid:6)(cid:0)1((cid:22)(1) (cid:0)(cid:22)(2)): Here u and S are independently distributed as a normal distribution p N ( N(cid:23);I ) and a Wishart distribution W (N (cid:0)2;I ), respectively. Simi- p p p p larly S ;u and (cid:23) denote the (cid:12)rst k parts of S;u and (cid:23), respectively, and 11 1 1 (cid:28)2 = (cid:23)′(cid:23) . Using that 1 1 1 (S +uu′)(cid:0)1 = S(cid:0)1 (cid:0)(1+u′S(cid:0)1u)(cid:0)1S(cid:0)1uu′S(cid:0)1; Fujikoshi (2002) derived an asymptotic expansion of Q(N;p;(cid:28)2) such that Q(N;p;(cid:28)2) = Q (N;p;(cid:28)2)+O (N(cid:0)1;p(cid:0)1); (3.4) 1 5=2 p assuming that u (cid:24) N ((cid:23);I ). However, the mean of u is not (cid:23), but N(cid:23). p p For a general case, the Q in Fujikoshi (2002) should be corrected as follows: 1 N(Np+p(cid:0)4) N2Q (N;p;(cid:28)2) = 1 N (cid:0)p(cid:0)2 ( ) N (cid:0)3 { } +2 3(1+(cid:28)2)(cid:0)1 (cid:0)(1+(cid:28)2)(cid:0)2 : (3.5) N (cid:0)p(cid:0)2 These imply that the bias term can be expressed under HD2 as Nk(N +2) b = (cid:0)Np+ k N (cid:0)k (cid:0)3 +N2fQ (N;p;(cid:28)2)(cid:0)Q (N;k;(cid:28)2)g+O (N(cid:0)1;p(cid:0)1;k(cid:0)1): (3.6) 1 1 1 1=2 8 We note that the result (3.6) is the same as a special case of the result (see Theorem 3.3) in the present paper. It is difficult to extend the above method to the case q (cid:21) 2. For the case q (cid:21) 2, we give an alternative method which is more powerful. 3.2 Some Basic Results ~ For our evaluation of the b in (3.1), we use the following theorem. k Theorem 3.1. Let T (cid:24) W (N (cid:0)1;I ;NΩ), and partition T and Ω as p p ( ) ( ) T T Ω Ω T = 11 12 ; Ω = 11 12 ; T T Ω Ω 21 22 21 22 where T and Ω are p (cid:2)p . If Ω , Ω and Ω are zero matrices, then ij ij i j 12 21 22 [ {( ) }] ( ) 1 1 p(cid:0)p E trT(cid:0)1 1+ I +Ω = 1+ 1 N p N N (cid:0)p(cid:0)2 ( ) [ {( ) }] N (cid:0)p (cid:0)2 1 + 1 E trT(cid:0)1 1+ I +Ω : N (cid:0)p(cid:0)2 11 N p1 11 The proof of Theorem 3.1 is given in Appendix. Moreover, we use the following lemma which is obtained by modifying the derivation in Fujikoshi (1985). Lemma 3.1. Suppose that T (cid:24) W (N (cid:0) 1;(cid:6);mΩ), and let L;p (cid:2) p be a p constant matrix. Then, if p is (cid:12)xed, m = O(N) and Ω = O(1), then E[N2trT(cid:0)1L] = NtrM∆+(p+3)trM∆2 +tr∆trM∆ (cid:0)trM∆3 (cid:0)tr∆trM∆2 +O(N(cid:0)1=2); (3.7) where M = (cid:6)(cid:0)1L and ∆ = (I +(m=N)(cid:6)(cid:0)1Ω)(cid:0)1. In particular, if m = N, p [ {( ) }] 1 E N2trT(cid:0)1 1+ I +Ω = Np+2(p+2)(cid:17) (cid:0)(cid:17) (cid:0)(cid:17)2 +O(N(cid:0)1=2); N p 1 2 1 where (cid:17) = tr(I +Ω)(cid:0)i. i p 9 CombiningTheorem3.1andLemma3.1,weobtainthefollowingtheorem. Theorem 3.2. Suppose that T and Ω are the same ones as in Theorem 3.1. Then, if p is (cid:12)xed, 1 [ {( ) }] 1 NfNp+p(cid:0)p2 (cid:0)3p g E N2trT(cid:0)1 1+ I +Ω = 1 1 N p N (cid:0)p(cid:0)2 ( ) N (cid:0)p (cid:0)2 { } + 1 K +O(N(cid:0)1=2) ; (3.8) N (cid:0)p(cid:0)2 where K = 2(p +2)(cid:17) (cid:0)(cid:17) (cid:0)(cid:17)2; (cid:17) = tr(I +Ω )(cid:0)i: 1 1 2 1 i p1 11 It may be noted that the O(N(cid:0)1=2) in (3.8) is not the same one as in the usual large-sample case where p is (cid:12)xed. The result (3.8) holds also under the assumption that p is not (cid:12)xed, for example, under HD1. 3.3 Evaluations of (3.1) In this section we evaluate each of the expectations in (3.1). Using Theorem 3.2 the expectation of the (cid:12)rst term in (3.1) is expressed as [ {( ) }] 1 E N2trT(cid:0)1 1+ I +Ω p p N ( ) N(Np+p(cid:0)q2 (cid:0)3q) N (cid:0)q (cid:0)2 = + fK +O(N(cid:0)1=2)g; (3.9) N (cid:0)p(cid:0)2 N (cid:0)p(cid:0)2 ! where K = 2(q +2)(cid:17) (cid:0)(cid:17) (cid:0)(cid:17)2 ; (cid:17) = tr(I +Ω )(cid:0)i; (3.10) ! 1! 2! 1! i! q q and Ω = diag(! ;:::;! ). We note that the result holds under the asymp- q 1 q totic framework HD1, and also HD2 since the expectation does not depend on k. Next we show that [ {( ) }] 1 E N2trT(cid:0)1 1+ I +(cid:9) k N k k ( ) N(Nk +k (cid:0)q2 (cid:0)3q) N (cid:0)q (cid:0)2 { } = + K +O(N(cid:0)1=2) ;(3.11) N (cid:0)k (cid:0)2 N (cid:0)k (cid:0)2 10

Description:
dimension and the sample size tend to infinity, for a general class of candidated candidate model Mk, which means that the first k variate x1 = (x1,,xk). ′.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.