High-Dimensional AICs for Selection of Redundancy Models in Discriminant Analysis (cid:3) (cid:3)(cid:3) (cid:3) Tetsuro Sakurai , Takeshi Nakada and Yasunori Fujikoshi (cid:3)Faculty of Science and Engineering, Chuo University, Kasuga, Bunkyo-ku, 112-8551, Japan (cid:3)(cid:3)Financial Systems Development Dept. No. 4, ITOCHU Techno-Solutions Corporation, 1-2-2, Osaki, Shinagawa-ku, 141-8522, Japan Abstract This paper is concerned with high-dimensional modi(cid:12)cations of Akaike information criterion (AIC) for redundancy (no additional in- formation) models in discriminant analysis. The AIC has been pro- posed as an asymptotically unbiased estimator of the risk function when the dimension is (cid:12)xed and the sample size tends to in(cid:12)nity. On the other hand, Fujikoshi (2002) attempted to modify the AIC in two- groups discriminant analysis when the dimension and the sample size tend to in(cid:12)nity. However, its modi(cid:12)cation was obtained under a re- strictive assumption, and furthermore, it was difficult to extend the method to multiple-groups case. In this paper, by a new approach we propose HAIC which is an asymptotically unbiased estimator of the risk function in multiple-groups discriminant analysis when both the dimension and the sample size tend to in(cid:12)nity, for a general class of candidated models. By simulation experiments it is shown that HAIC is more useful than the usual AIC and CAIC. AMS 2000 subject classi(cid:12)cation: primary 62H12; secondary 62H30 Key Words and Phrases: AIC, Bias correction, Discriminant analysis, HAIC, High dimensional case, Multiple-groups case, Selection of variables. . 1 1. Introduction This paper is concerned with selection of variables in discriminant anal- ysis. One way of selecting variables is to formulate as a problem of selecting redundancy models based on no additional information hypothesis due to Rao (1948, 1973). Then, we apply the idea of a model selection criterion AIC to the models. The selection criterion AIC is proposed as an approximately unbiased estimator(AIC)oftheAICtypeofriskde(cid:12)nedbytheexpectedlog-predictive likelihood. The AIC for redundancy models has been proposed under large- sample framework, i.e., the dimension is (cid:12)xed and the sample size tends to in(cid:12)nity. Ontheotherhand,indiscriminantanalysisweencounterahigh-dimensional case, i.e., the case when the dimension is relatively large. There are some works on asymptotic approximations for the expected probabilities of mis- classi(cid:12)cation in the two-groups discriminant analysis under a high dimen- sional framework. For these works, see, e.g., Raudys (1972), Wyman et al. (1990), and Fujikoshi and Seo (1998), in which they point a goodness of such approximations. In this paper, we consider the problem of estimating the AIC type of risk when both the dimension and sample size are large, in multiple-groups discriminant analysis. More precisely, we attempt to reduce for the bias term when we estimate the AIC type of risk by (cid:0)2log likelihood, in a high dimensional case. An attempt has been done by Fujikoshi (2002) in two- groups discriminant analysis. However, a modi(cid:12)cation was obtained under a restrictive assumption (see Section 3.1), and furthermore, it was difficult to extend the method to multiple-groups case. In this paper, by a new approachweobtain an asymptoticallyunbiasedestimator ofthe riskfunction in multiple-groups discriminant analysis when both the dimension and the sample size tend to in(cid:12)nity, for a general class of candidated models. which 2 is not necessary to include the true model. Such an estimator is called high- dimensional AIC, which is denoted by HAIC. Furthermore, it is pointed that HAIC has smaller biases than the AIC in a large-sample framework in a wide range of dimensions and sample sizes, through simulation experiments. It is also shown that HIC provides better model selections than does AIC or CAIC. 2. Preliminaries Let x = (x ;:::;x )′ be a p-dimensional random vector measurable on 1 p the individuals of each of q + 1 populations (cid:5) ;:::;(cid:5) . Let x(i);:::;x(i) 1 q+1 1 Ni be a sample from (cid:5) , and denote all the observations by the N (cid:2)p matrix, i ( ) X = x(1);:::;x(1);:::;x(q+1);:::;x(q+1) ; (2.1) 1 N1 1 Nq+1 where N = N +(cid:1)(cid:1)(cid:1)+N . It is assumed that 1 q+1 E(xj(cid:5) ) = (cid:22)(i); Var(xj(cid:5) ) = (cid:6): (2.2) i i WeconsiderAICanditshigh-dimensionalmodi(cid:12)cationsforaredundancy model of a given subset of fx ;:::;x g. Without loss of generality we treat a 1 p candidate model M , which means that the (cid:12)rst k variate x = (x ;:::;x )′ k 1 1 k is sufficient, or the remainder variate x = (x ;:::;x ) is redundant, i.e., 2 k+1 p the remainder p (cid:0) k variate x has no additional information in canonical 2 discriminant analysis, in presence of x . In order to write the model M in 1 k terms of unknown parameters, let us consider the partitions ( ) ( ) (cid:22)(i) (cid:6) (cid:6) (cid:22)(i) = 1 ; (cid:6) = 11 12 ; (2.3) (cid:22)(i) (cid:6) (cid:6) 21 22 2 and let (cid:22)(i) = (cid:22)(i) (cid:0)(cid:0)(cid:22)(i); i = 1;:::;q +1; (cid:0) = (cid:6) (cid:6)(cid:0)1: 2(cid:1)1 2 1 21 11 Then, M is de(cid:12)ned by k ( ) M : (cid:22)(1) = (cid:1)(cid:1)(cid:1) = (cid:22)(q+1) = (cid:22)(i) : (2.4) k 2(cid:1)1 2(cid:1)1 2(cid:1)1 3 Let f(X;(cid:2)) be the density function of X under M with (cid:2) = f(cid:22)(1);:::; k (cid:22)(q+1);(cid:6)g. Further, let g(X) be the density function of X under the true model M(cid:3). Then, we can write the AIC type of risk de(cid:12)ned by the expected log-predictive likelihood for a model M as k R = E(cid:3) E(cid:3)[(cid:0)2logf(Z;(cid:2)^ )]; (2.5) k X Z k where Z is an N (cid:2) p random matrix that has the same distribution as X ^ and is independent of X, and (cid:2) is the maximum likelihood estimator of (cid:2) k (cid:3) (cid:3) under M . Here E and Var denote the expectation under the true model. k Note that the risk R can be expressed as k R = E(cid:3) [(cid:0)2logf(X;(cid:2)^ )]+b ; (2.6) k X k k where b = E(cid:3) E(cid:3)[(cid:0)2logf(Z;(cid:2)^ )](cid:0)E [(cid:0)2logf(X;(cid:2)^ )]: (2.7) k X Z k X k This means that a naive estimator of R is (cid:0)2logf(X;(cid:2)^ ), and b is its bias k k k term in the estimation of R . In this paper we assume that k g(X) = f(X;(cid:2) ); for some (cid:2) : (2.8) 0 0 In the following we write (cid:2) as (cid:2) simply. 0 Let x(cid:22)(i) and x(cid:22) be the sample mean vectors of the observations of the ith groups and all the groups, respectively, i.e., 1 ∑Ni 1 ∑q+1 ∑Ni x(cid:22)(i) = x(i); i = 1;:::;q +1; x(cid:22) = x(i): N j N j i j=1 i=1 j=1 Further, let W and B be the matrices of sums of squares and products due to within-groups and between-groups, respectively, i.e., ∑q+1 ∑Ni ∑q+1 W = (x(i) (cid:0)x(cid:22)(i))(x(i) (cid:0)x(cid:22)(i))′; B = N (x(cid:22)(i) (cid:0)x(cid:22))(x(cid:22)(i) (cid:0)x(cid:22))′: j j i i=1 j=1 i=1 Put T = W + B and partition W, B and T in the same way as in (2.3). ^ Then we can write the MLE (cid:2) of (cid:2) under M (see, e.g., Fujikoshi (1985)) k k 4 as (cid:22)^(i) = x(cid:22)(i); i = 1;:::;q; (cid:22)^ = x(cid:22) (cid:0)(cid:0)^x(cid:22) ; 1 1 2 2 1 (cid:0)^ = T T(cid:0)1; N(cid:6)^ = W ; (2.9) 21 11 11 11 N(cid:6)^ = T = T (cid:0)T T(cid:0)1T ; 22(cid:1)1 22(cid:1)1 22 21 11 12 and hence, putting ℓ (W;T) = (cid:0)2logf(X;(cid:2)^ ) we have k k ℓ (W;T) = N logjN(cid:0)1W j+N logjN(cid:0)1T j+Np(1+log2(cid:25)) k 11 22(cid:1)1 jW j = (cid:0)N log 22(cid:1)1 +N logjN(cid:0)1Wj+Np(1+log2(cid:25)): (2.10) jT j 22(cid:1)1 Our main concern is to evaluate the bias term b under the assumption k of normality. Note that in the evaluation, it is not assumed that M includes k the true model. Put ∑q+1 ∑q+1 N N (cid:22)(cid:22) = i(cid:22)(i); (cid:4) = i((cid:22)(i) (cid:0)(cid:22)(cid:22))((cid:22)(i) (cid:0)(cid:22)(cid:22))′: N N i=1 i=1 Corresponding to a partition of (cid:6), we partition (cid:4) as ( ) (cid:4) (cid:4) (cid:4) = 11 12 : (cid:4) (cid:4) 21 22 Then, we can write b (see, e.g., Fujikoshi (2002)) as k Nk(N +q +1) b = (cid:0)Np+ +N2~b ; (2.11) k N (cid:0)k (cid:0)q (cid:0)2 k where [ { } { }] ~b = E trT(cid:0)1 (1+N(cid:0)1)(cid:6)+(cid:4) (cid:0)trT(cid:0)1 (1+N(cid:0)1)(cid:6) +(cid:4) : (2.12) k 11 11 11 Here the expectation E means the one under general normal populations. Note that T and T are distributed as noncentral Wishart distributions 11 W (n;(cid:6);N(cid:4)) and W (n;(cid:6) ;N(cid:4) ), respectively, where n = N (cid:0)1. p k 11 11 Our interest is to examine the problem of evaluating the bias term b and k estimating it. In general, the bias correction problem has been studied under a usual large sample framework, LS : p;q;k; (cid:12)x; N ! 1: 5 Fujikoshi (1985) derived an asymptotic unbiased estimator for b under LS k without assuming that M includes the true model, which is given as k b = b(0) +b(1); (2.13) LS LS LS where { } 1 b(0) = 2 k(q +1)+p(cid:0)k + p(p+1) ; LS 2 ( ) ( ) 1 1 1 1 b(1) = (cid:0)2 trC + trC2 + (trC)2 +2 trC + trC2 + (trC )2 : LS 2 2 k 2 k 2 k Here C = (cid:6)(cid:0)1(cid:4)(I +(cid:6)(cid:0)1(cid:4))(cid:0)1; C = (cid:6)(cid:0)1(cid:4) (I +(cid:6)(cid:0)1(cid:4) )(cid:0)1; p k k k k k k and (cid:6) and (cid:4) are the (cid:12)rst k (cid:2) k submatrices of (cid:6) and (cid:4), respectively. k k When M includes the true model, b(1) = 0, and hence b = b(0), which k LS LS LS corresponds to 2(cid:2)(the number of independent parameters under M ). The k usual AIC and its corrected version have been proposed as AIC = ℓ (W;T)+b(0); (2.14) k LS and ^ CAIC = ℓ (W;T)+b ; (2.15) k LS ^ respectively. Here, b is the one obtained from b by substituting the LS LS sample quantities to C and C . k However, the result will not work well as the dimension p increases. In order to overcome this weakness, we study asymptotic unbiased estimator for b under two high-dimensional frameworks such that k HD1 : q;k;(cid:12)x; N ! 1; p ! 1; N (cid:0)p ! 1; c = p=N ! c 2 (0;1): 0 HD2 : q;(cid:12)x; N ! 1; p ! 1; N (cid:0)p ! 1; c = p=N ! c 2 (0;1); 0 k ! 1; N (cid:0)k ! 1; d = k=N ! d 2 (0;1): 0 ^ Our aim is to construct b such that HD E[^b ] = b +O (N(cid:0)1;p(cid:0)1;k(cid:0)1); HD HD 1=2 6 where O (N(cid:0)1;p(cid:0)1;k(cid:0)1) denotes the term of the jth order with respect to j (N(cid:0)1;p(cid:0)1;k(cid:0)1). The AIC with such a bias term is denoted as ^ HAIC = ℓ (W;T)+b : (2.16) k HD 3. Asymptotic Evaluation of the Bias Term In this section we obtain asymptotic expressions for b in (2.12) under k high-dimensional frameworks. In the derivation we are not necessary to as- sume that M includes the true model M(cid:3), but we assume only that the true k model M(cid:3) satis(cid:12)es (2.8), i.e., that (cid:5) ’s are normal populations. It is easy to i ~ see that the b in (2.12) can be expressed as k [ { } { }] ~b = E trT(cid:0)1 (1+N(cid:0)1)I +Ω (cid:0)trT(cid:0)1 (1+N(cid:0)1)I +(cid:9) : (3.1) k p p k k k Here Ω = diag(! ::::;! ;0::::;0); (cid:9) = diag( ::::; ;0::::;0); (3.2) p 1 q k 1 q where ! (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) ! (cid:21) 0 and (cid:21) (cid:1)(cid:1)(cid:1) (cid:21) (cid:21) 0 are possibly non-zero 1 q 1 q roots of (cid:6)(cid:0)1(cid:4) and (cid:6)(cid:0)1(cid:4) , respectively. Further, T and T are distributed 11 11 k as noncentral Wishart distributions W (n;I ;NΩ ) and W (n;I ;N(cid:9) ), re- p p p k k k spectively. Such reductions are obtained by considering appropriate trans- formations. For example, let H be an orthogonal matrix H such that (cid:4)~ = H′(cid:6)(cid:0)1=2(cid:4)(cid:6)(cid:0)1=2H = Ω : p Then { } { } trT(cid:0)1 (1+N(cid:0)1)(cid:6)+(cid:4) = tr(H′(cid:6)(cid:0)1=2T(cid:6)(cid:0)1=2H)(cid:0)1 (1+N(cid:0)1)I +Ω : p p The reduction of the (cid:12)rst expectation is obtained by noting that H′(cid:6)(cid:0)1=2T(cid:6)(cid:0)1=2H (cid:24) W (n;I ;NΩ ): p p p Similarly the expectation of the second term is obtained. 7 3.1 Two-Groups Case For two-groups case, Fujikoshi (2002) used the following expression for the bias term b : k Nk(N +2) b = (cid:0)Np+ k N (cid:0)k (cid:0)3 +N2fQ(N;p;(cid:28)2)(cid:0)Q(N;k;(cid:28)2)g; (3.3) 1 where Q(N;p;(cid:28)2) = E[tr(S +uu′)(cid:0)1f(1+N(cid:0)1)I +(cid:23)(cid:23)′g]; p Q(N;k;(cid:28)2) = E[tr(S +u u′)(cid:0)1f(1+N(cid:0)1)I +(cid:23) (cid:23)′g]; 1 11 1 1 k 1 1 p (cid:23) = ( N N =N)(cid:6)(cid:0)1=2((cid:22)(1) (cid:0)(cid:22)(2)); (cid:28)2 = (cid:23)′(cid:23) = f(N N )=N2g∆2 and 1 2 1 2 ∆2 = ((cid:22)(1) (cid:0)(cid:22)(2))′(cid:6)(cid:0)1((cid:22)(1) (cid:0)(cid:22)(2)): Here u and S are independently distributed as a normal distribution p N ( N(cid:23);I ) and a Wishart distribution W (N (cid:0)2;I ), respectively. Simi- p p p p larly S ;u and (cid:23) denote the (cid:12)rst k parts of S;u and (cid:23), respectively, and 11 1 1 (cid:28)2 = (cid:23)′(cid:23) . Using that 1 1 1 (S +uu′)(cid:0)1 = S(cid:0)1 (cid:0)(1+u′S(cid:0)1u)(cid:0)1S(cid:0)1uu′S(cid:0)1; Fujikoshi (2002) derived an asymptotic expansion of Q(N;p;(cid:28)2) such that Q(N;p;(cid:28)2) = Q (N;p;(cid:28)2)+O (N(cid:0)1;p(cid:0)1); (3.4) 1 5=2 p assuming that u (cid:24) N ((cid:23);I ). However, the mean of u is not (cid:23), but N(cid:23). p p For a general case, the Q in Fujikoshi (2002) should be corrected as follows: 1 N(Np+p(cid:0)4) N2Q (N;p;(cid:28)2) = 1 N (cid:0)p(cid:0)2 ( ) N (cid:0)3 { } +2 3(1+(cid:28)2)(cid:0)1 (cid:0)(1+(cid:28)2)(cid:0)2 : (3.5) N (cid:0)p(cid:0)2 These imply that the bias term can be expressed under HD2 as Nk(N +2) b = (cid:0)Np+ k N (cid:0)k (cid:0)3 +N2fQ (N;p;(cid:28)2)(cid:0)Q (N;k;(cid:28)2)g+O (N(cid:0)1;p(cid:0)1;k(cid:0)1): (3.6) 1 1 1 1=2 8 We note that the result (3.6) is the same as a special case of the result (see Theorem 3.3) in the present paper. It is difficult to extend the above method to the case q (cid:21) 2. For the case q (cid:21) 2, we give an alternative method which is more powerful. 3.2 Some Basic Results ~ For our evaluation of the b in (3.1), we use the following theorem. k Theorem 3.1. Let T (cid:24) W (N (cid:0)1;I ;NΩ), and partition T and Ω as p p ( ) ( ) T T Ω Ω T = 11 12 ; Ω = 11 12 ; T T Ω Ω 21 22 21 22 where T and Ω are p (cid:2)p . If Ω , Ω and Ω are zero matrices, then ij ij i j 12 21 22 [ {( ) }] ( ) 1 1 p(cid:0)p E trT(cid:0)1 1+ I +Ω = 1+ 1 N p N N (cid:0)p(cid:0)2 ( ) [ {( ) }] N (cid:0)p (cid:0)2 1 + 1 E trT(cid:0)1 1+ I +Ω : N (cid:0)p(cid:0)2 11 N p1 11 The proof of Theorem 3.1 is given in Appendix. Moreover, we use the following lemma which is obtained by modifying the derivation in Fujikoshi (1985). Lemma 3.1. Suppose that T (cid:24) W (N (cid:0) 1;(cid:6);mΩ), and let L;p (cid:2) p be a p constant matrix. Then, if p is (cid:12)xed, m = O(N) and Ω = O(1), then E[N2trT(cid:0)1L] = NtrM∆+(p+3)trM∆2 +tr∆trM∆ (cid:0)trM∆3 (cid:0)tr∆trM∆2 +O(N(cid:0)1=2); (3.7) where M = (cid:6)(cid:0)1L and ∆ = (I +(m=N)(cid:6)(cid:0)1Ω)(cid:0)1. In particular, if m = N, p [ {( ) }] 1 E N2trT(cid:0)1 1+ I +Ω = Np+2(p+2)(cid:17) (cid:0)(cid:17) (cid:0)(cid:17)2 +O(N(cid:0)1=2); N p 1 2 1 where (cid:17) = tr(I +Ω)(cid:0)i. i p 9 CombiningTheorem3.1andLemma3.1,weobtainthefollowingtheorem. Theorem 3.2. Suppose that T and Ω are the same ones as in Theorem 3.1. Then, if p is (cid:12)xed, 1 [ {( ) }] 1 NfNp+p(cid:0)p2 (cid:0)3p g E N2trT(cid:0)1 1+ I +Ω = 1 1 N p N (cid:0)p(cid:0)2 ( ) N (cid:0)p (cid:0)2 { } + 1 K +O(N(cid:0)1=2) ; (3.8) N (cid:0)p(cid:0)2 where K = 2(p +2)(cid:17) (cid:0)(cid:17) (cid:0)(cid:17)2; (cid:17) = tr(I +Ω )(cid:0)i: 1 1 2 1 i p1 11 It may be noted that the O(N(cid:0)1=2) in (3.8) is not the same one as in the usual large-sample case where p is (cid:12)xed. The result (3.8) holds also under the assumption that p is not (cid:12)xed, for example, under HD1. 3.3 Evaluations of (3.1) In this section we evaluate each of the expectations in (3.1). Using Theorem 3.2 the expectation of the (cid:12)rst term in (3.1) is expressed as [ {( ) }] 1 E N2trT(cid:0)1 1+ I +Ω p p N ( ) N(Np+p(cid:0)q2 (cid:0)3q) N (cid:0)q (cid:0)2 = + fK +O(N(cid:0)1=2)g; (3.9) N (cid:0)p(cid:0)2 N (cid:0)p(cid:0)2 ! where K = 2(q +2)(cid:17) (cid:0)(cid:17) (cid:0)(cid:17)2 ; (cid:17) = tr(I +Ω )(cid:0)i; (3.10) ! 1! 2! 1! i! q q and Ω = diag(! ;:::;! ). We note that the result holds under the asymp- q 1 q totic framework HD1, and also HD2 since the expectation does not depend on k. Next we show that [ {( ) }] 1 E N2trT(cid:0)1 1+ I +(cid:9) k N k k ( ) N(Nk +k (cid:0)q2 (cid:0)3q) N (cid:0)q (cid:0)2 { } = + K +O(N(cid:0)1=2) ;(3.11) N (cid:0)k (cid:0)2 N (cid:0)k (cid:0)2 10
Description: