Conditional Akaike information under covariate shift with application to small area prediction 6 Yuki Kawakubo∗ Shonosuke Sugasawa† and Tatsuya Kubokawa‡ 1 0 2 v Abstract o N In this study, we consider the problem of selecting explanatory variables of fixed effects 5 in linear mixed models under covariate shift, which is when the values of covariates in the predictive model differ from those in the observed model. We construct a variable selection ] criterion based on the conditional Akaike information introduced by Vaida and Blanchard E (2005). We focus especially on covariate shift in small area prediction and demonstrate M the usefulness of the proposed criterion. In addition, numerical performance is investigated . through simulations, one of which is a design-based simulation using a real dataset of land t a prices. t s [ Keywords: Akaike information criterion; Conditional AIC; Covariate shift; Linear mixed 3 model; Small area estimation; Variable selection. v 9 8 1 Introduction 8 3 Linearmixedmodelshavebeenstudiedforalongtimetheoretically, andhavemanyapplications, 0 . for example, longitudinal data analysis in biostatistics, panel data analysis in econometrics, and 1 0 small area estimation in official statistics. The problem of selecting explanatory variables in 5 linear mixed models is important and has been investigated in the literature. Mu¨ller et al. 1 (2013) is a good survey on model selection in linear mixed models. : v When the purpose of the variable selection is to find a set of significant variables for a good i X prediction,Akaike-type informationcriteria(Akaike,1973,1974)arewell-known methods. How- r ever, the Akaike information criterion (AIC) based on marginal likelihood, which integrates out a likelihood with respect to random effects, is not appropriate when the prediction is focused on random effects. Then, Vaida and Blanchard (2005) proposed considering Akaike-type informa- tion basedontheconditionaldensity given therandomeffects andproposedtheconditional AIC (cAIC). Here, weprovideabriefexplanation of thecAIC concept. Letf(y|b,η)beaconditional density function of y given b, where y is an observable random vector of the response variables, η is a vector of the unknown parameters, and b is a random vector of the random effects. Let π(b|η) be a density function of b. Then, Vaida and Blanchard (2005) proposed measuring the predictionriskoftheplug-inpredictivedensityf(y˜|bˆ,η)relativetoKullback–Leiblerdivergence: f(y˜|b,η) b log f(y˜|b,η)dy˜ f(y|b,η)π(b|η)dydb, (1) " (f(y˜|bˆ,η)) # ZZ Z ∗Faculty of Law, Politics and Economics, Chiba University(E-mail: [email protected]) b †Instituteof Statistical Mathematics ‡Faculty of Economics, University of Tokyo 1 where y˜ is an independent replication of y given b, and bˆ and η are some predictor or estimator of b and η, respectively. The cAIC is an (asymptotically) unbiased estimator of a part of the risk in (1), which is called the conditional Akaike informationb(cAI), given by cAI = −2 log f(y˜|bˆ,η) f(y˜|b,η)f(y|b,η)π(b|η)dy˜dydb. ZZZ n o ThecAICasavariableselectioncriterioninblinearmixedmodelshasbeenstudiedbyLiang et al. (2008),Greven and Kneib(2010),Srivastava and Kubokawa(2010),Kubokawa(2011),Kawakubo and Kubokawa (2014), and others. Furthermore, the cAIC has been constructed as a variable selection crite- rion in generalized linear mixed models by Donohue et al. (2011), Yu and Yau (2012), Yu et al. (2013), Saefken et al. (2014), and others. Considering the prediction problem, it is often the case that the values of covariates in the predictive model are different from those in the observed model, which we call covariate shift. Here, we call the model in which y is the vector of the response variables the “observed model,” and we call the model in which y˜ is the vector of the response variables the “predictive model.” It is noted that the term “covariate shift” was first used by Shimodaira (2000), who defined it as the situation in which the distribution of the covariates in the predictive model differs from that in the observed model. In this study, although we treat the covariates as non-random, we usethesameterm “covariate shift”as Shimodaira(2000). Even whentheinformation aboutthe covariates in the predictive model can be used, most of the Akaike-type information criteria do not use it. This is because it is assumed that the predictive model is the same as the observed model for deriving the criteria. As for the abovementioned cAIC, the conditional density of y given b and that of y˜ given b are the same, both of which are denoted by f(·|b,η). On the other hand, under the covariate shift, the conditional density of y˜ given b is different from that of y given b and is denoted by g(y˜|b,η). When the aim of the variable selection is to choose the best predictive model, it is not appropriate to use the covariates only in the observed model. Then, we redefine the cAI under covariate shift, as follows, cAI = −2 log g(y˜|bˆ,η) g(y˜|b,η)f(y|b,η)π(b|η)dy˜dydb, ZZZ n o and construct an information criterion asban asymptotically unbiased estimator of the cAI. Satoh (1997, 2000) considered a similar problem in the multivariate linear regression model and proposed a variable selection criterion. It is important to note that we do not assume that the candidate model is overspecified, in other words, that the candidate model includes the true model. Although most of the Akaike-type information criteria make the overspecified assumption, this is not appropriate for estimating the cAI under covariate shift. We discuss this point in Section 3.2. As an important applicable example of covariate shift, we focus on small area prediction, which is based on a finite super-population model. We consider the situation in which we are interested in the finite subpopulation (area) mean of some characteristic and that some values in the subpopulation are observed through some sampling procedure. When the sample size in each area is small, the problem is called small area estimation. For details about small area es- timation, see Rao and Molina (2015), Datta and Ghosh (2012), Pfeffermann (2013), and others. Themodel-based approach in small area estimation often assumes that the finitepopulation has a super-population with random effects and borrows strength from other areas to estimate (pre- dict) the small area (finite subpopulation) mean. The well-known unit-level model is the nested error regression model (NERM), which is a kind of linear mixed model, and was introduced by Battese et al. (1988). The NERM can be used when the values of the auxiliary variables for 2 the units with characteristics of interest (response variables in the model) are observed through survey sampling. This is the observed model in the framework of our variable selection proce- dure. On the other hand, two types of predictive model can be considered. One is the unit-level model, which can be used in the situation in which the values of the auxiliary variables are known for all units. The other is the area-level model, which can be used in the situation in which each mean of the auxiliary variables is known for each small area. The latter is often the case in official statistics and the model introduced by Fay and Herriot (1979) is often used in this case. The rest of this paper is organized as follows. In Section 2, we explain the setup of variable selection problem. In Section 3, we define the cAI under covariate shift in linear mixed models andobtainanasymptotically unbiasedestimator ofthecAI.InSection 4,weprovideanexample of covariate shift, which is focused on small area prediction. In Section 5, we investigate the numerical performance of the proposed criteria by simulations, one of which is design-based simulation based on a real dataset of land prices. All the proofs are given in the Appendix. 2 Setup of variable selection problem 2.1 Class of candidate models We focus on the variable selection of the fixed effects. First, we consider the collection of candidate models as follows. Let n×p matrix X(ω) consist of all the explanatory variables ω and assume that rank(X(ω)) = p . In order to define candidate models by the index set j, ω suppose that j denotes a subset of ω = {1,...,p } containing p elements, i.e., p = #(j), and ω j j X(j) consists of p columns of X(ω) indexed by the elements of j. We define the class of the j candidate models by J = P(ω), namely, the power set of ω, in which we call ω the full model. We assume that the true model exists in the class of the candidate models J, which is denoted by j . It is noteworthy that the dimension of the true model is p , which is abbreviated to p . ∗ j∗ ∗ We next introduce the terms “overspecified” and “underspecified”models. Candidate model j is overspecified if X(ω)β ∈ R[X(j)], which means that X(ω)β is in the column space ∗ ∗ of X(j) following Fujikoshi and Satoh (1997) or Kawakubo and Kubokawa (2014). The set of overspecified models are denoted by J = {j ∈ J|j ⊆ j}. On the other hand, candidate + ∗ model j is underspecified when X(ω) 6⊆ R[X(j)]. The set of underspecified models is denoted by J = J \ J . It is important to note that most of the Akaike-type information criteria − + are derived under the assumption that the candidate model is overspecified. However, the assumption is not appropriate for considering the covariate shift, which is explained in Section 3.2. Thus, we derive the criterion without the overspecified assumption. In the following two subsections, we clarify the observed model and predictive model for deriving the criteria. 2.2 Observed model The candidate observed model j is the linear mixed model y = X(j)β +Zb +ε , (2) j j j where y is an n×1 observation vector of response variables, X(j) and Z are n×p and n×q j matrixes of covariates, respectively, β is a p ×1 vector of regression coefficients, b is a q×1 j j j vector of random effects, and ε is an n×1 vector of random errors. Let b and ε be mutually j j j 3 independentandb ∼ N (0,σ2G), ε ∼ N (0,σ2R), whereGandRareq×q andn×npositive j q j j n j definite matrixes and σ2 is a scalar. We assume that G and R are known and σ2 is unknown. j j The true observed model j is ∗ y = X(ω)β +Zb +ε , ∗ ∗ ∗ where b ∼ N (0,σ2G), ε ∼ N (0,σ2R) and β is p × 1 vector of regression coefficients, ∗ q ∗ ∗ n ∗ ∗ ω whose p −p components are exactly 0 and the rest of the components are not 0. The fact ω ∗ that we can write the true model as the equation above implies that the true model exists in the class of candidate models J. Note that X(ω) is n×p matrix of covariates for the full model ω ω. Then, the marginal distribution of y is y ∼ N (X(ω)β ,σ2Σ), n ∗ ∗ where Σ = ZGZT+R. For the true model, the conditional density function of y given b and ∗ the density function of b are denoted by f(y|b ,β ,σ2) and π(b |σ2), respectively. ∗ ∗ ∗ ∗ ∗ ∗ 2.3 Predictive model The candidate predictive model j is the linear mixed model, which has the same regression coefficients β andrandomeffects b as thecandidate observed modelj, butdifferent covariates, j j namely, y˜ = X(j)β +Zb +ε˜ , (3) j j j where y˜ is m×1 random vector of the target of prediction, X(j) and Z are m×p and m×q f e j matrixes of covariates whose columns correspond to those of X(j) and Z, respectively, and ε˜ is m ×1 vector of random errors, which is independentfof b and eε and is distributed as j j j ε˜ ∼ N (0,σ2R), where R is a known m ×m positive definite matrix. We assume that we j m j know the values of X(j) and Z in the predictive model and that they are not necessarily the same as those oef X(j) andeZ in the observed model. We call this situation covariate shift. The conditional density ffunction ofey˜ given b for the model j is denoted by g (y˜|b ,β ,σ2). j j j j j The true predictive model j is ∗ y˜ = X(ω)β +Zb +ε˜ , ∗ ∗ ∗ where X(ω) is m×p matrix of covariates and ε˜ ∼ N (0,σ2R). Then, the marginal distribu- ω f ∗ e m ∗ tion of y˜ is f y˜ ∼ N (X(ω)β ,σ2Σ), e m ∗ ∗ T where Σ = ZGZ +R. For the true model, the conditional density function of y˜ given b is ∗ f e denoted by g(y˜|b ,β ,σ2). ∗ ∗ ∗ e e e e 3 Proposed criteria 3.1 Conditional Akaike information under covariate shift The cAI introduced by Vaida and Blanchard (2005) measures the prediction risk of the plug-in predictive density g (y˜|bˆ ,β ,σˆ2), where β and σˆ2 are maximum likelihood estimators of β j j j j j j j and σ2, respectively, which are given as j b b β = (X(j)TΣ−1X(j))−1X(j)TΣ−1y, j σˆ2 = (y−X(j)β )TΣ−1(y−X(j)β )/n, bj j j b b 4 respectively, and bˆ is the empirical Bayes estimator of b for quadratic loss, which is given by j j bˆ = GZTΣ−1(y−X(j)β ). j j Then, the cAI under covariate shift is b cAI = E(y,b∗)Ey˜|b∗ −2log{g (y˜|bˆ ,β ,σˆ2)} j j j j = E(y,b∗)Ey˜|b∗hmlog(2πσˆj2)+lobg|R|+i(y˜ −X(j)βj −Zbˆj)TR−1(y˜ −X(j))βj −Zbˆj)/σˆj2 , h i where E(y,b∗) and Ey˜|b∗ denote expectaetion withfrespebct toethe joeint distrifbutionbof (ye,b ) ∼ ∗ f(y|b ,β ,σ2)π(b |σ2)andtheconditionaldistributionofy˜ givenb ,namelyy˜|b ∼ g(y˜|b ,β ,σ2), ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ respectively. Takingexpectationwithrespecttoy˜|b ∼ g(y˜|b ,β ,σ2)andb |y ∼ N (b˜ ,σ2(G− ∗ ∗ ∗ ∗ ∗ q ∗ ∗ GZTΣ−1ZG)) for b˜ = GZTΣ−1(y−X(ω)β ), we obtain ∗ ∗ cAI = E mlog(2πσˆ2)+log|R|+tr(R−1Λ)·σ2/σˆ2 +aTR−1a/σˆ2 , (4) j ∗ j j h i where e e e Λ = Σ−ZGZTΣ−1ZGZT, (5) a = (X(j)β −X(ω)β )−ZGZTΣ−1(X(j)β −X(ω)β ). e e j ∗ e j ∗ 3.2 Drawback of ofversbpeciffied modeel assumption b MostoftheAkaike-typeinformationcriteriaarederivedundertheassumptionthatthecandidate model includes the true model, namely, the overspecified assumption. Although the assumption seems to be too strong, the influence is restrictive in practice. This is because the likelihood part of the criterion is a naive estimator of the risk function, namely, the cAI in the context of the cAIC. Under the covariate shift situation, however, we cannot construct the likelihood part as a good estimator of the cAI. That is, the drawback of overspecified assumption is more serious in the situation of covariate shift than the usual one. In Section 5.1, we show that an unbiased estimatorofthecAIundertheoverspecifiedassumptioncAI in(22)haslargebiasforestimating u the cAI of the underspecified models. Thus, we evaluate and estimate the cAI directly dboth for the overspecified models and underspecified models in the following subsection, which is essential work in selecting variables in covariate shift. 3.3 Evaluation and estimation of cAI We evaluate the cAI in (4) both for the overspecified model and for the underspecified model. We assume that the full model ω is overspecified, that is, the collection of the overspecified models J is not an empty set. In addition, we assume that the size of the response variable in + the predictive model m is of order O(n). When the candidate model j is overspecified, nσˆ2/σ2 follows the chi-squared distribution. j ∗ Then, we can evaluate the expectation in (4) exactly. However, for the underspecified model, this is not true. In this case, we asymptotically approximate the cAI as the following theorem. 5 Theorem 1 For the overspecified case, it follows that cAI = E[mlog(2πσˆ2)] + log|R| + R∗, j where nγ R∗ = , e n−p −2 j −1 −1 for γ = tr(R Λ)+tr[R A(X(j)TΣ−1X(j))−1AT] and A = X(j)−ZGZTΣ−1X(j). For the underspecified case, cAI is approximated as e e f e cAI = E[mlog(2πσˆ2)]+log|R|+R∗+R +R +R +R +O(n−1), (6) j 1 2 3 4 where e R = γ(λ−1), 1 R = γ·n−1{−2λ3 +(p +4)λ2 −(p +2)}, 2 j j R = λ·βTBTR−1Bβ /σ2, 3 ∗ ∗ ∗ and e R = n−1{−2λ3+(p +4)λ2}×βTBTR−1Bβ /σ2, 4 j ∗ ∗ ∗ for λ = 1/(1+δ), e δ = βTX(ω)T(P −P )X(ω)β /(nσ2), ∗ ω j ∗ ∗ B = P X(ω)−X(ω)+ZGZT(P −P )X(ω), j ω j P = Σ−1X(j)(X(j)TΣ−1X(j))−1X(j)TΣ−1, j e f e P = Σ−1X(ω)(X(ω)TΣ−1X(ω))−1X(ω)TΣ−1, ω and P = X(j)(X(j)TΣ−1X(j))−1X(j)TΣ−1. j When the candidate model j is overspecified, it follows that R , R , R , and R are exactly 0. 1 2 3 4 e f Because the approximation of cAI in (6) includes unknown parameters, we have to provide an estimator of cAI for practical use. First, we obtain estimators of R and R , which are 1 2 polynomials of λ. We define λˆ, λ2, and λ3 as n−p σˆ2 λˆc= cj ω, n−p σˆ2 ω j 2 (n−p )(n−p +2) σˆ2 λ2 = j j ω , (n−p )(n−p +2) σˆ2 ω ω j ! c and 3 (n−p )(n−p +2)(n−p +4) σˆ2 λ3 = j j j ω . (n−p )(n−p +2)(n−p +4) σˆ2 ω ω ω j ! When j ∈ J , it followscthat + σˆ2 n−p p −p ω ∼ Be ω, ω j , σˆ2 2 2 j (cid:18) (cid:19) whereBe(·,·)denotesthebetadistribution. ThisimpliesthatE(λˆ) = E(λ2)= E(λ3) = 1forthe overspecified case. For the underspecifiedcase, on the other hand, it follows that E[(σˆ2/σˆ2)k] = ω j c c 6 λk+O(n−1) as n → ∞ for k = 1,2,3. Then, we obtain an estimator of R in the approximation 2 of cAI given by (6), which is given as follows: −2λ3+(p +4)λ2 −(p +2) j j R = γ· . (7) 2 n c c Because R is of order Oc(n), we have to estimate λ with higher-order accuracy in order to 1 obtain an estimator of R whose bias is of order O(n−1) for the underspecified case. To this 1 end, we expand E(λˆ) up to O(n−1) as n−p K E(λˆ) = j ·E 0 n−p K +K ω (cid:18) 0 1(cid:19) −2λ3+(p +2)λ2 −p λ = λ+ j j +O(n−2). n Then, we obtain an estimator of R given as 1 −2λ3+(p +2)λ2 −p λˆ R = γ· λˆ− j j −1 . (8) 1 n ( ) c c c We now have the following lemma, which can be proved using Appendix C and Appendix D of Kawakubo and Kubokawa (2014). Lemma 1 When the candidate model j is underspecified, R in (8) and R in (7) are asymp- 1 2 totically unbiased estimators of R and R , respectively, whose bias is of order O(n−1), that 1 2 is, c c E(R ) = R +O(n−1) and E(R ) = R +O(n−1). 1 1 2 2 When the candidate model j is overspecified, it follows that E(R ) = E(R ) = 0. c c 1 2 We next consider estimation procedures of R and R , which are complex functions of un- 3 4 c c known parameters. We observe R and R as functions of η = (βT,σ2)T, that is, R = R (η ), 3 4 ∗ ∗ ∗ 3 3 ∗ T R = R (η ) and substitute their unbiased estimators η =(β ,σ˜2)T, which are given by 4 4 ∗ β = β = (X(ω)TΣ−1X(ω))−1X(ω)TΣ−1y, ω e e σ˜2 = (y−X(ω)β)TΣ−1(y−X(ω)β)/(n−p ). ω e b Then, the plug-in estimators of R and R are 3 e4 e R = R (η) = λ˜·βTBTR−1Bβ/σ˜2, 3 3 R = R (η) = n−1{−2λ˜3 +(p +4)λ˜2}×βTBTR−1Bβ/σ˜2, (9) f4 4 e e e je where λ˜ = 1/(1+fδ˜) for e e e e δ˜= βTX(ω)T(P −P )X(ω)β/(nσ˜2). ω j Because R3 is of order O(n), theeplug-in estimator R3 hasesecond-order bias. Then, we correct the bias by analytical method based on Taylor series expansions. We observe that expectation of the plug-in estimator R (η) is expanded as f 3 E[R (η)] = R (η )+B (η )+B (η )+O(n−2), (10) 3e 3 ∗ 1 ∗ 2 ∗ e 7 whereB (η )issecond-orderbiasandB (η )isthird-orderbiasofR (η),thatis,B (η ) =O(1) 1 ∗ 2 ∗ 3 1 ∗ and B (η )= O(n−1), respectively. Because β and σ˜2 are independent, it follows that 2 ∗ e B (η )= 1 ·tr ∂2R3(η∗)E[(β−β e)(β−β )T] + 1∂2R3(η∗)E[(σ˜2 −σ2)2], (11) 1 ∗ 2 ∂β ∂βT ∗ ∗ 2 (∂σ2)2 ∗ (cid:20) ∗ ∗ (cid:21) ∗ e e where E[(β −β )(β −β )T] = σ2(X(ω)TΣ−1X(ω))−1 and E[(σ˜2 −σ2)2] = 2(σ2)2/(n −p ). ∗ ∗ ∗ ∗ ∗ ω Second-order partial derivatives of R are given by the following lemma. 3 e e Lemma 2 The second-order partial derivative of R (η ) with respect to β is 3 ∗ ∗ −1 ∂2R (η ) βTBTR Bβ 2C 8Cβ βTC 3 ∗ = ∗ ∗ × − + ∗ ∗ ∂β ∂βT σ2 nσ2(1+δ)2 n2(σ2)2(1+δ)3 ∗ ∗ ∗ (cid:26) ∗ ∗ (cid:27) e −1 −1 − 4BTR Bβ∗βT∗C +4Cβ∗βT∗BTR B +2λ·BTR−1B/σ2, n(σ2)2(1+δ)2 ∗ ∗ e e where C = X(ω)T(P −P )X(ω). The second-order partial derivative ofeR (η ) with respect ω j 3 ∗ to σ2 is ∗ −1 ∂2R (η ) βTBTR Bβ 2βTCβ 2(βTCβ )2 3 ∗ = ∗ ∗ × − ∗ ∗ + ∗ ∗ (∂σ2)2 σ2 n(σ2)3(1+δ)2 n2(σ2)4(1+δ)3 ∗ ∗ (cid:26) ∗ ∗ (cid:27) − 2βT∗BeTR−1Bβ∗βT∗Cβ∗ +2λ·βTBTR−1Bβ /(σ2)3. n(σ2)4(1+δ)2 ∗ ∗ ∗ ∗ e e When the candidate model j is overspecified, second-order bias B (η ) can be simplified to 1 ∗ B (η ) = tr[BTR−1B(X(ω)TΣ−1X(ω))−1], 1 ∗ becauseCβ = Bβ = 0andλ = 1, whicheimplies that(∂2R (η ))/(∂β ∂βT)= 2BTR−1B/σ2 ∗ ∗ 3 ∗ ∗ ∗ ∗ and that (∂2R (η ))/(∂σ2)2 = 0. However, we cannot know which candidate models are over- 3 ∗ ∗ specified. Then, we propose the following bias-corrected estimator of R : e 3 R = R (η)−B (η). (12) 3 3 1 Lemma 3 Both for the cases in whiffch the caendidate meodel j is overspecified and j is under- specified, R and R in (12) and (9) are asymptotically unbiased estimators of R and R , whose 3 4 3 4 bias is of order O(n−1), that is, f f f E(R ) = R +O(n−1), and E(R )= R +O(n−1). 3 3 4 4 f Using R , R , R , fand R given by (8), (7), (12),fand (9), respectively, we construct an 1 2 3 4 estimator of cAI as follows: f c c f f cAI = mlog(2πσˆ2)+log|R|+R∗+R +R +R +R . (13) j 1 2 3 4 Then, we obtain thedfollowing theorem, whicheis shown cby Thceoremff1 anfd Lemmas 1–3. 8 Theorem 2 Both for the cases in which the candidate model j is overspecified and j is under- specified, cAI in (13) is a second-order asymptotically unbiased estimator of cAI, that is, E(cAI) = cAI+O(n−1). d When the sample size n is small, second-order accuracy seems not to be sufficient for the d overspecifiedmodel. Actually,astheresultinthesimulationstudyshows,theestimateofthecAI ofthetruemodelhasrelativelylargebias,althoughtheestimationofthetruemodelisimportant. Moreover,someoftheotherinformationcriteria,whichincludethecAICofVaida and Blanchard (2005), are exact unbiased estimators of the information of the overspecified candidate model. Thus,weshouldimprovetheestimators ofR andR toremove thebiasthatisoforderO(n−1). 3 4 To this end, we adopt a parametric bootstrap method. Bootstrap sample y† is generated by y† = X(ω)β+Zb†+ε†, where b† and ε† are generated by the following distribution: e b† ∼ N(0,σ˜2G), and ε† ∼ N(0,σ˜2I ). n Then, we use the following estimator of R : 4 R4 = 2R4(η)−Eηe[R4(η†)] (14) where Eηe denotes expectation wcith respecte to the booetstrap distribution and η† is η† = † ((β )T,σ˜2†)T for e e † e β = (X(ω)TΣ−1X(ω))−1X(ω)Σ−1y†, σ˜2† = (y†−X(ω)β†)TΣ−1(y†−X(ω)β†)/(n−p ). e ω As for R , it follows from (10) that 3 e e Eηe[R3(η†)] = R3(η)+B1(η)+B2(η)+Op(n−2). However, B (η) has a bias with order O(n−1), that is, 1 e e e e E[B (η)] = B (η )+B (η )+O(n−2), e 1 1 ∗ 11 ∗ where B (η ) = O(n−1). Because this bias is not negligible when we want to estimate R with 11 ∗ e 3 third-order accuracy, we estimate the bias by bootstrap method as follows: B11 = Eηe[B1(η†)]−B1(η). Then, we obtain an estimator of R , which is given as d3 e e R3 = 2R3(η)−Eηe[R3(η†)]+B11 = 2R3(η)−Eηe[R3(η†)]+Eηe[B1(η†)]−B1(η). (15) c e e d Lemma 4 Both for the cases in which the candidate model j is overspecified and j is underspec- e e e e ified, R and R in (15) and (14) are asymptotically unbiased estimators of R and R , whose 3 4 3 4 bias is of order O(n−2), that is, c c E(R ) = R +O(n−2), and E(R )= R +O(n−2). 3 3 4 4 c c 9 Using R , R , R , and R given by (8), (7), (15), and (14), we obtain an estimator of cAI 1 2 3 4 as follows: † c c ccAI = mclog(2πσˆ2)+log|R|+R∗+R +R +R +R , (16) j 1 2 3 4 which improves cAI in unbiasedness. Then, we obtain the following theorem, which is proved d e c c c c by Theorem 2 and Lemma 4. d † Theorem 3 When the candidate model j is overspecified, cAI in (16) is a third-order asymp- totically unbiased estimator of cAI, that is, d † E(cAI ) = cAI+O(n−2). † When the candidate model j is underdspecified, cAI is a second-order asymptotically unbiased estimator of cAI, that is, E(cAI†) = cAdI+O(n−1). d 4 Application to small area prediction A typical example of the covariate shift situation appears in the small area prediction problem. The model for small area prediction supposes that the observed small area data have a finite population, which has the super-population model with random effects, one of which is the well-known NERM proposed by Battese et al. (1988). Let Y and x (j) denote the value of a characteristic of interest and its p -dimensional ik ik j auxiliary variable for the kth unit of the ith area for i = 1,...,q and k = 1,...,N . Note that i x (j) is a subvector of x (ω), which is the vector of the explanatory variables in the full model ik ik ω, and we hereafter abbreviate the model index j and write x instead of x (j), p instead of ik ik p , etc. Then, the NERM is j Y = xTβ+b +ε (i = 1,...,q; k = 1,...,N ), (17) ik ik i ik i where β is a p × 1 vector of regression coefficients, b is a random effect for the ith area, i and b s and ε s are mutually independently distributed as b ∼ N(0,τ2) and ε ∼ N(0,σ2), i ik i ik respectively. We consider the situation in which only n values of the Y s are observed through i ik some sampling procedure. We define the number of unobserved variables in the ith area by N −n = r and let n = n +···+n ,r = r +···+r . Suppose, without loss of generality, i i i 1 q 1 q the first n elements of {Y ,...,Y } are observed, which are denoted by y ,...,y , and i i1 i,Ni 1 i,ni Y ,...,Y are unobserved. Then, the observed model is defined as i,ni+1 i,Ni y = xTβ+b +ε (i = 1,...,q; k = 1,...,n ), (18) ik ik i ik i which corresponds to (2) with y = (yT,...,yT)T for y = (y ,...,y )T, X = (XT,...,XT)T 1 q i i1 i,ni 1 q for X = (x ,...,x )T, Z = diag(Z ,...,Z ) for Z = 1 , G = ψI and R = I , where i i1 i,ni 1 q i ni q n 1 denotes an n ×1 vector of 1s and ψ = τ2/σ2. In the derivation of our proposed criteria, we ni i assume that the covariance matrix of b is σ2G for a known matrix G. However, in the NERM, G includes the parameter ψ, which is usually unknown and has to be estimated. In this case, we propose that G in the bias correction should be replaced with its plug-in estimator G(ψ). The influence caused by the replacement may be limited because ψ is the nuisance parameter whenweareinterestedinselectingonlyexplanatoryvariables. Kawakubo and Kubokawa(201b4) discussed the problem in their Remark 3.1. 10