Longitudinal Latent Variable Models Given Incompletely Observed Biomarkers and Covariates Chunfeng Ren∗ Yongyun Shin Department of Biostatistics, Virginia Commonwealth University, Richmond, Virginia 23219 ∗email: [email protected] Statistics in Medicine, v35 n26 p4729-4745 2016 SUMMARY. In this paper, we analyze a two-level latent variable model for longitudinal data fromtheNationalGrowthofHealthStudywheresurrogateoutcomesorbiomarkersandcovariates are subject to missingness at any of the levels. A conventional method for efficient handling of missing data is to reexpress the desired model as a joint distribution of variables, including the biomarkers, thatare subjectto missingnessconditional onall ofthe covariatesthat arecompletely observed, and estimate the joint model by maximum likelihood, which is then transformed to the desired model. The joint model, however, identifies more parameters than desired, in general. We show that the over-identified joint model produces biased estimation of the latent variable model, and describe how to impose constraints on the joint model so that it has a one-to-one correspon- dence with the desired model for unbiased estimation. The constrained joint model handles miss- ing data efficiently under the assumption of ignorable missing data and is estimated by a modified applicationoftheexpectation-maximization(EM)algorithm. KEYWORDS. Longitudinal data analysis; Multivariate outcomes; Random effects; Missing data;Latentvariable;theEMalgorithm 1. Introduction The National Heart, Lung, and Blood Institute initiated the Growth and Health Study (NGHS) to investigateethnicdisparitiesindietary,family,psychosocialandphysicalactivityfactorsofobesity about 2,379 girls in 1985. It collected data on development of obesity and factors associated with the development from 1,213 African-American and 1,166 white girls. The study followed the subjects from 1987-1988 when they were 9 to 10 years old until 1996-1997 when they were 18 to 19 years old. The subjects were assessed on development of obesity and related factors annually [1]. We consider multiple biomarkers of obesity: body mass index (BMI), sum of skinfolds at triceps,subscapular,andsuprailiacsites(Skinfold),maximumbelow-waistcircumference(Waist), and percent body fat by bioelectrical impedance analysis (PercentFat). Many investigators have identified the risk factors of child obesity using one of these biomarkers as an outcome variable [2-6]. Althoughuseful,someofthesebiomarkersdonotdifferentiatethefatmassfrombodymass while others are measured with error. For example, BMI, the ratio of body weight in kilograms to height in meters squared, is widely used to define obesity (BMI≥ 30) for men and women. Consequently, it is a broadly analyzed outcome variable as a surrogate body fat. However, it cannot distinguish muscle mass from body adiposity, in particular, for children and adolescents [7-9]. Our analysis aims to quantify child obesity via multiple biomarkers and study its risk factors simultaneously. Specifically, we want to control for ethnic and social disparities in the growth of obesity, and ask how environmental factors such as TV watching and mother’s BMI influence the development of child obesity. Because obesity is not directly observable, NGHS collected the four biomarkers of obesity. We formulate a latent-variable model (LVM) of simultaneous equationswherebiomarkers,giventhelatentobesity,areindependentinameasurementmodel,and the obesity is regressed on covariates in a structural model [10-19]. Given completely observed covariates and biomarkers having ignorable missing data [20], the LVM may be estimated by maximum likelihood (ML) via standard LVM software such as Amos [21], EQS [22], and Mplus 1 [23]. This paper focuses on a longitudinal multilevel model where occasions at level 1 are nested within individuals at level 2 and where missing data are present at both levels under the assump- tion of ignorable missing data [20, 24]. Roy and Lin [25] estimated a longitudinal LVM given nonignorable dropouts and level-1 covariates missing not at random by ML. Das et al. [26] es- timated a structural equation model by a Markov Chain Monte Carlo method where continuous responses and covariates at level 1 may be missing at random in the measurement model. Both approacheshandlelevel-1outcomesandcovariatessubjecttomissingness. Recentadvancesenableefficienthandlingofmissingdatainahierarchicallinearmodel(HLM) by ML [27-30] or by Bayesian approaches [27, 31-33]. Shin and Raudenbush [28] formulated a univariate HLM as a joint normal distribution of variables, including the outcome, subject to missingnessconditionaloncompletelyobservedcovariates. Theauthorsestimatedthejointmodel by ML via the EM algorithm [34], and then transformed the estimated joint model to the HLM. They showed that the unconstrained joint model, in general, over-identifies the HLM and that the over-identified HLM leads to biased inferences. Therefore, the authors estimated a constrained joint model to just identify the HLM for unbiased estimation. The method, however, cannot be used for the complicated LVM of simultaneous equations. In this paper, we extend the method to theLVMwheremultiplebiomarkersandcovariatesaresubjecttomissingnessatanyofthelevels. We analyze the LVM given biomarkers and covariates that are subject to missingness with a general missing pattern at any of the levels. A conventional method for efficient handling of the missing data is to reexpress the LVM as a joint distribution of the variables, including the biomarkers, thatare subjectto missingnessconditional onall ofthe covariatesthat arecompletely observed, and estimate the joint model which is then transformed to the LVM. The unconstrained joint model, however, identifies more parameters than desired in the LVM. Furthermore, the LVM is not nested within the joint model, in general. The consequence is that the over-identified joint model leads to biased estimation of the LVM. This paper explains how to characterize the joint model so that it is a one-to-one transformation of the LVM for unbiased estimation. To yield un- 2 biasedestimationoftheLVMwhilehandlingmissingdataefficiently,weestimatetheconstrained jointmodelaccordingtotheLVMwithineachiterationoftheEMalgorithm. The next section introduces an LVM of our interest given incomplete data. Section 3 explains a joint model for efficient handling of missing data and shows how to impose proper constraints on the joint model for unbiased estimation of the LVM. Section 4 describes the EM algorithm for efficient handling of the constrained joint model. Section 5 simulates an LVM to show that the conventional method produces biased estimation of the LVM and that our approach corrects the bias. Section6illustratesunbiasedandefficientanalysisofthedesiredLVMgiventheNGHSdata. Section7discussesthelimitationsandfutureextensionsofourmethod. 2. Latent Variable Model ThissectionintroducestheLVMofinterest[15]. Thestructuralmodelis U = AT α+BTb +(cid:15) , b i∼id N(0,D), (cid:15) i∼id N(0,1), (1) ik ik ik i ik i ik where U is a univariate latent obesity score, A is a vector of covariates having fixed effects α, ik ik B is a vector of known covariates having level-2 unit-specific random effects b independent of ik i a level-1 unit-specific random error (cid:15) , and level-1 unit or occasion k is nested within level-2 unit ik or subject i for k = 1,··· ,k and i = 1,··· ,n, and D is a positive definite matrix. This model i cannot be directly estimated due to unobservable U . However, U is related to biomarkers by a ik ik measurementmodel R = γ +γ U +a +e , (2) ik 0 1 ik i ik where R is a vector of J biomarkers, γ = [γ γ ···γ ]T is a vector of J intercepts, γ = ik 0 01 02 0J 1 [γ γ ···γ ]T is a vector of the J effects or factor loadings of U , and subject-specific ran- 11 12 1J ik dom effects a i∼id N(0,⊕J ξ ) are independent of level-1 random errors e i∼id N(0,⊕J τ ) i j=1 j ik j=1 j for a diagonal matrix ⊕J ψ = diag(ψ ,ψ ,··· ,ψ ) with diagonal elements or submatrices (cid:96)=1 (cid:96) 1 2 J (ψ ,ψ ,··· ,ψ ) and all other elements equal to zero. To make parameters identifiable in the 1 2 J 3 model (1), we assume that var((cid:15) )=1 and that A does not contain an intercept. Note that the ik ik jth and j(cid:48)th biomarkers of subject i at occasion k are correlated and their covariance is equal to γ γ var(U ). 1j 1j(cid:48) ik Our goal is to identify the obesity factors A and B and explain their associations with the ik ik obesity U by efficient analysis of the LVM, that is, by analyzing all available sample data with- ik out dropping any observations. The challenge is to efficiently handle missing data in (R ,A ), ik ik which is explained in the next section. In this paper, we refer to the associations as the “effects” ofthefactors,butdonotmeancausality. Suchuseoftheterm“effects”ispervasiveintheliterature. 3. Missing Data To handle missing data in R and A efficiently, we reparameterize the LVM in terms of a joint ik ik distribution of the response variables R and all covariates subject to missingness in A condi- ik ik tionalonallcovariatescompletelyobserved. BecauseA mayhavecovariatessubjecttomissing- ik ness as well as covariates completely observed, we decompose A = [ST YT WT WT]T where ik ik 2i 1ik 2i p -vector S and p -vector Y are level-1 and -2 covariates subject to missingness, respectively, 1 ik 2 2i and p -vector W and p -vector W are level-1 and -2 covariates completely observed, respec- 3 1ik 4 2i tively. Then, the joint model is a multivariate distribution of level-1 Y = [RT ST]T and level-2 1ik ik ik Y thataresubjecttomissingnessconditionalonW ,W andB thatarecompletelyobserved. 2i 1ik 2i ik In this section, we explain that this joint model over-identifies the LVM, in general. The conse- quence is biased estimation of the LVM as will be illustrated in Section 5. For a positive integer m,letI and1 denoteanm-by-midentitymatrixandavectorofmunities,respectively. m m 3.1. Over-identification Problem IfwewereabletoobserveU ,wewoulddirectlyanalyzethestructuralmodel(1)withoutinvolv- ik ing the measurement model (2). To analyze all observed data in the model (1), we would estimate the multivariate distribution of (U ,S ,Y ) given completely observed (W ,W ,B ). In this ik ik 2i 1ik 2k ik simple case, we are able to not only reveal the over-identification problem explicitly, but also ex- 4 plain how to correct the problem clearly. In the following subsection, we extend the multivariate distribution to efficient handling of missing data in (Y ,Y ) conditional on (W ,W ,B ) for 1ik 2k 1ik 2k ik thegeneralLVM. If U were observed, efficient handling of the missing data in the desired model (1) might be ik achieved,withoutthemeasurementmodel(2),by U βT βT BT 0 0 b (cid:15) ik u1 u2 ik ui uik W 1ik S = β β + 0 I 0 b +(cid:15) , (3) ik s1 s2 p1 si sik W2i Y 0 β 0 0 I b 0 2i 22 p2 2i where βT and β are 1-by-p and p -by-p matrices of the fixed effects of W on U and S , u1 s1 3 1 3 1ik ik ik respectively, βT , β , and β are 1-by-p , p -by-p and p -by-p matrices of the fixed effects of u2 s2 22 4 1 4 2 4 b T T T ui uu us u2 iid W on U , S and Y , respectively, and b ∼ N 0,T T T is independent of 2i ik ik 2i si su ss s2 b T T T 2i 2u 2s 22 (cid:15) Σ Σ uik i∼id N 0, uu us. Wecenterlevel-1S andW aroundrespectivesamplemeans ik 1ik (cid:15) Σ Σ sik su ss andlevel-2Y andW aroundrespectiveweightedsamplemeans (cid:80)ikiY2i and (cid:80)ikiW2i inEquation 2i 2i (cid:80)iki (cid:80)iki (3), except for B that is centered around its group mean for precise estimation of the variance ik matrix [35]. The centering ensures that we identify the model (1) with no intercept and model W 1ik (2). Shin and Raudenbush [28] expressed [βT βT ] = βTW , β W + β W = u1 u2 u uik s1 1ik s2 2i W 2i (I ⊗ WT )β and β W = (I ⊗ WT)β , and efficiently estimated the model (3) by ML via p1 uik s 22 2i p2 2i 2 theEMalgorithmwhereU wasobservable. ik Although the conditional model (1) expresses a single effect of each covariate in S on U , ik ik the multivariate model (3) expresses a distinct covariance at each level between the covariate and obesity to identify p extraneous parameters than desired in the model (1). The two distinct co- 1 variances identify the within-child association between the time-varying covariate and outcome that may be different from the between-child association, the association between the child-mean 5 covariateandoutcome. Theassociationsidentifyacontextualeffectofthecovariatethatisdefined as the difference between the between- and within-child associations [35, 36]. Controlling for the within-child association, the contextual effect explains the expected difference in obesity between two children who have the same value of the covariate at an occasion, but who differ by one unit intheirchild-meancovariates. Consequently,themultivariatemodelidentifiesacontextualeffects model where each covariate in S has a contextual effect, controlling for the within-child effect ik of the covariate [29]. Because the model (1) expresses no contextual effect of the covariate, im- plying identical between- and within-child associations between the covariate and outcome [36], the multivariate model (3) over-identifies the model (1) and expresses the single effect of each covariate in S as a weighted average of the two associations [30, 35]. The weighted average is ik different from the single effect when model (1) is directly estimated [35, 36]. The consequence is that the desired model (1) is not nested within or congenial to the multivariate model (3) [37]. The over-identified model (3) yields biased estimation of the desired model (1) unless constrains are imposed on the model (3) [28]. We illustrate the over-identification problem causing biased estimationbyasimulationstudyinSection5. In order to correct the bias, we impose p constraints on the model (3) so that it represents 1 a one-to-one transformation of the LVM. For clarity, we describe the constraints for a random- interceptmodel(1)havingB = 1. AppendixAexplainstheconstraintsforarandom-coefficient ik T T uu|2 us|2 model(1). Tosimplifythenotation,letcov(b ,b |b ) = . GivenY ,weconstrain ui si 2i 2i T T su|2 ss|2 thecovariancesbetweenU andeachcovariateinS toequal,i.e. ik ik αT = T T−1 = Σ Σ−1, (4) 1 us|2 ss|2 us ss which says that the association between U and each of the level-1 covariates is the same at each ik level given Y . The constraints imply cov(U ,S |Y )[var(S |Y )]−1 = (T + Σ )(T + 2i ik ik 2i ik 2i us|2 us ss|2 Σ )−1 = αT for T = αTT and Σ = αTΣ , and the one-to-one transformations between ss 1 us|2 1 ss|2 us 1 ss 6 theLVMandthemultivariatemodel(3) α = Σ−1Σ , α = T−1(T −T α ), 1 ss su 2 22 2u 2s 1 α = β −βTα , (5) 3 u1 s1 1 α = β −βTα −βT α , 1 = Σ −αTΣ α , 4 u2 s2 1 22 2 uu 1 ss 1 D = T −αTT α −2αTT α −αTT α . uu 2 22 2 1 s2 2 1 ss 1 3.2. Efficient Handling of Missing Data Because U is unobservable, we need to estimate the measurement model (2) in addition to the ik desired model (1). Because observed biomarkers are also subject to missingness, the multivariate model (3) cannot handle the missing data in both A and R . Instead, we formulate the joint ik ik distribution of (R ,S ,Y ) subject to missingness given completely observed covariates for R = i i 2i i [RT RT ···RT ]T andS = [ST ST ···ST ]T basedontheaggregatedmodels(2)and(3) i1 i2 iki i i1 i2 iki R 1 ⊗γ +(W β +B b +(cid:15) )⊗γ 1 ⊗a e i ki 0 ui u i ui ui 1 ki i i S = W β +(1 ⊗I )b +(cid:15) + 0 + 0 , (6) i si s ki p1 si si Y X β +b 0 0 2i 2i 2 2i for W = [W W ···W ]T, B = [B B ···B ]T (cid:15) = [(cid:15) (cid:15) ···(cid:15) ]T, e = ui ui1 ui2 uiki i i1 i2 iki ui ui1 ui2 uiki i [eT eT ···eT ]T, W = [I ⊗W I ⊗W ···I ⊗W ]T, (cid:15) = [(cid:15)T (cid:15)T ···(cid:15)T ]T, and i1 i2 iki si p1 ui1 p1 ui2 p1 uiki si si1 si2 siki X = I ⊗WT. Toderiveestimators,wereexpressmodel(6)parsimoniouslyas 2i p2 2i Y X 0 β Z 0 b (cid:15) a +c 1i 1i 1 1i 1i 1i 1i 1i = + + + , (7) Y 0 X β 0 I b 0 0 2i 2i 2 p2 2i 1 ⊗γ R I W ⊗I 0 ki 0 B ⊗I 0 forY = i,X = J×ki ui J ,β = β ⊗γ ,Z = i J , 1i 1i 1 u 1 1i Si 0 0 Wsi 0 1ki ⊗Ip1 β s b ⊗γ (cid:15) ⊗γ 1 ⊗a e b = ui 1, (cid:15) = ui 1, a = ki i, and c = i, where var(b ,b ) = 1i 1i 1i 1i 1i 2i b (cid:15) 0 0 si si 7 τ τ I ⊗(Σ γ γT) I ⊗(γ Σ ) (1 1T )⊗(⊕J ξ ) 0 11 12,var((cid:15) ) = ki uu 1 1 ki 1 us ,var(a ) = ki ki j=1 j , 1i 1i τT τ I ⊗(Σ γT) I ⊗Σ 0 0 12 22 ki su 1 ki ss I ⊗(⊕J τ ) 0 T ⊗(γ γT) T ⊗γ T ⊗γ and var(c ) = ki j=1 j for τ = uu 1 1 us 1, τ = u2 1 , 1i 11 12 0 0 T ⊗γT T T su 1 ss s2 and τ = T . Note that the joint model (7) enables us to analyze a subject who has at least a 22 22 singlevalueobservedin(Y ,Y )forefficientanalysisoftheLVM. 1i 2i Toefficientlyhandlemissingdata,letO andO bematricesoftheobservedvalueindicators 1i 2i (1 if observed, 0 otherwise) in Y and Y , respectively, such that they extract all observed data 1i 2i Yo = O Y andYo = O Y fromY andY ,respectively[28]. Themodel(7)fortheobserved 1i 1i 1i 2i 2i 2i 1i 2i datais Y◦ X◦ 0 β Z◦ 0 b a◦ +(cid:15)◦ +e◦ 1i = 1i 1 + 1i 1i + 1i 1i 1i , (8) Y◦ 0 X◦ β 0 O b 0 2i 2i 2 2i 2i for X◦ = O X , X◦ = O X , Z◦ = O Z , a◦ = O a , (cid:15)◦ = O (cid:15) , and e◦ = O e . We 1i 1i 1i 2i 2i 2i 1i 1i 1i 1i 1i 1i 1i 1i 1i 1i 1i 1i reexpressthemodel(8)parsimoniouslyasY◦ ∼ N(µ◦,V◦)forY◦ = [Y◦T Y◦T]T, i i i i 1i 2i X◦β Z◦τ Z◦T +O [var((cid:15) )+var(a )+var(c )]OT Z◦τ OT µ◦ = 1i 1, V◦ = 1i 11 1i 1i 1i 1i 1i 1i 1i 12 2i.(9) i i X◦β O τ Z◦T O τ OT 2i 2 2i 21 1i 2i 22 2i 4. Estimation via the EM Algorithm Thissectionsketchesefficientestimationofthejointmodel(7)byamodifiedapplicationoftheEM algorithm[34]. SeeAppendicesB,CandDfordetails. Themodificationisduetothefactthatwe efficiently estimate the LVM to find the constraints (4) that will be imposed on the joint model (7) within each iteration of the EM algorithm. We view (Y ,Y ,U ,b ,b ,a ) as complete data and 1i 2i i ui si i Y◦observedwithinunitiforU = [U U ···U ]. Theconstraints(4)requirethattheparameters i i i1 i2 iki αoftheLVMbeestimated. WithineachiterationoftheEMalgorithm,weestimatetheparameters αandtranslatethemintotheparametersofthejointmodel(7)accordingtothetransformations(5). Toestimateα,letA = [A A ···A ]T,(cid:15) = [(cid:15) (cid:15) ···(cid:15) ]T,γ = [γ γ ]T,U∗ = [1 U ]T, i i1 i2 iki i i1 i2 iki j 0j 1j ik ik 8 (cid:15) = [(cid:15) (cid:15) ]T, (cid:15)∗ = [(cid:15) (cid:15) ] for the LVM, b∗ = [b b ]T, b∗ = [b∗ b ]T, β∗ = [β β ]T, 1ik uik sik 1i ui si 1i ui si i 1i 2i 1 u s T T T T T Σ Σ W 0 uu us u2 11 12 uu us ui T = , T = , T = , Σ = , W = , 11 12 usi T T T TT T Σ Σ 0 W su ss s2 12 22 su ss si and T = T − T T−1T for the joint model. The complete data ML estimators in iteration 2|1 22 21 11 12 (cid:16) (cid:17)−1 k are αˆ(k) = αˆ(k−1) + (cid:80)n (cid:80)ki A AT (cid:80)n (cid:80)ki A (cid:15) and Dˆ = (cid:80) b bT/n for the i=1 k=1 ik ik i=1 k=1 ik ik i i i structuralmodel(1)and (cid:32) (cid:33)−1 (cid:88)n (cid:88)ki (cid:88)n (cid:88)ki γˆ(k) = γˆ(k−1) + U∗U∗T U∗e , j j ik ik ik ikj i=1 k=1 i=1 k=1 n 1 (cid:88) ξˆ = a2 , j n ij i=1 1 (cid:88)n (cid:88)ki τˆ = e2 , j (cid:80)n k ikj i=1 i i=1 k=1 1 (cid:88)n (cid:88)ki Σˆ = (cid:15) (cid:15)T , (10) (cid:80)n k 1ik 1ik i=1 i i=1 k=1 n 1 (cid:88) Tˆ = b∗b∗T, n i i i=1 (cid:32) n (cid:33)−1 n (cid:88) (cid:88) βˆ∗(k) = βˆ∗(k−1) + Σ−1 ⊗(WT W ) Σ−1 ⊗(WT (cid:15)∗ ), 1 1 usi usi usi 1i i=1 i=1 (cid:32) n (cid:33)−1 n βˆ(k) = βˆ(k−1) + (cid:88)T−1 ⊗(W WT) (cid:88)T−1 ⊗W (cid:0)b −T T−1b∗ (cid:1) 2 2 2|1 2i 2i 2|1 2i 2i 21 11 1i i=1 i=1 forthejointmodel(7). AttheEstep,weobtainconditionalexpectations,E(A AT |Y◦),E(A (cid:15) ik ik i ik ik |Y◦), E(b bT|Y◦), E(U |Y◦), E(U2|Y◦), E(U e |Y◦), E(e |Y◦), E(e2 |Y◦), E(a2 |Y◦), i i i i ik i ik i ik ikj i ikj i ikj i ij i E((cid:15) (cid:15)T |Y◦),E(b∗|Y◦),E(b∗b∗T|Y◦),andE((cid:15)∗ |Y◦)fromthedistributionofY ,Y ,U ,e ,(cid:15)∗ , 1ik 1ik i i i i i i 1i i 1i 2i i i 1i b∗,a |Y◦. Let V(A) denote a vector of distinct elements in a variance-covariance matrix A. At i i i convergence, the Fisher information matrix is obtained from the observed log-likelihood of pa- rameters (γ ,γ ,β∗,β ,τ,T ,V(T ),V(T ), V(T ),ξ, V(Σ ),α ,α ). The variance matrix 0 1 1 2 uu ss 2s 22 ss 1 2 associated with the parameter estimates in the constrained joint model (7) is produced by invert- ingtheFisherinformationmatrix. Weobtainthepointestimatesandthestandarderrorsassociated withtheparametersoftheLVMbytheinvariancepropertyofMLEsandmultivariatedeltamethod, 9