AR(1) Latent Class Models for Longitudinal Count Data Nicholas C. Henderson∗ 5 Department of Statistics 1 0 University of Wisconsin-Madison 2 n Paul J. Rathouz a J Department of Biostatistics & Medical Informatics 3 2 University of Wisconsin-Madison ] E M . Abstract t a t Inavariety of applications involving longitudinalor repeated-measurements data, s [ it is desired to uncover natural groupings or clusters which exist among study sub- jects. Motivated by the need to recover longitudinal trajectories of conduct problems 1 v in the field of developmental psychopathology, we propose a method to address this 1 goalwhenthedatainquestionarecounts. Weassumethatthesubject-specificobser- 6 9 vations are generated from a first-order autoregressive process which is appropriate 5 for counts. A key advantage of our approach is that the marginal distribution of 0 the response can be expressed in closed form, circumventing common computational . 1 issues associated with random effects models. To further improve computational effi- 0 ciency, weproposeaquasi-EMprocedureforestimating themodelparameters where, 5 1 within each EM iteration, the maximization step is approximated by solving an ap- : v propriately chosen set of estimating equations. Additionally, we introduce a novel i method to express the degree to which both the underlyingdata and the fitted model X are able to correctly assign subjects to their (unknown) latent classes. We explore r a the effectiveness of our procedures through simulations based on a four-class model, placing a special emphasis on posterior classification. Finally, we analyze data and recover trajectories of conduct problems in an important nationally representative sample. Keywords: Discrete AR(1) processes; Class discrimination; Finite mixture model; Longitu- dinal data; Negative binomial model. ∗ NicholasC.HendersonisPhDCandidate,DepartmentofStatistics,UniversityofWisconsin,Madison, WI 53792 (e-mail: [email protected]). Paul J. Rathouz is Professor, Department of Biostatistics andMedicalInformatics,UniversityofWisconsin,Madison,WI53792(e-mail: [email protected]). Henderson’s contributions to this work were supported by 5T32HL083806from the National Heart, Lung, and Blood Institute. Rathouz’ contributions were supported by grant 5R01HD061384 from the National Institute of Child Health and Human Development. 1 1 Introduction Many longitudinal studies have the aim of tracking the change in some outcome or re- sponse over time. This is an important and common goal in the field of developmental psychopathology, which aims to study the natural history of common childhood psychiatric diseases such as conduct disorder and delinquency. Often, there exists substantial variabil- ity in the response-paths of observed trajectories across subjects, and grouping subjects with similar trajectories may reveal certain sub-populations that exhibit interesting devel- opmental patterns. Inconduct disorder research, forexample, such distinct “developmental sub-types” of trajectories are of great interest because the classification carries important information about level of impairment, future life outcomes, and possible etiologic origin (see, e.g. Odgers et al. (2007), or Moffitt (1993)). Furthermore, it is of interest to robustly estimate such trajectory sub-types using large and representative samples, to do so in a computationallyefficient way, andtousethoseestimates torecover classmembership atthe subject level. The problem of identifying a finite number of sub-populations is frequently formulated as a latent class or finite mixture model where the distribution governing the observations on a given subject is determined by an unobserved class label. In an alternative approach to the analysis of longitudinal data, random effects are in- troduced to account for the heterogeneity across subjects and the correlation among obser- vations on the same subject. If the conditional distribution of the response given the values of the random effects is not Gaussian however, the marginal distribution of the response will typically not have a closed form. In these cases, evaluation of the likelihood requires numerical integration over the distribution of the random effects. Direct maximization of the likelihood then involves numerical integration for every evaluation of the likelihood. A number of other estimation approaches for models of this type, commonly referred to as generalized linear mixed models (GLMMs), have been proposed, including approximate methods such as penalized quasi-likelihood (Schall (1991) or Breslow and Clayton (1993)), Monte Carlo methods (Zeger and Karim (1991)), and marginalized random effects models (Heagerty (1999)). More recently, some authors have combined these two approaches. They have observed that with longitudinal data, a small number of latent classes is not sufficient to account for 2 all of the realized between-subject heterogeneity andcorrelation among observations within subjects. They have therefore developed models with latent random effects in addition to a latent variable indicating class membership already present (see e.g. Muth´en and Shedden (1999), or Qu et al. (1996)). Whereas this approach has gained traction in the applied literature, it poses two potential methodological problems. First, the addition of within- class random effects to the latent class model may complicate computation considerably. For example, if using an EM algorithm for estimation, not only does one confront within each iteration the difficulties associated with GLMMs, but one must also use numerical integration to update the class-membership probabilities within each iteration. A within- class closed form likelihood would be much more computationally tractable. The second problem involves the distinction between global and local correlation among observations on the same subject. To articulate this concern, we first posit that one key role of the latent class structure is to account for the global correlation, i.e., the correlation that exists between all observations on a subject, whether they are separated by a short or by a long time lag. In classic latent class analysis, given class membership, all observations on a given subject are independent, and this assumption drives the identification of the model. This assumption is, however, somewhat restrictive and there are concerns that it could lead to models with too many classes that are too small if there remains residual correlation among some of the observations within subject. An obvious solution is to allow—and model—in a restricted way some correlation among some observations. The introductionofrandomeffectsintogrowthmixturemodelsattemptstoaddressthisneed. A concern, however, is that random effects also account for a more global type of correlation, confoundingtheroleofthelatentclassesandthatofthewithin-classcorrelationstructurein model identifiability, fitting, and testing when classes are not crisply separated. In contrast to random effects models, auto-regressive processes represent an alternative and more local source of within-subject correlation, allowing observations close together in time to bemore strongly correlated than those further apart. Local correlation is not at all accounted for by the latent class structure. With a within-class local correlation model, the observations far apart in time will be nearly independent, strengthening model identifiability. To address these issues, we propose a longitudinal latent class model for count data 3 which yields a closed form likelihood, accounts for local correlation among the repeated measures on a given subject, and allows for global association to be accounted for by the latent class structure. With our approach, correlations between observations far apart in time will be especially informative about class membership because the subject-specific correlation in these cases will be negligible. Our contributions in this paper are as follows. In the next section, we provide a tech- nical description of our discrete data AR-1 process latent class trajectory model, followed in Section 3 with our approach to estimating and making inferences on model parame- ters. There, we rely on a variation of the EM algorithm that exploits a general estimating function rather than the true likelihood score function. In Section 4, we propose a novel measure of the inherent ability of latent class data to discriminate among classes for sub- jects in the population. To our knowledge, this is a previously unexplored, but important construct in latent class analysis especially when such analysis is used to to assign subjects totheirclasses basedontheirmanifest data. Becauseitisbasedonthetruedatagenerating mechanism, our measure represents an upper bound on the ability for any fitted statistical model to perform class assignment. We also extend this measure to such fitted models and modeling procedures for finite data. Section 5 presents a simulation study with the aims of quantifying the statistical operating characteristics of our proposed model in terms of parameter estimation, bias, and confidence interval coverage. We also quantify accuracy of class assignment. Additionally, we examine the ability of our approach to support class assignment when the data generating mechanism is different from the one specified by our model. Finally, we examine a longitudinal study of conduct problems, and illustrate the use of our model for class definition and assignment in that study. 2 Model Description 2.1 Data Structure and Trajectory Model Let y = (y ,...,y ) be observed longitudinal counts associated with the ith subject. In i i1 ini total, we have measurements on m subjects Y = (y ,...,y ). We have observations on 1 m subject i at each of the time points (t ,...,t ), and we let y denote the observation on i1 ini ij 4 subject i at time t . For each subject and each observed time point, we also observe a p×1 ij covariate vector xT, with X = (x ,...,x )T denoting the n ×p design matrix for subject ij i i1 ini i i. In addition, each subject has an unobserved “latent class” Z with Z ∈ {1,...,C} i i indicating membership in one of C latent classes. The distribution of (y |X ,Z = c) is governed by a vector of class-specific parameters i i i θ with p(y |X ,Z = c) = p (y ;θ ) denoting the distribution of y given covariates X and c i i i i i c i i class label c. Observations made on different subjects are assumed to be independent, and the class labels (Z ,...,Z ) are assumed to be i.i.d. random variables with the vector of 1 n mixture proportions π = (π ,...,π ) determining the class-membership proportions (i.e., 1 C P(Z = c) = π ) in the population. i c Conditional on a subject’s class, the mean-response curve or latent trajectory is E(y |Z = c,X ) = E (y ) = µc = (µc ,...,µc ) , (1) i i i θc i i i1 ini where E (·) denotes taking expectation conditional on subject i belonging to class c. θc We relate the mean curve µc to the covariates through log(µc ) = xTβ , where β = i ij ij c c (βc,...,βc) are the class-c regression coefficients. To allow for overdispersion, the variance 1 p function has the form Var (y ) = φ µc , with scale parameter φ allowed to vary across θc ij c ij c classes. Due to our data-generating model (Section 2.2), we must have φ > 1. For this c reason, we often write the scale parameter as φ = 1+γ , where γ > 0. c c c 2.2 The INAR-(1) Negative Binomial Process Conditional on class-membership, the observations from subject i comprise a multivariate outcomewithdistributionp (y ;θ ), governedbythe(p+2)×1 class-specific parameter vec- i i c tor θ = (βT,α ,φ )T. The distribution of y = (y ,...,y ) is modeled by assuming that c c c c i i1 ini the components y of y arise from a first-order Markov process governed by θ . The joint ij i c distribution of y is built up directly through the transition function, p(y |y ,X ;θ ), i ij i,j−1 i c associated with the underlying process p (y ;θ ) = p(y |X ;θ ) ni p(y |y ,X ;θ ). i i c i1 i c j=2 ij i,j−1 i c The correlation structure of yi then arises from the various dependQencies introduced by the Markov process. A stochastic process tailored specifically for count data is the integer-valued autoregres- sive (INAR(1)-NB)process with negative binomial marginalsdescribed in McKenzie (1986) 5 and Bockenholt (1999). For a subject in class c, observations from the INAR(1)-NB pro- cess arise as follows: the first observation y follows a negative binomial distribution with i1 E (y ) = µc and Var (y ) = µc (1+γ ). We denote this by y ∼ NB µc ,µc (1+γ ) θc i1 i1 θc i1 i1 c i1 i1 i1 c meaning that y has probability mass function (cid:8) (cid:9) i1 P(y = k) = k +µci1/γc −1 1 µci1/γc γc k; k ≥ 0. (2) i1 k 1+γ 1+γ (cid:18) (cid:19) c c (cid:16) (cid:17) (cid:16) (cid:17) The subsequent values (y ,...,y ) are determined through i2 ini y = H +I , j = 2,...,n , (3) ij ij ij i where conditional on (y ,...,y ) and latent probabilities (q ,...,q ), i1 i,j−1 i2 ini H ∼ Binomial(y ,q ), with the understanding that H = 0 whenever y = 0. ij i,j−1 ij ij i,j−1 The success probabilities (q ,...,q ) are themselves independent random variables with i2 ini q ∼ Beta α µc /γ ,(1−α )µc /γ , where Beta(α,β) represents a Beta distribution ij c i,j−1 c c i,j−1 c with shape(cid:8)parameters α and β. Becau(cid:9)se this implies that that E[H |y ] = α y ijh i,j−1 c i,j−1 marginally over q , we must have α ∈ [0,1] for each class. One may also note that ij c given y ≥ 1 and class membership c, H follows a beta-binomial distribution with i,j−1 ij parameters (y ,α µc /γ ,(1−α )µc /γ ). The innovation component I is assumed i,j−1 c i,j−1 c c i,j−1 c ij tofollowaNB µc (1−α ),µc (1−α )(1+γ ) distributionwhere (I ,...,I ) aremutually ij c ij c c i2 ini independent a(cid:8)nd where each I is independ(cid:9)ent of the history (y ,...,y ). ij i1 i,j−1 Although the transition function p(y |y ,X ,θ ) associated with the INAR(1)-NB ij i,j−1 i c process does not have a simple closed form (see Bockenholt (1999)), it may be directly computed by using the fact that it is the convolution of a beta-binomial distribution and a negative binomial distribution. In addition, under our description of the INAR(1)-NB process, the marginal distribution (conditional on class membership) of y is negative ij binomial with E (y ) = µc and Var (y ) = µc (1+γ ), and the within-class correlation θc ij ij θc ij ij c structure of y is first-order autoregressive. That is, for two observations y and y on i ik ij |k−j| the same subject, the within-class correlation is Corr (y ,y ) = α . The conditional θc ik ij c expectation of y given y is a linear function of y , ij i,j−1 i,j−1 E (y |y ) = µc 1−λc +λc y , (4) θc ij i,j−1 ij ij ij i,j−1 (cid:16) (cid:17) 6 and the conditional variance of y given y is given by ij i,j−1 µc /γ +y Var (y |y ) = µc 1−λc φ +y λc 1−λc i,j−1 c i,j−1, (5) θc ij i,j−1 ij ij c i,j−1 ij ij 1+µc /γ i,j−1 c (cid:16) (cid:17) (cid:16) (cid:17) where λc = α µc /µc . ij c ij i,j−1 It is also woqrth mentioning that our specification of the INAR(1)-NB process places the following restriction on the relation between the autocorrelation parameters and the latent trajectories: α2 < min {µc /µc ,µc /µc } for each c. However, when all of the latent c i,j i,j−1 ij ij i,j−1 trajectories are reasonably smooth, this constraint is not especially restrictive as the values of {µc /µc } will be close to one. i,j−1 ij 3 Estimation Because a finite mixture model witha fixed number of components caneasily beformulated asa“missing-data”problem, theEMalgorithmprovidesanattractiveestimationapproach. In our model, if the individual class-membership labels Z = (Z ,...,Z ) were observed, 1 n the “complete-data” log-likelihood would be m C logL(Θ,π;Y,Z) = 1{Z = c} log(π )+log{p (y ;θ )} . (6) i c i i c Xi=1 Xc=1 (cid:16) (cid:17) Above, Θ = (θ ,...,θ ) where θ = (β ,α ,γ ) are the parameters associated with class 1 C c c c c c, and π = (π ,...,π ) is the vector of mixture proportions. 1 C Given current, interation-k, estimates of parameters (Θ(k),π(k)), each EM iteration obtains new parameter estimates by maximizing the current expectation of the complete- data log-likelihood, with the expectation being taken over the unobserved class labels, viz. (Θ(k+1),π(k+1)) = argmaxE logL(Θ,π;Y,Z) Y,Θ(k),π(k) Θ,π n (cid:12) o C m (cid:12) (cid:12) = argmax W (Θ(k),π(k)) log(π )+log{p (y ;θ } . (7) ic c i i c Θ,π Xc=1 Xi=1 (cid:16) (cid:17) Here, W (Θ(k),π(k)) is the current estimated posterior probability that subject i belongs ic to class c, namely π p (y ;θ(k)) W (Θ(k),π(k)) = P(Z = c y ,Θ(k),π(k)) = c i i c . (8) ic i i π p (y ;θ(k)) s s i i s (cid:12) (cid:12) P 7 3.1 Estimating Equation Approach for Computation In each step of the EM algorithm, updating the class probabilities is straightforward be- (k+1) cause the update is simply the average of the current posterior probabilities, π = c 1 m W(k)(Θ(k),π(k)). However, to update the remaining parameters, we must maxi- m i=1 ic miPze K separate weighted log-likelihood functions m θ(k+1) = argmax W (Θ(k),π(k))log{p (y ;θ )}, c = 1,...,C. (9) c ic i i c θc i=1 X Because each such log-likelihood function is a sum over many complicated transition prob- abilities, implementing the maximization in (9) may be challenging. Instead of updating the parameters by maximizing each of the weighted log-likelihood functions directly, we found that replacing the score function with a more manageable esti- matingfunctionprovidesalesscumbersomeapproach. Thatis, lettings (θ ) = ∂logp (y ;θ )/∂θ i c i i c c denote the score function, we suggest that, rather than solving m W (Θ(k),π(k))s (θ ) = 0 (10) ic i c i=1 X for each class, one instead solve m W (Θ(k),π(k))U (θ ) = 0; for c = 1,...,C, (11) ic i c i=1 X where U (θ ) forms an unbiased estimating function (i.e., E [U (θ )] = 0) for each class. i c θc i c Such an approach, where within each EM iteration the maximization step is approxi- mated by solving an estimating equation, is similar to the estimation strategy described in Elashoff and Ryan (2004). Our choice of U (θ ) relies on the extended quasilikelihood function procedure for con- i c structing estimating equations that was proposed in Hall and Severini (1998). These esti- mating functions considerably simplify computation (i.e., by solving (11)) when compared to using score functions s (θ ). Tailoring Hall and Severini (1998) to the case of a log-link i c function and an AR(1) correlation structure yields the (p+2)×1 estimating function U[1](θ ) XTA1/2(µc)R−1(α )A−1/2(µc)(y −µc) i c i i i i c i i i i U (θ ) = U[2](θ ) = 1 2φcαc(ni−1) −(y −µc)TdR−i 1(αc)(y −µc) . i c i c φc 1−α2c i i dαc i i U[3](θ ) 1 (y −µc)TA−1/2(µc)R−1(α )A−1/2(µc)(y −µc)−n i c φc i i i i i c i i i i i (12) 8 In (12), A (µc) is the n ×n matrix defined by A (µc) = diag{µ ,...,µ } and R (α ) is i i i i i i i1 ini i c |k−j| the n ×n correlation matrix whose (k,j) entry is R (α )[k,j] = α . i i i c c The equation determined by setting the p-component vector m U[1](θ ) to zero i=1 i c m P 1 φ XTi Ai1/2(µci)R−i 1(αc)A−i 1/2(µci)(yi −µci) = 0, (13) c i=1 X corresponds to the generalized estimating equation (GEE) described in Zeger and Liang (1986). In GEE, the autocorrelation parameter α is first estimated separately and then c plugged into (13) in order to solve for regression coefficients β . In contrast, solving c (11) requires that α be estimated simultaneously with β . To solve (11) for a fixed c, c c we first update β by solving m W (Θ(k),π(k))U[1](θ ) = 0, using initial values for c i=1 i,c i c (αc,φc). Using this value of βc Pand the initial overdispersion φc, αc is updated by solving m W (Θ(k),π(k))U[2](θ ) = 0. The value of φ can then be updated non-iteratively be- i=1 i,c i c c cPause, given values of (βc,αc), solving mi=1Wi,c(Θ(k),π(k))Ui[3](θc) = 0 for φc has a closed form. This procedure is repeated untilPconvergence. 3.2 Quasi-EM Procedure and Standard Errors Our altered EM algorithm, where the score function is replaced with another, more man- ageable estimating function, may be summarized as follows: 1. Find initial estimates Θ(0), π(0). (Our initialization procedure is described in the appendix). 2. Compute current estimated posterior probabilities W (Θ(k),π(k)) for each class and ic subject. 3. Update mixture proportions through π(k+1) = 1 m W (Θ(k),π(k)). Update other c m i=1 ic parameters (θ1,...,θK) by solving mi=1Wic(Θ(kP),π(k))Ui(θc) = 0, for c = 1,...,C. P 4. Repeat steps (2) and (3) until convergence. Parameter estimates (Θ,πˆ) produced from the above iterative procedure may be viewed as a solution to the estimating equation G(Θ,π) = m G (Θ,π) = 0, where G (Θ,π) is the b i=1 i i (p+3)C−1×1 vector defined as Gi(Θ,π) = [vecP(Vi),bi]T and where Vi is the (p+2)×C 9 matrix V = W (Θ,π)U (θ ),......,W (Θ,π)U (θ ) and b is the (C −1)×1 vector i i1 i 1 iC i C i T b = W (Θ,(cid:2)π)−π ,......,W (Θ,π)−π . (cid:3) i i1 1 i,C−1 K−1 If(cid:2)we let Uik(θc;yi) be the kth component of Ui(θ(cid:3)c), we can see that G(Θ,π) = 0 forms anunbiased estimating equation for (Θ,π) because the expectation of the (k,c) component of V is given by i p (y ) E W (Θ,π)Uk(θ ;y ) = π E θc i Uk(θ ;y ) = π E Uk(θ ;y ) = 0 , (14) ic i c i c p(y ) i c i c θc i c i i n o n o n o and the expectation of the cth element of b is given by i E W (Θ,π)−π = π E pθc(yi) −π = 0. (15) ic c c c p(yi) n o n o Conditions under which solutions of unbiased estimating equations are asymptotically nor- mal (after normalizing appropriately) are discussed in a number of sources (see e.g., section 12.4of Heyde (1997), or Crowder (1986)). By applying the typical results regarding asymp- −1/2 totic normality to our setting, we have that Σ (Θ,πˆ)T −(Θ,π)T −→ N(0,I ) d qK−1 as m −→ ∞, where Σ is given by (cid:8) (cid:9) b b 1 m b −1 1 m 1 m −1T Σ = E DG (Θ,πˆ) G (Θ,πˆ)G (Θ,πˆ)T E DG (Θ,πˆ) . i i i i m m m (cid:16) Xi=1 (cid:8) (cid:9)(cid:17) (cid:16) Xi=1 (cid:17)(cid:16) Xi=1 (cid:8) (cid:9)(cid:17)(16) b b b b b In (16), DG (Θ,π) is the (p + 3)C − 1 × (p + 3)C − 1 matrix of partial derivatives i DG (θ,π) = ∂G (Θ,π)/∂((cid:0)Θ,π). Wecom(cid:1)put(cid:0)estandarderr(cid:1)orsusingthediagonalelements i i of Σ. b 4 Class Assignment and Discrimination Measures 4.1 Class Separation Index and Expected Empirical Discrimina- tion After estimating latent trajectories for each class, one often wishes to go back and assign or classify subjects based on their empirical resemblance to one of the estimated trajec- tories. Even though all the model parameters may be estimated accurately, however, the associated classification procedure may not have a high rate of correct assignment. Poor 10