ebook img

An efficient Fisher-scoring algorithm for fitting latent class models with individual covariates PDF

0.11 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview An efficient Fisher-scoring algorithm for fitting latent class models with individual covariates

An efficient Fisher-scoring algorithm for fitting latent class models with individual covariates A. Forcina, Dipartimento di Economia, Finanza e Statistica, University of Perugia, Italy 3 1 0 2 n a Abstract J Forlatentclassmodelswherethe classweightsdependonindividualcovariates, 0 3 we derive a simple expression for computing the score vector and a convenient hybrid between the observed and the expected information matrices which is ] always positive definite. These ingredients, combined with a maximization al- O gorithm based on line search,provides an efficient tool for maximum likelihood C estimation. Inparticular,theproposedalgorithmissuchthatthelog-likelihood t. never decreases from one step to the next and the choice of starting values is a not crucial for reaching a local maximum. We show how the same algorithm t s may be used for numerical investigation of the effect of model mispecifications. [ An application to education transmission is used as an illustration. 1 v Keywords: Latent class models, individual covariates,Fisher-scoring 1 algorithm, line search. 7 2 7 1. Introduction . 1 0 The latent class models considered in this paper are those where subjects 3 belongtooneamongafinitesetofdisjointlatentclasseswithprobabilitieswhich 1 may depend on individual covariates. Observations are based on a collection : v of discrete responses whose distribution depends on the latent type but not on i covariates. The literature on latent class models is very extensive, see Vermunt X (2010) for a convenient selection of some of the most relevant contributions; a r a slightlymoreextendedframework,dealingwithmissingdataandknowngroups of distinct respondents is presented by Chung (2003). TheEM(expectation-maximization)algorithmisgenerallyusedtocompute themaximumlikelihoodestimates,though,forinstance,theLatentGOLDsoft- ware combines EM and Newton Raphson. As regards the EM algorithm, its numerical stability and the fact that the likelihood always increases from one step to the next, are mentioned as its main advantages. The Newton-Raphson algorithm, though faster, is known for being likely to diverge, unless the start- ing values are close to a local maximum; in particular, in the context of latent class models, the algorithm cannot be used safely on their own. The perfor- mance ofthe Newton-RaphsonorFisher-scoringalgorithmsmay be greatlyim- provedby performing a line searchto optimize the step length (see for example Preprint submitted toElsevier January 31, 2013 Potra and Shi, 1995; Turner, 2008) and adopt suitable strategies that prevents the likelihood from decreasing. Bolck et al. (2004) have proposed a three step algorithm which, in the first step estimates a latent class model without co- variates, assigns subjects to latent classes and estimates the latent regression model with weights derived from the estimated matrix of classification errors. An extension of this approach is proposed by Vermunt (2010). Inthispaperweproposeaconvenientmatrixformulationthatallowtoderive simpleexpressionsforcomputingthescoreandtheobservedortheexpectedin- formationmatrix. Theexpectedinformationmatrixhasthe advantageofbeing always positive definite whenever the model is identifiable; on the other hand, ithasbeenargued(seeEfron and Hinkley,1978)thattheobservedinformation matrix is preferable for the asymptotic distribution of the maximum likelihood estimates because it is data dependent. We show that there is a component of the observed information which is easier to compute, is always positive defi- nite and such that its expectation is still equal to the expected information; we suggestusingthishybridinformationmatrixinthemaximizationalgorithm. In additionwedescribethemainfeaturesofanintelligentsoftwarewhichcombines line searchandstrategiesto preventthe likelihoodfromdecreasing. With ami- nor modification, the same algorithm may be used to maximizes the expected log-likelihood; this could be used as a numerical tool to assess consistency of estimates under possible mispecifications ofthe model, when theoreticalresults are not easily available. In section2, afterintroducing the notations,wederive anexpressionfor the scoreandanapproximationoftheinformationmatrixwhich,weshow,ispositive definite, under suitable conditions. In section 3 we discuss the computation of thepreviousquantitiesanddescribeasuitablelinesearchalgorithm;inaddition we show how the same algorithm may be used for numerical assessment of the effectofmodelmispecifications. Insection4wepresentanapplicationfromthe field of education transmission. 2. Notations and main results Suppose there are c disjoint latent classes and let π , i = 1,...,n, be the i vector of prior probabilities for the ith subject to belong to one of the c latent classes;letX be ac×k matrixdepending onindividualcovariatesandassume i that exp(X β) π = i . i 1′ exp(X β) c i Let r be the number of possible configurations of the response variables; their joint distribution conditionalon U =j, j =0,...,c−1, may be representedby the r×1 vector of probabilities exp(Gθ ) q = j , j 1′ exp(Gθ ) r j where G is a r×g full rank designmatrix and θ a suitable vectorof log-linear j parameters. The dimension r of q is equal to the product of the number of j 2 categories of the response variables with entries in a given lexicographic order. This formulation does not necessarily assume conditional independence among the responses; the conditional dependence structure is determined by the G matrix and we assume that this is such that the model is identifiable. Finally, let Q be the matrix whose jthe column is q , so that p = Qπ is the marginal j i i distribution of the responses. If we stack the vectors θ one below the other j into the vector θ, the contribution of the ith subject to the log-likelihood may be written as ℓ(β,θ;y ,X ) = y′log(p ). i i i i 2.1. Score vector and information matrix Undertheassumptionthatobservationsfromdifferentsubjectsareindepen- dent,the scoreandthe informationmaybewrittenassumsacrosssubjects. By application of the chain rule, the score relative to β is sβ =XX′iΩπiQ′diag(pi)−1yi i where Ωπi = diag(πi)−πiπ′i. Noting that pi = Pjπijqj, by the chain rule, the score relative to θ is j s = π G′Ω diag(p )−1y , j X ij j i i i where Ω = diag(q )−q q′. j j j j It is convenient to think of the observedinformation matrix as made of two components: call F the matrix which we obtain by treating the score vector as a function of diag(p ) while all the rest is held constant and call D the i matrixwhichweobtainbydifferentiatingthescorevectorwhilediag(p )isheld i constant. Let Ai =(cid:0)QΩπiXi πi1Ω1G ... πicΩcG(cid:1) and diag(d ) = diag(p )−2diag(y ). i i i ′ Lemma 1. The matrix F is equal to A diag(d )A and E(D)=0. Pi i i i Proof. See the Appendix. Whenindividualobservationsareavailable,y =e ,avectorof0’sexcept i u(i) for the u(i)th entry which is 1; let t = A′e , let T the n×(k+cg) matrix i i u(i) withrowst′; letp˜ be theu(i)thelementof(p )andp˜ the vectorwithelements i i i p˜. i Lemma 2. The hybrid information matrix F is positive definite if and only if T is of full rank k+cg. Proof. Theresultfollowsbecause,bysimplealgebra,F = t t′/p˜2=T′diag(p˜)−2T. Pi i i i 3 Inpractice,F ispositivedefinite whenever,withinthe nobservations,there areatleastn≥k+cg distinct patternsofcovariateconfigurations;inthatcase the model is identifiable. The resultof Lemma 2 seems to suggestthat F may be used as anapproxi- mationforboth the observedandthe expected informationmatrix. Relativeto theobservedinformation,ithastheadvantagethatitispositivedefinitelikethe expected information. However, as we show below, it is more easily computed then the expected information and, in addition, it is partly data dependent. 3. Computational aspects First we note that the whole score vector may be computed as s β   s s= ..1=XA′idiag(pi)−1yi =Xti/p˜i.  .    i i sc Though the A matrices involve, apparently, several matrix multiplication, as i we show below, they need not be computed explicitly. Each t vector may be i constructed by stacking one below the other the following components, where ′ ′ q˜ = Qe is the u(i)th column of Q, i u(i) X′iΩπiQ′eu(i) =X′idiag(πi)q˜i−X′iπiπ′iq˜i and, for j =0,...,c−1, π (g′q˜ −G′q q˜ ), ij i ij j ij ′ where g is the u(i)th column of G. i 3.1. Line search Let ψ be the vector obtained stacking β and θ one below the other; after h−1 steps, the basic updating equation takes the form ψˆ(h) =ψˆ(h−1)+ah−1(cid:16)F(h−1)(cid:17)−1s(h−1), where ah−1 is the step length. When the log-likelihoodis not concaveand may have two or more local maxima, an algorithm with a = 1 is almost certain h to diverge, unless the starting value is very close to a local maximum; one possibility would be to set a very smalland let it increase with h. In a related 0 context,Turner (2008) suggestusing the LevenbergMarquardtalgorithmwhich combinesNewton-Raphsonandsteepestascentsteps;thiswouldbelessefficient in our contextwhere the informationmatrix is positive definite. Our algorithm uses a proper line search where the log-likelihood is never allowed to decrease. Its main features are given below: 1. set a to some value possibly smaller than 1, say, 0.5; 0 4 2. at the (h−1)th step, first use the updating equationto compute the first guess, say, ψˆ(h,a); 3. compute the log-likelihood and the score at the first guess; 4. with these elements find the step length that maximizes a cubic approxi- mation to the log-likelihood, let ψˆ(h,b) be the second guess; 5. compute the log-likelihood at ψˆ(h,b) and select the best guess; 6. incaseofnoimprovement,firstshortenthe stepand,ifeventhisdoesnot work, perform a steepest ascent step. Few other adjustments are made to check whether the log-likelihood is locally concave or that the second derivative is negative along the given direction in order to perform some conditional adjustments to the step length. In any case, a starting point is never updated unless a better one has been found. Inordertoincreasetheprobabilityofreachingaglobalmaximum,aftercon- vergence, a random perturbation is applied to the estimates and the algorithm restarted for a few times. 3.2. Numerical assessment of the effect of mispecifications We now show how the expressions for the score and the information matrix may be used to assess the effect of model mispecifications. Suppose that M is the true model and M˜ is a mispecified model; mispecifications may concern the number of latent classes, the regression model determined by X or the i dependence structure of responses encoded into the G matrix. Suppose, for simplicity, that the n covariate configurations are kept fixed while the number m of the replicates y , corresponding to each configuration, il increases. Then,thelowoflargenumbersmaybeusedtoshowthattheaverage log-likelihoodfunction convergesto its expectedvalue atthe true model. Thus, if we want to assess the effect that a given mispecifications of the model has on the estimates of the parameters of the mispecified model when no theory is easily available, we simply maximize the appropriate expected log-likelihood. ThismaybeeasilyperformedbythesameadjustedFisherscoringalgorithm describedabove: wemayusetheexpressionsforthescorevectorandinformation matrix and simply replace the observations y with their expected value under i the true model. The only difference is that the simplified expressions described above can no longer be used. On the other hand, based on our experience, the expected log-likelihood seems to very well behaved so that convergence is usually reached in very few steps. 4. Application 4.1. The data We use data from the National Child Development Survey (NCDS), a UK cohort study targeting all the population born in the UK between the 3rd to the 9th of March 1958. Information on family background and on schooling 5 and social achievements for the subjects in the sample were collected at dif- ferent stages of their lives. In the application below we use, as covariates, the number of years of education and the amount of concern for the child edu- cation (as graded by the teachers), separately for father and mother. As re- sponse variables we consider the performance in mathematics and reading test scores taken when the child was 7, 11 and 16 years old, an overall measure of non cognitive attitudes (as reported by teachers) and the academic quali- fication achieved (none, O-level, A-level, university degree). Overall we use 8 responses,all,exceptforacademicqualification,werecodedintothreecategories based on quantiles. A complete description of the original data is available at http://www.esds.ac.uk/longitudinal/access/ncds. 4.2. The model Thevectorofpriorweightsπ wasassumedtodependonthefourcovariates i (education and interest for each parent) as in a multinomial logistic regression, this requires 4(c − 1) regression parameters and c − 1 logit intercepts. The response variables were assumed conditionally independent, except for a first orderautoregressivemodel withinMath andReadtest scorestakenatadjacent dates; because each of these variables has three ordered categories, to use a parsimonious model, in place of the 4 interactions, we used a vector of scores withvalues1,0.5and0accordingtowhetherthecategoriesofthetworesponse variables (say Math at 16 and Math at 11) were equal, differed of 1 or of 2. For simplicity, in this application we restrict attention only to the 2568 femaleswithnomissingdatafortheselectedvariables. Becausetherelativesize oftheselectedsub-sampleisslightlylessthan30%,resultsshouldbeinterpreted with care. Similar models with a number of latent classes ranging from 2 to 5 were fitted and the Bayesian Information criteria was used to determine that the model with four latent classes was the most adequate. Table1: Bayesinformation criteria c 2 3 4 5 Bic 37165 36577 36507 36592 4.3. Main results The estimated regressioncoefficients and z ratios for the logits of belonging to the the different latent classes relative to the first are displayed in Table 2. All regression coefficients are positive and most are also significant; this seems to suggest that the first latent class contains subjects with the lowest cognitive abilities and that the parents concern, probably associated to their pressure, is important in pushing up. The education of the father seems to have a positive significanteffectmostofthetimes,notsotheeducationofthemother;however father education might be a proxy for family income. 6 Table2: Estimated regression coefficients forthe latent weights U =1/U =0 U =2/U =0 U =3/U =0 Model βˆ z βˆ z βˆ z Int. 0.962 6.40 0.518 5.16 1.317 8.34 F.Ed. 1.239 8.87 0.135 1.34 2.001 10.02 M.Ed. 0.028 0.16 0.145 1.53 0.388 2.44 F.In. 0.296 2.85 0.385 3.9 0.831 5.70 M.In. 0.334 3.52 0.607 3.72 1.239 6.53 To characterizethe nature of the latent classesbetter, we display the condi- tional distributions of the academic qualification and that of the non cognitive scoretestsinTable3. Itemergesthatacademicqualificationsandlatentclasses Table3: Conditional distributions of academic qualification and non cognitive scores Academic qual. Non cognitive U None O-lev A-lev Univ 0 1 2 0 0.9584 0.0328 0.0079 0.0009 0.0534 0.2027 0.7439 1 0.6198 0.3027 0.0038 0.0736 0.2443 0.3814 0.3743 2 0.1908 0.5579 0.0830 0.1683 0.4344 0.3845 0.1811 3 0.0611 0.2243 0.2225 0.4920 0.5644 0.3131 0.1225 are stochastically ordered, thus, relative to this response, classes are ordered fromworstto best. Instead, relative to noncognitive tests, latentclasses are in reverse order and this seems to indicate that non cognitive scores are probably a measure of problematic behaviour. A similar picture emerges from Table 4: essentially subjects in latent class 3 are the most talented both in Math and Read. Table4: Estimated conditional distribution of Mathand Read scores at the age of 16 Math Reading U 0 1 2 0 1 2 0 0.8720 0.1280 0.0000 0.8982 0.1018 0.0000 1 0.5703 0.4029 0.0268 0.5023 0.4519 0.0459 2 0.1285 0.5475 0.3239 0.0736 0.5547 0.3717 3 0.0040 0.0653 0.9307 0.0014 0.1403 0.8583 7 Appendix Proof of Lemma 1 Information relative to β: todifferentiatesβ withΩπi heldconstant,usethepropertiesoftheinverseand diagonal operators to differentiate with respect to the jth element of π which i gives −X′iΩπiQ′diag(pi)−2diag(yi)qj; the resultfollows by stackingthese row vectorsone to the side of the other and thenapplythechainrule. Whendiag(p )−1isheldfixed,letv =Q′diag(p )−1y , i i i i then compute the derivative with respect to π′ and apply the chain rule to ob- i tain X′i[diag(vi)−(π′vi)I −πiv′i]ΩπiXi. ′ To show that the expected value of this expression is 0 note that E(v ) = Q1 i = 1c, diag[E(vi)] = I, the identity matrix and that v′iΩπi = 0′. Information relative to θ: The derivative of s with Ω held constant may be computed as above giving h j terms of the form −π π G′Ω diag(p )−2diag(y )Ω G. ij ih j i i h Let v = diag(p )−1y and g the hth column of G′, to compute the derivative i i i h withv heldfixed,firstdifferentiatewithrespecttothe elementsofq andthen i j use the chain rule to get π G′(diag(v )−(q′v )I−q′v )Ω G. ij i j i j i j Because E(v ) = 1 , this expression has 0 expectation. i r The mixed information: ′ In practice it is convenient to differentiate each s with respect to β . With j techniques similar to those used above, the component where the initial π is j held fixed is −πijG′Ωjdiag(pi)−2diag(y)QΩπXi and the other component is simply s (x′ −π′X ), j j i this has 0 expectation because E(s )=0. j References References A. Bolck, M. Croon, and J. Hagenaars. Estimating latent structure models with categorical variables: One-step versus three-step estimators. Political Analysis, 12:3–27,2004. 8 H. Chung. Latent class models with covariatee. PhD thesis, Pensilvania State University, 2003. B. Efron and D.V. Hinkley. Assessing the accuracy of maximum likelihood estimates: observedversusexpectedinformationmatrix. Biometrika, 65:457– 487, 1978. F.A. Potra and Y. Shi. An efficient line search algorithm for unconstrained optimization. Journal of Optimization Theory and Applications, 85:677–704, 1995. R. Turner. Direct maximization of the likelihood of a hidden markov model. Computational Statistics and Data Analysis, 52:4147–4160,2008. J.K.Vermunt. Latentclassmodeling withcovariates: Twoimprovedthree-step approaches. Political Analysis, 18:450–469,2010. 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.