Table Of ContentR
LMest: an package for latent Markov models for
categorical longitudinal data
5
1 Francesco Bartolucci Alessio Farcomeni Silvia Pandolfi
0
2
University of Perugia Sapienza - University of Rome University of Perugia
n
a
Fulvia Pennoni
J
9
University of Milano-Bicocca
1
] January 20, 2015
O
C
.
t
a Abstract
t
s
[
LatentMarkov (LM)modelsrepresentanimportantclassofmodelsfortheanal-
1
ysisoflongitudinaldata(Bartolucci et al.,2013),especially whenresponsevariables
v
8 are categorical. These models have a great potential of application for the analysis
4
4 of social, medical, and behavioral data as well as in other disciplines. We pro-
4 pose the R package LMest, which is tailored to deal with these types of model. In
0
. particular, we consider a general framework for extended LM models by including
1
0 individual covariates and by formulating a mixed approach to take into account ad-
5
1 ditional dependence structures in the data. Such extensions lead to a very flexible
:
v class of models, which allows us to fit different types of longitudinal data. Model
i
X parameters are estimated through the expectation-maximization algorithm, based
r on the forward-backward recursions, which is implemented in the main functions of
a
the package. The package also allows us to perform local and global decoding and
to obtain standard errors for the parameter estimates. We illustrate its use and the
most important features on the basis of examples involving applications in health
and criminology.
Keywords: expectation-maximizationalgorithm, forward-backwardre-
cursions, hidden Markov model
1
1 Introduction
In this paper we illustrate the R package LMest V2.0, which replaces version V1.0, avail-
able from http://CRAN.R-project.org/package=LMest. The new version extends V1.0
in several ways and it is aimed at providing a collection of tools that can be used to
estimate the class of Latent Markov (LM) models for longitudinal categorical data. The
package can be seen as a companion to the book Bartolucci et al. (2013) that is focused
on these models; we will closely refer to this book for the theoretical and some practical
details. Additional insights are given in the discussion paper by Bartolucci et al. (2014b).
LM models are designed to deal with univariate and multivariate longitudinal data
based on the repeated observation of a panel of subjects across time. More in detail, LM
models are specially tailored to study the evolution of an individual characteristic of in-
terest that is not directly observable. This characteristic is represented by a latent process
following a Markov chain as in a Hidden Markov (HM) model (Zucchini and MacDonald,
2009). These models also allow us to account for time-varying unobserved heterogeneity
in addition to the effect of observable covariates on the response variables.
The initial LMformulationintroduced byWiggins (1973), see alsoWiggins (1955), has
been developed in several directions and in connection with applications in many fields,
suchaspsychology, sociology,economics, andmedicine. Inparticular,thebasicLMmodel,
relying on a homogenous Markov chain of first order, has several extensions based on pa-
rameterizations that allow us to include hypotheses and constraints of interest. The most
important extension is for the inclusion of individual covariates that may affect the distri-
bution of the latent process (Vermunt et al., 1999; Bartolucci et al., 2007) or the condi-
tional distribution of the response variables given this process (Bartolucci and Farcomeni,
2009). LM models may also be formulated to take into account additional dependence
structures in the data. In particular, the basic LM model has been generalized to the
mixed LM approach by van de Pol and Langeheine (1990) and to the LM model with
random effects (Altman, 2007; Maruotti, 2011; Bartolucci et al., 2011). These formu-
lations are related to extended models in which the parameters are allowed to vary in
different latent subpopulations; for a complete review see Bartolucci et al. (2013) and
2
Bartolucci et al. (2014b).
Asalreadymentioned, LMmodelsarecloselyrelatedtoHMmodels(Zucchini and MacDonald,
2009), which are specific for time-series data. LM models are formulated quite similarly,
but they are tailored for longitudinal data where many subjects are observed only at few
occasions, typically no more than ten. Most importantly, differently from LM models,
extensions of HM models to include covariates or for complex data structures are rarely
applied because these structures are typical of longitudinal studies.
In R there are some packages that can handle models that are related to LM models.
In particular, HM models can be fit through packages HMM (Himmelmann, 2010), Hid-
denMarkov (Harte, 2012), and depmixS4 (Visser and Speekenbrink, 2010). The last one
is the most closely related to LMest as it is tailored to deal with HM models based on a
generalized linear formulation, which can be specified conditionally on individual covari-
ates. depmixS4 is nonetheless designed for repeated measurements on a single unit, as
in a time-series context. Packages mhsmm (O’Connell and Hojsgaard, 2011) and hsmm
(Bulla and Bulla, 2013) deal with hidden semi-Markov models, which are HM models in
which a separate assumptions are specified for persistency in a latent state. Finally, pack-
age msm (Jackson, 2011) can deal with HM and related models evolving in continuous
time. A commercial software which can be used to fit certain LM models is Latent Gold
(Vermunt and Magidson, 2003). Additionally, certain Matlab toolboxes are also available
for this aim; see for example the HMM toolbox proposed by Murphy (1998).
The distinguishing features of the LMest package with respect to the ones above are
the following:
• it is primarily designed to work with longitudinal data, that is, with (even many)
i.i.d. replicates of (usually short) sequences;
• it can deal with univariate and multivariate categorical outcomes;
• standard errors for the parameter estimates are obtained by exact computation or
through reliable approximations of the observed information matrix;
• additional random effects can be used to build mixed LM models;
3
• individual covariates are included through efficient parameterizations of either the
measurement or the latent model;
• computationally efficient algorithms are implemented for predictions;
• the package is flexible and it minimizes the use of time resources by relying on
certain Fortran routines.
The paper is organized as follows. Section 2 briefly outlines the general formulation
of LM models and their maximum likelihood estimation, together with a brief description
of the basic LM model. Section 3 describes the use of the LMest package to estimate
LM models with individual covariates included in the measurement model, while in Sec-
tion 4 we deal with individual covariates affecting the distribution of the latent process.
Both formulations are illustrated throughexamples based onreal data about self-reported
health status. In Section 5 we introduce the mixed LM models and we describe the use
of the R function aimed at estimating this model formulation. Here, the illustrative ex-
ample is based on data in the field of criminology. Finally, Section 6 contains the main
conclusions.
2 Latent Markov models for longitudinal data
In the following we provide a brief review of the statistical methodology on which the
LM models are based, trying to reduce the mathematical details as much as possible.
The illustration closely follows the recent paper by Bartolucci et al. (2014b). We focus
in particular on the maximum likelihood estimation of these models on the basis of the
Expectation-Maximization (EM) algorithm (Dempster et al., 1977). We also deal with
more advanced topics which are important for the correct use of the LMest package, such
as the selection of the number of latent states and local and global decoding (Viterbi,
1967; Juang and Rabiner, 1991).
2.1 The general latent Markov model formulation
Consider the multivariate case where we observe a vector Y (t) of r categorical response
(t)
variables, Y , with c categories, labeled from0 to c −1, j = 1,...,r, which areavailable
j j j
4
at the t-th time occasion, t = 1,...,T. Also let Y˜ be the vector obtained by stacking Y (t)
for t = 1,...,T; this vector has then rT elements. Obviously, in the univariate case we
have a single response variable Y(t) for each time occasion, and Y˜ is made by T elements.
When available, we also denote by X(t) the vector of individual covariates at the t-th
time occasion and by X˜ the vector of all the individual covariates which is obtained by
stacking the vectors X(t), for t = 1,...,T.
Thegeneralformulationofthemodelassumestheexistenceofalatentprocess, denoted
by U = (U(1),...,U(T)), which affects the distribution of the response variables. Such
a process is assumed to follow a first-order Markov chain with state space {1,...,k},
where k is the number of latent states. Under the local independence assumption, the
response vectors Y(1),...,Y (T) are assumed to be conditionally independent given the
latent process U. Moreover, the elements Y(t) within Y (t), t = 1,...,T, are conditionally
j
independent given U(t). This assumption leads to a strong simplification of the model,
but it can be relaxed by allowing serial dependence through the inclusion of the lagged
response variables among covariates (see Bartolucci and Farcomeni, 2009, among others).
Parameters of the measurement model are the conditional response probabilities
φ(t) = p(Y(t) = y|U(t) = u,X(t) = x), j = 1,...,r, y = 0,...,c −1,
jy|ux j j
which reduce to
φ(t) = p(Y(t) = y|U(t) = u,X(t) = x), y = 0,...,c−1,
y|ux
in the univariate case, where t = 1,...,T and u = 1,...,k.
Parameters of the latent process are the initial probabilities
πu|x = p(U(1) = u|X(1) = x), u = 1,...,k,
and the transition probabilities
π(t) = p(U(t) = u|U(t−1) = u¯,X(t) = x), t = 2,...,T, u¯,u = 1,...,k.
u|u¯x
In the expressions above, x denotes a realization of X(t), y denotes a realization of
Y(t), u denotes a realization of U(t), and u¯ denotes a realization of U(t−1).
j
5
On the basis of the above parameters, the conditional distribution of U given X˜ may
be expressed as
p(U = u|X˜ = x˜) = π π ,
u(1)|x(1) u(t)|u(t−1)x(t)
t>1
Y
whereu = (u(1),...,u(T))andx˜ denotesarealizationofthevectorofallresponsevariables
X˜ . Moreover, the conditional distribution of Y˜ given U and X˜ may be expressed as
p(Y˜ = y˜|U = u,X˜ = x˜) = φ(t) ,
y(t)|u(t)x(t)
t
Y
where, in general, we define φ(t) = p(Y(t) = y|U(t) = u,X(t) = x) and, due to the
y|ux
assumption of local independence, we have
(t) (t)
φ = φ .
y|ux jyj|ux
j
Y
Note that in the above expressions y˜ is a realization of Y˜ made by the subvectors y(t) =
(y(t),...,y(t)) whereas y is a realization of Y(t) with elements y , j = 1,...,r.
1 r j
When we deal with the inclusion of individual covariates, the manifest distribution of
the response variables corresponds to the conditional distribution of Y˜ given X˜ , defined
as
p(y˜|x˜) = p(Y˜ = y˜|X˜ = x˜) = π π(2) ···π(T)
u(1)|x(1) u(2)|u(1)x(2) u(T)|u(T−1)x(T)
u
X
(1) (T)
×φ ···φ . (1)
y(1)|u(1)x(1) y(T)|u(T)x(T)
In the basic version of the model, all assumptions are not formulated conditionally
on individual covariates. Moreover, when such covariates are available, we suggest to
avoid that they simultaneously affect both the distribution of the latent process and the
conditional distribution of the response variables given this process. In fact, the two
formulations have different interpretations, as explained in more detail in the following,
and the resulting model would be of difficult interpretation and estimation.
Finally, it is important to note that computing p(y˜|x˜), or p(y˜) in the basic version,
involves a sum extended to all the possible kT configurations of the vector u; this typ-
ically requires a considerable computational effort. In order to efficiently compute such
a probability we can use a forward recursion (Baum et al., 1970); see Bartolucci et al.
(2013) for details.
6
2.2 Maximum likelihood estimation
Given a sample of n independent units that provide the response vectors y˜ ,...,y˜ , and
1 n
given the corresponding vectors of covariates x˜ ,...,x˜ , the model log-likelihood assumes
1 n
the following expression
n
ℓ(θ) = logp(y˜ |x˜ ).
i i
i=1
X
Note that in this case each vector y˜ is a realization of Y˜ which, in the multivariate case,
i i
is made up of the subvectors y(t), t = 1,...,T, that, in turn, have elements y(t), j =
i ij
1,...,r. Moreover, x˜ may be decomposed into the time-specific subvectors of covariates
i
x(1),...,x(T). Also note that p(y˜ |x˜ ) corresponds to the manifest probability of the
i i i i
responses provided by subject i, given the covariates; see equation (1). Moreover, θ is the
vector of all free parameters affecting p(y˜ |x˜ ).
i i
Theabovelog-likelihoodfunctioncanbemaximizedbytheEMalgorithm(Baum et al.,
1970; Dempster et al., 1977), as described in the following section.
2.2.1 Expectation-Maximization algorithm
The EM algorithm is based on the complete data log-likelihood that, with multivariate
categorical data, has the following expression
r T k cj−1 k
ℓ∗(θ) = a(t) logφ(t) + b(1) logπ
juxy jy|ux ux u|x
j=1 t=1 u=1 x y=0 u=1 x
XXXXX XX
T k k
(t) (t)
+ b logπ , (2)
u¯ux u|u¯x
t=2 u¯=1 u=1 x
XXXX
where, with reference to occasion t and covariate configuration x, a(t) is the number of
juxy
(t)
subjects that are in the latent state u and provide response y to response variable j, b
ux
(t)
is the frequency for latent state u, and b is the number of transitions from state u¯ to
iu¯ux
state u.
The EM algorithm alternates the following two steps until convergence:
• E-step: it consists of computing the posterior (given the observed data) expected
value of each indicator variable involved in (2) by suitable forward-backward re-
7
(t) ˆ(t)
cursions (Baum et al., 1970); these expected values are denoted by aˆ , b , and
juxy ux
ˆ(t)
b ;
u¯ux
• M-step: it consists of maximizing the complete data log-likelihood expressed as in
(2), with each indicator variable substituted by the corresponding expected value.
Howtomaximizethisfunctiondepends onthespecificformulationofthemodeland,
in particular, on whether the covariates are included in the measurement model or
in the latent model.
We refer the reader to Bartolucci et al. (2013) for a detailed description of these steps,
which are implemented in suitable functions of the LMest package.
The EM algorithm could converge to a mode of the log-likelihood which does not cor-
respond to the global maximum. In this regard, we suggest to combine a deterministic
and a random starting rule for the EM algorithm, so to take as final estimate the one
corresponding to the highest log-likelihood obtained at convergence of the algorithm; the
corresponding estimate is denoted by θˆ. This may prevent problems due to the multi-
modality of this function. In particular, the random initialization is based on suitably
normalized random numbers drawn from a uniform distribution from 0 to 1 for the initial
and transition probabilities of the Markov chain and for the conditional response prob-
abilities. The convergence of the EM algorithm is checked on the basis of the relative
log-likelihood difference, that is,
(ℓ(θ)(s) −ℓ(θ)(s−1))/|ℓ(θ)(s)| < ǫ, (3)
where θ(s) is the parameter estimate obtained at the end of the s-th M-step and ǫ is a
predefined tolerance level which can be set by the users.
2.2.2 Standard errors
Afterthemodel isestimated, standarderrorsmaybeobtainedonthebasisoftheobserved
informationmatrix, denoted by J(θˆ), by standardrules. Inparticular, they arecomputed
through the squared root of the diagonal elements of the inverse of this matrix. The
LMest package computes this matrix, and then provides the standard errors, by using
either the numerical method proposed by Bartolucci and Farcomeni (2009) or the exact
8
computation method proposed by Bartolucci and Farcomeni (2014b), according to the
model of interest.
According to the approximated method, J(θˆ) is obtained as minus the numerical
derivative of the score vector s(θˆ) at convergence. The score vector, in turn, is obtained
as the first derivative of the expected value of the complete data log-likelihood, which is
(t) ˆ(t) ˆ(t)
based on the expected frequencies aˆ , b , and b computed with the final parameter
juxy ux u¯ux
estimates θˆ; for details see Bartolucci et al. (2013) and Pennoni (2014).
The exact computation of J(θˆ) is instead based on the Oakes’ identity (Oakes, 1999).
This method uses the complete data information matrix, produced by the EM algorithm,
and a correction matrix computed on the basis of the first derivative of the posterior
probabilities obtained from the backward-forward recursions.
For the basic LM model and for the model with individual covariates affecting the
distribution of the latent process, the LMest package also provides functions aimed at
performing a parametric bootstrap (Davison and Hinkley, 1997) in order to obtain stan-
dard errors.
2.3 Criteria for selecting the number of latent states
In certain applications, the number of latent states, k, can be a priori defined, as in the
univariate case in which the number of latent states may be fixed equal to the number
of response categories. Otherwise, the following criteria are typically used to select the
number oflatentstates: Akaikeinformationcriterion(AIC)ofAkaike(1973)andBayesian
information criterion (BIC) of Schwarz (1978). They are based on the indices
ˆ
AIC = −2ℓ+2 np,
ˆ
BIC = −2ℓ+log(n) np,
ˆ
where ℓ denotes the maximum of thelog-likelihoodof themodel ofinterest andnpdenotes
the number of its free parameters.
According to the above criteria, the optimal number of latent states is the one cor-
responding to the minimum value of the above indices; this model represents the best
compromise between goodness-of-fit and complexity. However, other criteria may be
9
used, such as those taking into account the quality of the classification; for reviews see
Bacci et al. (2014) and Bartolucci et al. (2014a).
2.4 Local and global decoding
The LMest package allows us to perform decoding, that is, prediction of the sequence of
the latent states for a certain sample unit on the basis of the data observed for this unit.
In particular, the EM algorithm directly provides the estimated posterior probabilities
of U(t), that is, p(U(t) = u|X˜ = x˜,Y˜ = y˜), for t = 1,...,T, u = 1,...,k, and for every
covariate and response configuration (x˜,y˜) observed at least once. These probabilities
can bedirectly maximized to obtaina prediction of the latent stateof each subject at each
time occasion t; this is the so-called local decoding. It shall be noted that local decoding
minimizes the classification error at each time occasion, but may yield sub-optimal or
even inconsistent predicted sequences of U(1),...,U(T).
In order to track the latent state of a subject over time, the most a posteriori likely
sequence of states must be obtained, through the so-called global decoding, which is based
on an adaptation of the Viterbi (1967) algorithm; see also Juang and Rabiner (1991).
The algorithm proceeds through a forward-backward recursion of a complexity similar to
the recursions adopted for maximum likelihood estimation within the EM algorithm, so
that global decoding may be applied even for long sequences of data.
2.5 Basic latent Markov model
Following Bartolucci et al. (2013), and as already mentioned above, the basic LM model
rules out individual covariates and assumes that the conditional response probabilities
(t)
are time homogenous. In symbols, we have that φ = φ in the univariate case and
y|ux y|u
(t) (t) (t)
φ = φ in the multivariate case; we also have π = π and π = π .
jy|ux jy|u u|x u u|u¯x u|u¯
In order to fit this model, we can use function est lm basic, which is one of the main
functions in the LMest package. Here, we skip a detailed illustration of this function, as
it is already provided in the Appendix of the book Bartolucci et al. (2013), and we prefer
to focus on functions to estimate more sophisticated LM versions which are only recently
available in the package. These versions include individual covariates and further latent
10