BAYESIAN LATENT CLASS MODELS FOR THE MULTIPLE IMPUTATION OF CROSS-SECTIONAL, MULTILEVEL AND LONGITUDINAL CATEGORICAL DATA Davide Vidotto Tilburg University ©2017DavideVidotto. AllRightsReserved. Neither this book nor any part may be reproduced or transmitted in any form or by any means,electronicormechanical,includingphotocopying,microfilming,andrecording,orby anyinformationstorageandretrievalsystem,withoutwrittenpermissionoftheauthor. This research is funded by The Netherlands Organization for Scientific Research (NWO [grantprojectnumber406-13-048]). PrintingwasfinanciallysupportedbyTilburgUniversity. ISBN: 978-94-6295-808-1 Printedby: Proefschriftmaken||www.proefschriftmaken.nl Coverdesign: Fabooshdesign&art BAYESIAN LATENT CLASS MODELS FOR THE MULTIPLE IMPUTATION OF CROSS-SECTIONAL, MULTILEVEL AND LONGITUDINAL CATEGORICAL DATA PROEFSCHRIFT ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof. dr. E.H.L. Aarts, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit 2 2018 1400 op vrijdag maart om . uur door Davide Vidotto 21 1988 geboren op april te Treviso, Italië Promotor: Prof.dr. J.K.Vermunt Copromotor: Dr. K.VanDeun OverigeledenvandePromotiecommissie: Prof.dr. S.vanBuuren Prof.dr. F.Bassi Dr. A.O.J.Cramer Prof.dr. L.A.vanderArk Allamiafamiglia Tomyfamily Voormijnfamilie TABLE OF CONTENTS 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 MultipleImputationofMissingCategoricalDatausingLatentClassModels: StateoftheArt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 LatentClassmodelsandMultipleImputation . . . . . . . . . . . . . 10 2.3 FourDifferentImplementationsofLatentClassMultipleImputation 15 2.4 Real-dataExample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Appendixa BayesianTools . . . . . . . . . . . . . . . . . . . . . . . . . 31 Appendixb BayesianMultipleImputationviaMixtureModeling . . . 33 Appendixc GeneratingtheExtraMissingnessfortheReal-dataExample 37 3 Bayesian Latent Class Models for the Multiple Imputation of Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 BayesianLatentClassImputation . . . . . . . . . . . . . . . . . . . . 42 3.3 SimulationStudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Real-dataStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Bayesian Multilevel Latent Class Models for the Multiple Imputation of NestedCategoricalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 TheBayesianMultilevelLatentClassModelforMultipleImputation 68 4.3 Study1: SimulationStudy . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4 Study2: Real-datacase . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5 Multiple Imputation of longitudinal categorical data through Bayesian mix- turelatentMarkovmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2 TheBayesianmixtureLatentMarkovModelforMultipleImputation 97 5.3 SimulationStudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4 Real-dataStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Appendixa Settingthepriordistribution . . . . . . . . . . . . . . . . . 119 Appendixb BMLMmodelestimation . . . . . . . . . . . . . . . . . . . 120 6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 1 INTRODUCTION Thisdissertationdealswiththemultipleimputation(MI;Rubin(1987))ofcategorical data coming from different types of data collection and data analysis designs. In particular, the use of latent class (LC) models (Lazarsfeld, 1950) for the MI of data coming from cross-sectional study designs (as it was first proposed by Vermunt, VanGinkel,VanderArkandSijtsma(2008))willserveasastartingpointtoobtain imputation models that can deal with more complex designs, such as multilevel (i.e., when multiple individuals are nested within a group) and longitudinal (i.e., whenmultipleobservationsforeachindividualareobservedacrosstime)designs. LatentClassmodelsforMultipleImputation LCmodelsareknownamonganalystsandmethodologistsfortheirsubstantiveuse, in which the estimates provided by the model are used to define latent types (or profiles, clusters) of units. These profiles differ from each other for some character- istics,identifiedbythedistributionofthescoresontheindicatorvariables(usually categorical variables). Within each LC, the joint distribution of these features is describedbyaproductoflocallyindependentcategorical(e.g.,Multinomial)distri- butions by means of the local independence assumption. Local independence makes the model easily interpretable, and allows to take into account a large number of indicators for a specific theoretical construct. A graphical representation of the LC model is given in Figure 1.1, in which X represents the LC variable and the Y’s representthe J indicators. However,LCmodels-whicharemembersofthefamilyofmixturemodels-can be used in contexts other than latent groups identification. That is, since mixture 2 1 INTRODUCTION X y1 y2 ··· yJ Figure1.1:LCmodel,graphicalrepresentation. X: latentclassvariable;Y’s: indicators(J in total). models can correctly pick up unobserved heterogeneity and relevant relationships inthedataifthenumberofspecifiedLCsislargeenough(McLachlan&Peel,2000), LC models can be used as a density estimation tool. In density estimation, the goal is to estimate and describe the joint distribution of the variables present in a dataset, retrieving all possible associations which tie the variables to each other. Under this framework, interpretation of the model parameters is of little interest, and the main focus is on the predictions the model provides by means of these parameters. Furthermore, models used for density estimation are likely to require the estimation of a tremendous number of parameters, which would make them very hard to interpret. Thus, the model parameters are merely a device used to obtainpredictionsand/oranoveralldescriptionofthejointdistributionofthedata. Vermuntetal.(2008)exploitedthisfeatureofLCmodels,andproposedthemfor application in MI. In MI, the missing data of a dataset are replaced (or imputed, predicted) M > 1 times by different sets of values, the distribution of which is esti- mated with the imputation model. In particular, the task of the imputation model is to provide values sampled from Pr(Dmis|Dobs), that is, the distribution of the missing data given the observed data. When the missing data mechanism is ignor- 1 able , MI can retrieve the correct distribution of the data (for some analysis model of interest), leading to proper substantive inferences. Furthermore, by obtaining M different imputations it becomes possible to quantify the uncertainty about the imputed missing values at the analysis stage. More specifically, substantive analy- ses are performed on each of the M imputed datasets, where for correct statistical inferencestheresultsarepooledusingRubin(1987)’srules. LC models require a very easy model specification (the number of LCs), which makesthemflexibleandautomatic,sincerelevantassociationsinthedataneednot to be specified a-priori. Concerning the model selection issue, in MI selecting a model that overfits the data (i.e., a model that capture sample-specific features) is 1 Thatis,themissingdatageneratingmechanismisindependentoftheunobserveddataanditsparameter isdistinctfromtheonesoftheassumeddatageneratingmodel(Rubin,1976).
Description: