Bayesian Gaussian Copula Factor Models for Mixed Data Jared S. Murray David B. Dunson Lawrence Carin Joseph E. Lucas 1 1 0 Abstract 2 v Gaussian factor models have proven widely useful for parsimoniously char- o N acterizing dependence in multivariate data. There is a rich literature on their 1 extension to mixed categorical and continuous variables, using latent Gaussian ] E variables or through generalized latent trait models acommodating measure- M . ments in the exponential family. However, when generalizing to non-Gaussian t a t s measured variables the latent variables typically influence both the dependence [ structure and the form of the marginal distributions, complicating interpretation 1 v 7 and introducing artifacts. To address this problem we propose a novel class of 1 3 Bayesian Gaussian copula factor models which decouple the latent factors from 0 1. the marginal distributions. A semiparametric specification for the marginals 1 1 based on the extended rank likelihood yields straightforward implementation 1 v: and substantial computational gains, critical for scaling to high-dimensional ap- i X plications. We provide new theoretical and empirical justifications for using this r a likelihood in Bayesian inference. We propose new default priors for the factor JaredS.MurrayisPhDStudent,Dept. ofStatisticalScience,DukeUniversity,Durham,NC27708 (E-mail: [email protected]). David B. Dunson is Professor, Dept. of Statistical Science, Duke University Durham, NC 27708 (E-mail: [email protected]). Lawrence Carin is William H. Younger Professor, Dept. of Electrical & Computer Engineering, Pratt School of Engineering, Duke University,Durham, NC 27708 (E-mail: [email protected]) Joseph E. Lucas is Assistant Research Professor, Institute for Genome Sciences and Policy, Duke University, Durham, NC 27710 (E-mail: [email protected]) We would like to acknowledge the support of the Measurement to Understand Re-Classifcation of Disease of Cabarrus and Kannapolis (MURDOCK) Study and the NIH CTSA (Clinical and Translational Science Award) 1UL1RR024128-01, without whom this research would not have been possible. The first author would also like to thank Richard Hahn, Jerry Reiter and Scott Schmidler for helpful discussion. loadings and develop efficient parameter-expanded Gibbs sampling for posterior computation. The methods are evaluated through simulations and applied to a dataset in political science. We also provide bfa, an easy-to-use R package leveraging compiled code to fit the models in this paper.1 Keywords: Factor analysis; Latent variables; Semiparametric; Extended rank likelihood; Parameter expansion; High dimensional 1 Introduction Factor analysis and its generalizations are powerful tools for analyzing and exploring multivariate data, routinely used in applications as diverse as social science, genomics and finance. The typical Gaussian factor model is given by y = Λη +(cid:15) (1.1) i i i where y is a p×1 vector of observed variables, Λ is a p×k matrix of factor loadings i (k < p), η ∼ N(0,I) is a k × 1 vector of latent variables or factor scores, and i (cid:15) ∼ N(0,Σ) are idiosyncratic noise with Σ = diag(σ2,...,σ2). Marginalizing out i 1 p the latent variables, y ∼ N(0,ΛΛ(cid:48) + Σ), so that the covariance in y is explained i i by the (lower-dimensional) latent factors. The model in (1.1) may be generalized to incorporate covariates at the level of the observed or latent variables, or to allow dependence between the latent factors. For exposition we focus on this simple case. This model has been extended to data with mixed measurement scales, often by linking observed categorical variables to underlying Gaussian variables which follow a latent factor model. Muth´en (1983) used this idea to formulate a linear structural components (LISCOMP) modeling framework. An alternative is to include shared 1Available from http://stat.duke.edu/~jsm38/software/bfa 1 latent factors in separate generalized linear models for each observed variable (Dunson, 2000, 2003; Moustaki and Knott, 2000; Sammel et al., 1997). Unlike in the Gaussian factor model the latent variables typically impact both dependence and the form of the marginal distributions. For example, suppose y = (y ,y )(cid:48) are bivariate counts i i1 i2 assigned Poisson log-linear models: logE(y |η ) = µ +λη . Then λ governs both the ij i j i dependence between y , y and the overdispersion in each marginal distribution. This i1 i2 confounding can lead to substantial artifacts and misleading inferences. Additionally, computation in such models is difficult and requiring marginal distributions in the exponential family can be restrictive. There is a growing literature on semiparametric latent factor models to address the latter problem. A number of authors have proposed mixtures of factor models (Chen et al., 2010; Ghahramani and Beal, 2000). Song et al. (2010) instead allow flexible error distributions in Eq. (1.1). Yang and Dunson (2010) proposed a broad class of semiparametric structural equation models which allow an unknown distribution for η . When building such flexible mixture models there is a sacrifice to be made in i terms of interpretation, parsimony and computation, and subtle confounding effects remain. It would be appealing to retain the simplicity, interpretability and computa- tional scalability of Gaussian factor models while allowing the marginal distributions to be unknown and free of the dependence structure. To accomplish these ambitious goals we propose a semiparametric Bayesian Gaus- sian copula model utilizing the extended rank likelihood of Hoff (2007) for the marginal distributions. This approximation avoids a full model specification and is in some sense not fully Bayesian, but in practice we expect that this rank-based likelihood discards only a mild amount of information while providing robust inference. An additional contribution of this paper is to provide new theoretical and empirical justification for this approach. 2 We proceed as follows: In Section 2, we propose the Gaussian copula factor model for mixed scale data and discuss its relationship to existing models. In Section 3 we develop a Bayesian approach to inference, specifying prior distributions and outlining a straightforward and efficient Gibbs sampler for posterior computation. Section 4 contains a simulation study, and Section 5 illustrates the utility of our method in a political science application. Section 6 concludes with a discussion. 2 The Gaussian copula factor model A p-dimensional copula C is a distribution function on [0,1]p where each univariate marginal distribution is uniform on [0,1]. Any joint distribution F can be completely specified by its marginal distributions and a copula; that is, there exists a copula C such that F(y ,...,y ) = C(F (y ),...,F (y )) (2.1) 1 p 1 1 p p where F are the univariate marginal distributions of F (Sklar, 1959). If all F are j j continuous then C is unique, otherwise it is uniquely determined on Ran(F )×···× 1 Ran(F ) where Ran(F ) is the range of F . The copula of a multivariate distribution p j j encodesitsdependencestructure, andisinvarianttostrictlyincreasingtransformations of its univariate margins. Here we consider the Gaussian copula: C(u ,...u ) = Φ (Φ−1(u ),...,Φ−1(u ) | C), (u ,...,u ) ∈ [0,1]p (2.2) 1 p p 1 p 1 p where Φ (·|C) is the p-dimensional Gaussian cdf with correlation matrix C and Φ is p the univariate standard normal cdf. Combining (2.1) and (2.2) we have F(y ,...,y ) = Φ (Φ−1(F (y )),...,Φ−1(F (y )) | C) (2.3) 1 p p 1 1 p p 3 From (2.3) a number of properties are clear: On marginalizing out a subset of y the joint distribution of the remaining variables has a Gaussian copula with correlation matrix given by the appropriate submatrix of C, and y ⊥⊥ y if and only if c = j j(cid:48) jj(cid:48) 0. When F , F are continuous, c = Corr(Φ−1(F (y ),Φ−1(F (y )) which is an j j(cid:48) jj(cid:48) j j j(cid:48) j(cid:48) upper bound on Corr(y ,y ) (attained when the margins are Gaussian) (Klaassen and j j(cid:48) Wellner, 1997). The rank correlation coefficients Kendall’s tau and Spearman’s rho are monotonic functions of c (Hult and Lindskog, 2002). For variables taking finitely jj(cid:48) many values c gives the polychoric correlation coefficient (Muth´en, 1983). jj(cid:48) If the margins are all continuous then zeros in R = C−1 imply conditional indepen- dence, as in the multivariate Gaussian distribution. However this is generally not the case when some variables are discrete. Indeed, in the simple case where p = 3, Y is 3 discrete and c c (cid:54)= 0, if r = 0 then Y and Y are in fact dependent conditional on 13 23 12 1 2 Y (a similar result holds when conditioning on several continuous variables and a sin- 3 gle discrete variable as well - see Appendix A). Results like these suggest that sparsity priors for R in Gaussian copula models (e.g. Dobra and Lenkoski (2011); Pitt et al. (2006)) are perhaps not always well-motivated when discrete variables are present, and should be interpreted only with great care. From (2.3) we see that a Gaussian copula model can be expressed in terms of latent Gaussian variables z which define y through a transformation: Let F−1(t) = inf{t : j F (y) ≥ t,y ∈ R} be the usual pseudo-inverse of F and suppose Ω is a covariance j j √ matrix with C as its correlation matrix. If z ∼ N(0,Ω) and y = F−1(Φ(z / ω )) for j j j jj 1 ≤ j ≤ p then F(y) has a Gaussian copula with correlation matrix C and univariate marginals F . We utilize this representation to generalize the Gaussian factor model j to Gaussian copula factor models by assigning z a factor model: η ∼ N(0,I), z |η ∼ N(Λη , I) i i i i 4 z yij = Fj−1Φ(cid:113) ij (2.4) 1+(cid:80)K λ2 j=1 jh Inference takes place on the scaled loadings λ ˜ jh λ = (2.5) jh (cid:113) 1+(cid:80)k λ2 h=1 jh so that c = (cid:80)k λ˜ λ˜ . Rescaling is important as the factor loadings are not jj(cid:48) h=1 jh j(cid:48)h otherwise comparable across the different variables, even though (2.4) is technically ˜ identified. Hence we will be concerned in the sequel with priors on λ induced by jh various priors on λ . Another quantity of interest is the uniqueness of variable j, jh given by k (cid:88) 1 u = 1− λ˜2 = (2.6) j jh 1+(cid:80)k λ2 h=1 h=1 jh In the Gaussian factor model the analogous quantity is σ2/(σ2 + (cid:80)k λ2 ) and rep- j j h=1 jh resents the proportion of variability in the manifest variable which is unexplained by commonfactors. IntheGaussiancopulafactormodelthispreciseinterpretationdoesn’t hold, but u still represents a measure of dependence on common factors. In a Bayesian j specification of the Gaussian factor model a prior is induced on u through the priors j on Σ and Λ. In our model (and other factor models where the scale is constrained) the prior on u is induced solely by the prior on Λ. In Section 3 we show that this has j meaningful implications for prior specification. 2.1 Relationship to existing factor models TheGaussiancopulafactormodelsubsumesmanyexistingmodels, includingtheGaus- sian factor model (where F are all univariate Gaussian) and probit factor models. j Probit factor models for binary or ordered categorical data parameterize each mar- 5 gin by a collection of “cutpoints” γ ,...γ (with γ − ∞ and γ = ∞) so that j0 jcj j0 jcj (cid:16) (cid:17) F (c) = Φ γ (1+(cid:80)K λ2 )−1/2 . Then F has the pseudoinverse j jc j=1 jh j (cid:88)cj γ γ F−1(uij) = c1Φ(cid:113) jc−1 < uij ≤ Φ(cid:113) jc 1+(cid:80)K λ2 1+(cid:80)K λ2 c=1 j=1 jh j=1 jh Plugging this into (2.4) and simplifying gives y = (cid:80)Cj c1(γ < z ≤ γ ) where ij c=1 jc−1 ij jc z ∼ N(0,ΛΛ(cid:48) + I), the data augmented representation of an ordinal probit factor i model. Naturally the connection extends to mixed Gaussian/probit margins as well. Our model generalizes these to any collection of ordered variables while preserving the Gaussian factor model’s fundamental dependence structure, encoded in its (Gaus- sian)copula. Thisisadistinctionbetweenourmodelandsemiparametricfactormodels which assume non-Gaussian latent variables η or errors (cid:15) . These instead retain the i i linear model formulation (1.1), so that marginally Cov(y ) = ΛCov(η )Λ(cid:48)+Σ (with Σ i i diagonal and σ2 = Var((cid:15) )). But F(y ) no longer has a Gaussian copula, and since the j ij i joint distribution is no longer Gaussian or even elliptically symmetric the covariance matrix is unlikely to be an adequate measure of dependence. Further, the dependence and marginal distributions are typically confounded since the implied correlation ma- trix will most likeliy depend on the parameters of the marginal distributions. However, in the Gaussian copula factor model Λ˜ governs the dependence separately from the marginal distributions, representing a factor-analytic decomposition for the scale-free copula parameter C rather than Cov(y ). The Gaussian copula model is i also invariant to strictly monotone transformations of observed variables and therefore consistentwiththecommonassumptionthatthereexistmonotonicfunctionsh ,...,h 1 p such that (h (y ),...h (y ))(cid:48) follows a Gaussian factor model. Existing semiparametric 1 1 p p approaches are not. 6 2.2 Marginal Distributions There are a number of ways to treat the marginal distributions in a copula model. An obviouschoiceistospecifyaparametricfamilyforeachmarginandinfertheparameters simultaneously with C (see e.g. Pitt et al. (2006) for a Bayesian approach). This is computationally expensive for even moderate p, and there is often no obvious choice of parametric family for every margin. Since our goal is not to learn the whole joint distributionbutrathertocharacterizeitsdependencestructurewewouldprefertotreat the marginal distributions as nuisance parameters. When the data are all continuous ˆ a popular semiparametric method is a two-stage approach in which an estimator F is j usedtocompute“pseudodata”zˆ = Φ−1(Fˆ (y )),whicharetreatedasfixedtoinferthe ij j ij copula parameters. A natural candidate is Fˆ (t) = n (cid:80)n 11(y ≤ t), the (scaled) j n+1 i=1 n ij empirical marginal cdf. Klaassen and Wellner (1997) considered such estimators in the Gaussian copula and Genest et al. (1995) developed this method in the general case. Inference for the copula based on marginal ranks is appealing because of their invariance properties when the data are continuous. However, this method cannot handle discrete margins (Hoff, 2007). To accommodate mixed discrete and continuous data Hoff (2007) proposed an approximation to the full likelihood called the extended rank likelihood. Since the transformation y = F−1(Φ(z )) is nondecreasing, when we ij j ij observe y = (y ,...,y ) we also observe a partial ordering on z = (z ,...z ). To j 1j nj j 1j nj be precise we have that z ∈ D(y ) ≡ {z ∈ Rn : y < y ⇒ z < z } (2.7) j j j ij i(cid:48)j ij i(cid:48)j The set D(y ) is just the set of possible z = (z ,...,z ) which are consistent with j j 1j nj the ordering of the observed data on the jth variable. Let D(Y ) = {Z ∈ Rp×n : z ∈ j 7 D(y ) ∀ 1 ≤ j ≤ p}. Then we have j P(Y |C,F ,...F ) = P(Y ,Z ∈ D(Y )|C,F ,...F ) 1 p 1 p = P(Z ∈ D(Y )|C)×P(Y |Z ∈ D(Y ),C,F ,...F ) (2.8) 1 p where (2.8) holds because given C the event Z ∈ D(Y ) does not depend on the marginal distributions. Hoff (2007) proposes dropping the second term in (2.8) and using P(Z ∈ D(Y )|C) as the likelihood. Intuitively we would expect the first term to include most of the information about C, and simulations in Section 4 provide further supporting evidence. Hoff (2007) shows that when the margins are all continuous the marginal ranks satisfy certain relaxed notions of sufficiency for C (although these fail when some margins are discrete). Very recently Hoff et al. (2011) further studied the informationlossundertheranklikelihoodincontinuousdata, withencouragingresults. But most interesting applications involve mixed data, where existing theoretical results do not hold. We give a new of proof of strong posterior consistency for C under the extended rank likelihood with nearly any mixture of discrete and continuous margins (barring pathological cases which preclude identification of C). Note that posterior consistency will generally fail under Gaussian/probit models when any margin is mis- specified. A similar result for continuous data and a univariate rank likelihood was obtained by Gu and Ghosal (2009). We replace Y with Y (n) for notational clarity. Theorem 1. Let Π be a prior distribution on the space of all positive semidefinite correlation matrices C with corresponding density π(C) with respect to a measure ν. Suppose π(C) > 0 a.e. [ν] and that F ,...,F , are the true marginal cdfs. Then for 1 p C a.e. [ν] and any neighborhood A of C we have that 0 0 lim Π(C ∈ A | Z(n) ∈ D(Y (n))) = 1 a.s. [G∞ ] (2.9) n→∞ C0,F1,...,Fp 8 where G∞ is the distribution of {y }∞ under C ,F ,...F . C0,F1,...,Fp i i=1 0 1 p The proof is in Appendix B. We assumed a prior π(C) having full support on C. Under factor-analytic priors fixing k < p restricts the support of π, and posterior consistencywillonlyholdifC hasafactoranalyticdecompositionink orfewerfactors. 0 But by setting k large (or inferring it) it is straightforward to define factor-analytic priors which have full-support on C (further discussion in Section 6). In practice, many correlation matrices which do not exactly have a k-factor decomposition are still well- approximated by a k-factor model. Finally, the argument also applies to posterior consistency for Λ˜ if k is chosen correctly, given compatible identifying restrictions. 3 Prior Specification and Posterior Inference 3.1 Prior Specification Choosing a good prior on Λ is paramount, especially since we are interested in cases where n is not large relative to p and in simultaneously accommodating uncertainty in marginal distributions. First we recall that the factor model is invariant under rotation or scaling of the loadings and scores. We assume that either sufficient identifying conditions are imposed (such as restricting Λ to have zeros above the diagonal and positive entries along the diagonal (Geweke and Zhou, 1996)) or that inference is on C, which does not suffer from this indeterminacy. A natural choice of prior for the unrestricted factor loadings is λ ∼ N(0,1/b) independently. We can derive the jh implied distribution on u analytically: j (b/2)−k/2 (cid:18) 1 (cid:19)(cid:18)1−u (cid:19)k/2−1 (cid:20) b (cid:18)1−u (cid:19)(cid:21) j j π(u ) = exp − (3.1) j Γ(k/2) u2 u 2 u j j j Figure 1 shows this density for b = 0.5,1,4 and K = 1,3,5. Even the “less- 9