ebook img

Logit stick-breaking priors for Bayesian density regression PDF

0.61 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Logit stick-breaking priors for Bayesian density regression

Logit stick-breaking priors for Bayesian density regression Tommaso Rigon Department of Decision Sciences, Bocconi University 7 and 1 0 Daniele Durante 2 r Department of Statistical Sciences, University of Padova a M March 27, 2017 4 2 ] O C Abstract . t a There is an increasing focus in several fields on learning how the distribution of a t s responsevariablechangeswithasetofpredictors. Bayesiannonparametricdependent [ mixture modelsprovidea useful approach to flexiblyaddress this goal, however many 2 representations are characterized by difficult interpretation and intractable computa- v 9 tional methods. Motivated by these issues, we describe a flexible class of predictor– 6 dependentinfiniteGaussianmixturemodels,whichreliesonaformalcharacterization 9 of the stick-breaking construction via a continuation–ratio logistic regression, within 2 0 an exponential family representation. We study the theoretical properties, and lever- . 1 age this result to derive analytically three computational methods of routine use in 0 Bayesian inference, covering simple Markov Chain Monte Carlo via Gibbs sampling, 7 the Expectation Maximization algorithm, and a variational Bayes procedure for scal- 1 : able inference. The algorithms associated with these methods are made available v i online at https://github.com/tommasorigon/DLSBP. We additionally compare the X three computational strategies in an application to the Old Faithful Geyser dataset. r a Keywords: Bayesian density regression; Continuation–ratio logistic regression; EM algo- rithm; Gibbs sampling; Variational Bayes. 1 1 Introduction Thereisagrowinginterestindensityregressionmethodswhichallowtheentiredistribution of a univariate response variable y ∈ Y to be unknown, and changing with a vector of predictors x ∈ X. The increased flexibility provided by these procedures allows relevant improvements in inference and prediction compared to classical regression frameworks, as seeninapplicationstoepidemiology(e.g.Dunson&Park2008),meteorology(e.g.Guti´errez et al. 2016), neuroscience (e.g. Wade et al. 2014), image analysis (e.g. Ren et al. 2011) and finance (e.g. Griffin & Steel 2011) — among others. There is a wide set of alternative methodologies to provide flexible inference for con- ditional distributions, within a Bayesian nonparametric framework. Most of these meth- ods represent generalizations of the marginal density estimation problem for f(y), which is commonly addressed via Bayesian nonparametric mixture models of the form f(y) = (cid:82) K(y;θ)dP(θ), where K(y;θ) is a known parametric kernel indexed by θ ∈ Θ, and Θ P(θ) denotes an unknown mixing measure which is assigned a flexible prior Π. Popular choices for Π are the Dirichlet process (Ferguson 1973, 1974, Sethuraman 1994), the two- parameter Poisson–Dirichlet process (Pitman & Yor 1997), and other almost surely discrete random measures having a stick-breaking representation (Ishwaran & James 2001). This choice leads to an infinite mixture model representation for f(y) of the form (cid:90) +∞ h−1 (cid:88) (cid:89) f(y) = K(y;θ)dP(θ) = π K(y;θ ), π = ν (1−ν ), h = 1,...,+∞, (1) h h h h l Θ h=1 l=1 with θ ∼ P , independently for h = 1,...,+∞, and the stick-breaking weights ν , h = h 0 h 1,...,+∞, having independent Beta(a ,b ) priors, so that (cid:80)+∞ π = 1 almost surely. h h h=1 h Fixing a = 1 and b = α leads to a Dirichlet process mixture model, whereas the two- h h parameterPoisson–Dirichletprocessmixturecanbeobtainedbylettinga = 1−aandb = h h b+ha, with 0 ≤ a < 1 and b > −a. Model (1) has computational benefits in allowing the implementation of simple Markov Chain Monte Carlo methods for inference (e.g. Escobar & West 1995, Neal 2000), and has been shown to provide a consistent procedure for density estimation (e.g. Ghosal et al. 1999, Tokdar 2006, Ghosal & Van Der Vaart 2007). These results have motivated different generalizations of (1) to incorporate the condi- tional density inference problem for f(y | x). In addressing this goal, a class of procedures 2 focus on modeling the joint density f(y,x) via Bayesian nonparametric mixtures of multi- variate kernels, to induce a flexible posterior distribution for the conditional density of y given x (Mu¨ller et al. 1996, Mu¨ller & Quintana 2010, Hannah et al. 2011). As discussed in Wadeetal.(2014)thesecontributionsmayfacecomputationalandpracticalissueswhenthe predictor space X is large and complex due to the need to model the marginal density f(x), which is effectively a nuisance quantity when the focus is on conditional inference. This result has motivated alternative methodologies explicitly focused on modeling f(y | x) via a generalization of (1) which allows the unknown random mixing measure P (θ) to change x with x ∈ X, under a dependent stick-breaking characterization (MacEachern 1999, 2000). Popular representations consider predictor–independent mixing weights π , h = 1,...,+∞ h and incorporate changes with x ∈ X in the atoms θ (x), for h = 1,...,+∞ (e.g. De Iorio h et al. 2004, Gelfand et al. 2005, Caron et al. 2006, De la Cruz-Mes´ıa et al. 2007). Although the above models have been successfully applied in different contexts, cov- ering ANOVA–type formulations, spatial statistics, time–series analysis and classification, as noted in MacEachern (2000) and Griffin & Steel (2006), the predictor–independent as- sumption for the mixing weights can have limited flexibility in modeling f(y | x). This has motivated more general formulations allowing also π (x), h = 1,...,+∞, to change with h the predictors. Relevant examples include the order-based dependent Dirichlet process (Griffin & Steel 2006), the kernel stick-breaking process (Dunson & Park 2008), the infinite mixture models with predictor-dependent normalized weights (Antoniano-Villalobos et al. 2014), and recent representations for dynamic density estimation (Guti´errez et al. 2016). These formulations represent a more broad class of priors for density regression and have appealing theoretical properties (Pati et al. 2013), however their flexibility comes at cost in terms of interpretation and computational tractability. Motivated by the above discussion, we propose an alternative formulation to character- ize changes in each mixing weight π (x), with the covariates x ∈ X. This representation is h carefully defined to provide simpler interpretation and improved computational tractabil- ity, while maintaining flexibility and theoretical support. We accomplish these goals via a simple logit stick-breaking construction which relates each ν (x) ∈ (0,1) to a function h of the covariates η (x) ∈ (cid:60), using the logistic link. Our contribution is closely related to h 3 the probit stick-breaking prior of Rodriguez & Dunson (2011) leveraging the probit link function instead of the logistic one. However, as we will outline in the subsequent sections, our mapping can be formally interpreted as the canonical link for the continuation–ratio representation (Tutz 1991) of the hierarchical mechanism assigning units to mixture com- ponents, under the stick-breaking construction of the mixing weights. Our logistic mapping is also intimately related to the hierarchical mixtures of experts (Jordan & Jacobs 1994, Bishop & Svens´en 2003), providing building-block results to implement scalable algorithms for estimation and approximate inference via the Expectation Maximization algorithm and variational Bayes, which are not provided in Rodriguez & Dunson (2011). Ren et al. (2011) noticed a similar connection in their logistic stick-breaking process; however the focus is exclusively on nonparametric clustering of spatial and temporal data. Although we rely on a similar representation, our contribution is designed for more general density regression settings and provides additional results in terms of interpretation, com- putational implementation and theoretical support. For example, we show that our logit stick-breaking construction can be interpreted as continuation–ratio logistic regression for the assignment of the units to the mixture components. Leveraging the P`olya-Gamma data augmentation for Bayesian logistic regression (Polson et al. 2013), this result facili- tates the implementation of a simple Gibbs sampler which converges to the exact posterior and avoids the approximations required in Ren et al. (2011). The remainder of the paper is organized as follows. In Section 2 we describe the logit stick-breakingprior,alongwithitspropertiesandtheformalinterpretationviacontinuation– ratiologisticregression. Section3providesdetailedderivationofthreealgorithmsofroutine use in Bayesian density regression, covering Gibbs sampling, the Expectation Maximiza- tion algorithm, and a variational Bayes approach for scalable inference. The performance of these methods is assessed in Section 4 with an application to the Old Faithful Geyser dataset. Concluding remarks are given in Section 5. 2 The logit stick-breaking prior Thissectionpresentsaformalconstructionofthelogitstick-breakingpriorviaacontinuation– ratio parameterization of the hierarchical mechanism assigning the units to mixture compo- 4 nents. Although our logit stick-breaking representation for the predictor–dependent mixing weights and the associated computational procedures apply to a wide set of dependent mix- ture models and kernels, we will mainly focus — for the sake of clarity — on a general class of predictor–dependent infinite Gaussian mixture models of the form (cid:90) 1 (cid:26)y −λ(x)(cid:124)β(cid:27) (cid:88)+∞ 1 (cid:26)y −λ(x)(cid:124)β (cid:27) f(y | x) = φ dP (β,σ) = π (x) φ h , (2) x h σ σ σ σ h h h=1 with π (x) = ν (x)(cid:81)h−1{1−ν (x)}, and β = (β ,...,β )(cid:124) a vector of coefficients lin- h h l=1 l h 1h Ph early related to selected functions of the predictors λ (x),...,λ (x), comprising the vector 1 P λ(x). Formulation (2) is arguably the most widely used in Bayesian nonparametric density regression and has been shown to provide consistent estimates of f(y | x) in asymptotic settings(Patietal.2013), therebymotivatinganindepthstudyoftheassociatedproperties and computational methods. Generalizations to other kernels will be also discussed. 2.1 Logit stick-breaking for infinite Gaussian mixture models To provide a constructive representation of the logit stick-breaking prior, let us consider an equivalent formulation of the predictor–dependent mixture model in (2). In particular — following standard hierarchical representations of mixture models — independent samples y ,...,y from the variable with density function factorized in (2), can be obtained from 1 n h−1 (cid:89) (y | G = h,θ ) ∼ N{λ(x )(cid:124)β ,σ2}, pr(G = h) = π (x ) = ν (x ) {1−ν (x )}, (3) i i h i h h i h i h i l i l=1 for every unit i = 1,...,n, with θ = (β ,σ2) ∼ P and G ∈ {1,2,...,+∞} an indicator h h h 0 i variable denoting the mixture component associated with unit i. According to (3) each G i has probability mass function p(G ) = (cid:81)+∞ π (x )1(Gi=h), which can be rewritten — under i h=1 h i the stick-breaking factorization for π (x ) in (3) — as h i ν (x )1(Gi=1){1−ν (x )}1−1(Gi=1)···ν (x )1(Gi=h){1−ν (x )}1(Gi>h−1)−1(Gi=h)··· 1 i 1 i h i h i +∞ (cid:89) = exp[1(G = h)log[ν (x )/{1−ν (x )}]+1(G > h−1)log{1−ν (x )}]. (4) i h i h i i h i h=1 Hence, thedistributionofeachcomponentmembershipindicatorG canbefactorizedasthe i product of conditionally independent Bernoulli probability mass functions for the binary 5 ASSIGNMENT pr(Gi=1)=ν1(xi) pr(Gi>1)=1−ν1(xi) Gi=1 Gi>1 pr(Gi=2|Gi>1)=ν2(xi) pr(Gi>2|Gi>1)=1−ν2(xi) Gi=2 Gi>2 pr(Gi=3|Gi>2)=ν3(xi) pr(Gi>3|Gi>2)=1−ν3(xi) Gi=3 Gi>3 ... ... ... ... Figure 1: Representation of the sequential mechanism to sample G . i variables {1(G = h) | 1(G > h − 1) = 1} ∼ Bern{ν (x ) = pr(G = h | G > h − 1)}, i i h i i i for h = 1,...,+∞, having natural parameters η (x ) = log[ν (x )/{1−ν (x )}] ∈ (cid:60) and h i h i h i logisticcanonicallink. Thisresultprovidessupportforourlogitstick-breakingfactorization h−1 h−1(cid:20) (cid:21) (cid:89) 1 (cid:89) 1 π (x ) = ν (x ) {1−ν (x )} = 1− , (5) h i h i l i 1+exp{−η (x )} 1+exp{−η (x )} h i l i l=1 l=1 for every h = 1,...,+∞, while allowing simple interpretation of the stick-breaking con- struction via a continuation–ratio logistic regression (Tutz 1991), described in Figure 1. In particular, in the first step of this continuation–ratio generative mechanism, unit i is either assigned to the first component with probability ν (x ) = [1+exp{−η (x )}]−1 or 1 i 1 i to one of the others with complement probability. If G = 1 the process stops, otherwise it i continues considering the reduced space {2,...,+∞}. A generic step h is reached if i has not been assigned to 1,...,h−1, and the decision at the hth step will be to either allocate i to component h with probability ν (x ) = [1+exp{−η (x )}]−1 or to one of the subsequent h i h i components h+1,...,+∞ with probability 1−ν (x ), conditioned on G ∈ {h,...,+∞}. h i i This generative mechanism plays a key role in developing simple computational procedures. ToconcludeourBayesianrepresentationwerequirepriorsfortheparametersη (x ) ∈ (cid:60), h i characterizing the log-odds of each conditional probability ν (x ) ∈ (0,1), h = 1,...,+∞ h i 6 in the continuation–ratio logistic regressions. A natural choice — consistent with classical generalized linear models representations (e.g. McCullagh & Nelder 1989) — is to define the log-odds as a linear combination of selected functions ψ(x ) = {ψ (x ),...,ψ (x )}(cid:124) of i 1 i R i the covariates and consider Gaussian priors for the coefficients, obtaining R (cid:88) (cid:124) logit{ν (x )} = η (x ) = ψ(x ) α = α ψ (x ), α ∼ N (µ ,Σ ), h = 1,...,+∞. (6) h i h i i h rh r i h R α α r=1 Although the linearity assumption in equation (6) may seem restrictive, it is worth noticing thatflexibleformulationsforη (x ),includingregressionviasplinesandGaussianprocesses, h i induce relations that are linear in the coefficients. Moreover, as we will outline in Section 3, the linearity assumption greatly simplifies computations, while inducing a logistic-normal prior for each ν (x ), h = 1,...,+∞, with well defined moments (Aitchison & Shen 1980). h i Hence, the logit stick-breaking does not induce Beta distributed stick-breaking weights, and therefore cannot be included in the general class of stick-breaking priors discussed in Ishwaran & James (2001). However, as outlined in Section 2.2, many relevant properties characterizing the priors discussed in Ishwaran & James (2001) are met also under our case. 2.2 Properties of the logit stick-breaking prior Let Θ be a complete and separable metric space endowed with the Borel σ-algebra B(Θ), and let {P : x ∈ X} denote the class of predictor–dependent random probability measures x on Θ, induced by our logit stick-breaking prior via +∞ h−1 (cid:88) (cid:89) (cid:124) P (·) = π (x)δ (·), π (x) = ν (x) {1−ν (x)}, logit{ν (x)} = ψ(x ) α , (7) x h θ h h l h i h h h=1 l=1 with independent and identically distributed atoms θ ∼ P , h = 1,...,+∞ from the space h 0 {Θ,B(Θ)}, and α ∼ N (µ ,Σ ) for every h = 1,...,+∞. As discussed in Section 2.1, h R α α representation (7) does not provide Beta distributed priors for the stick-breaking weights ν (x), h = 1,...,+∞. However, in line with the random measures outlined in Ishwaran & h James(2001),alsoourlogitstick-breakingpriorprovidesawelldefinedpredictor–dependent random probability measure P at every x ∈ X, as discussed in Proposition 1. x Proposition 1. For any x ∈ X, (cid:80)+∞ π (x) = 1 almost surely, with π (x) factorized as h=1 h h in (7) and α ∼ N (µ ,Σ ) for every h = 1,...,+∞. h R α α 7 Proof: Following Lemma 1 in Ishwaran & James (2001), we have that (cid:80)+∞ π (x) = 1 h=1 h almostsurelyifandonlyif(cid:80)+∞ E[log{1−ν (x)}] = −∞. Sincelog{1−ν (x)}isconcavein h=1 h h ν (x)foreveryx ∈ X andh = 1,...,+∞, byJenseninequalitywehaveE[log{1−ν (x)}] ≤ h h log[1−E{ν (x)}]. Hence, since ν (x) ∈ (0,1), from the usual properties of the expectation h h wehavethat0 < E{ν (x)} = µ (x) < 1,therebyprovidinglog{1−µ (x)} < 0. Therefore, h 1ν 1ν (cid:80)+∞ E[log{1−ν (x)}] ≤ (cid:80)+∞ log{1−µ (x)} = −∞, proving Proposition 1. h=1 h h=1 1ν Although we focus on the infinite case, our logit stick-breaking prior is well defined also in truncated models considering a finite number of mixture components H. Consistent with Ishwaran & James (2001), in this case it suffices to model the first H − 1 weights ν (x),...,ν (x) and let ν (x) = 1 for any x ∈ X, to ensure condition (cid:80)H π (x) = 1. 1 H−1 H h=1 h Results in Proposition 1 motivate further analyses of the logit stick-breaking prior. In particular, consistent with theoretical studies on other stick-breaking priors not belonging totheclassdiscussedinIshwaran&James(2001)—e.g. Dunson&Park(2008),Rodriguez & Dunson (2011) — Proposition 2 provides additional insights on the moments of the predictor–dependent random probability measure induced by our logit stick-breaking prior. Proposition 2. For every x ∈ X and B ∈ B(Θ) the expectation of P (B) is E{P (B)} = x x P (B), whereas the variance of P (B) for any truncated version of P (·) in (7) with H > 1 0 x x components — including the infinite case — is µ (x){1−[1−2µ (x)+µ (x)]H} 2ν 1ν 2ν var{P (B)} = P (B){1−P (B)} , x 0 0 2µ (x)−µ (x) 1ν 2ν where µ (x) = E{ν (x)} and µ (x) = E{ν (x)2} for every h = 1,...,+∞. The covari- 1ν h 2ν h ance at two different predictor values x ∈ X and x(cid:48) ∈ X, x (cid:54)= x(cid:48), is instead µ (x,x(cid:48)){1−[1−µ (x)−µ (x(cid:48))+µ (x,x(cid:48))]H} 2ν 1ν 1ν 2ν cov{P (B),P (B)} = P (B){1−P (B)} , x x(cid:48) 0 0 µ (x)+µ (x(cid:48))−µ (x,x(cid:48)) 1ν 1ν 2ν with µ (x,x(cid:48)) = E{ν (x)ν (x(cid:48))}. 2ν h h Proof: Results are a direct consequence of the calculations in Appendix 2 and Appendix 6 in Rodriguez & Dunson (2011), after replacing the probit link with the logistic one. According to Proposition 2, the expectation of P (·) coincides with the base measure x P (·) which can be therefore interpreted as the prior guess for the mixing measure at any 0 x ∈ X. This quantity is predictor–independent, meaning that a priori we are not forcing 8 Montecarlo Estimate Probit Analytical Approximation 4 4 2 2 xj 0 xj 0 2 2 − − 4 4 − − −4 −2 0 2 4 −4 −2 0 2 4 xi xi Figure 2: Left: Contour plot of a Monte Carlo estimate for the correlation between P (B) x and P (B), at the infinite case, with 0 < P (B) < 1, η (x) = α + α x and α = x(cid:48) 0 h 1h 2h h (α ,α )(cid:124) ∼ N (0,104I ), for values x,x(cid:48) in (−4,4). Right: Same quantity, relying on the 1h 2h 2 2 analytical approximation of the logistic link. particular dependence structure between the atoms θ and the predictors. The variance changes instead with the predictors via a function of the first two moments of the logistic- normal stick-breaking weights. Note that, since each ν (x) is bounded between 0 and 1, we h have ν (x) ≥ ν (x)2 for every h = 1,...,+∞ and x ∈ X, implying 0 < µ (x) ≤ µ (x) < h h 2ν 1ν 1. These results provide the bound 1−2µ (x)+µ (x) < 1, which leads to a well defined 1ν 2ν limitingvariancefortheinfinitecaseH → +∞equaltoP (B){1−P (B)}µ (x){2µ (x)− 0 0 2ν 1ν µ (x)}−1. ThelimitingcovarianceisinsteadP (B){1−P (B)}µ (x,x(cid:48)){µ (x)+µ (x(cid:48))− 2ν 0 0 2ν 1ν 1ν µ (x,x(cid:48))}−1, after noticing that µ (x) ≥ µ (x,x(cid:48)), µ (x(cid:48)) ≥ µ (x,x(cid:48)) and 1−µ (x)− 2ν 1ν 2ν 1ν 2ν 1ν µ (x(cid:48))+µ (x,x(cid:48)) < 1. Hence the association is always positive and increases the closer 1ν 2ν x is to x(cid:48). This behavior is illustrated in Figure 2. Although results in Proposition 2 provide simple expressions for E{P (B)}, var{P (B)} x x and cov{P (B),P (B)}, their computation requires the moments of logistic-normal priors x x(cid:48) for the stick-breaking weights, induced by representation (6). Unfortunately these quanti- ties are not available in explicit form (e.g. Aitchison & Shen 1980), however Proposition 3 9 provides a simple procedure to accurately approximate the moments of logit stick-breaking weights leveraging a connection with the probit stick-breaking priors. Proposition 3. The logit stick-breaking prior described in representation (6), can be ac- curately approximated by a probit stick-breaking process ν (x) ≈ Φ{ψ(x)(cid:124)α¯ }, with α¯ = h h h (cid:112) (cid:112) α π/8 ∼ N { π/8µ ,(π/8)Σ }, for every x ∈ X and h = 1,...,+∞. h R α α Proof: ConsistentwithresultsinAmemiya(1981),thelogisticlink{1+exp(−ψ(x)(cid:124)α )}−1 h can be accurately approximated by Φ{ψ(x)(cid:124)α (cid:112)π/8}. Therefore h ν (x) = {1+exp(−ψ(x)(cid:124)α )}−1 ≈ Φ{ψ(x)(cid:124)α (cid:112)π/8} = Φ{ψ(x)(cid:124)α¯ }, x ∈ X, h h h h (cid:112) with α¯ ∼ π/8N (µ ,Σ ), for every h = 1,...,+∞, concluding the proof. h R α α According to Proposition 3, the logit stick-breaking can be approximated by a probit stick-breaking process, up to a scale transformation of the prior for the coefficients α , h = h 1,...,+∞. This result allows simple approximation for the moments of the logistic-normal priors on the stick-breaking weights by rescaling those provided in Rodriguez & Dunson (2011) for the probit stick-breaking. As shown in Figure 2, this analytical approximation provides indistinguishable results when compared to a Monte Carlo estimate, motivating the use of our computational algorithms also under the probit link. In fact, a researcher considering a probit stick-breaking process, could easily perform inference leveraging our (cid:112) algorithms, after rescaling the prior for the coefficients in the linear predictor by 8/π. We conclude the analysis of the logit stick-breaking properties by studying how a trun- cated version of (7) approximates the infinite process. Although there are some compu- tational methods for the infinite representation, these algorithms are not necessarily more tractable than those relying on a finite truncation, and still require approximations. In line with Rodriguez & Dunson (2011) and Ren et al. (2011), we develop detailed computational methods based on a finite representation, and discuss generalizations to the infinite case. This choice allows a more direct comparison between the algorithms proposed, and — ac- cording to Theorem 1 — provides an accurate approximation of the infinite representation. Theorem 1. For a sample y = (y ,...,y )(cid:124) with covariates X = {x ,...,x }(cid:124), let 1 n 1 n (cid:40) (cid:41) (cid:32) (cid:33) (cid:89)n (cid:89)n (cid:20)(cid:90) 1 (cid:26)y −λ(x )(cid:124)β(cid:27) (cid:21) f (y | X) = E f (y | x ) = E φ i i dPH(β,σ) , H PxHi H i i PxHi σ σ xi i=1 i=1 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.