ebook img

Piecewise Deterministic Markov Processes for Scalable Monte Carlo on Restricted Domains PDF

0.59 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Piecewise Deterministic Markov Processes for Scalable Monte Carlo on Restricted Domains

Piecewise Deterministic Markov Processes for Scalable Monte Carlo on Restricted Domains Joris Bierkens 1, Alexandre Bouchard-Cˆot´e2, Arnaud Doucet3, Andrew B. Duncan4, Paul Fearnhead5, Gareth Roberts6, and Sebastian J. Vollmer6 1EEMCS, TU Delft, Netherlands. 2Department of Statistics, University of British Columbia, Canada. 7 3Department of Statistics, University of Oxford, U. K. 1 4School of Mathematical and Physical Sciences, University of Sussex, U. K. 0 2 5Department of Mathematics and Statistics, Lancaster University, U. K. n 6Department of Statistics, University of Warwick, U. K. a J 6 January 17, 2017 1 ] E Abstract variants of MCMC that require accessing only small M subsets of the data at each iteration [26]. Piecewise deterministic Monte Carlo methods One promising set of algorithms are based around . at (PDMC)consistofaclassofcontinuous-timeMarkov simulating a piecewise deterministic Markov process t chain Monte Carlo methods (MCMC) which have [23, 7, 6]. Such processes explore the state space s [ recently been shown to hold considerable promise. according to a constant velocity, where the velocity Beingnon-reversible, themixingpropertiesofPDMC changes at random event times. The rate of these 1 methods often significantly outperform classical event times, and the change in velocity at each event, v 4 reversible MCMC competitors. Moreover, in a arechosensothattheresultingprocesshastheposte- 4 Bayesian context they can use sub-sampling ideas, rior distribution as its invariant distribution. We will 2 so that they need only access one data point per refer to this family of sampling methods as Piecewise 4 iteration, whilst still maintaining the true posterior Deterministic Monte Carlo methods (PDMC). 0 . distribution as their invariant distribution. However, ThePDMCmethodsproposedtodateexhibitsome 1 current methods are limited to parameter spaces remarkable properties of practical interest: 0 of the form Rd. We show how these algorithms 7 can be extended to applications involving restricted 1. there is no rejection, i.e. each piece of computa- 1 : parameter spaces. In simulations we observe that the tionimpliesexplorationofnewconfigurationsfor v resultingalgorithmismoreefficientthanHamiltonian all variables of the model; i X Monte Carlo for sampling from truncated logistic r regression models. The theoretical framework used 2. thereisnoneedforsymplecticintegrators(which a to justify this extension lays the foundation for the are required for Hamiltonian Monte Carlo [10]) development of other novel PDMC algorithms. because trajectories consist simply of straight lines; 1 Introduction 3. the underlying Markov process is non-reversible; Markov chain Monte Carlo (MCMC) methods have 4. factor graph decompositions of the target distri- beencentraltothewide-spreaduseofBayesianmeth- bution can be leveraged to perform sparse up- ods. However their applicability to some modern ap- dates of the variables [7, 19, 23]; plicationshasbeenlimitedduetotheirhighcomputa- tional cost, particularly in big-data, high-dimensional 5. and it is often possible to use only a small sub- settings. ThishasledtointerestinnewMCMCmeth- sample of the data at each iteration, whilst still ods, particularly non-reversible methods which can beingguaranteedtotargetthetrueposteriordis- mix better than standard reversible MCMC [9], and tribution [5, 7, 13]. 1 Existing PDMC algorithms can only be used to times which are simulated according to N inhomo- samplefromposteriorswheretheparameterscantake geneous Poisson processes, with respective intensity anyvalueinRd. Inthispaperweshowhowtoextend functions λ (x ,v ), i = 1,...,N, depending on the i t t PDMC methodology to deal with constraints on the current state of the system. At each switching time parameters. Such models are ubiquitous in machine the position stays the same, but the velocity is up- learning and statistics. For example, many popular datedaccordingtoaspecifiedtransitionkernel. More models used for binary, ordinal and polychotomous specifically, suppose the next switching event occurs responsedataaremultivariatereal-valuedlatentvari- from ith Poisson process, then the velocity immedi- able models where the response is given by a deter- ately after the switch is sampled randomly from the ministic function of the latent variables [1, 11, 24]; probability distribution Q (x,v,·), i.e. for any mea- i e.g. a binary response is set equal to 1 if a linear surable subset A ⊂ V, the probability that the new combinationof the latentvariablesispositiveand −1 velocityisinAgiventhecurrentpositionxandveloc- otherwise. Under the posterior distribution, the do- ity v before the switch is Q (x,v,A). The switching i main of the latent variables is then constrained based times are random, and designed in conjunction with on the values of the responses. Additional examples the kernels {Q } so that the invariant distri- i i=1,...,N ariseinregressionwherepriorknowledgerestrictsthe bution of the process coincides with the target distri- signsofmarginaleffectsofexplanatoryvariablessuch bution π. An example of a realisation of this process asineconometrics[14],imageprocessingandspectral is shown in Figure 1(f). analysis[3],[15]andnon-negativematrixfactorization The resulting stochastic process is a Piecewise De- [17]. When performing Bayesian inference in these terministic Markov Process (PDMP, [8]). For it to be contexts, posterior simulation requires sampling from usefulasthebasisofaPiecewiseDeterministicMonte constrained distributions. A few methods for dealing Carlo(PDMC)algorithmweneedto(i)beabletoeas- with restricted domains are available but these either ily simulate this process; and (ii) have simple recipes target an approximation of the correct distribution for choosing the intensities, {λ } , and transi- i i=1,...,N [22] or are limited in scope [21]. tionkernels,{Q } ,suchthattheresultingpro- i i=1,...,N Thegeneralframeworkpresentedhereissufficiently cess has π(x) as its marginal stationary distribution. rich to include both the bouncy particle sampler We will tackle each of these problems in turn. [7,23],thezigzagsampler [5,6]aswellaseventchain sampling methods of Werner Krauth and collabora- 2.1 Simulation tors [4, 19], as we will illustrate with an example. We presentsimulationresultsofPDMCforisotoniclogis- We shall first assume that O = Rd. The key chal- ticregression. InthiscaseweseethePDMCmethods lengeinsimulatingourPDMPissimulatingtheevent can be substantially more efficient than Hamiltonian times. The intensity of events is a function of the Monte Carlo. state of the process. But as the dynamics between Additionally, we also give a general presentation of event times are deterministic, we can easily represent PDMC methods, which shows how the choices of the theintensityforthenexteventasadeterministicfunc- dynamics of the process affect their stationary distri- tion of time. Suppose that the PDMP is driven by a bution. These results can aid with the development single inhomogeneous Poisson process with intensity of novel PDMC algorithms. function λ(cid:101)(u;Xt,Vt)=λ(Xt+uVt,Vt), u≥0. 2 Piecewise Deterministic We can simulate the first event time directly if we Monte Carlo have an explicit expression for the inverse function of the monotonically increasing function Ourobjectiveistocomputeexpectationswithrespect (cid:90) u to a probability distribution π on O ⊆ Rd which is u(cid:55)→ λ(cid:101)(s;Xt,Vt)ds. (1) assumed to have smooth density, also denoted π(x), 0 with respect to the Lebesgue measure on O. In this case the time until the next event is obtained Wedenotethestateofourcontinuous-timeMarkov by (i) simulating a realization, y say, of an exponen- processattimetasZ =(X ,V ),takingvaluesinthe tial random variable with rate 1; and (ii) setting the t t t domainE =O×V,whereO andV aresubsetsofRd, time until the next event as the value τ that solves (cid:82)τ such that O is open. The dynamics of Zt are easy to 0 λ(cid:101)(s;Xt,Vt)ds=y. describeifoneviewsX aspositionandV asvelocity. In practice inverting Equation 1 is often not prac- t t ThepositionprocessX movesdeterministically,with tical. In such cases simulation can be carried via t constantvelocityV betweenadiscretesetofswitching thinning [18]. This requires finding a tractable upper t 2 bound on the rate, λ(u) ≥ λ(cid:101)(u;Xt,Vt) for all u > 0. Such an upper bound will typically take the form of (a) (b) 5 1.2 a piecewise linear function or a step function. Note that the upper bound λ is only required to be valid 4 1.0 along the trajectory u (cid:55)→ (X + uV ,V ) in O × V. t t t 3 0.8 Therefore the upper bound will depend on the start- Rate 0.6 ingpoint(X ,V )ofthelinesegmentwearecurrently 2 t t 0.4 simulating. 1 l 0.2 We then propose potential events by simulating events from an inhomogenous Poisson process with 0 0.0 X X 0 1 2 3 4 5 0 1 2 3 4 rate λ(u), and accept an event at time u with prob- (c) T(idm)e 5 5 ability λ(cid:101)(u;Xt,Vt)/λ(u). The time of the first ac- cepted event will be the time until the next event in X 4 4 our PDMP. 3 3 For a PDMP driven by N inhomogeneous Poisson X processes with intensities {λ }N the previous steps i i=1 2 2 leadtothefollowingalgorithmforsimulatingthenext 1 l 1 l eventofourPDMP.Thisalgorithmcanbeiteratedto simulate the PDMP for a chosen number of events or 0 0 a pre-specified time-interval. 0 1 2 3 4 5 0 1 2 3 4 5 (e) (f) 5 5 (0) Initialize: Setttothecurrenttimeand(X ,V ) t t l to the current position and velocity. 4 4 3 3 (1) Determine bound: For each i ∈ 1,...,N, find a convenient function λ satisfying λ (u) ≥ i i 2 2 λ(cid:101)i(u;Xt,Vt) for all u ≥ 0, depending on the ini- 1 l 1 tial point (X ,V ) from which we are departing. t t 0 0 (2) Propose event: For i = 1,...,N, simulate the 0 1 2 3 4 5 0 1 2 3 4 5 firsteventtimesτ(cid:48) ofaPoissonprocesswithrate i function λ . i Figure 1: Schematics of a PDMC algorithm for a (3) Let imin =argminj=1,...,Nτj(cid:48) and τ(cid:48) =τi(cid:48)min. PDMPdrivenbyasingleinhomogeneousprocesswith (4) Accept/Reject event: With probability intensity λ(cid:101), along with output for a two-dimensional target. (a)Initialposition(shownbydot)andvelocity λ(cid:101)i(τ(cid:48);Xt,Vt) (shown by arrow). (b) We first calculate λ(cid:101) (red lower λ (τ(cid:48)) line), by evaluating the event rate along the current i path of the process; and we bound this by a simple accept this event. function λ (black upper line). We then propose event timesfromaPoissonprocesswithrateλ,thefirsttwo (4.1) Upon acceptance: set Xt+τ(cid:48) = Xt + eventtimesareshownbycrosses. Inpracticetheseare τ(cid:48)Vt; simulate a new velocity V(cid:48) from simulated sequentially. (c) Given the first event time, Qimin(Xt+τ(cid:48),Vt,·) and set Vt+τ(cid:48) =V(cid:48). u say, we simulate the deterministic process forward (4.2) Upon rejection: set X = X + τ(cid:48)V t+τ(cid:48) t t foraperiodoflengthu. Wethenacceptorrejectthis and set V =V . t+τ(cid:48) t event, with acceptance probability λ(cid:101)(u)/λ(u). Note that rejection in this context means the position is (5) Update: Record the time t + τ(cid:48) and state updated to move up to the rejected point, but the (Xt+τ(cid:48),Vt+τ(cid:48)). velocityisnotchanged. Herewereject,so(d)wethen A schematic of this algorithm is shown in Figure 1. simulatethenextproposedevent-time,andrepeatthe accept-reject step. In this case we accept so (e) we 2.2 Output of PDMC Algorithms then simulate a new velocity, and repeat the above iterations. (f) Example output of the process over a The output of these algorithms will be a sequence series of iterations. of event times t ,t ,t ,...,t and associated states 1 2 3 K (X ,V ), (X ,V ),...,(X ,V ). To obtain the value 1 1 2 2 K K 3 of the process at times t ∈ [t ,t ), we can lin- We impose the following condition: k k+1 early interpolate the continuous path of the process ibnettewgereanlse(cid:82)vetnft(Xtim)edss,i.oef.aXtfu=ncXtitokn+fVko(ft−thtek)p.rTocimeses (cid:90) (cid:90) (cid:88)N λi(x,v)Qi(x,v,du)ρ(dv) (3) 0 s V U i=1 X can be approximated by numerically integrating t (cid:90) (cid:90) N the one dimensional integral along the piecewise lin- (cid:88) = λ (x,v)Q (x,u,dv)ρ(du), i i ear trajectory of the PDMP. However, in many in- U V i=1 stances, such integrals can be computed analytically fromtheoutputoftheabovealgorithm. Forexample, for all x∈O and U ⊂V measurable. This is implied for f(x)=x(r), the rth component of x we obtain, by the easier but stronger condition that each Qi is reversible with respect to ρ, i.e. for i=1,...,N, (cid:90) tKX(r)ds=(cid:88)K (cid:18)X(r) τ +V(r) τk2(cid:19), Qi(x,u,dv)ρ(du)=Qi(x,v,du)ρ(dv). (4) s tk−1 i tk−1 2 0 k=1 Moreover, we shall require the following condition which relates the probability flow with the switching whereτ =t −t . Alternativelywecansamplethe k k k−1 intensities λ : PDMPatasetofevenlyspacedtimepointsalongthe i trajectory and use this collection as an approximate N (cid:90) N (cid:88) (cid:88) sample from our target distribution. Note that using λ (x,v)Q (x,u,dv)− λ (x,u) i i i the values of the process at event times will not pro- i=1 V i=1 (5) duce samples drawn from the stationary distribution =−u·∇U(x), of the PDMP [8]. It can be shown from the above construction that for all (x,u) ∈ E. The following result is proved in Z is a strong Markov process. Additional conditions the supplementary material. t are required to ensure that Z is ‘non-explosive’, in t Proposition 2. Suppose that conditions (3) and (5) the sense that, almost surely, there are only finitely hold. Thentheprobabilitydistributionπ(x)dx⊗ρ(dv) manyswitchesineveryfinitetimeinterval. Theresult is an invariant distribution of the process Z . below provides on such condition, valid for compact t V. Under the assumption that the resulting PDMP is ergodic (for sufficient conditions see e.g. [5, 7]), we Proposition 1. Suppose that V is a compact subset havethefollowingversionofthelawoflargenumbers of Rd and that the switching intensities {λi}Ni=1 are [16] for the PDMP Zt =(Xt,Vt): For all square inte- piecewise continuous, then the process Zt = (Xt,Vt) grable functions f :Rd →R (i.e. (cid:82)Rdf2(x)π(x) dx< defined above is a non-explosive process. ∞), we have that, with probability one, The proof can be found in the supplementary ma- (cid:90) 1 (cid:90) T f(x)π(x)dx= lim f(X )ds. terial. Rd T→∞T 0 s It is this formula which allows us to use PDMPs for Monte Carlo purposes. 2.3 Choosing the Intensity and Tran- sition Kernel 2.4 Examples Most existing PDMC methods [5, 7, 23, 7] assume Current PDMC algorithms differ in terms of how the that the target density, π(x), is defined on Rd. They Q and λ are chosen such that the above equation furtherrequirethatlogπ(x)bedifferentiable. Assum- i i holds for some simple distribution for the velocity. ing these two conditions hold, we can provide criteria In the following examples δ denotes the Dirac- whichmustholdforagivenprobabilitydistributionto x measure centered in x. be a stationary distribution of Z . We shall consider t The Zig-Zag sampler [5] can be recovered by stationary distributions for which x and v are inde- choosing N = d and picking as velocity space pendent, i.e. distributions of the form π(x)dx⊗ρ(dv) V = {−1,+1}d equipped with discrete uniform on E. Furthermore we assume that distribution ρ, defining switching rates λ (x,v) = i max(v ∂ U(x),0). The corresponding switching ker- π(x)∝exp(−U(x)) (2) i xi nels over new directions are given by where U is continuously differentiable. Q (x,v,dv(cid:48))=δ (dv(cid:48)), i Fiv 4 whereF :V →V denotestheoperationofflippingthe processmovesbackintoO. Weshallrefertosuchup- i i-th component, i.e. (F v)(i)=−v(i), and (F v)(j)= dates as boundary reflections even though they need i i v(j) for j (cid:54)=i. not be specular reflections. The Bouncy Particle Sampler introduced in [23] Tohandleboundaryreflections,ateverygiventime and explored in [7] is obtained setting N = 1 and t, we keep track of the next reflection event in the ρ = N (0,I) on Rd or ρ = U(Sd−1) the uniform dis- absence of a switching event, i.e. we compute tribution on the unit sphere. The single switching rate is chosen to be λBPS(x,v) = max(v·∇U(x),0), τb =inf{u>0 : Xt+uVt (cid:54)∈O}. with corresponding switching kernel Q which reflects If the boundary ∂O can be represented as a finite set v with respect to the orthogonal complement of ∇U of M hyper-planes in Rd, then the cost of computing with probability 1: τ is O(Md). When generating the switching event b Q(x,v,dv(cid:48))=δ (dv(cid:48)), times and positions for Zt we determine whether a (I−2P∇U)v boundary reflection will occur before the next poten- where P : z (cid:55)→ z·y y denotes orthogonal projection tial switching event. If so, then we induce a switch- y (cid:107)y(cid:107)2 along the one dimensional subspace spanned by y. ing event at time t+τb where Xt+τb ∈ ∂O and sam- Asnotedin[7]thisalgorithmsuffersfromreducibil- ple a new velocity from the transition kernel Qb, i.e. ity issues. These can be overcome by refreshing the Vt+τb ∼Qb(Xt+τb,Vt,·). velocitybydrawinganewvelocityindependentlyfrom Although theoretically we may choose a new ve- ρ(dv). In the simplest case the refreshment times locity pointing outwards and have an immediate sec- come from an independent Poisson process with con- ond jump, we will for algorithmic purposes assume stant rate λref. This also fits in the framework above that the probability measure Qb(x,u,·) for (x,u) ∈ by choosing λ(cid:101)=λBPS+λref and ∂O×V isconcentratedonthosedirectionsv forwhich (v·n(x)) ≤ 0, where n(x) is the outward normal at Q(x,u,dv)= λBPS δ (dv) x∈∂O. λ +λ (I−2P∇U)u Based on this, we thus modify steps (2)–(4) of the BPS ref λ PDMP scheme as follows. + ref ρ(dv). λ +λ BPS ref (2’) Propose event: For i = 1,...,N simulate the As a generalization of the BPS, one can consider firsteventtimesτi(cid:48) ofaPoissonprocesswithrate a preconditioned version, by introducing a constant function λi. Compute the next boundary reflec- positive definite symmetric matrix M to rescale the tion time τb. velocityprocess,allowingcertaindirectionstobepre- (3’) Let i =argmin τ(cid:48) and τ(cid:48) =τ(cid:48) . ferred. Morespecifically,wesetV =Rd,ρ≡N(0,M) min j=1,...,N j imin and (4’) Accept/Reject event: Q(x,v,dv(cid:48))=δ (dv(cid:48)), (I−2P∇MU)v (4.1’) If τ < τ(cid:48) then set τ = τ ; set X = b b t+τ where PM : z (cid:55)→ z·y My. It is straightforward to X + τV ; sample a new velocity V ∼ y y·My t t t+τ checkthatconditions(3)and(5)hold,andsoπ⊗ρis Qb(Xt+τ,Vt,·). an invariant measure. The choice of M plays a very (4.2’) Otherwise with probability similar role to the mass matrix in HMC, and careful tuning can give rise to dramatic increases in perfor- λ(cid:101)imin(τ(cid:48);Xt,Vt) mance [13]. For Gaussian targets, the natural choice λ (τ(cid:48)) of M is the covariance of the target distribution. See imin [20] for a related approach. accept the event at time τ =τ(cid:48). (4.1.1’) Upon acceptance: set X = X + t+τ t 3 Sampling over Restricted Do- τVt; sample a new velocity Vt+τ ∼ Q (X ,V ,·). mains imin t+τ t (4.2.2’) Upon rejection: set X =X +τV t+τ t t and set V =V . t+τ t Suppose now that O is an open pathwise-connected subset of Rd with Lipschitz boundary ∂O. We can Toensurethattheprocessπ(x)dx⊗ρ(dv)isanin- adaptthePDMPschemedescribedpreviouslytosam- variant distribution of Z we must impose conditions t ple from a distribution π(x) supported on O. To en- on the boundary transition kernel Q . The follow- b surethatx remainsconfinedwithinO thevelocityof ing result provides one such sufficient condition. The t theprocessisupdatedwheneverx hits∂Osothatthe proof is deferred to the supplementary material. t 5 Proposition 3. Consider the process Z on O ×V 3.2 Example: Event Chains of Krauth t where O is an open, pathwise connected subset of Rd et al. with Lipschitz boundary. For x∈∂O, denote by n(x) the outward unit normal of ∂O. Suppose that condi- Krauthandcollaborators[4,19]introducedthenotion tions (3) and (5) hold on O, and for x∈∂O, of event chains for simulations in statistical mechan- ics, with superior performance over other methods. Qb(x,u,dv)ρ(du)=Qb(x,v,du)ρ(dv) (6) We show in a simple example how this can be incor- porated into the current framework. and moreover Consider the simulation of two one-dimensional (cid:90) ‘hard disks’ of radius R on T1, the one dimensional (n(x)·u)Q (x,v,du)=−v·n(x) (7) b torus, i.e. the interval [0,1] with the sides identified. V The locations of the centers are denoted by x and 1 for all (x,v) ∈ ∂O×V. Then π(x)dx⊗ρ(dv) is an x . We have that x = (x ,x ) ∈ O := {(x ,x ) ∈ 2 1 2 1 2 invariant distribution for the process Z . T2 :d(x ,x )>2R}, where d represents the distance t 1 2 functiononT1. Thetargetdistributionistheuniform In practice we only have to consider the exit region distribution on O. Γ ⊂ O × V. For example if O = (a,b) ⊂ R1 and The Straight Event Chain (SEC) algorithm [4] V ={−1,+1}, then Γ={b,+1}∪{a,−1}. The spec- initially samples a fixed direction vector v ∈ ification of Q on (∂O×V)\Γ is irrelevant as these b {e ,−e ,e ,−e } (with e ,e denoting canonical ba- pointsareneverreachedbyZ . Onthisirrelevantset, 1 1 2 2 1 2 t sis vectors) and then moves x along the direction v. wemaychooseQ asdesiredtosatisfy(7). Inpractice b Thiscorrespondstoonemovingdiskandonestation- it therefore suffices to check (4) and (7) on Γ. ary disk. When the moving disk hits the stationary Another possible behavior at the boundary is to disk, its motion is transferred to the other disk in the generate the new reflected direction independently of same direction. the angle of incidence. This will also preserve the Since there are no changes of direction unless the invariant distribution provided that ρ is isotropic. disks hit each other, we have λ≡0. Proposition 4. Consider the process Z as in the t previous proposition, such that conditions (5) and (3) (0,1) (1,1) hold and the distribution ρ has mean zero. Then π(x)dx⊗ρ(dv) will be an invariant distribution for the process Z if Q (x,v,du) is independent of v for n(x) t b all x∈∂O. O 3.1 Example: Bouncy Particle Sam- x pler For the Bouncy Particle Sampler [23, 7], it is natural O to choose Q (s,v,du)=δ (du), b (I−2Pn(s))v (0,0) (1,0) for s∈∂O, so that the process X reflects specularly t at the boundary (i.e. angle of incidence equals angle Figure 2: Illustration of O, the point x at the bound- of reflection of process with respect to the boundary ary and the outward normal n(x) for the event chain normal). Itisstraightforwardtocheckthatcondition discussedinSection 3.2. Theblue areaistheexterior (3) holds at the boundary and that (7) is satisfied. of O as a subset of T2. The width of the blue strip Similarly, for the preconditioned BPS with mass ma- (in horizontal or vertical direction) is equal to 2R. trix, the natural choice of reflection at the boundary is given by Qb(s,v,du)=δ(I−2PM )v(du), Consider for simplicity a single event where we hit n(s) the boundary ∂O in such a way that x = x +2R 1 2 forwhichitisstraightforwardtoconfirmthatthecon- mod 1 (the other possibility being x = x + 2R 2 1 ditions of Proposition 3 are satisfied. mod 1). See Figure 2 for an illustration. At this 6 point x, the dynamics described above result in for all realizations of ∇(cid:100)U. A more comprehensive ex- planationofthisargumentcanbefoundin[12]inthe (cid:40) δ (dv) ifu=−e , context of the Zig-Zag sampler, and in [20] for the Q (x,u,dv)= −e2 1 . b δ (dv) ifu=e bouncy particle sampler. e1 2 It is straightforward to show that the result- Notethatthe‘exitset’Γ(seetheRemarkafterPropo- ing BPS algorithm uses an event rate that is (cid:104) (cid:16) (cid:17)(cid:105) sition 3) does not include (x,e1) or (x,−e2), because E max 0,(∇(cid:100)U(x)·v) , and that this rate and the these correspond to entering O from its exterior. It resulting transition probability Q at events satisfies maybecheckedthatQ satisfies(7)onΓ. Inorderto b Proposition 2 and 7. Hence this algorithm still tar- make Q satisfy (7) and (4) on ∂O×V, we have to set gets π(x), but only requires access to one data point (cid:40) at each accept-reject decision. δ (dv) ifu=−e , Q (x,u,dv)= −e1 2 . Notethatthisgainincomputationalefficiencydoes b δe2(dv) ifu=e1 not come for free, as it follows from Jensen’s inequal- ity that the overall rate of events will be higher. This Conditions(3)and(5)aretriviallysatisfiedonOsince makes mixing of the PDMC process slower. It is also λ≡0. immediate that the bound, λ, we will have to use will The above is only a simple example of event chains be higher. However [5] show that if our estimator of in statistical mechanics rephrased as PDMPs; a sys- ∇(cid:100)U(x) has sufficiently small variance, then we can tematic study is beyond the scope of this paper. still gain substantially in terms of efficiency. In par- ticular they give an example where the CPU cost ef- 4 Subsampling fective sample size does not grow with N – by com- parison all standard MCMC algorithms would have a cost that is at least linear in N. When using PDMC to sample from a posterior, we To obtain such a low-variance estimator requires a can use sub-samples of data at each iteration of the good choice of xˆ, so that with high probability x will algorithm,asdescribedin[5]. Inthefollowingwewill be closer to xˆ. This involves a preprocessing step to assume that we can write the posterior as find a value xˆ close to the posterior mode, a prepro- N cessing step to then calculate ∇U(xˆ) is also needed. (cid:89) π(x)∝ f(y ;x), We now illustrate how to find an upper bound on i i=1 the event rate. Following [5], if we assume L is a uniform (in space and i) upper bound on the largest for some function f. For example this would be the eigenvalue of the Hessian of Ui, and if (cid:107)v(cid:107)=1 likelihoodforasingleIIDdatapointtimesthe1/Nth (cid:16) (cid:16) (cid:17) (cid:17) power of the prior. max 0, ∇U(xˆ)+N(∇Ui(x )−∇Ui(xˆ)) ·v t We first present a way for estimating this quantity (cid:13) (cid:13) in an unbiased manner, which uses control variates ≤max(0,∇U(xˆ)·v)+N(cid:13)∇Ui(x)−∇Ui(xˆ)(cid:13) (9) (cid:13) (cid:13) [2, 5]. For any xˆ∈O we note that (cid:13) (cid:13) +N(cid:13)∇Ui(x)−∇iU(x )(cid:13) (cid:13) t (cid:13) ∇U(x)=∇U(xˆ)+[∇U(x)−∇U(xˆ)]. ≤max(0,∇U(xˆ)·v)+NL(cid:107)x−xˆ(cid:107)+NLt (10) We can then introduce the estimator ∇(cid:100)U(x) of ∇U Thustheupperboundontheintensityisoftheform by λ¯(τ) = a+b·τ with a,b ≥ 0. In this case the first arrival time can be simulated as follows ∇(cid:100)U(x)=∇U(xˆ)+N[∇logf(yI;x)−∇logf(yI;xˆ)], (cid:114) (8) (cid:16)a(cid:17)2 R τ(cid:48) =−a/b+ +2· with R∼Exp(1). where I is drawn uniformly from {1,...,N}. b b The idea of using sub-sampling, within say the BouncyParticleSampler(BPS),isthatateachitera- tionofourPDMCalgorithmwecanreplace∇U(x)by anunbiasedestimatorinstep(3). Weneedtousethe 5 Experiments sameestimatebothwhencalculatingtheactualevent rate in the accept/reject step and, if we accept, when WeuseBayesianbinarylogisticregressionasatestbed simulating the new velocity. Whilst for each imple- for our newly proposed methodology and perform a mentation of step (3) we need to use an independent simulation study. The data y ∈ {−1,1} is modelled i estimate. The only further alteration we need to the by algorithm is to choose an upper bound λ that holds p(y |ξ ,β)=f(y βTξ ) (11) i i i i 7 where ξ ∈ Rp×n are fixed covariates and f(z) = 1 ∈ [0,1]. We will assume that we wish to 1+exp(−z) fit this model under some monotonicity constraints – so that the probability of y = 1 is known to either increase or decrease with certain covariates. This is modeled through the constraint β >0 and β <0 re- i i spectively. This can be motivated for general logistic regressionforquestionnaires,see[25]. Infollowingwe consider the case β ≥0 for j =1,...,p. j For simplicity we use a flat prior over the space of parametersvaluesconsistentwithourconstraints. By Bayes’ rule the posterior π satisfies N (cid:89) π(β)∝ f(y βTξ ) for β ∈O, i i i=1 where O is the space of parameter values consis- tent with our constraints. We implement the BPS with subsampling, see Section 4 and use reflection at the boundary i.e. Q (s,v,du) = δ (du) for Figure 3: ESS against gradient evaluations for BPS b (I−2Pn(s))v s ∈ ∂O. We can bound the switching intensity, even and HMC (acceptance rate is hand tuned to 0.6) ap- when we use the sub-sampling estimator of Equation pliedtologisticregressionwithp=20andn=10000 (8), by a linear function of time. This follows be- and parameter β constrained to be positive. The cause we can bound the second derivative of the log- graphicisbasedon40independentrunsforeachHMC likelihood of any observation. The details for deriv- and BPS for each choice of number of gradient evalu- ing an upper bound on the intensity can be found in ations. Section 7.2 of the supplementary material. We use n = 10000 and p = 20 and generate artificial data based on β(cid:63) and ξ whose components are a realiza- counterparts. Other benefits, such as their amenabil- tion i.i.d. uniformly distributed random variables on ity to unbiased subsampling techniques and their [0,1]. In Figure 3 we compare standard HMC with ability to leverage structural properties of the tar- BPSintermsofeffectivesamplesize(ESS)pergradi- get distribution (such as any factor graph represen- entevaluation. ForHMCweuse5leap-frogstepsand tation) makes them an extremely powerful tool in tune the acceptance rateto 0.6. This requires asmall the Bayesian context, particularly in the large data stepsizebecauseifHMCproposesoutsidethebound- regime. ary the move is rejected. (There exists a version of In this work we have shown that PDMC methods HMCwhichcansamplefromtruncatedGaussiandis- also provide a very effective tool for sampling from tributions[21]buttoourknowledgenoefficientHMC probabilitydistributionsonrestricteddomains,which scheme is available for general restricted domains.) arise naturally in the Bayesian context where prior We note that due to increasingly many rejections at knowledge imposes hard constraints on inferred pa- the boundary, the effective sample size of the HMC rameters,examplesofwhichariseineconometricsand scheme is relatively unchanged as the number of gra- image processing. dient evaluations is increased. On the other hand, we Whilethisworkprovidesaframeworkfordescribing observe that the BPS shows a clear advantage over a general class of PDMC methods which are ergodic the HMC scheme in this case, with the effective sam- withrespecttoagiventargetprobabilitydistribution, ple size growing commensurately with the number of itleavesopenthequestionofhowthechoiceofinten- gradient evaluations. sityfunction,velocitytransitionkernelaswellasother parameters of the system influence the overall perfor- mance of the scheme. The problem of understanding 6 Discussion the true computational cost of such PDMC schemes is more subtle than for classical discrete time MCMC PDMCmethodsareanovelandpromisingalternative schemes. Indeed, a PDMC scheme which maximizes to existing classical discrete time MCMC methods. rate of convergence to equilibrium or the speed of Due to their inherent non-reversibility, PDMC meth- mixing need not be an optimal choice if the average ods enjoy superior performance to their reversible switching rate (which is proportional to the number 8 of gradient evaluations per unit time), is very high. [9] P. Diaconis, S. Holmes, and R. Neal. Analysis of Further investigation is required to understand this anonreversibleMarkovchainsampler. Annals of delicate balance. Applied Probability, 10(3):726–752, 2000. [10] S. Duane, A.D. Kennedy, B. J. Pendleton, and Acknowlegements D.Roweth. HybridMonteCarlo. Physics Letters B, 195(2):216–222, 1987. The authors wish to thank Thibaut Lienart for read- ing an early version of this paper, and for useful [11] L. Fahrmeir and G. Tutz. Multivariate statisti- disussion. All authors thank the Alan Turing In- cal modelling based on generalized linear models. stitute and Lloyds registry foundation for support. 2001. Springer, New York. S.J.V. gratefully acknowledges funding through EP- [12] P.Fearnhead,J.Bierkens,M.Pollock,andG.O. SRC EP/N000188/1. J.B., P.F. and G.R. grate- Roberts. Piecewise deterministic markov pro- fully acknowledge EPSRC ilike grant EP/K014463/1. cesses for continuous-time monte carlo. arXiv A.B.D. acknowledges grant EP/L020564/1. preprint arXiv:1611.07873, 2016. References [13] N. Galbraith. On Event-Chain Monte Carlo Methods. Master’s thesis, Department of Statis- [1] J. H. Albert and S. Chib. Bayesian analy- tics, Oxford University, 2016. sis of binary and polychotomous response data. [14] J.Geweke. Exactinferenceintheinequalitycon- Journal of the American statistical Association, strained normal linear regression model. Journal 88(422):669–679, 1993. of Applied Econometrics, 1(2):127–141, 1986. [2] R. Bardenet, A. Doucet, and C. Holmes. On [15] Y. Guo and M. Berman. A comparison between markov chain monte carlo methods for tall data. subsetselectionandl1regularisationwithanap- arXiv preprint arXiv:1505.02827, 2015. plication in spectroscopy. Chemometrics and In- [3] S.Bellavia,M.Macconi,andB.Morini. Aninte- telligent Laboratory Systems, 118:127–138, 2012. rior point newton-like method for non-negative [16] J. Jacod and A. Shiryaev. Limit theorems for least-squares problems with degenerate solu- stochastic processes, volume 288. Springer Sci- tion. Numerical Linear Algebra with Applica- ence & Business Media, 2013. tions, 13(10):825–846, 2006. [4] E. P. Bernard, W. Krauth, and D. B. Wilson. [17] H. Kim and H. Park. Sparse non-negative ma- Event-chain Monte Carlo algorithms for hard- trixfactorizationsviaalternatingnon-negativity- sphere systems. Physical Review E, 80(5):56704, constrained least squares for microarray data nov 2009. analysis. Bioinformatics, 23(12):1495–1502, 2007. [5] J. Bierkens, P. Fearnhead, and G. Roberts. The Zig-ZagProcessandSuper-EfficientSamplingfor [18] P. A. Lewis and G. S. Shedler. Simulation of Bayesian Analysis of Big Data, 2016. nonhomogeneous poisson processes by thinning. Naval Research Logistics Quarterly, 26(3):403– [6] J. Bierkens and G. Roberts. A piecewise de- 413, 1979. terministic scaling limit of lifted metropolis- hastingsinthecurie-weissmodel. arXiv preprint [19] Y. Nishikawa, M. Michel, W. Krauth, and arXiv:1509.00302, 2015. K. Hukushima. Event-chain algorithm for the Heisenberg model. Physical Review E, [7] A.Bouchard-Cˆot´e,S.J.Vollmer,andA.Doucet. 92(6):63306, 2015. TheBouncyParticleSampler: ANon-Reversible Rejection-Free Markov Chain Monte Carlo [20] A.Pakman,D.Gilboa,D.Carlson,andL.Panin- Method. arXiv preprint arXiv:1510.02451, 2015. ski. Stochastic bouncy particle sampler. arXiv preprint arXiv:1609.00770, 2016. [8] M. H. A. Davis. Piecewise-deterministic markov processes: A general class of non-diffusion [21] A. Pakman and L. Paninski. Exact hamiltonian stochastic models. Journal of the Royal Statis- montecarlofortruncatedmultivariategaussians. tical Society. Series B (Methodological), pages Journal of Computational and Graphical Statis- 353–388, 1984. tics, 23(2):518–542, 2014. 9 [22] S. Patterson and Y. W. Teh. Stochastic gradient riemannian langevin dynamics on the probabil- ity simplex. In Advances in Neural Information Processing Systems, pages 3102–3110, 2013. [23] E. A. J. F. Peters and G. De With. Rejection- freeMonteCarlosamplingforgeneralpotentials. Physical Review E, 85(2):1–5, 2012. [24] L. E. Train. Discrete choice methods with simu- lation. Cambridge university press, 2009. [25] Gerhard Tutz and Jan Gertheiss. Rating scales as predictors—the old question of scale level and some answers. Psychometrika, 79(3):357–376, 2014. [26] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference onMachineLearning(ICML-11),pages681–688, 2011. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.