ebook img

Dynamic temperature selection for parallel-tempering in Markov chain Monte Carlo simulations PDF

1.9 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Dynamic temperature selection for parallel-tempering in Markov chain Monte Carlo simulations

MNRAS000,1–21(2002) Preprint16March2016 CompiledusingMNRASLATEXstylefilev3.0 Dynamic temperature selection for parallel-tempering in Markov chain Monte Carlo simulations W. D. Vousden(cid:63), W. M. Farr(cid:63) and I. Mandel(cid:63) University of Birmingham, Edgbaston, Birmingham, B15 2TT, United Kingdom 16March2016 6 1 0 ABSTRACT 2 ModernproblemsinastronomicalBayesianinferencerequireefficientmethodsforsam- r pling from complex, high-dimensional, often multi-modal probability distributions. a Most popular methods, such as Markov chain Monte Carlo sampling, perform poorly M on strongly multi-modal probability distributions, rarely jumping between modes or settling on just one mode without finding others. Parallel tempering addresses this 5 problembysamplingsimultaneouslywithseparateMarkovchainsfromtemperedver- 1 sions of the target distribution with reduced contrast levels. Gaps between modes can betraversedathighertemperatures,whileindividualmodescanbeefficientlyexplored ] M at lower temperatures. In this paper, we investigate how one might choose the ladder oftemperaturestoachievemoreefficientsampling,asmeasuredbytheautocorrelation I . time of the sampler. In particular, we present a simple, easily-implemented algorithm h for dynamically adapting the temperature configuration of a sampler while sampling. p Thisalgorithmdynamicallyadjuststhetemperaturespacingtoachieveauniformrate - o ofexchangesbetweenchainsatneighbouringtemperatures.Wecomparethealgorithm r to conventional geometric temperature configurations on a number of test distribu- t s tions and on an astrophysical inference problem, reporting efficiency gains by a factor a of 1.2–2.5 over a well-chosen geometric temperature configuration and by a factor of [ 1.5–5 over a poorly chosen configuration. On all of these problems a sampler using 3 thedynamicaladaptationstoachieveuniformacceptanceratiosbetweenneighbouring v chains outperforms one that does not. 3 2 8 5 0 1 INTRODUCTION space(Geyer1991).ParalleltemperedMCMCsamplersare . thus particularly well-suited to sampling posterior distribu- 1 Many problems in astronomical data analysis and Bayesian tions with well-separated modes, where a regular MCMC 0 statistical inference demand the characterisation of high- 5 samplerwouldtakemanyiterationstofinditswaybetween dimensional probability distributions with complicated 1 modes. structures.Lackinganalyticforms,thesedistributionsmust : v be explored numerically, usually via Monte Carlo methods. Anopenproblemintheapplicationofparalleltemper- i ParalleltemperedMarkovchainMonteCarlo(MCMC), X ing is selecting a specification, or ladder, of temperatures a development on standard MCMC, uses several Markov thatminimisestheautocorrelationtime(ACT)ofthechain r chains in parallel to explore a target distribution at differ- a samplingtheposteriordistributionofinterest.Theefficiency ent“temperatures”(Earl&Deem2005;Swendsen&Wang ofagivenladderhingescriticallyontherateatwhichitcan 1986;Geyer1991).Asthetemperatureincreases,theposte- transferthepositionsinparameterspaceofsamplesbetween riordistributionasymptotestotheprior,allowingachainto high and low temperatures. efficientlyexplorethewholepriorvolumewithoutbecoming stuckinregionsoftheparameterspacewithhighprobability Inthispaperwepresentasimplealgorithmthatadapts density.Atlowertemperatures,achaincanmoreefficiently the temperature ladder of an ensemble-based parallel tem- sample from such a high-probability region. Meanwhile, ex- peredMCMCsampler(Goodman&Weare2010)suchthat change of positions between chains allows colder chains to the rate of exchange between chains is uniform over the migrate between widely separated modes in the parameter entire ladder. The algorithm is easy to implement in ex- isting code, and we provide an example implementation for the emcee sampler of Foreman-Mackey et al. (2013). (cid:63) E-mail: [email protected] (WDV); [email protected] We also present an implementation for traditional, non- (WMF);[email protected](IM) ensemble MCMC samplers where a single walker explores c 2002TheAuthors (cid:13) 2 W. D. Vousden, W. M. Farr and I. Mandel theparameterspace.Wereportfavourableresultsfromsuch their positions in the parameter space, so that chain i is at an implementation, along with a number of caveats. θ(cid:126) and chain j is at θ(cid:126). Since the hottest chains can access j i InSection2wedescribetheparalleltemperingformal- all of the modes of π (as long as T is chosen appropri- max ism and lay out the requirements for a good temperature ately),theirlocationspropagatetocolderchains,ultimately ladder. We discuss previous work on temperature selection allowingtheT =1(cold)chaintoefficientlyexploretheen- andsuggestadefinitionofladderoptimalitythat,forsimple tire target distribution. At the same time, the positions of cases,proposesageometricspacingoftemperatures.Foril- the colder chains propagate upward to higher temperature lustration,weapplytheseideasinSection2.2tothesimple chains, where they are free to explore the entire prior vol- example of an unbounded Gaussian posterior distribution. ume. InSection3wedescribethealgorithmmentionedabove Thegoalinchoosinganeffectiveladderoftemperatures and then apply it in Section 4 to a variety of test distribu- is to minimise the ACT of the cold chain (our measure of tions. We show that, while our temperature selection strat- theefficiencyofthesampler).Therequirementstothisend egyisnotnecessarilyoptimalintheACTofthesampler,it are two-fold: nonethelessimprovestheACTcomparedtothesimplegeo- (i) T must be large enough that isolated modes of L metricspacingthatisconventionalintheliterature(Earl& max broaden sufficiently that an individual MCMC chain can Deem 2005; Sugita & Okamoto 1999; Kofke 2002, 2004) by efficiently access all of these modes when sampling under factors of >1.2 for our test cases. the tempered posterior π in (1) at T = T . We denote InSection5weapplyourmethodtotheastrophysically T max this temperature T . motivated – and more challenging – problem of parameter prior (ii) Since A depends on β − β , the differences be- estimation in the setting of gravitational wave (GW) data i,j i j tweentemperaturesmustbesmallenoughthatneighbouring analysis, using a single-walker MCMC sampler. We demon- chains can communicate their positions efficiently with one strate a reduction in ACT by as much as a factor of 2 over another. a geometric ladder, despite the caveats mentioned in Sec- tion 6.1. Both requirements depend sensitively on the (unknown) WeconcludeinSection6withadiscussionofourresults shape of the target distribution, so it is difficult to select and suggestions for further research. temperatures appropriately in advance. In choosing T , one must know roughly the relative max sizeandseparationofthemodestobeexplored.Asanexam- 2 PARALLEL TEMPERING ple,consideraone-dimensionallikelihoodwithtwoGaussian modesofwidthσ=1andcentresµ=±10.Inordertopre- Parallel tempering (Earl & Deem 2005; Swendsen & Wang ventasamplerfromgettingstuckononeofthemodes,they 1986;Geyer1991)isadevelopmentonthestandardMCMC must be widened to roughly the separation between them2, formalismthatusesseveralMarkovchainsinparalleltosam- givingσ=O(10).ThewidthofaGaussianpeakscaleswith ple from tempered versions of the posterior distribution π, √ the temperature as T, so we might choose T = 100; max Figure 1 illustrates the resulting coalescence of the modes. π (θ(cid:126))∝L(θ(cid:126))1/Tp(θ(cid:126)), (1) A different configuration of modes will, of course, require a T different T . max whereLandparerespectivelythelikelihoodandpriordis- Ontheotherhand,theswapacceptanceprobabilityA i,j tributions. depends on the distribution of likelihood values at temper- For high T, individual peaks in L become flatter atures T and T . In the case of a likelihood distribution i j and broader, making the distribution easier to sample via comprising a single Gaussian mode, the time-averaged ac- MCMC. A set of N chains is assigned temperatures in a ceptance ratios between chains, E[A ], can be computed i,j ladderT <T <...<T ,withT =1(thetargettemper- 1 2 N 1 analytically (see Section 2.2). ature).Thetemperaturesaretypicallygeometricallyspaced In general, we don’t know in advance what the target from 1 up to some T , decided in advance (a convention max distribution looks like, and so choosing an effective ladder that we shall discuss in more detail in Section 2.2). becomes a heuristic exercise, relying largely on educated Each chain is allowed to explore its tempered distribu- guesswork.Wearethereforemotivatedtofindsomemethod tionπ underanMCMCalgorithm,whileatpre-determined T of empirically determining an effective ladder. intervals “swaps” are proposed between (usually adjacent1) pairs of chains and accepted with probability (cid:32)L(θ(cid:126))(cid:33)βj−βi  2.1 Ladder selection A =min i , 1 , (2) i,j  L(θ(cid:126)j)  For an n-dimensional problem, the conventional choice of temperaturesisageometricallyspacedladderconstructedso where θ(cid:126) is the current position in the parameter space of that approximately 23% of swaps proposed between chains i the ith chain and β ≡ 1/T is the inverse temperature of will be accepted when sampling from an n-dimensional, i i unbounded Gaussian distribution (Earl & Deem 2005; ter this chain. When a swap is accepted, the chains exchange 1 Inprinciple,swapscanbeproposedbetweenanypairofchains. 2 Ideally, the modes must also be widened enough that they However,sincetheswapacceptanceratio(2)decaysexponentially extendtotheedgesofthepriorvolume.Alikelihooddistribution with the separation of inverse temperatures, ∆β, it is generally withasinglemodethatoccupiesonlyasmallfractionoftheprior sufficientonlytoproposeswapsbetweenadjacentchains. volumewilltakealongtimetoburnin. MNRAS000,1–21(2002) Temperature dynamics for PTMCMC samplers 3 requirement (i), discussed above, that the temperature lad- 0.20 der should reach a T sufficient for all of the modes of L T =1 max to mix (specified by T ). T =3.16 prior Kofke(2002)discussestheselectionoftemperaturelad- 0.15 T =10 dersinthecontextofmolecularsimulations.Heshowsthat, T =31.6 in simulations of such thermodynamic systems, there is a T =100 close relation between the specific heat of the system, C , x) V ( 0.10 and the acceptance ratios between adjacent temperatures. T π Inparticular,whenC isconstantwithrespecttoT overa V giventemperatureinterval,thenageometricspacingoftem- 0.05 peratures on that interval yields uniform acceptance ratios between adjacent temperatures. In the language of thermodynamics, the energy of the system, U, is analogous to −logL, and an analogue to the 0.00 20 15 10 5 0 5 10 15 20 specific heat can therefore be defined as − − − − x C (T)=− d E[logL] , (3) V dT T where E[·] denotes the expectation operator over θ(cid:126)under T Figure1. Aone-dimensionaltargetdistributionwithtwoGaus- thedistributionπ (θ(cid:126)).E[logL] isthereforetheexpectation T T sian peaks of width σ =1 at µ= 10 normalised for a uniform of the untempered log likelihood collected when sampling ± priorover[−20,20].AtT =100,thepeaksbroadentoσ=10,al- from the posterior at temperature T. lowinganMCMCchainsamplingatthistemperaturetofindboth In the context of Bayesian inference, Kofke’s result modesquickly,startingfromanywherewithinthepriorvolume. thereforetellsusthatifthemeanloglikelihoodcollectedby asamplerrespondslinearlytochangesintemperature,then ageometricallyspacedtemperatureladderwillachieveuni- Braak & Vrugt 2008; Roberts & Rosenthal 1998). We shall formacceptanceratiosbetweenadjacentchains.Conversely, discuss this convention in more detail in Section 2.2. temperature intervals on which E[logL] is strongly non- A consequence of this strategy is that increasing the T linear in T represent a phase transition that will require number of chains N does not improve communication be- morecarefulplacementoftemperatures,asweshallshowin tweenexistingchains,whichisdeterminedbyE[A ] =0.23. i,j Section 4. Instead,addingnewchainsextendstheladdertohighertem- peratures. This may be appropriate for an unbounded pos- terior,butforarealisticproblemwithafinitepriorvolume, 2.2 The ideal Gaussian distribution: a simple the acceptance ratio between adjacent chains saturates to example ∼100% at some temperature T , at which the posterior prior πT begins to look like the prior p. In the simple case of a unimodal Gaussian likelihood under Forthisgeometricspacingscheme–whereTprior isun- a flat prior, the optimal temperature spacing at low tem- known – there is therefore an optimal number of chains, peratures–whereverylittlelikelihoodmassistruncatedby Nopt, such that Tprior ≈TNopt ≡Tmax. For N <Nopt none the prior – can be analysed by approximating the prior to ofthechainswillbesamplingfromtheprior(sothesampler beunbounded3.Weshowthat,forthistractableexample,a may not find all of the modes), while for N >Nopt we end geometric temperature spacing is consistent with both the upwithseveralchainssamplingredundantlyfromtheprior. uniform-A criterion and also with the alternative criterion SincewearegenerallyignorantofTpriorfortheproblem that the Kullback–Leibler (KL) divergence is uniform be- athand,wearemotivatedtofindanalternativetemperature tweenallpairsofadjacentchains.Weusetheexampletoil- selection strategy. lustratetherelationshipbetweentheanalyticaldistribution It has been suggested in the literature (Earl & Deem of logL, the acceptance ratio A , and the temperature T. i,j 2005; Sugita & Okamoto 1999; Kofke 2002, 2004) that one Weshallworkwithann-dimensionalunitGaussiancen- could select temperatures such that the acceptance ratios tred on the origin (the same result can be achieved for a Ai,j are uniform for all pairs (i,j) of adjacent chains, in general Gaussian through a simple change of coordinates). an attempt to ensure that each sample sequence θ(cid:126)(t) for Since the prior is uniform and unbounded, we can restrict t = 1,2,..., as it moves between chains, spends an equal attention to the likelihood distribution L. In this case, the amount of time at every temperature. Sugita & Okamoto probability density p˜for the values of logL(θ(cid:126)) collected by (1999)justifythisnotionexperimentally–inthecontextof the sampler is molecular dynamics – with test cases in which such a lad- der indeed performs well. They use an algorithm derived p˜(logL)= elogL(−logL)n2−1, (4) from that of Hukushima & Nemoto (1996), which selects Γ(n) 2 temperatures according to an iterative process for which a uniform-Aladderisafixedpoint.Earl&Deem(2005)pro- vide further references for similar methods of determining 3 Theapproximationbreaksdownathighertemperatures,where temperature ladders that yield a given a target acceptance boundaryeffectsbecomesignificant.Indeed,withnopriorbound- ratio (Rathore et al. 2005; Sanbonmatsu & Garc´ıa 2002; aries,thereisnoTprioratwhichthemodeisspreadovertheentire Schug et al. 2004). However, these methods do not address priorvolume. MNRAS000,1–21(2002) 4 W. D. Vousden, W. M. Farr and I. Mandel 0.5 1.0 T =1 n=1 T =2 n=2 0.4 0.8 T =4 n=3 T =8 n=4 ) 0.3 0.6 n=5 L g A] (lo E[ ˜pT 0.2 0.4 0.1 0.2 0.0 0.0 20 15 10 5 0 100 101 102 − − − − logL γ Figure 2. The distribution of logL under a three-dimensional, Figure 3. The time-averaged acceptance ratio, E[A], between unimodal Gaussian at various temperatures, where L is nor- twochainsofaPTMCMCsampleronaunimodal,n-dimensional malised so that logL((cid:126)0) = 0. As T , the variance of logL Gaussian likelihood distribution. The chains have temperatures → ∞ diverges.Thelegendisorderedtomatchtheverticalorderofthe T andγT.Thelinesareorderedverticallytomatchthelegend. lines’peaks. Note that the specific heat from (3) is a constant n/2, as where L is normalised so that logL((cid:126)0) = 0 and n is the expected. number of parameters. Since the acceptance ratio A depends on logL − At a temperature T, −logL simply follows a gamma i,j i logL , the more separate the distributions of logL and distributionΓ(α,β)withshapeparameterα=n/2andrate j i logL attheirrespectivetemperatures,T andT ,thelower parameter β = 1/T. Thus, for a chain sampling at tem- j i j the acceptance ratio between such chains will be. For two perature T, the log likelihood distribution is p˜ (logL) = T chains at temperatures T and γT, the separation of the Tp˜(logL/T). means of p˜ and p˜ , in units of the standard deviation Overlongtime-scales,theaverageacceptanceratiobe- T γT at T, will be tween chains i and j is (cid:114) (cid:90)(cid:90) E[logL]T −E[logL]γT =(γ−1) n. (7) E[Ai,j] = Ai,jp˜Ti(logLi)p˜Tj(logLj) dlogLi dlogLj (cid:112)Var[logL]T 2 (−∞,0]2 It follows that – for constant γ – as the dimension n (cid:32) 1 (cid:18)n+1(cid:19)(cid:33) increases, so the acceptance ratio between chains at tem- = √π 2n−1γi−,jn/2Γ 2 peratures T and γT falls. For a higher dimensional target distribution, therefore, a closer spacing of temperatures is (cid:32) (cid:18) (cid:19) · F˜ n,n;n +1;− 1 − required for a given acceptance ratio. 2 1 2 2 γ Formoregeneraldistributions,byconsideringtheover- i,j (cid:18) (cid:19)(cid:33) lapofp˜T(logL)atdifferenttemperatures,Falcioni&Deem n n γn F˜ ,n; +1;−γ +1, (1999) argue that the number of temperatures N required i,j2 1 2 2 i,j to efficiently sample the posterior distribution should scale √ (5) with ∆logL/ n, where ∆logL is the range of E[logL]T between T =1 and T =T . That is: where F˜ istheregularisedGausshypergeometricfunction prior 2 1 and γi,j = Tj/Ti is the ratio between the temperatures of N ∝ E[logL]1−√E[logL]Tprior. (8) twochains.SinceE[Ai,j]dependsonTiandTj onlythrough n theratioγ ,uniformacceptanceratiosbetweenalladjacent i,j Since the log likelihood range ∆logL itself depends on pairs of chains can be achieved with a geometric spacing of the dimension of the system n, it is difficult to apply this temperatures – where γ is constant – for a unimodal i,i+1 relationinpractice.However,fortheidealGaussian,wecan Gaussian likelihood. seefrom(6)that∆logLscaleswithn,andsoN scaleswith Thelogspacingrequiredforaparticularacceptancera- √ n, as we might expect. tio also depends on the dimension of the parameter space, with more parameters requiring a closer spacing of temper- atures, illustrated by Figure 3. This can be understood by 2.3 The Kullback–Leibler divergence lookingattheexpectationandvarianceoflogLatapartic- ular temperature (see Figure 2), Another measure of the optimal spacing of temperatures nT nT2 is the Kullback–Leibler (KL) divergence between adjacent E[logL]T =− 2 and Var[logL]T = 2 . (6) chains. The KL divergence from a hot distribution πTj to a MNRAS000,1–21(2002) Temperature dynamics for PTMCMC samplers 5 cold distribution π , Underthisscheme,finitechangestoS willalwayspreserve Ti i the correct ordering of temperatures (T <...<T ). DKL(πTi(cid:107)πTj)=(cid:90) πTi(θ(cid:126)) logππTi((θ(cid:126)θ(cid:126))) dθ(cid:126), (9) gap TSoaaccchoierdveintghetosatmheeaAcicefportaanlclechraaitn1ios,swbeetcwaeneNndrcivheaitnhei Tj i and those immediately above and below, to wit quantifies the information gained about the posterior with each step down the temperature ladder, from the prior p= ddSti =κ(t)(cid:2)Ai(t)−Ai+1(t)(cid:3), (12) π to the posterior π=π . It is reasonable to expect T= T=1 for 1 < i < N, where κ is a positive constant controlling ∞ that for an optimally-spaced ladder – that is, one with a the time-scale of the evolution of T . κ can be interpreted minimalACTonthecoldchainforagivennumberofchains i as the instantaneous exponential time-constant for temper- –theinformationgainshouldbeuniformforeverystepdown ature adjustments. The two extremal temperatures, T and the ladder. 1 T , are fixed (see below). For the example of the ideal Gaussian of Section 2.2, N Underthisscheme,chainiwillattempttoincreasethe the KL divergence is, straightforwardly, gap in temperature space between itself and chain (i+1) n(cid:18) 1 (cid:19) if swaps are accepted too often and close it when they are D (π (cid:107)π )= +logγ −1 . (10) KL Ti Tj 2 γ i,j accepted too seldom — and similarly for chain (i−1) — i,j equilibratingatA thatareuniformoveri.Therefore,foran i Liketheswapacceptanceratio,therefore,uniformKLdiver- appropriatechoiceofκ–discussedmomentarily–theserules gence over the entire ladder is also achieved by a geometric drive the chains i={2,...,N −1} toward even acceptance spacing of temperatures for the ideal Gaussian. spacing. Unfortunately, unlike the acceptance ratio, the KL di- However,inordertoefficientlysampleatargetdistribu- vergenceisdifficulttocomputenumericallywhilesampling, tionwithstronglyseparatedmodes(suchthatatraditional owing to the unknown – and temperature-dependent – evi- MCMC sampler would be unable to traverse the “valleys” dence (normalisation) values on π and π . Ti Tj betweenthem),TN mustbehighenoughthatthemodesare We henceforth assume that spacing temperatures for flattenedoutandthechaincanexploretheentireparameter uniform acceptance ratios is a reasonable approximation of spaceunhindered.Thisamountstothetopmostchainsam- a ladder that is optimal in the ACT of the cold chain. We plingfromthepriordistribution4,whichweachievetrivially makethisassumptiononfaithand,whilewebrieflyexamine by setting the inverse temperature of this chain as β =0. N its validity in Section 4.1 and Section 6.1, it invites a more This continuous system is discretised as careful study. (cid:2) (cid:3) S (t+1)−S (t)=κ(t) A (t)−A (t) , (13) i i i i+1 where A (t) are the acceptance ratios accumulated by the i sampler at the current iteration. 3 ADAPTIVE TEMPERATURE LADDERS The values of A are measured empirically at each it- i FromtheargumentsinSection2andthereferencestherein, eration as the fraction of swap proposals between chains we shall assume that uniformity of acceptance ratios pro- that were accepted. For a traditional sampler comprising videsagoodapproximationtotheoptimaltemperaturelad- one sample per chain, these will be either 0 or 1. For en- der for parallel tempering problems. In this section, we de- semble samplers, however, comprising n distinct walkers w scribeanalgorithmfordynamicallyadaptingchaintemper- per temperature, the measurements of A are less granular, i atures to achieve uniform acceptance ratios for inter-chain suchthatA ∈{x∈[0,1]|n x∈Z}.Ingeneral,fewerwalk- i w swaps. ers require a longer averaging time-scale – discussed below From (2), as 1/T −1/T → 0, A → 1, so in order – in order to smooth out this granularity. j i i,j toincreasetheexpectedacceptanceratiobetweenchains,it Importantly, the temperature adjustment scheme we suffices to move them closer together in temperature space; haveproposed–and,moregenerally,anyadaptivesampling conversely,toreduceE[A ],wecanpushthechainsapart. scheme – in fact violates the condition for detailed balance i,j We will henceforth adopt the notation that A ≡ A that ensures that an MCMC sampler will converge to the i i,i 1 − and that T < T , with T = 1 being the untempered or target distribution. Roberts & Rosenthal (2007) investigate i i+1 1 cold chain (which samples from the target distribution, π). theconditionsrequiredofsuchanadaptivesamplerforitto Here,A (t)aretheinstantaneousacceptanceratiosbetween be ergodic in the target distribution – that is, that it will i chains,butweshallshortlydescribethediscretecasewhere converge on long time-scales. They determine (from their empiricalmeasurementsofA arecollectedwitheachitera- Theorem1andCorollary4)thatdiminishingtheamplitude i tion of the sampler. of adaptations in the transition kernel with each iteration is sufficient for the sampler to be ergodic in the target dis- tribution. We therefore suppress temperature adjustments 3.1 Dynamics toensurethatthesamplerisMarkovianonsufficientlylong time-scales5. Our goal is to dynamically adjust the temperatures of the Therateofdiminutionoftemperatureadjustmentsisa chains to achieve uniform acceptance ratios as we sample thetargetdistribution.Wedefineourtemperaturedynamics in terms of the log of the temperature difference between 4 For analytic priors, this special case, where the likelihood is chains, ignored, can be treated separately by having the sampler draw independentsamplesdirectlyfromtheprior. Si ≡log(Ti−Ti 1). (11) 5 Inprinciple,ofcourse,wecouldstoptemperatureadjustments − MNRAS000,1–21(2002) 6 W. D. Vousden, W. M. Farr and I. Mandel trade-offbetweentherateofconvergenceofthetemperature Agoodchoiceofν dependsontheresponseofE[A ] to i ladder and that of the sampler itself toward its stationary changes in the relevant chains’ temperatures, and therefore distribution.Wemodulatethedynamicswithhyperbolicde- depends on the particular likelihood function that is being cay to suppress the dynamics on long time-scales, sampled. However, if E[A ] will eventually be of order, say, i 0.25,andwewantthemeasurementsofA tobebetween0.2 1 t i κ(t)= 0 , (14) and 0.3, then we should average A over at least 100 swap νt+t i 0 proposals, giving ν (cid:38)100/n . w where t is the time at which the temperature adjustments 0 Combiningthesecriteriaonν andt ,wethereforesug- 0 havebeenreducedtohalftheirinitialamplitude.Theinitial gest default parameter values of ν = 102/n and t = w 0 amplitudeofadjustmentsisinturnsetbyν,thetime-scale 103/n . w on which the temperatures evolve at early time. 3.2 Parameter choice 4 EXAMPLES Intheschemeof (13)and(14),therearetwoparametersto We have implemented the algorithm proposed above as a choose:t0 andν.Thedynamicaltimeparametertin(13)is modification to the ensemble sampler emcee of Foreman- measuredinunitsofintra-chainjumpsofthesampler,with Mackey et al. (2013). Our implementation can be found at temperature adjustments being made at every iteration. https://github.com/willvousden/ptemcee. The lag parameter t sets the time-scale for the atten- 0 In this section we apply our implementation to specific uation of temperature adjustments. This decay factor in κ examples in order to understand how and when the tradi- is included as a fail-safe mechanism to ensure that, even tional geometric spacing fails and how much the uniform-A fortargetdistributionsonwhichthetemperaturedynamics strategy might help us. We present the following test cases. fail to find an equilibrium set of temperatures, the ladder will always converge over long time-scales. This condition (i) InSection4.1wecomparetheuniform-Astrategyused guarantees that the sampler correctly explores the target by the temperature dynamics of Section 3 with the alter- distribution. native strategy of uniform KL divergence discussed in Sec- From (14), the time-scale of the dynamics at late time tion 2.2 on the example of a unimodal truncated Gaussian – when t(cid:29)t – is νt/t . To ensure that temperatures have likelihood. 0 0 time to find an equilibrium over the course of a run, we (ii) In Section 4.2 we test the dynamics on a more com- therefore require that t (cid:29) ν, so that the dynamics will plex, bimodal distribution for various choices of the num- 0 always be on a time-scale much shorter than the current ber of chains N. We compare the resulting ACTs of the run time. However, we should also ensure that, over the sampler with those of another sampler using a geomet- course of the run, the dynamical time-scale is longer than ric ladder whose maximum temperature is fixed such that the ACT of the sampler, so that the temperatures respond T ≈T . max prior tothecorrectposteriordistribution.Tothisend,werequire (iii) InSection4.3wetestthealgorithmagainstthemore that νN (cid:29) t , where N is the number of independent difficult egg-box distribution with 243 modes. For compari- τ 0 τ samples gathered over the course of the run. For example, son,wesamplefromthesamedistributionwithageometric if N =100, these two conditions are satisfied by t =10ν, ladderconstructedtoyield25%acceptanceratioswhenap- τ 0 and for our test cases, we have indeed found this choice to plied to the ideal Gaussian discussed in Section 2.2. work well. For all of these tests, ν = 102 and t = 103 are used Meanwhile,thetime-scaleofthedynamicsatearlytime 0 to control the dynamics in (14), while the sampler uses 100 – when t (cid:28) t – is ν. A good choice of ν should therefore 0 walkers. Note that these choices, while different from the ensurethatthesamplerisnotsusceptibletolargestatistical defaults proposed in Section 3.2, do satisfy the conditions errors on the measurements of the acceptance ratios A . i described in that section. Ingeneral,forn swapproposals,theacceptancecount s n A is a random variable that follows a binomial distribu- s i tion B(n ,E[A ]), so that A has variance s i i 4.1 Truncated Gaussian E[A ](1−E[A ]) Var[Ai] = i n i . Ourfirsttestcaseisann-dimensional,unimodal,unitGaus- s sian similar to that of Section 2.2 but with finite prior vol- Since the dynamical equations (13) are linear in A , they i ume.Thesimplicityofthiscaseadmitssomeexactanalysis will be driven by the means, E[A ], on long time-scales, as- i before recourse to numerics, which allows us to test the ap- suming that the noise in the system from counting errors – √ proximations made in Section 2.2. proportionalto1/ n –doesnotcauseshort-termchanges s At low temperatures, where the prior boundaries do in E[A ]. i not truncate much of the likelihood probability mass, the Given a sampler of n walkers, n swaps are proposed w w optimal temperature spacing should be similar to that of with each iteration, so that n = n ν. To ensure stable s w the ideal Gaussian. By imposing a step-like cut-off in the dynamics at early time, we should therefore choose n ν (cid:29) w prior at a radius of R, there will be some temperature at 1. which this approximation will fail and a geometric spacing becomes inappropriate. altogether once the temperatures have reached an equilibrium, For the likelihood we use the same distribution as in discardingtheprevioussamplesaspartoftheburn-in. Section2.2,whileforthepriorweuseauniformdistribution MNRAS000,1–21(2002) Temperature dynamics for PTMCMC samplers 7 overtheclosedn-ballofradiusR=30,centredontheorigin. 7 The likelihood and prior are defined by L(θ(cid:126))∝exp(cid:16)−1(cid:107)θ(cid:126)(cid:107)2(cid:17), (15) bits) 6 γγ==186.0.0 2 ( 5 γ=4.0 (cid:40) n p(θ(cid:126))∝ 1 if (cid:107)θ(cid:126)(cid:107)≤R, (16) gai 4 γ=2.0 0 otherwise, n o 3 ti where (cid:107)·(cid:107) is the Euclidean norm on Rn. Subsequently, the ma 2 r normalisedposteriorgeneratedby (15)and(16)attemper- o nf 1 ature T is I  0 πT(θ(cid:126))=(2πγ˜T(cid:18))n2−,n2R2TΓ2((cid:19)n2)exp(cid:16)−(cid:107)2θ(cid:126)T(cid:107)2(cid:17) if (cid:107)θ(cid:126)(cid:107)≤R, (17) 20 21 22 23 24 T2lo5w 26 27 28 29 210 0 otherwise, where γ˜(a,z) is the lower incomplete gamma function. Figure 4. The KL divergence, or information gain, from a hot In the low-temperature limit, this distribution con- chainattemperatureγTlowtoacolderchainattemperatureTlow, verges to the ideal Gaussian distribution. We should there- bothsamplingfrom(17)atn=5(solidlines).AsTlow→0,the foreexpecttheKLdivergenceforastepdownthetempera- informationgaintendstothatoftheidealGaussianofSection2.2 tureladdertoasymptoteto(10)asT →0,wheretheeffects (dashedlines).Thelinesareorderedverticallytomatchtheleg- ofthepriorboundaryarenegligible6.Indeed,theKLdiver- end. gence of (17) from T to T is available analytically as 2 1 (cid:16) (cid:17) (T −T )γ˜ 1+ n, R2 D =− 2 1 2 2T2 KL (cid:16) (cid:17) T γ˜ n, R2 2 2 2T1 transition) in order to achieve uniform A than would be (cid:18) (cid:19) γ˜(cid:16)n, R2(cid:17) (18) required for uniform DKL. There is therefore less risk of a + nlog T2 +log 2 2T2 . large gap in temperature across such a temperature range, 2 T1 γ˜(cid:16)n, R2(cid:17) atthecostof(potentially)slightlylessefficientcommunica- 2 2T1 tion across the rest of the ladder. The uniform-A criterion If we set T =γT (with γT (cid:28)1), then γ˜(a,z)→Γ(a) as for optimality is therefore conservative with respect to a 2 1 1 T →0, and the expression reduces to (10), as expected. uniform-D criterion. 1 KL Figure 4 illustrates this convergence for n = 5. The We can also visualise the ladder specification in terms point on this plot at which the solid line diverges from the of the density of chains over temperature. We define this dashed line, for each γ, predicts the temperature beyond density, in logT, as which a geometric spacing of temperatures is no longer op- timal (for optimality as defined by uniform KL divergence 1 1 η(logT)= dN = = , (19) between chains). This is caused by truncation of the tem- dlogT ∆logT logγ pered likelihood by the prior boundaries. Ofcourse,sincetheKLdivergencecannoteasilybeas- with γ =T /T , where T and T are the chain temper- i+1 i i+1 i sessed empirically by an MCMC sampler, and we must in- atures to either side of T. stead resort to using acceptance ratios, we would like to Figure 6 shows this density for a temperature ladder know how consistent these two schemes are outside the as- of 20 chains that is in equilibrium under the temperature sumptions of Section 2.2. dynamics of Section 3 (the N = 20 contour of Figure 5). Figure 5 shows contours of constant D , calculated KL The density exhibits the expected uniformity of γ for low from(18),andcontoursofconstantAi,illustratedbypoints temperatures but falls for T (cid:38)√80. The width σ of the unit representingtemperaturepairs(fromladdersselectedbythe GaussianattemperatureT is T,soatthistemperaturethe algorithm developed in Section 3). In the low temperature priorboundaryisat∼3σ.AtT =80,∼5%ofthelikelihood limit, as expected, both schemes select a geometric spac- mass is truncated – compared to < 0.1% for T = 40 and ing of temperatures consistent with the ideal Gaussian of ∼ 35% for T = 160 – indicating that the prior boundary Section 2.2 (i.e., the contours remain constant in γ). At becomes significant in this temperature regime. highertemperatures,bothschemesdepartfromthegeomet- Thisdropindensityreflectstheconvergenceofthetem- ric spacing, but they do so differently. The uniform accep- peredposteriordistribution,π ,towardthepriorasT →∞. T tance scheme displays a more gradual departure from a ge- Asπ becomesflatter,fewerchainsareneededperlogT to T ometric spacing than the contours of constant D . The KL maintain good communication. smaller γ selected by the uniform-A scheme outside the ge- Also shown on figure Figure 6 is the square root of the ometric regime, however, suggest that closer spacing is re- estimatedspecificheatC ofthesystemasdiscussedinSec- V quired in difficult temperature ranges (e.g., across a phase tion 2.1, which can be seen to track closely the logarithmic chain density η when appropriately normalised. While the 6 WhilewedonotconsiderT <1inoursimulations,thecaseof provenance of this relationship is unclear, it demonstrates T 0canequivalentlybethoughtofasR ,sincethewidth therelevanceofthespecificheatindetermininganeffective → →∞ oftheGaussianscaleswith√T. temperature ladder. MNRAS000,1–21(2002) 8 W. D. Vousden, W. M. Farr and I. Mandel 23 N =5 4.2 N =6 N =7 3.6 N =8 N =9 ) 22 N =10 3.0 ts bi N =12 ( n N =15 2.4 gai γ N =20 n o ti a 1.8 m r 21 nfo I 1.2 0.6 20 0.0 20 21 22 23 24 25 26 27 28 29 210 Tlow Figure 5. AcontourplotoftheKLdivergence,orinformationgain,fromahotchainattemperatureThigh=γTlow toacolderchain at temperature Tlow, both sampling from the Gaussian likelihood (17). The coloured lines show the equilibrium N-chain temperature laddersreachedbythetemperaturedynamicsalgorithmofSection3,wheretheacceptanceratioisthesamebetweenanypairofadjacent chains.Thepointsontheselinesrepresentpairsofadjacenttemperatures(T,γT)(excludingthetop-most,whereγ= ). ∞ an inefficient use of resources, it at least doesn’t drastically 3.5 inhibit communication between high temperatures and low temperatures. Instead, we now turn to a more complex, bi- 3.0 modallikelihooddistributionforwhichageometricspacing 2.5 mightcausebottlenecksinthecommunicationbetweenhigh and low temperatures. 2.0 We use a likelihood derived from the two-dimensional η Rosenbrock function f: 1.5 (cid:18) 1 1 (cid:19)1/Tp L(x,y)∝ + , (20) 1.0 c+f(x,y) c+f(−x,y) η (uniform-A) where 0.5 √CV (arbitraryscale) f(x,y)=(a−x)2+b(y−x2)2. (21) 0.0 100 101 102 Tp isapre-temperingfactorchosentoincreasethecon- T trastofthedistribution,makingithardertosample.When T (cid:28) 1, each mode is locally Gaussian, making the re- p sultscomparabletotheGaussianexampleconsideredinSec- tion 2.2. Figure 6. Orange: The density of chains per logT under the For the following tests, we use a = 4, b = 1, c = 0.1, truncated Gaussian distribution (17), where N = 20, n = 25, andT =10 3.Weuseaflatprioron[−10,10]×[−20,100]. and temperatures are chosen for uniform acceptance ratios be- p − tween chains. The chains have equilibrated to 77% acceptance. Figure 7 illustrates this likelihood over the prior volume. Blue:ThesquarerootofthespecificheatofthetruncatedGaus- siandistribution,normalisedtomatchthechaindensityη ofthe 4.2.1 Test: temperature evolution uniform-A ladder, between T1 and TN 1. The specific heat CV, from(3),isestimatedfromthesample−meansoflogLovermany As an illustrative example, we first tested the temperature runswithdifferenttemperatureladders. dynamics of Section 3 with the double Rosenbrock poste- rior distribution (20) using 13 chains. Figure 8 shows the evolution of the temperature ladder according to these dy- 4.2 Double Rosenbrock function namics, while Figure 9 shows the chain density η(logT) for The previous test demonstrated how a geometric ladder the equilibrated temperature ladder. spaces temperatures too closely at higher temperatures, as Whiletheequilibratedchainsaredistributeduniformly the prior boundary becomes significant. While this may be inlogT forT (cid:46)50,thereisadistinctpeakinη atT ≈800, MNRAS000,1–21(2002) Temperature dynamics for PTMCMC samplers 9 lower-temperaturechainscanstillcommunicatewithachain 100 0 sampling from the prior. Under this set-up, therefore, the ACT always decreases as N increases, per Figure 10. 80 −1500 To test the improvement in ACT, τ, conferred by our 60 temperature dynamics, we allowed emcee to explore the 3000 − target distribution (20) with different numbers of chains, L y 40 −4500 log Nsp,acuesdinlagddbeortsh.Tthheeruensuiflotrinmg-AAClTadsd,eτrs aannddτgeom,aerteripcalollty- geo acc 20 −6000 ted against N in Figure 10. In this example, an N-chain ladder dynamically 0 7500 − adapted for uniform acceptance ratios clearly outperforms 20 9000 a geometrically spaced ladder of the same size for all N. − − 10 5 0 5 10 The benefit of a uniform-A ladder is most pronounced − − x atlowN –i.e.,wheretherearefewchainsavailable.Inthis regime, the sampler will be more sensitive to phase transi- tions, since the bigger gaps in temperature could cause se- Figure 7. TheRosenbrockloglikelihood,from(20). vere bottlenecks in communication across the temperature ladder. When N is large, the differences in acceptance ratios where a simple geometric spacing of temperatures hinders between a geometric ladder and one chosen for uniform A communicationbetweenchains.Thispeakoccursataphase becomeslesssignificant.Inthiscase,thedifferencebetween transitionwherethetwomodesofthelikelihooddistribution thelimiting(minimum)acceptanceratioforaladderandthe begintomixandE[logL] changesrapidlywithT,indicated ladder’s average acceptance ratio is proportionally smaller. by the sharp change in specific heat in the bottom panel of In the case of the double Rosenbrock distribution (20), Figure 9. Since the shape of the likelihood distribution in we have found that, once the minimum acceptance ratio thisregimebecomesverysensitivetoT,ahigherdensityof for a geometric ladder (terminating at T =2×104) ex- max chains is needed to maintain a given acceptance ratio. We ceeds ∼ 10%, reallocating temperatures for uniform accep- also note that in the geometric regime (i.e., for low T) the tance ratios does not reduce the measured ACT by more specific heat is approximately n/2 = 1, with E[A] ≈ 57%, than25%.ThisoccurswhenN ≈7inthecurrentexample. consistent with the values derived for the ideal Gaussian Nonetheless, there remains an overall improvement in ACT from (6) and (5) respectively. regardless of N. Ultimately, however, the figure of merit for a tempera- Figure10alsoshows,inthemiddlepane,thetotalnum- turespecificationinaPTMCMCsimulationistheresulting ber of iterations per independent sample across all chains. ACTforthetargettemperature(T =1)ofthesampler.We This quantity, given by N ×τ, is proportional to the total must therefore test the performance of the sampler empiri- CPUtimeofthesimulation,whileτ itselfisproportionalto cally. the CPU time per chain, or wall time, of the simulation. In We use the term ACT to refer to the integrated auto- this instance, the CPU time of a run diminishes with N in correlation time discussed by Sokal (1997), which we esti- muchthesamefashionasthewalltimedoes.Thefractional mate according to the algorithm used in the acor package improvement in CPU time is of course the same as for wall (see Appendix A and http://www.math.nyu.edu/faculty/ time – τ /τ . geo acc goodman/software/acor/ for details). For the following tests,weusetheACTofthefirstparameter,x,asameasure of the efficiency of the sampler (since (20) is bimodal in x 4.2.3 Test: chain removal but unimodal in y). To determine whether a uniform-A temperature placement strategy is in fact close to optimal, we assess the contribu- 4.2.2 Test: improvement over a geometric ladder tion of each chain from such a temperature ladder to the efficiency of the sampler, as measured by its ACT. If this InSection2weclaimedthataimingforuniformacceptance contribution is equal for all chains, then we can conclude ratios between chains yields a good temperature ladder. thatitisindeedoptimaltohavethemallexchangingequally Specifically, we expect that a ladder selected for uniform – that is, with uniform acceptance ratios. acceptanceratiosshouldleadtoalowerACTfortheT =1 To this end, we conducted the following test: chain than that resulting from a plain geometric ladder. Thegeometricansatzthatweusehasafixedmaximum (i) Sample from (20) with N = 7 chains under the tem- temperature such that TN =2×104. As N increases, more peraturedynamicsofSection3untilthetemperatureshave chains are added between T1 and TN, maintaining the geo- equilibrated to (T1,...,T7) to give uniform acceptance ra- metricspacing.Underthisarrangement,theadditionofnew tios. temperaturesisnotredundantevenwhenTN isalreadyhigh (ii) Generate5newtestladders,eachof6chains,formed enough to sample from the prior; the additional chains in- byremovingtheith chainfromthatdeterminedabove–i.e., steadaidinter-chaincommunicationatlowertemperatures. (T ,...,T ,T ,...,T ) – for i=2,...,6. 1 i 1 i+1 7 Since TN is close to the temperature at which the posterior (iii) Sam−ple from (20) with each of these 5 test ladders becomes the prior, there is little CPU time wasted in sam- and calculate the ACTs on the cold chain, τ . test pling redundantly from the prior with several chains, while MNRAS000,1–21(2002) 10 W. D. Vousden, W. M. Farr and I. Mandel 105 104 e r u t 103 a r e mp 102 e T 101 100 100 101 102 103 104 105 106 Time 1.0 o 0.8 ti a r e 0.6 c n a t 0.4 p e c Ac 0.2 0.0 100 101 102 103 104 105 106 Time Figure 8. The evolution of ladder of 13 temperatures Ti and acceptance ratios Ai over an emcee run of 106 iterations under the Rosenbrocklikelihood(20).Chains1and13arenotshown,havingfixedtemperaturesT1=1andT13= . ∞ We can understand this behaviour by examining the 3.0 complete autocorrelation functions from which these ACTs η (uniform-A) areestimated.IllustratedinFigure12,theseautocorrelation 2.5 √CV (arbitraryscale) functionsexhibittwodistincttime-scales.Firstly,thereisa large autocorrelation for lags (cid:46) 100 for all i – particularly 2.0 i = 2 – corresponding to the ACT of the sampler within one of the two modes: that is, the time taken for the sam- η 1.5 pler to generate an independent sample without changing mode. Secondly, there is a visible hump in the autocorre- 1.0 lation function for 100 (cid:46) lag (cid:46) 2000, corresponding to the time taken for the sampler to migrate between modes. Re- moving the second chain from initial geometric ladder of 7 0.5 chainsincreasestheintra-modeACTinparticular,butdoes not affect the inter-mode ACT. Meanwhile, while removing 0.0 100 101 102 103 higher temperature chains pushes the secondary hump out- T wardtolargerlags,increasingtheinter-modeACTinstead. The overall autocorrelation time in which we are inter- ested, discussed by Sokal (1997) and in Appendix A, rep- resents the time between independent samples of the sys- Figure 9. Orange:TheequilibriumdensityofchainsperlogT tem. It is therefore set by the time-scale on which the sam- forthedoubleRosenbrockrunillustratedinFigure8,wherethe pler migrates to a new mode independently of the current arococetpotfanthcee srpateicoisfichahveeatsefottrletdhetodoAuibl≈e R0o.5s7en.bBrolucke:dTishtreibsuqtuiaorne, mode. Removing a chain at higher temperatures increases asdescribedinFigure6. the inter-modal ACT, and therefore damages the efficiency of the sampler. Nonetheless, all of the tested temperature ladders Figure 11 shows the ACTs for the cold chain result- yielded lower ACT than the default geometric ladder, de- ing from the test outlined above. While τ < τ < τ spite the geometric ladder being chosen with prior knowl- acc test geo forladdersofthesameN,τtest increaseswiththetempera- edge of Tprior. tureofthechainthatisremoved,suggestingthatadditional chains are more useful at higher temperatures. The sharp 4.3 Egg-box in five dimensions jumpinτ whenachainaboveT ≈200isremovedarises test fromthephasetransitionthatoccursasT approachesT , Totestthealgorithm’sperformanceonayetmorestrongly prior indicated by a peak in C (visible in Figure 9). multi-modaldistribution,weuseanegg-boxdistributionde- V MNRAS000,1–21(2002)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.