Table Of ContentMNRAS000,1–21(2002) Preprint16March2016 CompiledusingMNRASLATEXstylefilev3.0
Dynamic temperature selection for parallel-tempering in
Markov chain Monte Carlo simulations
W. D. Vousden(cid:63), W. M. Farr(cid:63) and I. Mandel(cid:63)
University of Birmingham, Edgbaston, Birmingham, B15 2TT, United Kingdom
16March2016
6
1
0 ABSTRACT
2 ModernproblemsinastronomicalBayesianinferencerequireefficientmethodsforsam-
r pling from complex, high-dimensional, often multi-modal probability distributions.
a Most popular methods, such as Markov chain Monte Carlo sampling, perform poorly
M
on strongly multi-modal probability distributions, rarely jumping between modes or
settling on just one mode without finding others. Parallel tempering addresses this
5
problembysamplingsimultaneouslywithseparateMarkovchainsfromtemperedver-
1
sions of the target distribution with reduced contrast levels. Gaps between modes can
betraversedathighertemperatures,whileindividualmodescanbeefficientlyexplored
]
M at lower temperatures. In this paper, we investigate how one might choose the ladder
oftemperaturestoachievemoreefficientsampling,asmeasuredbytheautocorrelation
I
. time of the sampler. In particular, we present a simple, easily-implemented algorithm
h for dynamically adapting the temperature configuration of a sampler while sampling.
p
Thisalgorithmdynamicallyadjuststhetemperaturespacingtoachieveauniformrate
-
o ofexchangesbetweenchainsatneighbouringtemperatures.Wecomparethealgorithm
r to conventional geometric temperature configurations on a number of test distribu-
t
s tions and on an astrophysical inference problem, reporting efficiency gains by a factor
a of 1.2–2.5 over a well-chosen geometric temperature configuration and by a factor of
[
1.5–5 over a poorly chosen configuration. On all of these problems a sampler using
3 thedynamicaladaptationstoachieveuniformacceptanceratiosbetweenneighbouring
v chains outperforms one that does not.
3
2
8
5
0 1 INTRODUCTION space(Geyer1991).ParalleltemperedMCMCsamplersare
. thus particularly well-suited to sampling posterior distribu-
1 Many problems in astronomical data analysis and Bayesian
tions with well-separated modes, where a regular MCMC
0 statistical inference demand the characterisation of high-
5 samplerwouldtakemanyiterationstofinditswaybetween
dimensional probability distributions with complicated
1 modes.
structures.Lackinganalyticforms,thesedistributionsmust
:
v be explored numerically, usually via Monte Carlo methods.
Anopenproblemintheapplicationofparalleltemper-
i ParalleltemperedMarkovchainMonteCarlo(MCMC),
X ing is selecting a specification, or ladder, of temperatures
a development on standard MCMC, uses several Markov
thatminimisestheautocorrelationtime(ACT)ofthechain
r
chains in parallel to explore a target distribution at differ-
a samplingtheposteriordistributionofinterest.Theefficiency
ent“temperatures”(Earl&Deem2005;Swendsen&Wang
ofagivenladderhingescriticallyontherateatwhichitcan
1986;Geyer1991).Asthetemperatureincreases,theposte-
transferthepositionsinparameterspaceofsamplesbetween
riordistributionasymptotestotheprior,allowingachainto
high and low temperatures.
efficientlyexplorethewholepriorvolumewithoutbecoming
stuckinregionsoftheparameterspacewithhighprobability
Inthispaperwepresentasimplealgorithmthatadapts
density.Atlowertemperatures,achaincanmoreefficiently
the temperature ladder of an ensemble-based parallel tem-
sample from such a high-probability region. Meanwhile, ex-
peredMCMCsampler(Goodman&Weare2010)suchthat
change of positions between chains allows colder chains to
the rate of exchange between chains is uniform over the
migrate between widely separated modes in the parameter
entire ladder. The algorithm is easy to implement in ex-
isting code, and we provide an example implementation
for the emcee sampler of Foreman-Mackey et al. (2013).
(cid:63) E-mail: will@star.sr.bham.ac.uk (WDV); w.farr@bham.ac.uk We also present an implementation for traditional, non-
(WMF);imandel@star.sr.bham.ac.uk(IM) ensemble MCMC samplers where a single walker explores
c 2002TheAuthors
(cid:13)
2 W. D. Vousden, W. M. Farr and I. Mandel
theparameterspace.Wereportfavourableresultsfromsuch their positions in the parameter space, so that chain i is at
an implementation, along with a number of caveats. θ(cid:126) and chain j is at θ(cid:126). Since the hottest chains can access
j i
InSection2wedescribetheparalleltemperingformal- all of the modes of π (as long as T is chosen appropri-
max
ism and lay out the requirements for a good temperature ately),theirlocationspropagatetocolderchains,ultimately
ladder. We discuss previous work on temperature selection allowingtheT =1(cold)chaintoefficientlyexploretheen-
andsuggestadefinitionofladderoptimalitythat,forsimple tire target distribution. At the same time, the positions of
cases,proposesageometricspacingoftemperatures.Foril- the colder chains propagate upward to higher temperature
lustration,weapplytheseideasinSection2.2tothesimple chains, where they are free to explore the entire prior vol-
example of an unbounded Gaussian posterior distribution. ume.
InSection3wedescribethealgorithmmentionedabove Thegoalinchoosinganeffectiveladderoftemperatures
and then apply it in Section 4 to a variety of test distribu- is to minimise the ACT of the cold chain (our measure of
tions. We show that, while our temperature selection strat- theefficiencyofthesampler).Therequirementstothisend
egyisnotnecessarilyoptimalintheACTofthesampler,it are two-fold:
nonethelessimprovestheACTcomparedtothesimplegeo-
(i) T must be large enough that isolated modes of L
metricspacingthatisconventionalintheliterature(Earl& max
broaden sufficiently that an individual MCMC chain can
Deem 2005; Sugita & Okamoto 1999; Kofke 2002, 2004) by
efficiently access all of these modes when sampling under
factors of >1.2 for our test cases.
the tempered posterior π in (1) at T = T . We denote
InSection5weapplyourmethodtotheastrophysically T max
this temperature T .
motivated – and more challenging – problem of parameter prior
(ii) Since A depends on β − β , the differences be-
estimation in the setting of gravitational wave (GW) data i,j i j
tweentemperaturesmustbesmallenoughthatneighbouring
analysis, using a single-walker MCMC sampler. We demon-
chains can communicate their positions efficiently with one
strate a reduction in ACT by as much as a factor of 2 over
another.
a geometric ladder, despite the caveats mentioned in Sec-
tion 6.1. Both requirements depend sensitively on the (unknown)
WeconcludeinSection6withadiscussionofourresults shape of the target distribution, so it is difficult to select
and suggestions for further research. temperatures appropriately in advance.
In choosing T , one must know roughly the relative
max
sizeandseparationofthemodestobeexplored.Asanexam-
2 PARALLEL TEMPERING ple,consideraone-dimensionallikelihoodwithtwoGaussian
modesofwidthσ=1andcentresµ=±10.Inordertopre-
Parallel tempering (Earl & Deem 2005; Swendsen & Wang
ventasamplerfromgettingstuckononeofthemodes,they
1986;Geyer1991)isadevelopmentonthestandardMCMC
must be widened to roughly the separation between them2,
formalismthatusesseveralMarkovchainsinparalleltosam-
givingσ=O(10).ThewidthofaGaussianpeakscaleswith
ple from tempered versions of the posterior distribution π, √
the temperature as T, so we might choose T = 100;
max
Figure 1 illustrates the resulting coalescence of the modes.
π (θ(cid:126))∝L(θ(cid:126))1/Tp(θ(cid:126)), (1) A different configuration of modes will, of course, require a
T
different T .
max
whereLandparerespectivelythelikelihoodandpriordis-
Ontheotherhand,theswapacceptanceprobabilityA
i,j
tributions.
depends on the distribution of likelihood values at temper-
For high T, individual peaks in L become flatter
atures T and T . In the case of a likelihood distribution
i j
and broader, making the distribution easier to sample via
comprising a single Gaussian mode, the time-averaged ac-
MCMC. A set of N chains is assigned temperatures in a
ceptance ratios between chains, E[A ], can be computed
i,j
ladderT <T <...<T ,withT =1(thetargettemper-
1 2 N 1 analytically (see Section 2.2).
ature).Thetemperaturesaretypicallygeometricallyspaced
In general, we don’t know in advance what the target
from 1 up to some T , decided in advance (a convention
max distribution looks like, and so choosing an effective ladder
that we shall discuss in more detail in Section 2.2).
becomes a heuristic exercise, relying largely on educated
Each chain is allowed to explore its tempered distribu-
guesswork.Wearethereforemotivatedtofindsomemethod
tionπ underanMCMCalgorithm,whileatpre-determined
T of empirically determining an effective ladder.
intervals “swaps” are proposed between (usually adjacent1)
pairs of chains and accepted with probability
(cid:32)L(θ(cid:126))(cid:33)βj−βi 2.1 Ladder selection
A =min i , 1 , (2)
i,j L(θ(cid:126)j) For an n-dimensional problem, the conventional choice of
temperaturesisageometricallyspacedladderconstructedso
where θ(cid:126) is the current position in the parameter space of that approximately 23% of swaps proposed between chains
i
the ith chain and β ≡ 1/T is the inverse temperature of will be accepted when sampling from an n-dimensional,
i i
unbounded Gaussian distribution (Earl & Deem 2005; ter
this chain. When a swap is accepted, the chains exchange
1 Inprinciple,swapscanbeproposedbetweenanypairofchains. 2 Ideally, the modes must also be widened enough that they
However,sincetheswapacceptanceratio(2)decaysexponentially extendtotheedgesofthepriorvolume.Alikelihooddistribution
with the separation of inverse temperatures, ∆β, it is generally withasinglemodethatoccupiesonlyasmallfractionoftheprior
sufficientonlytoproposeswapsbetweenadjacentchains. volumewilltakealongtimetoburnin.
MNRAS000,1–21(2002)
Temperature dynamics for PTMCMC samplers 3
requirement (i), discussed above, that the temperature lad-
0.20
der should reach a T sufficient for all of the modes of L
T =1 max
to mix (specified by T ).
T =3.16 prior
Kofke(2002)discussestheselectionoftemperaturelad-
0.15 T =10 dersinthecontextofmolecularsimulations.Heshowsthat,
T =31.6
in simulations of such thermodynamic systems, there is a
T =100
close relation between the specific heat of the system, C ,
x) V
( 0.10 and the acceptance ratios between adjacent temperatures.
T
π Inparticular,whenC isconstantwithrespecttoT overa
V
giventemperatureinterval,thenageometricspacingoftem-
0.05 peratures on that interval yields uniform acceptance ratios
between adjacent temperatures.
In the language of thermodynamics, the energy of the
system, U, is analogous to −logL, and an analogue to the
0.00
20 15 10 5 0 5 10 15 20 specific heat can therefore be defined as
− − − −
x
C (T)=− d E[logL] , (3)
V dT T
where E[·] denotes the expectation operator over θ(cid:126)under
T
Figure1. Aone-dimensionaltargetdistributionwithtwoGaus- thedistributionπ (θ(cid:126)).E[logL] isthereforetheexpectation
T T
sian peaks of width σ =1 at µ= 10 normalised for a uniform of the untempered log likelihood collected when sampling
±
priorover[−20,20].AtT =100,thepeaksbroadentoσ=10,al- from the posterior at temperature T.
lowinganMCMCchainsamplingatthistemperaturetofindboth
In the context of Bayesian inference, Kofke’s result
modesquickly,startingfromanywherewithinthepriorvolume.
thereforetellsusthatifthemeanloglikelihoodcollectedby
asamplerrespondslinearlytochangesintemperature,then
ageometricallyspacedtemperatureladderwillachieveuni-
Braak & Vrugt 2008; Roberts & Rosenthal 1998). We shall
formacceptanceratiosbetweenadjacentchains.Conversely,
discuss this convention in more detail in Section 2.2.
temperature intervals on which E[logL] is strongly non-
A consequence of this strategy is that increasing the T
linear in T represent a phase transition that will require
number of chains N does not improve communication be-
morecarefulplacementoftemperatures,asweshallshowin
tweenexistingchains,whichisdeterminedbyE[A ] =0.23.
i,j Section 4.
Instead,addingnewchainsextendstheladdertohighertem-
peratures. This may be appropriate for an unbounded pos-
terior,butforarealisticproblemwithafinitepriorvolume,
2.2 The ideal Gaussian distribution: a simple
the acceptance ratio between adjacent chains saturates to
example
∼100% at some temperature T , at which the posterior
prior
πT begins to look like the prior p. In the simple case of a unimodal Gaussian likelihood under
Forthisgeometricspacingscheme–whereTprior isun- a flat prior, the optimal temperature spacing at low tem-
known – there is therefore an optimal number of chains, peratures–whereverylittlelikelihoodmassistruncatedby
Nopt, such that Tprior ≈TNopt ≡Tmax. For N <Nopt none the prior – can be analysed by approximating the prior to
ofthechainswillbesamplingfromtheprior(sothesampler beunbounded3.Weshowthat,forthistractableexample,a
may not find all of the modes), while for N >Nopt we end geometric temperature spacing is consistent with both the
upwithseveralchainssamplingredundantlyfromtheprior. uniform-A criterion and also with the alternative criterion
SincewearegenerallyignorantofTpriorfortheproblem that the Kullback–Leibler (KL) divergence is uniform be-
athand,wearemotivatedtofindanalternativetemperature tweenallpairsofadjacentchains.Weusetheexampletoil-
selection strategy. lustratetherelationshipbetweentheanalyticaldistribution
It has been suggested in the literature (Earl & Deem of logL, the acceptance ratio A , and the temperature T.
i,j
2005; Sugita & Okamoto 1999; Kofke 2002, 2004) that one Weshallworkwithann-dimensionalunitGaussiancen-
could select temperatures such that the acceptance ratios tred on the origin (the same result can be achieved for a
Ai,j are uniform for all pairs (i,j) of adjacent chains, in general Gaussian through a simple change of coordinates).
an attempt to ensure that each sample sequence θ(cid:126)(t) for Since the prior is uniform and unbounded, we can restrict
t = 1,2,..., as it moves between chains, spends an equal attention to the likelihood distribution L. In this case, the
amount of time at every temperature. Sugita & Okamoto probability density p˜for the values of logL(θ(cid:126)) collected by
(1999)justifythisnotionexperimentally–inthecontextof the sampler is
molecular dynamics – with test cases in which such a lad-
der indeed performs well. They use an algorithm derived p˜(logL)= elogL(−logL)n2−1, (4)
from that of Hukushima & Nemoto (1996), which selects Γ(n)
2
temperatures according to an iterative process for which a
uniform-Aladderisafixedpoint.Earl&Deem(2005)pro-
vide further references for similar methods of determining 3 Theapproximationbreaksdownathighertemperatures,where
temperature ladders that yield a given a target acceptance
boundaryeffectsbecomesignificant.Indeed,withnopriorbound-
ratio (Rathore et al. 2005; Sanbonmatsu & Garc´ıa 2002; aries,thereisnoTprioratwhichthemodeisspreadovertheentire
Schug et al. 2004). However, these methods do not address priorvolume.
MNRAS000,1–21(2002)
4 W. D. Vousden, W. M. Farr and I. Mandel
0.5 1.0
T =1 n=1
T =2 n=2
0.4 0.8
T =4 n=3
T =8 n=4
) 0.3 0.6 n=5
L
g A]
(lo E[
˜pT 0.2 0.4
0.1 0.2
0.0 0.0
20 15 10 5 0 100 101 102
− − − −
logL γ
Figure 2. The distribution of logL under a three-dimensional, Figure 3. The time-averaged acceptance ratio, E[A], between
unimodal Gaussian at various temperatures, where L is nor- twochainsofaPTMCMCsampleronaunimodal,n-dimensional
malised so that logL((cid:126)0) = 0. As T , the variance of logL Gaussian likelihood distribution. The chains have temperatures
→ ∞
diverges.Thelegendisorderedtomatchtheverticalorderofthe T andγT.Thelinesareorderedverticallytomatchthelegend.
lines’peaks.
Note that the specific heat from (3) is a constant n/2, as
where L is normalised so that logL((cid:126)0) = 0 and n is the
expected.
number of parameters.
Since the acceptance ratio A depends on logL −
At a temperature T, −logL simply follows a gamma i,j i
logL , the more separate the distributions of logL and
distributionΓ(α,β)withshapeparameterα=n/2andrate j i
logL attheirrespectivetemperatures,T andT ,thelower
parameter β = 1/T. Thus, for a chain sampling at tem- j i j
the acceptance ratio between such chains will be. For two
perature T, the log likelihood distribution is p˜ (logL) =
T chains at temperatures T and γT, the separation of the
Tp˜(logL/T).
means of p˜ and p˜ , in units of the standard deviation
Overlongtime-scales,theaverageacceptanceratiobe- T γT
at T, will be
tween chains i and j is
(cid:114)
(cid:90)(cid:90) E[logL]T −E[logL]γT =(γ−1) n. (7)
E[Ai,j] = Ai,jp˜Ti(logLi)p˜Tj(logLj) dlogLi dlogLj (cid:112)Var[logL]T 2
(−∞,0]2 It follows that – for constant γ – as the dimension n
(cid:32) 1 (cid:18)n+1(cid:19)(cid:33) increases, so the acceptance ratio between chains at tem-
= √π 2n−1γi−,jn/2Γ 2 peratures T and γT falls. For a higher dimensional target
distribution, therefore, a closer spacing of temperatures is
(cid:32) (cid:18) (cid:19)
· F˜ n,n;n +1;− 1 − required for a given acceptance ratio.
2 1 2 2 γ Formoregeneraldistributions,byconsideringtheover-
i,j
(cid:18) (cid:19)(cid:33) lapofp˜T(logL)atdifferenttemperatures,Falcioni&Deem
n n
γn F˜ ,n; +1;−γ +1, (1999) argue that the number of temperatures N required
i,j2 1 2 2 i,j to efficiently sample the posterior distribution should scale
√
(5) with ∆logL/ n, where ∆logL is the range of E[logL]T
between T =1 and T =T . That is:
where F˜ istheregularisedGausshypergeometricfunction prior
2 1
and γi,j = Tj/Ti is the ratio between the temperatures of N ∝ E[logL]1−√E[logL]Tprior. (8)
twochains.SinceE[Ai,j]dependsonTiandTj onlythrough n
theratioγ ,uniformacceptanceratiosbetweenalladjacent
i,j Since the log likelihood range ∆logL itself depends on
pairs of chains can be achieved with a geometric spacing of
the dimension of the system n, it is difficult to apply this
temperatures – where γ is constant – for a unimodal
i,i+1 relationinpractice.However,fortheidealGaussian,wecan
Gaussian likelihood.
seefrom(6)that∆logLscaleswithn,andsoN scaleswith
Thelogspacingrequiredforaparticularacceptancera- √
n, as we might expect.
tio also depends on the dimension of the parameter space,
with more parameters requiring a closer spacing of temper-
atures, illustrated by Figure 3. This can be understood by
2.3 The Kullback–Leibler divergence
lookingattheexpectationandvarianceoflogLatapartic-
ular temperature (see Figure 2), Another measure of the optimal spacing of temperatures
nT nT2 is the Kullback–Leibler (KL) divergence between adjacent
E[logL]T =− 2 and Var[logL]T = 2 . (6) chains. The KL divergence from a hot distribution πTj to a
MNRAS000,1–21(2002)
Temperature dynamics for PTMCMC samplers 5
cold distribution π , Underthisscheme,finitechangestoS willalwayspreserve
Ti i
the correct ordering of temperatures (T <...<T ).
DKL(πTi(cid:107)πTj)=(cid:90) πTi(θ(cid:126)) logππTi((θ(cid:126)θ(cid:126))) dθ(cid:126), (9) gap TSoaaccchoierdveintghetosatmheeaAcicefportaanlclechraaitn1ios,swbeetcwaeneNndrcivheaitnhei
Tj i
and those immediately above and below, to wit
quantifies the information gained about the posterior with
each step down the temperature ladder, from the prior p= ddSti =κ(t)(cid:2)Ai(t)−Ai+1(t)(cid:3), (12)
π to the posterior π=π . It is reasonable to expect
T= T=1 for 1 < i < N, where κ is a positive constant controlling
∞
that for an optimally-spaced ladder – that is, one with a
the time-scale of the evolution of T . κ can be interpreted
minimalACTonthecoldchainforagivennumberofchains i
as the instantaneous exponential time-constant for temper-
–theinformationgainshouldbeuniformforeverystepdown
ature adjustments. The two extremal temperatures, T and
the ladder. 1
T , are fixed (see below).
For the example of the ideal Gaussian of Section 2.2, N
Underthisscheme,chainiwillattempttoincreasethe
the KL divergence is, straightforwardly,
gap in temperature space between itself and chain (i+1)
n(cid:18) 1 (cid:19) if swaps are accepted too often and close it when they are
D (π (cid:107)π )= +logγ −1 . (10)
KL Ti Tj 2 γ i,j accepted too seldom — and similarly for chain (i−1) —
i,j
equilibratingatA thatareuniformoveri.Therefore,foran
i
Liketheswapacceptanceratio,therefore,uniformKLdiver-
appropriatechoiceofκ–discussedmomentarily–theserules
gence over the entire ladder is also achieved by a geometric
drive the chains i={2,...,N −1} toward even acceptance
spacing of temperatures for the ideal Gaussian.
spacing.
Unfortunately, unlike the acceptance ratio, the KL di-
However,inordertoefficientlysampleatargetdistribu-
vergenceisdifficulttocomputenumericallywhilesampling,
tionwithstronglyseparatedmodes(suchthatatraditional
owing to the unknown – and temperature-dependent – evi-
MCMC sampler would be unable to traverse the “valleys”
dence (normalisation) values on π and π .
Ti Tj betweenthem),TN mustbehighenoughthatthemodesare
We henceforth assume that spacing temperatures for flattenedoutandthechaincanexploretheentireparameter
uniform acceptance ratios is a reasonable approximation of spaceunhindered.Thisamountstothetopmostchainsam-
a ladder that is optimal in the ACT of the cold chain. We plingfromthepriordistribution4,whichweachievetrivially
makethisassumptiononfaithand,whilewebrieflyexamine by setting the inverse temperature of this chain as β =0.
N
its validity in Section 4.1 and Section 6.1, it invites a more This continuous system is discretised as
careful study.
(cid:2) (cid:3)
S (t+1)−S (t)=κ(t) A (t)−A (t) , (13)
i i i i+1
where A (t) are the acceptance ratios accumulated by the
i
sampler at the current iteration.
3 ADAPTIVE TEMPERATURE LADDERS
The values of A are measured empirically at each it-
i
FromtheargumentsinSection2andthereferencestherein, eration as the fraction of swap proposals between chains
we shall assume that uniformity of acceptance ratios pro- that were accepted. For a traditional sampler comprising
videsagoodapproximationtotheoptimaltemperaturelad- one sample per chain, these will be either 0 or 1. For en-
der for parallel tempering problems. In this section, we de- semble samplers, however, comprising n distinct walkers
w
scribeanalgorithmfordynamicallyadaptingchaintemper- per temperature, the measurements of A are less granular,
i
atures to achieve uniform acceptance ratios for inter-chain suchthatA ∈{x∈[0,1]|n x∈Z}.Ingeneral,fewerwalk-
i w
swaps. ers require a longer averaging time-scale – discussed below
From (2), as 1/T −1/T → 0, A → 1, so in order – in order to smooth out this granularity.
j i i,j
toincreasetheexpectedacceptanceratiobetweenchains,it Importantly, the temperature adjustment scheme we
suffices to move them closer together in temperature space; haveproposed–and,moregenerally,anyadaptivesampling
conversely,toreduceE[A ],wecanpushthechainsapart. scheme – in fact violates the condition for detailed balance
i,j
We will henceforth adopt the notation that A ≡ A that ensures that an MCMC sampler will converge to the
i i,i 1
−
and that T < T , with T = 1 being the untempered or target distribution. Roberts & Rosenthal (2007) investigate
i i+1 1
cold chain (which samples from the target distribution, π). theconditionsrequiredofsuchanadaptivesamplerforitto
Here,A (t)aretheinstantaneousacceptanceratiosbetween be ergodic in the target distribution – that is, that it will
i
chains,butweshallshortlydescribethediscretecasewhere converge on long time-scales. They determine (from their
empiricalmeasurementsofA arecollectedwitheachitera- Theorem1andCorollary4)thatdiminishingtheamplitude
i
tion of the sampler. of adaptations in the transition kernel with each iteration
is sufficient for the sampler to be ergodic in the target dis-
tribution. We therefore suppress temperature adjustments
3.1 Dynamics toensurethatthesamplerisMarkovianonsufficientlylong
time-scales5.
Our goal is to dynamically adjust the temperatures of the
Therateofdiminutionoftemperatureadjustmentsisa
chains to achieve uniform acceptance ratios as we sample
thetargetdistribution.Wedefineourtemperaturedynamics
in terms of the log of the temperature difference between 4 For analytic priors, this special case, where the likelihood is
chains, ignored, can be treated separately by having the sampler draw
independentsamplesdirectlyfromtheprior.
Si ≡log(Ti−Ti 1). (11) 5 Inprinciple,ofcourse,wecouldstoptemperatureadjustments
−
MNRAS000,1–21(2002)
6 W. D. Vousden, W. M. Farr and I. Mandel
trade-offbetweentherateofconvergenceofthetemperature Agoodchoiceofν dependsontheresponseofE[A ] to
i
ladder and that of the sampler itself toward its stationary changes in the relevant chains’ temperatures, and therefore
distribution.Wemodulatethedynamicswithhyperbolicde- depends on the particular likelihood function that is being
cay to suppress the dynamics on long time-scales, sampled. However, if E[A ] will eventually be of order, say,
i
0.25,andwewantthemeasurementsofA tobebetween0.2
1 t i
κ(t)= 0 , (14) and 0.3, then we should average A over at least 100 swap
νt+t i
0 proposals, giving ν (cid:38)100/n .
w
where t is the time at which the temperature adjustments
0 Combiningthesecriteriaonν andt ,wethereforesug-
0
havebeenreducedtohalftheirinitialamplitude.Theinitial gest default parameter values of ν = 102/n and t =
w 0
amplitudeofadjustmentsisinturnsetbyν,thetime-scale 103/n .
w
on which the temperatures evolve at early time.
3.2 Parameter choice
4 EXAMPLES
Intheschemeof (13)and(14),therearetwoparametersto
We have implemented the algorithm proposed above as a
choose:t0 andν.Thedynamicaltimeparametertin(13)is modification to the ensemble sampler emcee of Foreman-
measuredinunitsofintra-chainjumpsofthesampler,with
Mackey et al. (2013). Our implementation can be found at
temperature adjustments being made at every iteration.
https://github.com/willvousden/ptemcee.
The lag parameter t sets the time-scale for the atten-
0 In this section we apply our implementation to specific
uation of temperature adjustments. This decay factor in κ
examples in order to understand how and when the tradi-
is included as a fail-safe mechanism to ensure that, even
tional geometric spacing fails and how much the uniform-A
fortargetdistributionsonwhichthetemperaturedynamics
strategy might help us. We present the following test cases.
fail to find an equilibrium set of temperatures, the ladder
will always converge over long time-scales. This condition (i) InSection4.1wecomparetheuniform-Astrategyused
guarantees that the sampler correctly explores the target by the temperature dynamics of Section 3 with the alter-
distribution. native strategy of uniform KL divergence discussed in Sec-
From (14), the time-scale of the dynamics at late time tion 2.2 on the example of a unimodal truncated Gaussian
– when t(cid:29)t – is νt/t . To ensure that temperatures have likelihood.
0 0
time to find an equilibrium over the course of a run, we (ii) In Section 4.2 we test the dynamics on a more com-
therefore require that t (cid:29) ν, so that the dynamics will plex, bimodal distribution for various choices of the num-
0
always be on a time-scale much shorter than the current ber of chains N. We compare the resulting ACTs of the
run time. However, we should also ensure that, over the sampler with those of another sampler using a geomet-
course of the run, the dynamical time-scale is longer than ric ladder whose maximum temperature is fixed such that
the ACT of the sampler, so that the temperatures respond T ≈T .
max prior
tothecorrectposteriordistribution.Tothisend,werequire (iii) InSection4.3wetestthealgorithmagainstthemore
that νN (cid:29) t , where N is the number of independent difficult egg-box distribution with 243 modes. For compari-
τ 0 τ
samples gathered over the course of the run. For example, son,wesamplefromthesamedistributionwithageometric
if N =100, these two conditions are satisfied by t =10ν, ladderconstructedtoyield25%acceptanceratioswhenap-
τ 0
and for our test cases, we have indeed found this choice to plied to the ideal Gaussian discussed in Section 2.2.
work well.
For all of these tests, ν = 102 and t = 103 are used
Meanwhile,thetime-scaleofthedynamicsatearlytime 0
to control the dynamics in (14), while the sampler uses 100
– when t (cid:28) t – is ν. A good choice of ν should therefore
0 walkers. Note that these choices, while different from the
ensurethatthesamplerisnotsusceptibletolargestatistical
defaults proposed in Section 3.2, do satisfy the conditions
errors on the measurements of the acceptance ratios A .
i described in that section.
Ingeneral,forn swapproposals,theacceptancecount
s
n A is a random variable that follows a binomial distribu-
s i
tion B(n ,E[A ]), so that A has variance
s i i
4.1 Truncated Gaussian
E[A ](1−E[A ])
Var[Ai] = i n i . Ourfirsttestcaseisann-dimensional,unimodal,unitGaus-
s
sian similar to that of Section 2.2 but with finite prior vol-
Since the dynamical equations (13) are linear in A , they
i ume.Thesimplicityofthiscaseadmitssomeexactanalysis
will be driven by the means, E[A ], on long time-scales, as-
i before recourse to numerics, which allows us to test the ap-
suming that the noise in the system from counting errors –
√ proximations made in Section 2.2.
proportionalto1/ n –doesnotcauseshort-termchanges
s At low temperatures, where the prior boundaries do
in E[A ].
i not truncate much of the likelihood probability mass, the
Given a sampler of n walkers, n swaps are proposed
w w optimal temperature spacing should be similar to that of
with each iteration, so that n = n ν. To ensure stable
s w the ideal Gaussian. By imposing a step-like cut-off in the
dynamics at early time, we should therefore choose n ν (cid:29)
w prior at a radius of R, there will be some temperature at
1.
which this approximation will fail and a geometric spacing
becomes inappropriate.
altogether once the temperatures have reached an equilibrium, For the likelihood we use the same distribution as in
discardingtheprevioussamplesaspartoftheburn-in. Section2.2,whileforthepriorweuseauniformdistribution
MNRAS000,1–21(2002)
Temperature dynamics for PTMCMC samplers 7
overtheclosedn-ballofradiusR=30,centredontheorigin.
7
The likelihood and prior are defined by
L(θ(cid:126))∝exp(cid:16)−1(cid:107)θ(cid:126)(cid:107)2(cid:17), (15) bits) 6 γγ==186.0.0
2 ( 5 γ=4.0
(cid:40) n
p(θ(cid:126))∝ 1 if (cid:107)θ(cid:126)(cid:107)≤R, (16) gai 4 γ=2.0
0 otherwise, n
o 3
ti
where (cid:107)·(cid:107) is the Euclidean norm on Rn. Subsequently, the ma 2
r
normalisedposteriorgeneratedby (15)and(16)attemper- o
nf 1
ature T is I
0
πT(θ(cid:126))=(2πγ˜T(cid:18))n2−,n2R2TΓ2((cid:19)n2)exp(cid:16)−(cid:107)2θ(cid:126)T(cid:107)2(cid:17) if (cid:107)θ(cid:126)(cid:107)≤R, (17) 20 21 22 23 24 T2lo5w 26 27 28 29 210
0 otherwise,
where γ˜(a,z) is the lower incomplete gamma function. Figure 4. The KL divergence, or information gain, from a hot
In the low-temperature limit, this distribution con- chainattemperatureγTlowtoacolderchainattemperatureTlow,
verges to the ideal Gaussian distribution. We should there- bothsamplingfrom(17)atn=5(solidlines).AsTlow→0,the
foreexpecttheKLdivergenceforastepdownthetempera- informationgaintendstothatoftheidealGaussianofSection2.2
tureladdertoasymptoteto(10)asT →0,wheretheeffects (dashedlines).Thelinesareorderedverticallytomatchtheleg-
ofthepriorboundaryarenegligible6.Indeed,theKLdiver- end.
gence of (17) from T to T is available analytically as
2 1
(cid:16) (cid:17)
(T −T )γ˜ 1+ n, R2
D =− 2 1 2 2T2
KL (cid:16) (cid:17)
T γ˜ n, R2
2 2 2T1 transition) in order to achieve uniform A than would be
(cid:18) (cid:19) γ˜(cid:16)n, R2(cid:17) (18) required for uniform DKL. There is therefore less risk of a
+ nlog T2 +log 2 2T2 . large gap in temperature across such a temperature range,
2 T1 γ˜(cid:16)n, R2(cid:17) atthecostof(potentially)slightlylessefficientcommunica-
2 2T1 tion across the rest of the ladder. The uniform-A criterion
If we set T =γT (with γT (cid:28)1), then γ˜(a,z)→Γ(a) as for optimality is therefore conservative with respect to a
2 1 1
T →0, and the expression reduces to (10), as expected. uniform-D criterion.
1 KL
Figure 4 illustrates this convergence for n = 5. The We can also visualise the ladder specification in terms
point on this plot at which the solid line diverges from the of the density of chains over temperature. We define this
dashed line, for each γ, predicts the temperature beyond density, in logT, as
which a geometric spacing of temperatures is no longer op-
timal (for optimality as defined by uniform KL divergence 1 1
η(logT)= dN = = , (19)
between chains). This is caused by truncation of the tem- dlogT ∆logT logγ
pered likelihood by the prior boundaries.
Ofcourse,sincetheKLdivergencecannoteasilybeas- with γ =T /T , where T and T are the chain temper-
i+1 i i+1 i
sessed empirically by an MCMC sampler, and we must in- atures to either side of T.
stead resort to using acceptance ratios, we would like to
Figure 6 shows this density for a temperature ladder
know how consistent these two schemes are outside the as-
of 20 chains that is in equilibrium under the temperature
sumptions of Section 2.2.
dynamics of Section 3 (the N = 20 contour of Figure 5).
Figure 5 shows contours of constant D , calculated
KL The density exhibits the expected uniformity of γ for low
from(18),andcontoursofconstantAi,illustratedbypoints temperatures but falls for T (cid:38)√80. The width σ of the unit
representingtemperaturepairs(fromladdersselectedbythe
GaussianattemperatureT is T,soatthistemperaturethe
algorithm developed in Section 3). In the low temperature
priorboundaryisat∼3σ.AtT =80,∼5%ofthelikelihood
limit, as expected, both schemes select a geometric spac-
mass is truncated – compared to < 0.1% for T = 40 and
ing of temperatures consistent with the ideal Gaussian of
∼ 35% for T = 160 – indicating that the prior boundary
Section 2.2 (i.e., the contours remain constant in γ). At
becomes significant in this temperature regime.
highertemperatures,bothschemesdepartfromthegeomet-
Thisdropindensityreflectstheconvergenceofthetem-
ric spacing, but they do so differently. The uniform accep-
peredposteriordistribution,π ,towardthepriorasT →∞.
T
tance scheme displays a more gradual departure from a ge-
Asπ becomesflatter,fewerchainsareneededperlogT to
T
ometric spacing than the contours of constant D . The
KL maintain good communication.
smaller γ selected by the uniform-A scheme outside the ge-
Also shown on figure Figure 6 is the square root of the
ometric regime, however, suggest that closer spacing is re-
estimatedspecificheatC ofthesystemasdiscussedinSec-
V
quired in difficult temperature ranges (e.g., across a phase
tion 2.1, which can be seen to track closely the logarithmic
chain density η when appropriately normalised. While the
6 WhilewedonotconsiderT <1inoursimulations,thecaseof provenance of this relationship is unclear, it demonstrates
T 0canequivalentlybethoughtofasR ,sincethewidth therelevanceofthespecificheatindetermininganeffective
→ →∞
oftheGaussianscaleswith√T. temperature ladder.
MNRAS000,1–21(2002)
8 W. D. Vousden, W. M. Farr and I. Mandel
23
N =5
4.2
N =6
N =7
3.6
N =8
N =9
)
22 N =10 3.0 ts
bi
N =12 (
n
N =15 2.4 gai
γ N =20 n
o
ti
a
1.8 m
r
21 nfo
I
1.2
0.6
20 0.0
20 21 22 23 24 25 26 27 28 29 210
Tlow
Figure 5. AcontourplotoftheKLdivergence,orinformationgain,fromahotchainattemperatureThigh=γTlow toacolderchain
at temperature Tlow, both sampling from the Gaussian likelihood (17). The coloured lines show the equilibrium N-chain temperature
laddersreachedbythetemperaturedynamicsalgorithmofSection3,wheretheacceptanceratioisthesamebetweenanypairofadjacent
chains.Thepointsontheselinesrepresentpairsofadjacenttemperatures(T,γT)(excludingthetop-most,whereγ= ).
∞
an inefficient use of resources, it at least doesn’t drastically
3.5
inhibit communication between high temperatures and low
temperatures. Instead, we now turn to a more complex, bi-
3.0
modallikelihooddistributionforwhichageometricspacing
2.5 mightcausebottlenecksinthecommunicationbetweenhigh
and low temperatures.
2.0 We use a likelihood derived from the two-dimensional
η Rosenbrock function f:
1.5
(cid:18) 1 1 (cid:19)1/Tp
L(x,y)∝ + , (20)
1.0 c+f(x,y) c+f(−x,y)
η (uniform-A) where
0.5
√CV (arbitraryscale) f(x,y)=(a−x)2+b(y−x2)2. (21)
0.0
100 101 102 Tp isapre-temperingfactorchosentoincreasethecon-
T trastofthedistribution,makingithardertosample.When
T (cid:28) 1, each mode is locally Gaussian, making the re-
p
sultscomparabletotheGaussianexampleconsideredinSec-
tion 2.2.
Figure 6. Orange: The density of chains per logT under the
For the following tests, we use a = 4, b = 1, c = 0.1,
truncated Gaussian distribution (17), where N = 20, n = 25,
andT =10 3.Weuseaflatprioron[−10,10]×[−20,100].
and temperatures are chosen for uniform acceptance ratios be- p −
tween chains. The chains have equilibrated to 77% acceptance. Figure 7 illustrates this likelihood over the prior volume.
Blue:ThesquarerootofthespecificheatofthetruncatedGaus-
siandistribution,normalisedtomatchthechaindensityη ofthe
4.2.1 Test: temperature evolution
uniform-A ladder, between T1 and TN 1. The specific heat CV,
from(3),isestimatedfromthesample−meansoflogLovermany
As an illustrative example, we first tested the temperature
runswithdifferenttemperatureladders.
dynamics of Section 3 with the double Rosenbrock poste-
rior distribution (20) using 13 chains. Figure 8 shows the
evolution of the temperature ladder according to these dy-
4.2 Double Rosenbrock function
namics, while Figure 9 shows the chain density η(logT) for
The previous test demonstrated how a geometric ladder the equilibrated temperature ladder.
spaces temperatures too closely at higher temperatures, as Whiletheequilibratedchainsaredistributeduniformly
the prior boundary becomes significant. While this may be inlogT forT (cid:46)50,thereisadistinctpeakinη atT ≈800,
MNRAS000,1–21(2002)
Temperature dynamics for PTMCMC samplers 9
lower-temperaturechainscanstillcommunicatewithachain
100
0 sampling from the prior. Under this set-up, therefore, the
ACT always decreases as N increases, per Figure 10.
80
−1500 To test the improvement in ACT, τ, conferred by our
60 temperature dynamics, we allowed emcee to explore the
3000
− target distribution (20) with different numbers of chains,
L
y 40 −4500 log Nsp,acuesdinlagddbeortsh.Tthheeruensuiflotrinmg-AAClTadsd,eτrs aannddτgeom,aerteripcalollty-
geo acc
20 −6000 ted against N in Figure 10.
In this example, an N-chain ladder dynamically
0 7500
− adapted for uniform acceptance ratios clearly outperforms
20 9000 a geometrically spaced ladder of the same size for all N.
− −
10 5 0 5 10 The benefit of a uniform-A ladder is most pronounced
− −
x atlowN –i.e.,wheretherearefewchainsavailable.Inthis
regime, the sampler will be more sensitive to phase transi-
tions, since the bigger gaps in temperature could cause se-
Figure 7. TheRosenbrockloglikelihood,from(20). vere bottlenecks in communication across the temperature
ladder.
When N is large, the differences in acceptance ratios
where a simple geometric spacing of temperatures hinders between a geometric ladder and one chosen for uniform A
communicationbetweenchains.Thispeakoccursataphase becomeslesssignificant.Inthiscase,thedifferencebetween
transitionwherethetwomodesofthelikelihooddistribution thelimiting(minimum)acceptanceratioforaladderandthe
begintomixandE[logL] changesrapidlywithT,indicated ladder’s average acceptance ratio is proportionally smaller.
by the sharp change in specific heat in the bottom panel of In the case of the double Rosenbrock distribution (20),
Figure 9. Since the shape of the likelihood distribution in we have found that, once the minimum acceptance ratio
thisregimebecomesverysensitivetoT,ahigherdensityof for a geometric ladder (terminating at T =2×104) ex-
max
chains is needed to maintain a given acceptance ratio. We ceeds ∼ 10%, reallocating temperatures for uniform accep-
also note that in the geometric regime (i.e., for low T) the tance ratios does not reduce the measured ACT by more
specific heat is approximately n/2 = 1, with E[A] ≈ 57%, than25%.ThisoccurswhenN ≈7inthecurrentexample.
consistent with the values derived for the ideal Gaussian Nonetheless, there remains an overall improvement in ACT
from (6) and (5) respectively. regardless of N.
Ultimately, however, the figure of merit for a tempera- Figure10alsoshows,inthemiddlepane,thetotalnum-
turespecificationinaPTMCMCsimulationistheresulting ber of iterations per independent sample across all chains.
ACTforthetargettemperature(T =1)ofthesampler.We This quantity, given by N ×τ, is proportional to the total
must therefore test the performance of the sampler empiri- CPUtimeofthesimulation,whileτ itselfisproportionalto
cally. the CPU time per chain, or wall time, of the simulation. In
We use the term ACT to refer to the integrated auto- this instance, the CPU time of a run diminishes with N in
correlation time discussed by Sokal (1997), which we esti- muchthesamefashionasthewalltimedoes.Thefractional
mate according to the algorithm used in the acor package improvement in CPU time is of course the same as for wall
(see Appendix A and http://www.math.nyu.edu/faculty/ time – τ /τ .
geo acc
goodman/software/acor/ for details). For the following
tests,weusetheACTofthefirstparameter,x,asameasure
of the efficiency of the sampler (since (20) is bimodal in x
4.2.3 Test: chain removal
but unimodal in y).
To determine whether a uniform-A temperature placement
strategy is in fact close to optimal, we assess the contribu-
4.2.2 Test: improvement over a geometric ladder tion of each chain from such a temperature ladder to the
efficiency of the sampler, as measured by its ACT. If this
InSection2weclaimedthataimingforuniformacceptance
contribution is equal for all chains, then we can conclude
ratios between chains yields a good temperature ladder.
thatitisindeedoptimaltohavethemallexchangingequally
Specifically, we expect that a ladder selected for uniform
– that is, with uniform acceptance ratios.
acceptanceratiosshouldleadtoalowerACTfortheT =1
To this end, we conducted the following test:
chain than that resulting from a plain geometric ladder.
Thegeometricansatzthatweusehasafixedmaximum (i) Sample from (20) with N = 7 chains under the tem-
temperature such that TN =2×104. As N increases, more peraturedynamicsofSection3untilthetemperatureshave
chains are added between T1 and TN, maintaining the geo- equilibrated to (T1,...,T7) to give uniform acceptance ra-
metricspacing.Underthisarrangement,theadditionofnew tios.
temperaturesisnotredundantevenwhenTN isalreadyhigh (ii) Generate5newtestladders,eachof6chains,formed
enough to sample from the prior; the additional chains in- byremovingtheith chainfromthatdeterminedabove–i.e.,
steadaidinter-chaincommunicationatlowertemperatures. (T ,...,T ,T ,...,T ) – for i=2,...,6.
1 i 1 i+1 7
Since TN is close to the temperature at which the posterior (iii) Sam−ple from (20) with each of these 5 test ladders
becomes the prior, there is little CPU time wasted in sam- and calculate the ACTs on the cold chain, τ .
test
pling redundantly from the prior with several chains, while
MNRAS000,1–21(2002)
10 W. D. Vousden, W. M. Farr and I. Mandel
105
104
e
r
u
t 103
a
r
e
mp 102
e
T
101
100
100 101 102 103 104 105 106
Time
1.0
o 0.8
ti
a
r
e 0.6
c
n
a
t 0.4
p
e
c
Ac 0.2
0.0
100 101 102 103 104 105 106
Time
Figure 8. The evolution of ladder of 13 temperatures Ti and acceptance ratios Ai over an emcee run of 106 iterations under the
Rosenbrocklikelihood(20).Chains1and13arenotshown,havingfixedtemperaturesT1=1andT13= .
∞
We can understand this behaviour by examining the
3.0
complete autocorrelation functions from which these ACTs
η (uniform-A)
areestimated.IllustratedinFigure12,theseautocorrelation
2.5 √CV (arbitraryscale) functionsexhibittwodistincttime-scales.Firstly,thereisa
large autocorrelation for lags (cid:46) 100 for all i – particularly
2.0 i = 2 – corresponding to the ACT of the sampler within
one of the two modes: that is, the time taken for the sam-
η 1.5 pler to generate an independent sample without changing
mode. Secondly, there is a visible hump in the autocorre-
1.0 lation function for 100 (cid:46) lag (cid:46) 2000, corresponding to the
time taken for the sampler to migrate between modes. Re-
moving the second chain from initial geometric ladder of 7
0.5
chainsincreasestheintra-modeACTinparticular,butdoes
not affect the inter-mode ACT. Meanwhile, while removing
0.0
100 101 102 103 higher temperature chains pushes the secondary hump out-
T wardtolargerlags,increasingtheinter-modeACTinstead.
The overall autocorrelation time in which we are inter-
ested, discussed by Sokal (1997) and in Appendix A, rep-
resents the time between independent samples of the sys-
Figure 9. Orange:TheequilibriumdensityofchainsperlogT
tem. It is therefore set by the time-scale on which the sam-
forthedoubleRosenbrockrunillustratedinFigure8,wherethe
pler migrates to a new mode independently of the current
arococetpotfanthcee srpateicoisfichahveeatsefottrletdhetodoAuibl≈e R0o.5s7en.bBrolucke:dTishtreibsuqtuiaorne, mode. Removing a chain at higher temperatures increases
asdescribedinFigure6. the inter-modal ACT, and therefore damages the efficiency
of the sampler.
Nonetheless, all of the tested temperature ladders
Figure 11 shows the ACTs for the cold chain result- yielded lower ACT than the default geometric ladder, de-
ing from the test outlined above. While τ < τ < τ spite the geometric ladder being chosen with prior knowl-
acc test geo
forladdersofthesameN,τtest increaseswiththetempera- edge of Tprior.
tureofthechainthatisremoved,suggestingthatadditional
chains are more useful at higher temperatures. The sharp
4.3 Egg-box in five dimensions
jumpinτ whenachainaboveT ≈200isremovedarises
test
fromthephasetransitionthatoccursasT approachesT , Totestthealgorithm’sperformanceonayetmorestrongly
prior
indicated by a peak in C (visible in Figure 9). multi-modaldistribution,weuseanegg-boxdistributionde-
V
MNRAS000,1–21(2002)