ebook img

Rethinking resampling in the particle filter on graphics processing units PDF

0.51 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Rethinking resampling in the particle filter on graphics processing units

1 Rethinking resampling in the particle filter on graphics processing units Lawrence M. Murray, Anthony Lee and Pierre E. Jacob Abstract—Modern parallel computing devices such as the graphics processing unit (GPU) have gained significant traction X X X ... X 0 1 2 T in scientific computing, and are particularly well-suited to data- parallelalgorithmssuchastheparticlefilter.Ofthecomponents 3 of the particle filter, the resampling step is the most difficult to 1 implement well on such devices, as it often requires a collective 0 operation,suchasasum,acrossweights.Wepresentandcompare 2 anumberofresamplingalgorithmsinthiswork,includingrarely- used alternatives based on Metropolis and rejection sampling. n We find that these alternative approaches perform significantly Y1 Y2 ... YT a faster on the GPU than more common approaches such as the J multinomial, stratified and systematic resamplers, a speedup 7 attributable to the absence of collective operations. Moreover, 1 in single-precision (particularly relevant on GPUs due to its Fig.1. Thestate-spacemodel. faster performance), the common approaches are numerically ] unstable for plausibly large numbers of particles, while these Code 1 Pseudocode for the bootstrap particle filter. O alternative approaches are not. Finally, we provide a number of C auxiliaryfunctionsofpracticaluseinresampling,suchasforthe PARTICLE-FILTER(N ∈N+,T ∈N0) . permutation of ancestry vectors to enable in-place propagation t of particles. 1 foreachi∈{1,...,N} a 2 xi ∼p(x ) // initialise particle i st Index Terms—Particle filter, sequential Monte Carlo, state- 3 w0i ←1/N0 // initialise weight i [ space models, resampling, graphics processing unit 0 4 fort=1,...,T 1 5 at ← ANCESTORS(wt−1) // resample v I. INTRODUCTION 6 foreachi∈{1,...,N} 9 1 For some sequence of time points t=1,...,T and obser- 7 xit ∼p(xt|xat−it1) // propagate particle i 0 vations at those times y1,...,yT, the particle filter [1], [2] 8 wti ←p(yt|xit) // weight particle i 4 usesN weightedsamples,orparticles,torecursivelyestimate 1. the time marginals p(xt|y1:t) of the latent states x1,...,xT 0 of the state-space model 3 T randomized algorithm OFFSPRING that also accepts a vector, 1 (cid:89) : p(x0:T,y1:T)=p(x0) p(yt|xt)p(xt|xt−1), (1) wt−1 ∈RN, of particle weights, but instead returns a vector, iv t=1 ot, of integers between between 0 and N, where each oit is X depicted in Figure 1. the number of offspring to be created from the ith particle at r Pseudocode for the simplest bootstrap particle filter [1] is time t−1 for propagation to time t. As we shall see, each a giveninCode1.Theinitialisation,propagationandweighting resampling algorithm more naturally takes one form or the stepsarereadilyparallelised,beingindependentoperationson other, and ancestry vectors are readily converted to offspring each particle xi and its weight wi. Resampling, on the other vectorsandvice-versa(§IIIprovidesfunctionstoachievethis). t t hand,isacollectiveoperationacrossparticlesandweights,so Following the resampling step is the propagation step. In that parallelisation is more difficult. It is here that the present the implementation of this there are two options in arranging work concentrates. memory. The first is to have an input buffer holding the The resampling step can be encoded by a randomised particles at time t−1, and a separate output buffer into which algorithm ANCESTORS that accepts a vector, wt−1 ∈RN, of to write the propagated particles at time t. The second is to particle weights and returns a vector, a , of integers between work in-place, reading and writing particles from and to the t 1 and N, where each ai is the index of the particle at time same buffer. The second option is more memory efficient by t t−1 which is to be the ancestor of the ith particle at time a factor of two (for a fixed number of particles), but places t. Alternatively, the resampling step may be encoded by a more stringent conditions on the output of the resampling algorithm. It is sufficient that the ancestry vector, a , satisfies t L.M.MurrayiswithCSIROMathematics,Informatics&Statistics. ∀i∈{1,...,N}: A.LeeiswiththeDepartmentofStatistics,UniversityofWarwick. P.E. Jacob is with the Department of Statistics and Applied Probability, NationalUniversityofSingapore. oi >0 =⇒ ai =i. (2) t t 2 With this, it is possible to insert a copy step immediately Code 2 Pseudocode for various primitive functions. before each propagation step, setting xi ← xait for all i = 1,...,N, but ai (cid:54)= i. Each pta−rt1icle catn−1then be INCLUSIVE-PREFIX-SUM(w∈RN)→RN propagated in-place bytreading from and writing to the same 1 Wi ←(cid:80)i wj j=1 buffer. Importantly, (2) ensures that the copies can be done 2 return W concurrently without read and write conflicts, as each particle EXCLUSIVE-PREFIX-SUM(w∈RN)→RN is either read from or written to, but not both. The focus of (cid:40) this work is on this in-place mode, such that algorithms are 0 i=1 1 Wi ← timed up to the delivery of an ancestry vector that satisfies (cid:80)i−1wj i>1 j=1 (2).Wealsopresentauxiliaryfunctionsforthepermutationof 2 return W arbitrary ancestry vectors to achieve this. The challenge in programming a parallel device such as ADJACENT-DIFFERENCE(W∈RN)→RN a graphics processing unit (GPU) is that, to occupy its (cid:40)Wi i=1 full capacity, an algorithm must be executable with tens of 1 wi ← Wi−Wi−1 i>1 thousands of threads. Furthermore, the threads with which it 2 return w executes are not entirely independent, grouped into teams (of size 32 on current hardware [3]) called warps. The threads SUM(w∈RN)→R of a warp should ideally follow the same path through an 1 return (cid:80)N wi algorithm. Where they diverge to different branches of a i=1 conditional, or different trip counts of a loop, their execution LOWER-BOUND(W∈RN,u∈R)→N+ becomes serialised and parallelism is lost (referred to as warp 1 requires divergence). A further consideration is that, for all the threads 2 W is sorted in ascending order of a warp, memory reads and writes should be from or to 3 return neighbouring addresses for best performance (referred to as 4 the lowest j such that u may be inserted into the coalescing of reads and writes). Considerations such as position j of W and maintain its sorting. these shape the development of the algorithms in this work. The reader is referred to the OpenCL standard [4] for general information, and the Compute Unified Device Architecture (CUDA) [5] for more detail on optimising for NVIDIA GPUs to programmers of, for example, the C++ standard template in particular. library (STL) or Thrust library [11], and their implementation Parallel implementation of resampling schemes has been onGPUshasbeenwell-studied[12].Theadvantageofdescrib- consideredbefore,largelyinthecontextofdistributedmemory ing algorithms in this way is that the efficient implementation machines [6], [7]. Likewise, implementation of the particle of these primitives in both serial and parallel contexts is well filter on GPUs has been considered [8], [9], [10], although understood, and a single pseudocode description in terms of with an emphasis on low-level implementation issues rather primitives will often suffice for both. Code 2 defines the than algorithmic adaptation as here. particular primitives used throughout this work. Section II presents algorithms for resampling based on Wedistinguishbetweentheforeachandforconstructs.The multinomial, stratified, systematic, Metropolis and rejection formerisusedwherethebodyoftheloopistobeexecutedfor sampling. Section III presents auxiliary functions for convert- each element of a set, with the order unimportant. The latter ing between offspring and ancestry vectors, and the permuta- is used where the body of the loop is to be executed for each tion of ancestry vectors to satisfy (2). Empirical results for elementofasequence,wheretheordermustbepreserved.The execution time and variance in outcomes are obtained for intended implication is that foreach loops may be parallised, all algorithms in Section IV, with concluding discussion in while for loops may not be. Section V. In what follows, we omit the subscript t from weight, off- spring and ancestry vectors, as all of the algorithms presented II. RESAMPLINGALGORITHMS behave identically at each time in the particle filter. We present parallel resampling algorithms suitable for both A. Multinomial resampling multi-core central processng units (CPUs) and many-core graphics processing units (GPUs). Where competitive serial Multinomial resampling proceeds by drawing each ai in- algorithms exist, these are also presented. Resampling algo- dependently from the categorical distribution over C = rithms are nicely visualised by arranging particles by weight {1,...,N}, where P(ai = j) = wj/SUM(w). Pseudocode in a circle, as in Figure 2, which may help to elucidate the is given in Code 3. The algorithm is dominated by the N various approaches. calls of LOWER-BOUND, which if implemented with a binary The algorithms presented here are described using pseu- search, will give O(Nlog N) complexity overall. 2 docode with a number of conventions. In particular, we make The INCLUSIVE-PREFIX-SUM operation on line 1 of Code use of primitive operations such as searches, transformations, 3 is not numerically stable, as large values may be added reductions and prefix sums. Such operations will be familiar to relatively insignificant ones during the procedure (an issue 3 0 0 0 0 0 (a) Multinomial (b) Stratified (c) Systematic (d) Metropolis (e) Rejection Fig.2. Visualisationoftheresamplingalgorithmsconsidered.Arcsalongtheperimeterofthecirclesrepresentparticlesbyweight,arrowsindicateselected particlesandarepositioned(a)uniformlyrandomlyinthemultinomialresampler,(b & c)byevenlyslicingthecircleintostrataandrandomlyselectingan offset (stratified resampler) or using the same offset (systematic resampler) into each stratum, (d) by initialising multiple Markov chains and simulating to convergenceintheMetropolisresampler,or(e)byrejectionsampling. Code 3 Pseudocode for parallel multinomial resampling. Code 4Pseudocodeforserial,single-passmultinomialresam- pling. MULTINOMIAL-ANCESTORS(w∈RN)→RN 1 W← INCLUSIVE-PREFIX-SUM(w) MULTINOMIAL-ANCESTORS(w∈RN)→RN 2 foreachi∈{1,...,N} 1 W← EXCLUSIVE-PREFIX-SUM(w) 3 ui ∼U[0,WN) 2 W ←WN +wn // sum of weights 4 ai ← LOWER-BOUND(W,ui) 5 return a 3 lnMax ←0 4 j ←N 5 fori=N,...,1 6 u∼U[0,1) intrinsic to any large summation). With large N, assigning 7 lnMax ←lnMax +ln(u)/i the weights to the leaves of a binary tree and summing with a 8 u←W exp(lnMax) depth-firstrecursionoverthiswillhelp.Withlargevariancein 9 whileu<Wj weights,pre-sortingwillalsohelp.Whilelog-weightsareoften 10 j ←j−1 used in the implementation of particle filters, these need to be 11 ai ←j exponentiated (perhaps after rescaling) for the INCLUSIVE- 12 return a PREFIX-SUM operation, so this does not alleviate the issue. Serially,thesameapproachmaybeused,althoughasingle- pass approach of complexity O(N) is enabled by generating Code 5 Pseudocode for stratified resampling. sorted uniform random variates [13]. Code 4 details this approach.Thereisscopeforasmalldegreeofparallelismhere STRATIFIED-CUMULATIVE-OFFSPRING(w∈RN)→RN by dividing N amongst a handful of threads, but not enough 1 u∈RN ∼i.i.d.U[0,1) to make this a useful algorithm on the GPU. A drawback is 2 W← INCLUSIVE-PREFIX-SUM(w) the use of relatively expensive logarithm functions, but we 3 foreachi∈{1,...,N} nevertheless find it superior to Code 3 on CPU. 4 ri ← NWi 5 ki ←mWiNn(cid:0)N,(cid:4)ri(cid:5)+1(cid:1) (cid:16) (cid:106) (cid:107)(cid:17) B. Stratified resampling 6 Oi ←min N, ri+uki The variance in outcomes produced by the multinomial 7 return O resampler may be reduced [14] by stratifying the cumula- tive probability function of the same categorical distribu- tion, and randomly drawing one particle from each stratum. This stratified resampler [15] most naturally delivers not the against ri under the floating-point model, so that the result ancestry or offspring vector, but the cumulative offspring of ri + uki is just ri. For such i, no random sample is vector, defined with respect to the offspring vector o as being made within the strata. Furthermore, rounding up on O = INCLUSIVE-PREFIX-SUM(o). Pseudocode is given in the same line might easily deliver ON = N + 1, not Code 5. The algorithm is of complexity O(N). ON = N as required, if not for the quick-fix use of min! As for multinomial resampling, the INCLUSIVE-PREFIX- This is particularly relevant in the present context because SUM operation on line 2 of Code 5 is not numerically stable. contemporary GPUs have significantly faster single-precision The same strategies to ameliorate the problem apply. than double-precision floating-point performance, so that the Line 6 of Code 5 is more problematic. Consider that there useofsingleprecisionisalwaystempting.Insingleprecision, may be a j such that, for i ≥ j, uki is not significant sevensignificantfigures(indecimal)cantypicallybeexpected. 4 Code 6 Pseudocode for systematic resampling. Code 7 Pseudocode for Metropolis resampling. SYSTEMATIC-CUMULATIVE-OFFSPRING(w∈RN)→RN METROPOLIS-ANCESTORS(w∈RN,B ∈N)→RN 1 u∼U[0,1) 1 foreachi∈{1,...,N} 2 W← INCLUSIVE-PREFIX-SUM(w) 2 k ←i 3 foreachi∈{1,...,N} 3 forn=1,...,B 4 ri ← NWi 4 u∼U[0,1] 5 Oi ←mWiNn(cid:0)N,(cid:4)ri +u(cid:5)(cid:1) 5 j ∼U{1,...,N} 6 return O 6 if u≤wj/wk 7 k ←i 8 ai ←k 9 return a Considerthat,withN aroundonemillion,almostcertainlyno uki ∈ [0,1) is significant against ri at high i. One million particles is not an unrealistic number for some contemporary applicationsoftheparticlefilter.Indouble-precision,however, As B must be finite, the algorithm produces a biased sample. where 15 significant figures (in decimal) is expected, it is SelectingB isatradeoffbetweenspeedandbias,withsmaller unlikely that N is sufficiently large for this to be a problem B giving faster execution time but larger bias. This bias in current applications. Note that while pre-sorting weights may not be much of a problem for filtering applications, but and summing over a binary tree can help with the numerical does violate the assumptions that lead to unbiased marginal stability of the INCLUSIVE-PREFIX-SUM operation, it does likelihood estimates in a particle Markov chain Monte Carlo not help with this latter issue. (PMCMC) framework [18], so care should be taken. We provide guidance as to the selection of B by bounding the bias (cid:15) for the maximum normalised weight p∗ [17]. One C. Systematic resampling may have both an upper bound on unnormalised weights, The variance in outcomes of the stratified resampler may supw, and an expected weight, E(w), from which this might often, but not always [14], be further reduced by using the be computed as p∗ = supw/(NE(w)) (see §IV for one same random offset when sampling from each stratum. This such example, albeit contrived). Otherwise, this might be is the systematic resampler (equivalent to the deterministic thought of as a tolerance threshold on weight degeneracy and method described in the appendix of [15]). Pseudocode is a value specified accordingly. Convergence of a Metropolis giveninCode6,whichisasimplemodificationtoCode5.The chain depends on a sufficient number of steps being taken samecomplexityandnumericalcaveatsapplytothesystematic suchthattheprobabilityofaparticleofthismaximumweight resampler as for the stratified resampler. being selected is within (cid:15) of p∗. Following[19][§2.1],forN particles,constructabinary0-1 process where state 1 represents being at a particle with the D. Metropolis resampling maximum weight, and state 0 represents being at any other The preceding resamplers all suffer from two problems: particle. The transition matrix is 1) they exhibit numerical instability for large N or large (cid:18) (cid:19) weight variance, and 1−α α T = , (3) 2) they require a collective operation over the weights, β 1−β specifically INCLUSIVE-PREFIX-SUM, which is less where readily parallelised than independent operations on 1 1−p∗ α= · (4) weights. N p∗ Wepresenttwoalternativeapproaches,nottypicallyconsid- is the probability of moving from state 1 to 0, and eredintheliterature,whichdonotsufferfromtheseproblems. They are more numerically stable, more readily parallelised, 1 β = (5) and consequently better suited to the breadth of parallelism N offered by GPU hardware. is the probability of moving from state 0 to 1, with a uniform ThefirstproducesamultinomialresampleviatheMetropo- proposal over all N particles and the Metropolis acceptance lis algorithm [16] rather than by direct sampling. This was rule. The B-step transition matrix is then: brieflystudiedbythefirstauthorin[17],butamorecomplete treatment is given here. Instead of the collective operation, 1 (cid:18) α β (cid:19) λB (cid:18) α −α (cid:19) TB = + , (6) only the ratio between any two weights is ever computed, α+β α β α+β −β β and only when needed. Code 7 describes the approach. The whereλ=(1−α−β).Werequirethatthesecondtermsatisfy complexity of the algorithm is discussed later. the bias bound: The Metropolis resampler is parameterised by B, the num- ber of iterations to be performed before convergence is as- λB (cid:18) α −α (cid:19) <(cid:15), (7) sumed and each particle may settle on its chosen ancestor. α+β −β β 5 Code 8 Pseudocode for rejection resampling. as a proposal distribution using REJECTION-ANCESTORS, then importance weight each particle i with wi ← wai/vai. REJECTION-ANCESTORS(w∈RN)→RN Note that each weight is 1 except where wai > supv. The 1 foreachi∈{1,...,N} procedure may also be used when no hard upper bound on 2 j ←i weights exists (supw), but where some reasonable substitute 3 β ∼U[0,1] can be made (supv). 4 whileβ >wj/supw Anissueuniquetotherejectionresampleristhatthecompu- 5 j ∼U{1,...,N} tationaleffortrequiredtodraweachancestorvaries,depending 6 β ∼U[0,1] onthenumberofrejectedproposalsbeforeacceptance.Thisis 7 ai ←j anexampleofavariabletask-lengthproblem[21],particularly 8 wi ←1 acute in the GPU context where it causes warp divergence, 9 return a with threads of the same warp tripping the loop on line 4 of Code 8 different numbers of times. A persistent threads strategy [22], [21] might be used to mitigate the effects of this, although we have not been successful in finding such an so implementation that does not lose more than it gains through λB max(α,β)<(cid:15), (8) additional overhead in register use and branching. α+β which is satisfied when F. Other algorithms (cid:15)(α+β) The resampling algorithms presented here do not constitute B >log . (9) λ max(α,β) anexhaustivelistofthoseinuse,butarereasonablyrepresen- Thus, by specifying a p∗ and some small (cid:15), one proceeeds tative, and tend to form the building blocks of more elaborate schemes. A notable example is residual resampling [23], by calculating α and β, then λ, then finally B as a suggested number of steps for the Metropolis resampler. An appropriate which deterministically draws each particle (cid:98)Nwi/SUM(w)(cid:99) value for (cid:15) might be relative to p∗, such as p∗ × 10−2. number of times before making up the deficit in the number The experiments in §IV provide some empirical evidence to of particles by randomly drawing additional particles with support this as a reasonable choice for (cid:15), and ultimately B. probabilities proportional to the residuals Nwi/SUM(w) − ThecomplexityoftheMetropolisresamplerisO(NB),but (cid:98)Nwi/SUM(w)(cid:99). The first stage may be implemented simi- larly to the systematic resampler, and the second using either B may itself be a function of N and weight variance, as in a multinomial, Metropolis or rejection approach. the analysis above. III. AUXILIARYFUNCTIONS E. Rejection resampling The multinomial, Metropolis and rejection resamplers most When an upper bound on the weights is known a priori, naturally return the ancestry vector a, while the stratified and rejection sampling is possible. Compared to the Metropo- systematic resamplers return the cumulative offspring vector lis resampler, the rejection resampler also avoids collective O. Conversion between these is reasonably straightforward. operations and their numerical instability, but offers several Anoffspringvectoromaybeconvertedtoacumulativeoff- additional advantages: spring vector O via the INCLUSIVE-PREFIX-SUM primitive, 1) it is unbiased, and back again via ADJACENT-DIFFERENCE. A cumulative 2) it is simpler to configure, and offspring vector may be converted to an ancestry vector via 3) it permits a first deterministic proposal (see, e.g., [20]) Code 9, and an ancestry vector to an offspring vector via that ai =i, improving the variance in outcomes. Code 10. These functions perform well on both CPU and Pseudocode is given in Code 8. If the third item above is GPU. An alternative approach to CUMULATIVE-OFFSPRING- removed,rejectionresamplingisanalternativeimplementation TO-ANCESTORS,usingabinarysearchforeachancestor,was of multinomial resampling. The complexity of the algorithm found to be slower. is O(SUM(w)/supw). An ancestry vector may be permuted to satisfy (2). A serial The rejection resampler will perform poorly if supw is algorithm to achieve this is straightforward and given in Code not a tight upper bound, i.e. if max(w1,...,wN) (cid:28) supw. 11. This O(N) algorithm makes a single pass through the While one can empirically set supw = max(w1,...,wN), ancestry vector with pair-wise swaps to satisfy the condition. this would require a collective operation over weights that Thesimplealgorithmiscomplicatedinaparallelcontextas would defeat the purpose of the approach. the pair-wise swaps are not readily serialised without heavy- Performance can be tuned if willing to concede a weighted weight mutual exclusion. In parallel we propose Code 12. outcome from the resampling step, rather than the usual This algorithm does not perform the permutation in-place, but unweightedoutcome.Todothis,choosesomesupv <supw, instead produces a new vector c∈RN that is the permutation thenformacategoricalimportanceproposalusingtheweights of the input vector a. It introduces a new vector d ∈ RN, v1,...,vN, where vi = min(wi,supv). Clearly supv forms through which, ultimately, ci = adi. In the first stage of the an upper bound on these new weights. Sample from this algorithm, PREPERMUTE, the thread for element i attempts 6 Code 9 Pseudocode conversion of an offspring vector o to an Code 12Parallelalgorithmforthepermutationofanancestry ancestry vector a. vector to ensure (2). CUMULATIVE-OFFSPRING-TO-ANCESTORS(O ∈ RN) → PREPERMUTE(a∈RN)→RN RN // claim places to satisfy (2) 1 foreachi∈{1,...,N} 1 Let d∈RN and set di ←N +1 for i=1,...,N. 2 if i=1 2 foreachi∈{1,...,N} 3 start ←0 3 atomicdai ←min(dai,i) 4 else 4 ensures 5 start ←Oi−1 5 ∀i(i∈{1,...,N}:oi >0 =⇒ di =min (aj =i)) j 6 return d 6 oi ←Oi−start 7 forj =1,...,oi PERMUTE(a∈RN)→RN 8 astart+j ←i 1 d← PREPERMUTE(a) 9 return a 2 foreachi∈{1,...,N} 3 x←dai 4 if x(cid:54)=i Code 10Pseudocodeconversionofanancestorvectoratoan 5 // claim unsuccessful in PREPERMUTE offspring vector o. 6 x←i 7 whiledx ≤N ANCESTORS-TO-OFFSPRING(a∈RN)→RN 8 x←dx 1 o←0 9 dx ←i 2 foreachi∈{1,...,N} 3 atomicoi ←oi+1 10 foreachi∈{1,...,N} 4 return o 11 ci ←adi 12 ensures 13 ∀i(i∈{1,...,N}:oi >0 =⇒ ci =i) to claim position ai in the output vector by setting dai ← i. 14 return c Byvirtueoftheminfunctiononline3,theelementoflowest indexalwayssucceedsinthisclaimwhileallotherscontesting thesameplacefail,andtheoutcomeofthewholepermutation procedure is deterministic. This is desirable so that the results to resample at any given time in the particle filter. The ESS ofaparticlefilterarereproducibleforthesamepseudorandom is given by SUM(w)2/wTw. Note that both sorting and ESS number seed. For each element i that is not successful in are necessarily collective operations. We include these two its claim, the thread for i instead attempts to claim di, if proceduresinourtimingresultstolendadditionalperspective. unsuccessful again then ddi, then recursively dddi,... etc, until an unclaimed place is found. Note that this procedure IV. EXPERIMENTS is guaranteed to terminate (a proof is given in Appendix A). Two other procedures are worth mentioning, as they are Weight sets are simulated to assess the speed and accuracy often used in conjunction with resampling. The first, the sort- of each resampling algorithm. For each number of particles ing of weights for improved numerical stability, has already N = 24,25,...,220 and observation y = 0,1,1,11,...,4, a 2 2 been mentioned. The second is the computation of effective weight set is generated by sampling xi ∼ N(0,1), for i = sample size (ESS) [24], often used to decide whether or not 1,...,N, and setting (cid:18) (cid:19) 1 1 wi = √ exp − (xi−y)2 . (10) Code 11 Serial algorithm for permuting an ancestry to ensure 2π 2 (2). This is repeated to produce 500 weight sets for each config- PERMUTE(a∈RN) uration. Note that the construction is analagous to having a prior distribution of x ∼ N(0,1) and likelihood function of 1 fori=1,...,N 2 if ai (cid:54)=i and aai (cid:54)=ai y ∼N(x,1). As y increases, the relative variance in weights does too. 3 swap(ai,aai) √ For this set up, the maximum weight is supw = 1/ 2π, 4 i←i−1 // repeat for new value and the expected weight 5 ensures 6 ∀i(i∈{1,...,N}:oi >0 =⇒ ai =i) 1 (cid:18) 1 (cid:19) E(w)=N(y;0,σ2 =2)= √ exp − y2 . (11) 2 π 4 7 The variance of the weights is 106 106 Multinomial Stratified 1 (cid:18) 1 (cid:19) 105 105 Systematic V(w)= √ exp − y2 −[E(w)]2, (12) Metropolis Rejection π 12 3 104 104 SEosrst and so their relative variance is (cid:18) w (cid:19) 2 (cid:18)y2(cid:19) sµ103 sµ103 V = √ exp −1, (13) E(w) 3 6 102 102 which is increasing with y. 101 101 These are used to set B for the Metropolis resampler according to the analysis in §II-D. This maximum weight 100 100 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 is also used for the rejection resampler. Note that, while log2 N log2 N presentedwithweightshere,ouractualimplementationworks Fig. 4. Execution times against log N at y = 1 for (left) the CPU and 2 withlog-weights,whichweassumeisfairlycommonpractice (right)theGPU.Allresamplingalgorithmsareshown,alongwiththeSORT for observation densities from the exponential family. andESSproceduresforcontext. Experiments are conducted in single-precision on two devices. The first device is an eight-core Intel Xeon E5- 10-1 10-1 Multinomial Stratified 2650 CPU, compiling with the Intel C++ Compiler version Systematic Metropolis 12.1.3, using OpenMP to parallelise over eight threads, and 10-2 10-2 Rejection using the Mersenne Twister pseudorandom number genera- tor (PRNG) [25] as implemented in the Boost.Random li- SE SE brary [26]. The second device is an NVIDIA S2050 GPU RM 10-3 RM 10-3 hosted by the same CPU, compiling with CUDA 5.0 and the same version of the Intel compiler, and using the XORWOW 10-4 10-4 PRNG [27] from the CURAND library [28]. All compiler optimisations are applied. 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 Figure 3 shows the surface contours of the mean execution log2 N log2 N time across N and y for each algorithm, as well as marking Fig.5. Root-mean-squareerror(RMSE)ofthevariousresamplingalgorithms up the fastest algorithms on each device, and across both at(left)y=1and(right)y=3. devices. Execution times are taken until the delivery of an ancestry vector satisfying (2), and so include any of the auxiliary functions in §III necessary to achieve this. Note the RMSE of the Metropolis resampler appears to match that that, as we would expect, execution times of the multinomial, ofthemultinomialresampler,suggestingthattheprocedurein stratified and systematic resamplers are not sensitive to y – or §II-D for setting the number of steps, B, is appropriate. Also equivalently to the variance in weights – while the Metropolis of interest is that as y increases, the probability of the initial and rejection resamplers are. proposal of the rejection resampler being accepted declines, As the resampling step is performed numerous times and so its RMSE degrades toward that of the multinomial and throughout the particle filter, it is not unreasonable to com- Metropolis resamplers. pare resampling algorithms on mean execution time alone. Nonetheless, some variability in execution time is expected, especially for the rejection resampler. This is shown in Figure V. DISCUSSION 4bytakingahorizontaltransectacrosseachsurfaceinFigure On raw speed, the Metropolis and rejection resamplers are 3 at y = 1, then plotting N versus execution time separately worth considering for faster execution times on GPU. This is for each device. The execution time of the sorting and ESS largely due to their avoidance of collective operations across procedures is added to these plots for context. all weights, which better suits the GPU architecture. They do To assess the variability in resampling draws for each notscaleaswellineitherN ory,however,sothatthestratified algorithm we compute the mean-square error in the outcome and systematic resamplers are faster at larger values of these. for each weight-set, based on the offspring vector o [15]: The Metropolis resampler scales worst of all in this regard. If 1 (cid:88)N (cid:18)oi wi (cid:19)2 there exists a good upper bound on the weights, the rejection − . (14) resampler is faster, easier to configure and unbiased with N N SUM(w) i=1 respecttotheMetropolisresampler.TheMetropolisresampler We then take the mean of this across all 500 weight sets, is more configurable, however, and allows the tuning of B andthesquare-rootofthis,toarriveatroot-mean-squareerror to trade off between execution time and bias. This may be (RMSE) as in Figure 5. This figure plots the RMSE for both particularlyadvantageousforapplicationswithhardexecution y = 1 and y = 3. Nothing in this figure is surprising: it is time constraints, such as real-time object tracking. Recall that well known that the stratified resampler reduces variance over the rejection resampler is somewhat configurable by using an the multinomial resampler, and that the systematic resampler approximate maximum weight, if one is willing to accept a can, but does not necessarily, reduce it further [14]. Notably still-weighted output from the resampling step. 8 Multinomial Stratified Systematic Metropolis Rejection 4 4 4 4 4 5 4 2 3.5 3.5 3.5 3.5 3.5 5 3 3 3 3 6 3 2.5 1 2 3 4 2.51 2 3 4 2.5 1 2 3 4 2.5 1 2 3 4 5 2.5 1 2 3 4 y 2 2 2 2 2 1.5 1.5 1.5 1.5 1.5 1 1 2 3 4 11 2 3 4 1 1 2 3 4 1 1 2 3 4 5 1 1 2 3 4 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 4 6 8 101214161820 4 6 8 101214161820 4 6 8 101214161820 4 6 8 101214161820 4 6 8 101214161820 4 4 4 4 3 4 6 3.5 3.5 3.5 3.5 3.5 3 3 5 3 3 3 2 5 3 2 4 2.5 4 3 4 2.5 3 2.5 3 2.5 2 3 4 2.5 2 3 y 2 2 2 2 2 1.5 3 1.5 1.5 1.5 1.5 1 2 3 4 1 3 1 3 1 2 3 4 1 2 3 0.5 0.5 0.5 0.5 0.5 1 0 0 0 0 0 4 6 8 101214161820 4 6 8 101214161820 4 6 8 101214161820 4 6 8 101214161820 4 6 8 101214161820 log2 N log2 N log2 N log2 N log2 N Fig.3. Meanexecutiontimesforalgorithmsoneachdevice,(top row)CPUand(bottom row)GPU.Eachsurfaceshowsthebase-10logarithmofmean execution time across log N (where N is the number of particles) and observation y, with prior distribution over particles x ∼ N(0,1) and likelihood 2 functiony∼N(x,1).Dashedlinesdemarcatetheregion,ifany,whereanalgorithmisthefastestinitsrow.Solidlinesdemarcatewhereeachalgorithmis thefastestoverall. A further consideration is that the execution time of both [2] A.Doucet,N.deFreitas,andN.Gordon,Eds.,SequentialMonteCarlo theMetropolisandrejectionresamplersdependsonthePRNG MethodsinPractice. Springer,2001. [3] CUDACProgrammingGuide,NVIDIACorporation,July2012. used. This dependence is by a constant factor, but can be [4] OpenCL.KhronosGroup.[Online].Available:www.khronos.org/opencl/ substantial. Here, robust PRNGs for Monte Carlo work have [5] CUDA. NVIDIA Corporation. [Online]. Available: www.nvidia.com/ been used, but conceivably cheaper, if less robust, PRNGs cuda might be considered as another trade-off between execution [6] M.Bolic´,P.M.Djuric´,andS.Hong,“Resamplingalgorithmsforparticle filters:Acomputationalcomplexityperspective,”EURASIPJournalon time and bias. AppliedSignalProcessing,pp.2267–2277,2004. One should still be aware that the stratified resampler is [7] ——,“Resamplingalgorithmsandarchitecturesfordistributedparticle variance-reducing with respect to the multinomial resampler, filters,” IEEE Transactions on Signal Processing, vol. 53, pp. 2442– 2450,2005. and that the systematic resampler can give further reduc- [8] G. Hendeby, J. D. Hol, R. Karlsson, and F. Gustafsson, “A graphics tions [14]. This may be a more important consideration processingunitimplementationoftheparticlefilter,”in15thEuropean than raw speed, depending on the application. The rejection SignalProcessingConference,2007. resampler, with its first deterministic proposal, has the nice [9] C. Lenz, G. Panin, and A. Knoll, “A GPU-accelerated particle filter withpixel-levellikelihood,”inVision,Modeling,andVisualization13th property that it can also improve upon the variance of the InternationalFallWorkshop,2008. multinomial resampler. The Metropolis resampler is expected [10] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes, “On to match the multinomial resampler on variance, and indeed the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods,” Journal of Computational and any variation can only be attributed to the bias introduced by GraphicalStatistics,vol.19,pp.769–789,2010. finite B. [11] J. Hoberock and N. Bell, “Thrust: A parallel template library,” 2010. A special consideration on GPUs is the use of single- [Online].Available:thrust.github.com precision floating-point operations to improve execution time. [12] S. Sengupta, M. Harris, and M. Garland, “Efficient parallel scan algo- rithmsforGPUs,”NVIDIA,Tech.Rep.NVR-2008-003,2008. For a number of particles upwards of hundreds of thousands, [13] J. L. Bentley and J. B. Saxe, “Generating sorted lists of random plausibleinmodernapplications,itisworthemphasisingagain numbers,”CarnegieMellonUniversity,ComputerScienceDepartment, that great care should be taken in the use of the stratified Tech. Rep. 2450, 1979. [Online]. Available: http://repository.cmu.edu/ compsci/2450 and systematic resamplers due to numerical instability. The [14] R.DoucandO.Cappe,“Comparisonofresamplingschemesforparticle Metropolis and rejection resamplers have better numerical filtering,” in Image and Signal Processing and Analysis, 2005. ISPA properties, as they compute only ratios of weights. In dou- 2005.Proceedingsofthe4thInternationalSymposiumon,15-172005, pp.64–69. ble precision, the number of particles required to have the [15] G.Kitagawa,“MonteCarlofilterandsmootherfornon-Gaussiannon- same issue far surpasses that which is realistic at present, so linear state space models,” Journal of Computational and Graphical numerical stability is unlikely to be an issue. Statistics,vol.5,pp.1–25,1996. [16] N.Metropolis,A.Rosenbluth,M.Rosenbluth,A.Teller,andE.Teller, “Equationofstatecalculationsbyfastcomputingmachines,”Journalof REFERENCES ChemicalPhysics,vol.21,pp.1087–1092,1953. [17] L.M.Murray,“GPUaccelerationoftheparticlefilter:TheMetropolis [1] N. Gordon, D. Salmond, and A. Smith, “Novel approach to resampler,” in DMMD: Distributed machine learning and sparse nonlinear/non-GaussianBayesianstateestimation,”IEEProceedings-F, representation with massive data sets, 2011. [Online]. Available: vol.140,pp.107–113,1993. arxiv.org/1202.6163 9 [18] C.Andrieu,A.Doucet,andR.Holenstein,“ParticleMarkovchainMonte Carlomethods,”JournaloftheRoyalStatisticalSocietySeriesB,vol.72, pp.269–302,2010. [19] A. E. Raftery and S. Lewis, “How many iterations in the Gibbs sampler?” in Bayesian Statistics 4. Oxford University Press, 1992, pp.763–773. [20] P. Del Moral, P. E. Jacob, A. Lee, L. M. Murray, and G. W. Peters, “Feynman-Kac particle integration with geometric interacting jumps.” [Online].Available:arxiv.org/1211.7191 [21] L. M. Murray, “GPU acceleration of Runge-Kutta integrators,” IEEE TransactionsonParallelandDistributedSystems,vol.23,pp.94–101, 2012. [22] T.AilaandS.Laine,“Understandingtheefficiencyofraytraversalon GPUs,”inProc.High-PerformanceGraphics2009,2009,pp.145–149. [23] J. S. Liu and R. Chen, “Sequential Monte-Carlo methods for dynamic systems,” Journal of the American Statistical Association, vol. 93, pp. 1032–1044,1998. [24] ——,“Blinddeconvolutionviasequentialimputations,”Journalofthe AmericanStatisticalAssociation,vol.90,pp.567–576,1995. [25] M. Matsumoto and T. Nishimura, “Mersenne twister: A 623- dimensionally equidistributed uniform pseudorandom number genera- tor,”ACMTransactionsonModelingandComputerSimulation,vol.8, pp.3–30,1998. [26] BoostC++libraries.[Online].Available:www.boost.org [27] G.Marsaglia,“XorshiftRNGs,”JournalofStatisticalSoftware,vol.8, no.14,pp.1–6,2003. [28] CUDAToolkit5.0CURANDLibrary,NVIDIACorporation,July2012. APPENDIX We offer a proof of the termination of Code 12. First note that PREPERMUTE leaves d in a state where, excludingallvaluesofN+1,theremainingvaluesareunique. Furthermore,inPERMUTEtheconditionalonline4meansthat the loop on line 7 is only entered for values of i that are not represented in d. Foreachsuchi,thewhilelooptraversesthesequencex = 0 i, xn = dxn−1, until dxn = N + 1. For the procedure to terminate this sequence must be finite. Because each x is an n elementofthefiniteset{1,...,N},toshowthatthesequence is finite it is sufficient to show that it never revisits the same value twice. The proof is by induction. 1) As no value of d is i, the sequence cannot revisit its initialvaluex =i.Theelementx isthereforeunique. 0 0 2) For k ≥ 1, assume that the elements of x are 0:k−1 unique. 3) Now, x is not unique if there exists some j ∈ 0:k {1,...,k −1} such that xk = dxk−1 = xj = dxj−1, withx (cid:54)=x bytheuniquenessofx .Butthis j−1 k−1 0:k−1 contradicts the uniqueness of the (non N+1) values of d. Thus the elements of x are unique, the sequence 0:k is finite, and the program must terminate. (cid:3)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.