ebook img

Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering PDF

0.9 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering

Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering Simon Lacoste-Julien Fredrik Lindsten Francis Bach INRIA - Sierra Project-Team Department of Engineering INRIA - Sierra Project-Team E´cole Normale Sup´erieure, Paris, France University of Cambridge E´cole Normale Sup´erieure, Paris, France 5 1 0 Abstract where x ∈ X denotes the latent state variable and 2 t b yt ∈ Y the observation at time t. Exact state infer- ence in SSMs is possible, essentially, only when the e Recently,theFrank-Wolfeoptimizationalgo- F rithm was suggested as a procedure to ob- model is linear and Gaussian or when the state-space 0 tain adaptive quadrature rules for integrals X isafiniteset. Forsolvingtheinferenceproblembe- yond these restricted model classes, sequential Monte 1 of functions in a reproducing kernel Hilbert space (RKHS) with a potentially faster rate Carlomethods,i.e.particlefilters(PFs),haveemerged ] as a key tool; see e.g., Doucet and Johansen (2011); L of convergence than Monte Carlo integration Capp´e et al. (2005); Doucet et al. (2000). However, (and“kernelherding”wasshowntobeaspe- M sincethesemethodsarebasedonMonteCarlointegra- cial case of this procedure). In this paper, t. we propose to replace the random sampling tiontheyareinherentlyaffectedbysamplingvariance, a which can degrade the performance of the estimators. step in a particle filter by Frank-Wolfe op- t s timization. By optimizing the position of Particular challenges arise in the case when the ob- [ the particles, we can obtain better accuracy servation likelihood p(yt|xt)iscomputationallyexpen- 2 thanrandomorquasi-MonteCarlosampling. sive to evaluate. For instance, this is common in v In applications where the evaluation of the robotics applications where the observation model re- 6 5 emissionprobabilitiesisexpensive(suchasin lates the sensory input of the robot, which can com- 0 robot localization), the additional computa- prise vision-based systems, laser rangefinders, syn- 2 tional cost to generate the particles through thetic aperture radars, etc. For such systems, simply 0 optimization can be justified. Experiments evaluating the observation function for a fixed value . 1 on standard synthetic examples as well as on of x can therefore involve computationally expensive t 0 a robot localization task indicate indeed an operations, such as image processing, point-set regis- 5 improvement of accuracy over random and tration, and related tasks. This poses difficulties for 1 quasi-Monte Carlo sampling. particle-filtering-based solutions for two reasons: (1) : v the computational bottleneck arising from the like- i X 1 Introduction lihood evaluation implies that we cannot simply in- creasethenumberofparticlestoimprovetheaccuracy, r a In this paper, we explore a way to combine ideas from and(2) thistypeof“complicated”observationmodels optimization with sampling to get better approxima- will typically not allow for adaptation of the proposal tions in probabilistic models. We consider state-space distribution used within the filter, in the spirit of Pitt models (SSMs, also referred to as general state-space and Shephard (1999), leaving us with the standard— hidden Markov models), as they constitute an impor- but inefficient—bootstrap proposal as the only viable tant class of models in engineering, econometrics and option. On the contrary, for these systems, the dy- other areas involving time series and dynamical sys- namical model p(xt|xt−1) is often comparatively sim- tems. A discrete-time, nonlinear SSM can be written ple, e.g. being a linear and Gaussian “nearly constant as acceleration” model (Ristic et al., 2004). x |x ∼p(x |x ); y |x ∼p(y |x ), (1) t 1:(t−1) t t−1 t 1:t t t The method developed in this paper is geared toward this class of filtering problems. The basic idea is that, Appearing in Proceedings of the 18th International Con- inscenarioswhenthelikelihoodevaluationisthecom- ference on Artificial Intelligence and Statistics (AISTATS) putational bottleneck, we can afford to spend addi- 2015, San Diego, CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the authors. tional computations to improve upon the sampling of Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering theparticles. Bydoingso,wecanavoidexcessivevari- µ(p):=E [Φ]∈H(Smolaetal.,2007;Sriperumbudur p ance arising from simple Monte Carlo sampling from et al., 2010). Essentially, by using Cauchy-Schwartz the bootstrap proposal. inequality and the linearity of the expectation opera- tor, we can obtain: Contributions. We build on the optimization view from Bach et al. (2012) of kernel herding (Chen et al., sup |Ep[f]−Epˆ[f]|=(cid:107)µ(p)−µ(pˆ)(cid:107)H 2010) to approximate the integrals appearing in the f∈H Bayesian filtering recursions. We make use of the (cid:107)f(cid:107)H≤1 =:MMD(p,pˆ), (3) Frank-Wolfe(FW)quadraturetoapproximate,inpar- ticular, mixtures of Gaussians which often arise in a and so by bounding MMD(p,pˆ), we can bound the er- particlefilteringcontextasthemixtureoverpastparti- ror of approximating the expectation for all f ∈ H, clesinthedistributionoverthenextstate. Weusethis with (cid:107)f(cid:107)H as a proportionality constant. MMD(p,pˆ) approach within a filtering framework and prove the- is thus a central quantity for developing good quadra- oretical convergence results for the resulting method, ture rules given by (2). In the context of RKHSs, denotedassequentialkernelherding (SKH),givingone MMD(p,q) can be called the maximum mean discrep- ofthe firstexplicit better convergence ratesthan for a ancy (Gretton et al., 2012) between the distributions particlefilter. Ourpreliminaryexperimentsshowthat p and q, and acts a pseudo-metric on the space of dis- SKHcangivebetteraccuracythanastandardparticle tributions on X. If κ is a characteristic kernel (such filter or a quasi-Monte Carlo particle filter. as the standard RBF kernel), then MMD is in fact a metric, i.e. MMD(p,q) = 0 =⇒ p = q. We refer the reader to Sriperumbudur et al. (2010) for the regular- 2 Adaptive quadrature rules with ityconditionsneededfortheexistenceoftheseobjects Frank-Wolfe optimization and for more details. 2.1 Approximating the mean element for 2.2 Frank-Wolfe optimization for adaptive integration in a RKHS quadrature We consider the problem of approximating integrals For getting a good quadrature rule pˆ, our goal is thus of functions belonging to a reproducing kernel Hilbert to minimize (cid:107)µ(pˆ)−µ(p)(cid:107) . We note that µ(p) lies in space(RKHS)Hwithrespecttoafixed distributionp H themarginalpolytope M⊂H,definedastheclosureof over some set X. We can think of the elements of theconvex-hullofΦ(X). WesupposethatΦ(x)isuni- H as being real-valued functions on X, with point- formlyboundedinthefeaturespace,thatis,thereisa wise evaluation given from the reproducing property finite R such that (cid:107)Φ(x)(cid:107) ≤ R ∀x ∈ X. This means by f(x) = (cid:104)f,Φ(x)(cid:105), where Φ : X → H is the fea- H thatMisaclosedboundedconvexsubsetofH,andwe ture map from the state-space X to the RKHS. Let couldintheoryoptimizeoverit. Thisinsightwasused κ : X2 → R be the associated positive definite ker- byBachetal.(2012)whoconsideredusingtheFrank- nel. We briefly review here the setup from Bach et al. Wolfe optimization algorithm to optimize the convex (2012), which generalized the one from Chen et al. function J(g) := 1(cid:107)g − µ(p)(cid:107)2 over M to obtain (2010). We want to approximate integrals E [f] for 2 H p adaptivequadraturerules. TheFrank-Wolfealgorithm f ∈ H using a set of n points x(1),...,x(n) ∈ X asso- (also called conditional gradient) (Frank and Wolfe, ciated with positive weights w(1),...,w(n) which sum 1956) is a simple first-order iterative constrained op- to 1: timization algorithm for optimizing smooth functions n E [f]≈(cid:88)w(i)f(x(i))=E [f], (2) over closed bounded convex sets like M (see Dunn p pˆ (1980) for its convergence analysis on general infinite i=1 dimensional Banach spaces). At every iteration, the where pˆ := (cid:80)n w(i)δ is the associated empirical algorithm finds a good feasible search vertex of M by i=1 x(i) distributiondefinedbythesepointsandδ (·)isapoint minimizing the linearization of J at the current iter- x mass distribution at x. If the points x(i) are inde- ate g : g¯ = argmin (cid:104)J(cid:48)(g ),g(cid:105). The next iter- k k+1 g∈M k pendent samples from p, then this Monte Carlo esti- ate is then obtained by a suitable convex combination mate (using weights of 1/n) is unbiased with a vari- of the search vertex g¯ and the previous iterate g : k+1 k ance of V [f]/n, where V [f] is the variance of f with g =(1−γ )g +γ g¯ for a suitable step-size γ p p k+1 k k k k+1 k respect to p. By using the fact that f belongs to the from a fixed schedule (e.g. 1/(k+1)) or by using line- RKHSH,wecanactuallychooseabettersetofpoints search. Acrucialpropertyofthisalgorithmisthatthe withlowererror. Itturnsoutthattheworst-caseerror iterate g is thus a convex combination of the vertices k of estimators of the form (2) can be analyzed in terms of M visited so far. This provides a sparse expan- of their approximation distance to the mean element sion for the iterate, and makes the algorithm suitable Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach to high-dimensional optimization (or even infinite) – the FW version) which always yields uniform weights: this explains in part the regain of interest in machine w(i) = 1/k for all i ≤ k. A third alternative is k learninginthelastdecadeforthisoldoptimizational- to re-optimize J(g) over the convex hull of the pre- gorithm (see Jaggi (2013) for a recent survey). In our viously visited vertices; this is called the fully cor- setupwhereMistheconvexhullofΦ(X),thevertices rective version (Jaggi, 2013) of the Frank-Wolfe al- of M are thus of the form g¯k+1 =Φ(x(k+1)) for some gorithm (hereafter referred as FCFW). In this case: x(k+1) ∈ X. Running Frank-Wolfe on M thus yields (w(1) ,...,w(k+1)) = argmin w(cid:62)K w − g = (cid:80)k w(i)Φ(x(i)) = E [Φ] for some weighted set k+1 k+1 w∈∆k+1 k+1 okf pointis=1{wk(ki),x(i)}ki=1. Tpˆhe iterate gk thus corre- 2abc(cid:62)kil+it1yws,iwmhpelerex,∆Kk+k1+i1s tisheth(ke+ke1r)n-delimmenastrioixnaolnprtohbe- sponds to a quadrature rule pˆof the form of (2) and (k+1)vertices: (K ) =κ(x(i),x(j))and(c ) = k+1 ij k+1 i gk = Epˆ[Φ], and this is the relationship that was ex- µp(x(i)) for i = 1,...,(k+1). This is a convex ploredinBachetal.(2012). RunningFrank-Wolfeop- quadratic problem over the simplex. A slightly modi- timizationwiththestep-sizeofγk =1/(k+1)reduces fiedversionoftheFCFWiscalledthemin-normpoint to the kernel herding algorithm proposed by Chen algorithm and can be more efficiently optimized us- et al. (2010). See also Husza´r and Duvenaud (2012) ing specific purpose active-set algorithms — see Bach for an alternative approach with negative weights. (2013, §9.2) for more details. We refer the reader toBachetal.(2012)formoredetailsontherateofcon- Algorithm1presentstheFrank-Wolfeoptimizational- vergenceofFrank-Wolfequadratureassumingthatthe gorithm to solve min J(g) in the context of get- g∈M FW vertex is found with guarantees. We summarize ting quadrature rules (we also introduce the short- themasfollows: ifHisinfinitedimensional,thenFW- hand notation µ := µ(p)). We note that to evalu- √ p QuadgivesthesameO(1/ N)ratefortheMMDerror ate the quality MMD(pˆ,p) of this adaptive quadra- as standard random sampling, for all FW methods. ture rule, we need to be able to evaluate µ (x) = p (cid:82) p(x(cid:48))κ(x(cid:48),x)dx(cid:48) efficiently. This is true only for On the other hand, if a ball of non-zero radius cen- x(cid:48)∈X tered at µ lies within M, then faster rates than ran- specific pairs of kernels and distributions, but fortu- p dom sampling are possible: FW gives a O(1/N) rate natelythisisthecasewhenpisamixtureofGaussians whereas FW-LS and FCFW gives exponential conver- and κ is a Gaussian kernel. This insight is central to gencerates(thoughinpractice,weoftenseedifferences this paper; we explore this case more specifically in not explained by the theory between these methods). Section 2.3. To find the next quadrature point, we also need to (approximately) optimize µ (x) over X p (step 3 of Algorithm 1, called the FW vertex search). 2.3 Example: mixture of Gaussians In general, this will yield a non-convex optimization We describe here in more details the Frank-Wolfe problem, and thus cannot be solved with guarantees, quadrature when p is a mixture of Gaussians p(x) = even with gradient descent. In our current implemen- (cid:80)K π N(x|µ ,Σ ) for X = Rd and κ is the Gaus- tation, we approach step 3 by doing an exhaustive i=1 i i i sian kernel κ (x,x(cid:48)) := exp(− 1 (cid:107)x−x(cid:48)(cid:107)2). In this search over M random samples from p precomputed σ √ 2σ2 when FW-Quad is called. We thus follow the idea case,µp(x)=(cid:80)Ki=1πi( 2πσ)dN(x|µi,Σi+σ2Id). We from the kernel herding paper (Chen et al., 2010) to thus need to optimize a difference of mixture of Gaus- choose the best N “super-samples” out of a large set sian bumps in step 3 of Algorithm 1, a non-convex of samples M. Thanks to the fact that convergence optimization problem that we approximately solve by guarantees for Frank-Wolfe optimization can still be exhaustive search over M random samples from p. given when using an approximate FW vertex search, weshowinAppendixBofthesupplementarymaterial 3 Sequential kernel herding that this procedure either adds a O(1/M1/4) term or √ a O(1/ M) term to the worst-case MMD(pˆ,p) error. 3.1 Sequential Monte Carlo In our description of Algorithm 1, a preset number N Consider again the SSM in (1). The joint probabil- of particles (iterations) was used. Alternatively, we ity density function for a sequence of latent states could usea variablenumber of iterations withthe ter- x := (x ,...,x ) and observations y factor- minating criterion test (cid:107)g −µ(p)(cid:107) ≤ (cid:15) which can 1:T 1 T 1:T k H izes as p(x ,y ) = (cid:81)T p(x |x )p(y |x ), with be explicitly computed during the algorithm and pro- 1:T 1:T t=1 t t−1 t t p(x |x ) := p(x ) denoting the prior density on the vides the MMD error bound on the returned quadra- 1 0 1 initial state. We would like to do approximate in- ture rule. Option (2) on line 5 chooses the step-size ference in this SSM. In particular, we could be in- γ by analytic line-search (hereafter referred as the k terested in computing the joint filtering distribution FW-LS version) while option (1) chooses the kernel r (x ) := p(x |y ) or the joint predictive distri- herding step-size γ = 1/(k+1) (herafter referred as t 1:t 1:t 1:t k bution p (x ,x ) := p(x ,x |y ). In parti- t+1 t+1 1:t t+1 1:t 1:t Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering Algorithm1FW-Quad(p,H,N): Frank-Wolfeadap- Algorithm 2Particlefiltertemplate(jointpredictive tive quadrature distribution form) — SKH alg. by changing step 3 Input: distributionp,RKHSHwhichdefinesker- Input: SSM p(x |x ), t t−1 nelκ(·,·)andstate-spaceX,numberofsamplesN o (x ):=p(y |x ) for t∈1:T. t t t t 1: Let g0 =0. Maintainpˆ(x )=(cid:80)N w(i)δ (x )duringalgo- 2: for k =0...N −1 do t 1:t i=1 t x(1i:)t 1:t rithm as approximation of p(x ,x |y ). 3: Solve x(k+1) =argmin(cid:104)g −µ ,Φ(x)(cid:105) t 1:(t−1) 1:(t−1) k p 1: Let p˜ (x ):=p(x ) x∈X 1 1 1 That is: 2: for t=1 ..., T do x(k+1) =argmin(cid:88)k w(i)(κ(x(i),x)−µ (x)). 3: Sample: get pˆt =SAMPLE(p˜t,N) x∈X k p [For SKH, use pˆt =FW-Quad(p˜t,Ht,N)] i=1 4: Include observation and normalize: 4: Option (1): Let γk = k+11. Wˆ =E [o ]; rˆ(x ):= 1 o (x )pˆ(x ). 5: Option (2): Let γk = (cid:104)gk(cid:107)−gµk−p,Φgk(−x(Φk+(x1()k)(cid:107)+21))(cid:105) (LS) 5: Proptagatepˆtapptroximtati1o:tn forwWˆatrdt: t t 1:t 6: Update gk+1 =(1−γk)gk+γkΦ(x(k+1)) p˜t+1(xt+1,x1:t):=p(xt+1|xt)rˆt(x1:t) i.e. w(k+1) =γ ; 6: end for k+1 k and w(i) =(1−γ )w(i) for i=1...k 7: Return Filtering distribution rˆT; predictive k+1 k k distribution pˆ ; normalization constants 78:: eRnedtufronr: pˆ=(cid:80)N w(i)δ Wˆ1,...,WˆT. T+1 i=1 N x(i) cle filtering methods, we approximate these distribu- We denote the conditional normalization constant at tionswithempiricaldistributionsfromweightedparti- time t by W := p(y |y ) and the global normal- t t 1:(t−1) clesets{w(i),x(i)}N asin(2). Wenotethatitiseasy ization constant by Z := p(y ) = (cid:81)t W . Wˆ t 1:t i=1 t 1:t u=1 u t tomarginalizepˆwithasimpleweightsummation,and is the particle filter approximation to W and is ob- t so we will present the algorithm as getting an approx- tainedbysummingtheun-normalizedmixtureweights imation for the joint distributions r and p defined in (4); see step 4 in Algorithm 2. Randomly sam- t t above, with the understanding that the marginal ones pling from (4) is equivalent to first sampling a mix- are easy to obtain afterwards. In the terminology of ture component according to the mixture weight (i.e., particle filtering, x(i) is the particle at time t, whereas choosing a past particle x(i) to propagate), and then t 1:t x(i) istheparticletrajectory. WhileprincipallythePF samplingitsnextextensionstatex(i) withprobability 1:t t+1 providesanapproximationofthefulljointdistribution p(x |x(i)). The standard bootstrap particle filter is r (x ),itiswellknownthatthisapproximationdete- t+1 t t 1:t thus obtained by maintaining uniform weight for the riorates for any marginal of xs for s(cid:28)t (Doucet and predictive distribution (w(i) = 1) and randomly sam- Johansen, 2011). Hence, the PF is typically only used t N plingfrom(4)toobtaintheparticlesattimet+1. This to approximate marginals of x for s (cid:46) t (fixed-lag s givesanunbiasedestimateofp˜ : E [pˆ ]=p˜ . smoothing) or s=t (filtering), or for prediction. t+1 p˜t+1 t+1 t+1 Lower variance estimators can be obtained by using a Algorithm2presentsthebootstrapparticlefilteringal- different resampling mechanism for the particles than gorithm(Gordonetal.,1993)fromthepointofviewof this multinomial sampling scheme, such as stratified propagatinganapproximateposteriordistributionfor- resampling (Carpenter et al., 1999) and are usually ward in time (see e.g. Fearnhead, 2005). We describe used in practice instead. itaspropagatinganapproximationpˆ(x )ofthejoint t 1:t One way to improve the particle filter is thus to re- predictive distribution one time step forward with the place the random sampling stage of step 3 with differ- model dynamics to obtain p˜ (x ,x ) (step 5), t+1 t+1 1:t ent sampling mechanisms with lower variance or bet- and then randomly sampling from it (step 3) to get ter approximation properties of the distribution p˜ the new predictive approximation pˆ (x ,x ). As t+1 t+1 t+1 1:t that we are trying to approximate. As we obtain the pˆ isanempiricaldistribution,p˜ isamixturedistri- t t+1 normalization constants W by integrating the obser- bution (the mixture components are coming from the t vationprobability,itseemsnaturaltolookforparticle particles at time t): point sets with better integration properties. By re- p˜t+1(xt+1,x1:t)= placingrandomsamplingwithaquasi-randomnumber N sequence, we obtain the already proposed sequential 1 (cid:88)p(y |x(i))w(i) p(x |x(i)) δ (x ). (4) quasi-Monte Carlo scheme (Philomin et al., 2000; Or- Wˆt i=1(cid:124) t (cid:123)t(cid:122) t(cid:125) (cid:124) t+(cid:123)1(cid:122) t (cid:125) x(1i:)t 1:t moneit et al., 2001; Gerber and Chopin, 2014). The mixtureweight mixturecomponent Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach maincontributionofourworkistoinsteadproposeto X defined (depending on H ) as all functions for t+1 t+1 use Frank-Wolfe quadrature in step 3 of the particle which the following semi-norm is finite:1 filter to obtain better (adapted) point sets. (cid:12)(cid:90) (cid:12) (cid:12) (cid:12) (cid:107)f(cid:107)Ft := sup (cid:12)(cid:12) f(xt+1)h(xt+1)dxt+1(cid:12)(cid:12). 3.2 Sequential kernel herding (cid:107)h(cid:107)Ht+1=1 Xt+1 In the sequential kernel herding (SKH) algorithm, Theorem 1 (Bounded growth of the mean map). we simply replace step 3 of Algorithm 2 with pˆt = Suppose that the function ft : (xt+1,xt) (cid:55)→ FW-Quad(p˜t,Ht,N). As mentioned in the introduc- p(yt|xt)p(xt+1|xt) is in the tensor product function tion, many dynamical models used in practice assume space Ft⊗Ht with the following defined nuclear norm: (cid:80) Gaussian transitions. Therefore, we will put par- (cid:107)ft(cid:107)Ft⊗Ht :=inf i(cid:107)αi(cid:107)Ft(cid:107)βi(cid:107)Ht, where the infimum ticular emphasis on the case when (more generally) is taken over all the possible expansions such that (cid:80) p(xt|x1:(t−1),y1:(t−1)) is a mixture of Gaussians, with ft(xt+1,xt)= iαi(xt+1)βi(xt)forallxt,xt+1. Then parametersforthemixturecomponentsthatcanbear- for any finite signed Borel measure ν on Xt, we have: bitrary functions of the state history x ,y , 1:(t−1) 1:(t−1) (cid:107)µ(F ν)(cid:107) ≤(cid:107)f (cid:107) (cid:107)µ(ν)(cid:107) . and is thus still fairly general. We thus consider the t Ht+1 t Ft⊗Ht Ht Gaussian kernel for the FW-Quad procedure as then Theorem 2 (Consistency of SKH). Suppose that for we can compute the required quantities analytically. all 1≤t≤T, f is in F ⊗H as defined in Theorem 1 An important subtle point is which Hilbert space H t t t t and o is in H . Then we have:2 to consider. In this paper, we focus on the marginal- t t ized filtering case, i.e. we are interested in p(x |y ) t 1:t (cid:107)µ(pˆ )−µ(p )(cid:107) ≤ only. Thus we are only interested in functions of xt, T T HT dwehpicehndisonwhxyt awneddneofitntehoeupraskterhniesltoartiest.imFeortstiomponlilcy- (cid:15)ˆT +(cid:18)R(cid:107)oTW−1(cid:107)HT−1 +ρT−1(cid:19)T(cid:88)−1χt(cid:15)ˆt(cid:32)T(cid:89)−2ρk(cid:33), T−1 ity, we also assume that H = H for all t (we use the t=1 k=t t samekernelforeachtimestep). Eventhoughthealgo- rithm can maintain the distribution on the whole his- where ρt := (cid:107)ft(cid:107)WFtt⊗Ht, χt := (cid:81)kt−=11 WWˆkk and (cid:15)ˆt is the FW error reported at time t by the algorithm: (cid:15)ˆ := tory pˆ(x ), the past histories x are marginal- t t 1:t 1:(t−1) (cid:107)µ(pˆ)−µ(p˜)(cid:107) . ized out when computing the mean map, for example t t Ht µ(p˜) = E [Φ(x )]. During the SKH algorithm, wectanstillp˜tt(rxa1c:tk)theptarticlehistoriesbykeepingtrack We note that χt ≈ 1 as we expect the errors on Wk togoineitherdirection,andthustocanceleachother from which mixture component in (4) x was coming t overtime(thoughintheworstcaseitcouldgrowexpo- from, but the past history is not used in the compu- nentially in t). If (cid:15)ˆ ≤(cid:15) and ρ ≤ρ, we basically have tation of the kernel and thus does not appear as a t t (cid:107)µ(pˆ )−µ(p )(cid:107) = O(ρT(cid:15)) if ρ > 1; O(T(cid:15)) if ρ = 1; repulsionterminstep3ofAlgorithm1. Weleaveitas T T and O((cid:15)) if ρ < 1 (a contraction). The exponential future work to analyze what kind of high-dimensional dependence in T is similar as for a standard particle kernel on past histories would make sense in this con- filter for general distributions; see Douc et al. (2014) text, and to analyze its convergence properties. The though for conditions to get a contraction for the PF. particle histories are useful in the Rao-Blackwellized extension that we present in Appendix A and use in Importantly, for a fixed T it follows that the rates of the robot localization experiment of Section 4.3. convergence for Frank-Wolfe in N translates to rates of errors for integrals of functions in H with respect to the predictive distribution p . Thus if we suppose 3.3 Convergence theory T that H is finite dimensional, that p has full support t In this section, we give sufficient conditions to guar- onX foralltandthatthekernelκiscontinuous,then antee that SKH is consistent as N goes to infinity. by Proposition 1 in Bach et al. (2012), we have that Let p here denote the marginalized predictive in- the faster rates for Frank-Wolfe hold and in particu- t stead of the joint. Let F be the forward transfor- lar we could obtain an error bound of O(1/N) with N t mation operator on signed measures that takes the particles. As far as we know, this is the first explicit predictive distribution p on x and yields the un- fasterratesofconvergenceasafunctionofthenumber t t normalized marginalized predictive distribution F p t t 1In general, the integral on X should be with re- on x in the SSM. Thus for a measure ν, we get t+1 t+1 specttothebasemeasureforwhichtheconditionaldensity (cid:82) (Ftν)(·) := Xtp(·|xt)p(yt|xt)dν(xt). We also have p(xt+1|xt) is defined. All proofs are in the supplementary that p = 1 F p . material. t+1 Wt t t 2We use the convention that the empty sum is 0 and For the following theorem, F is a function space on the empty product is 1. t Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering d = 2, K = 100, σ2 = 1 gorithm. In Figure 1, we give the MMD error as 1 10 well as the error on the mean function in term of the number of particles N for the different sampling 0 10 schemes on a randomly chosen mixture of Gaussians Err withK =100componentsind=2dimensions. Addi- D 10−1 tional results as well as the details of the model are MM MC −−00..5634 given in Appendix C.1 of the supplementary mate- rial. In our experiments, the number of FW search 10−2 QMC −0.95 points is M = 50,000. We note that even though in FW −3 FCFW −1.47 theory all meth√ods should have the same rate of con- 10 vergence O(1/ N) for the MMD (as H is infinite di- 0 1 2 3 10 10 10 10 Number of particles mensional), FCFW empirically improves significantly over the other methods. As d increases, the differ- d = 2, K = 100, σ2 = 1 2 ence between the methods tapers off for a fixed kernel 10 bandwidth σ2, but increasing σ2 gives better results n for FW and FCFW than the other schemes. a e m 0 n 10 In the remaining sections, we evaluate empirically the ctio −−00..4547 application of kernel herding in a filtering context us- un −0.76 ing the proposed SKH algorithm. or f10−2 MC Err f QFWMC −1.83 4.2 Particle filtering using SKH on synthetic FCFW examples −4 10 0 1 2 3 10 10 10 10 We consider first several synthetic data sets in order Number of particles to assess the improvements offered by Frank-Wolfe quadrature over standard Monte Carlo and quasi- Figure 1: Top: MMD error for different sampling Monte-Carlo techniques. We generate data from four schemes where p is a mixture of 2d Gaussians with different systems (further details on the experimental K = 100 components. Bottom: error on the mean setup can be found in Appendix C.2): estimate for the same mixture. The dashed lines are linear fits with slopes reported next to the axes. Two linear Gaussian state-space (LGSS) mod- els of dimensions d=3 and d=15, respectively. ofparticlesthanthestandardO(√1 )forMonteCarlo N particle filters. In contrast, Gerber and Chopin (2014, A jump Markov linear system (JMLS),consist- Theorem 7) showed a o(√1 ) rate for the randomized ing of 2 interacting LGSS models of dimension N d=2. The switching between the models is gov- version of their SQMC algorithm (note the little-o).3 erned by a hidden 2-state Markov chain. Note that the theorem does not depend on how the error of (cid:15) is obtained on the mean maps of the distri- Anonlinearbenchmarktime-seriesmodelusedby, bution; and so if one could show that a QMC point among others, Doucet et al. (2000); Gordon et al. set could alsoachievea fasterratefor theerror onthe (1993). The model is of dimension d = 1 and is mean maps (rather than on the distributions itself as given by: is usually given), then their rates would translate also x to the global rate by Theorem 2.4 x =0.5x +25 t +8cos(1.2t)+v , t+1 t 1+x2 t t 4 Experiments y =0.05x2+e , t t t 4.1 Sampling from a mixture of Gaussians with v and e mutually independent standard t t Gaussian. We start by investigating the merits of different sam- pling schemes for approximating mixtures of Gaus- These models are ordered in increasing levels of diffi- sians, since this is an intrinsic step to the SKH al- culty for inference. For the LGSS models, the exact filtering distributions can be computed by a Kalman 3The rate holds on the approximation of integrals of continuous bounded functions. filter. For the JMLS, this is also possible by running 4We also note that a simple computation shows that a mixture of Kalman filters, albeit at a computational for a Monte Carlo sample of size N, E(cid:107)µ(pˆ)−µ(p)(cid:107)2 ≤ costof2T (whereT isthetotalnumberoftimesteps). H (R2−(cid:107)µ(p)(cid:107)2). For the nonlinear system, no closed form expressions N Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach are available for the filtering densities; instead we run ing number but typically around ten. a PF with N =100,000 particles as a reference. To deal with the high-dimensional state-vector, We generate 30 batches of observations for T = 100 T¨ornqvist et al. (2009) used a Rao-Blackwellized PF timestepsfromallsystems,exceptfortheJMLSwhere (see Appendix A) to solve the filtering problem, we use T = 10 (to allow exact filtering). We run the marginalizing all but 6 state components (being the proposed SKH filter, using both FW and FCFW op- pose, i.e., thepositionandorientation)usingacombi- timization and compare against a bootstrap PF (us- nation of Kalman filters and extended Kalman filters. ing stratified resampling (Carpenter et al., 1999)) and The remaining 6 state-variables were tracked using a a quasi-Monte-Carlo PF based on a Sobol-sequence bootstrap particle filter with N = 200 particles; the point-set. All methods are run with N varying from strikinglysmallnumberofparticlesowingtothecom- 20to200particles. Wedeliberatelyuseratherfewpar- putational complexity of the likelihood evaluation. ticles since, as discussed above, we believe that this is For the current experiment, we obtained the code and the setting when the proposed method can be partic- the flight-test data from T¨ornqvist et al. (2009). The ularly useful. modularity of our approach allowed us to simply re- To assess the performances of the different meth- place the Monte Carlo simulation step within their ods, we first compute the root-mean-squared errors setup with FW-Quad. We ran SKH-FW with σ2 =10 (RMSE) for the filtered mean-state-estimates over and SKH-FCFW with σ2 = 0.1, as well as the boot- the T time steps, w.r.t. the reference filters. We re- strap PF used in T¨ornqvist et al. (2009), and a QMC- port the median RMSEs over the 30 different data PF; all methods using N = 50, 100, and 200 parti- batches, along with the 25% and 75% quantiles, and cles. We ran all methods 10 times on the same data; the minimum and maximum values in Figure 2. The the variation in SKH coming from the random search SKH algorithms were run for three different values of pointsfortheFWprocedure,andinQMCforstarting σ2 ∈ {0.01,0.1,1}. Here, we report the results for the Sobol sequence at different points. For compari- σ2 = 1 for the LGSS models and the JMLS, and for son, weran10timesareferencePFwithN =100,000 σ2 =0.1 for the nonlinear benchmark model. The re- particles and averaged the results. The median posi- sults for the other values are given in Appendix C.2. tion errors for 100 seconds of robot time (there are 20 The improvements are somewhat robust to the value SSM time steps per second of robot time) are given in ofσ2,butinsomecasessignificantdifferenceswereob- Figure 3. The UAV is assumed to start at a known served. As can be seen, both SKH methods improve locationattimezero,hence,alltheerrorsarezeroini- significantly upon both QMC and the bootstrap PF. tially. Note that all methods accumulate errors over For the two LGSS models, we also compute the MMD time. This is natural, since there is no absolute po- (reported in the rightmost column in Figure 2). sition reference available (i.e., the filter is unstable) and the objective is basically to keep the error as small as possible for as long time as possible. SKH- 4.3 Vision-based UAV Localization FWheregivestheoverallbestresults,withsignificant improvements over the bootstrap PF and the QMC In this section, we apply the proposed SKH algo- methodsforsmallnumberofparticles. SKH-FWeven rithm to solve a filtering problem in field robotics. gives similar errors for the last time step with only Weusethedataandtheexperimentalsetupdescribed N = 200 particles as one of the reference PFs (us- byT¨ornqvistetal.(2009). Theproblemconsistsofes- ing N =100,000 particles). See Appendix C.2.1 for a timatingthefullsix-dimensionalposeofanunmanned discussion of the role of σ2 for FCFW. aerial vehicle (UAV). T¨ornqvist et al. (2009) proposed a vision-based solu- Runtimes. In these experiments, we focused on in- tion, essentially tracking interest points in the camera vestigating how optimization could improve the error images over consecutive frames to estimate the ego- per particle, as the gain in runtime depends on the motion. This information is then fused with the in- exact implementation as well as the likelihood eval- ertial and barometer sensors to estimate the pose of uation cost. We note that the FW-Quad algorithm theUAV.Thesystemismodelledonstate-spaceform, scales as O(NM) for N samples and M search points with a state vector comprising the position, velocity, when using FW, by updating the objective on the M acceleration, as well as the orientation and the angu- search points in an online fashion (we also empirically lar velocity of the UAV. The state is also augmented observedthislinearscalinginN). Ontheotherhand, withsensorbiases,resultinginastatedimensionof22. FCFWscalesasO(N2M)astheweightsontheparti- Furthermore, the state is augmented with the three- cles possibly change at each iteration, preventing the dimensional positions of the interest points that are same online trick. SKH scales linearly with the num- currently tracked by the vision system; this is a vary- ber of time steps T (as a standard PF). For the UAV Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering RMSE,LGSS,d=3 RMSE,JMLS MMDRMS(σ2=1),LGSS,d=3 100 100 100 −0.49 −0.46 −0.46 −0.55 −−00..2312 10−1 −0.45 10−1 PQFMC −0.56 PQFMC 10−1 PQFMC −0.51 FW(σ2=1) −0.60 FW(σ2=1) FW(σ2=1) −0.54 FCFW(σ2=1) FCFW(σ2=1) FCFW(σ2=1) −0.64 min/max min/max min/max 20 50 100 200 20 50 100 20 50 100 200 Numberofparticles Numberofparticles Numberofparticles RMSE,LGSS,d=15 RMSE,Nonlinearbenchmark MMDRMS(σ2=1),LGSS,d=15 101 101 100 100 100 −−−−0000....33336866 −−00..8849 −−00..3356 −1.22 −−00..4421 10−1 −1.25 PF PF PF QMC QMC QMC FW(σ2=1) FW(σ2=0.1) FW(σ2=1) FCFW(σ2=1) FCFW(σ2=0.1) FCFW(σ2=1) 10−1 min/2m0ax 50 100 200 10−2 min/2m0ax 50 100 200 10−1 min/2m0ax 50 100 200 Numberofparticles Numberofparticles Numberofparticles Figure 2: RMSEs (left and middle columns) for the four considered models and MMDs (right column) for the two LGSS models. 6 PQFMC N=50 6 PQFMC N=100 15 UAV-lasttimestepePQrrFMorC FW(σ2=10) FW(σ2=10) FW(σ2=10) 5 FCFW(σ2=0.1) 5 FCFW(σ2=0.1) FCFW(σ2=0.1) PPFFNN==1100k0k PPFFNN==1100k0k PPFFNN==1100k0k m)4 m)4 m)10 min/max ( ( ( error3 error3 error Position2 Position2 Position 5 1 1 00 20 40 60 80 100 00 20 40 60 80 100 0 50 100 200 Robottime(s) Robottime(s) Numberofparticles Figure 3: Median of position errors over 10 runs for each method. The errors are computed relative to the mean predictionover10runsofaPFwith100kparticles(thevariationofthereferencePFisalsoshownforPF100k). The error bars represent the [25%, 75%] quantile. The rightmost plot shows the error at the last time step as a functionofN. 100sofrobottimerepresents2,000SSMtimesteps,itdoes not correspondtocomputationtime. application, the original Matlab code from T¨ornqvist 5 Conclusion etal.(2009)spentanaverageof0.2spertimestepfor We have developed a method for Bayesian filtering N =50 particles (linear in the number of particles as problemsusingacombinationofoptimizationandpar- thelikelihoodevaluationisthebottleneck)onaXEON ticle filtering. The method has been demonstrated to E5-26202.10GHzPC.TheoverheadofusingourMat- provideimprovedperformanceoverbothrandomsam- labimplementationofFW-QuadwithN =50isabout pling and quasi-Monte Carlo methods. The proposed 0.15 s per time step for FW and 0.3 s for FCFW; and method is modular and it can be used with different 0.3 s for FW and 1.0 s for FCFW for N = 100 (we types of particle filtering techniques, such as the Rao- used M = 10,000 search points in this experiment). Blackwellizedparticlefilter. Furtherinvestigatingthis Inpractice, thismeansthatSKH-FWcanberunhere possibility for other classes of particle filters is a topic with 50 particles in the same time as the standard PF for future work. Future work also includes a deeper is run with about 90 particles. But as Figure 3 shows, analysis of the convergence theory for the method in the error for SKH-FW with 50 particles is still much order to develop practical guidelines for the choice of lower than the PF with 200 particles. the kernel bandwidth. Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach Acknowledgements M. Gerber and N. Chopin. Sequential quasi-Monte Carlo. arXiv preprint arXiv:1402.4039v5, 2014. We thank Eric Moulines for useful discussions. This N. J. Gordon, D. J. Salmond, and A. F. M. work was partially supported by the MSR-Inria Joint Smith. Novel approach to nonlinear/non-Gaussian Centre, a grant by the European Research Council Bayesian state estimation. Radar and Signal Pro- (SIERRA project 239993) and by the Swedish Re- cessing, IEE Proceedings F, 140(2):107–113, Apr. searchCouncil(projectLearningofcomplexdynamical 1993. systems number 637-2014-466). A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola. A kernel two-sample References test. The Journal of Machine Learning Research, F. Bach. Learning with submodular functions: A 13:723–773, 2012. convex optimization perspective. Foundations and F. Husz´ar and D. Duvenaud. Optimally-weighted Trends in Machine Learning, 6(2-3):145–373, 2013. herding is Bayesian quadrature. In Proceedings of F. Bach, S. Lacoste-Julien, and G. Obozinski. On the the 28th Conference on Uncertainty in Artificial In- equivalence between herding and conditional gradi- telligence (UAI), pages 377–385, 2012. ent algorithms. In Proceedings of the 29th Inter- M. Jaggi. Revisiting Frank-Wolfe: Projection-free national Conference on Machine Learning (ICML), sparse convex optimization. In Proceedings of the pages 1359–1366, 2012. 30thInternationalConferenceonMachineLearning (ICML), 2013. O. Capp´e, E. Moulines, and T. Ryd´en. Inference in Hidden Markov Models. Springer, 2005. D. Ormoneit, C. Lemieux, and D. J. Fleet. Lattice particle filters. In Proceedings of the 17th Confer- J.Carpenter, P.Clifford, andP.Fearnhead. Improved enceonUncertaintyinArtificialIntelligence(UAI), particle filter for nonlinear problems. IEE Proceed- pages 395–402, 2001. ingsRadar,SonarandNavigation,146(1):2–7,1999. V. Philomin, R. Duraiswami, and L. Davis. Quasi- Y. Chen, M. Welling, and A. Smola. Super- random sampling for condensation. In Proceedings samples from kernel herding. In Proceedings of the ofthe6thEuropeanConferenceonComputerVision 26thInternationalConferenceonMachineLearning (ECCV), 2000. (ICML), 2010. M. K. Pitt and N. Shephard. Filtering via simulation: R. Douc, E. Moulines, and J. Olsson. Long-term sta- Auxiliary particle filters. Journal of the American bilityofsequentialMonteCarlomethodsunderver- Statistical Association, 94(446):590–599, 1999. ifiable conditions. Annals of Applied Probability, 24 (5):1767–1802, 2014. B. Ristic, S. Arulampalam, and N. Gordon. Beyond the Kalman filter: particle filters for tracking appli- A. Doucet and A. Johansen. A tutorial on particle cations. Artech House, London, UK, 2004. filtering and smoothing: Fifteen years later. In D. Crisan and B. Rozovsky, editors, The Oxford A. Smola, A. Gretton, L. Song, and B. Sch¨olkopf. A Handbook of Nonlinear Filtering. Oxford University Hilbert space embedding for distributions. In Al- Press, 2011. gorithmic Learning Theory, pages 13–31. Springer, 2007. A. Doucet, S. J. Godsill, and C. Andrieu. On sequen- B. K. Sriperumbudur, A. Gretton, K. Fukumizu, tial Monte Carlo sampling methods for Bayesian B.Sch¨olkopf,andG.R.Lanckriet.Hilbertspaceem- filtering. Statistics and Computing, 10(3):197–208, beddings and metrics on probability measures. The 2000. Journal of Machine Learning Research, 99:1517– J. C. Dunn. Convergence rates for conditional gra- 1561, 2010. dient sequences generated by implicit step length D. T¨ornqvist, T. B. Sch¨on, R. Karlsson, and rules. SIAM Journal on Control and Optimization, F. Gustafsson. Particle filter SLAM with high di- 18:473–487, 1980. mensional vehicle model. Journal of Intelligent and P. Fearnhead. Using random quasi-Monte-Carlo Robotic Systems, 55(4):249–266, 2009. within particle filters, with application to financial timeseries.JournalofComputationalandGraphical Statistics, 14(4):751–769, 2005. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95–110, 1956. Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering Supplementary material A Extension for Rao-Blackwellization A common strategy for improving the efficiency of the PF is to make use of Rao-Blackwellization—this idea can be used also with SKH. Rao-Blackwellization, here, refers to analytically marginalizing some conditionally tractable component of the state vector and thereby reducing the dimensionality of the space on which the PF operates. Assume that the state of the system is comprised of two components x and z , where the filtering t t density for z is tractable conditionally on the history of x . The typical case is that of a conditionally linear t 1:t Gaussian system, in which case the aforementioned conditional filtering density p(z |x ,y ) is Gaussian and t 1:t 1:t computable using a Kalman filter (conditionally on x ). The Rao-Blackwellized PF (RBPF) exploits this 1:t property by factorizing: N p(z ,x |y )=p(z |x ,y )p(x |y )≈(cid:88)w(i)N(z |z (x(i)),Σ (x(i)))δ (x ), (5) t 1:t 1:t t 1:t 1:t 1:t 1:t t t (cid:98)t 1:t t 1:t x(i) 1:t 1:t i=1 where the conditional mean z (x ) := E[z |x ,y ] and covariance matrix Σ (x ) := V(z |x ,y ) can be (cid:98)t 1:t t 1:t 1:t t 1:t t 1:t 1:t computed(forafixedtrajectoryx )usingaKalmanfilter. Themixtureapproximationfollowsbypluggingina 1:t particle approximation of p(x |y ) computed using a standard PF. Hence, for a conditionally linear Gaussian 1:t 1:t model, the RBPF takes the form of a Mixture Kalman filter; see Chen and Liu (2000). Analogously to a standard PF, the SKH procedure allows us to to compute an empirical point-mass approximation of p(x |y ) 1:t 1:t by keeping track of the complete history of the state x . Consequently, by (5) it is straightforward to employ 1:t Rao-Blackwellization also for SKH; we use this approach in the numerical example in Section 4.3. B Rates for SKH when using random search points In this section, we show that we can get guarantees on the MMD error of the FW-Quad procedure when approximately finding the FW vertex in step 3 of Algorithm 1 using exhaustive search through M random samples from p. This means that despite not solving step 3 exactly, the SKH procedure with M random search points (under assumptions of Theorem 2) is still consistent as long as M grows to infinity. The main idea is that the rates of convergence for the Frank-Wolfe optimization procedure still holds when the linear subproblem (step 3) is solved within accuracy of δ. More specifically, if we guarantee that the FW vertex g¯ that we use satisfy (cid:104)J(cid:48)(g ),g¯ (cid:105)≤min (cid:104)J(cid:48)(g ),g(cid:105)+δ during the algorithm, then the standard O(1/k) k+1 k k+1 g∈M k rateofconvergenceforFWcarriesthroughbutwithin δ oftheoptimalobjective(i.e. uptoJ(g∗)+δ). Asimple modification of the argument by Jaggi (2013) (who used a shrinking δ during the FW algorithm) can show this for the step-size of γ = 2 ; we give the proofs for the step-size of γ = 1 as well as the potential faster rate k k+2 k k+1 O(1/k2) for the MMD objective in Appendix G. Let X ⊆ X be the set of M search points, and p be the empirical distribution for the M samples from p. M M Let δ := (cid:107)µ(p )−µ(p)(cid:107) which can be made small by increasing M. Consider the iteration k in FW-Quad M M H where we do exhaustive search on X in step 3. We thus have: M (cid:104)g −µ ,Φ(x(k+1))(cid:105)= min (cid:104)g −µ ,Φ(x)(cid:105)= min (cid:104)g −µ(p )+µ(p )−µ(p),Φ(x)(cid:105) k p k p k M M x∈XM x∈XM ≤ min (cid:104)g −µ(p ),Φ(x)(cid:105)+δ R , k M M M x∈XM where R := max (cid:107)Φ(x)(cid:107) (R ≤ R). We can thus interpret step 3 as approximately solving (within M x∈XM M δ R ) the linear subproblem for the Frank-Wolfe optimization of J (g):= 1(cid:107)g−µ(p )(cid:107)2 over the marginal M M M 2 M H polytope of X . We thus get a rate of convergence to within δ R of min J (g)=0. Finally, we have M M M g M (cid:112) (cid:112) (cid:107)g −µ(p)(cid:107) ≤(cid:107)g −µ(p )(cid:107) +δ = 2J (g )+δ ≤ 2((cid:15) +R δ )+δ N H N M H M M N M N M M M

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.