Table Of ContentSequential Kernel Herding:
Frank-Wolfe Optimization for Particle Filtering
Simon Lacoste-Julien Fredrik Lindsten Francis Bach
INRIA - Sierra Project-Team Department of Engineering INRIA - Sierra Project-Team
E´cole Normale Sup´erieure, Paris, France University of Cambridge E´cole Normale Sup´erieure, Paris, France
5
1
0
Abstract where x ∈ X denotes the latent state variable and
2 t
b yt ∈ Y the observation at time t. Exact state infer-
ence in SSMs is possible, essentially, only when the
e Recently,theFrank-Wolfeoptimizationalgo-
F rithm was suggested as a procedure to ob- model is linear and Gaussian or when the state-space
0 tain adaptive quadrature rules for integrals X isafiniteset. Forsolvingtheinferenceproblembe-
yond these restricted model classes, sequential Monte
1 of functions in a reproducing kernel Hilbert
space (RKHS) with a potentially faster rate Carlomethods,i.e.particlefilters(PFs),haveemerged
] as a key tool; see e.g., Doucet and Johansen (2011);
L of convergence than Monte Carlo integration
Capp´e et al. (2005); Doucet et al. (2000). However,
(and“kernelherding”wasshowntobeaspe-
M
sincethesemethodsarebasedonMonteCarlointegra-
cial case of this procedure). In this paper,
t. we propose to replace the random sampling tiontheyareinherentlyaffectedbysamplingvariance,
a which can degrade the performance of the estimators.
step in a particle filter by Frank-Wolfe op-
t
s timization. By optimizing the position of Particular challenges arise in the case when the ob-
[
the particles, we can obtain better accuracy servation likelihood p(yt|xt)iscomputationallyexpen-
2 thanrandomorquasi-MonteCarlosampling. sive to evaluate. For instance, this is common in
v
In applications where the evaluation of the robotics applications where the observation model re-
6
5 emissionprobabilitiesisexpensive(suchasin lates the sensory input of the robot, which can com-
0 robot localization), the additional computa- prise vision-based systems, laser rangefinders, syn-
2 tional cost to generate the particles through thetic aperture radars, etc. For such systems, simply
0 optimization can be justified. Experiments evaluating the observation function for a fixed value
.
1 on standard synthetic examples as well as on of x can therefore involve computationally expensive
t
0 a robot localization task indicate indeed an operations, such as image processing, point-set regis-
5 improvement of accuracy over random and tration, and related tasks. This poses difficulties for
1
quasi-Monte Carlo sampling. particle-filtering-based solutions for two reasons: (1)
:
v the computational bottleneck arising from the like-
i
X 1 Introduction lihood evaluation implies that we cannot simply in-
creasethenumberofparticlestoimprovetheaccuracy,
r
a In this paper, we explore a way to combine ideas from and(2) thistypeof“complicated”observationmodels
optimization with sampling to get better approxima- will typically not allow for adaptation of the proposal
tions in probabilistic models. We consider state-space distribution used within the filter, in the spirit of Pitt
models (SSMs, also referred to as general state-space and Shephard (1999), leaving us with the standard—
hidden Markov models), as they constitute an impor- but inefficient—bootstrap proposal as the only viable
tant class of models in engineering, econometrics and option. On the contrary, for these systems, the dy-
other areas involving time series and dynamical sys- namical model p(xt|xt−1) is often comparatively sim-
tems. A discrete-time, nonlinear SSM can be written ple, e.g. being a linear and Gaussian “nearly constant
as acceleration” model (Ristic et al., 2004).
x |x ∼p(x |x ); y |x ∼p(y |x ), (1)
t 1:(t−1) t t−1 t 1:t t t The method developed in this paper is geared toward
this class of filtering problems. The basic idea is that,
Appearing in Proceedings of the 18th International Con-
inscenarioswhenthelikelihoodevaluationisthecom-
ference on Artificial Intelligence and Statistics (AISTATS)
putational bottleneck, we can afford to spend addi-
2015, San Diego, CA, USA. JMLR: W&CP volume 38.
Copyright 2015 by the authors. tional computations to improve upon the sampling of
Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering
theparticles. Bydoingso,wecanavoidexcessivevari- µ(p):=E [Φ]∈H(Smolaetal.,2007;Sriperumbudur
p
ance arising from simple Monte Carlo sampling from et al., 2010). Essentially, by using Cauchy-Schwartz
the bootstrap proposal. inequality and the linearity of the expectation opera-
tor, we can obtain:
Contributions. We build on the optimization view
from Bach et al. (2012) of kernel herding (Chen et al., sup |Ep[f]−Epˆ[f]|=(cid:107)µ(p)−µ(pˆ)(cid:107)H
2010) to approximate the integrals appearing in the f∈H
Bayesian filtering recursions. We make use of the (cid:107)f(cid:107)H≤1 =:MMD(p,pˆ), (3)
Frank-Wolfe(FW)quadraturetoapproximate,inpar-
ticular, mixtures of Gaussians which often arise in a and so by bounding MMD(p,pˆ), we can bound the er-
particlefilteringcontextasthemixtureoverpastparti- ror of approximating the expectation for all f ∈ H,
clesinthedistributionoverthenextstate. Weusethis with (cid:107)f(cid:107)H as a proportionality constant. MMD(p,pˆ)
approach within a filtering framework and prove the- is thus a central quantity for developing good quadra-
oretical convergence results for the resulting method, ture rules given by (2). In the context of RKHSs,
denotedassequentialkernelherding (SKH),givingone MMD(p,q) can be called the maximum mean discrep-
ofthe firstexplicit better convergence ratesthan for a ancy (Gretton et al., 2012) between the distributions
particlefilter. Ourpreliminaryexperimentsshowthat p and q, and acts a pseudo-metric on the space of dis-
SKHcangivebetteraccuracythanastandardparticle tributions on X. If κ is a characteristic kernel (such
filter or a quasi-Monte Carlo particle filter. as the standard RBF kernel), then MMD is in fact a
metric, i.e. MMD(p,q) = 0 =⇒ p = q. We refer the
reader to Sriperumbudur et al. (2010) for the regular-
2 Adaptive quadrature rules with
ityconditionsneededfortheexistenceoftheseobjects
Frank-Wolfe optimization
and for more details.
2.1 Approximating the mean element for
2.2 Frank-Wolfe optimization for adaptive
integration in a RKHS
quadrature
We consider the problem of approximating integrals
For getting a good quadrature rule pˆ, our goal is thus
of functions belonging to a reproducing kernel Hilbert
to minimize (cid:107)µ(pˆ)−µ(p)(cid:107) . We note that µ(p) lies in
space(RKHS)Hwithrespecttoafixed distributionp H
themarginalpolytope M⊂H,definedastheclosureof
over some set X. We can think of the elements of
theconvex-hullofΦ(X). WesupposethatΦ(x)isuni-
H as being real-valued functions on X, with point-
formlyboundedinthefeaturespace,thatis,thereisa
wise evaluation given from the reproducing property
finite R such that (cid:107)Φ(x)(cid:107) ≤ R ∀x ∈ X. This means
by f(x) = (cid:104)f,Φ(x)(cid:105), where Φ : X → H is the fea- H
thatMisaclosedboundedconvexsubsetofH,andwe
ture map from the state-space X to the RKHS. Let
couldintheoryoptimizeoverit. Thisinsightwasused
κ : X2 → R be the associated positive definite ker-
byBachetal.(2012)whoconsideredusingtheFrank-
nel. We briefly review here the setup from Bach et al.
Wolfe optimization algorithm to optimize the convex
(2012), which generalized the one from Chen et al.
function J(g) := 1(cid:107)g − µ(p)(cid:107)2 over M to obtain
(2010). We want to approximate integrals E [f] for 2 H
p adaptivequadraturerules. TheFrank-Wolfealgorithm
f ∈ H using a set of n points x(1),...,x(n) ∈ X asso-
(also called conditional gradient) (Frank and Wolfe,
ciated with positive weights w(1),...,w(n) which sum
1956) is a simple first-order iterative constrained op-
to 1:
timization algorithm for optimizing smooth functions
n
E [f]≈(cid:88)w(i)f(x(i))=E [f], (2) over closed bounded convex sets like M (see Dunn
p pˆ
(1980) for its convergence analysis on general infinite
i=1
dimensional Banach spaces). At every iteration, the
where pˆ := (cid:80)n w(i)δ is the associated empirical algorithm finds a good feasible search vertex of M by
i=1 x(i)
distributiondefinedbythesepointsandδ (·)isapoint minimizing the linearization of J at the current iter-
x
mass distribution at x. If the points x(i) are inde- ate g : g¯ = argmin (cid:104)J(cid:48)(g ),g(cid:105). The next iter-
k k+1 g∈M k
pendent samples from p, then this Monte Carlo esti- ate is then obtained by a suitable convex combination
mate (using weights of 1/n) is unbiased with a vari- of the search vertex g¯ and the previous iterate g :
k+1 k
ance of V [f]/n, where V [f] is the variance of f with g =(1−γ )g +γ g¯ for a suitable step-size γ
p p k+1 k k k k+1 k
respect to p. By using the fact that f belongs to the from a fixed schedule (e.g. 1/(k+1)) or by using line-
RKHSH,wecanactuallychooseabettersetofpoints search. Acrucialpropertyofthisalgorithmisthatthe
withlowererror. Itturnsoutthattheworst-caseerror iterate g is thus a convex combination of the vertices
k
of estimators of the form (2) can be analyzed in terms of M visited so far. This provides a sparse expan-
of their approximation distance to the mean element sion for the iterate, and makes the algorithm suitable
Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach
to high-dimensional optimization (or even infinite) – the FW version) which always yields uniform weights:
this explains in part the regain of interest in machine w(i) = 1/k for all i ≤ k. A third alternative is
k
learninginthelastdecadeforthisoldoptimizational- to re-optimize J(g) over the convex hull of the pre-
gorithm (see Jaggi (2013) for a recent survey). In our viously visited vertices; this is called the fully cor-
setupwhereMistheconvexhullofΦ(X),thevertices rective version (Jaggi, 2013) of the Frank-Wolfe al-
of M are thus of the form g¯k+1 =Φ(x(k+1)) for some gorithm (hereafter referred as FCFW). In this case:
x(k+1) ∈ X. Running Frank-Wolfe on M thus yields (w(1) ,...,w(k+1)) = argmin w(cid:62)K w −
g = (cid:80)k w(i)Φ(x(i)) = E [Φ] for some weighted set k+1 k+1 w∈∆k+1 k+1
okf pointis=1{wk(ki),x(i)}ki=1. Tpˆhe iterate gk thus corre- 2abc(cid:62)kil+it1yws,iwmhpelerex,∆Kk+k1+i1s tisheth(ke+ke1r)n-delimmenastrioixnaolnprtohbe-
sponds to a quadrature rule pˆof the form of (2) and (k+1)vertices: (K ) =κ(x(i),x(j))and(c ) =
k+1 ij k+1 i
gk = Epˆ[Φ], and this is the relationship that was ex- µp(x(i)) for i = 1,...,(k+1). This is a convex
ploredinBachetal.(2012). RunningFrank-Wolfeop- quadratic problem over the simplex. A slightly modi-
timizationwiththestep-sizeofγk =1/(k+1)reduces fiedversionoftheFCFWiscalledthemin-normpoint
to the kernel herding algorithm proposed by Chen algorithm and can be more efficiently optimized us-
et al. (2010). See also Husza´r and Duvenaud (2012) ing specific purpose active-set algorithms — see Bach
for an alternative approach with negative weights. (2013, §9.2) for more details. We refer the reader
toBachetal.(2012)formoredetailsontherateofcon-
Algorithm1presentstheFrank-Wolfeoptimizational-
vergenceofFrank-Wolfequadratureassumingthatthe
gorithm to solve min J(g) in the context of get-
g∈M
FW vertex is found with guarantees. We summarize
ting quadrature rules (we also introduce the short-
themasfollows: ifHisinfinitedimensional,thenFW-
hand notation µ := µ(p)). We note that to evalu- √
p
QuadgivesthesameO(1/ N)ratefortheMMDerror
ate the quality MMD(pˆ,p) of this adaptive quadra-
as standard random sampling, for all FW methods.
ture rule, we need to be able to evaluate µ (x) =
p
(cid:82) p(x(cid:48))κ(x(cid:48),x)dx(cid:48) efficiently. This is true only for On the other hand, if a ball of non-zero radius cen-
x(cid:48)∈X tered at µ lies within M, then faster rates than ran-
specific pairs of kernels and distributions, but fortu- p
dom sampling are possible: FW gives a O(1/N) rate
natelythisisthecasewhenpisamixtureofGaussians
whereas FW-LS and FCFW gives exponential conver-
and κ is a Gaussian kernel. This insight is central to
gencerates(thoughinpractice,weoftenseedifferences
this paper; we explore this case more specifically in
not explained by the theory between these methods).
Section 2.3. To find the next quadrature point, we
also need to (approximately) optimize µ (x) over X
p
(step 3 of Algorithm 1, called the FW vertex search). 2.3 Example: mixture of Gaussians
In general, this will yield a non-convex optimization
We describe here in more details the Frank-Wolfe
problem, and thus cannot be solved with guarantees,
quadrature when p is a mixture of Gaussians p(x) =
even with gradient descent. In our current implemen-
(cid:80)K π N(x|µ ,Σ ) for X = Rd and κ is the Gaus-
tation, we approach step 3 by doing an exhaustive i=1 i i i
sian kernel κ (x,x(cid:48)) := exp(− 1 (cid:107)x−x(cid:48)(cid:107)2). In this
search over M random samples from p precomputed σ √ 2σ2
when FW-Quad is called. We thus follow the idea case,µp(x)=(cid:80)Ki=1πi( 2πσ)dN(x|µi,Σi+σ2Id). We
from the kernel herding paper (Chen et al., 2010) to thus need to optimize a difference of mixture of Gaus-
choose the best N “super-samples” out of a large set sian bumps in step 3 of Algorithm 1, a non-convex
of samples M. Thanks to the fact that convergence optimization problem that we approximately solve by
guarantees for Frank-Wolfe optimization can still be exhaustive search over M random samples from p.
given when using an approximate FW vertex search,
weshowinAppendixBofthesupplementarymaterial 3 Sequential kernel herding
that this procedure either adds a O(1/M1/4) term or
√
a O(1/ M) term to the worst-case MMD(pˆ,p) error. 3.1 Sequential Monte Carlo
In our description of Algorithm 1, a preset number N
Consider again the SSM in (1). The joint probabil-
of particles (iterations) was used. Alternatively, we
ity density function for a sequence of latent states
could usea variablenumber of iterations withthe ter-
x := (x ,...,x ) and observations y factor-
minating criterion test (cid:107)g −µ(p)(cid:107) ≤ (cid:15) which can 1:T 1 T 1:T
k H izes as p(x ,y ) = (cid:81)T p(x |x )p(y |x ), with
be explicitly computed during the algorithm and pro- 1:T 1:T t=1 t t−1 t t
p(x |x ) := p(x ) denoting the prior density on the
vides the MMD error bound on the returned quadra- 1 0 1
initial state. We would like to do approximate in-
ture rule. Option (2) on line 5 chooses the step-size
ference in this SSM. In particular, we could be in-
γ by analytic line-search (hereafter referred as the
k terested in computing the joint filtering distribution
FW-LS version) while option (1) chooses the kernel
r (x ) := p(x |y ) or the joint predictive distri-
herding step-size γ = 1/(k+1) (herafter referred as t 1:t 1:t 1:t
k bution p (x ,x ) := p(x ,x |y ). In parti-
t+1 t+1 1:t t+1 1:t 1:t
Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering
Algorithm1FW-Quad(p,H,N): Frank-Wolfeadap- Algorithm 2Particlefiltertemplate(jointpredictive
tive quadrature distribution form) — SKH alg. by changing step 3
Input: distributionp,RKHSHwhichdefinesker- Input: SSM p(x |x ),
t t−1
nelκ(·,·)andstate-spaceX,numberofsamplesN o (x ):=p(y |x ) for t∈1:T.
t t t t
1: Let g0 =0. Maintainpˆ(x )=(cid:80)N w(i)δ (x )duringalgo-
2: for k =0...N −1 do t 1:t i=1 t x(1i:)t 1:t
rithm as approximation of p(x ,x |y ).
3: Solve x(k+1) =argmin(cid:104)g −µ ,Φ(x)(cid:105) t 1:(t−1) 1:(t−1)
k p 1: Let p˜ (x ):=p(x )
x∈X 1 1 1
That is: 2: for t=1 ..., T do
x(k+1) =argmin(cid:88)k w(i)(κ(x(i),x)−µ (x)). 3: Sample: get pˆt =SAMPLE(p˜t,N)
x∈X k p [For SKH, use pˆt =FW-Quad(p˜t,Ht,N)]
i=1 4: Include observation and normalize:
4: Option (1): Let γk = k+11. Wˆ =E [o ]; rˆ(x ):= 1 o (x )pˆ(x ).
5: Option (2): Let γk = (cid:104)gk(cid:107)−gµk−p,Φgk(−x(Φk+(x1()k)(cid:107)+21))(cid:105) (LS) 5: Proptagatepˆtapptroximtati1o:tn forwWˆatrdt: t t 1:t
6: Update gk+1 =(1−γk)gk+γkΦ(x(k+1)) p˜t+1(xt+1,x1:t):=p(xt+1|xt)rˆt(x1:t)
i.e. w(k+1) =γ ; 6: end for
k+1 k
and w(i) =(1−γ )w(i) for i=1...k 7: Return Filtering distribution rˆT; predictive
k+1 k k distribution pˆ ; normalization constants
78:: eRnedtufronr: pˆ=(cid:80)N w(i)δ Wˆ1,...,WˆT. T+1
i=1 N x(i)
cle filtering methods, we approximate these distribu- We denote the conditional normalization constant at
tionswithempiricaldistributionsfromweightedparti- time t by W := p(y |y ) and the global normal-
t t 1:(t−1)
clesets{w(i),x(i)}N asin(2). Wenotethatitiseasy ization constant by Z := p(y ) = (cid:81)t W . Wˆ
t 1:t i=1 t 1:t u=1 u t
tomarginalizepˆwithasimpleweightsummation,and is the particle filter approximation to W and is ob-
t
so we will present the algorithm as getting an approx- tainedbysummingtheun-normalizedmixtureweights
imation for the joint distributions r and p defined in (4); see step 4 in Algorithm 2. Randomly sam-
t t
above, with the understanding that the marginal ones pling from (4) is equivalent to first sampling a mix-
are easy to obtain afterwards. In the terminology of ture component according to the mixture weight (i.e.,
particle filtering, x(i) is the particle at time t, whereas choosing a past particle x(i) to propagate), and then
t 1:t
x(i) istheparticletrajectory. WhileprincipallythePF samplingitsnextextensionstatex(i) withprobability
1:t t+1
providesanapproximationofthefulljointdistribution p(x |x(i)). The standard bootstrap particle filter is
r (x ),itiswellknownthatthisapproximationdete- t+1 t
t 1:t thus obtained by maintaining uniform weight for the
riorates for any marginal of xs for s(cid:28)t (Doucet and predictive distribution (w(i) = 1) and randomly sam-
Johansen, 2011). Hence, the PF is typically only used t N
plingfrom(4)toobtaintheparticlesattimet+1. This
to approximate marginals of x for s (cid:46) t (fixed-lag
s givesanunbiasedestimateofp˜ : E [pˆ ]=p˜ .
smoothing) or s=t (filtering), or for prediction. t+1 p˜t+1 t+1 t+1
Lower variance estimators can be obtained by using a
Algorithm2presentsthebootstrapparticlefilteringal- different resampling mechanism for the particles than
gorithm(Gordonetal.,1993)fromthepointofviewof this multinomial sampling scheme, such as stratified
propagatinganapproximateposteriordistributionfor- resampling (Carpenter et al., 1999) and are usually
ward in time (see e.g. Fearnhead, 2005). We describe used in practice instead.
itaspropagatinganapproximationpˆ(x )ofthejoint
t 1:t One way to improve the particle filter is thus to re-
predictive distribution one time step forward with the
place the random sampling stage of step 3 with differ-
model dynamics to obtain p˜ (x ,x ) (step 5),
t+1 t+1 1:t ent sampling mechanisms with lower variance or bet-
and then randomly sampling from it (step 3) to get
ter approximation properties of the distribution p˜
the new predictive approximation pˆ (x ,x ). As t+1
t+1 t+1 1:t that we are trying to approximate. As we obtain the
pˆ isanempiricaldistribution,p˜ isamixturedistri-
t t+1 normalization constants W by integrating the obser-
bution (the mixture components are coming from the t
vationprobability,itseemsnaturaltolookforparticle
particles at time t):
point sets with better integration properties. By re-
p˜t+1(xt+1,x1:t)= placingrandomsamplingwithaquasi-randomnumber
N sequence, we obtain the already proposed sequential
1 (cid:88)p(y |x(i))w(i) p(x |x(i)) δ (x ). (4) quasi-Monte Carlo scheme (Philomin et al., 2000; Or-
Wˆt i=1(cid:124) t (cid:123)t(cid:122) t(cid:125) (cid:124) t+(cid:123)1(cid:122) t (cid:125) x(1i:)t 1:t moneit et al., 2001; Gerber and Chopin, 2014). The
mixtureweight mixturecomponent
Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach
maincontributionofourworkistoinsteadproposeto X defined (depending on H ) as all functions for
t+1 t+1
use Frank-Wolfe quadrature in step 3 of the particle which the following semi-norm is finite:1
filter to obtain better (adapted) point sets.
(cid:12)(cid:90) (cid:12)
(cid:12) (cid:12)
(cid:107)f(cid:107)Ft := sup (cid:12)(cid:12) f(xt+1)h(xt+1)dxt+1(cid:12)(cid:12).
3.2 Sequential kernel herding (cid:107)h(cid:107)Ht+1=1 Xt+1
In the sequential kernel herding (SKH) algorithm, Theorem 1 (Bounded growth of the mean map).
we simply replace step 3 of Algorithm 2 with pˆt = Suppose that the function ft : (xt+1,xt) (cid:55)→
FW-Quad(p˜t,Ht,N). As mentioned in the introduc- p(yt|xt)p(xt+1|xt) is in the tensor product function
tion, many dynamical models used in practice assume space Ft⊗Ht with the following defined nuclear norm:
(cid:80)
Gaussian transitions. Therefore, we will put par- (cid:107)ft(cid:107)Ft⊗Ht :=inf i(cid:107)αi(cid:107)Ft(cid:107)βi(cid:107)Ht, where the infimum
ticular emphasis on the case when (more generally) is taken over all the possible expansions such that
(cid:80)
p(xt|x1:(t−1),y1:(t−1)) is a mixture of Gaussians, with ft(xt+1,xt)= iαi(xt+1)βi(xt)forallxt,xt+1. Then
parametersforthemixturecomponentsthatcanbear- for any finite signed Borel measure ν on Xt, we have:
bitrary functions of the state history x ,y ,
1:(t−1) 1:(t−1)
(cid:107)µ(F ν)(cid:107) ≤(cid:107)f (cid:107) (cid:107)µ(ν)(cid:107) .
and is thus still fairly general. We thus consider the t Ht+1 t Ft⊗Ht Ht
Gaussian kernel for the FW-Quad procedure as then
Theorem 2 (Consistency of SKH). Suppose that for
we can compute the required quantities analytically.
all 1≤t≤T, f is in F ⊗H as defined in Theorem 1
An important subtle point is which Hilbert space H t t t
t and o is in H . Then we have:2
to consider. In this paper, we focus on the marginal- t t
ized filtering case, i.e. we are interested in p(x |y )
t 1:t
(cid:107)µ(pˆ )−µ(p )(cid:107) ≤
only. Thus we are only interested in functions of xt, T T HT
dwehpicehndisonwhxyt awneddneofitntehoeupraskterhniesltoartiest.imFeortstiomponlilcy- (cid:15)ˆT +(cid:18)R(cid:107)oTW−1(cid:107)HT−1 +ρT−1(cid:19)T(cid:88)−1χt(cid:15)ˆt(cid:32)T(cid:89)−2ρk(cid:33),
T−1
ity, we also assume that H = H for all t (we use the t=1 k=t
t
samekernelforeachtimestep). Eventhoughthealgo-
rithm can maintain the distribution on the whole his- where ρt := (cid:107)ft(cid:107)WFtt⊗Ht, χt := (cid:81)kt−=11 WWˆkk and (cid:15)ˆt is the
FW error reported at time t by the algorithm: (cid:15)ˆ :=
tory pˆ(x ), the past histories x are marginal- t
t 1:t 1:(t−1)
(cid:107)µ(pˆ)−µ(p˜)(cid:107) .
ized out when computing the mean map, for example t t Ht
µ(p˜) = E [Φ(x )]. During the SKH algorithm,
wectanstillp˜tt(rxa1c:tk)theptarticlehistoriesbykeepingtrack We note that χt ≈ 1 as we expect the errors on Wk
togoineitherdirection,andthustocanceleachother
from which mixture component in (4) x was coming
t overtime(thoughintheworstcaseitcouldgrowexpo-
from, but the past history is not used in the compu-
nentially in t). If (cid:15)ˆ ≤(cid:15) and ρ ≤ρ, we basically have
tation of the kernel and thus does not appear as a t t
(cid:107)µ(pˆ )−µ(p )(cid:107) = O(ρT(cid:15)) if ρ > 1; O(T(cid:15)) if ρ = 1;
repulsionterminstep3ofAlgorithm1. Weleaveitas T T
and O((cid:15)) if ρ < 1 (a contraction). The exponential
future work to analyze what kind of high-dimensional
dependence in T is similar as for a standard particle
kernel on past histories would make sense in this con-
filter for general distributions; see Douc et al. (2014)
text, and to analyze its convergence properties. The
though for conditions to get a contraction for the PF.
particle histories are useful in the Rao-Blackwellized
extension that we present in Appendix A and use in Importantly, for a fixed T it follows that the rates of
the robot localization experiment of Section 4.3. convergence for Frank-Wolfe in N translates to rates
of errors for integrals of functions in H with respect
to the predictive distribution p . Thus if we suppose
3.3 Convergence theory T
that H is finite dimensional, that p has full support
t
In this section, we give sufficient conditions to guar- onX foralltandthatthekernelκiscontinuous,then
antee that SKH is consistent as N goes to infinity. by Proposition 1 in Bach et al. (2012), we have that
Let p here denote the marginalized predictive in- the faster rates for Frank-Wolfe hold and in particu-
t
stead of the joint. Let F be the forward transfor- lar we could obtain an error bound of O(1/N) with N
t
mation operator on signed measures that takes the particles. As far as we know, this is the first explicit
predictive distribution p on x and yields the un- fasterratesofconvergenceasafunctionofthenumber
t t
normalized marginalized predictive distribution F p
t t 1In general, the integral on X should be with re-
on x in the SSM. Thus for a measure ν, we get t+1
t+1 specttothebasemeasureforwhichtheconditionaldensity
(cid:82)
(Ftν)(·) := Xtp(·|xt)p(yt|xt)dν(xt). We also have p(xt+1|xt) is defined. All proofs are in the supplementary
that p = 1 F p . material.
t+1 Wt t t 2We use the convention that the empty sum is 0 and
For the following theorem, F is a function space on the empty product is 1.
t
Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering
d = 2, K = 100, σ2 = 1 gorithm. In Figure 1, we give the MMD error as
1
10 well as the error on the mean function in term of
the number of particles N for the different sampling
0
10 schemes on a randomly chosen mixture of Gaussians
Err withK =100componentsind=2dimensions. Addi-
D 10−1 tional results as well as the details of the model are
MM MC −−00..5634 given in Appendix C.1 of the supplementary mate-
rial. In our experiments, the number of FW search
10−2 QMC −0.95
points is M = 50,000. We note that even though in
FW
−3 FCFW −1.47 theory all meth√ods should have the same rate of con-
10 vergence O(1/ N) for the MMD (as H is infinite di-
0 1 2 3
10 10 10 10
Number of particles mensional), FCFW empirically improves significantly
over the other methods. As d increases, the differ-
d = 2, K = 100, σ2 = 1
2 ence between the methods tapers off for a fixed kernel
10
bandwidth σ2, but increasing σ2 gives better results
n for FW and FCFW than the other schemes.
a
e
m 0
n 10 In the remaining sections, we evaluate empirically the
ctio −−00..4547 application of kernel herding in a filtering context us-
un −0.76 ing the proposed SKH algorithm.
or f10−2 MC
Err f QFWMC −1.83 4.2 Particle filtering using SKH on synthetic
FCFW examples
−4
10
0 1 2 3
10 10 10 10 We consider first several synthetic data sets in order
Number of particles
to assess the improvements offered by Frank-Wolfe
quadrature over standard Monte Carlo and quasi-
Figure 1: Top: MMD error for different sampling
Monte-Carlo techniques. We generate data from four
schemes where p is a mixture of 2d Gaussians with
different systems (further details on the experimental
K = 100 components. Bottom: error on the mean
setup can be found in Appendix C.2):
estimate for the same mixture. The dashed lines are
linear fits with slopes reported next to the axes.
Two linear Gaussian state-space (LGSS) mod-
els of dimensions d=3 and d=15, respectively.
ofparticlesthanthestandardO(√1 )forMonteCarlo
N
particle filters. In contrast, Gerber and Chopin (2014, A jump Markov linear system (JMLS),consist-
Theorem 7) showed a o(√1 ) rate for the randomized ing of 2 interacting LGSS models of dimension
N d=2. The switching between the models is gov-
version of their SQMC algorithm (note the little-o).3
erned by a hidden 2-state Markov chain.
Note that the theorem does not depend on how the
error of (cid:15) is obtained on the mean maps of the distri-
Anonlinearbenchmarktime-seriesmodelusedby,
bution; and so if one could show that a QMC point
among others, Doucet et al. (2000); Gordon et al.
set could alsoachievea fasterratefor theerror onthe
(1993). The model is of dimension d = 1 and is
mean maps (rather than on the distributions itself as
given by:
is usually given), then their rates would translate also
x
to the global rate by Theorem 2.4 x =0.5x +25 t +8cos(1.2t)+v ,
t+1 t 1+x2 t
t
4 Experiments y =0.05x2+e ,
t t t
4.1 Sampling from a mixture of Gaussians with v and e mutually independent standard
t t
Gaussian.
We start by investigating the merits of different sam-
pling schemes for approximating mixtures of Gaus- These models are ordered in increasing levels of diffi-
sians, since this is an intrinsic step to the SKH al- culty for inference. For the LGSS models, the exact
filtering distributions can be computed by a Kalman
3The rate holds on the approximation of integrals of
continuous bounded functions. filter. For the JMLS, this is also possible by running
4We also note that a simple computation shows that a mixture of Kalman filters, albeit at a computational
for a Monte Carlo sample of size N, E(cid:107)µ(pˆ)−µ(p)(cid:107)2 ≤ costof2T (whereT isthetotalnumberoftimesteps).
H
(R2−(cid:107)µ(p)(cid:107)2). For the nonlinear system, no closed form expressions
N
Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach
are available for the filtering densities; instead we run ing number but typically around ten.
a PF with N =100,000 particles as a reference.
To deal with the high-dimensional state-vector,
We generate 30 batches of observations for T = 100 T¨ornqvist et al. (2009) used a Rao-Blackwellized PF
timestepsfromallsystems,exceptfortheJMLSwhere (see Appendix A) to solve the filtering problem,
we use T = 10 (to allow exact filtering). We run the marginalizing all but 6 state components (being the
proposed SKH filter, using both FW and FCFW op- pose, i.e., thepositionandorientation)usingacombi-
timization and compare against a bootstrap PF (us- nation of Kalman filters and extended Kalman filters.
ing stratified resampling (Carpenter et al., 1999)) and The remaining 6 state-variables were tracked using a
a quasi-Monte-Carlo PF based on a Sobol-sequence bootstrap particle filter with N = 200 particles; the
point-set. All methods are run with N varying from strikinglysmallnumberofparticlesowingtothecom-
20to200particles. Wedeliberatelyuseratherfewpar- putational complexity of the likelihood evaluation.
ticles since, as discussed above, we believe that this is
For the current experiment, we obtained the code and
the setting when the proposed method can be partic-
the flight-test data from T¨ornqvist et al. (2009). The
ularly useful.
modularity of our approach allowed us to simply re-
To assess the performances of the different meth- place the Monte Carlo simulation step within their
ods, we first compute the root-mean-squared errors setup with FW-Quad. We ran SKH-FW with σ2 =10
(RMSE) for the filtered mean-state-estimates over and SKH-FCFW with σ2 = 0.1, as well as the boot-
the T time steps, w.r.t. the reference filters. We re- strap PF used in T¨ornqvist et al. (2009), and a QMC-
port the median RMSEs over the 30 different data PF; all methods using N = 50, 100, and 200 parti-
batches, along with the 25% and 75% quantiles, and cles. We ran all methods 10 times on the same data;
the minimum and maximum values in Figure 2. The the variation in SKH coming from the random search
SKH algorithms were run for three different values of pointsfortheFWprocedure,andinQMCforstarting
σ2 ∈ {0.01,0.1,1}. Here, we report the results for the Sobol sequence at different points. For compari-
σ2 = 1 for the LGSS models and the JMLS, and for son, weran10timesareferencePFwithN =100,000
σ2 =0.1 for the nonlinear benchmark model. The re- particles and averaged the results. The median posi-
sults for the other values are given in Appendix C.2. tion errors for 100 seconds of robot time (there are 20
The improvements are somewhat robust to the value SSM time steps per second of robot time) are given in
ofσ2,butinsomecasessignificantdifferenceswereob- Figure 3. The UAV is assumed to start at a known
served. As can be seen, both SKH methods improve locationattimezero,hence,alltheerrorsarezeroini-
significantly upon both QMC and the bootstrap PF. tially. Note that all methods accumulate errors over
For the two LGSS models, we also compute the MMD time. This is natural, since there is no absolute po-
(reported in the rightmost column in Figure 2). sition reference available (i.e., the filter is unstable)
and the objective is basically to keep the error as
small as possible for as long time as possible. SKH-
4.3 Vision-based UAV Localization
FWheregivestheoverallbestresults,withsignificant
improvements over the bootstrap PF and the QMC
In this section, we apply the proposed SKH algo-
methodsforsmallnumberofparticles. SKH-FWeven
rithm to solve a filtering problem in field robotics.
gives similar errors for the last time step with only
Weusethedataandtheexperimentalsetupdescribed
N = 200 particles as one of the reference PFs (us-
byT¨ornqvistetal.(2009). Theproblemconsistsofes-
ing N =100,000 particles). See Appendix C.2.1 for a
timatingthefullsix-dimensionalposeofanunmanned
discussion of the role of σ2 for FCFW.
aerial vehicle (UAV).
T¨ornqvist et al. (2009) proposed a vision-based solu- Runtimes. In these experiments, we focused on in-
tion, essentially tracking interest points in the camera vestigating how optimization could improve the error
images over consecutive frames to estimate the ego- per particle, as the gain in runtime depends on the
motion. This information is then fused with the in- exact implementation as well as the likelihood eval-
ertial and barometer sensors to estimate the pose of uation cost. We note that the FW-Quad algorithm
theUAV.Thesystemismodelledonstate-spaceform, scales as O(NM) for N samples and M search points
with a state vector comprising the position, velocity, when using FW, by updating the objective on the M
acceleration, as well as the orientation and the angu- search points in an online fashion (we also empirically
lar velocity of the UAV. The state is also augmented observedthislinearscalinginN). Ontheotherhand,
withsensorbiases,resultinginastatedimensionof22. FCFWscalesasO(N2M)astheweightsontheparti-
Furthermore, the state is augmented with the three- cles possibly change at each iteration, preventing the
dimensional positions of the interest points that are same online trick. SKH scales linearly with the num-
currently tracked by the vision system; this is a vary- ber of time steps T (as a standard PF). For the UAV
Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering
RMSE,LGSS,d=3 RMSE,JMLS MMDRMS(σ2=1),LGSS,d=3
100 100
100
−0.49
−0.46 −0.46
−0.55 −−00..2312
10−1 −0.45
10−1 PQFMC −0.56 PQFMC 10−1 PQFMC −0.51
FW(σ2=1) −0.60 FW(σ2=1) FW(σ2=1) −0.54
FCFW(σ2=1) FCFW(σ2=1) FCFW(σ2=1) −0.64
min/max min/max min/max
20 50 100 200 20 50 100 20 50 100 200
Numberofparticles Numberofparticles Numberofparticles
RMSE,LGSS,d=15 RMSE,Nonlinearbenchmark MMDRMS(σ2=1),LGSS,d=15
101 101 100
100
100 −−−−0000....33336866 −−00..8849 −−00..3356
−1.22 −−00..4421
10−1 −1.25
PF PF PF
QMC QMC QMC
FW(σ2=1) FW(σ2=0.1) FW(σ2=1)
FCFW(σ2=1) FCFW(σ2=0.1) FCFW(σ2=1)
10−1 min/2m0ax 50 100 200 10−2 min/2m0ax 50 100 200 10−1 min/2m0ax 50 100 200
Numberofparticles Numberofparticles Numberofparticles
Figure 2: RMSEs (left and middle columns) for the four considered models and MMDs (right column) for the
two LGSS models.
6 PQFMC N=50 6 PQFMC N=100 15 UAV-lasttimestepePQrrFMorC
FW(σ2=10) FW(σ2=10) FW(σ2=10)
5 FCFW(σ2=0.1) 5 FCFW(σ2=0.1) FCFW(σ2=0.1)
PPFFNN==1100k0k PPFFNN==1100k0k PPFFNN==1100k0k
m)4 m)4 m)10 min/max
( ( (
error3 error3 error
Position2 Position2 Position 5
1 1
00 20 40 60 80 100 00 20 40 60 80 100 0 50 100 200
Robottime(s) Robottime(s) Numberofparticles
Figure 3: Median of position errors over 10 runs for each method. The errors are computed relative to the mean
predictionover10runsofaPFwith100kparticles(thevariationofthereferencePFisalsoshownforPF100k).
The error bars represent the [25%, 75%] quantile. The rightmost plot shows the error at the last time step as a
functionofN. 100sofrobottimerepresents2,000SSMtimesteps,itdoes not correspondtocomputationtime.
application, the original Matlab code from T¨ornqvist 5 Conclusion
etal.(2009)spentanaverageof0.2spertimestepfor
We have developed a method for Bayesian filtering
N =50 particles (linear in the number of particles as
problemsusingacombinationofoptimizationandpar-
thelikelihoodevaluationisthebottleneck)onaXEON
ticle filtering. The method has been demonstrated to
E5-26202.10GHzPC.TheoverheadofusingourMat-
provideimprovedperformanceoverbothrandomsam-
labimplementationofFW-QuadwithN =50isabout
pling and quasi-Monte Carlo methods. The proposed
0.15 s per time step for FW and 0.3 s for FCFW; and
method is modular and it can be used with different
0.3 s for FW and 1.0 s for FCFW for N = 100 (we
types of particle filtering techniques, such as the Rao-
used M = 10,000 search points in this experiment).
Blackwellizedparticlefilter. Furtherinvestigatingthis
Inpractice, thismeansthatSKH-FWcanberunhere
possibility for other classes of particle filters is a topic
with 50 particles in the same time as the standard PF
for future work. Future work also includes a deeper
is run with about 90 particles. But as Figure 3 shows,
analysis of the convergence theory for the method in
the error for SKH-FW with 50 particles is still much
order to develop practical guidelines for the choice of
lower than the PF with 200 particles.
the kernel bandwidth.
Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach
Acknowledgements M. Gerber and N. Chopin. Sequential quasi-Monte
Carlo. arXiv preprint arXiv:1402.4039v5, 2014.
We thank Eric Moulines for useful discussions. This
N. J. Gordon, D. J. Salmond, and A. F. M.
work was partially supported by the MSR-Inria Joint
Smith. Novel approach to nonlinear/non-Gaussian
Centre, a grant by the European Research Council
Bayesian state estimation. Radar and Signal Pro-
(SIERRA project 239993) and by the Swedish Re-
cessing, IEE Proceedings F, 140(2):107–113, Apr.
searchCouncil(projectLearningofcomplexdynamical
1993.
systems number 637-2014-466).
A. Gretton, K. M. Borgwardt, M. J. Rasch,
B. Sch¨olkopf, and A. Smola. A kernel two-sample
References
test. The Journal of Machine Learning Research,
F. Bach. Learning with submodular functions: A 13:723–773, 2012.
convex optimization perspective. Foundations and F. Husz´ar and D. Duvenaud. Optimally-weighted
Trends in Machine Learning, 6(2-3):145–373, 2013. herding is Bayesian quadrature. In Proceedings of
F. Bach, S. Lacoste-Julien, and G. Obozinski. On the the 28th Conference on Uncertainty in Artificial In-
equivalence between herding and conditional gradi- telligence (UAI), pages 377–385, 2012.
ent algorithms. In Proceedings of the 29th Inter- M. Jaggi. Revisiting Frank-Wolfe: Projection-free
national Conference on Machine Learning (ICML), sparse convex optimization. In Proceedings of the
pages 1359–1366, 2012. 30thInternationalConferenceonMachineLearning
(ICML), 2013.
O. Capp´e, E. Moulines, and T. Ryd´en. Inference in
Hidden Markov Models. Springer, 2005. D. Ormoneit, C. Lemieux, and D. J. Fleet. Lattice
particle filters. In Proceedings of the 17th Confer-
J.Carpenter, P.Clifford, andP.Fearnhead. Improved
enceonUncertaintyinArtificialIntelligence(UAI),
particle filter for nonlinear problems. IEE Proceed-
pages 395–402, 2001.
ingsRadar,SonarandNavigation,146(1):2–7,1999.
V. Philomin, R. Duraiswami, and L. Davis. Quasi-
Y. Chen, M. Welling, and A. Smola. Super-
random sampling for condensation. In Proceedings
samples from kernel herding. In Proceedings of the
ofthe6thEuropeanConferenceonComputerVision
26thInternationalConferenceonMachineLearning
(ECCV), 2000.
(ICML), 2010.
M. K. Pitt and N. Shephard. Filtering via simulation:
R. Douc, E. Moulines, and J. Olsson. Long-term sta-
Auxiliary particle filters. Journal of the American
bilityofsequentialMonteCarlomethodsunderver-
Statistical Association, 94(446):590–599, 1999.
ifiable conditions. Annals of Applied Probability, 24
(5):1767–1802, 2014. B. Ristic, S. Arulampalam, and N. Gordon. Beyond
the Kalman filter: particle filters for tracking appli-
A. Doucet and A. Johansen. A tutorial on particle
cations. Artech House, London, UK, 2004.
filtering and smoothing: Fifteen years later. In
D. Crisan and B. Rozovsky, editors, The Oxford A. Smola, A. Gretton, L. Song, and B. Sch¨olkopf. A
Handbook of Nonlinear Filtering. Oxford University Hilbert space embedding for distributions. In Al-
Press, 2011. gorithmic Learning Theory, pages 13–31. Springer,
2007.
A. Doucet, S. J. Godsill, and C. Andrieu. On sequen-
B. K. Sriperumbudur, A. Gretton, K. Fukumizu,
tial Monte Carlo sampling methods for Bayesian
B.Sch¨olkopf,andG.R.Lanckriet.Hilbertspaceem-
filtering. Statistics and Computing, 10(3):197–208,
beddings and metrics on probability measures. The
2000.
Journal of Machine Learning Research, 99:1517–
J. C. Dunn. Convergence rates for conditional gra-
1561, 2010.
dient sequences generated by implicit step length
D. T¨ornqvist, T. B. Sch¨on, R. Karlsson, and
rules. SIAM Journal on Control and Optimization,
F. Gustafsson. Particle filter SLAM with high di-
18:473–487, 1980.
mensional vehicle model. Journal of Intelligent and
P. Fearnhead. Using random quasi-Monte-Carlo Robotic Systems, 55(4):249–266, 2009.
within particle filters, with application to financial
timeseries.JournalofComputationalandGraphical
Statistics, 14(4):751–769, 2005.
M. Frank and P. Wolfe. An algorithm for quadratic
programming. Naval Research Logistics Quarterly,
3:95–110, 1956.
Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering
Supplementary material
A Extension for Rao-Blackwellization
A common strategy for improving the efficiency of the PF is to make use of Rao-Blackwellization—this idea
can be used also with SKH. Rao-Blackwellization, here, refers to analytically marginalizing some conditionally
tractable component of the state vector and thereby reducing the dimensionality of the space on which the PF
operates. Assume that the state of the system is comprised of two components x and z , where the filtering
t t
density for z is tractable conditionally on the history of x . The typical case is that of a conditionally linear
t 1:t
Gaussian system, in which case the aforementioned conditional filtering density p(z |x ,y ) is Gaussian and
t 1:t 1:t
computable using a Kalman filter (conditionally on x ). The Rao-Blackwellized PF (RBPF) exploits this
1:t
property by factorizing:
N
p(z ,x |y )=p(z |x ,y )p(x |y )≈(cid:88)w(i)N(z |z (x(i)),Σ (x(i)))δ (x ), (5)
t 1:t 1:t t 1:t 1:t 1:t 1:t t t (cid:98)t 1:t t 1:t x(i) 1:t
1:t
i=1
where the conditional mean z (x ) := E[z |x ,y ] and covariance matrix Σ (x ) := V(z |x ,y ) can be
(cid:98)t 1:t t 1:t 1:t t 1:t t 1:t 1:t
computed(forafixedtrajectoryx )usingaKalmanfilter. Themixtureapproximationfollowsbypluggingina
1:t
particle approximation of p(x |y ) computed using a standard PF. Hence, for a conditionally linear Gaussian
1:t 1:t
model, the RBPF takes the form of a Mixture Kalman filter; see Chen and Liu (2000). Analogously to a
standard PF, the SKH procedure allows us to to compute an empirical point-mass approximation of p(x |y )
1:t 1:t
by keeping track of the complete history of the state x . Consequently, by (5) it is straightforward to employ
1:t
Rao-Blackwellization also for SKH; we use this approach in the numerical example in Section 4.3.
B Rates for SKH when using random search points
In this section, we show that we can get guarantees on the MMD error of the FW-Quad procedure when
approximately finding the FW vertex in step 3 of Algorithm 1 using exhaustive search through M random
samples from p. This means that despite not solving step 3 exactly, the SKH procedure with M random search
points (under assumptions of Theorem 2) is still consistent as long as M grows to infinity.
The main idea is that the rates of convergence for the Frank-Wolfe optimization procedure still holds when the
linear subproblem (step 3) is solved within accuracy of δ. More specifically, if we guarantee that the FW vertex
g¯ that we use satisfy (cid:104)J(cid:48)(g ),g¯ (cid:105)≤min (cid:104)J(cid:48)(g ),g(cid:105)+δ during the algorithm, then the standard O(1/k)
k+1 k k+1 g∈M k
rateofconvergenceforFWcarriesthroughbutwithin δ oftheoptimalobjective(i.e. uptoJ(g∗)+δ). Asimple
modification of the argument by Jaggi (2013) (who used a shrinking δ during the FW algorithm) can show this
for the step-size of γ = 2 ; we give the proofs for the step-size of γ = 1 as well as the potential faster rate
k k+2 k k+1
O(1/k2) for the MMD objective in Appendix G.
Let X ⊆ X be the set of M search points, and p be the empirical distribution for the M samples from p.
M M
Let δ := (cid:107)µ(p )−µ(p)(cid:107) which can be made small by increasing M. Consider the iteration k in FW-Quad
M M H
where we do exhaustive search on X in step 3. We thus have:
M
(cid:104)g −µ ,Φ(x(k+1))(cid:105)= min (cid:104)g −µ ,Φ(x)(cid:105)= min (cid:104)g −µ(p )+µ(p )−µ(p),Φ(x)(cid:105)
k p k p k M M
x∈XM x∈XM
≤ min (cid:104)g −µ(p ),Φ(x)(cid:105)+δ R ,
k M M M
x∈XM
where R := max (cid:107)Φ(x)(cid:107) (R ≤ R). We can thus interpret step 3 as approximately solving (within
M x∈XM M
δ R ) the linear subproblem for the Frank-Wolfe optimization of J (g):= 1(cid:107)g−µ(p )(cid:107)2 over the marginal
M M M 2 M H
polytope of X . We thus get a rate of convergence to within δ R of min J (g)=0. Finally, we have
M M M g M
(cid:112) (cid:112)
(cid:107)g −µ(p)(cid:107) ≤(cid:107)g −µ(p )(cid:107) +δ = 2J (g )+δ ≤ 2((cid:15) +R δ )+δ
N H N M H M M N M N M M M