ebook img

A non-parametric ensemble transform method for Bayesian inference PDF

0.4 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A non-parametric ensemble transform method for Bayesian inference

A non-parametric ensemble transform method for Bayesian inference∗ Sebastian Reich January 1, 2013 2 Abstract 1 Many applications, such as intermittent data assimilation, lead to a recursive application of Bayesian 0 inference within a Monte Carlo context. Popular data assimilation algorithms include sequential Monte 2 Carlo methods and ensemble Kalman filters (EnKFs). These methods differ in the way Bayesian inference c isimplemented. SequentialMonteCarlomethodsrelyonimportancesamplingcombinedwitharesampling e stepwhileEnKFsutilizealineartransformationofMonteCarlosamplesbasedontheclassicKalmanfilter. D While EnKFs have proven to be quite robust even for small ensemble sizes, they are not consistent since their derivation relies on a linear regression ansatz. In this paper, we propose another transform method, 7 2 which does not rely on any a prior assumptions on the underlying prior and posterior distributions. The new method is based on solving an optimal transportation problem for discrete random variables. ] A 1 Introduction N . h ThispaperisconcernedwithaparticularimplementationofMonteCarlomethodsforBayesianinferenceandits t a application to filtering and intermittent data assimilation (Jazwinski, 1970). More specifically, we consider the m problem of estimating posterior expectation values under the assumption that a finite-size ensemble {xf}M i i=1 [ from the (generally unknown) prior distribution πXf is available. A standard approach for obtaining such estimators relies on the idea of importance sampling based on the likelihood π (y |xf) of the samples xf with 3 Y 0 i i regard to a given observation y (Doucet et al., 2001; Arulampalam et al., 2002; Bain and Crisan, 2008). If v 0 5 appliedrecursively,itisnecessarytocombineimportancesamplingwitharesamplingstepsuchasmonomialor 7 systematicresampling(Arulampalametal.,2002;Ku¨nsch,2005). EvenmorerecentlytheensembleKalmanfilter 3 (EnKF) has been introduced (Evensen, 2006), which transforms the prior ensemble {xf}M into a uniformly 0 i i=1 weighted posterior ensemble {xa}M using the classic Kalman update step of linear filtering (Jazwinski, 1970). . i i=1 0 The EnKF leads, however, to a biased estimator even in the limit M → ∞ (Lei and Bickel, 2011). In this 1 paper, we propose a non-random ensemble transform method which is based on finite-dimensional optimal 2 transportation in form of linear programming (Strang, 1986; Cotter and Reich, 2012). We provide numerical 1 : andtheoreticalevidencethatthenewensembletransformmethodleadstoconsistentposteriorestimators. The v new transform method can be applied to intermittent data assimilation leading to a novel implementation of i X particle filters. We demonstrate this possibility for the chaotic Lorenz-63 model (Lorenz, 1963). r An outline of the paper is as follows. In Section 2, importance sampling Monte Carlo is summarized in the a contextofBayesianinference. Subsequentlyimportancesamplingisputintothecontextoflinearprogramming in Section 3. This leads to a novel resampling method which maximizes the correlation between the prior and posterior ensemble members. We propose a further modification which turns the resampling step into a deterministic and linear transformation. Convergence of the proposed transformation step is demonstrated numerically by means of two examples. A theoretical convergence result is formulated for univariate random variables with compactly supported probability distributions. Finally, the application to sequential Monte Carlo methods is discussed in Section 4 and gives rise to a novel ensemble transform filter. Numerical results are presented for the Lorenz-63 model. 2 Bayesian inference and importance sampling We summarize the importance sampling approach to Bayesian inference. Given a prior (or in the context of dynamic models, forecasted) random variable Xf :Ω→RN, we denote its probability density function (PDF) ∗Universita¨tPotsdam,Institutfu¨rMathematik,AmNeuenPalais10,D-14469Potsdam,Germany 1 by π (x), x ∈ RN, we consider the assimilation of an observed y ∈ RK with likelihood function π (y|x). Xf 0 Y According to Bayes’ theorem the analyzed, posterior PDF is given by π (y |x)π (x) πXa(x|y0)= (cid:82) Y 0 Xf . π (y |x)π (x)dx RN Y 0 Xf Typically, the forecasted random variable Xf and its PDF are not available explicitly. Instead one assume that an ensemble of forecasts xf ∈RN, i=1,...,M, is given, which mathematically are considered as realiza- i tions Xf(ω), ω ∈ Ω, of M independent (or dependent) random variables Xf : Ω → RN with law π . Then i i Xf the expectation value g¯f = E [g] of a function g : RN → R with respect to the prior PDF π (x) can be Xf Xf estimated according to M G¯f = 1 (cid:88)g(Xf) M M i i=1 with realization M M g¯f =G¯f (ω)= 1 (cid:88)g(Xf(ω))= 1 (cid:88)g(xf) M M M i M i i=1 i=1 for the ensemble {xf}M . The estimator is unbiased for any M >0 and its variance vanishes as M →∞. i i=1 Following the idea of importance sampling (Liu, 2001), one obtains the following estimator with respect to the posterior PDF π (x|y ) using the forecast ensemble: Xa 0 M (cid:88) g¯a = w g(xf), M i i i=1 with weights π (y |xf) w = Y 0 i . (1) i (cid:80)M π (y |xf) i=1 Y 0 i The estimator is no longer unbiased for finite M but remains consistent. Here an estimator is called consistent iftherootmeansquareerrorbetweentheestimatorg¯a andtheexactexpectationvalueg¯a vanishesasM →∞. M 3 An ensemble transform method based on linear programming Alternatively to importance sampling, we may attempt to transform the samples xf =Xf(ω) with Xf ∼π i i i Xf into samples xˆa which follow the posterior distribution π (x|y ). Then we are back to an estimator i Xa 0 M 1 (cid:88) g¯a = g(xˆa) M M i i=1 with equal weights for posterior expectation values. For univariate random variables Xf and Xa with PDFs π and π , respectively, the transformation is characterized by Xf Xa F (xˆa)=F (xf), (2) Xa i Xf i where F (x) and F denote the cumulative distribution functions of Xf and Xa, respectively, e.g. Xf Xa (cid:90) x F (x)= π (x(cid:48))dx(cid:48). Xf Xf −∞ Eq. (2) requires knowledge of the associated PDFs and its extension to multivariate random variables is non- trivial. In this section, we propose an alternative approach that does not require explicit knowledge of the underlying PDFs and that easily generalizes to multi-variate random variables. To obtain the desired trans- formation we utilize the idea of optimal transportation (Villani, 2003) with respect to an appropriate distance d(x,x(cid:48)) in RN. More precisely, we first seek a coupling between two discrete random variables Xf : Ω(cid:48) → X and Xa : Ω(cid:48) → X with realizations in X = {xf,...,xf } and probability vector pf = (1/M,...,1/M)T for 1 M 2 Xf and pa = (w ,...,w )T for Xa, respectively. A coupling between Xf and Xa is a M ×M matrix T with 1 M non-negative entries t =(T) ≥0 such that ij ij M M (cid:88) (cid:88) t =1/M t =w . (3) ij ij i i=1 j=1 We now seek the coupling T∗ that minimizes the expected distance (cid:88) E [d(xf,xa)]= t d(xf,xf). XfXa ij i j ij The desired coupling T∗ is characterized by a linear programming problem (Strang, 1986). Since (3) leads to 2M −1 independent constraints the matrix T∗ contains at most 2M −1 non-zero entries. In this paper, we use the squared Euclidean distance, i.e. d(xf,xf)=(cid:107)xf −xf(cid:107)2. (4) i j i j We recall that minimizing the expected distance with respect to the squared Euclidean distance is equivalent to maximizing E [(xf)Txa] since XfXa E [(cid:107)xf −xa(cid:107)2]=E [(cid:107)xf(cid:107)2]+E [(cid:107)xa(cid:107)2]−2E [(xf)Txa]. XfXa Xf Xa XfXa Hence minimizing the expected distance is equivalent to maximizing correlation. We next introduce the Markov chain P∈RM×M on X via P=MT∗ with the property that pa =Ppf. Givenrealizationsxf,j =1,...,M fromthepriorPDF,aMonteCarloresamplingstepproceedsnowasfollows: j Define discrete random variables   p 1j . Xa ∼ .  (5) j  .  p Mj forj =1,...,M. Herep denotesthe(i,j)thentryofP. NotethattherandomvariablesXa,j =1,...,M,are ij j neither independent nor identically distributed. A new ensemble of size M is finally obtained via xa =Xa(ω), j j j = 1,...,M. This ensemble of equally weighted samples allows for the approximation of expectation values with respect to the posterior distribution π (x|y ). Xa 0 The outlined procedure leads to a particular instance of resampling with replacement (Arulampalam et al., 2002; Ku¨nsch, 2005). The main difference to techniques such as monomial or systematic resampling is that the resampling is chosen such that an expected distance d(xf,xa) between the prior and posterior samples is minimized. We now propose a further modification which replaces the random resampling step and avoids obtaining multiple copies in the analyzed ensemble {xa}. The modification is based on the observation that i M (cid:88) x¯a =E [x]= p xf. j Xaj ij i i=1 We use this result to propose the deterministic transformation xa =x¯a, (6) j j j =1,...,M. The idea is that M 1 (cid:88) g¯a = g(x¯a) M M j j=1 3 1 prior 0.8 posterior 0.6 0.4 0.2 0 (cid:239)0.2 (cid:239)0.4 (cid:239)0.6 (cid:239)0.8 (cid:239)1 (cid:239)2 (cid:239)1 0 1 2 3 realizations x i Figure 1: Prior xf and posterior xa realizations from the transform method for M =10. i i still provides a consistent estimator for E [g] as M → ∞. For the special case g(x) = x it is easy to verify Xa that indeed M M M M 1 (cid:88) 1 (cid:88)(cid:88) (cid:88) (cid:88) x¯a = xa = p xf = t∗xf = w xf. M M j M ij i ij i i i j=1 j=1i=1 i,j i=1 Before investigating the theoretical properties of the proposed transformation (6) we consider two examples which indicate that (6) lead to a consistent approximation to (2) in the limit M →∞. Example. We take the univariate Gaussian with mean x¯ = 1 and variance σ2 = 2 as prior random variable Xf. Realizations of Xf are generated using √ 1 i−1 xf = 2erf−1(2u −1), u = + i i i 2M M for i=1,...,M. The likelihood function is 1 (cid:18)−(y−x)2(cid:19) π (y|x)= √ exp Y 4π 4 with assumed observed value y = 0.1. Bayes’ formula yields a posterior distribution which is Gaussian with 0 mean x¯=0.55 and variance σ2 =1. The prior and posterior realizations from the transform method are shown for M = 10 in Figure 1. We also display the analytic transform, which is a straight line in case of Gaussian distributions, and the approximate transform using linear programming in Figure 2. The structure of non-zero entries of the Markov chain matrix P for M = 40 is displayed in Figure 3, which shows a banded structure of local interactions. More generally, one obtains the posterior estimates for the first four moments displayed in Table 1, which indicate convergences as M →∞. Example. As a further (non-Gaussian) example we consider a uniform prior on the interval [0,1] and use samples xf = u with the u ’s as defined in the previous example. Given the observed value y = 0.1, the i i i 0 posterior PDF is (cid:26) 1 e−(x−0.1)2/4 x∈[0,1] π (x|0.1)= 0.9427... Xa 0 else The resulting posterior mean is x¯ ≈ 0.4836 and its variance σ2 ≈ 0.0818. The third and fourth moments are 0.0016and0.0122,respectively. Thetransformmethodyieldstheposteriorestimatesforthefirstfourmoments displayed in Table 2, which again indicate convergences as M →∞. 4 2.5 analytic transformation Linear programming transformation 2 s 1.5 n o ati c 1 o e l cl arti 0.5 p d e 0 s y al an(cid:239)0.5 (cid:239)1 (cid:239)1.5 (cid:239)2 (cid:239)1 0 1 2 3 4 forecast particle locations Figure 2: Exact and numerical ensemble transform map for M = 10. The Gaussian case leads to the exact transformation being linear. The numerical approximation deviates from linearity mostly in its both tails. M=40 particles 0 5 10 x e d n15 e i cl rti20 a p st a25 c e r o f30 35 40 0 10 20 30 40 analysed particle index Figure 3: Non-zero entries in the matrix P for M = 40, i.e. the support of the coupling. There are a total of 2M −1=79 non-zero entries. The banded structure reveals the spatial locality and the cyclical monotonicity (Villani, 2003) of the resampling step. 5 Table 1: Estimated posterior first to fourth-order moments from the ensemble transform method applied to a Gaussian scalar Bayesian inference problem. x¯ σ2 E[(X−x¯)3] E[(X−x¯)4] M =10 0.5361 1.0898 -0.0137 2.3205 M =40 0.5473 1.0241 0.0058 2.7954 M =100 0.5493 1.0098 -0.0037 2.9167 Table 2: Estimated posterior first to fourth-order moments from the ensemble transform method applied to a non-Gaussian scalar Bayesian inference problem. x¯ σ2 E[(X−x¯)3] E[(X−x¯)4] M =10 0.4838 0.0886 0.0014 0.0114 M =40 0.4836 0.0838 0.0016 0.0121 M =100 0.4836 0.0825 0.0016 0.0122 Theorem. Assume that the ensemble {xf}M is a realization of M identically and independent distributed i i=1 univariate random variables Xf :Ω→[a,b]⊂R with PDF π . Without restriction we may assume that the i Xf ensemble members xf ∈[a,b] are mutually distinct and that they have been sorted such that i xf <xf <···<xf <xf . 1 2 M−1 M We finally assume that there is a constant δ >0 such that π (x)≥δ for all x∈[a,b]. Then xa, given by (6), Xf i converges to xˆa, defined by (2), in probability as M →∞. i Proof. Given an ε >0, we chose an integer L=L(ε ) sufficiently large such that truncated weights 1 1 wˆ ∈{0,1/(LM),2/(LM),...,1} i satisfying |w −wˆ | i i ≤ε 1/M 1 can be defined for i=1,...,M. Here the weights w are defined by (1). i It follows that the linear programming problem with weights w replaced by wˆ leads to a matrix P with i i maximal L non-zero entries in each column and those non-zero entries cluster into a single block due to cyclical monotonicity of optimal transport plans (Villani, 2003). See also Figure 3. Next we decompose the interval [a,b] into J disjunct intervals I such that j (cid:90) 1 π (x)dx= . Xf J Ij Given an ε >0, we now select a J =J(ε ) such that 2 2 (cid:90) dx≤ε . 2 Ij This is possible since π is bounded from below. Xf Large deviation results for monomial resampling (see, for example, Lemma 1 in Devroye and Gy¨orfi (1985)) implythattheprobabilityofhavinglessthanLmembersoftheensemble{xf}M ineachofthesubintervalsI i i=1 i decays exponentially fast as M →∞. Hence the standard deviation of Xa, as defined by (5), can be bounded i by ε . Taking the limit ε → 0, i = 1,2, we obtain the convergence in probability of xa = x¯a to xˆa, as defined 2 i i i i 6 by (2). The limit ε →0 can be achieved, for example, by choosing J =O(M1/4) and K =O(M1/2). (cid:3) i Remark. Cyclically monotonicity for optimal couplings µ between two PDFs π and π states that any Xf Xa sequence of points (xf,ya), i=1,...,I, from the support spt(µ) of µ, i.e. (xf,xa)∈spt(µ), satisfies i i i i I I (cid:88) (cid:88) d(xf,xa)≤ d(xf,xa ) i i i σ(i) i=1 i=1 whereσ denotesanypermutationoftheindicesi∈{1,2,...,I}(Villani,2003). Cyclicallymonotonicityapplies in particular to any sequence of pairs (xf,xa), i = 1,...,M, obtained from the resampling step (5). For the i i Euclidian distance (4), this property carries over to the sample means and, hence, to pairs (xf,xa) obtained i i from the transformation (6). This observation together with the results from McCann (1995) should allow for a more direct convergence proof. In particular, the techniques leading to Theorem 6 in McCann (1995) are directly applicable. Onemayreplacetheuniformprobabilitiesinpf byanappropriaterandomvectorpf =(wf,...,wf )T,i.e.wf ≥ 1 M i 0 and (cid:80)M wf =1. To clarify the notations we write pa =(wa,...,wa )T for the posterior weights according i=1 i 1 M to Bayes’s formula. The linear programming task is being adjusted accordingly and one obtains an optimal coupling T∗ and an induced Markov chain P with entries t p = ij . ij wf j Hence the transform method is now defined by M (cid:88) x¯a = p xf (7) j ij i i=1 and the posterior ensemble mean satisfies M M M M x¯a =(cid:88)wfx¯a =(cid:88)(cid:88)wf tij xf =(cid:88)waxf M j j j wf i i i j=1 j=1i=1 j i=1 as desired. More generally, posterior expectation values are given by M (cid:88) g¯a = wfg(xa). M i i i=1 4 Application to sequential data assimilation We now apply the proposed ensemble transformation (ET) method to sequential state estimation for ordinary differential equation models x˙ =f(x) (8) with known PDF π for the initial conditions x(0) ∈ RN at time t = 0. Hence we treat solutions x(t) as 0 realizations of the random variables X , t≥0, determined by the flow of (8) and the initial PDF π . t 0 We assume the availability of observations y(t ) ∈ RK at discrete times t = k∆t , k > 0, in intervals of k k obs ∆t >0. The observations satisfy the forward model obs Y(t )=h(x (t ))+Ξ , k ref k k whereΞ :Ω→RK representindependentandidenticallydistributedcenteredGaussianrandomvariableswith k covariancematrixR∈RK×K,h:RN →RK istheforwardmap,andx (t)∈RN denotesthedesiredreference ref solution. The forward model gives rise to the likelihood (cid:18) (cid:19) 1 1 π (y|x)= exp − (y−h(x))TR−1(y−h(x)) . Y (2π)K/2|R|1/2 2 7 1.4 1.3 or1.2 Err S M1.1 R n a e 1 M e m Ti0.9 0.8 ESRF ET filter 0.7 0 20 40 60 80 100 Ensemble Size Figure 4: Time mean RMS errors for the Lorenz-63 model in the setting of Anderson (2010) for an ensemble square root filter (ESRF) and the new ensemble transform (ET) filter for increasing ensemble sizes M. Aparticlefilterstartsfromanensemble{x (0)}M ofM realizationsfromtheinitialPDFπ . Weevolvethis i i=1 0 ensemble of realizations under the model dynamics (8) till the first observation y (∆t ) becomes available obs obs at which point we apply the proposed ET method to the forecast ensemble members xf = x (∆t ). If one i i obs furthermore collects these prior realizations into a N ×M matrix Xf =[xf···xf ] 1 M then, for given observation y = y(∆t ), the ET method (7) leads to the posterior realizations simply given 0 obs by Xa =XfP, [xa···xa ]=Xa, (9) 1 M where P is the Markov chain induces by the associated linear programming problem. The analysed ensemble members xa, i=1,...,M, are now being used as new initial conditions for the model (8) and the sequence of i alternating between propagation under model dynamics and assimilation of data is repeated for all k >1. It should be noted that a transformation similar to (9) arises from the ensemble square root filter (ESRF) (Evensen, 2006). However, the transform matrix P ∈ RM×M used here is obtained in a completely different manner and does not relly on the assumption of the PDFs being Gaussian. We mention the work of Lei and Bickel(2011)foranalternativeapproachtomodifyEnKFsinordertomakethemconsistentwithnon-Gaussian distributions. We now provide a numerical example and compare an ESRF implementation with a particle filter using the new ET method. Example. We consider the Lorenz-63 model (Lorenz, 1963) x˙ =σ(y−x), y˙ =x(ρ−z)−y, z˙ =xy−βz in the parameter and data assimilation setting of Anderson (2010). In particular, the state vector is x = (x,y,z)T ∈R3 and we observe all three variables every ∆t =0.12 time units with a measurement error vari- obs anceR=8ineachobservedsolutioncomponent. Theequationsareintegratedintimebytheimplicitmidpoint rulewithstep-size∆t=0.01. WeimplementanESRF(Evensen,2006)andthenewETfilterforensemblesizes M =10,20,40,60,80,100. The results for both methods use an optimized form of ensemble inflation. The ET nevertheless leads to filter divergence for M =10 while the ESRF is stable for all given choices of M. The time mean root mean square (RMS) errors over 2000 assimilation steps can be found in Fig. 4. It is evident that the new ET filter leads to much lower RMS errors for all M ≥ 20. The results also compare favorable to the 8 ones displayed in Anderson (2010) for the rank histogram filter (RHF) Anderson (2010) and the EnKF with perturbed observations (Evensen, 2006). 5 Conclusions We have explored the application of linear programming and optimal transportation to Bayesian inference and particle filters. Numerical evidence indicates that the proposed ET method allows to reproduce poste- rior expectation values in the limit M → ∞. It would be of interesting to prove the convergence of the ET method to solutions of the associated continuous optimal transportation problem (Villani, 2003, 2009) for general multivariate random variables using cyclical monotonicity arguments. The application of continuous optimal transportation to Bayesian inference has been discussed by Moselhy and Marzouk (2012), Reich (2011, 2012), Cotter and Reich (2012). However, a direct application of optimal transportation to Bayesian inference in high-dimensional state spaces RN seems currently out of reach. It remains to investigate what modifications are required (such as localization (Evensen, 2006)) in order to implement the proposed ET method even if the ensemble sizes M are much smaller than the dimension of state space N (or the dimension of the attractor of (8) in case of intermittent data assimilation). Acknowledgment. I would like to thank Yann Brenier and Jacques Vanneste for inspiring discussions on the subjectofthispaperduringanOberwolfachworkshopin2010. Thepaperbenefitedfurthermorefromadditional discussions with Wilhelm Stannat and Dan Crisan at another Oberwolfach workshop in 2012. References J.L. Anderson. A non-Gaussian ensemble filter update for data assimilation. Monthly Weather Review, 138: 4186–4198, 2010. M.S.Arulampalam,S.Maskell,N.Gordon,andT.Clapp. Atutorialonparticlefiltersforonlinenonlinear/non- Gaussian Bayesian tracking. IEEE Trans. Sign. Process., 50:174–188, 2002. A. Bain and D. Crisan. Fundamentals of stochastic filtering, volume 60 of Stochastic modelling and applied probability. Springer-Verlag, New-York, 2008. C.J. Cotter and S. Reich. Ensemble filter techniques for intermittent data assimilation - a survey. submitted, 2012. L. Devroye and L. Gy¨orfi. Nonparametric Density Estimation: The L View. John Wiley & Sons, New York, 1 1985. A. Doucet, N. de Freitas, and N. Gordon (eds.). Sequential Monte Carlo methods in practice. Springer-Verlag, Berlin Heidelberg New York, 2001. G. Evensen. Data assimilation. The ensemble Kalman filter. Springer-Verlag, New York, 2006. A.H. Jazwinski. Stochastic processes and filtering theory. Academic Press, New York, 1970. H.R. Ku¨nsch. Rekursive Monte Carlo filter: Algorithms and theoretical analysis. Ann. Statist., 33:1983–2021, 2005. J. Lei and P. Bickel. A moment matching ensemble filter for nonlinear and non-Gaussian data assimilation. Mon. Weath. Rev., 139:3964–3973, 2011. J.S. Liu. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York, 2001. E.N. Lorenz. Deterministic non-periodic flows. J. Atmos. Sci., 20:130–141, 1963. R.J. McCann. Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80:309–323, 1995. 9 T.A. El Moselhy and Y.M. Marzouk. Bayesian inference with optimal maps. J. Comput. Phys., 231:in press, 2012. S. Reich. A dynamical systems framework for intermittent data assimilation. BIT Numer Math, 51:235–249, 2011. S. Reich. A Gaussian mixture ensemble transform filter. Q. J. R. Meterolog. Soc., 138:222–233, 2012. G. Strang. Introduction to Applied Mathematics. Wellesley-Cambridge Press, 2nd edition, 1986. C. Villani. Topics in Optimal Transportation. American Mathematical Society, Providence, Rhode Island, NY, 2003. C. Villani. Optimal transportation: Old and new. Springer-Verlag, Berlin Heidelberg, 2009. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.