ebook img

Error bounds for computing the expectation by Markov chain Monte Carlo PDF

0.29 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Error bounds for computing the expectation by Markov chain Monte Carlo

ERROR BOUNDS FOR COMPUTING THE EXPECTATION BY MARKOV CHAIN MONTE CARLO 9 0 DANIEL RUDOLF 0 2 Abstract. We study the errorof reversibleMarkovchain Monte n Carlo methods for approximating the expectation of a function. u J Explicit error bounds with respect to the l2-, l4- and l∞-norm of 2 the function areproven.By the estimationthe wellknownasymp- 1 totical limit of the error is attained, i.e. there is no gap between theestimateandtheasymptoticalbehavior.Wediscussthedepen- ] dence of the error on a burn-in of the Markov chain. Furthermore A we suggest and justify a specific burn-in for optimizing the algo- N rithm. . h t a m [ 1. Introduction 1 We start with a probability distribution π on a finite set D and a v 9 function f : D R. The goal is to compute the expectation denoted 5 → by 3 2 S(f) = f(x)π(x). . 6 x∈D 0 X 9 Let the cardinality of D be very large such that an exact computa- 0 tion of the sum is practically impossible. Furthermore suppose that : v the desired distribution is not explicitly given, i.e. we have no ran- i X dom number generator for π available. Such kind of problems arise in r statistical physics, in statistics, and in financial mathematics (see for a instance [GRS96, Liu08]). The idea of approximating S(f) via Markov chain Monte Carlo (MCMC) is the following: Run a Markov chain on D to simulate the distribution π and compute the time average over the last n steps. Let X ,...,X be the chain, then we obtain as 1 n+n0 approximation n 1 S (f) = f(X ). n,n0 n i+n0 i=1 X Date: Version: June 12, 2009. Key words and phrases. Markov chain Monte Carlo methods, Markov chain MonteCarlo,errorbounds,expliciterrorbounds,burn-in,mixingtime,eigenvalue. 1 2 DANIELRUDOLF By n the so called burn-in is given, loosely spoken this is the number 0 of time steps taken to warm up. Afterwards the distribution of the generated Markov chain is (hopefully) close to the stationary one. A Markov chain is identified with its initial distribution ν and its transition matrix P. We restrict ourself to ergodic chains, i.e. the sec- ond largest absolute value β of the eigenvalues of P is smaller than one. It is well known that the distribution of these chains reaches station- arity exponentially (see [Br´e99, RR97, LPW09]). The error of S for f RD is measured by n,n0 ∈ e (S ,f) = E S (f) S(f) 2 1/2, ν n,n0 ν,P | n,n0 − | where E denotes the expe(cid:0)ctation of the Markov c(cid:1)hain. The asymp- ν,P totic behavior of the integration error can be written in terms of the eigenvalues and eigenfunctions of P. It holds true that 1+β lim n e (S ,f)2 1 f 2, n→∞ · ν n,n0 ≤ 1 β1 k k2 − whereβ isthesecond largesteigenvalue (see[Sok97,Mat99]).Thecon- 1 stant 1+β1 is optimal but this statement does not give an error bound 1−β1 for finite n and also does not include anything concerning the choice of n . How does an explicit error bound of the MCMC method look like 0 where the asymptotic behavior is attained? Letusgiveanoutlineofthestructureandthemainresults.Section2 contains the used notation and presents some relevant statements con- cerning Markov chains. Section 3 contains the new results. The explicit error bound is developed with respect to the l -, l - and l -norm of 2 4 ∞ the function f. For f 1 and C = 2 ν 1 we obtain the k k∞ ≤ π − ∞ following. The error obeys q (cid:13) (cid:13) (cid:13) (cid:13) 2 2Cβn0 e (S ,f)2 + . ν n,n0 ≤ n(1 β ) n2(1 β)2 1 − − For details and estimates concerning l and l we refer to Theorem 11 2 4 in Section 3.3. In Section 4 it turns out that n = max log(C) ,0 0 log(β−1) is a reasonable choice for the burn-in. Then the error bnoulnd simpmlifieos to 2 2 e (S ,f)2 + . ν n,n0 ≤ n(1 β ) n2(1 β)2 1 − − ERROR BOUNDS OF MCMC 3 In many examples a good estimate for β can be achieved, see for in- stance [MR02, BD06, BL07]. Therefore it is straightforward to apply the explicit error bound. 2. Preliminaries The Markov chain X ,X ,... is a stochastic process with state 1 2 space D. We identify it with initial distribution ν and transition ma- trix P = (p(x,y)) and denote it by (ν,P). For x,y D the entry x,y∈D ∈ p(x,y) presents the probability of jumping from state x to state y in one step of the chain. By Pf(x) = p(x,y)f(y) we obtain the expectation of the y∈D value of f RD after one step of the chain starting from x D. ∈ P ∈ The expectation after k steps of the Markov chain from x is given by Pkf(x) = pk(x,y)f(y), where Pk = (pk(x,y)) denotes y∈D x,y∈D the k-th power of P. Similarly we consider the application of P to P a distribution ν, i.e. νP(x) = ν(y)p(y,x). This is the distribu- y∈D tion which arises after one step where the initial state was chosen by P ν. The distribution which arises after k steps is given by νPk(x) = ν(y)pk(y,x). y∈D P The expectation E of the Markov chain X ,...,X is taken ν,P 1 n+n0 with respect to the probability measure W (x ,...,x ) = ν(x )p(x ,x ) p(x ,x ) ν,P 1 n+n0 1 1 2 ····· n+n0−1 n+n0 on Dn+n0. Using this for i j we obtain a characterization by the ≤ transition matrix (1) E (f(X )f(X )) = Pi(fPj−if)(x)ν(x). ν,P i j x∈D X 2.1. Reversibility and spectral structure. We call the Markov chain with transition matrix P, or simply P, reversible with respect to a probability measure π if the detailed balance condition π(x)p(x,y) = π(y)p(y,x) holdstrueforx,y D.IfP isreversible, thenπ isasocalledstationary ∈ distribution of the Markov chain, i.e. πP(x) = π(x). Note that, if P is reversible then Pk is also reversible. Let us define the weighted scalar- product f,g = f(x)g(x)π(x), h iπ x∈D X 4 DANIELRUDOLF for functions f,g RD. Then let f = f,f 1/2. By considering ∈ k k2 h iπ the scalar-product it is easy to show, that reversibility is equivalent to P being self-adjoint. Furthermore suppose that the underlying Markov chain is irreducible and aperiodic, this is also called ergodic. For details oftheseconditionswerefertotheliterature,forinstance[H¨ag02,Br´e99, LPW09]. It is a well known fact that this implies the uniqueness of the stationary distribution. Applying the spectral theorem of self-adjoint stochasticmatricesandergodicityweobtainthatP hasrealeigenvalues 1 = β > β β β > 1 0 1 2 |D|−1 ≥ ≥ ··· ≥ − with a basis of orthogonal eigenfunctions u for i 0,..., D 1 , i ∈ { | |− } i.e. 1 i = j Pu = β u , u ,u = δ = i i i h i jiπ ij 0 i = j. ( 6 Additionally one can see that u (x) = 1 and S(u ) = 0 for i > 0. 0 i 2.2. Convergence of the chain. The speed of convergence of the Markov chain to stationarity is measured by the so called χ2-contrast. Let ν, µ be distributions on D then (ν(x) µ(x))2 χ2(ν,µ) = − . µ(x) x∈D X The χ2-contrast is not symmetric and therefore no distance. For arbi- trarydistributions it can be very large, i.e. χ2(ν,µ) ν 1 , where ≤ µ − ∞ µν −1 ∞ = maxx∈D µν((xx)) −1 . (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) (cid:13) From(cid:13) [Br´e99, Theo(cid:12)rem 3.3(cid:12)p. 209] we have (cid:13) (cid:13) (cid:12) (cid:12) (cid:13) (cid:13) (cid:12) (cid:12) (2) χ2(νPk,π) β2k χ2(ν,π), ≤ ERROR BOUNDS OF MCMC 5 where β = max β , β denotes the second largest absolute value 1 |D|−1 of the eigenvalues. Let us turn to another presentation of the conver- (cid:8) (cid:12) (cid:12)(cid:9) gence property. We h(cid:12)ave (cid:12) ν(y) νPk(x) π(x) = pk(y,x)π(y) π(x) − π(y) − y∈D X ν(y) = pk(x,y)π(x) π(x) rev. π(y) − y∈D X ν(y) ν(y) = pk(x,y)π(x) π(y)π(x) π(y) − π(y) y∈D y∈D X X ν(y) = (pk(x,y) π(y))π(x). π(y) − y∈D X The second equality follows by the reversibility of the Markov chain. For simplicity let ν(y) d (x) := (pk(x,y) π(y)), k π(y) − y∈D X such that altogether ν (3) d = χ2(νPk,π) βk 1 . k kk2 ≤ π − (2) r ∞ p (cid:13) (cid:13) Since β < 1 we have an exponential decay(cid:13)of the n(cid:13)orm with k . (cid:13) (cid:13) → ∞ We define the weighted sequence spaces for 1 p by ≤ ≤ ∞ l = l (D,π) := f RD : f p = f(x) pπ(x) < . p p ∈ k kp | | ∞ ( ) x∈D X It is clear that l = RD, since the state space has finite cardinality. p Remark 1. Aswehaveseentheχ2-contrastcorrespondstothel -norm 2 of the function d . Other tools for measuring the speed of convergence k induce similar relations. For instance νPk d = 2 νPk π and d = 1 . k kk1 − tv k kk∞ π − (cid:13) (cid:13)∞ (cid:13) (cid:13) (cid:13) (cid:13) The total variatio(cid:13)n correspo(cid:13)nds to the l1-norm o(cid:13)f dk and t(cid:13)he l∞-norm (cid:13) (cid:13) to the supremum-distance. Remark 2. The constant β plays a crucial role in estimating the speed of convergence of the Markov chain to stationarity. In general it is not easy to handle β or β, but there are different auxiliary tools, e.g. 1 6 DANIELRUDOLF canonical path technique, conductance (see [JS89] and [DS91]), log- Sobolev inequalities and path coupling. For a small survey see [Ran06]. 2.3. Norm of the transition matrix. Let us consider P and S as operators acting on l . Then the functional S maps arbitrary functions p to constant functions. Let l0 := l0(D,π) = g l : S(g) = 0 for 2 p . p p { ∈ p } ≤ ≤ ∞ The norm of P as operator on l0 and l0 is essential in the analysis. 2 4 We state and show some results which are implied by the Theorem of Riesz-Thorin. For a proof and an introduction we refer to [BS88]. Proposition 1 (Theorem of Riesz-Thorin). Let 1 p,q ,q . 1 2 ≤ ≤ ∞ Further let θ (0,1) and ∈ 1 1 θ θ := − + p q q 1 2 and T : l l with T M , q1 → q1 k klq1→lq1 ≤ 1 T : l l with T M . q2 → q2 k klq2→lq2 ≤ 2 Then T 2M1−θMθ. k klp→lp ≤ 1 2 Note that the factor two in the last inequality comes from the fact that we consider real valued functions f. In the following we show a relation between P, P S and β. − Lemma 2. Let P be a reversible transition matrix with respect to π and n N. Then ∈ (4) Pn S = Pn = βn. k − kl2→l2 k kl20→l20 Furthermore if 2 p then ≤ ≤ ∞ (5) Pn Pn S 2. k klp0→lp0 ≤ k − klp→lp ≤ Proof. Theself-adjointnessofP implies P = max β , β = k kl20→l20 1 |E|−1 β, such that Pn = βn. By k kl20→l20 (cid:8) (cid:12) (cid:12)(cid:9) (cid:12) (cid:12) Pn S = sup (Pn S)f = sup Pn(f S(f)) k − kl2→l2 k − k2 k − k2 kfk ≤1 kfk ≤1 2 2 sup sup Png = Pn ≤ k k2 k kl20→l20 kfk ≤1kgk ≤1,S(g)=0 2 2 ERROR BOUNDS OF MCMC 7 and Pn = sup Png = sup Png S(g) k klp0→lp0 k kp k − kp kgk ≤1,S(g)=0 kgk ≤1,S(g)=0 p p sup (Pn S)f = Pn S ≤ k − kp k − klp→lp kfk ≤1 p claim (4) and the first part of (5) is shown. Finally, by applying the triangle inequality of the norm Pn S = sup Pnf Sf Pn + S = 2. k − klp→lp k − kp ≤ k klp→lp k klp→lp kfk ≤1 p (cid:3) The next statement adds the result about the matrix norm which is used in the proof of the error bound. Lemma 3. Let P be a reversible transition matrix with respect to π and n N. Then ∈ (6) Pn 2√2 βn/2. k kl40→l40 ≤ Proof. By Lemma 2 we have Pn S = βn and Pn S 2. k − kl2→l2 k − kl∞→l∞ ≤ Then the result is an application of Proposition 1, where T = Pn S and q = 2, q = , p = 4 thus θ = 1. −(cid:3) 1 2 ∞ 2 3. Error bounds Inthissectionwemainlyfollowtwostepstodeveloptheerrorbound. At first a special case of method S is considered. The initial distri- n,n0 bution is the stationary one, thus it is not necessary to do a burn-in, i.e. n = 0. Secondly we relate the result of the first step to the general 0 case where the chain is initialized by a distribution ν. The techniques which we will use are similar as in [Rud09]. 3.1. Starting from stationarity. This is also called starting in equi- librium, i.e. the distribution of the Markov chain does not change, it is already balanced. In the following we will always denote S as S . n,0 n Let us start with stating and discussing a result from [BD06, Prop. 2.1 p.3]. Proposition 4. Let f RD. Let X ,...,X be a reversible Markov 1 n ∈ chain with respect to π, given by (P,π). Then 8 DANIELRUDOLF |D|−1 1 (7) e (S ,f)2 = a 2W(n,β ), π n n2 | k| k k=1 X where n(1 β2) 2β (1 βn) a = f,u and W(n,β ) := − k − k − k . k h kiπ k (1 β )2 k − Proof. Let us consider g := f S(f) RD. Because of the orthogonal − ∈ |D|−1 basis the presentation g(x) = a u (x) is given. The error obeys k=1 k k 1 n P 2 1 n 2 e(S ,f)2 = E g(X ) = E g(X ) n π,P n j n2 π,P j (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) Xj=1 (cid:12) (cid:12)Xj=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 n(cid:12) (cid:12) 2 n−1 (cid:12)n (cid:12) = (cid:12) E g(X )(cid:12)2 + (cid:12) E g((cid:12)X )g(X ). n2 π,P j n2 π,P j i j=1 j=1 i=j+1 X X X For i j, ≤ |D|−1|D|−1 E g(X )g(X ) = a a E u (X )u (X ) π,P i j k l π,P k i l j k=1 l=1 X X |D|−1|D|−1 = a a u ,Pj−iu k l k l π (1) k=1 l=1 X X (cid:10) (cid:11) |D|−1|D|−1 |D|−1 = a a βj−i u ,u = a2 βj−i, k l l h k liπ k k k=1 l=1 k=1 X X X where the equality of the second line is due to the fact that the initial step is chosen from the stationary distribution. The last two equali- ties follow from the orthonormality of the basis of the eigenvectors. Altogether we have |D|−1 n−1 n 1 e(S ,f)2 = a2 n+2 βi−j n n2 k k " # k=1 j=1 i=j+1 X X X 1 |D|−1 (n 1)β nβ2 +βn+1 = a2 n+2 − k − k k n2 k (1 β )2 k=1 (cid:20) − k (cid:21) X |D|−1 1 = a 2W(n,β ). k k n2 | | k=1 X (cid:3) ERROR BOUNDS OF MCMC 9 Let us consider W(n,β ) to simplify and interpret Proposition 4. k Lemma 5. For all n N and k 1,..., D 1 we have ∈ ∈ { | |− } 2n (8) W(n,β ) W(n,β ) . k 1 ≤ ≤ 1 β 1 − Proof. Let x [ 1,1), then we are going to show that W(n,x) is ∈ − monotone increasing, i.e. W(n,β ) W(n,β ). For i 0,...,n 1 k 1 ≤ ∈ { − } it is true that xn−i 1 (1 xi)xn−i 1 xi xn−i +xi 1+xn. ≤ ⇐⇒ − ≤ − ⇐⇒ ≤ Therefore xi +xi+1 +xn−i−1 +xn−i 2(1+xn), ≤ and n−1 n−1 1 (1+x) xi = xi +xi+1 +xn−i−1 +xn−i n(1+xn). 2 ≤ i=0 i=0 X X Now dW (1+x) n−1xi n(1+xn) (n,x) = 2 i=0 − 0 dx − (1 x)2 ≥ P − and the first inequality is shown. By n(1+x)−2xn x [ 1,0] 2n W(n,x) 1−x ∈ − ≤ n(1+x) x (0,1) ≤ 1 x ( 1−x ∈ − (cid:3) the claim is proven. An explicit formula of the error if the initial state is chosen by the stationary distribution is established. Let us discuss the worst case error of S . n Proposition 6. Let X ,...,X be a reversible Markov chain with re- 1 n spect to π, given by (P,π). Then 1+β 2β (1 βn) 2 (9) sup e(S ,f)2 = 1 1 − 1 . n n(1 β ) − n2(1 β )2 ≤ n(1 β ) kfk ≤1 1 1 1 2 − − − Proof. The individual error of f is 1 |D|−1 f 2 e(S ,f)2 = a 2W(n,β ) k k2 max W(n,β ) n (7) n2 | k| k ≤ n2 k=1,...,|D|−1 k k=1 X f 2 1+β 2β (1 βn) = k k2W(n,β ) = 1 f 2 1 − 1 f 2, (8) n2 1 n(1 β1) k k2 − n2(1 β1)2 k k2 − − 10 DANIELRUDOLF where a is chosen as in Proposition 4 and therefore |D|−1 a 2 k k=1 | k| ≤ f 2. From the preceding analysis of the individual error we have an k k2 P uppererror bound.Nowwe considerf = u ,whereobviously u = 1 1 k 1k2 and get by applying (7) that 1+β 2β (1 βn) e(S ,u )2 = 1 1 − 1 . n 1 n(1 β ) − n2(1 β )2 1 1 − − Thus the error bound is attained for u and by (8) everything is shown. 1 (cid:3) Finally an explicit presentation for the worst case error on the class of bounded functions with respect to is shown. Notice, that (9) k·k2 is an equality, which means that the integration error is completely known if we start with the stationary distribution. In some artificial cases this method even beats direct simulation, e.g. if all eigenvalues are smaller than zero or if one specific β < 0 and the goal is to approx- i imate S(u ). In [FHY92, Remark 3, p.617] the authors state a simple i transition matrix where β = 1 for all i. Now one could think i −|D|−1 to construct a transition matrix where β is close to 1 and therefore 1 − damp the integration error. But it is well known that this is not possi- ble for large D , since β 1 . | | 1 ≥ −|D|−1 In the next subsection we link the results to a more general frame- work, where the unrealistic assumption that the initial distribution is the stationary one is abandoned. 3.2. Starting from somewhere else. In the next statement a rela- tion between the error of starting by π and the error of starting not by the invariant distribution is established. Proposition 7. Let f RD and g := f S(f). Let X ,...,X be ∈ − 1 n+n0 a reversible Markov chain with respect to π, given by (P,ν). Then (10) n n−1 n 1 2 e (S ,f)2 = e (S ,f)2 + L (g2)+ L (gPk−jg), ν n,n0 π n n2 j+n0 n2 j+n0 j=1 j=1 k=j+1 X X X where ν(y) L (h) = d (x)h(x)π(x) = (pi(x,y) π(y))h(x)π(x). i i π(y) − x∈D x∈Dy∈D X XX Remark 3. The proof of this identity is similar as in [Rud09], except for the fact that we study a discrete state space and therefore integrals become sums.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.