S (S ) ÉMINAIRE DE PROBABILITÉS TRASBOURG OLIVIER CATONI SimulatedannealingalgorithmsandMarkov chainswithraretransitions Séminairedeprobabilités(Strasbourg),tome 33(1999),p. 69-119 <http://www.numdam.org/item?id=SPS_1999__33__69_0> ©Springer-Verlag,BerlinHeidelbergNewYork,1999,tousdroitsréservés. L’accès aux archives du séminaire de probabilités (Strasbourg) (http://portail. mathdoc.fr/SemProba/) implique l’accord avec les conditions générales d’utilisa- tion(http://www.numdam.org/legal.php).Touteutilisationcommercialeouimpres- sionsystématiqueestconstitutived’uneinfractionpénale.Toutecopieouimpres- sion de ce fichier doit contenir la présente mention de copyright. Article numérisé dans le cadre du programme Numérisation de documents anciens mathématiques http://www.numdam.org/ SIMULATED ANNEALING ALGORITHMS AND MARKOV CHAINS WITH RARE TRANSITIONS OLIVIER CATONI ABSTRACT. In these notes, written for a D.E.A. course at University Paris XI during the first term of 1995, we prove the essentials about stochastic optimisation algorithms based on Markov chains with rare transitions, un- der the weak assumption that the transition matrix obeys a large deviation principle. We present a new simplified line of proofs based on the Freidlin and Wentzell graphical approach. The case of Markov chains with a peri- odic behaviour at null temperature is considered. We have also included some pages about the spectral gap approach where we follow Diaconis and Stroock [13] and Ingrassia [23] in a more conventional way, except for the application to non reversible Metropolis algorithms (subsection 6.2.2) where we present an original result. ALGORITHMES DE RECUIT SIMULE ET CHAINES DE MARKOV A TRANSI- TIONS RARES: Dans ces notes, tirees d’un cours de D.E.A. donne au pre- mier trimestre 1995, nous etablissons les bases de la theorie des algorithmes d’optimisation stochastiques fondes sur des chaines de Markov a transi- tions rares, sous 1’hypothese faible selon laquelle la matrice des transitions vérifie un principe de grandes deviations. Nous presentons un nouvel en- semble de preuves originales fondees sur 1’approche graphique de Freidlin et Wentzell. Le cas des chaines presentant un comportement periodique a temperature nulle est traite. De plus nous avons aussi inclus quelques pages sur les methodes de trou spectral, dans lesquelles nous suivons Dia- conis et Stroock [13] et Ingrassia [23] d’une façon plus conventionnelle, si ce n’est pour l’application aux algorithmes de Metropolis non reversibles de la section 6.2.2, qui est originale. INTRODUCTION These lecture notes were written on the occasion of a course of lectures which took place from January to April 1995. We seized the opportunity of the present English translation to add some proofs which were left to the reader and to cor- rect some misprints and omissions. Sections 4.1, 4.2 and 4.3 contain standard material from [13] and [23]. The rest is more freely inspired by the existing literature. The presentation of the cycle decomposition is new, as well as lemma 1. We chose to make weak large deviation assumptions on the transition matrix pp at inverse temperature ,Q, and to give results which are accordingly concerned Date: May 1995, English translation January 1997, in revised form November 1998. 70 only with equivalents for the logarithm of the probability of some events of inter- est. In the study of simulated annealing, we considered piecewise constant tem- perature sequences, in order to avoid introducing specifically non-homogeneous techniques. Our aim was to give tools to study a wide variety of stochastic opti- misation algorithms with discrete time and finite state space. For related results directed towards applications to statistical mechanics, we refer to [8]. 1. EXAMPLES OF HOMOGENEOUS MARKOV CHAINS We are going to study in this section homogeneous Markov chains related to stochastic optimisation algorithms. , 1.1. The Metropolis Algorithm. This algorithm can be applied to any finite state space E on which an energy function U : .E -~ I~ is defined (U can be any arbitrary real valued function). Its purpose can be either: . to simulate the equilibrium distribution of a system from statistical me- chanics with state space E and energy U interacting with a heat bath at temperature T, . or to find a state x E E for which U(x) is close to min U(y). yEE We will mainly be interested in the second application in these notes. Description of the algorithm Let us consider a Markov matrix q : E x E -~ [0,1] which is irreducible and reversible with respect to its invariant measure. In other words let us assume that ’ ~ q(x~ y) =1~ . yEE . sup qm (x ~ y) > ~ x ~ y E E. ~ m (This last equation means that there is a path :co = x, .ci,... , Xm = y leading from x to y such that q(xi, > 0, i = 0, t -1.) ... , . the invariant probability distribution ~ of q (which is unique under the preceding assumptions) is such that y) x)~ = Let us consider also an inverse temperature ,Q > 0, /? E I~. To this temperature corresponds the Gibbs distribution G(E, U,,Q), defined by (x) Z G(E ,U,03B2)(x) exp(-03B2U(x)) = , where Z (the "partition function" ) is . Z = ~ exp(-,QU(x)). xEE The distribution G(E, ~7, /?) describes the thermal equilibrium of the thermo- dynamic system (E, U, ,Q) We then define the transition matrix at inverse . temperature ,~. This is the Markov matrix P,8 : E x E -~ [0,1] defined by y) = y) U(x))+, y E E, 71 where r+ max{0, r}. = Proposition 1.1. The matrix pp is irreducible. It is aperiodic as soon as U is not constant, and therefore lim (( - v) pa 0, = n-~+oo ’ where the set of probability measures on E. Moreover pa is reversible with respect to tc p = G(E, p, U, ,Q) . Proof. It is irreducible because pp(x, y) > 0 as soon as q(x, y) > 0. If U is not constant there are x, y E E such that q(x, y) > 0 and U(y), which implies that pp (x, x) > 0 and therefore that pp is aperiodic. Moreover y) 1 Z (x)q(x,y) (-03B2(U(x) V U(y))) = exp = ~~ (y)pA (y~ x) x ~ y E E~ x ~ y~ ~ 1.1.1. Construction of the Metropolis algorithm. On the canonical space ~) . where B is the sigma field generated by the events depending on a finite number of coordinates, we consider the canonical process (Xn )nEN defined by E~, = x E and the family of probability distributions on B) defined by =~x~ Px03B2(Xn y| (X0, (x0,...,xn-1)) y). = ... , = The homogeneous Markov chain (Xn)nEN, B, is the canoni- cal realization of the Metropolis algorithm with state space E, Markov matrix q, energy function U and inverse temperature ,Q. We will use the notation q, U, ~). . 1.1.2. Computer implementation. Assuming that = x E E, choose a state y according to the distribution q(x, y), compute U(y) - , if U(y) U(x),, put Xn = y, if U(y) > U(x), put Xn = y with probability exp -,Q(U(y) - U(x)) and Xn = x otherwise. 1.1.3. Behaviour at temperature zero ~,Q = Letting ,Q tend to +oo in the definition of M(E, q, U"Q), we define the infinite inverse temperature algorithm M(E~ q~ U~ +~) bY P+oo(Xn = y I Xn-1= ~) = U(x)), x ~ y E E. > This is a relaxation algorithm: U(Xn) is almost surely non increasing. It is still homogeneous, but no more ergodic in general (if U is not constant on E, E has at least one transient component). When ,Q tends to infinity, M(E, q, U,,~) weakly tends to M(E, q, U, +oo), in the sense that for any function f :: E~ -~ R depending on a finite number of coordinates we have lim ENf(y)P03B2(dy) f(y)P+~(dy) = EN 72 (Note that it implies that the same holds for any continuous function f, E~ being equipped with the product topology, because any such function is a uniform limit of functions depending on a finite number of coordinates.) When it is observed during a fixed interval of time, M(E, q, U, p) is a small perturbation of M (E, q, U, +oo) at low temperature. We can see now that the Metropolis algorithm is suitable for the two purposes we announced at the beginning: . Simulation of the thermal equilibrium distribution G(E, U"Q): As pp is irreducible and aperiodic and as E is finite, (Pp ) p p = Pp o X-1n tends to G(E, U"Q) when r~ tends to infinity (at exponential rate, as will be seen in the following). . . Minimisation of U: The Gibbs distributions = G(E, U, ~3) get con- centrated around arg min U when ,Q tends to +00. Indeed, for any r~ > 0, 1 - - min U + ~l) > exp ( ~(~? + min !7)), Z > , therefore we have the following rough estimate min U + ~) > 1- . Taking 7; = min{U(y), y E j6’ B arg min U } - minE U, we see that, as a consequence, lim G ( E, U, ~3) (arg min U) =1. . Thus Proposition 1.2. For any é > 0 there are N E N and ,Q E such that for any n > N Pp (U (Xn ) = min U) > 1- E. 1.2. The Gibbs sampler. This algorithm is meant for a product state space E = where the components Fi are finite sets. The purpose is the same as for the Metropolis algorithm (simulate the Gibbs distribution or minimise the energy) . Description: Let us consider . An energy function U :: E -~ M, which can in fact be any real valued function. . An "infinite temperature" probability distribution E M+1. . An inverse temperature ~3 E ~+~ . The Gibbs distribution ~(x) G(E,~,~/?)(~)= ° . A permutation a~ E 6r of ~1, ... , r}. Let us define 73 ~ For any i E {1,... , , r~ the transition matrix p~ : E x E 2014~ [0,1] at site i and inverse temperature ~3 where we have used the notations x = E Fj and xi = . . The global transition matrix at temperature ,Q r , p°~r~ p~ = = .. > i=l which corresponds to the scan of the sites defined by the permutation cr. Properties of pp:: . It is a full matrix, (pp (x, y) > 0, x, y E E), thus it is irreducible and aperiodic. ~ The Gibbs distribution G is p~ invariant for any i e ~ 1, ... , r}, therefore G is also the (unique) invariant probability measure of pp . We consider then the Markov chain with canonical realization (E~, (Xn)nEN, B, Pp) where Pp is the probability measure on B) of the Markov chain defined by Pp and P(Xn = y I xn-1 = x) = y)~, ~~ y E E. The homogeneous Markov chain (X, is called a Gibbs sampler with state space E, energy function U, reference measure ~c, scan function o~, inverse tem- perature ,Q and initial distribution The notation GS(E, ~, ~, U,,Q, ,~o) will denote this process in the following. Let us describe its computer imple- mentation with more details. Computer implementation: Each step of the chain corresponds to one scan of all the ,sites, in the order defined by cr. It includes thus r sub-steps. To perform the ith sub-step, i = 1,..., r, if x is the starting configuration, we have to draw at random f E Fa(i) according to the conditional thermal equi- librium distribution at site knowing that the configuration should coincide with x on the other sites. This computation is easy if ~ The number of elements of is small, ~ The conditional distribution = f ( X ~ = ~ o~ ( i) ) depends on few coordinates, as it is the case for a Markov random field. The new state at the end of the ith sub-step is y E E, given by i) = f and yj = , j ~ ~(i) , Behaviour at "zero temperature ": Here again lim p~ exists, therefore lim p p exists and defines a Markov chain at temperature zero. This zero temperature dynamic is a relaxation algorithm: the energy is almost surely non-increasing. It is not in general an ergodic process, and Pp converges weakly to as in the case of the Metropolis dynamic. Moreover the purposes of simulation of the equilibrium distribution and of minimisation of the energy are fulfilled in the 74 same way, and, as for the Metropolis algorithm, proposition 1.2 holds also for the Gibbs sampler. 2. MARKOV CHAINS WITH RARE TRANSITIONS 2.1. Construction. We are going to put the two previous examples into a more general framework. Let us consider ~ An arbitrary finite state space E, ~ A rate function Assume that V is irreducible in the sense that the matrix exp(-Y(x, y)) is irreducible. ~ A family Y = (E~, B, of homogeneous Markov chains indexed by a real positive parameter ,Q. Definition 2.1. The family of homogeneous Markov chains ~ is said to have rare transitions with rate function V if for any x, y E E - log P03B2(Xn = y | Xn-1 = x) 03B2 = V(x,y), lim (with the convention that log 0 -oo) . = . Remarks about this definition: . This is a large deviation assumption with speed ,Q and rate function V about the transition matrix. We will see that it implies large deviation estimates for the exit time and point from any subdomain of E. . The two examples of algorithms given previously fit into this framework. Indeed the rate function of the Metropolis algorithm M(E, q, !7, /?, is +~{ (U(y) - U(x))+ if p03B2(x,y) otherwis>e .0 for 03B2 > 0 V(x,y) = As for the Gibbs Sampler GS(E, 7, U, /?, ~o) with E = ~i=1 Fi, the rate function V is built in the following way: For any x, y E E, any i E ~ l, ... , r~, let us put ( +oo ~~~ otherwise, ~ (x’ y) - and let us consider the path y = (,k )rk=0 defined by y03C3(i) if i ~ k, 03B303C3(i)k = x03C3(i) otherwise. The rate function of the Gibbs sampler is r k=1 75 2.2. Rate function induced by a potential. Definition 2.2. We will say that the rate function V : E x E -~ is induced by the potential U : E -~ R if for all e E + y) _ U(y) + x)~ with the convention that +00 + r = +00 for any r E R. Proposition 2.1. The rate function of the Metropolis algorithm M(E,,u, U"Q, ~o) is induced by ~. Proof. .~ As q is irreducible, > 0 for any x E E. Indeed there is Xo such that > 0 and there is n such that x) > 0, therefore ~(~) _ > ~) > 0. Thus q(x, y) > 0 if and only if q(y, x) > 0, from the p reversibility of q. Therefore V(x, y) = +00 if and only if V(y, x) = +00. In the case when q(x, y) > 0, x ~ y, y) - (U(y) - U(x))+ - (U(x) - ~(y))+ ~(y) - U(x). o = = ° 3. LEMMAS ON IRREDUCIBLE MARKOV CHAINS Let E be a finite state space, p : E x E -~ [0, Ij an irreducible Markov matrix, ~, P) an homogeneous Markov chain with transition matrix p, W C E a given subset of E and W = E B W its complement. For any oriented graph g c E x E and any x E E, we write g(x) = {~/ ~I (.c,~/) E g~ and more generally U g(y). Definition 3.1. We let G( W) be the set of oriented graphs g C E x E satisfying 1. For any x E E, ~g(x)~ = lyy (no arrow starts from W, exactly one arrow starts from each state outside W). +00 2. For any x E E, .c ~ Og(x), where = U is the orbit of x under n=i g, (g is without loop). Equivalently, the second condition can be replaced by: For any x W, Og(x) n 0 (any point in W leads to W). Definition 3.2. For any x E E, y E W, we will write ~{g ~ G(W) | y ~ Og(x)} if x ~ W Gx,y(W) = G(W) if x = y 0 if x E W~~y~. Thus Gr,y(W) is the set of graphs g e G(W ) linking x to y. We will also write GA,B (W ) = ~9 ~ d~ E A, ~y E B such that g E We will give three formulas which express the equilibrium distribution of p, the probability distribution of the hitting point of W, and the expectation of the corresponding hitting time, as the ratio of two finite sums of positive terms. 76 They have been introduced in the large deviation theory of random dynamical systems by Freidlin and Wentzell [16] . The idea of using graphs to compute determinants has been known since the nineteenth century and presumably goes back to Kirchhoff [24] . The proofs which we propose are based on a preliminary lemma: Lemma 3.I. For any W c E, W # Ø, let p|W W be the matrix p restricted to WxW: p|W W(x,y) 1(x E w) I (Y E w) = p(x,y) ° Let T(W) be the first hitting time of JV: T(W) = 0|Xn e W). For any z, y e W we have £ (Z > Y) pn|W W > Y) " T(W) ) = E03B2( 1(Xn = y) | X0 = x) )(p(g) )-1 = (p(g) , where p(g> fl pZ, t> * . z,t>Eg Remark: The fact that idp- - is non singular is a consequence Of the fact that p is irreducible (limn pn|W W = 0 and therefore all the eigenvalues Of p|W W are of module lower than one).. Lemma 3.2. The (unique) invariant probability distribution of p is given by p(g))( p(g)) -1, (x) = ( x ~ E° Lemma 3.3. The distribution of the first hitting point can be expressed as ( P(~)~ ( P(~) > P(xTW> " Y 1 x0 " Z) " £ £ for any w # 0, z e W, y e W. Lemma 3.4. For any W # Ø, any z E W, ( p(g)) ( p(g)) -1. E(T(W) ) |X0 z) £ = = 77 Proof of lemma 3. ~’ As p is irreducible, for any 0, there is g E G(W ) such that p(g) > 0 (the proof of this is left to the reader). Let us write for any x, y E W p(g))( p(g))-1 m(x,y) = ( . We want to check that for any x, y E W (1) ~ (id(x, z) - p(x, z)) m(z, y) id(x, y). = zEW Using the equality p(x, x) =1- £ p(x, z), equivalently check that we can (2) ~ y) = y) + ~ p(x, z)m(z, y).. z~{x} z~W{x} The left hand side of this equation is equal to (p(x, z)p(g))(p(g))-1. where Ci = {(z, g) E {x} x G(W U {y}) ; g E Gx,y(W U {y})}, the right hand side is equal to ( / ( / id(x, y) + 03A3 p(x,z)p(g) 03A3 p(g) , ) where C2 = ~(z~ 9) E W U ~x} x G(W U ~ 9 E GZ,y(w ~ Let us consider first the case when x ~ y. Then we can define a one to one mapping 03C6 : C1 ~ C2 by ((zg,( xg) iiff gg ~~ GGzz,y,(yW( W~ U {{yy}})), 03C6(z, g) = ) , (g ~ {(x, z)}) B {(x, g ( x ))}) . The easiest way to check that y is one to one is to check that (03C6z-1(z,,g) g= ()g(x),(g~{(x,z)}) B (x,g(x))}) if g ~ Gx,y(W U y 1
Description: