ebook img

Simulated Annealing & the Metropolis Algorithm PDF

14 Pages·2010·1.08 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Simulated Annealing & the Metropolis Algorithm

Simulated Annealing & the Metropolis Algorithm: A Parameter Search Method for Models of Arbitrary Complexity Michael Herman MATH 519: Inverse Theory 1 Introduction In many situations, models designed to simulate complicated physical behavior reach alevelofcomplexitysuchthatmanypopularinversemethodscannotbeusedtodeter- mine a desirable set of model parameters. The Navier-Stokes equations, for example, are a set of three non-linear equations describing fluid dynamics. Coupled with ex- pressions describing the continuity of mass and conservation of energy, the resulting system of equations is often used to model planetary atmospheres and in practice be- comes an intricate set of iterated computer instructions. Such complicated models can appear like a black box to the user and even to the model author when the need for parameter tuning arises. Indeed, often models are used to study complex natural phenomena because the phenomena is only partially understood. It is this incomplete knowledge that drives a researcher to investigate and yet prevents her from deter- mining a set of model parameters that give the model realistic behavior. Often the author of a model, or one implementing a known theoretical model with unknowncoefficients,forexample,willuseanheuristicmethodofdeterminingparam- eters or coefficients that elicit realistic behavior in a complex model. While this task becomes very difficult when there are just a few parameters that have a wide range of uncertainty, for a large number of parameters, the task becomes nearly impossible. Recently, a climate scientist likened the act of tuning a climate model’s parameters to tuning a bicycle wheel by successively tightening and loosening spokes. Often this process (at the hands of an amateur, at least) will lead to a wheel that, while round or of equal radius, is hardly straight. Part of the problem is simply that there are a lot of spokes, and the net effect of turning one of them on the rest of the wheel is difficult to predict. The Metropolis algorithm is herein presented as a tool apropos to this type of cir- cumstance. It is a flexible, simple numerical method that can be used to locate an optimal parameter set for a model of arbitrary complexity based on minimizing a cost function. An interesting augmentation to the Metropolis algorithm, first described by Kirkpatrick et al. 1983, related the minimization process to statistical mechanics. We illustrate the benefits of this slow annealing approach presented in terms of Kirk- patrick’s analogy and give a brief illustration of how this approach can be applied to atmospheric modeling. 1 2 The cost function and parameter selection In a simulated annealing procedure, it is desireable to define a simple object of opti- mization. This is often called the cost function, and serves as a distilled form of the greater problem. For instance, in the case of fitting a model to data, the cost function can be the norm of the residual difference between the model and the data. Usually one wants to minimize the norm in order to fit the model as close to the data as possi- ble without overfitting. In this case, modifications to the parameters that describe the model will, in turn, modify the cost function. Once the cost function has been determined and there is an efficient way to cal- culate it for each set of model parameters, the iteration process can begin. Initially, a set of parameters is chosen by some means, either at random, or by a guess based on a priori knowledge. The function or forward model is then evaluated using this parameter set and the cost function is determined from the result. In the iterative procedure, these steps are repeated a large number of times. At each successive step, the current parameter set is modified at random, the function is again evaluated, and the cost function is calculated. The cost function obtained at each step is thencompared to the one from the previ- ous step in order to determine whether the modified parameter set is worth keeping. This is the point in the simulated annealing procedure where the Metropolis algo- rithm appears, and it plays an important role. If the current cost function value is more optimal than the previous one, the latest parameter set is kept as a basis for the next iteration. However, if the cost function is less optimal, the choice of whether to keep or discard the parameter set is based on probability. In this way, the procedure does not seek an optimal state immediately, but rather lets one emerge over many successive trials and errors. Local minima are thus largely avoided in the interest of seeking the ultimate, or global minimum. Since the current state of the system is compared to only the previous state, this is a Markov Chain process, i. e., a sequence of trials where the outcome of each trial depends only on the outcome of the previous trial. [4] 2.1 Annealing Metals and Statistical Mechanics Analogies Thenomenclatureusedinsimulatedannealingproceduresoftenmimicsthelanguage used to describe statistical mechanics and the process of annealing a metal, which involvesheatingthemetallicsolidtoaveryhightemperature,andthenslowlycooling it. In this way, the atoms have an opportunity to arrange themselves into the lowest possible energy state and thereby become tightly packed. In this condition, the metal is strong and relatively free of metastable regions. It is these metastable regions manifested as local minima that are avoided in the process of simulated annealing. The algorithm thus acts to minimize the energy of the system, which is given by the cost function, described above. The system temperature, a tunable parameter, is modified throughout the optimization procedure, in order to slowly cool the system, and bring the system towards the lowest minimum. 2 It has been shown that simulated annealing is guaranteed to reach a global min- imum if certain conditions are met. In general, there are two ways to effect the an- nealing procedure. One is to keep the temperature constant for each Markov chain, letting the system attain thermal equilibrium and lowering the temperature between chains. Several such chains are executed in a row, each with distinct temperature values. This is a kind of homogeneous Markov chain. The other method is to drop the temperature as a continuous function of the number of iterations in a single, long chain. This is an inhomogeneous chain. In the case of the homogeneous algorithm, among the necessary conditions for convergence to the global minimum are the requirements that, in the limit that the number of iterations goes to zero, the temperature also goes to zero: limT = 0 (1) k k→0 Ifaninfinitenumberofiterationsisthenmade,thealgorithmwillreachtheglobal minimum. Similarly, if the inhomogeneous algorithm is used, and the above limiting value of the temperature is reached, then the solution will converge to the global minimum if the temperature drops no faster than the inverse log of the number of iterations: 1 T ∝ (2) k logk Since the infinity assumption is not possible in practice, a compromise must be made. In particular, if we mix the two algorithms, we can approach a minimum that is a pretty good approximation of the global minimum. Although the global minimum is no longer a certain outcome, its approximation is likely and thus the method has practical benefit [4]. 2.2 Probabilistic Decision Making In the science of statistical mechanics, a given energy state in a system in thermal equilibrium exists with the probability given by the normalized Boltzmann factor, 1 (−Ei) P(Ei) = e kBT (3) z(T) where z(T) is the normalization constant, called the partition function, and k is the B Boltzmann constant. This expression illustrates the characteristic of energy states of asystemasafunctionofthetemperatureT.Asthetemperaturedrops,theprobability of a given energy state drops, too. However, for higher energies, the probability is particularlylow foranygiventemperature. Atvery hightemperatures, moleculescan accessanyenergystate,i. e.,somewillbemovingveryfast,whileotherscanbenearly motionless. However, at lower temperatures, the highest energy states are no longer accessible, and the mean energy state, or the most probable states are necessarily the lower energy values [7]. 3 This behavior is analogously borne out in the implementation of the annealing algorithm. At each iteration, the current energy value is used to determine whether thecurrentparametersetisretainedforfurtheriterations. Forinstance,iftheenergy is expressed by the 2-norm of the residuals between the current model and a data set, when the model fit improves, the current 2-norm or energy is smaller than the previous value and the parameters are kept. However, a worse fit gives a positive change in energy. In this case, the probability of keeping the parameter set is given by [3]: (cid:16) (cid:17) −Ei π(mi) e kBT −∆E P(∆Ei) = = (cid:16) (cid:17) = ekBT (4) π(mi−1) −Ei−1 e kBT If the change in energy is large the probability is low that the parameters will be kept, while for a small change in energy they might be accepted. In this way, the al- gorithm is allowed to climb out of local minima, increasing the chance that the global minima will be found. The actual probabilistic choice is made by comparing the value given in 4 to a random value from the uniform distribution. If the random value is smaller, the parameters are kept. As time goes on, however, and the system tem- perature drops, the probability of keeping the state approaches zero, even for small changes in energy. The probability given by (4) is actually a ratio of the probabilities for the initial and final energy states. The meaning of the probability is as follows. If the P(∆E) = 0.1, then 10% of the particles in the initial energy state will be in the final state[7]. In terms of parameter space, there is a 10% probability that this parameter set is accessible, given the range of parameters allowed at this temperature. 2.3 Selecting Annealing Parameters In order to approach a suitable minimum, the algorithm needs some tuning to work properly. Inparticular,fourparametersareneededtocompletelydefinethealgorithm: 1) T , the initial value of the temperature; 2) T , the final value; 3) L , the length of 0 f k the homogeneous Markov chains; 4) a rule dictating how the temperature is lowered. The last two parameters are interdependent since they determine the length and the difference between each homogeneous chain, respectively. In order to abide by the requirement that the system is approximately in quasi-equilibrium by the end of a single Markov chain, increasing the magnitude of the temperature drop requires longer Markov chains. Laarhoven and Aarts give a few options for determining these parameters in their book on the subject, based on extensive experiments in the field of applied mathematics. [4] The initial temperature T needs to be high enough so that the system allows most 0 possible energy states to occur. One way to find this temperature is to arbitrarily choose a value and then iterate over the algorithm, counting the number of accepted parameter sets. If the ratio of accepted to attempted parameter sets is less than say, 0.8, then the initial temperature is doubled until the ratio is greater than the set value. 4 Another method involves solving the following equation for T : 0 + −∆E χ0 = e T0 , (5) + whereχ ≈ 0.9and∆E istheaveragepositivechangeinenergyforaseriesofrandom 0 transitions. Using a value of χ close to one ensures a high acceptance rate for the 0 initialparametersets. ThefinalvalueofthetemperatureT canbebasedonassuming f a fixed number of different temperatures. For instance, one might choose to change the temperatureonly 60times over thecourse ofthe routine. Thisvalue can betied to the number of variables in the model, or some other measure of the size of the model. Alternately, the algorithm can be halted at some T if the parameter configuration f remains constant for several consecutive values of the temperature. To determine the length of the Markov chains L one can simply assume a fixed k minimum number of accepted configurations. This becomes problematic for small values of T , however, where this minimum value is not likely to be reached. In this k case, some upper bound on chain length can be set, which simply stops the iteration when a large number of attempts have been made. Lastly, the change in temperature, the annealing schedule, can be determined using the following simple formula: T = αT , (6) k+1 k where α is some number close one. The authors relate experiments wherein values in the range 0.5 to 0.99 were used. Values closer to one effect small changes in tem- perature and thereby help to attain quasi-equilibrium within each chain. Far more detailed and thoughtful methods have been proposed and used; however, the authors point out that a wide range of tuning parameters often lead to the same, desirable minima [4]. 3 A Research Application One type of model that doesn’t easily lend itself to a parameter search is the atmo- spheric model. A typical model integrates the non-linear Navier-Stokes equations, while balancing conservation of energy and continuity of mass equations. Such mod- els have lengthy integration times and often include parametrized effects to avoid the explicit modeling of computationally expensive processes such as the phase changes of water, moisture advection and other convective mechanisms. The parameterizations can be quite complicated and are highly sensitive to initial- ization parameters, many of which are not known empirically. It is therefore neces- sary to effect a parameter search for such models in order to derive realistic behavior from the model. In this paper, such a parameter search is achieved using a simulated annealingalgorithm. Initially,algorithmtuningparameterssuggestedbyKirkpatrick et al were used. [3] Later, these were altered to suit the problem, based on the above- mentioned techniques of Laarhoven and Aarts. In this parameter fitting exercise, the primary interest is in determining a param- etersetthatdrivesthemodel’smoistureprofiletoconformtoanothermodelofsimilar 5 111555 control trial )))111000 mmm kkk ((( ht ht ht ggg eieiei hhh 555 000 000...000000000 000...000000555 000...000111000 000...000111555 rrrttt (((kkkggg///kkkggg))) Figure 1: Comparison of moisture profiles for the control and original trial models. The moisture quantity is a mass ratio of water vapor and droplets to dry air at every model level through the depth of the modeled atmosphere. utility but of differing implementation. The relevant models are represented by their respective moisture profiles in Figure 1. 3.1 Implementation The trial model is implemented in the c-language as two thousand lines of code. It takes 30 seconds to run the model to an equilibrium state so that its moisture profile can be compared with that of the control model, so that it takes about a day to check 10000 parameter sets. However, running sixteen models in parallel pushes this figure out to about 4 days, due to the extra burden on the primary CPU. A study of the code foundthattherearetwentyparametersthatenterintothecalculationofthemoisture profile. However, only eleven of those are external to the compiled program and can be changed for successive instances of the model. Thus, those eleven parameters are altered in this study. A more thorough parameter fitting could be done in the future by altering the code to make all twenty parameters external. The two models have been matched in form, so that they each have exactly 32 data points in the vertical, at specific pressure (height) levels. Otherwise, the remaining parameters of the control model are fixed and the accessible parameters of the trial model are varied in a random fashion. The cost function of the fitting procedure is taken as the 2-norm of the residuals between the two models: (cid:118) (cid:117)32 E = (cid:107)r (z)−r (z)(cid:107) = (cid:117)(cid:116)(cid:88)(r (z)−r (z))2, (7) i 0 i 2 0 i i=1 where r (z) is the value of the control profile at height z and r (z) is that of the trial 0 i 6 profile. A script written in the python computing language executes the algorithm. It begins with assumed values of the initial parameter set, the ideal profile, the initial system temperature and the other algorithm parameters discussed above. On each iteration, the script makes a random change to a single model parameter and runs the model. The model output is processed into a simplified form and the two norm of the residuals between the current model and the ideal is calculated using 7. This is the current energy of the system. At this stage, the Metropolis criterion is used to determine whether the current parameter set is kept or discarded. If the current energy state is less than that of the last iteration, the model parameters are kept. If the energy has increased since the last iteration, however, the probability of reaching this energy state at the current temperature is determined using 4. Next, a random number is drawn from the uni- form distribution on [0,1]. If this number is less than the probability of attaining the state, the parameters are kept. Otherwise, they are discarded, and the last accepted set of model parameters is used as a basis for the next iteration. The current energy, temperature and model parameters are saved to disk for later analysis and a new iteration is begun. If one of two conditions are met, the temperature of the system drops. The effect on the algorithm is that positive energy changes become less likely to be accepted. The new temperature value is always 90% of the previous value. The first condition for cooling is that the algorithm has accepted 150 new parameter sets at the current temperature. These accepted values could be due to lower energy states or to prob- abilistically accepted higher states. If there are 300 iterations before the limit of ac- cepted sets is achieved, the temperature drops by the second condition, which simply limits the total number of iterations at a given temperature in order to let the cooling processcontinue. Together,thesevalueslimitthelengthsofthehomogeneousMarkov chains. positive ∆ E distribution negative ∆ E distribution 1.0 1.0 0.8 0.8 n n o o uti uti b b stri0.6 stri0.6 di di d d e e z0.4 z0.4 ali ali m m or or n n 0.2 0.2 0.0 0.0 10.0 8.0 6.0 4.0 2.0 0.0 10.0 8.0 6.0 4.0 2.0 0.0 log ∆ E log ∆ E Figure 2: The distribution of energy changes for a typical experiment. Positive changes in energy are shown at left and negative changes at right. 7 0.1 2.8 0.018 0.018 0.0 2.3 0.016 0.016 0.013 0.013 0.0 1.7 cvc 0.011pbltop 0.011 0.0 0.009 1.1 0.009 0.007 0.007 0.0 0.6 0.004 0.004 0.0 0.002 0.0 0.002 0.0000 0.5700 1.1400 1.7100 2.2800 2.8500 0.0000 0.0004 0.0008 0.0011 0.0015 0.0019 pbltop cdrag 11.4 11.4 0.018 0.018 9.1 9.1 0.016 0.016 0.013 0.013 6.8 6.8 pscale 0.011pstiff 0.011 4.6 0.009 4.6 0.009 0.007 0.007 2.3 2.3 0.004 0.004 0.0 0.002 0.0 0.002 0.0000 1.1400 2.2800 3.4200 4.5600 5.7000 0.0000 0.5700 1.1400 1.7100 2.2800 2.8500 wscale pbltop Figure 3: Two dimensional energy plots used to estimate the step sizes for each pa- rameter. The starting temperature was set according to the formula given by Laarhoven and Aarts (5). The average positive energy change was estimated by letting the model run for several thousand iterations and making a histogram of the positive energy changes. These are shown of Figure 2. Assuming an initial acceptance rate of 90% and an estimated value of the positive energy changes of 1e−4, the initial temperature was found to be 0.001. Step sizes were determined for each of the parameters by estimating uniques ranges of useful values for each parameter based on plots of energy values for two dimensional slices through the parameter space, centered on the initial guess of the model. Such plots are shown in Figure 4 Results The trials were run as an ensemble of 16 identical algorithms. The stochastic nature oftheparameterselectionproceduregaveverydifferentresults. Threecasesofresults areshowninFigure4wheretimeseriesofenergyandtemperatureareshownforeach trial. In the first trial, the algorithm very quickly discovered significant minima and retained these gains over the course of the run. As the temperature approaches zero, the positive energy changes are penalized more and more until the system is forced into a steady minimum state. The second trial has an even better outcome, in that the final steady state is lower 8 energy and temperature energy and temperature energy and temperature 0.05 0.05 0.05 energy energy energy temperature temperature temperature 5mperature (x5e)00..0034 5mperature (x5e)00..0034 5mperature (x5e)00..0034 energy and te00..0012 energy and te00..0012 energy and te00..0012 0.00 0.00 0.00 0 4279 8559 12838 17118 21397 0 4165 8331 12496 16662 20827 0 4273 8547 12820 17094 21367 slice slice slice Figure 4: Energy and temperature trends for three trials. Difference Difference Difference E=0.000777 E=0.000631 E=0.022931 15 15 15 m)10 m)10 m)10 k k k ht ( ht ( ht ( g g g ei ei ei h h h 5 5 5 0 0 0 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 mixing ratio (g/kg) mixing ratio (g/kg) mixing ratio (g/kg) Figure 5: Differences between ideal model and iterative solution. thanthatofthefirst;however,thejourneytothisstateisquitedifferent. Eventhough the algorithm finds deep minima early in the run, it soon gets stuck on a plateau, where it remains for several thousand iterations. Following this steady state, a mono- tonic drop to lower energy states leads to the final state. Incontrasttotheserelativelysuccessfulruns,thelasttrialillustratesaworst-case scenario, where the final steady state is a metastable one, where the energy is far too high to give an acceptable solution. Strikingly, the energy hardly deviates from this value throughout the 20,000 iterations of the algorithm. The differences between the ideal model and the annealing solutions for these three trials are shown in Figure 5. Since the mixing ratio ranges from 1 g/kg near 10 km up to 15 g/kg at the surface (see Figure 1), the final residuals for the first two trials are not bad. Residuals for the last trial are off the charts near the lowest levels of the modeled atmosphere, which is expected due to the metastable steady state of the model. Final energy values for each trial are shown at the top of each plot. Recall that these are the 2-norm values for the residuals at all model levels. The lowest en- ergy achieved by these three trials is the one shown at center. Comparisons of these models with the ideal are shown in Figure 6. Estimated parameter distributions are shown in Figure 7 for two sets of model parameters. These distributions span the range of success in the estimation of pa- 9 Compare with Ideal Compare with Ideal Compare with Ideal 1155 1155 1155 ideal ideal ideal adjusted adjusted adjusted m)m)1100 m)m)1100 m)m)1100 kk kk kk ht (ht ( ht (ht ( ht (ht ( gg gg gg eiei eiei eiei hh hh hh 55 55 55 00 00 00 00 55 1100 1155 00 55 1100 1155 00 55 1100 1155 mmiixxiinngg rraattiioo ((gg//kkgg)) mmiixxiinngg rraattiioo ((gg//kkgg)) mmiixxiinngg rraattiioo ((gg//kkgg)) Figure 6: Comparison between ideal model and iterative solution. rameter values for the ensemble of 16 inhomogenous Markov chains. The plot on the left, showing distributions for the sea surface temperature and the surface drag coef- ficientgiveslittlehintoftheshapeoflikelyvaluesforthesetwoparameterssince, not only are the distributions irregularly shaped, but there is little pattern in the range of energy values (indicated by the shading). In the plot on the right, however, there is a concentrated region of lower energy chains, as indicated by the lighter shading there. Also, thehigherenergychainstotheleftandrightofthecentralregionsuggest gradients of similar charater to those shown in the two -dimensional slices shown in Figure 3. At this stage, it seems too early to attempt to estimate the parameter values and their confidence intervals. There is more work to do to explore and alter the optimiza- tion scheme used here to achieve more reliably convergent chains. To this end, I am currently performing a series of chains in which the step size is altered to effect a more thorough exploration of the parameter space. 5 Further Ideas, Limitations and Improvements In these experiments, it should be pointed out that the models with the lowest energy are not necessarily the best models. In fact, it is possible that none of them are de- sirable. This is due to the fact that our cost function is solely dedicated to the task of matching the moisture profiles. Other aspects of the model could thus be negatively altered in a non-trivial manner. One way to avoid this is to add another term to the cost function, such as the 2-norm of the model’s temperature profile, again compared to the ideal model. This would constrain the thermodynamics of the model to realistic values at every level of the atmosphere. Adding more terms to the cost function gives rise to the relative importance of the terms with respect to one another. If one term, such as the temperature 2-norm, is deemed more important, a coefficient can be included so the cost function is largely 10

Description:
MATH 519: Inverse Theory. 1 Introduction .. In this paper, such a parameter search is achieved using a simulated annealing .. maxis = range(1,n+1).
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.