ebook img

Simulated Annealing, weighted simulated annealing and genetic algorithm at work PDF

19 Pages·2011·0.17 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Simulated Annealing, weighted simulated annealing and genetic algorithm at work

Simulated annealing, weighted simulated an- nealing and genetic algorithm at work 1 2 Fran(cid:24)cois Bergeret and Philippe Besse 1 Motorola Semiconducteurs S.A., avenue Eisenhower, 31023 Toulouse Cedex, France 2 Laboratoire de Statistique et Probabilit(cid:19)es, U.M.R.CNRS 5583, Universit(cid:19)e Paul Sabatier, 31062 Toulouse cedex France Summary Two well known stochastic optimization algorithms, simulated annealing and genetic algorithm are compared when using a sample to minimize an objectivefunctionwhichistheexpectationofarandomvariable. Sincethey leadtominimumdepending onthe sample,aweighted version ofsimulated annealing is proposed in order to reduce this kind of over-(cid:12)t bias. The al- gorithms are implemented on an optimization problem related to quality control. A design of experiment is used to get the best trade-o(cid:11) between optimizationand execution time. Simulated annealing appears to be more e(cid:14)cient than the genetic algorithm. With regard to the bias problem, the randomlyweighted version of simulatedannealingallowsto achieve a solu- tion less dependent on the sample and thus less biased. Keywords: stochastic optimization,qualitycontrol,design ofexperiment. 1 Introduction Stochastic optimizationaims at (cid:12)nding the global minimumof an objective function. Simulated annealing (Kirkpatrick and al., 1983) and genetic algo- 2 rithm (Holland, 1975) are two stochastic algorithms. The execution time of such algorithmsis high in order to approach the global minimum. They are widely used in many applications (Davis, 1987) but few comparisons exist. Weare interested in(cid:12)ndingan optimaltest sequence inqualitycontrol. It is acombinatorialoptimizationproblemwhose objective function can be mod- eled as the expectation of a random variable. The expectation depends on a parameter (cid:13) which belongs to a (cid:12)nite set (cid:8); X is a random variable that modelizes the production. The expectation is taken over X and is unknown because the distribution of X is unknown. The problemis minE[f(X;(cid:13))]: (cid:13)2(cid:8) Since the function H((cid:13)) = E[f(X;(cid:13))] is usually not \convex" in (cid:13), global optimizationalgorithmsareusedinordernottogetstuckinalocalminimum. To estimate the expectation, a sample is used: (X1;:::;Xn) are independent andidenticallydistributed. This sampleremains(cid:12)xed for allthe study. The natural idea is to estimate H((cid:13)) by the sample average: 1 Xn Hn((cid:13)) = f(Xi;(cid:13)): n i=1 In this paper, we aimat: (cid:15) adjusting the parameters of the algorithms, (cid:15) comparingthe algorithms, (cid:15) reducing over-(cid:12)t bias that mayoccur because the same sample is used to optimizeand to estimate. The adjustment of the parameters is done by design of experiment. The comparisonofthealgorithmsshowsthatsimulatedannealingismoree(cid:14)cient than the genetic algorithm on the test optimization problem. A randomly weighted version of simulated annealing is then proposed to reduce over-(cid:12)t bias: the objective function is estimated on a weighted sample. Numerical experiments show that this version reduces the bias. We introduce notations and main results about simulated annealing in section2. Theweightedversionofsimulatedannealingispresented insection 3. Theimplementationofthealgorithmsisdetailedinsection4andpractical comparisons (simulated annealing versus genetic algorithm and simulated annealing versus weighted simulated annealing) are presented in section 5. Main results are discussed in section 6. 2 Simulated Annealing SimulatedAnnealing (see Aarst and Korst, 1989) uses a direct analogywith the physical annealing process of solids which consists in two steps: 3 (cid:15) Increase quickly the temperature to a maximum value at which the solid melts. The particles are randomlyarranged. (cid:15) Decrease slowlythe temperature untilthe particles arrange themselves in the ground state of the solid. The energy is then minimal. In optimization, a solution is equivalent to a state of the physical system and the cost function we want to minimize is equivalent to the energy of a state. At the beginning of the algorithm,all the solutions are accepted. As the algorithm evolves, the probability of accepting deteriorations decreases with the temperature. At the end of the algorithm, only improvements are allowedand the system is considered to be frozen. 2.1 The algorithm The followingnotations are necessary to de(cid:12)ne the algorithm: (cid:15) (cid:13)0 is the initial(random) solution, (cid:15) (cid:13) is the current solution, 0 (cid:15) (cid:13) is a neighbour of (cid:13), that is a solution which is close to (cid:13) in some sense, (cid:15) k is the iteration number of the changes of temperature, (cid:15) ck is the control parameter and plays the role of the temperature, (cid:15) a proposed transition is a neighbour visited, (cid:15) anaccepted transitionisthetransformationofthe current solutioninto a new one, (cid:15) Lk is the number of proposed transitions for k (cid:12)xed, a (cid:15) Lk is the number of accepted transitions for k (cid:12)xed. The algorithmhas two nested loops: (cid:15) An outer loopon k to control the temperature ck. The algorithmexits this loop when a stopping criterium is satis(cid:12)ed. (cid:15) An inner loop is used to visit new solutions for k (cid:12)xed. The algorithm exits this loop if one of the two followingconditions is satis(cid:12)ed: a 1. the number of accepted transitions Lk reaches a certain value a Lmax, 2. the number of proposed transitions Lk reaches a certain value Lmax; 4 a Lmax andLmax aretwoparametersofthealgorithm. Theyare(cid:12)xed. Ateach 0 step ofthe inner loop,the algorithmgenerates aneighbour solution(cid:13) ofthe current solution (cid:13) and compares the values of the objective function for the two solutions. Improvements of the objective function are always accepted inorder to reach the minimum. Deteriorations in cost are accepted with the acceptance probability 0 H((cid:13))(cid:0)H((cid:13) ) A(cid:13)(cid:13)0(ck)=exp( ): ck 0 Let us remarkthat this probabilitydepends on the di(cid:11)erence H((cid:13))(cid:0)H((cid:13) ): too strong deteriorations are not allowed in order not to loose the previous information. Italsodependsonthecontrolparameterckwhichisveryimpor- tant: at the beginning ck is large and the domainspace is visited randomly, then ck is slowly decreased to visit lower energy regions and to escape from local minima. As ck goes to 0, only improvementsare allowed and the algo- rithm behaves like local search algorithms. In a pioneering work, S. Geman and D. Geman (1984) prove that suitable inverse cooling schedules ensure the convergence of the chain to a globalminimum. More precise studies are thepapers ofHajek(1985,1988)andTsitsiklis(1989). Theygivesimplesuf- (cid:12)cient and necessary conditions on the cooling schedule ck for the algorithm to converge in probability to the global minima. Sharper estimates can be found in Catoni (1991), Chiang and Chow (1988) and various exensions can be found in Trouv(cid:19)e (1993), Hwang and Sheu (1992), B(cid:19)elisle (1992). From a practicalpointofview,theessentialproblemistochoosethecoolingschedule so that the convergence occurs as fast as possible. More over Catoni (1991) studies optimalcoolingschedules givena (cid:12)nite-timeexecution. A (cid:12)nite-time implementationof the algorithmis now presented in the next section. 2.2 The cooling schedule Someparameters of the algorithmwere de(cid:12)ned in the previous section. The evolution of these parameters is controlled by a cooling schedule. A cooling schedule speci(cid:12)es: (cid:15) an initialvalue of the control parameter c0, (cid:15) a decremental function for decreasing the value of ck, (cid:15) a stop criterion for the objective function, (cid:15) a (cid:12)nite number of transitions for each value of ck. Followingthisde(cid:12)nition,severalcoolingschedules havebeen proposed inthe literature. The mostfamousistheoriginalschedule fromKirkpatrickandal. (1983): 5 a (cid:15) c0isinitializedsuchthattheacceptanceratior =L0=L0isclosetoone. In practice, before the algorithmstarts, di(cid:11)erent values ofc are tested, in increasing order; at each attempt, the corresponding value of the acceptance ratio r increases because c increases. When the observed r is close to one (e.g. r > 0:99), the corresponding value of c is taken as the starting point c0. (cid:15) The decremental function is ck =(cid:11)(cid:2)ck(cid:0)1; (cid:11) usually lies between 0.8 and 0.95. (cid:15) The algorithm is terminated if the value of the cost function remains unchanged for a number of consecutive iterations. (cid:15) Asithasbeentoldin2.1,thevalueofckisdecreasedifatleastanumber a Lmax of transitions are accepted for k (cid:12)xed. However this requires Lk ! 1 when ck ! 0 because, when ck is close to 0, almost only improvementsareaccepted. Thisiswhy,astherearefewimprovements because the current solution is nearly optimal, Lk is also bounded by some constant Lmax. With this coolingschedule the temperature decreases exponentially with k , that is (assuming ln(cid:11)'(cid:11)(cid:0)1): ck =c0exp[((cid:11)(cid:0)1)k]: Another cooling schedule speci(cid:12)es a logarithmic decrement of the tempera- ture: ck =d=log(k): Fromapractical pointofview,the essential problemis tochoose the cooling scedule so that the convergence occurs as fast as possible. In a recent paper Catoni (1992) study optimalcooling schedules given a (cid:12)nite timeexecution. He shows that the exponential schedule is better as long as the execution timeis (cid:12)nite. 2.3 Convergence Toprovideconditionsfortheconvergence ofsimulatedannealingtowardsthe (cid:3) set S of global minimaof H, the followingspeci(cid:12)cations are needed. S(cid:13) is the set of neighbours of (cid:13). It is assumed that: (cid:13) 2= S(cid:13); 0 (cid:13) 2S(cid:13) ,(cid:13) 2S(cid:13)0: (1) 6 Let us denotes (cid:26) 0 0 1 if (cid:13) 2S(cid:13), (cid:31)S(cid:13)((cid:13) )= 0 otherwise. G(cid:13)(cid:13)0(c) is the generation probability. It is the probability of generating 0 (cid:13) froma neibourghood of (cid:13), S(cid:13). It is given by: 0 (cid:31)S(cid:13)((cid:13) ) G(cid:13)(cid:13)0(c)=G(cid:13)(cid:13)0 = ; (2) (cid:2) where (cid:2)=jS(cid:13)j, 8(cid:13) 2(cid:8). 0 A(cid:13)(cid:13)0(c)is the acceptance probability. It isthe probabilityofaccepting (cid:13) generated fromS(cid:13). It is given by: (cid:26) 0 1 if H((cid:13) )<H((cid:13)), A(cid:13)(cid:13)0 = H((cid:13))(cid:0)H((cid:13)0) 0 (3) exp( ) if H((cid:13) )(cid:21)H((cid:13)). c P(cid:13)(cid:13)0(c) denotes the transition probability, that is the probability of re- 0 placing the current solution (cid:13) by (cid:13) . As a result of (2) and (3), it is given by: (cid:26) 0 G(cid:13)(cid:13)0A(cid:13)(cid:13)0(c) if (cid:13) 6=(cid:13) , P(cid:13)(cid:13)0(c)= P 0 (4) 1(cid:0) l2S l6=(cid:13) P(cid:13)l(c) if (cid:13) =(cid:13) . th The algorithm is modeled as a Markov chain (Yl)l; Yl is the l trial and P(c) is the transition matrix associated with the algorithm. The stationary distribution is then de(cid:12)ned by: 0 0 q(cid:13)(c)= lim P(Y(l)=(cid:13)=Y(0)=(cid:13) ); 8(cid:13) : l!1 The convergence proof works as follows(see for exampleGemanand Ge- man1984,Gidas 1985, Hajek 1988): 1. Thetemperaturecis(cid:12)rstlyassumedtobeconstant. TheMarkovchain (Yl)l is then homogeneous. Under speci(cid:12)cations (2), (3), (4) and if the followingcondition holds: 0 0 8(cid:13);(cid:13) 2(cid:8); 9p(cid:21)1; 9l0;:::;lp2(cid:8); l0 =(cid:13); lp =(cid:13) : Glklk+1 >0; 0(cid:20)k (cid:20)p(cid:0)1; (5) the Markovchain is irreducible andaperiodic. It converges towards an unique stationary distribution. If the equation (1) holds, then the stationary distribution satis(cid:12)es the detailed balance equation: 0 q(cid:13)(c)P(cid:13)(cid:13)0(c)=q(cid:13)0(c)P(cid:13)0(cid:13)(c); 8(cid:13);(cid:13) 2(cid:8): In that case, it can be shown that the distribution: exp((cid:0)H((cid:13))=c) q(cid:13)(c)= ; 8(cid:13) 2(cid:8); N0(c) 7 with X H(l) N0(c)= exp[(cid:0) ]; c l2(cid:8) istheuniquestationarydistribution. Notethataweakerconditionthan condition (1) exists in Hwangand Sheu (1992). With this condition,it is still possible to precise the stationary distribution. In any case, the structure of the generation probability is the important point of the proof. 2. The temperature c decreases. The simulated annealing algorithm is describedbycombiningthehomogeneousMarkovchainsof(cid:12)nitelength into one single inhomogeneous Markov chain. If the cooling is done su(cid:14)ciently slowly, the inhomogeneous Markov chain converges, when c!0, towards the uniformdistribution on the set of globaloptima: (cid:3) 1 q(cid:13) = jS(cid:3)j(cid:31)S(cid:3)((cid:13)): Strong ergodicityofthe Markovchainisused toshowthe convergence. IsaacsonandMadsen(1976)provideconditionsforthestrongergodicity of a Markov chain. The Markov chain has to be weakly ergodic and the stationary distribution for c (cid:12)xed has to exist and has to satisfy someproperties. Foradetailedaccount anddiscussionofthe originsof the variousapproaches commonlyused, see forinstance Catoni(1991). Another study (Miclo, 1996), based on the use of the relative entropy- distance and log-sobolev inequalities, leads to a simpler convergence proof, and extends other which have been published. Application to an expectation Consider a simulated annealingalgorithmwith the objective function: 1 Xn Hn((cid:13)) = f(Xi;(cid:13)): n i=1 The temperature c is (cid:12)xed. The speci(cid:12)cations (1), (2), (3), (4) are used. Theneigbourhoodstructure isde(cid:12)nedtosatisfythecondition(5). Then,the stationary distribution is: 1 Pn 1 n i=1f(Xi;(cid:13)) q(cid:13)(c)= exp(cid:0)[ ]; N0(c) c with X n1 Pni=1f(Xi;l) N0(c)= exp(cid:0)[ ]: c l2(cid:8) The temperature c decreases. If the cooling is done su(cid:14)ciently slowly,simu- lated annealing converges towards the set of globaloptimaof Hn. 8 3 Weighted simulated annealing As stated in the introduction, a weighted version is proposed in this section toreduce over-(cid:12)t biassince we are interested in(cid:12)nding aglobalminimumof H. The weighted algorithmis detailed and its convergence is studied. 3.1 The weighted algorithm The aim of this adaptation is to avoid a minimumwhich would depend too much on the sample. This is done by introducing, at each step of the sim- ulated annealing, a small perturbation. Each new evaluated solution is in a neighbourhood of the initial sample. More precisely, at each step of the algorithm,the cost function H((cid:13)) is estimated by: n b 1 X H ((cid:13);!) = !if(Xi;(cid:13)); n i=1 Pn where the !i's are random weights such that i=1!i = n. The weights are usuallytakenastheaverageofB multinomialrandomvectors formedfromn drawsonnequallylikelycells. Thealgorithmwaspresented yet insection 2. The di(cid:11)erence is that each solution is a couple ((cid:13);!). When a new solution is generated, new weights are generated. The acceptance probability and 0 0 the transition probability for a neighbour ((cid:13) ;! ) of ((cid:13);!) are those of the b standard algorithm. The function H ((cid:13);!) is then a weighted estimation of H((cid:13)). There isasimilaritywiththeimplementationofthebootstrap (Efron, 1979) by Monte Carlo methods because of the multinomialrandom vectors. The di(cid:11)erences are: (cid:15) Each time the objective function is estimated in the algorithm, new weights are generated. (cid:15) The optimizationis performed once. With a Monte Carlo method ap- proximatingthebootstrap,itwouldhavebeen performedB timeslead- ing each timeto a di(cid:11)erent solution of the optimizationproblem. Therandomgenerationofnewweightsateachstepprevents fromoptimizing only on the initial sample. This may lead to a less biased estimation of the objectivefunctionattheendofthealgorithm. Moreover,theadditionalcom- putationalcostofthe algorithmisnotimportant. Itismainlythegeneration ofB vectors ofrandomweights for each evaluationofthe objective function. 3.2 Convergence Inthissection,weprovethattheweightedversionofthesimulatedannealing converges towards the set of global optima of the objective function, with regards to an enlarged set of solutions. The enlarged set of solutions is: S =(cid:8)(cid:2)(cid:10); 9 Pn (cid:10) is the set of vectors of weights ! = (!i)1(cid:20)i(cid:20)n with i=1!i = n. The number of solutions is still (cid:12)nite because the weights are multinomial. Let us precise the new generation probability: 0 0 The probabilityof generating ((cid:13) ;!) from ((cid:13);!) is: 0 0 (cid:31)S((cid:13);!)((cid:13) ;!) 0 G((cid:13);!);((cid:13)0;!0) = Q(! ); (6) (cid:2) 0 where (cid:2)=jS(cid:13)j, 8(cid:13) 2(cid:8) andQ(! ) is the probabilityofobtainingthe vector 0 0 0 of weights ! . By construction, Q(! ) > 0; 8! 2 (cid:10). The acceptance and transitions probabilities are those of speci(cid:12)cations (3) and (4) with: b 1 Xn H ((cid:13);!) = !if(Xi;(cid:13)): n i=1 The results for the weighted simulatedannealing are: 1. The temperature c is constant. If the condition (5) is satis(cid:12)ed for the standard version, it is satis(cid:12)ed for the weighted version: if Glklk+1 >0 0 then G(lk;!)(lk+1;!0) > 0; 8!;! 2 (cid:10) because all the weights are likely tobegenerated ateachstep. Thus,theMarkovchainisirreducibleand aperiodic and converges towards an unique stationary distribution. The condition: 0 0 G((cid:13);!);((cid:13)0;!0) =G((cid:13)0;!0);((cid:13);!); 8((cid:13);!);((cid:13) ;!)2S; does not hold with the generation probability (6) because all vectors ! are not equally likely. For this reason, it is not easy to precise the stationarydistribution. However,conditionsofHwangandSheu(1992) can be used to prove the convergence. They introduce the generalized simulated annealing and propose the Hajek's condition to prove the convergence. It can be shown that the Hajek's condition is satis(cid:12)ed if: 0 0 0 0 0 0 8((cid:13);!);((cid:13) ;!)2S; P(((cid:13);!);((cid:13) ;!))>0,P(((cid:13) ;! );((cid:13);!))>0: This condition is satis(cid:12)ed with the generation probability(6). 2. The temperature c decreases. If the cooling is done su(cid:14)ciently slowly, theinhomogeneousMarkovchainconverges, whenc!0,towardssome distribution on the set of globaloptimaof (cid:8)(cid:2)(cid:10): The homogeneous Markov chain converges towards a stationary distri- bution because the optimization is performed on (cid:8)(cid:2)(cid:10). However, the bias reduction comes from the followingpoint: each possible vector of weights is in a neighbourhood of the current vector of weights. So the optimization is not really performed in ! since previous solutions in ! are forgotten. The perturbationonlyallowsnottostronglydependontheinitialsampleinorder to reduce the bias. 10 4 Application in Quality Control 4.1 The problem After manufacturing,electronic devices are tested inorder tomakesure that the products meet the speci(cid:12)cations. The control consists in a number of electrical measurements. Assume that the cost of each measurement (also called test) is known. Assume also that the cost of an undetected bad part is known. Savings are possible by reducing the number of tests but the proportion of bad parts undetected will increase. Thus there is a trade- o(cid:11) between the cost of the test and the cost of undetected bad parts in order to minimize the total manufacturing cost. The selection of the tests thatoptimizethisobjectivefunctionisacombinatorialoptimizationproblem solved by the simulated annealing and genetic algorithm. Let us introduce some de(cid:12)nitions to precise the problem: (cid:15) q is the originalnumber of tests. (cid:15) Atest combination(cid:13) isavector oflengthq. (cid:13)i =1meansthatthetest i is in the combination,(cid:13)i =0 means that i is not in the combination. (cid:15) The domainspace for (cid:13) is: q (cid:8)=f0;1g : p p (cid:15) (cid:13) is the present test sequence, that is (cid:13)i =1, 8i. (cid:15) TC((cid:13)) is the test cost per lot generated by (cid:13). (cid:15) UC(X;(cid:13)))isarandomvariable. Itrepresents thecostoftheundetected bad parts for the lot X. (cid:15) f(X;(cid:13)) represents net savings per lot: p f(X;(cid:13)) =TC((cid:13) )(cid:0)TC((cid:13))(cid:0)UC(X;(cid:13)): (cid:15) The objective function is then: H((cid:13))=E[f(X;(cid:13))]: The function H((cid:13)) is to be maximized. According to the de(cid:12)nitions of the b sections 2 and 3, H((cid:13)) is estimated by Hn((cid:13)) or H ((cid:13);!) on a sample of q lots taken from the production. The number of test combinations is 2 and q ranges from 7 to 700 depending on the product. An exhaustive search is impossibleformostproducts because thecomputationalcostgrowsexponen- tially with q. For q < 100, a step by step technique can be used (Bergeret and Chandon, 1995), but for q (cid:21) 100 the results depend strongly on the initial solution. Stochastic algorithms are then used in order to approach the globalminimumof the objective function. In the next section, a genetic algorithm is presented in order to compare it with the simulated annealing on this real-world problem.

Description:
lead to minimum depending on the sample, a weighted version of simulated annealing is . the papers of Hajek (1985, 1988) and Tsitsiklis (1989).
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.