Simulation and analysis of in vitro DNA evolution Morten Kloster1,2 and Chao Tang2,3,∗ 1Department of Physics, Princeton University, Princeton, New Jersey 08544 2NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540 3Center for Theoretical Biology, Peking University, Beijing 100871, China We study theoretically the in vitro evolution of a DNA sequence by binding to a transcription factor. Using a simple model of protein-DNA binding and available binding constants for the Mnt protein, we perform large-scale, realistic simulations of evolution starting from a single DNA sequence. Weidentifydifferentparameterregimescharacterizedbydistinctevolutionarybehaviors. For each regime we find analytical estimates which agree well with simulation results. For small population sizes, the DNA evolutional path is a random walk on a smooth landscape. While for large population sizes, the evolution dynamics can be well described by a mean-field theory. We 3 also studyhow the details of theDNA-proteininteraction affect the evolution. 0 0 2 The concept of evolution has not only fundamentally PCR (Polymerase Chain Reaction) followed by selecting n shaped our view of biology, but also found rich and pro- a fixed fraction of the DNA sequences via washing off a found applications in bioengineering and biotechnology. relatively weak protein binders. Starting from a mix of J In particular, in vitro evolution has been widely used to random DNA sequences and by sequencing a number of 1 evolve DNA [1], RNA [2] and proteins [3]. In evolution, DNA sequences after each round, they were able to ob- 2 mutationsatthemolecularlevelareselectedatthefunc- servethe dynamics of DNA populationevolving towards 1 tional level. A quantitative theory of an evolutionary thewildtype(WT)sequence. Insuchamolecularbreed- v process wouldreqire a quantitative understanding of the ingexperiment,therelationbetweenthegenotype(DNA 2 selection process (e.g. fitness function, landscape, selec- sequence) and the phenotype (binding affinity to a pro- 7 tion pressure, etc.). While this is in general difficult to tein) is direct and simple. If the binding constants are 3 achievefornaturalorlaboratoryevolution,therearesim- known for various DNA sequences, the selection process 1 ple caseswhere sucha quantitativedescriptionis readily canbe modeled quantitativelyand itcanthen serveas a 0 3 available. Based on experiments of RNA virus evolu- model system for quantitative analysis of molecular evo- 0 tion [4], Levine and colleagues [5, 6, 7] studied a simple lution. / model of evolutionary process in a smooth landscape in Inthispaper,westudytheoreticallythein vitroevolu- t a which the fitness of an individual is given as the sum of tion of DNA sequences via binding to the mnt repressor m manyindividualcontributionsthatcanbemutatedinde- protein. The system of DNA-mnt is perhaps the best - pendently. Their studies found good agreement between experimentally characterized system of sequence-specific d theoryandsimulationsforsmallpopulationsizesandfor DNA-protein binding, [10, 11, 12] and is particularly n equilibrium mean field theory, but the evolution dynam- suited for a thorough quantitative study of molecular o icsturnedouttobepathologicalinlargepopulationsizes evolution. Specifically, the binding constants of the WT c : whereextremelyraremutationsareexponentiallyampli- DNA sequence and many other sequences, including all v fied, yielding infinite speed of evolution. Peng et al. [8] thesequencesonepoint-mutationawayfromtheWT,are i X proposed DNA binding to proteins as a model system measured experimentally. [10] Furthermore, it has been r forevolutioninasmoothlandscapeandstudiedamodel demonstrated that the binding energy of a sequence can a where a large population of long DNA molecules were be approximately decomposed as the sum of the contri- subjected to high mutation rates and selected by how butions from the individual bases. [10, 12] This additive stronglythey bindto aprotein(the histone-octamerwas form of binding energy greatly simplifies the analysis— mentioned as a possible example). Their model can be it enables us to perform realistic large scale simulations well described by a continuum equation and they have as well as to obtain analytic solutions and estimates in shownthattheaveragedistancetothehighestaffinityse- various cases. In our study, we explore various regimes quenceexponentiallyapproachesitsequilibriumvalue.[8] ofexperimentallyaccessibleparametersandwefindvery In a recent experiment, Dubertret et al. [9] studied distinct evolution dynamics in different regimes. the in vitro evolution of DNA sequences via binding to the lac repressor protein which is kept unchanged. In I. MODELS AND METHODS the experiment, a round of evolution consists of ampli- fication and mutation of the DNA pool by error-prone We assume that the binding energy between a DNA molecule and the mnt protein is given by the sum of the contributions from individual base pairs, ∗Corresponding author: Chao Tang, NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540. Tel: (609)-951-2644; E(S)= ǫi(Si), (1) Fax: (609)-951-2496. E-mail: [email protected] i X 2 where Si ∈ A,C,G,T is the base at the i-th position of Position 4 7 9 10 13 15 the DNA sequence and ǫi(Si) is the contribution to the WT base G C A C T G binding energy from the i-th position for which we use SS base T A C T C C the experimentally determined value in Ref. [10]. The ∆K 2.13 10.0 5.0 5.5 7.2 8.3 relative binding constants are then K(S)= K (S )= i i i ie−βǫi(Si).∗ WestartwithapopulationsizQeN ofDNA TABLE I: The starting sequence SS. ∆K = e−β∆E is the molecules of the same sequence that is significantly dif- ratio of the binding constants due to the mutation at the Q ferent from the WT, avoiding the potential problem of specific position. enrichment dominating the evolution [9] (i.e. sequences very close to the WT in the initial pool being amplified exponentially). An iteration of the evolutionary process 100 102 104 106 108 1010 1012 1014 1016 1018 1.0 1000 consists of an amplification with mutation followed by a selection. During amplification the population is dou- bled I times (correspondingto,e.g. I cyclesof PCR),so 0.8 100 that the populationsize is increasedto 2IN. We assume d tbtthihaoaesntemia(snsttetheepaertnchohteseudiabnupj.peplcEeitcnaadtctohiixoanDfosNterhlAeemcrmteoirooiensleoacpnunroletechreiirsssos)sr.evlrieTaacthtbeeeidnopdfwoirnpitguphlteaaor- Fraction00..46 RW MG Crs. MF 10 mber of paths use u N probability 1 , where the chemical potential µ 1+eβ(E(S)−µ) BBeesstt ppaatthh ((ffrraaccttiioonn)) 1 ischosensuchthattheexpectednumberofselectedDNA 0.2 MMiinniimmuumm ppaatthhss ((ffrraacc..)) molecules is N. We iterate the evolutionary process un- R## Wmm iinnpiirmmeduuimmct ioppnaatthhss RW prediction til at least 90% of the DNA molecules are WT, and we 0.0 0.1 denote the number of iterations required t . 100 102 104 106 108 1010 1012 M N Simulation. Thebindingsiteforthemntrepressorcon- FIG. 1: The average fraction of WT contributed by the best sists of 17 important base pairs, at positions 3 through path,thetotalfractionfromallminimumpaths,andthenum- 19. † For our starting sequence, we chose, by random berofdifferentminimumpathsused,fordifferentpopulation mutations fromWT, asequence thatdiffers fromWT at sizes N. The parameters r = 10−4 (solid lines), r = 10−7 m=6positions. WedenotedbySSthestartingsequence (dashedlines), andI =1areheldfixed. Thevariousregimes (Table I). A simulation starts with N copies of SS. We (random walk, middle ground, and crossover to mean field) call a specific sequence of mutations that take an SS to are indicated. the WT an evolution path. Each path contains the six required mutations in some order, and may contain ad- ditionalmutations.‡ Thereare6!=720minimumpaths from each different path. that only contain the six requiredmutations. We iterate the evolutionary process until at least 90% of the DNA molecules are WT. We denote the number of iterations required t and record the number of molecules coming M II. RESULTS Someofthekeyquantitiesfromasinglesimulationrun ∗ In addition to this sequence-specific binding, DNA molecules are the fraction of WT that was produced through min- may bind to the transcription factor in a non-specific man- ner [13, 14, 15], Ktot(S) = K(S)+KNSB, and this binding imum paths, fmWinT, the number of minimum paths used, dominates for DNA sequences that are very far from WT. In nmin, and the fraction of the WT produced through the thispaper,weonlyconsiderDNAsequencesthataresufficiently single best path in the simulation, fWT. Fig. 1 shows best closetoWTsothatthenon-specificbindingcanbeneglected— how these quantities depend on the population size (av- westillinclude itas a parameter inour simulations, but ignore eragedovermanysimulations). WeseethatfWTissmall it in the analytical estimates. See the appendix for more on min for very small N and is very close to 1 for large N, with non-specificbinding. † Themntrepressorisadimeroftwoidenticalhalves,whichcauses a fairly sharp transition, whereas fWT slowly decreases best the WT to bepalindromicaroundposition 11. The actual wild from 1 with increasing N. This indicates that we may typesequencediffersfromthesequencewiththehighestaffinity expect to find qualitatively different behavior for small atone position (19), we denote the highest affinity sequence by and large population sizes. WTforconvenience. [10] ‡ Ifamoleculeismutatedatmorethanonepositionduringasingle Small N: Random Walk. Let us first consider the duplication cycle, we randomly assign an order to these muta- case of N = 1 and I = 1, i.e. a single DNA molecule is tions. Asaprecaution,wehavealsoperformedsimulationswhere duplicated once and one of the two molecules is selected no more than one mutation per duplication cycle per molecule at each iteration. If there are no mutations during the was allowed, and verified that there were no qualitative differ- ences. duplication, the two molecules are identical and nothing 3 interesting can happen. If there is a mutation of type§ 8 i (the chance of multiple mutations in the same duplica- tion is negligible), the binding constant of the copy will be ∆Ki times that of the original, and the chance of se- 6 Simulation results lecting the mutant is ∆Ki , which is high for favorable Exponential fit 1+∆Ki mutations (∆K >1) and low for unfavorableones. The i DThNeAmmoloelceuculelewwillillmtahkuesapesrtfeoprmwhaenbeivaeserdarmanudtoamntwisaslke-: Errors4 lected, and the steps that improve the binding to the protein are favored over the steps that reduce the bind- 2 ing. Consideringthe probabilityofmakingeachstep, we find that we have exactly a random walk in the energy landscape given by the binding energy—in equilibrium, Prob(S)∝e−βE(S). 0 0 200 400 600 800 1000 Now consider a population size N >1, keeping I =1. Iteration As long as N is sufficiently small (N ≪1/r), the chance FIG.2: AveragenumberofdifferentbasesbetweenDNApool of a mutation happening in any single iteration is very and WT as function of time, and the exponential fit. r = low. When a mutation happens, it will almost cer- 10−4, N =256, I =1, averaged over 8192 runs. tainly either “die out” (disappear from the population) or spread through the whole population before the next mutationoccurs,sothatmostofthetimethepopulation The averagetime neededto improvethe DNA poolby consistsofN identicalDNAmolecules. Duringselection, one base relative to WT can now be estimated as the chance of choosing a particular combinationof DNA molecules is proportional to the product of their bind- −1 Nr 1 log(N) ing constants. Thus, if there are m mutants of type i in hTi≈ 1− + , (4) the population at iteration t, the chance that there will " i 3 (cid:18) ∆Ki(cid:19)# log(12+∆∆KK) X be m+j mutants in the population the next iteration is (∆Ki)2j times the chance that there will be m − j where ∆K is a typical value for ∆Ki. The first term mutants (as we select exactly half the DNA, the combi- is the time required to create a “seed” mutation: The natorics are identical). The probabilities pN(m) that m sum is over all possible correct mutations, Nr is the ± 3 identical mutants in a population of size N will spread chance of each mutation occurring in a given iteration, through the whole population (p ) or will die out (p ) and 1− 1 is the chance that the mutation survives. + − ∆Ki then satisfy Thesecondtermisthetimerequiredforthe mutationto spread through the population (for each mutant, 2∆Ki 1+∆Ki pN(N −m)=(∆K )2mpN(m). (2) is the average number of surviving children after one it- + i − eration), which is negligible for small N. Since the first term, which dominates, is inversely proportional to the Including the probability that a mutant will be created distancefromWT(i.e. thenumberoftermsinthesum), and selected in the first place, we find the rate R at i the average speed of the DNA pool will be proportional which a single mutation of type i will cause the whole to that distance. Fig. 2 shows the distance from WT as population to be replaced: afunctionoftime,andexceptinthebeginning,itcanbe Nr ∆K Nr (∆K )2N −(∆K )2N−1 almost perfectly fitted to an exponential—this is similar Ri = 3 · ∆K +i 1pN+(1)= 3 · i(∆K )2N −i1 to the result in [8], which is for a very different regime. i i The corrections for the beginning are precicely what we Nr 1− 1 ,∆K >1 wouldexpectfromthe secondterm: Itreducesthespeed ≈ 3 ∆Ki i (3) when the speed is large and causes a short delay. Our ( (cid:16) 0 (cid:17) ,∆K <1 i result for the evolution speed in the random walk (RW) regime is very similar to the one found in [6] for a birth- where Eq.(2)andthe conditionpN(m)+pN(m)=1are + − rate model: As long as the second term of equation 4 is used,andtheapproximationisvalidfor|2Nlog∆K |>> i negligible, the evolution speed of the DNA population is 1. The population again describes a random walk, but proportionalto the mutation rate and to the population this time the energylandscape is 2N−1 times the bind- size N. ing energy of a single DNA molecule—in equilibrium, Given a sequence S, the chance that mutation i will Prob(N copies of S)∝e−β(2N−1)E(S). be the next surviving favorablemutation (in the limit of small Nr and large N) is simply 1− 1 § Mvoulvteadti.on “type” specifies both the position and the bases in- PRWmut(S,i)= j1−∆K∆i1Kj , (5) P 4 0.0015 r Monte Carlo results I 10−4 10−5 10−6 10−7 Best fit 1 1.1·1019 2.4·1024 7.5·1029 2.9·1035 path 0.001 2 1.2·1019 2.9·1024 9.9·1029 4.2·1035 n of 4 1.6·1019 4.6·1024 1.8·1030 9.0·1035 actio 10 6.2·1019 3.8·1025 2.8·1031 2.3·1037 ed fr 15 1.4·1020 1.4·1026 1.4·1032 1.4·1038 bserv0.0005 20 1.5·1019 1.5·1025 1.5·1031 1.5·1037 O TABLE II:Lower boundsfor themean field regime. 0 0 0.0005 0.001 0.0015 Calculated probability of path From Fig. 1 we found that for large N, almost all the WTisproducedthroughminimumpaths. Lookingatthe FIG.3: Probabilitiesofindividualminimumpathsinrandom full parameter range we have explored, for r =10−4 and walk regime—observed vs. predicted. r = 10−7, N = 512, I =20 the contribution to WT through minimum paths I =1, averaged over 131,072 runs. isabout98%forthelargestN wecansimulate,andforall other parameters it’s well above 99% for large N—thus we expect any set that includes all the minimum paths where the sum is over all possible favorable mutations to give a reasonable result. To get even more accurate to S. The chance of a given path π will thus be results, we allow one erroneous mutation (this already P (π) = P (Sπ,π ). Fig. 3 compares RWpath n RWmut n n increases the number of paths from 720 to 156,960) and these predictedvalueswithsimulationresults(whichare verify that this is only a minor correction. Q Poisson distributed), and as we can see, they agree very Let nSS,π be the amount of WT produced from one well—thetotalobservednormalizedvarianceis745.2,vs. 0 moleculethroughπ,i.e.numberofWTsequencesattime expected720. TheapproachtothesmallNrlimitforthe t originating from a single molecule of sequence SS ex- fraction of minimum paths out of all the paths used, is M isting at the beginning of the experiment, through any shown in Fig. 1. specific path π. Once we know all the chemical poten- When I >1, the mutations withlarge∆K willbe sig- tials µ(t) used for the selections, we can use a set of re- nificantly more likely to happen first, as they are more cursionrelationstofindtheaveragehnSS,πiandthevari- likely to survive the now stronger selection. The prob- 0 ance Var(nSS,π), as well as the probability P (nSS,π) = ability of each path can still be calculated (in the same 0 + 0 limitasbefore),butthisisfarmorecomplicatedthanfor P(nS0S,π > 0) that a single SS molecule at time t = 0 the case of I =1. will yield some nonzero amount of WT at tM through π The randomwalk approachclearlydoesn’t workwhen (details in the appendix). the second term of equation 4 exceeds the first term: A Inthemeanfieldregime,the amountofWTproduced mutationwillnothavetime tospreadthroughthe whole through each minimum path should be relatively con- population before the next favorable mutation occurs. stant from one experiment to the next, which gives us a Large N: Mean Field. For sufficiently large N, we lower bound for the population size: expect mean field (MF) behaviour—there is no signifi- cant difference between one experiment/simulation and N >maxVar(nS0S,π). (6) another, and each path contributes a fixed fraction to π hnSS,πi2 0 the total WT DNA produced. In principle, in the limit of N → ∞ the fraction of population S at time t, TableIIshowstheseboundsforaselectionofparameters f(S,t) can be traced from iteration to iteration and the I andr. ThedependenceonrisroughlyN ∼r−m,where chemical potential determined from the selection crite- m = 6 is the number of different bases between SS and rion f(S,t+1/2)/[1+exp(E(S)−µ(t))]=2−I,where WT. S f(S,t+1/2) is the fraction of S just after the amplifica- Thespeedofevolutioninthemeanfieldregimecanbe P tion but before the selection. This would require tracing easily estimated in the case of I = 1. With I = 1, half all possible sequences—a tedious and often impractical thepopulationisremovedduringeachselection,thusthe task. Inpractice,itsufficestorestrictourselvestoalim- chemical potential for the selection will be close to the ited set of paths. With a fixed set of paths, we can find median binding energy in the population, and any DNA thefractionoftheDNAateachstepofeachpathateach with significantly higher binding energy than the major- iteration, as well as the chemical potential used for each ity will almost certainly survive. In the very first itera- selection. The calculations proceed the same way as for tion of the evolution process, a fraction r/3 of the DNA the finite N simulations, except there is no randomness will get each of the m “correct” mutations. In the fol- involved,andwe discardthe DNA thatdepartsfromthe lowing iterations, almost all of these improvedDNA will chosen paths. survivethe selections,andtheamountofimprovedDNA 5 π) 10-2 π) 10-2 (Wpath (Wpath 6 PR10-3 PR10-3 ECrornosrsta rnetl astpiveee dto 1 W/TT m Walk: 10-4 m Walk: 10-4 ors4 Rando 10-150-4 Mean Fie1ld0-:3 <n0SS,π> 10-2 Rando 10-5 10M-18iddle G10r-o16und: 1P0+-M14F(n0SS1,0π-)12 Err2 πSS,>Mean Field: <n0111000---423 10-18 10-16 10-14 10-12 Middle Ground Simulations1110100-01--0428 10-18 10-16 10-14 10-12 0 Middle Ground: PMF(nSS,π) Middle Ground: PMF(nSS,π) 0 20 40 60 80 100 + 0 + 0 Iteration FIG. 5: The predicted average contribution from each path FIG.4: AveragenumberofdifferentbasesbetweenDNApool for the various regimes, plotted against each other and, for and WT as function of time. r=10−4, N =239 ≈5.5·1011, the middle ground, compared to simulation results (average I =1. of 223 simulations with I =1, N =224 and r=10−7). will thus roughly be doubled in each iteration. After The Middle Ground. The argumentused for the evo- log( 3 ) lution speed in the mean field regime is qualitatively T ≈1+ mr (7) 0 log(2) valid for all N > 1 (the factor T in the seed fraction r does not fully apply until N > 1 ), thus we expect a iterations,theimprovedDNAwillhavereplacedtheorig- r2 smooth transition from the RW evolution behavior, ran- inalpopulation,i.e. mostoftheDNAwillonlyhavem−1 dom with on averageexponential approach,to the mean errors relative to WT. fieldbehavior,constantspeed. However,whileallthekey Once the whole population has been improved by one quantities shown in fig. 1 vary smoothly with the pop- base, the process is repeated. However, there have al- ulation size, there is a significant region where fWT ≈1 readybeenT iterationinwhichtheimprovedDNAcould min 0 and fWT > 0.5, i.e. typically a single minimum path improve further through mutations, and this even more best dominates. In this region the average contribution of improved DNA has been amplified at the same rate as the different paths vary far more than in the RW and the regular improved DNA, i.e. the “seed fraction” of MF regimes (Fig. 5), and the region can be considered improved DNA will now be 12T0(m3−1)r (the factor 21 is a third parameter regime. Fig. 6 shows the result of a because only the copy gets mutated in each amplifica- simulation in the middle ground regime,and while there tion). The time required to improve the DNA pool by inthis simulationare 11minimum paths that contribute one base is thus roughly WT, 70% of it comes from a single path. Note that all theminimumpathsusedhereare“probable”ones(fairly log( 3 ) T(m′)≈1+ m′Tr , (8) red). log(2) As a single minimum path dominates in each simu- where m′ is the number of errors left. Corrections in- lation, the average contribution of the different paths cludethatsomeoftheimprovedDNAwillbelostduring should depend strongly on how often that path domi- selection, reducing the effective amplification from 2 to nates. Given that the overall evolution behavior is sim- 2∆K , and higher order corrections to the factor T in ilat to that of the mean field regime, we can use as a t∆hKe+s1eed fraction. These and other corrections can be first guess the probabilities PMF(nSS,π) from the large + 0 addressed by considering an infinite length model. [18] N discussion (we here add the superscript MF to em- Fig.4showshowtheaveragenumberoferrorschanges phasize that these are mean field calculations). Fig. 5 with time, and we see that the evolutionspeed is almost shows the results from simulations plotted against this constant—using T(2) from equation 8 gives a very good estimate, and there is a very accurate relationship be- fit. The first improvement takes somewhat longer, as tween them (it is not linear), at least for the most prob- expected, and so does the last improvement: For most able paths—forthe less probablepaths, statisticalerrors of the DNA, the error at position 4 will be the last to are large. The middle ground region corresponds fairly be corrected, and the binding energy difference for that well to 1 < N < ( PMF(nSS,π))−1, i.e. the regime r π + 0 mutation is much smaller than for the others, thus the ends approximatelywhenthe populationis largeenough P effective amplification is significantly smaller. that, using the mean field chemical potentials, we would The effects of large I and various analytical estimates expect to find at least some WT in most simulations. for the mean field regime are discussed in the appendix. There is a very large crossover region between the MF 6 Variance within each simulation, averaged Variance of average for simulations Average t 1 a) max 0.8 0.6 0.4 0.2 0 0 200 400 600 800 1000 0.4 b) Variance000...231 0 0 50 100 150 200 0.4 c) 0.3 0.2 0.1 0 Mutation energy scale (edges): 0 20 40 60 80 100 ∆K Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12 FIG. 7: The variance of thenumberof errors relative to WT Accumulated weight size scale (nodes & leaves): as a function of the time (iteration). r=10−4 and I =1 for LNeoadveess:: Σn( t πn,(tπ,)t) all graphs, while N =29 for a), N =218 for b) and N =232 0 1 2 3 4 5 6 7 8 9 M for c). Probability color scale (leaves): log (P MF(n SS,π)) −11 −10 −9 −8 −7 10 + 0 1e+13 t=29 FIG.6: TreeofevolutiontoWT(therootisSS)forasample t=30 simulation with N = 220, r = 10−4, and I = 1. Node size 8e+12 t=31 t=32 showsthesumoveralltheiterationsoftheamountofthese- t=33 t=34 quence (from that path) present during the simulation. Leaf t=35 size (square) shows the amount of WT produced (all nodes er6e+12 t=36 mb t=37 annotated with a square are WT). The color of the squares nu t=38 show theprobability of that pathusing themean field calcu- al t=39 lation. Tot4e+12 tt==4401 2e+12 regime and the middle ground, and a smaller region be- tween the middle ground and the RW regime (Fig. 1). Fig. 5 also shows the predictions (or “best guess” 00 2 4 6 for the middle ground) for the various regimes plotted Errors against each other, and it is clear that they are very dif- FIG.8: DistributionofdistancesfromWTasfuntionoftime. ferent, though they are somewhat correlated. r=10−4,I =1,N =239 ≈5.5·1011 Experimental Signatures. It is difficult to com- pletely and directly test the above theoretical analy- sis experimentally—one would need to sequence a large III. CONCLUSION number of DNA. A more practical choice is to consider onlythedistancefromasequencetoWT,i.e. thenumber of positions at which they differ, and study its variance Our simulation and analysis show that in the simple indifferentregimes. Intherandomwalkregime,atmost caseofadditive binding energythe evolutionbehaviorof times the DNA pool consists of only a single sequence, DNA-protein binding can be understood quantitatively thus the varianceofthis distance is,atany giventime in andrathercompletely. Dependingonthepopulationsize anygivenrun,almostzero,butthevariancefromonerun and the mutation rate, the evolutionary process exhibits tothenextcanbeverylarge. Asweincreasethepopula- distinct behaviors in three parameter regimes. Our re- tion size, the variance within a run increases somewhat, sults are fairly general as long as the potential is mainly but the variance between runs decreases drastically, and additive and can be used to make sense of experimental above the middle ground we have almost perfect coher- data. ence (Fig. 7). Itisnoteworthythatevolutionaryprocessesviamolec- Fig. 8 shows how the distribution of the distances ular breeding such as the one discussed here are funda- changesthrough(apartof)asimulation,anditconfirms mentallydifferentfromthosewherethefitnessofanindi- ourearlierassumptions: Onlyoncethewholepopulation vidualdepends solelyonitself (e.g. onits genotype)and has been “upgraded” does further improvement start to does not depend on others in the population. In the lat- become significant, i.e. there are at most two distances ter casethe fitness landscape is fixed[5,6, 7, 16, 17]and with significant population at any one time. the more fit the better, while in the former case the fit- 7 nesslandcapeisdynamic[8]andyoujusthavetobebet- terthantheaverage—tobeevenbetterdoesnotincrease 10 yourfitness. Thetwocasescanleadto,forexample,very KNSB=0.1 K =1 differentevolutiondynamicsforlargepopulationsizes[5]. 8 KNSB=10 The additivity of the binding energy gives rise to a T NSB W sTmhoeoitnhclluasnidonscaopfea, swmhaiclhl pgerretautrlbyastiimveplnifioens-atdhdeitainvealypsairst. ative to 6 tothepotentialwouldnotchangethepicture,butwould s rel nonetheless provide insights to the cases of more general Error 4 potentials and fitness functions. We thank Qi Ouyang and Terry Hwa for very helpful 2 discussions. 0 0 50 100 150 Iteration IV. APPENDIX FIG. 9: Average distance from WT as a funtion of time for A. Approximations various values of non-specific binding. r = 10−4,I = 1,N = 233 ≈8.6·109 Mutation rates. In the main paper we assumed that all mutation rates were the same regardless of which remain, the “one mutation at a time” assumption fails. base (A,C,G or T) was mutated and which base it mu- Notethataverylargepopulationsizeisrequiredinor- tated into, while in reality these rates can be very dif- dertobecertainthatasufficientlygoodsequencewillbe ferent [19, 20, 21]. While this assumption was done to produced (exponentially increasing the further into the simplify the analysis, the mutations rates will depend non-specificbindingregimeweare),andtherandomwalk on the specific experimental conditions under which the regime no longer exists - for small N, most experiments PCRsareperformed,anditmightthusnothavesufficed would never generate the WT sequence. to choose just one set of base dependent mutation rates. Thesimulations,themeanfieldrecursionrelationsand the analysis of the random walk regime can all easily be alteredtoincludedifferentmutationrates. Notethatthe B. More on Mean Field exactresultsforthe equilibriumdistributions inthe ran- dom walk regime only hold (in the correct limit) when Mean Field Recursion Relations. Here we derive a mutationratesarethesameforoppositemutations. The set of recursionequations in the MF regime. We assume evolution speed for the mean field regime is also still es- that the number of iterations t and the chemical po- M sentially correct,although the least likely mutations will tentials µ(t) are known from a mean field simulation, as typicallyoccurlastandwilltakelonger,astheseedfrac- discussed in the main paper. tionissmaller(thisisalsoaddressedintheinfinitelength Let nS,π be the number of WT sequences at time model [18]). t t originating from a single molecule of sequence S ex- Non-specific binding. In the main paper we always M isting a time t in the experiment, through a specific started out with a DNA sequence that binds strongly path π (which takes S to WT). Let P (nS,π) be the to the protein, such that favorable mutations can be se- + t chancethatthemoleculeproducessomenonzeroamount lectedquickly. Ifwestartoutwithaverypoorsequence, of WT through π, hnS,πi be the expected amount, and or, in the simulations, increase the non-specific binding t Var(nS,π)=h(nS,π)2i−hnS,πi2 be the variance. strengthoftheprotein,thenevenhighlyfavorablemuta- t t t tionswillbeonlyweaklyselected,asnon-specificbinding Giventhesequantitiesfortimet′rightafteraselection, will dominate. we can compute the values for time t right before the Figure 9 shows how the average number of errors selection: changes through the iterations when we start with a se- quence with 10 errors (the usual ones and 4 more) and various values for the non-specific binding strength. For P+(nSt,π) = pt,SP+(nSt′,π) (9) high non-specific binding strengths, none of the muta- hnS,πi = p hnS,πi (10) t t,S t′ tions areselected strongly,and “nothing happens” until, h(nS,π)2i = p h(nS,π)2i (11) bychance,severalverygoodmutationscombine,yielding t t,S t′ asequencethatbindssignificantlymorestronglythanby 1 p = (12) non-specificbinding alone-the DNA poolthenabruptly t,S 1+eE(S)−µ(t) improvesby allthose mutationsatonce,andthe process then proceeds as usual. Also, the original sequence now Similarly, given the quantities for t′ right after a cycle containsseveral(3)veryweakerrors,andonceonlythese ofPCR,wecancomputethevaluesfortimetrightbefore 8 the PCR: 1025 P (nS,π) = P (nS,π) 1−(1−r)LP (nS,π) 0 t 0 t′ + t′ −r(1−rh)L−1P (nS′,π′) (13) eld RAneaculyrtsicioanl erestlaimtioantes 3 + t′ n fi1020 a hnSt,πi = hnSt′,πi+(1−r)LhnSt′,πi i or me +3r(1−r)L−1hnSt′′,π′i, (14) bound f wer 1015 where S′ is the sequence reached by performing the first Lo mutation in path π on S, π′ is the remaining path, and P (x)=1−P (x). 0 + The first term of equation 13, P0(nSt′,π), is the chance 1010 0 50 100 150 200 250 300 thattheoriginal,unchangedDNAmoleculedoesnotpro- # PCRs per iteration (I) duce any WT at time t (through path π). The second M term is the corresponding chance for the possibly mu- FIG. 10: Lower bound for mean field regime calculated from tated copy: If it was not mutated (chance (1−r)L), the recursion relations vs. analytical estimate, as function of I. chance of producing WT through path π is P (nS,π). If r=10−4. + t′ it was mutated exactly the right way, i.e. according to the first step of path π (chance r(1−r)L−1), the chance 3 creasingI beyondthisonlyservestoincreasetheeffective of producing WT through the remaining part π′ of path rate of mutation. Table III shows how t depends on I, π is P+(nSt′,π′), while any other mutation would mean it and this matches our expectations well.M is off the right path. Equation14 is straightforward: hnSt′,πi is the contribu- I 1 2 4 6 10 15 20 30 50 100 tion from the unchanged molecule, and the other terms tM 151 80 46 36 28 25 22 22 21 20 are for the possibly mutated copy. The equation for h(nS,π)2ifollowsimmediately. Theseequationscoverthe TABLE III: Numberof iterations for mean field simulations. t casewhereweonlyallowsinglemutations;thereareaddi- r=10−7 for all values. tionaltermswhenweallowmultiplemutations(i.e. mul- tiple steps along the path). ForlargeI,theamplificationofagivenDNAsequence We trivially know that P (nWT,−) = hnWT,−i = in an iteration is approximately the ratio of its binding h(nWT,−)2i = 1, and the val+uestMare zero fortMall other constant to that of the typical DNA in the population, tM thus DNA that is much better than the average will be sequencesatthattime. Byapplyingtheaboveequations amplified far more strongly. In particular, DNA that for all selections and PCRs in the experiment, in reverse mutatestoWTduringtheveryfirstiterationwillreceive order, we find the desired quantities for time t=0. The results of this calculation for r = 10−4 and I = 1 are the maximum possible amplification, and this process is the one that contributes the most WT in a simulation. shown in Fig. 11. We can estimate this contribution and the corrections Large I. As discussed for I = 1, evolution works by due to DNA that does not quite reachWT the first iter- first producing a seed fraction of improved DNA which ation, as well as the variation from this process. These is then amplified through subsequent iterations. Unlike calculationsarecarriedoutbelowandgiveusanestimate the I = 1 case, we can no longer consider all improved for the lower bound of the mean field regime for large I: DNA (with a given number of errors) to be equivalent: Themaximumamplification(inoneiteration)forDNA Var(nπ (t )) 3 m that is better than the majority of the population is N >max WT M ∝ . (15) 2I. However, for this amplification to actually occur, π hnπWT(tM)i2 (cid:18)Ir(cid:19) the binding constant of the DNA that is to be amplified Figure 10 compares this estimate (equation 21) to the must be at least2I times that of the typical DNA in the (highly accurate) values found from the recursion rela- population. For 2I ≫∆K, the amplification will simply tions (eqs. 9-14). be ∆K per iteration. Mean field estimates: Large I. When I is suffi- Thehighestpossibleamplificationinourmodelisthus ciently large, the chemical potential for the selections is K(WT) ≈215. For very small I we expect the number of much lower than even the binding energy of WT, thus K(OS) iterations required to improve by one base (equations 7 the chance that a DNA molecule will survive a selection and8),andthus the totalnumberofiterationst , to be is proportional to the binding constant of the sequence. M approximately inversely proportionalto I (this holds for Thechancethateachinitialerrorwillbecorrectedduring all I for the infinite length model [18]). As we increase any given iteration (I cycles of PCR) is Ir, but for each 3 I,thevariousamplificationsaregraduallysaturated(ap- iteration that mutation i has not happened, the DNA proach ∆K), and for I > 15 they are all saturated. In- will be amplified by a factor ∆K less than those that i 9 have reached WT. The number of WT molecules cre- Mean field estimates: I = 1. A DNA molecule that ated (through all paths), as a function of time, is ap- hashigherbindingenergythantheaveragewilltypically proximately surviveaselection,andviceversa. Thesimplestestimate wecanuseforthebehavioroftheaveragebindingenergy m Ir 1 hn (t)i≈f (t)2It−m , is a linear change from the binding energy of SS to that WT I 3 1−(∆Ki)−1 of WT). (cid:18) (cid:19) i (cid:18) (cid:19) Y (16) Given that a DNA molecule mutates from SS to WT where i runs over the mutated positions, and fI(t) is a during the tM iterations of the simulation, we estimate functiondescribingtheratioofWTmoleculesthatwould that it will survive all the selections iff the number of have survived all the selections so far. The powers of 2 good mutations is always at least mt. The number of tM is the total number of PCRs minus the number of cycles waysthis canhappenisapproximatelym(tM)m−1: There in which the DNA mutated (ignoring simultaneous mu- m mustbeamutationatthefirstiteration,givingthefactor tations), and the last factor corrects for mutations that m and leaving approximately (t )m−1 combinations for M didn’t happen the first iteration. the remaining mutations. Given m points randomly dis- The main source of variance is whether or not a start- tributedonacircleoflengthm,thereiswithprobability ing molecule produces a WT molecule that survives the 1 exactly one point for which, when traveling clockwise first selection: fromthatpoint,youwillalwayshavevisitedmorepoints Var(nπ (t ))≈h(nπ (t ))2i than the length you have traveled (assuming tM ≫ m, WT M WT M thiscontinuumcaseshouldbegoodenough)—thus,the ≈ 2I(tM−1)ffI((t1M)) 2 2Im−!m I3r mfI(1) (17) c1h.ance that we started from the right point/mutation is (cid:18) I (cid:19) (cid:18) (cid:19) m 2m 3 m 1−(∆Ki)−1 2 Fromthis,thenumberofWTmoleculesproducedfrom ≈ (18) a single SS molecule in tM iterations is about m! Ir ∆Ki (cid:18) (cid:19) i (cid:0) (cid:1) r m Y hn (t )i≈2tM−m (t )m−1, (22) WT M M 3 ∆Ki wherewehaveusedfI(1)= i2I —thisassumesthat, where the powers of two are(cid:16)the(cid:17)number of iterations before the selection for theQfirst iteration, most of the wheretheDNAdidn’tchange(andthuswasduplicated). DNA sequences have not been mutated, which fixes the The simulation ends when there are about as many chemical potential µ . The first factor of eq. 17 is the 1 WT molecules as there were SS molecules at the begin- average number of WT molecules that a WT molecule ning: that survives the first selection would produce by time t , squared, and the remaining factors are the chance 1 ≈ hn (t )i M WT M that such a molecule is produced; the expected number ⇓ (23) of WT molecules produced through a single path during 3 theI cyclesofPCRinthefirstiterationtimesthechance t ≈ m+mlog −(m−1)log (t ). M 2 MR 2 M fI(1) that eachWT molecule survivesthe first selection. (cid:18) (cid:19) WeeliminatedfI(tM)fromequation17byusingequa- This is very similar to T + m−1 T(m′), using equa- tion 16, setting hn (t )i ≈ 1. From this we can get 0 m′=1 WT M tions 7 and 8 and log (t ) ≈ mlog (T) — most of the a rough estimate for the lower bound of the mean field 2 M P 2 discrepancy corresponds to the higher order corrections regimebyusinghnπWT(tM)i≈ m1!,asforI =1. However, −1 found for the infinite length model [18]. for large I we can do better by calculating the relative log(2) Regarding the variance Var(nπ (t )) of the number contributions of individual paths directly: WT M of WT molecules produced from a single SS molecule Ir m through a single path π, any duplication that occurs be- hnπ (t)i ≈ f (t)2It−m g (π)∆K(π) (19) WT I 3 m foretheWTisreachedwillonlyyieldafactorof2—-the (cid:18) (cid:19) twomoleculesmustindependently developtotheWT— 1 i−1 g (π ) 1 while duplication once the WT is reached gives a factor j j g (π) = + , (20) i i! (i−j)!∆K(π)−1 of 4. Because of this, the variance is dominated by the Xj=1 extremely rareevents where allthe goodmutations hap- pen very early (the first m+∆ iterations), and the WT where π means the last j steps of path π, and ∆K(π) j moleculesarethensimplyduplicateduntilthesimulation is the total change in binding constant, i.e. the prod- ends: uct of the ∆K for the individual mutations of path π. Knowledge of initial sequence is implicit. Var(nπ (t ))≈hnπ (t )i2 WT M WT M Usingequation19toeliminatef (t)inequation17,we I r m (m−2+∆)! find the bound for the MF regime directly: ≈ 22tM−2m−∆ (24) 3 ∆!(m−2)! 2m 3 m ∆K(π) ∆X≥0 r m(cid:16) (cid:17) N >mπax m! (cid:18)Ir(cid:19) (gm(π))2 (21) = 22tM−m−1 3 . (25) (cid:16) (cid:17) 10 If the DNA is mutated to WT in the first m+∆ iter- regime. ations (∆ is the number of“idle” iterations),it will then For r = 10−4 we estimate t ≈ 65 and thus M be duplicated tM −m−∆ times, giving a contribution Var(nπ (t ))≈5.8·1010. Theactualvaluesaret =83 22tM−2m−2∆. Also, the DNA will be duplicated during and VaWrT(nπM (t ))≈1.5·1012. The somewhat oMdd fact each idle iteration, but this merely increases the chance WT M that t is pretty far off, while the variance, which de- M that a molecule will reach WT, and thus gives a factor pends very strongly on t , is reasonably close, is not 2∆. M entirely unexpected: Towards the end of the simulation, The number of ways the mutations can happen is manyWT moleculeswillbe selectedaway(asthe chemi- (m−2+∆)!, assuming that there must be a mutation the ∆!(m−2)! calpotentialforthe selectionapproachesthe bindingen- firstiteration—wehaveignoredthepossibilityofmulti- ergy of WT), which will increase t . However, this will M ple mutations in one cycle of PCR, which is a significant also affect the variance correspondingly, such that using correction. Equation 25 follows by using the solution from equation 23 in equation 25 will still give the right value. Considering that we have used an n+1 extremely simple approximation for selection (and com- (n+∆)! x−∆ = x−n . (26) pletely ignored the ∆K for the various mutations), the ∆!n! ∆≥0 n≥0 estimate seems very reasonable. As a check, when using X X very large energies for the mutations (while keeping the We can then use the average hnπ (t )i = 1 to esti- ratios fixed), the simulations yield t =65 for the mean WT M π m! M mate (very roughly) the lower bound for the mean field field regime. [1] Bittker, J.A., Phillips, K.J. & Liu, D.R. (2002) Curr. 33. Opin. Chem. Biol. 6, 367-374. [21] Lin-Goerke, J., Robbins, D. & Burczak, J. (1997) [2] Landweber, L.F. (1999) Trends Ecol. Evol. 14, 353-358. Biotechniques 23, 409-412. [3] Arnold, F.H., ed. (2001) Evolutionary Protein Design (Academic Press, San Diego). [4] Novella, I.S., Duarte, E.A., Elena, S.F., Moya, A., Domingo, E. & Holland, J.J. (1995) Proc. Natl. Acad. Sci. USA92, 5841-5844. [5] Tsimring, L., Levine, H. & Kessler, D.A. (1996) Phys. Rev. Lett. 76, 4440-4443. [6] Kessler, D.A., Levine, H., Ridgway, D. & Tsimring, L. (1997) J. Stats. Phys. 87, 519-544. [7] Ridgway,D.,Levine,H.&Kessler, D.A.(1998) J. Stats. Phys. 90, 191-210. [8] Peng, W., Gerland, U., Hwa, T. & Levine, H. (2002) cond-mat/0204117. [9] Dubertret, B., Liu, S., Quyang, Q. & Libchaber, A. (2001) Phys. Rev. Lett. 86, 6022-6025. [10] Fields, D.S., He, Y., Al-Uzri, A. & Stormo, G.D. (1997) J. Mol. Biol. 271, 178-194. [11] Stormo, G.D. & Fields, D.S. (1998) Trends in Biochem- ical Sciences 23, 109-113. [12] Man, T.-K.& Stormo, G.D. (2001) Nucl. Acid. Res. 29, 2471-2478. [13] Record,M.T.Jr.,deHaseth,P.L.&Lohman,T.M.(1977) Biochemistry 16, 4791-4796. [14] Winter,R.B.&vonHippel,P.H.(1981)Biochemistry20, 6948-6960. [15] Frank, D.E., Saecker, R.M., Bond, J.P., Capp, M.W., Tsodikov, O.V., Melcher, S.E., Levandoski, M.M. & Record, M.T. Jr. (1997) J. Mol. Biol.267, 1186-1206. [16] Peliti, L. (1997) cond-mat/9712027. [17] Woodcock,G.&Higgs,P.G.(1996) J. Theor. Biol.179, 61-73. [18] Kloster, M. unpublished. [19] Shafikhani, S., Siegel, R., Ferrari, E. & Schellenberger, V.(1997) Biotechniques 23, 304-310. [20] Cadwell, R.& Joyce, G. (1992) PCR Meth. Appl. 2, 28-