Simple estimation of absolute free energies for biomolecules F. Marty Ytreberg∗ and Daniel M. Zuckerman† Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15261 (Dated: February 2, 2008) One reason that free energy difference calculations are notoriously difficult in molecular systems is due to insufficient conformational overlap, or similarity, between the two states or systems of interest. The degree of overlap is irrelevant, however, if the absolute free energy of each state 6 can be computed. We present a method for calculating the absolute free energy that employs a 0 simple construction of an exactly computable reference system which possesses high overlap with 0 thestateofinterest. Theapproachrequiresonlyaphysicalensembleofconformationsgeneratedvia 2 simulation,andanauxiliarycalculation ofapproximatelyequalcentral-processing-unit(CPU)cost. Moreover, the calculations can converge to the correct free energy value even when the physical n ensembleisincompleteorimproperlydistributed. Asa“proofofprinciple,”weusetheapproachto a J correctly predict free energies for test systems where theabsolute values can be calculated exactly, and also to predict the conformational equilibrium for leucine dipeptidein implicit solvent. 8 1 PACSnumbers: pacs Keywords: freeenergy,entropy ] h p - I. INTRODUCTION a harmonic solid in Cartesian coordinates as a reference o system.17 i b Knowledgeofthefreeenergyfortwodifferentstatesor Other approaches to calculate the absolute free ener- . s systemsofinterestallowsthecalculationofsolubilities,1,2 gies of molecules have been developed. Meirovitch and c determines binding affinities of ligands to proteins,3,4 collaborators calculated absolute free energies for pep- i s and determines conformational equilibria (e.g., Ref. 5). tides in vacuum, for liquid argon and water using the y Freeenergydifferences(∆F)thereforehavepotentialap- hypothetical scanning method.18,19 Computational cost h p plication in structure-based drug design where current has thus far limited the approach to peptides with sixty [ methods rely on ad hoc protocols to estimate binding degrees of freedom.20 The “mining minima” approach, affinities.6,7 developedbyGilsonandcollaborators,estimatesthe ab- 2 solute free energy of complex molecules by attempting v Poor “overlap,” the lack of configurational similarity to enumeratethe low-energyconformationsandestimat- 5 between the two states or systems of interest, is a key ing the contribution to the configurational integral for 9 cause of computational expense and error in ∆F calcu- 0 lations. The most common approach to improve overlap each.21,22 Anharmonic effects can be included.23 The 8 mining minima method can, in principle, include po- in free energy calculations (used in thermodynamic inte- 0 tential correlations between the torsions and bond an- gration, and free energy perturbation) is to simulate the 5 gles or lengths, and uses an approximate method to system at multiple hybrid, or intermediate stages (e.g., 0 compute local partition functions. Other investigators / Refs. 8,9,10,11,12). However, the simulation of interme- s haveestimatedabsolutefree energiesformolecules using diate stages greatly increases the computational cost of c harmonicorquasi-harmonicapproximations,23,24,25 how- i the ∆F calculation. s ever,asdiscussedinRefs.23and24localminimacanbe y Here, we address the overlap problem by calculating deviate substantially from a parabolic shape. h the absolute free energy for each of the end states, thus p avoiding the need for any configurational overlap. Our We introduce, apparently for the first time, a refer- : ence system which is constructed to have high overlap v method relies on the calculation of the free energy dif- withfairlygeneralmolecularsystems. Theapproachcan i ference between a referencesystem (where the exactfree X makeuseofeitherinternalorCartesiancoordinates. For energy can be calculated, either analytically or numeri- r cally) and the system of interest. biomolecules,usinginternalcoordinatesgreatlyenhances a theaccuracyofthemethodsinceinternalcoordinatesare Such use of a reference system with a computable tailoredto the descriptionof conformations. Further, all free energy has been used successfully in solids where degrees of freedom and their correlations are explicitly the reference system is generally a harmonic or Einstein included in the method. solid,13,14 andliquidsystems,wherethereferencesystem isusuallyanidealgas.15,16 Theschemehasalsobeenap- Our method differs in several ways from the impor- pliedto molecularsystemsby StoesselandNowak,using tant study of Stoessel and Nowak:17 (i) we use internal coordinates for molecules which are key for optimizing the overlap between the reference system and the sys- tem of interest; (ii) we may use a nearly arbitrary ref- ∗E-mail: [email protected] erence potential because only a numerical reference free †E-mail: [email protected] energy value is needed, not an analytic value; (iii) there 2 is no need, in cases we have studied, to use multi-stage Ignoring the constant is not a limitation since, for the methodology to find the desired free energy due to the conformational free energies studied here, the term can- overlap built into the reference system, cels for free energy differences. We consider this report a “proof of principle” for our referencesystemmethod. After introducingthe method, itistestedonsingleanddouble-welltwo-dimensionalsys- B. The reference energy and its normalization tems, and on a methane molecule where absolute free energy estimates can be compared to exact values. The The trivial identities of Eq. (3) suggest that arbitrary methodisthenusedtocompute the absolutefree energy reference systems can be used in our approach. To be of the alpha and beta conformations for leucine dipep- concrete and anticipate the procedure used, our discus- tide(ACE-(leu)2-NME)inimplicitsolvent,usingallone- sion below will assume that a finite-length simulation of hundred fifteen degrees of freedom, correctly calculating the system of interest has been performed—from which the free energy difference ∆Falpha→beta. Extensions of histograms of the coordinates have been generated. For the method to larger systems are then discussed. the molecular systems studied in this report, ordinary Langevindynamicssimulationsareperformedusingstan- dard forcefields. The reference potential energy can be II. REFERENCE SYSTEM METHOD constructed from a wide variety of histograms, as dis- cussed below. Denoting the computed histograms over A. The fundamental relations all coordinates as P(~x), we define U (~x)≡−k T lnP(~x), (4) The absolute free energy of the system of interest ref B (“phys” for physical) is defined using the partition func- where P(~x) is the normalized probability of a particular tion Z phys configuration(correspondingto a set ofhistogrambins); see Fig. 1. For example, if all coordinates are binned as F =−k T lnZ = phys B phys independent, then −k T ln d~xe−β Uphys(~x)+Kphys(~x) , (1) B Ncoords (cid:20)Z (cid:0) (cid:1)(cid:21) P(~x)= P (x ), (5) i i where T is the system temperature, β = 1/kBT, Uphys iY=1 andK are,respectively,thephysicalpotentialenergy phys where P (x ) is the binned probability distribution (his- (i.e., simulation forcefield) and the kinetic energy, and i i togram)for the ith coordinate, and there areN de- ~x represents the full set of configurational coordinates coords grees of freedom in the system. If all coordinates are (internal or Cartesian). The kinetic energy term can be binned as pairwise correlated, then integrated exactly to obtain26 P(~x)= P (x ,x ), (6) Zphys = h31N σ8Cπ2◦ N 2πkBTmi 3/2 d~xe−βUphys(~x),(2) {Yi,j} ij i j " # i=1 Z Y(cid:0) (cid:1) where {i,j} is a set of pairs in which each coordi- where mi is the mass of atom i, h is Planck’s constant, nate occurs exactly once, and Pij(xi,xj) is the proba- C◦ is the standard concentration, σ is the symmetry bility for two particular coordinate values from the two- number,22 N is the number of particles in the system, dimensional histogram for these coordinates. It is also andtheintegralisdefinedtobetheconfigurationalparti- possible to use an arbitrary combination of independent tionfunction. Formethodusedinthisstudytheabsolute and correlated coordinates—so long as each coordinate free energyof the system ofinterestis calculatedusing a occurs in only one P factor. reference system (“ref”), and the following relationships Weemphasizethatthefinalcomputedfreeenergyval- are used, uesincludeallcorrelationsembodiedinthetruepotential U . This is true regardless of whether or how coordi- phys Zphys nates are correlated in the reference potential. Z =Z , phys ref Z A schematic of how U is computed for a one- ref ref F =F +∆F , (3) coordinate system is shown in Fig. 1. The coordinate phys ref ref→phys histogram is first determined (solid bars) using a simu- where F is the trivially computable free energy of the lation trajectory; then Eq. (4) is used to calculate U ref ref reference system, and ∆F is the free energy dif- (dashed bars). A possible physical potential is also in- ref→phys ference between the referenceand physicalsystemwhich cluded(dottedline)forcomparisontoU . Forasystem ref can be calculated using standard techniques. containing many degrees of freedom, the process is car- For this report, we include estimates of the configu- riedoutforallcoordinates,basedonEq.(5),(6)orother rational integral only, i.e., the leading constant factor in correlationscheme. U isthesumofalltheappropriate ref square brackets in Eq. (2) is not included in our results. terms, consistent with Eq. (4) and the binning choice. 3 10 perturbation8 fromthe referenceto the physicalsystems 8 Histogram from simulation, P F =−k T ln e−β Uphys−Uref phys B 6 ref D (cid:0) (cid:1)E . 1 Nref 4 Physical potential energy Uphys =−kBT ln e−β Uphys−Uref (10) N ref ! i=1 (cid:0) (cid:1) X 2 where N is number of structures in the reference en- ref . 0 semble, the “=” symbol denotes a computational esti- mate, and h...i represents a canonical average using ref -2 structures from the reference ensemble only. It is impor- tanttonotethat,whileotherchoicesforcomputingF -4 Reference energy U phys ref are possible, such as Bennett’s method,5,27,28,29,30,31 Eq. 0 2 4 6 8 10 12 (10) is the only choice which relies solely on configura- Coordinate tions drawn from the reference ensemble which are, by construction, sampled canonically and without dynam- FIG. 1: Depiction of how the reference potential energy U ref ical trapping. We also note that “uni-directional” esti- is calculated for a one-coordinate system. First the coordi- nateisbinned,creatingahistogram P (solidbars)populated mates like that of Eq. (10) have been analyzed exten- according to the physical ensemble. Then Eq. (4) is used to sively (e.g., Refs. 32 and 33) and may be amenable to calculate reference energies for each coordinate bin (dashed error-reductiontechniques;34,35however,wehaveapplied bars). Ahypotheticalphysicalpotentialisshownasadotted the perturbation approach here to keep our initial anal- curve for comparison to Uref. For a multi-coordinate system ysis as straightforward as possible. Staged free energy Uref would be the sum of the single-coordinate reference po- methods like thermodynamic integration36 and adaptive tential energies. integration37 may also be used. The free energy of the reference system can now be D. The physical ensemble and construction of the calculated via the reference partition function reference system Z = d~xe−βUref(~x) = d~xP(~x). (7) The method used in this report relies on simple his- ref Z Z tograms for all degrees of freedom (in principle, with internal or Cartesian coordinates) based on a “physical In practice, we normalize the histogram for each coordi- ensemble” of conformations generated via molecular dy- natetooneindependentlybysummingoverallhistogram namics, Monte Carlo or other canonical simulation. The bins. So, for a particular bond length r , that is binned 1 histograms define a reference system with a free energy as independent, we account for the Jacobian factor (see that is trivially computable, as described in Sec. II. We Eq. (11)) by defining ξ =r3/3, and then 1 emphasize that an analytical solution need not be avail- able; a precise numerical evaluation is more than ade- Zξ = dξ P(ξ)= ∆ξ P(ξ)=1, (8) quate. Awell-sampledensembleofreferencesystemcon- Z NXbin figurationsisthenreadilygeneratedandusedtocompute the free energy difference via Eq. (10). where∆ξ isthehistogrambinsize,andN isthe num- bin The first step in our approach to constructing the ref- ber of bins in the r histogram. (Binning choices are 1 erence system is to generate a physical ensemble (i.e., discussed below.) Similar relationships are used for all a trajectory) by simulating the system of interest us- coordinates. Thus the reference free energy F =0 and ref ing standardmolecular dynamics, Monte Carlo, or other Eq. (3) becomes canonicalsampling techniques. The trajectoryproduced by the simulation is used to generate histograms for all F =∆F (F ≡0) (9) phys ref→phys ref coordinates as described below. In creating histograms, note that constrained coordinates, such as bond lengths involvinghydrogensconstrainedbyRATTLE,38neednot C. Using the physical and reference ensembles bebinnedsincethesecoordinatesdonotchangebetween configurations. Such coordinate constraints are not re- With the reference potential energy U defined in quired in the method, however. ref Eq. (4) and the physical potential energyU given by If internal coordinates are used (such as for the phys the forcefield, which may include implicit solvation en- molecules in this study), care must be taken to account ergies, Boltzmann-distributed snapshots from both the fortheJacobianfactors. Usinginternalcoordinateswith reference and physical systems can be utilized to calcu- bond lengths r, bond angles θ and dihedrals ω, the vol- late F =∆F . Here, we simply use free energy ume element in the configurational integral of Eq. (2) is phys ref→phys 4 given by23 System Exact Estimate two-dimensional single-well39 -1.1443 -1.1449 (0.0003) N−1 N−2 N−3 d~x= r2dr sinθ dθ dω = two-dimensional double-well39 5.4043 5.4058 (0.0003) i i i i i i=1 i=1 i=1 Methane molecule 10.932 10.934 (0.002) Y Y Y N−1 N−2 N−3 d(r3/3) d(−cosθ ) dω , (11) TABLEI: Absolutefreeenergyestimatesobtainedusingour i i i reference system approach for cases where the absolute free i=1 i=1 i=1 Y Y Y energycanbedeterminedexactly. Inallcases,theestimateis where N is the number of atoms in the system. Thus, inexcellentagreementwiththeexactfreeenergy. Theuncer- whenusinginternalcoordinates,the simpleststrategyto tainty,shown in parentheses (e.g., 3.14(0.05)=3.14±0.05), account for the Jacobian is to bin according to a set of is the standard deviation from five independent simulations. rules: bond lengths are binned according to r3/3, bond Theresultsforthetwo-dimensionalsystemsareinkBT units angles are binned according to cosθ, and dihedrals are andmethaneresultshaveunitsofkcal/mole. Thetableshows binned according to ω (i.e., the same as Cartesian coor- estimates of the configurational integral in Eq. (2), i.e., the constant term is not included in theestimate. dinates). methodtoestimatetheabsolutefreeenergiesofthealpha E. Generation of the reference ensemble andbeta conformationsofthe 50-atomleucine dipeptide (ACE-(leu) -NME),andcomparedthefreeenergydiffer- 2 Once the histograms are constructed and populated ence obtained via our method with an independent esti- using the physical ensemble, the reference ensemble is mate. Inallcases,the free energyestimate computedby generated. To generate a single reference structure, for ourapproachis in excellentagreementwithindependent each coordinate one chooses a histogram bin according results. to the probability associated with that bin. Then a co- ordinate value is chosen at random uniformly within the bin according the Jacobian factor in Eq. (11)—e.g., for A. Simple test systems a bond length r, one chooses uniformly in the variable (r3/3). The process is repeated for every degree of free- Wefirststudiedthetwo-dimensionalsingleanddouble- dom in the system. By repeating the entire procedure, well potentials from Ref. 39, one cangenerate as many referencestructures as desired (i.e., the reference ensemble). Usingle(x,y)=(x+2)2+y2, phys 1 Udouble(x,y)= ((x−1)2−y2)2+ F. Summary of the reference system method phys 10 n 10(x2−5)2+(x+y)4+(x−y)4 . (12) In summary, the method is implemented by first con- o structing properly normalizedhistograms for all internal TableIshowstheexcellentagreementbetweentheref- (or Cartesian) coordinates based on a physical ensem- erence system estimates and the exact free energies (ob- ble of structures. An ensemble of reference structures is tained analytically) for the two-dimensional potentials then chosen at random from the histograms. The refer- used in this study, Eq. (12). The “physical” simulations ence energy (Uref of Eq. (4)) and physical energy (Uphys used Metropolis Monte Carlo with kBT = 1.0 and one fromtheforcefield)mustbecalculatedforeachstructure millionsnapshotsinthephysicalandreferenceensembles. in the reference ensemble. Finally, Eq. (10) is used to For all two-dimensional simulations, both coordinates calculate the desired absolute free energy of the system weretreatedwithfullcorrelations—i.e.,two-dimensional of interest. histogramswereused—andthebinsizeswerechosensuch The CPU costof the method, above that of the initial thatthenumberofbinsrangedfrom100-1000. Theerror “physical” trajectory, is one physical energy evaluation shown in Table I in parentheses is the standard devia- for each of the Nref reference structures, plus the less tion from five independent estimates using five separate expensive cost of generating reference structures. physicalensembles—andthusfivedifferentreferencesys- tems. Good estimates were also obtained using fewer snapshots—e.g., we obtained F = −1.142 (0.003) for III. RESULTS the single-well potential and F = 5.408 (0.007) for the double-well potential using 10,000snapshots in both the Totesttheeffectivenessofthereferencesystemmethod physical and reference ensembles. we first estimated the absolute free energy for three test Table I also shows the excellent agreement between systems where the free energy is known exactly. We thereferencesystemestimatesandtheexactvalueofthe chose the two-dimensional potentials from Ref. 39, and freeenergyformethaneinvacuum. Methanetrajectories a methane molecule in vacuum. Finally, we used the weregeneratedusing TINKER4.240 with the OPLS-AA 5 12 forcefield; and (ii) five bond angles which are correlated ol) to one another but not to the bond lengths. Thus the m 11.8 al/ exact partition function Zmeth is a product of four bond c length partition functions Z and one angular partition k 11.6 r (ys function Zθ, h p F 11.4 e Z =Z4Z , at meth r θ m ∞ esti 11.2 Zr = dr e−βUphys(r), energy 11 Zθ = πdθ1dθ2dθ3dθ4dθ5 e−Zβ0Uphys(θ1,θ2,θ3,θ4,θ5). (13) ee 10.8 Z0 Fr U (r) is harmonic and thus Z was computed an- 10.6 phys r 1 10 102 103 104 105 alytically using parameters from the forcefield. For N U (θ ,θ ,θ ,θ ,θ ) the correlations between angles ref phys 1 2 3 4 5 must be taken into account, thus Z was estimated θ FIG. 2: Absolute free energy for methane estimated by the numerically using TINKER to evaluate U in the phys referencesystemmethodasafunctionofthenumberofrefer- five-dimensional integral. We found that F = meth encestructuresNref usedintheestimate. Thesolidhorizontal −kBT lnZmeth =10.932 kcal/mol as shown in Table I. line is the exact free energy obtained by numerical integra- Methane was also used to show that the method cor- tion. Five independent simulations are shown on a log scale rectly computes the free energy even when the physical to clearly show the convergence of the free energy estimate. ensemble is incorrect or incomplete. In our studies we ResultsshownwereobtainedusingEq.(10)withone-hundred found that the correct free energy is obtained using our binsforeachdegreeoffreedom,i.e.,theestimatesfortheab- methodevenwhenthehistogramforeachcoordinatewas solutefreeenergyofmethaneinTableIarethevaluesshown here for N =1,000,000. assumedto be flat, i.e., without the use of a physicalen- ref semble (data not shown). Choosing the size of the histogram bins is an impor- forcefield.41 The temperature was maintained at 300.0 tantconsideration. Figure3showsthelarge“sweetspot” K using Langevin dynamics with a friction coefficient of wherebinsarelargeenoughtobewellpopulated,andyet 91.0ps−1andatimestepof0.5fs. Thephysicalensemble small enough to reveal histogram features. The figure was created by generating five 10.0 ns trajectories with shows results for the absolute free energy for a methane snapshotssavedevery0.1ps. Usingthe100,000methane moleculeusingten-thousandstructuresinboththephys- structuresinthe physicalensemble,the referencesystem ical and reference ensembles, Nphys = Nref = 10,000, was generated by binning internal coordinates into his- (dashed curve) and Nphys = Nref = 100,000 (solid tograms. Theabsolutefreeenergywasthenestimatedby curve). The small vertical scale of two kcal/mol and generating 100,000 structures for the reference ensemble thelogarithmichorizontalscaleemphasizethatthereisa and using Eq. (10). All coordinates were binned as in- wide range of bin sizes that produce excellent results for dependent using one-hundred bins per coordinate, thus the reference system approach. Error bars are the stan- only one-dimensionalhistogramswererequired. The un- darddeviationoffiveindependentsimulations. Thesolid certainty shown in parenthesis in Table I is the stan- horizontallineshowstheexactfreeenergyandthecurves darddeviationfromfiveindependentestimatesusingthe are free energy estimates, using Eq.(10) as a function of five separate methane trajectories—and thus five differ- thenumberofbinsusedforthehistogramsforalldegrees ent reference systems. of freedom. From this plot it is clear that one should Figure 2 shows the convergence behavior of the ref- chooseatleastfiftybins,andthatthemaximumnumber erence system method for methane. Five independent of bins that should be used depends on the number of absolutefreeenergyestimatesareshownasafunctionof snapshots in the physical ensemble—more snapshots in the number of reference structures used in the estimate. the physical ensemble means one can use more bins for Eachofthe five simulationsuse the same protocolas de- the reference system. scribed above, i.e., the absolute free energy estimates in TableIarethevaluesshowninFig.2forN =100,000. ref Methane was chosen as a test system because intra- B. Leucine dipeptide molecular interactions are due only to bond lengths and angles. In the OPLS-AAforcefield no non-bonded terms Table II shows the agreement for leucine dipep- are present in the potential energy U , and thus the tide (ACE-(leu) -NME) between the free energy differ- phys 2 exact absolute free energy can be computed numerically ence ∆F as predicted by the reference sys- alpha→beta without great difficulty. For methane, a configuration is tem method, and as predicted via long simulation. The determined by: (i) four bond lengths, which are inde- leucine dipeptide physical ensembles were generated us- pendent of eachother and all of other coordinates in the ing TINKER 4.240 with the OPLS-AA forcefield.41 The 6 12 temperaturewasmaintainedat500.0K(toenableanin- ) ol dependent ∆F estimate via repeated crossingof the free m al/ N = N = 10,000 energy barrier between alpha and beta configurations), kc 11.5 ref phys usingLangevindynamicswithafrictioncoefficientof5.0 ( hys N = N = 100,000 ps−1. GBSA42implicitsolvationwasused,andRATTLE Fp ref phys wasutilizedtomaintainallbondsinvolvinghydrogensat e, theirideallengths38allowingtheuseofa2.0fstimestep. at 11 m We calculated reference systems and computed ab- esti solute free energies of the alpha and beta conforma- nergy 10.5 Exact F tlaiotniosnsb,abseadckobnonfievteor1s0io.0nsnwsetrreacjeocntsotrriaeisn.edFoursinagllasiflmaut-- e e Methane results bottomed harmonic restraint (zero force if the torsion e Fr angleswerewithin the allowedrange,andharmonic oth- 10 10 100 1000 erwise), namely, for alpha: −105<φ<−45and −70< Number of histogram bins, N ψ <−10; and for beta: −125<φ<−65and120<ψ < bin 180. The reference system was generated using 100,000 FIG. 3: Absolute free energy for methane estimated by the snapshots from the physical ensemble, then free energy reference system method as a function of the number of his- estimates were obtained by generating 1,000,000 struc- togram binsusedfor eachdegreeof freedom. Theplotshows tures for the reference ensemble for each estimate. All the “sweet spot” where histogram bins are small enough to one-hundred fifteen (excludes bond lengths constrained reveal histogram features, yet large enough to give sufficient by RATTLE38) internal coordinates were binned as in- populationineachbin. Theresultsareshownwithavertical dependent with fifty bins for each coordinate. The un- scaleoftwokcal/molandonalogscaletoemphasizethewide certainty shown in parenthesis is the standard deviation range of bin sizes that produce excellent results for the ref- from the five independent estimates using the five sep- erence system approach. Results shown were obtained using arate trajectories, i.e., five different physical ensembles Eq.(10)foramethanemoleculeusingN =N =10,000 phys ref and five different reference systems. (dashed curve) and N = N = 100,000 (solid curve). phys ref Thesolid horizontal lineshows theexact free energy and the Since independent estimates of the absolute free en- errorbars are thestandard deviationsof fiveindependenttri- ergies of the alpha and beta conformations of leucine als. The plot demonstrates at least fifty bins should be used dipeptide arenotavailable,wecalculatedthefreeenergy foreachindependentcoordinate,andthatthemaximumnum- difference ∆F = −0.85 (0.05) kcal/mol via a alpha→beta berofbinsdependsonthenumberofsnapshotsinthephysical 1.0 µs unconstrained simulation. The uncertainty of the ensemble. independentestimatewasobtainedusingblockaverages. Thetemperaturewaschosentobe500.0Kwhichallowed around1500 crossingsof the free energy barrier between thealphaandbetaconformations,providinganaccurate independentestimate. AscanbeseeninTableII,oures- System Estimate (kcal/mol) IndependentEstimate timated free energy difference is in good agreement with Falpha 87.3 (0.7) — the independent value obtained via long simulation. Fbeta 86.3 (0.7) — We emphasize that the nearly kcal/mol fluctuations ∆Falpha→beta -1.0 (0.9) -0.85 (0.05) observed in our leucine dipeptide estimates are com- pletely independent of the magnitude of the free energy TABLE II: Absolute free energy estimates of the alpha difference of the same order. That is, for a similar sized (Falpha) and beta (Fbeta) conformations obtained using the system and similar CPU investment, one would expect reference system method for leucine dipeptide with GBSA similar uncertainty, even for a very large free energy dif- solvation, in units of kcal/mol. The independent measure- ference. This, indeed, is the motivation for performing ment for the free energy difference was obtained via a 1.0 µs absolute free energy calculations. We believe, moreover, unconstrained simulation. The uncertainty for the absolute thatefficiencyimprovementswillbeachievedbeyondthe free energies, shown in parentheses, is the standard devia- data in this initial report. tion from five independent 10.0 ns leucine dipeptide simula- tions using one-million reference structures in the reference Figure 4 shows the convergence behavior of the ref- ensemble. The uncertainty for the free energy differences is erence system method for leucine dipeptide. Five free obtained by using every possible combination of Falpha and energy estimates are shown as a function of the number Fbeta, i.e., twenty-five independent estimates. The standard ofreferencestructuresusedintheestimatefor(a)theal- error associated with the∆Falpha→beta reference system esti- pha configuration, and (b) the beta configuration. Each mate is 0.18 kcal/mol, reflecting the twenty-fiveindependent ofthefivesimulationsusethesameprotocolasdescribed estimates. The table shows estimates of the configurational above. integral in Eq. (2), i.e., the constant term is not included in The leucine dipeptide calculations also demonstrate theestimate. two importantaspects of the particular referencesystem defined in this study: (i) the reference system has good 7 120 mol) 115 (a) al/ c 110 k (hys105 p F e at 100 m sti 95 e y g er 90 n e ee 85 Fr 80 10 2 3 4 5 6 10 10 10 10 10 N ref 120 mol) 115 (b) al/ c 110 k (hys105 p F e at 100 m sti 95 e y g er 90 n e ee 85 Fr 80 10 2 3 4 5 6 10 10 10 10 10 N ref FIG. 4: Free energy for leucine dipeptide estimated by the referencesystemmethodasafunctionofthenumberofrefer- ence structures N used in the estimate. Five independent ref simulations are shown on alog scale todemonstratethecon- FIG. 5: Scatter plots of the two χ2 torsions of each residue vergencebehaviorofthefreeenergyestimatefor(a)thealpha forleucinedipeptide. Resultsareshownforbothphysicaland configuration, and (b) thebeta configuration. Results shown reference ensembles containing 100,000 structures each. The were obtained using Eq. (10) with fifty bins for each degree figure shows that: (i) the reference system has good overlap of freedom. with the physical system, as can be seen by the similarity between the two plots; and (ii) the reference system is more broadly distributed than the physical system, as evidenced overlap with the physical system; and (ii) the reference by the data at (-60,-60) for the reference system that is not system is broader than the physical system. Figure 5 present for thephysical system. showsascatterplotofthe χ torsionsofeachresiduefor 2 both the physical and reference ensembles. Each ensem- ble contains100,000structures. The figure clearlyshows system. the excellent overlap between the reference and physical ensemble, as can be seen by the similarity between the two plots. In addition, the reference ensemble scatter plot has data in the region (-60,-60) which does not ex- IV. DISCUSSION ist in the physical ensemble, showing that the reference system is “broader” than the physical system. Figure 6 shows a histogram of the distance between Thepresentresultsraiseanumberofquestionsregard- the C atom of residue one and the C of residue two ingthereferencesystemapproachtocomputingabsolute δ α forthe sameensemblesasFig.5. Thefigureagainshows free energies—inparticular,regardingthe use of correla- howthereferencesystemhasbothexcellentoverlapwith tions, the importance of the physical ensemble, and the thephysicalsystemandisalsobroaderthanthephysical potential for application to larger systems. 8 1 Regardless of the degree of correlations included in Physical Ensemble Uref, we emphasize that final results fully include cor- Reference Ensemble relations in the physical potential U . 0.8 phys bility0.6 B. Quality of the physical ensemble a b o Since the reference ensemble is generated by drawing r0.4 P at random from histograms which, in turn, were gener- ated from the physical ensemble, a natural question to 0.2 ask is: how complete does the physicalensemble need to be? The surprising answer is that, for our reference sys- tem method, the physical ensemble does not need to be 0 2 4 6 8 complete, or even correct (properly distributed). Since Distance between C of residue 1 and C of residue 2 (Angstroms) δ α Eqs.(3)and(9)arevalidforarbitraryreferencesystems, the convergence of the free energy estimate to the cor- FIG. 6: Histogram of the distance between theCδ of residue rect value is guaranteed,in the limit of infinite sampling one and the Cα of residue two for leucine dipeptide. Results (N →∞), regardless of the quality of the physical en- ref areshownforbothreferenceandphysicalensemblescontain- semble. The“trick”isthattheensembleforthereference ing 100,000 structures each. The figure shows that: (i) the system must be converged, which can be achieved with reference system has good overlap with the physical system; much less expense since there is no dynamical trapping. and(ii)thereferencesystemisbroaderthanthephysicalsys- Unlike the typical case for molecular mechanics simu- tem. lation, we sample the reference ensemble “perfectly”— there is no possibility of being trapped in a local basin. Byconstruction,sinceallcoordinatevaluesaregenerated A. Correlation of Coordinates exactly according to the reference distributions, the ref- erence ensemble can only suffer from statistical (but not How can correlations among coordinates be used to systematic) error. For example, it was possible to ob- increase the method’s effectiveness? One may choose tain the correctfree energy for methane based on 10,000 to bin coordinates as independent (i.e., one-dimensional referencestructuresevenwhenthehistogramforeachco- histograms),or with correlations(i.e., multi-dimensional ordinate was assumed to be flat, i.e., without the use of histograms). For example, in peptides, one may choose a physical ensemble (data not shown). to bin all sets of backbone φ,ψ torsions as correlated, It is important to note that, while convergence to the and all other coordinates (bond lengths, bond angles, correct free energy is guaranteed for any choice of refer- other torsions) as independent. It might always seem ence system, the efficiency of the method could be dra- advantageoustobinsomecoordinates(atleastbackbone maticallyreducedifthereferencesystemdoesnotoverlap torsions) as correlated, since reference structures drawn well with the physical system. randomly from the histograms will be less likely to have Given the fact that the physical ensemble need not steric clashes. On the other hand, including correlations be correct, it is easy to imagine a modified method that withsmallbinsizesisimpractical. Asanexample,imag- does not require simulation, but instead populates the ine that for the leucine dipeptide molecule used in this histogram bins using the “bare” potential for each in- study,onebinnedthe fourφ,ψ backbonetorsionsascor- ternal coordinate (e.g., Gaussian histograms for bond related. Iffiftybinsforeachtorsionwereused(asshould lengths and angles). Of course, the conformationalstate be done according to the discussion below), then there must be defined explicitly, with upper and lower limits wouldbe504 =6,250,000multi-dimensionalbinstopop- for coordinates. Allowed ranges for the torsions (espe- ulate, which is simply not feasible. ciallyφ,ψ)arenaturallyobtainablevia,e.g.,Ramachan- dran propensities (e.g., Ref. 43), and reasonable ranges There does appear to be an important advantage to for bond lengths and angles could be chosen to be, e.g., eliminating at least some correlations from the original several standard deviations from the mean. “physical” ensemble: namely, a larger portion of confor- mationalspacebecomesavailabletothereferenceensem- ble;seeFigs.5and6. Sincecoordinatesforthereference structures are drawn randomly and independently, it is C. Extension to larger systems possible to generate reference structures that are in en- tirely different energy basins than those in the physical Whiletheinitialresultsofourreferencesystemmethod ensemble. Itisthuspossibletoovercometheinadequacies arepromising,anaiveimplementationofthemethodwill of the physical ensemble by binning internal coordinates find difficulty with large systems (as do all absolute and independently. The optimal (presumably) limited use of relative free energy methods). For our method, the dif- correlations will be considered in future work. ficulty with including a very large number of degrees of 9 freedom is due to the fact that, if one does not treat tem method to include explicitly solvated biomolecules. all correlations in the backbone, then steric clashes will However, as with all absolute free energy methods, the occurfrequentlywhengeneratingthereferenceensemble. additionofthe solventdegreesoffreedomcausesthefree However,it is possible to extend the method to larger energyestimatetoconvergemuchmoreslowlythanwith- peptides, still include all degrees of freedom, and bin out explicit solvent. Thus, we feel the method described all coordinates independently (important for broadening inthisstudywillfinduseprimarilyinimplicitly solvated configurational space, as discussed above), by using a biomolecules. “segmentation”techniquemotivatedbyearlierwork.44,45 Considergeneratingreferencestructuresforaten-residue peptideinthealphahelixconformation. Duetothelarge V. CONCLUSIONS numberofbackbonetorsions,mostofthereferencestruc- tures chosen at random will not be energetically favor- In conclusion, we have introduced and tested a simple able. However,if one breaks the peptide into two pieces, method for calculating absolute free energies in molec- thenonecangeneratemanystructuresforeachsegment, ular systems. The approach relies on the construction and only “keep” energetically likely segment structures. of an ensemble of reference structures (i.e., the reference The selectedstructures may be joined to form full struc- system) that is designed to have high overlap with the tureswhicharereasonablylikelytohavelowenergy. For physical system of interest. The method was first shown example, if one generates 105 structures for each of the to reproduce exactly computable absolute free energies two segments and keeps only 103 of those, then one only for simple systems, and then used to correctly predict need evaluate 103 ×103 = 106 full structures out of a the stability of leucine dipeptide conformations using all possible 105 ×105 = 1010. A statistically correct seg- one-hundred fifteen degrees of freedom. mentationstrategyiscurrentlybeinginvestigatedbythe Some strengths ofthe approacharethat: (i) the refer- authors for use in large peptides. encesystemisbuilttohavegoodoverlapwiththesystem Anotherstrategywhichmayproveusefulforlargersys- of interest by using internal coordinates and by using a tems is to use the reference system method with multi- single equilibrium ensemble from Monte Carlo or molec- stage simulation. Multi-stage simulation requires the in- ular dynamics; (ii) the absolute free energy estimate is troductionofahybridpotentialenergyparameterizedby guaranteed to converge to the correct value, whether or λ, e.g., not the physical ensemble is complete and, in fact, it is possibleto estimatethe absolutefreeenergywithoutthe U =λU +(1−λ)U . (14) λ phys ref use ofa physicalensemble; (iii)the methodexplicitly in- cludesalldegreesoffreedomemployedinthesimulation; Thus, U = U and U = U . Simulations are per- 0 ref 1 phys (iv) the reference system need only be numerically com- formedusing the hybridpotentialenergyU (and thus a λ putable, i.e., the exact analytic result is not needed; and hybrid forcefield, if using molecular dynamics) at inter- (v) the method can be trivially extended to include the mediate λ values between0 and1. Conventionalfree en- use of multi-stage simulation. The CPU cost of the ap- ergymethods suchas thermodynamic integrationor free proach,beyondthatforshorttrajectoriesofthe physical energy perturbation can then be used to obtain F . phys system of interest, is one energy call for each reference We also believe that including correlations, such as structure, plus the less expensive cost of generating the suggested by Eq. (6) and possibly other ways, may be reference ensemble. useful. The inclusion of correlations should improve the In the present “proofof principle” report, our method overlap between the reference and physical ensembles— was used to study conformationalequilibria; howeverwe therebyreducingthe amountofsamplingrequiredinthe feelthatthe simplicityandflexibilityofthe methodmay reference system, hence improving efficiency. This also findbroaduseincomputationalbiophysicsandbiochem- will be explored in future work. (We also remind the istryforawide varietyoffreeenergyproblems. We have reader that the final free energy value includes the full also described a segmentation strategy, currently being correlations in U , regardless of U .) phys ref pursued, to use the approach in much larger systems. Themethodcouldproveusefulinfutureprotein-ligand binding studies. In the simplest approach, one could freeze all degrees of freedom except for the ligand and side-chain degrees of freedom in the binding site. While Acknowledgments the absolute free energy would be unphysical, the ap- proach could permit comparison of ligands or protein The authors would like to thank Edward Ly- mutations with little or no conformational similarity. man, Ronald White, Srinath Cheluvarajah and Hagai In principle, it is possible to extend the reference sys- Meirovitch for many fruitful discussions. 1 A. Grossfield, P. Ren, and J. W. Ponder, J. Am. Chem. 2 J. W. Pitera and W. F.van Gunsteren,J. Phys.Chem. B Soc. 125, 15671 (2003). 10 105, 11264 (2001). (1981). 3 S. B. Singh, Ajay, D. E. Wemmer, and P. A. Kollman, 25 J. Carlsson and J. Aqvist, J. Phys. Chem. B 109, 6448 Proc. Nat. Acad. Sci. (USA)91, 7673 (1994). (2005). 4 B.C.Oostenbrink,J.W.Pitera,M.M.vanLipzip,J.H.N. 26 C. E. Chang, M. J. Potter, and M. K. Gilson, J. Phys. Meerman, and W. F. van Gunsteren, J. Med. Chem. 43, Chem. B 107, 1048 (2003). 4594 (2000). 27 C. H. Bennett, J. Comput. Phys. 22, 245 (1976). 5 F. M. Ytreberg and D. M. Zuckerman, J. Phys. Chem. B 28 M.R.ShirtsandV.S.Pande,J.Chem.Phys.122,144107 109, 9096 (2005). (2005). 6 B. K.Shoichet, Nature432, 862 (2004). 29 M. R. Shirts, E. Bair, G. Hooker, and V. S. Pande, Phys. 7 J. Y. Trosset and H. A. Scheraga, J. Comput. Chem. 20, Rev.Lett. 91, 140601 (2003). 412 (1999). 30 G. E. Crooks, Phys.Rev.E 61, 2361 (2000). 8 R. W. Zwanzig, J. Chem. Phys. 22, 1420 (1954). 31 N. Lu, D. A. Kofke, and T. B. Woolf, J. Comput. Chem. 9 D. Beveridge and F. DiCapua, Ann. Rev. Biophys. Bio- 25, 28 (2004). phys.Chem. 18, 431 (1989). 32 D. M. Zuckerman and T. B. Woolf, Phys. Rev. Lett. 89, 10 W. L. Jorgensen and C. Ravimohan, J. Chem. Phys. 83, 180602 (2002). 3050 (1985). 33 D. M. Zuckerman and T. B. Woolf, J. Stat. Phys. 114, 11 W. Yang, R. Bitetti-Putzer, and M. Karplus, J. Chem. 1303 (2004). Phys. 120, 2618 (2004). 34 D.M.ZuckermanandT.B.Woolf,Chem.Phys.Lett.351, 12 J. A.McCammon, Curr. Opin.Struc.Bio. 2, 96 (1991). 445 (2002). 13 W. G. Hoover, S. G. Gray, and K. W. Johnson, J. Chem. 35 F.M. YtrebergandD.M.Zuckerman,J.Comput. Chem. Phys. 55, 1128 (1971). 25, 1749 (2004). 14 D. Frenkel and A. J. C. Ladd, J. Chem. Phys. 81, 3188 36 T. P. Straatsma and J. A. McCammon, J. Chem. Phys. (1984). 95, 1175 (1991). 15 W. G. Hoover and F. H. Ree, J. Chem. Phys. 47, 4873 37 M.Fasnacht,R.H.Swendsen,andJ.M.Rosenberg,Phys. (1967). Rev.E 69, 056704 (2004). 16 L. M. Amon and W. P. Reinhardt, J. Chem. Phys. 113, 38 H. C. Andersen,J. Comput. Phys. 52, 24 (1983). 3573 (2000). 39 F. M. Ytreberg and D. M. Zuckerman, J. Chem. Phys. 17 J. P. Stoessel and P. Nowak, Macromolecules 23, 1961 120, 10876 (2004). (1990). 40 J. W. Ponder and F. M. Richard, J. Comput. Chem. 8, 18 S. Cheluvaraja and H. Meirovitch, Proc. Nat. Acad. Sci. 1016 (1987), http://dasher.wustl.edu/tinker. 101, 9241 (2004). 41 W. L. Jorgensen, D. S. Maxwell, and J. Tirado-Rives, J. 19 R.P.WhiteandH.Meirovitch,Proc.Nat.Acad.Sci.101, Am. Chem. Soc. 117, 11225 (1996). 9235 (2004). 42 W.C.Still,A.Tempczyk,andR.C.Hawley,J.Am.Chem. 20 S. Cheluvaraja and H. Meirovitch, J. Chem. Phys. 1022, Soc. 112, 6127 (1990). 054903 (2004). 43 S. C. Lovell, I. W. Davis, W. B. Arendall III, P. I. W. de 21 M. S. Head, J. A. Given, , and M. K. Gilson, J. Phys. Bakker,J. M.Word,M.G.Prisant, J. S.Richardson,and Chem. A 101, 1609 (1997). D. C. Richardson, Proteins 50, 437 (2003). 22 M.K.Gilson, J.A.Given,B.L.Bush,andJ.A.McCam- 44 K. D. Gibson and H. A. Scheraga, J. Comput. Chem. 8, mon, Biophys. J. 72, 1047 (1997). 826 (1987). 23 C. E. Chang and M. K. Gilson, J. Am. Chem. Soc. 126, 45 A.R.Leach,K.Prout,andD.P.Dolata,J.Comput.Aid. 13156 (2004). Mol. Des. 2, 107 (1988). 24 M. Karplus and J. N. Kushick, Macromolecules 14, 325