Article pubs.acs.org/JCTC Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born Andreas W. Götz,† Mark J. Williamson,†,∥ Dong Xu,†,⊥ Duncan Poole,‡ Scott Le Grand,‡ and Ross C. Walker*,†,§ † San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive MC0505, La Jolla, California 92093, United States ‡ NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, California 95050, United States § Department of Chemistry and Biochemistry, University of California San Diego, 9500 Gilman Drive MC0505, La Jolla, California 92093, United States * S Supporting Information ABSTRACT: We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD) within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We discussthealgorithmsthatareusedtoexploittheprocessingpoweroftheGPUsandshowtheperformancethatcanbeachieved in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to performance, we have focused on understanding the implications of the different precision models on the outcome of implicit solventMDsimulations.Weshowresultsforarangeoftestsincludingtheaccuracyofsinglepointforceevaluationsandenergy conservationaswellasstructuralpropertiespertaininingtoproteindynamics.Thenumericalnoiseduetoroundingerrorswithin theSPSPprecisionmodelissufficientlylargetoleadtoanaccumulationoferrorswhichcanresultinunphysicaltrajectoriesfor longtimescalesimulations.Werecommendtheuseofthemixed-precisionSPDPmodelsincethenumericalresultsobtainedare comparablewiththoseofthefulldoubleprecisionDPDPmodelandthereferencedoubleprecisionCPUimplementationbutat significantlyreducedcomputationalcost.OurimplementationprovidesperformanceforGBsimulationsonasingledesktopthat is on par with, and in some cases exceeds, that of traditional supercomputers. 1. INTRODUCTION innovation and ultimately their failure to mature into ubiquitous community-maintained research tools. Since the first simulation of an enzyme using molecular dynamics (MD) was reported by McCammon et al.1 in 1977, Graphics processing units (GPUs), on the other hand, have been an integral part of personal computers for decades, and a MD simulations have evolved to become important tools in strong demand from the consumer electronics industry has rationalizingthebehaviorofbiomolecules.Thefieldhasgrown resulted in significant sustained industrial investment in the fromthatfirst10-ps-longsimulationofamere500atomstothe stable,long-termdevelopmentofGPUtechnology. Inaddition point where small enzymes can be simulated on the micro- second time scale2−4 and simulations containing millions of tolowpricesforGPUs,thishasledtoacontinuousincreasein atomscanbeconsideredroutine.5,6However,suchsimulations thecomputationalpowerandmemorybandwidthofGPUs,sig- nificantly outstripping the improvements in CPUs. As a are numerically very intensive, and using traditional CPU- centric hardware requires access to large-scale supercomputers consequence, high-end GPUs can be considered standard equip- or well-designed clusters with expensive interconnects that are ment in scientific workstations, which means that they either beyond the reach of many research groups. already exist in many research laboratories or can be purchased Numerous attempts have been made over the years to easily with new equipment. This makes them readily available to accelerateclassicalMDsimulationsbyexploitingalternativehard- researchers and thus attractive targets for acceleration of many ware technologies. Some notable examples include ATOMS by scientific applications including MD simulations. AT&T Bell Laboratories,7 FASTRUN by Columbia University The nature of GPU hardware, however, has until recently and Brookhaven National Laboratory,8 MDGRAPE by RIKEN,9 madetheiruseingeneralpurposecomputingchallengingtoall and most recently Anton by DE Shaw Research LLC.10 All of but those with extensive three-dimensional (3D) graphics pro- these approaches have, however, failed to make an impact on gramming experience. However, the development of application mainstreamresearchbecauseoftheir excessivecost. Additionally, programminginterfaces(APIs)targetedatgeneralpurposescien- these technologies have been based on custom hardware and do tificcomputinghasreducedthiscomplexitysubstantiallysuchthat notformpartofwhatwouldbeconsideredastandardworkstation specification. This has made it difficult to experiment with such Received: December 20, 2011 technologies, leading to a lack of sustained development or Published: March 26, 2012 ©2012AmericanChemicalSociety 1542 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article Figure 1. Peak floating-point operations per second (Flop/s; left) and memorybandwidth (right) for Intel CPUs26 and NVIDIAGPUs.27 GPUsarenowacceptedasserioustoolsfortheeconomically Unlike a regular CPU, which typically operates on one to four efficient acceleration of an extensive range of scientific threads in parallel, GPUs typically process threads in blocks problems.11,12 (termed warps within the CUDA programming language28) Thecomputationalcomplexityandfinegrainedparallelismof containing between 16 and 64 threads. These thread blocks MD simulations of macromolecules makes them an ideal logically map to the underlying hardware, which consists of candidateforimplementationonGPUs.Indeed,asweillustrate streaming multiprocessors. At the time of writing, high-end hereforimplicitsolventandinasubsequentpaper13forexplicit GPUs typically have between 16 and 32 multiprocessors. For solvent, the careful implementation of modern MD algorithms example, an NVIDIA M2090 GPU consists of 16 multi- on GPUs can provide capability, in terms of performance, that processors,eachcontaining32coresforatotalof512cores.All exceeds that achieveable with any current CPU-based super- threads in a single block must execute the same instruction on computer. Several previous studies have investigated the use thesameclockcycle.Thisnecessarilyimpliesthat,foroptimum of GPUs to accelerate MD simulations.14−20 For a detailed performance, codes must be vectorized to match the size of a reviewoftheuseofGPUsforaccelerationofcondensedphase thread block. Branching must therefore be used with extreme biomolecularMDsimulations,wereferthereadertoourrecent care since if any two threads in the same warp have to follow review.12 differentcodepathsofthebranch,thenthreadsinthewarpwill In this manuscript, we present our high-performance GPU stall while each side of the branch is executed sequentially. implementationofimplicitsolventgeneralizedBorn(GB)MD 2.2. Memory Model. The memory hierarchy of GPUs has for the AMBER21 and CHARMM22 pairwise additive force its origins in their graphics lineage, and the high density of fields on CUDA-enabled NVIDIA GPUs. We have implemented arithmetic units comes at the expense of cache memory and this within the AMBER23,24 PMEMD dynamics engine in a controlunits.Allofthecoresmakingupamultiprocessorhave manner that is designed to be as transparent to the user as pos- asmallnumberofregistersthattheycanaccess,afewkilobytes sible, and we give an overview of what the code currently (64kBonanM2090)ofsharedmemory[thiscanbesplitinto supports, as well as our plans for future developments. We directly accessible memory and L1 cache; in the case of an discuss the specifics by which we exploit the processing power M2090, it can be split 48/16 kB or 16/48 kB; in the case of ofGPUs,bothinserialandusingmultipleGPUs,andshowthe AMBER, the configuration is switched at runtime for optimal performance that can be achieved in comparison to conven- performance of a given kernel] which is private to each multi- tional CPU clusters. We also discuss our implementation and processor and a small amount (typically 48 kB) of high-speed validation ofthree specificprecision models thatwedeveloped but read-only texture memory. The majority of the memory and their impact on the numerical results of implicit solvent (6GBonanM2090),termedglobaldevicememory,isavailable MD simulations. to all multiprocessors. While being fast compared to the main memory accessible by CPUs, access to the device memory by GPUs is still relatively slow compared to the local cache 2. GPU PROGRAMMING COMPLEXITIES memory. The nature by which the multiprocessors are AsillustratedbyFigure1,GPUsofferatremendousamountof connectedtothismemoryalsomeansthatthereisasignificant computing power in acompact package. This, however, comes performancepenaltyfornonstride-1access.Finally,itshouldbe at the cost of reduced flexibility and increased programming notedthatcurrentlytheCPUandGPUmemoriesareindifferent complexityascomparedtoCPUs.Inordertodevelopsoftware addressspacesandthisrequirescarefulconsideration.Theunique thatrunsefficientlyonGPUs,itisnecessarytohaveathorough nature of this memory model leads to several considerations for understanding of the characteristics of the GPU hardware optimizing GPU performance, including optimizing device architecture. A number of manuscripts have already discussed memory access for contiguous data, utilizing the multiprocessor this in detail in the context of MD.11,12,15,17,25 For this reason, sharedmemorytostoreintermediateresultsortoreorganizedata weprovidesimplyabriefoverviewofthecomplexitiesinvolved that would otherwise require nonstride-1 memory accesses, and in programming GPUs as they relate to our implementa- using the texture memory to store read-only information, such tion, focusing on NVIDIA hardware. For a more detailed as various force field parameters, in a fashion that allows very description,thereaderisreferredtothepublicationscitedabove. rapid access. 2.1. Vectorization. A GPU is an example of a massively 2.3. GPU to CPU Communication. As mentioned above, parallel stream-processing architecture which uses the single- the CPU and GPU memories are, at the time of writing, in instruction multiple data (SIMD) vector processing model. differentaddressspaces.Thismeansitisuptotheprogrammer 1543 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article to ensure that the memories are synchronized as necessary to all necessary operations to access and manipulate data on a avoid race conditions. However, there is a big performance GPUdevice.RealizingthefullpotentialofGPUs,however,still penalty for such synchronizations which have to occur via the requires considerable effort as indicated above and outlined Peripheral Component Interconnect Express (PCIe) bus, and belowtotakeadvantageoftheparticularGPUarchitecture,and thus they should be avoided unless absolutely necessary. not allalgorithms are suitable toachieve good performance on 2.4. GPU to GPU Communication. The traditional these massively parallel processors. method for programming scientific algorithms in parallel uses the message passing interface (MPI)29 in which each thread 3. OVERVIEW OF THE AMBER IMPLICIT SOLVENT runsinaseparateaddressspace.WhenrunningGPUsinparallel GPU IMPLEMENTATION underanMPIparadigm,additionalcomplexityisintroducedsince The nature of MD simulations requires what in computer sending data between two GPUs involves copying the data from science is referred to as strong scaling, that is, reduction ofthe the memory of the sending GPU to the CPU memory of the solution time with an increasing number of processors for a corresponding MPI thread over the PCIe bus, an MPIsend by fixed total problem size. This enables access to simulations at this CPU thread, and corresponding MPIreceive by the receiv- longer time scales, which is required for a proper convergence ing CPU thread, which copies the data between the memories ofresults.Thisbecomesmoreimportantasonemovestolarger oftheCPUs,andfinallycopyingthedatatothememoryofthe systemsizessincethenumberofdegreesoffreedomincreases. receiving GPU. Clearly, this introduces additional consider- Weak scaling, that is, the solution time with the number of ations for maximizing parallel performance as compared to processors for a fixed problem size per processor, is only of traditional CPU programming. secondary importance, since this merely enables simulating Atthetimeofwriting,thereareeffortstostreamlineGPUto larger molecules at currently attainable time scales. Our imple- GPUcommunication,particularlywithinasinglenodebutalso mentation therefore has focused on accelerating problem sizes forInfinibandconnectionsbetweennodes.Onesuchapproach thatcorrespondtothosetypicallystudiedbyAMBERusers.In under development by NVIDIA and Mellanox is termed thecaseofGBsimulations,thisisintherangeof300to30000 GPUDirect,30 which ultimately seeks to unify address spaces atoms. between multiple CPUs and GPUs. Currently, the degree to The initial driving force in accelerating AMBER implicit whichthiscanbeutilizedisheavilydependentontheunderlying solventGBcalculationswithGPUswastoprovidethescientific hardware design. Therefore, at present, the added complexity of community with a computational tool that would allow an indi- using the advanced features of GPUDirect, beyond the pinned vidual researcher to obtain performance on a simple desktop memory MPI optimizations offered by GPUDirect version 1, on workstation equivalent to that of a small CPU cluster. Such a the large number of possible different hardware combinations is tool alleviates the costs, both capital and recurring, involved in not worth the effort for a widely used production code. purchasing, maintaining, and using individual research compute 2.5. Mathematical Precision. Early versions of GPUs in clusters. To this end, our goal was that a single state-of-the- NVIDIA’s lineup (prior to the GT200 model) only supported artGPUshouldprovideaperformanceequivalenttothatoffour singleprecision(SP)floatingpointarithmetic.Thiswasdueto to six high-end CPU cluster nodes. Such an approach also re- the fact that graphics rendering did not require double movestheneedtopurchaseandmaintainexpensiveinterconnects precision(DP).Scientificalgorithms,however,typicallyrequire thatarerequiredtoachievescaling evenonamodestnumber of DP arithmetic (for a discussion in the context of quantum nodes. chemistry, see for example the work by Kniziaet al.31). The Beyond this initial serial development, which was first generation of GPUs at the time of our initial implementation releasedasanupdatetoAMBER1034inAugust2009,wehave (2008) supported DP in hardware, but only at 1/8 the alsodevelopedaparallelimplementationbasedontheMPI-229 performanceofSP.Inthelatestgenerationofcards,atthetime message passing protocol, released as an update to AMBER ofwriting, termed the Fermi lineupby NVIDIA,the DPto SP 1123inOctober2010, thatallowsasingle jobtospanmultiple performance ratio is 1/2 and thus equivalent to that in CPUs. GPUs. These can be within a single node or across multiple This, however, only holds for the professional (termed Tesla) nodes.Asshownbelow,itispossiblewiththisimplementation seriesofcards.Thesignificantlycheapergamingcards(termed to achieve a performance improvement that goes beyond simply GeForce)stillonlysupportDPatafractionofthespeedofSP. making a desktop workstation faster, ultimately providing a per- ItisthereforeimportanttooptimizetheuseofDPsuchthatit formancecapabilitythatsurpasseswhatisachievableonallcurrent is only used when necessary to maintain numerical accuracy. conventionalsupercomputers.Achievingthislevelofperformance 2.6.ProgrammingModel.EarlyuseofGPUsforscientific required implementing the entire implicit solvent MD algorithm computing was hampered by the lack of an application including energy and force evaluations, restraints, constraints, programming interface (API) for general purpose calculations. thermostats,andtimestepintegrationontheGPU.Asdescribed The problems to be solved had to be described in terms of a in section 3.2, CPU to GPU communication only occurs during graphics pipeline employing either OpenGL or DirectX, which I/O or to some extent when data is sent between GPUs during madethesoftwaredevelopmenttime-consumingandhardware- parallel runs. specific. The barrier to utilizing GPU hardware for general WhilewehavedesignedourGPUimplementationtoachieve purpose computation has since been reduced by the intro- substantialaccelerationofimplicitsolventMDsimulationsover ductionofGPUprogrammingmodelssuchastheBrookstream that achievable with AMBER’s CPU implementation, our programming language,32 OpenCL,33 and NVIDIA’s Compute overridinggoalhasalwaysbeentomaintaintheprecisionofthe Unified Device Architecture (CUDA)28 and the availability calculations. To this end, we have focused on ensuring that of corresponding software development toolkits (SDKs). The GPU simulations will match CPU simulations. All approxima- AMBER implementation uses CUDA, which is a relatively tionsmadeinordertoachieveperformance onGPUhardware simpleextensionofthestandardCprogramminglanguagethat have been rigorously tested as highlighted in the following allowsonetocodeinaninherentlyparallelfashionandperform sections. 1544 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article An additional design goal has been to attempt to preserve frequently unless the implementation is deterministic. The forwardcompatibilityofourimplementation.UsingtheCUDA deterministic nature of the GPU code coupled with machine programminglanguageprovidesthisbyabstractingtheprogram precision binary restart files (currently under development) from the underlying hardware. The GPU accelerated version makesthismodeofsimulationpossible.Thisalsomakesdebugg- of AMBER can be used on all NVIDIA cards that support ing and validation easier. double precision in hardware, that is, those with hardware Transparency. Another key feature and a primary design revision 1.3 or 2.0 or higher. Our choice of CUDA and goal of our GPU acccelerated implementation is that its use NVIDIA graphics cards was largely guided by the fact that, at is completely transparent to the user. As far as the user is thetimewebeganthiswork,OpenCLwasnotmatureenough concerned, our GPU implementation is indistinguishable toofferthesameperformanceandstabilitybenefitsthatCUDA from the CPU implementation, and using the GPU version did. A port to OpenCL is certainly possible, and this would ofthecodeissimplyacaseofswitchingtheexecutablename support AMD hardware. However, with the public release of from pmemd to pmemd.cuda or from pmemd.MPI to the CUDA API35 by NVIDIA and the release of CUDA pmemd.cuda.MPI for the MPI parallelized implementation. compilers for x86 platforms by PGI,36 it is possible that an All other items such as input and output files and regression AMBER implementation will soon be available on a variety of testswithinthecoderemainidentical.Theonlydifferenceto accelerator hardware. be noticed by the user is an increase of performance. This 3.1.FeaturesoftheImplementation.Wehaveattempted guarantees effective uptake of our GPU implementation by to make our GPU implementation all-inclusive of the features thescientificcommmunitybecausethereisnolearningcurve available in the PMEMD program. At the time of writing, the for the use of the code, and all tools and scripts that have majority of features applicable to implicit solvent simulations been developed for the CPU version of PMEMD can be are available as described below. utilized without modifications. SupportedMethods.SupportisprovidedforallGBmodels SystemSize.Themaximumsystemsizethatcanbetreated currently implemented within AMBER37−41 as well as the withtheGPUimplementationisafunctionofboththeGPU analytical linearized Poisson−Boltzmann (ALPB)42 model. In hardware and the MD simulation parameters. In particular, additiontoconstantenergysimulations,thermostatshavebeen Langevintemperatureregulationandtheuseoflargercutoffs implemented to perform constant temperature simulations. for the effective Born radii calculations increase the memory ThisincludesallthreethermostatsavailableinPMEMD,thatis, requirements. The physical GPU hardware also affects the Berendsen weak coupling algorithm,43 the Andersen tem- memory usage since the optimizations used are nonidentical perature coupling scheme,44 and the Langevin dynamics ther- for different GPU types. Table 1 gives an overview of the mostat.45 Constraints for hydrogen bond distances use a GPU version of the standard SHAKE algorithm46,47 employed in Table1.ApproximateMaximum AtomCountsThatCanBe PMEMD, and harmonic restraints to a reference structure are Treated with the GPU Implementation of GB Implicit supported. Solvent Simulations in AMBER 11 Using the SPDP Tothebestofourknowledge,noGB formalismcurrently Precision Modela exists that corrects for the errors introduced by the use of cutoffs for long-range nonbonded interactions. The use of GPUcard GPUmemory simulationtype maxatoms cutoffs in GB simulations as implemented in PMEMD does GTX-295 895MB constantE 20500 not conserve energy, and their use involves an approx- constantT 19200 imation with an unknown effect on accuracy. For this TeslaC1060 4.0GB constantE 46350 reason, we chose not to implement van der Waals (vdW) constantT 45200 and electrostatic cutoffs in the GPU version of this code. TeslaC2050 3.0GB constantE 39250 [Cutoffs for the nonbonded interactions are implemented constantT 38100 for explicit solvent simulations with periodic boundary TeslaC2070 6.0GB constantE 54000 conditions using the particle mesh Ewald (PME) method, constantT 53050 as described in a later paper.] However, cutoffs in aTest systems are droplets of TIP3P water molecules. All calculating the effective Born radii are supported. simulations use SHAKE (AMBER input ntf=2, ntc=2); a time Reproducibility. A design feature of the GPU code that step of 2 fs; the Hawkins, Cramer, Truhlar GB model37 (AMBER inputigb=1);thedefaultcutoffvalueof25ÅforGBradii(AMBER goes beyond the CPU implementation is the deterministic input rgbmax=25); and temperature control with the Langevin nature of the implementation on a given hardware config- thermostat (AMBER input ntt=3), if applicable. Error-correction uration.SerialCPUcalculationsforagivensetofinputpara- code (ECC) was switched off on the Tesla cards. metersonidenticalhardwareareperfectlyreproducible.This doesnotholdfortheparallelCPUimplementationsincethe need to load balance aggressively to achieve good parallel approximate maximum atom counts that can be treated with scaling means that the order of numerical operations is not the present version of the code. The dominant sources of defined, andtherefore two simulations started fromidentical GPU memory usage are the output buffers used for the conditions will always diverge due to rounding differences. nonbonded interactions as described in section 3.2. The This poses a problem when transitioning to microsecond or memory used by those buffers is proportional to the square greater simulation time scales since it can be of advantage of the number of atoms. Currently, the atom count limita- to store trajectory information less frequently than what is tions imposed by GPU memory usage are roughly identical optimal in order to conserve available storage space and in serial and parallel. produce data files of manageable size. It is thus not possible 3.2. Technical Details of the Implementation. In togobacktoagivenpointofthesimulationandanalyzethe classical MD, the majority of the computational effort is spent trajectory in finer detail by restarting and sampling more evaluating the potential energy and gradients, which has to be 1545 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article repeated each time step. In the case of the AMBER pairwise planarity and prevent undesired chiral inversions. Within the additive force fields,21 the potential takes the form AMBERforcefield,improperdihedralsaretreatedinthesame way as proper dihedrals. The third additional term is a cross VAMBER = nb∑ondsbi(ri − ri,eq)2 + na∑nglesai(θi − θi,eq)2 tϕe,rψm bteertmweeedn tCwMoAsePq.4u9entAiadldpitriootneainllyb,actkhbeonCeHdiAheRdMraMl anganleds i i AMBER force fields handle 1−4 nonbonded interactions in a ndihedralsni,max different manner. The single prime on the electrostatic + ∑ ∑ (Vi,n/2)[1 + cos(nϕi − γi,n)] summation has the same meaning as described above for the i n AMBERforcefieldwiththeexceptionthat1−4interactionsare + na∑toms′⎛⎜Aij − Bij⎞⎟ + na∑toms′ qiqj nthoetssacmaleede.xTclhuesiodnosubaslethperimsinegolenptrhimeevdbWutsthuemumsaetioofndiifmfeprleinest i<j ⎝⎜ri1j2 ri6j ⎠⎟ i<j 4πε0rij (1) values Rimjin and εij for 1−4 interactions. IntheGBimplicitsolventmodel,theeffectofasurrounding where the bond and angle terms are represented by a simple solvent is described via a continuum electrostatics model that harmonicexpressionwithforceconstantsb anda andequilibrium uses a pairwise descreening approximation and in general also i i bond distances/angles r and θ , respectively. Torsional includesaDebye−Hückeltermtoaccountforsalteffectsatlow i,eq i,eq potentialsforthedihedralanglesarerepresentedusingatruncated salt concentrations. The general form of the correction to the Fourier expansion in which the individual terms have a potential energy of the solute is given as itwVnhii,tentehrwvadicdtWthiiaotpnoienmrtiboeicerdatiwccptietiaoyernnanmraaeenttpoedrrmesps-ehcAnaestinjeetdesarhnbeidfdyt γaBpi,noij.LinTeanthnnedcahrladats-hrtJgeotenwseeoslqetcipetroramotnessdntaattiriaqeclj ΔGGB = −12 ∑i,j ⎛⎝⎜⎜⎜1 − e−κεfriGjB⎞⎠⎟⎟⎟4πεq0iqfjiGj B (3) separated by the distance r. The prime on the summation of the nonbonded interactionsijindicates that vdW and electrostatic where qi and qj are the atomic partial charges, εr is the relative interactionsareonlycalculatedforatomsindifferentmoleculesor permittivity of the solvent, and κ is the Debye−Hückel foratomsinthesamemoleculeseparatedbyatleastthreebonds. screening parameter.38 The function fiGjB interpolates between Those nonbonded interactions separated by exactly three bonds aneffectiveBornradiusRiwhenthedistancerijbetweenatoms (1−4interactions)arereducedbytheapplicationofindependent is short and rij at large distances according to vdW and electrostatic scale factors, termed SCNB and SCEE in GB 2 2 1/2 f = [r + RR exp(−r /4RR)] AMBER,whicharedependentonthespecificversionoftheforce ij ij i j ij i j (4) field (2.0 and 1.2, respectively, for the ff99SB48 version of the TheeffectiveBornradiusR reflectshowdeeplyburiedacharge AMBER force field). i The CHARMM force field22 takes a similar form but isinthelow-dielectricmedium(solute),anditdependsonthe intrinsic radius ρ of an atom and the relative positions and includes three additional bonded terms: i intrinsic radii of all other atoms in the solute. The models implemented in PMEMD differ in the intrinsic radii and the V CHARMM function that is used to determine the effective Born radii. nbonds nangles The remainder of this section discusses key aspects of the = ∑ b(r − r )2 + ∑ a(θ − θ )2 i i i,eq i i i,eq GPU implementation of these equations. This includes design i i features to maximize performance while preserving accuracy as nUrey−Bradley well as addressing issues of reproducibility which are critical + ∑ u(r̃ − r̃ )2 when moving to longer time scale simulations. i i i,eq IntegrationwiththeExistingCodeBase.Oneoftheinitial i ndihedrals ni,max ideas we considered when implementing GPU acceleration for + ∑ ∑ (V /2)[1 + cos(nϕ − γ )] AMBER was to write an entirely new code from scratch in a i,n i i,n combination of C++ and CUDA. However, this was quickly i n dismissed for a number of reasons. In particular, we wanted to nimpropers + ∑ k(ω − ω )2 + ∑ V 1. keep the maintenance of the AMBER code base simple i i i,eq CMAP 2. minimize the amount of coding required i Φ,Ψ 3. simplifythewayinwhichfeaturesareportedtotheGPU + na∑toms″εij⎡⎢⎢⎛⎜⎜Rimj in⎞⎟⎟12 − ⎛⎜⎜Rimj in⎞⎟⎟6⎤⎥⎥ 4. manadinrteaginrebssaicoknwtaersdtsc.ompatibility with existing input files i<j ⎣⎢⎝ rij ⎠ ⎝ rij ⎠ ⎦⎥ The approach weultimately chose wasto utilize the existing FortrancodebaseinPMEMDandextenditwithcallstospecific natoms qq CUDA kernels for the GPU acceleration with the CUDA code + ∑ ′ i j protected with #IFDEF CUDA preprocessor directives. Building i<j 4πε0rij (2) andtestingisautomatedwithintheexistinginstallationprocedure. AseriesofGPUsynchronizationroutinesprovideanabstract The two-body Urey−Bradley terms account for angle bending way to copy relevant data to and from the GPU memory, between atoms that are separated by two bonds using a for example, gpu_upload_vel() and gpu_download_vel() for harmonic potential with force constant u and equilibrium atomicvelocities.Acompletelistofsynchronizationroutinesis i distance r̃ . The improper dihedrals with force constant k provided in the Supporting Information. For performance, we i,eq i describe out-of-plane bending and are used to maintain haveimplementedtheentireMDalgorithmontheGPU,which 1546 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article means uploads (to GPU memory) are only needed at the adopters of GPU technology. For example, the OpenMM beginningofarunanddownloads(toCPU memory)are only libraryofPandeetal.15usessingleprecision(SP)floatingpoint needed when I/O is required, for example, downloading the numbers throughout the calculations with the exception of a coordinates to write to the trajectory file. It remains single double-precision accumulator in the reduction phase of nevertheless simple to add new features to the code in a way theforceaccumulation.UseofSPinallplaceswithinthecode, thatwillworkforboththeCPUandGPU(seetheSupporting however, can cause substantial instabilities in the MD simula- Informationforanexample).Whilethislimitstheperformance tions. For example, energy conservation in the NVE ensemble forthenewfeatureduetorepeatedup-anddownloads,itdoes can become problematic. While this error can be hidden using provideamechanismbywhichnewfeaturescanbeeasilyadded tightlycoupledthermostats,thetrueeffectsofsuchapproxima- and tested with the GPU code before writing an additional tions have not been well characterized. GPU kernel. We distinguish three different precision models in our GPU Reproducibility. In order to achieve performance in parallel implementation in which the contributions to the nonbonded on CPUs, it has always been necessary to include complex forces are calculated in single precision arithmetic, but bonded dynamic load balancing algorithms and extensive use of asyn- terms and force accumulation are in double precision (SPDP), chronous communication within the implementation. This or everything is computed and accumulated in single precision arises from the fact that the CPUs are expected to perform a (SPSP) or double precision (DPDP). The exception to this is number of tasks above and beyond running the underlying the SHAKE algorithm (see below), which is implemented in simulation. For example, the operating systems run multiple DPforallprecisionmodelssinceitinvolvescalculatingrelative daemonscontrollingI/OandvariousotherOSrelatedtasks.In differences in distance on the order of 10−6 Å. Attempting to addition, the latency between any two CPUs in a large parallel useSPfortheSHAKEalgorithmasimplementedleadstonumeri- run can span an order of magnitude or more. As a result, the cally unstable simulations. The aim in developing our SPDP various MPI threads become desynchronized. Load balancing precisionmodelfortheGPUimplementationofPMEMDwasto and asynchronous communication is used to minimize the achieve numerical stability during MD simulations equivalent to resulting effect. A downside of this is that the order of opera- that of traditional DP implementations but with performance as tions is not well-defined in the CPU implementation. This close to the SPSP model as possible. As highlighted in the sub- results in rounding differences between otherwise identical sequentsections,theSPDPmodelachievestheseaimsandforthis runs. Due to the chaotic nature of numerically integrating Newton’s equations of motion, this means that two initially reasonisthedefaultprecisionmodelusedintheGPUimplementa- tion of PMEMD, although the other precision models can be identicalsimulationswillalwaysdecorrelate ontimescalesofa chosen if desired. few nanoseconds or less. There is nothing inherently wrong Nonbonded Interactions. Ourapproachtothecalculationof with this; the two simulations are just exploring two different nonbonded interactions is similar to that described in Friedrichs regions of phase space. As mentioned previously, this does, etal.15havingbeendevelopedatthesametimeandwithoverlapp- however, pose a problem when one starts to routinely run ing authors but with several differences and additional optimiza- simulations on the microsecond time scale since it can be tions, including the way in which accumulations are handled. necessarytoreturntoagivenpointinasimulationandrepeatit Thealgorithmusedforthecalculationofthenonbondforces exactly while saving the output more frequently. This is somethingthatShawetal.attemptedtoaddresswithabit-wise is identical for all three precision models. The pairwise reversible integrator50 based on scaled integers. However, this interactions between atoms i and j, which can schematically did notsolve the issue when usingthe SHAKE algorithm. The berepresentedbyamatrixasinFigure2,aregroupedtogether architectural designoftheGPUwithmanyfloatingpointunits controlled by a single instruction unit, and the fact that the GPUisnotrequiredtotimeslicewithOSrelatedtasks,means that work can be divided between GPU threads and indeed entire GPUs in a predetermined fashion without detrimentally impacting performance. This can be exploited by careful pro- grammingandbookkeepingtoensurethattheorderoffloating point operations is always predefined at each given time step. TheGPUimplementationofPMEMDhasbeendesignedsuch that any two simulations with identical starting conditions run onthesame GPUmodelwillalwaysbe bit-wise identical. This is extremely useful for debugging purposes, for example, for detecting shared memory race conditions, and has also been usedtodeterminethattheuseofECConGPUshaslittletono impact on the reliability of AMBER simulations.51 For Anderson and Langevin thermostatted simulations, it is necessary for the random number generator (RNG) to also be perfectly deterministic in order for any two initially identical Figure2.Schematicrepresentationofthework-loaddistributionforthe simulations to be reproducible. To this end, we use the calculationofnonbondforceswithNatoms.Eachsquarerepresentsthe parallelized RNG thatis implementedinthe CURAND library interactions between two atoms i and j for which the resulting forces and available with the CUDA Toolkits. needtobeevaluated.ThesearegroupedtogetherintilesofsizeW×W Precision Model. In AMBER, all of the traditional CPU thatareeachassignedtoanindependentwarp.Duetosymmetry,only codesarewrittenentirelyusingdoubleprecision(DP)floating the blue diagonal tiles and the green off-diagonal tiles need to be point arithmetic. This is in contrast to developments by early consideredforthecalculation.Fordetails,seethetext. 1547 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article intotilesofdimensionW×W.Theevaluationofthenonbond ij,theforceonatomjisjustthenegativeoftheforceonatomi, forces for each tile is dynamically assigned to independent andthusonlytheupperorlowertriangleofthetilewouldhave warps across all of the Streaming Multiprocessors (SMs) of a to be considered. However, this is far outweighed by the given GPU, which are each running a sufficient number of advantage of avoiding race conditions that would result from threadstoburyoperationallatency.Thereasonfordoingthisis several threads updating their force contribution to the same that the GPU schedules work in terms of warps which atom. effectively perform the same mathematical operation on W Inthe caseofoff-diagonal tiles,adifferentapproach istaken valuesatonce.InthecaseofNVIDIA,allcurrentGPUsprocess such that the symmetry in the interactions is exploited while warpsas32-threads-wide;inthiscase,optimumperformanceis avoidingraceconditions.Again,thecoordinatesandassociated achievedbymakingW=32.Sincetheinteractionbetweentwo parameters of atoms labeled j to j + W − 1 are placed in the atomsiandjissymmetric,onlythebluediagonaltilesandthe registers, while the coordinates and associated parameters of green off-diagonal tiles need to be considered for the atoms i to i + W − 1 are placed in shared memory. If the off- calculation as described below. Unlike the CPU code, which diagonal tiles were handled in the same fashion as the on- canjustsimplyloopoveratomsiandj,itiscrucialtodivideup diagonal tiles, on the first iteration, the force on atom j due to the calculation on the GPU into as many warp size blocks as interacting with atom i would be calculated and from this the possible. However, this presents a problem when it comes to corresponding force on atom i obtained by negation. At the accumulating the forces for each atom since the warps can be sametime,theforceonatomj+1duetointeractingwithatom scheduled for computation in any order. A naıv̈ e accumulation iwouldbecalculatedandfromthisthecorrespondingforceon of the resulting forces on each atom can thus lead to race atomiobtainedbynegation.Thus,theforcesonatomidueto conditions and incorrect results. The use of atomic operations all atoms j of the tile would be calculated at the same time, within the nonbond routine to avoid this problem, however, whichwouldrequireatomicoperationstoaccumulatecorrectly. represents a serial bottleneck that would be unacceptable in ThesolutiontothisproblemthatisimplementedinAMBERis terms of performance. The approach we use is thus to allocate topartitiontheoff-diagonaltilesintimeasillustratedinFigure2. (N/W) + 1 output buffers per atom where N is the total By starting each thread offset by its thread ID, we avoid race number of atoms and W is the warp size. In the case of the conditionsintheaccumulation andeliminatetheneedforper- nonbond force calculation, each output buffer is three double- formance destroying atomic operations. precision-values-wide for the SPDP and DPDP precision GeneralizedBornTerms.TheGBtermsarecalculatedinan models and three single-precision-values-wide for the SPSP identical fashion to the nonbond terms described above. The precision model corresponding to the force in the x, y, and z Born radii are first calculated within their own kernel and directions of Cartesian coordinate space. Forces calculated reduced to per atom radii in an analogous fashion to how the withinatilecanthenbesummedforeachatomwithoutfearof nonbond forces are handled. The remainder of the nonbond overwriting memory effectively providing a separation in space calculation is then split over two additional kernels. between tiles. At the conclusion of the force calculation, we Bondedand1−4Interactions.Thebondedand1−4terms assign one thread per atom in linear order and cycle through represent a very small fraction of the computational workload the output buffers, summing them to obtain the total force on (typically <1% of an iteration), and thus optimization of their each atom. calculation is not critical for performance compared with, for While this approach provides a separation in space between example, the nonbond interactions. However, efficient calcula- tiles, it is still necessary to deal with race conditions within a tion on a GPU still requires some consideration in how these tile. We have to distinguish on-diagonal tiles (blue) and off- terms are calculated in order to exploit the massive parallelism diagonal tiles (green), see Figure 2. The on-diagonal tiles within the GPU while avoiding race conditions and memory includetheinteractionofatomsitoi+W−1withthemselves overwrites. Our implementation simply creates a list of the and thus include the self-interaction from the GB term. In interactions, sorted by type, and then divides up the bonded contrast, off-diagonal tiles include the interaction of atoms i to and1−4termsacrossSMsonaperinteractionbasis.Sincethe i + W − 1 with j to j + W − 1, where i ≠ j. To avoid race resultingforcesneedtobesummedforeachatom,thereisthe conditions, it is necessary to handle these two types of tile in potential for a race condition to occur during this reduction if different ways. twoormoreGPUthreadsattempttoaccumulateforcesforthe On-diagonal tiles store two copies of the coordinates and same atom at the same time in global memory. Our initial associatedparametersforeachatom,onesetinsharedmemory attempts to avoid this used atomic operations for reduction of (index atom i in Figure 2) and the second set in the registers the resulting forces to individual atoms but showed very poor (index atom j in Figure 2). Each thread in the warp then runs performance. Our solution to this is to make use of the tile linearly through shared memory (atom j), accumulating only reduction buffers from the nonbond calculation as described the force acting on the register atoms i in the corresponding above and then sum these forces in the reduction step used in force output buffer. As illustrated in Figure 2, on the first the nonbond calculation. iteration, the force on atom i resulting from interacting with Harmonic Restraints. At the time of writing, the AMBER itself is obtained andstored inthe corresponding register for i, GPU implementation supports harmonic restraints to a while simultaneously the force on atom i + 1 resulting from reference structure. These are handled in an identical fashion interactingwithatomiisobtainedandstoredintheregisterfor to the bonded interactions being calculated as a bond between atom i + 1. Similarly the forces on atoms i + 2 to i + W − 1 an atom and a fixed virtual particle representing the reference resulting from interacting with atom i are obtained and stored structure. inthecorrespondingoutputbuffer.Allthreadsthensteptothe SHAKEAlgorithm.TheSHAKEimplementationiscurrently nextcolumntoobtaintheforcesonatomsitoi+W−1dueto restricted tohydrogenatomssincethisrepresents>99%ofthe interactions with atom i + 1 etc. This approach does some types of simulation AMBER users run. This restriction has the excessive computational work since, for any given atom pair benefit that each heavy atom and its attached hydrogen atoms 1548 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article can thus be handled independently. Our approach is to simply considerably in subsequent versions of AMBER running on spawn a thread for each heavy atom during the SHAKE next generation hardware. calculation. This provides sufficient parallelization to keep the Unlike the CPU MPI implementation, the GPU MPI GPU busy. The use of double precision entirely within the implementation is fully deterministic for a given number of SHAKE routine provides sufficient precision to handle shake nodes and GPUs. At present, our GPU implementation per- tolerancesontheorderof10−7Åandthusavoidscomplexities forms significant load-balancing between the SMs within each involved inattempting to implement SHAKE constraints inan GPU rather than between GPUs, which makes it possible to internal coordinate representation as would be required if produce a deterministic implementation. Achieving good SHAKE was to be carried out in single precision. parallel scaling on CPU clusters has always required the use Coordinate Update. Integration is carried out on the GPU of extensive load-balancing due to the noise introduced from in a manner analogous to that on the CPU since it is the operating system sharing the CPU resources on a node. intrinsically parallel given that each atom can be handled GPUs on the other hand provide significantly more stable independently. The updated coordinates are stored in global performance, and thus load balancing between GPUs is not as critical. memory on the GPU. The integration is done on the GPU In the current version of the software (AMBER v11), the ratherthanontheCPU tonegate theneedforexpensivecopy GPU parallel support for GB calculations is implemented by operationsofthecoordinatesbackandforthbetweenCPUand dividing the force calculation evenly and linearly across all GPU. GPUs. For M GPUs running a simulation consisting of N Thermostats. As mentioned above, three different thermo- atoms,GPUicalculatesallforcesforatoms(i−1)×N/M+1 stats are supported. The Berendsen and Langevin thermostats to i × N/M. Since GB is a full O(N2) calculation, this areappliedinthesamewayduringthecoordinateupdatestep, introducesasmalldegreeofredundancyattheGPUlevelwith while the Anderson thermostat is implemented as a separate aworst-caseoutcomeofdoingtwicetheoverallcalculation,but kernel that randomizes the velocities at regular intervals. The usually much less than this as long as N ≫ M. As currently reason for including the Berendsen and Langevin thermostats implemented, there are three stages to the calculation in withintheintegrationkernelisthattheyoperateoneverytime parallel, each in need of synchronizing force data across all step,whiletheAndersonthermostatistypicallycalledeveryfew GPUs. The procedure used consists of three sequential hundred time steps. In the case of the Langevin thermostat, a MPI_allGather operations per iteration to merge Born radii, total of 3N random numbers are required on each step, which Bornforce,andgeneralforcedata.Asthehardwareevolvesand are extracted from a pool ofGPU generated random numbers, GPUscanreliablycommunicatedirectlywitheachotherwithin stored in global memory, which is refilled as needed using the a node, we intend to scrap the internode MPI communication parallel random number generator implemented within the and instead replace it with direct peer to peer copies between CURAND library from the CUDA toolkit. The same pool is GPUs. used for the Anderson thermostat as needed. Additional Optimizations. In the interest of performance 4. PERFORMANCE and conserving limited GPU resources, the energy is only To assess both the serial and parallel performance that can be calculatedalongsidetheforceswhenneeded.Calculationofthe achievedwithGPUacceleratedAMBER,weranaseriesofMD energies causes a small degree of register spillage to global simulations representative of typical research scenarios using memory as well as additional code execution, and since one the GPU and CPU implementations on a variety of hardware. onlyperiodicallyneedsthesevalues,typicallyonlyevery1to2 ThesystemsusedconsistedofpartiallyfoldedTRPCage52(304 ps when writing to the output file, it is not necessary to cal- atoms), ubiquitin53,54 (1231 atoms, PDB code 1UBQ), apo- culate these on every iteration. Additionally, this is one aspect myoglobin(2492atoms),andnucleosome(25095atoms,PDB of the algorithm that is allowed to be nondeterministic at the code 1KX5). levelofdouble-precisionround-offerror.Weallowthisbecause In all three simulations, the ff99SB48 version of the AMBER theenergyvalueshavenoeffectonthetrajectoryofthesimula- force field was used with a time step of 2 fs and bonds to tion.However,itwasdisturbingtoobservethisinactioninthe hydrogen atoms constrained using the SHAKE algorithm. The SPSP modewhere nondeterminism wasinitially allowed at the Hawkins, Cramer, and Truhlar GB model37 (AMBER input level of single-precision round-off error. This was harmless as igb=1) was used with no cutoff applied to the nonbonded the forces remained consistent, but in the interest of avoiding interactions and a cutoff of 15 Å for the calculation of the confusion among end-users, with regression tests showing effective GB radii. The output and trajectory files were written differencesforrepeatrunsinSPSPmode,allenergysummation toevery1000steps(2ps).Inputfilesforthesesimulationsare was promoted to double-precision where differences in round- provided in the Supporting Information. off of the energies are seen only once in approximately 106 The software base used for all simulations was AMBER printed interactions. version 11, including patches 1−15 for AmberTools and Parallelization. The parallel GPU implementation is patches 1−17 for AMBER, which were released on August 18, currently written exclusively using MPI. The reasons for this 2011.55 The executables were built under the RedHat were to maximize portability of the code and avoid hardware- Enterprise Linux 5 operating system with Intel compiler induced complexities such as the fact that inter-GPU com- version11.1.069,theIntelMKLlibraryversion10.1.1.019,and municationasimplementedinCUDA4.0requiresallGPUsto theNVIDIACUDAcompilerversion4.0.MVAPICH2version be on the same PCIe controller. Support for GPU to GPU 1.5 was used for the parallel runs for both the CPU and GPU copiesindifferentnodesisalsonotyetsufficientlymaturetobe versions of the code. The default SPDP precision model was exploitedinawidelyusedsoftwarepackage.Forthesereasons, used in the GPU code. While the Fortran compiler and use of the current parallel GPU scaling should be considered as a MKL has little impact on the GPU code performance, this lower bound on performance that will likely improve compiler/library combination gives the best performance for 1549 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article the CPU code and thus forms a fair basis when assessing the Table 3. Multi-GPU Throughput Timings (ns/day) for capabilities of the GPU implementation. Both the CPU and AMBER GB Simulations with a Time Stepof 2fs Using the GPU simulations were run on machines equipped with a Parallel CPU Version (12 Intel X5670 Cores or 32 AMD SuperMicroX8DTG-DFmotherboard withdualhex-coreIntel Opteron 3136 Cores on Each Node) and the Parallel GPU X5670processorsclockedat2.93GHzandMellanoxquadruple Version with the SPDP Precision Model (One Intel X5670 a datarate(QDR)Infinibandinterconnects.Eachcomputenode Core and One GPU Per Node) was equipped with one GPU and one Infiniband card, which apo-myoglobin nucleosome wereeach connected toaPCIe x16slot.Error-correction code CPU/GPU (2,492atoms) (25,095atoms) (ECC) was switched off on the Tesla cards, and the 64-bit GPUversion NVIDIA Linux Driver version 275.21 was used. Additionally, 8×M2090 135.1 3.95 CPU simulations were run on the XSEDE Trestles super- 4×M2090 115.0 2.71 computer at the San Diego Supercomputer Center, whose 2×M2090 93.1 1.80 nodes are equipped with four oct-core AMD Opteron 6136 1×M2090 78.1 1.42 processors clocked at 2.4 GHz and also interconnected with CPUversion QDR Infiniband. 2048×Opteron3136 − 0.53 Serial GPU Performance. The results of the simulation 1024×Opteron3136 − 0.78 timings for single GPU runs are summarized in Table 2. For 512×Opteron3136 − 0.65 256×Opteron3136 − 0.55 Table 2. Single GPU Throughput Timings (ns/day) for 128×Opteron3136 29.8 0.31 AMBER GBSimulations with aTime Step of 2fs Using the 64×Opteron3136 18.3 0.17 Parallel CPU Version on One Node (12 Intel X5670 Cores 32×Opteron3136 10.3 0.08 or 32 AMD Opteron 6136 Cores) and the Serial GPU 12×X5670 6.6 0.07 VersionwiththeSPDPPrecisionModelonOneNode(One aFor detailsonthehardwareandsoftwarestack,seethetext.Adash a Intel X5670 Core and One GPU) indicates lower speed than with lessnodes. TRPCage ubiquitin apo-myoglobin nucleosome CPU/GPU (304atoms)(1231atoms) (2492atoms) (25095atoms) tion.EvenasingleGPUisafactorof2to3fasterthantheCPU GPUversion scaling limit. M2090(6GB) 399.9 184.2 78.1 1.42 PrecisionModelPerformance.Asmentionedabove,the C2070(6GB) 364.1 157.2 64.3 1.09 AMBER GPU implementation supports three different C1060(4GB) 234.6 78.3 31.5 0.40 precision models. The DPDP model is logically equivalent GTX580(1.5GB, 471.1 215.9 88.7 − to the traditional DP approach used on the CPU. However, PNYXLR8) the use of DP on GPUs can be more detrimental to CPUversion 32×Opteron 225.0b 29.9 10.3 0.08 performance than its use on CPUs. Therefore, the desire to 6136 achieve high performance prompts for use of SP where 12×X5670 247.1 19.8 6.6 0.07 possible. It is critical though to use SP carefully in order not aFor details on the hardware and software stack, see text. A dash to detrimentally affect the accuracy of the MD. For this indicates insufficient GPU memory for the simulation. bThe CPU reason, we designed our hybrid precision SPDP model to coderequires>10atomspercore,andthustheTRPCagesimulation achieve performance very close to that achievable with SP was run on 24CPU cores. but in a manner that does not impact accuracy. While we recommendusingtheSPDPprecisionmodel,itisinstructive to compare the performance of the different precision small simulations such as TRPCage with 304 atoms where the GPU’s arithmetic hardware cannot be fully utilized, the serial models available in the AMBER GPU implementation. As shown in Table 4, the single precision (SPSP) model GPUimplementationisapproximatelytwiceasfastasacurrent state of the art CPU node. The real advantage of the AMBER Table 4. Throughput Timings (ns/day) for AMBER GB GPU implementation, however, becomes apparent for implicit Simulations of Apo-Myoglobin (2,492 atoms) with a Time solvent GB simulations of medium to large systems with 2500 Step of 2 fs Using the Serial GPU Version with Different to 25000 atoms. For apo-myoglobin (2492 atoms), the serial a Precision Models GPUversionofthecodeissignificantlyfasteronasingleGPU than a state of the art CPU node. For nucleosome, this per- precisionmodel SPSP SPDP DPDP formancegapincreasesdramaticallysuchthatasimulationthat M2090(6GB) 92.7 78.1 25.8 takes two weeks on a single desktop machine with one GPU C2070(6GB) 73.7 64.3 20.5 wouldtakeninemonthsusingthesamedesktopmachinewith- C1060(4GB) 41.2 31.5 5.4 out GPU acceleration. GTX580(1.5GB,PNYXLR8) 111.4 88.7 16.0 Parallel GPU Performance. The parallel performance of aFor details on the hardware andsoftware stack, see the text. the GPU and CPU implementations is compared in Table 3. Athroughputofupto135.1ns/dayforapo-myoglobinandup achieves the highest performance as expected but, as to 3.95 ns/day for nucleosome can be achieved with eight highlighted in section 5, at the cost of accuracy. Our hybrid NVIDIA M2090 GPUs, which is a factor of 4 to 5 faster than SPDP precision model, however, achieves a performance of the maximum throughput that can be achieved on a typical greaterthan75% oftheSPSPprecisionmodelwithaccuracy supercomputergiventhattheCPUscalingplateauslongbefore comparable to that of the full double precision (DPDP) it reaches the performance achieved by the GPU implementa- model, which is between 4 to 7 times slower. 1550 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555 Journal of Chemical Theory and Computation Article 5. VALIDATION accumulating them in DP, however, leads to an improvement of more than 1 order of magnitude in all cases for the SPDP Acriticalandoftenoverlookedaspectofanymajorchangetoa modelascomparedtotheSPSPmodel.Inthefollowingsection, widely used scientific software package is a detailed validation we will indeed show that the forces obtained from the SPSP of any and all new approximations made. In the case of the model are not precise enough to conserve energy in biomole- AMBER GPU implementation the validation falls into two cular MD simulations and can lead to instabilities in long time distinct categories. The first is a detailed testing of the code scaleMDsimulations,whiletheSPDPmodelissufficientforthis itself to ensure that it correctly simulates the existing systems purpose. and does not introduce any new bugs or contain critical logic 5.2. Energy Conservation. One of the most important errors.Thisisthekeyreasonwhyweimplementedafulldouble gaugesforjudgingtheprecisionofMDsoftwareisitsnumerical precision(DPDP)modelwithintheGPUcode.Thisprecision stability, which is reflected in its ability to conserve the model matches the CPU code to machine precision and thus constants of motion, in particular the energy. To this end, we can be used to check that the GPU code is indeed giving the have performed constant energy MD simulations of the first correct answers. This allowed us to take the CPU-based three test systems described in section 4 above. We collected regressionteststhatwehaveusedformanyyearstovalidatean datafor100ns(TRPCage),50ns(ubiquitin),and20ns(apo- AMBER installation and run them using the GPU code and myoglobin)afteraninitialequilibrationfor1nsat300Kusing obtain answers to the limits of machine precision. More Langevin dynamics. Center of mass motion was removed complex,however,isthesecondcategoryforvalidationwhichis before starting the constant energy runs. In addition to thecarefultestingofourhybrid(SPDP)precisionmodel.This simulations using a time step of 2 fs with bonds to hydrogen is significantly more complex since it requires a careful evaluation of any subtle differences in the dynamics as well as atoms constrained using the SHAKE algorithm with a relative ensemble properties from converged simulations run in DPDP geometrical tolerance of 10−6, we have run simulations using and SPDP precision. In this section, we attempt to extensively time steps of 0.5 and 1.0 fs without constraints. validate our SPDP model comparing a number of key The energy drifts along the trajectories are summarized in observablesacrossallthree,DPDP,SPDP,andSPSP,precision Table6,whileFigure3showsaplotofthetotalenergyforthe models. 5.1. Single Point Forces. The first metric tested was the Table 6. Energy Drifts Per Degree of Freedom (kT/ns/dof) effect of the numerical precision model on the force from Simulations of 100 ns (TRPCage), 50 ns (Ubiquitin), a calculations as compared to a reference precision model, in and 20 ns (Apo-Myoglobin) this case, the result from the CPU implementation. We are timestep 0.5fs 1.0fs 2.0fs using the starting structures of the test systems described in TRPCage(304atoms) section 4 above. The deviations in the forces are summarized CPU 0.000006 0.000066 0.000355 in Table 5. The DPDP model matches the reference forces GPU(DPDP) 0.000012 0.000082 0.000382 GPU(SPDP) 0.000003 0.000070 0.000222 Table 5. Deviations of Forces (in kcal/(mol Å)) of the GPU(SPSP) 0.000184 0.000252 − AMBER PMEMD GPU Implementation Using Different ubiquitin(1231atoms) Precision Models As Compared to Reference Values CPU 0.000004 0.000011 −0.000216 Obtained with the CPU Implementation GPU(DPDP) 0.000001 0.000006 −0.000247 precision TRPCage ubiquitin apo-myoglobin nucleosome GPU(SPDP) 0.000003 0.000030 −0.000165 model (304atoms) (1231atoms) (2492atoms) (25095atoms) GPU(SPSP) 0.001065 0.000305 − maxdeviation apo-myoglobin(2492atoms) SPSP 3.0×10−3 4.8×10−3 4.2×10−3 2.7×10−2 CPU 0.000012 0.000094 0.000416 SPDP 5.6×10−5 3.7×10−4 1.6×10−4 1.1×10−3 GPU(DPDP) −0.000004 0.000117 0.000290 DPDP 1.1×10−8 7.3×10−8 3.4×10−8 8.0×10−8 GPU(SPDP) 0.000019 0.000185 0.000139 RMSdeviation GPU(SPSP) 0.002230 0.000442* − SPSP 5.0×10−4 6.1×10−4 4.1×10−4 1.5×10−3 aTheSHAKEalgorithmtoconstrainbondlengthstohydrogenatoms SPDP 7.0×10−6 1.5×10−5 8.1×10−6 3.0×10−5 wasusedforatimestepof2.0fs;noconstraintswereusedforsmaller DPDP 1.5×10−9 3.6×10−9 2.6×10−9 3.2×10−9 time steps. A dash indicates that the system heated up extremely during thesimulation tothe pointthatitis meaninglessto reportan verycloselywithmaximumdeviationsnotexceeding10−7kcal/ energy drift. An asterisk indicates that the energy drift increases (molÅ)andRMSdeviationsnotexceeding10−8kcal/(molÅ), dramatically for longer time scales. even for systems as large as nucleosome. These deviations trajectories of TRPCage and ubiquitin for the different pre- are entirely due to the different order of execution of the cision models at a time step of 0.5 fs. The corresponding plot floating point operations in the CPU and GPU implementa- for apo-myoglobin is very similar to that for ubiquitin and can tion. The DPDP GPU implementation of PMEMD will thus befoundintheSupportingInformationalongwithplotsforthe generate trajectories of precision equivalent to the CPU larger time steps. The plots underline that it is important to implementation. The forces obtained from the SPSP model, validate the precision models for trajectories that are long however, show very large deviations from the reference valuesontheorderofupto10−2kcal/(molÅ)forthelargest enoughtouncovernumericalinstabilities.WhiletheSPSPmodel system studied. Such large deviations introduce significant seemstoperformreasonablywellforatrajectoryof1nslength, numerical noise into the simulations, and it is reasonable in particular for smaller systems like TRPCage, it becomes to expect that this may render the results of MD simula- apparent that the errors introduced lead to unacceptably large tionsquestionable.CalculatingtheforcecontributionsinSPand energydrifts inthe long term. Apart from the magnitudein the 1551 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Description: