ebook img

DTIC ADA474161: A Divide-and-Conquer/Cellular-Decomposition Framework for Million-to-Billion Atom Simulations of Chemical Reactions PDF

1.1 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA474161: A Divide-and-Conquer/Cellular-Decomposition Framework for Million-to-Billion Atom Simulations of Chemical Reactions

ComputationalMaterialsScience38(2007)642–652 www.elsevier.com/locate/commatsci A divide-and-conquer/cellular-decomposition framework for million-to-billion atom simulations of chemical reactions Aiichiro Nakano a,*, Rajiv K. Kalia a, Ken-ichi Nomura a, Ashish Sharma a, Priya Vashishta a, Fuyuki Shimojo a,b, Adri C.T. van Duin c, William A. Goddard c, Rupak Biswas d, Deepak Srivastava d aCollaboratoryforAdvancedComputingandSimulations,DepartmentofComputerScience,DepartmentofPhysics&Astronomy, DepartmentofChemicalEngineering&MaterialsScience,UniversityofSouthernCalifornia,LosAngeles,CA90089-0242,USA bDepartmentofPhysics,KumamotoUniversity,Kumamoto860-8555,Japan cMaterialsandProcessSimulationCenter,DivisionofChemistryandChemicalEngineering,CaliforniaInstituteofTechnology,Pasadena,CA91125,USA dNASAAdvancedSupercomputing(NAS)Division,NASAAmesResearchCenter,MoffettField,CA94035,USA Received22February2006;accepted21April2006 Abstract Toenablelarge-scaleatomisticsimulationsofmaterialprocessesinvolvingchemicalreactions,wehavedesignedlinear-scalingmolec- ulardynamics(MD)algorithmsbasedonanembeddeddivide-and-conquer(EDC)framework:firstprinciples-basedfastreactiveforce- field(F-ReaxFF)MD;andquantum-mechanicalMDintheframeworkofthedensityfunctionaltheory(DFT)onadaptivemultigrids. TomaptheseO(N)algorithmsontoparallelcomputerswithdeepmemoryhierarchies,wehavedevelopedatunablehierarchicalcellular- decomposition(THCD)framework,whichachievesperformancetunabilitythroughahierarchyofparameterizedcelldata/computation structuresandadaptiveloadbalancingthroughwavelet-basedcomputational-spacedecomposition.Benchmarktestson1920Itanium2 processors of the NASA Columbia supercomputer have achieved unprecedented scales of quantum-mechanically accurate and well validated, chemically reactive atomisticsimulations—0.56 billion-atom F-ReaxFF MDand 1.4 million-atom (0.12trillion grid points) EDC–DFT MD—in addition to 18.9 billion-atom non reactive space–time multiresolution MD. The EDC and THCD frameworks exposemaximal datalocalities, andconsequentlythe isogranularparallel efficiencyon1920processorsisas highas 0.953.Chemically reactive MD simulations have been applied to shock-initiated detonation of energetic materials and stress-induced bond breaking in ceramics in corrosiveenvironments. (cid:2)2006Elsevier B.V.All rightsreserved. PACS: 02.70.(cid:2)c;02.70.Ns;71.15.(cid:2)m Keywords: Moleculardynamics;Reactiveforcefield;Quantummechanics;Densityfunctionaltheory;Parallelcomputing 1. Introduction reactions. Such large reactive MD simulations would for the first time provide requisite coupling of chemical reac- There is growing interest in large-scale molecular tions, atomistic processes, and macroscopic materials dynamics (MD) simulations [1–4] involving million-to- phenomena,tosolveawidespectrumofproblemsofgreat billion atoms, in which interatomic forces are computed societal impact. Examples of technological significance quantum mechanically [5] to accurately describe chemical include: stress corrosion cracking (corrosion-related direct costs make up 3% of the gross domestic product in the US),wherechemicalreactionsatthecracktipareinsepara- * Correspondingauthor.Tel.:+12138212657;fax:+12138212664. ble from long-range stress fields [6]; energetic nanomateri- E-mailaddress:[email protected](A.Nakano). als to boost the impulse of rocket fuels, in which 0927-0256/$-seefrontmatter (cid:2)2006ElsevierB.V.Allrightsreserved. doi:10.1016/j.commatsci.2006.04.012 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED APR 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER A divide-and-conquer/cellular-decomposition framework for 5b. GRANT NUMBER million-to-billion atom simulations of chemical reactions 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION California Institute of Technology,Materials and Process Simulation REPORT NUMBER Center,MC 139-74,Pasadena,CA,91125 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT To enable large-scale atomistic simulations of material processes involving chemical reactions, we have designed linear-scaling molecular dynamics (MD) algorithms based on an embedded divide-and-conquer (EDC) framework: first principles-based fast reactive force-field (F-ReaxFF) MD; and quantum-mechanical MD in the framework of the density functional theory (DFT) on adaptive multigrids. To map these O(N) algorithms onto parallel computers with deep memory hierarchies, we have developed a tunable hierarchical cellular-decomposition (THCD) framework, which achieves performance tunability through a hierarchy of parameterized cell data/computation structures and adaptive load balancing through wavelet-based computational-space decomposition. Benchmark tests on 1920 Itanium2 processors of the NASA Columbia supercomputer have achieved unprecedented scales of quantum-mechanically accurate and well validated, chemically reactive atomistic simulations?0.56 billion-atom F-ReaxFF MD and 1.4 million-atom (0.12 trillion grid points) EDC?DFT MD?in addition to 18.9 billion-atom non reactive space?time multiresolution MD. The EDC and THCD frameworks expose maximal data localities, and consequently the isogranular parallel efficiency on 1920 processors is as high as 0.953. Chemically reactive MD simulations have been applied to shock-initiated detonation of energetic materials and stress-induced bond breaking in ceramics in corrosive environments. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION 18. NUMBER 19a. NAME OF OF ABSTRACT OF PAGES RESPONSIBLE Same as 11 PERSON a. REPORT b. ABSTRACT c. THIS PAGE unclassified unclassified unclassified Report (SAR) A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 643 Fig.1. (a)Reactiveforce-fieldmoleculardynamicssimulationofshock-initiatedcombustionofanenergeticnanocompositematerial(nitraminematrix[7] embeddedwithaluminumnanoparticles[9]).(b)RenderingofamoleculardynamicssimulationonatileddisplayatUSC,showinghypervelocityimpact damageofaceramicplatewithimpactvelocity15km/s,whereonequarterofthesystemiscuttoshowtheinternalpressuredistribution(theprojectileis shown in white). Such simulations help design thermal and radiation protection systems of aerospace vehicles, which are tolerant to micrometeorite impacts(whereimpactspeedsareashighas40km/s). chemical reactions sustain shock waves (see Fig. 1(a)) [7]; algorithms onto multi-teraflops to petaflops parallel com- and micrometeorite impact damages to the thermal and puters,however,posesanumberofchallenges,e.g.,achiev- radiation protection layers of aerospace vehicles, under- inghighscalabilityforirregularlydistributedbillionatoms, standing of which is essential for safer space flights andenablingperformanceportability[14]forawiderange (Fig. 1(b)). Emerging petaflops computers could poten- ofparallelarchitectures.Toovercomethesechallenges,the tially extend the realm of quantum-mechanical (QM) second contribution of this paper is a tunable hierarchical simulation[8]tothemacroscopicscales,butonlyifscalable cellular-decomposition (THCD) framework, which is parallel simulation technologies were developed. awareof deepmemory hierarchies,by maximallyexposing Inthepastfewyears,severalpromisingapproacheshave datalocalityandexploitingparallelismateachdecomposi- emerged toward achieving million-to-billion atom simula- tion level. The framework features topology-preserving tions of chemical reactions. One computational approach computational-space decomposition and wavelet-based toward QM-basedmillion-atom MDsimulations isto per- adaptive load balancing. To ensure performance portabil- form a number of small density functional theory (DFT) ity, a hierarchy of cell data/computational structures are [10,11] calculations ‘‘on the fly’’ to compute interatomic parameterized and tuned on each platform. The major forces quantum mechanically. We have recently designed accomplishment of this paper is the unprecedented scales a linear-scaling divide-and-conquer DFT algorithm on of quantum-mechanically accurate and well validated, adaptive multigrids, which achieves robust convergence, chemically reactive atomistic simulations on the NASA controlled errors, and energy conservation during MD Columbia supercomputer. Benchmark tests on 1920 simulations [12]. Here we present the first million-atom Itanium2 processors have achieved 0.56 billion-atom DFT-based MD simulation, where electronic wave func- ReaxFFand1.4million-atom(0.12trilliondegreesoffree- tionsarerepresentedon1011gridpoints.Alternativetothis dom) DFT simulations, in addition to 18.9 billion-atom concurrentDFT-basedMDapproachisasequentialDFT- MDsimulation,withisogranularparallelefficiencyashigh informed strategy, which employs environment-dependent as 0.953. interatomic potentials based on: (1) variable atomic This paper is organized as follows. In the next section, charges to describe charge transfers; and (2) reactive bond we describe the EDC algorithmic framework. Section 3 orders to describe chemical bond formation and breakage. discusses theTHCD parallelcomputing framework.Results In our first principles-based reactive force-field (ReaxFF) of benchmark tests are given in Section 4, and Section 5 approach [7,13], the parameters in the interatomic poten- describes applications of large-scale chemically reactive tialare‘‘trained’’tobestfitthousandsofDFTcalculations MD simulations to shock-initiated detonation of energetic onsmall(thenumberofatoms,N(cid:3)10)clustersofvarious materials and stress-induced bond breaking in ceramics in atomic-species combinations. This paper presents a new corrosive environments. Finally, Section 6 contains O(N) parallel ReaxFF algorithm, which for the first time conclusions. enables billion-atom MD simulations of chemical reactions. 2. Linear-scaling embedded divide-and-conquer simulation The first contribution of this paper is a unified embed- algorithms ded divide-and-conquer (EDC) algorithmic framework for designing linear-scaling parallel algorithms for broad We have developed a unified algorithmic framework to scientific and engineering problems, with specific applica- design linear-scaling algorithms for broad scientific and tions to two reactive atomistic simulation methods: engineering applications, based on data locality principles. ReaxFF MD and DFT-based MD. Mapping these O(N) In the embedded divide-and-conquer (EDC) algorithms, 644 A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 spatially localized subproblems are solved in a global plexity of the N-body problem to O(N). In the FMM, the embedding field, which is efficiently computed with tree- physical system is recursively divided into subsystems to based algorithms (see Fig. 2). Examples of the embedding form an octree data structure, and the electrostatic field field are the electrostatic field in molecular dynamics is computed recursively on the octree with O(N) opera- (MD) simulations and the self-consistent Kohn–Sham tions, while maintaining the spatial locality at each recur- potential in the density functional theory (DFT). sion level. Our scalable parallel implementation of the Specifically, we have used the EDC framework to FMMhasauniquefeaturetocomputeatomisticstressten- developasuiteoflinear-scalingMDsimulationalgorithms sor components based on a complex charge method [17]. formaterialsresearch,inwhichinteratomicforcesarecom- The MRMD algorithm also utilizes temporal locality puted with varying accuracy and complexity. The linear- throughmultipletimestepping(MTS),whichusesdifferent scaling algorithms encompass a wide spectrum of physical force-update schedules for different force components reality:(1)classicalMDbasedonamany-bodyinteratomic [16,18–20]. Specifically, forces from the nearest-neighbor potentialmodel,whichinvolvestheformallyO(N2)N-body atoms are computed at every MD step, whereas forces problem; (2) environment-dependent, reactive force-field from farther atoms are updated less frequently. (ReaxFF) MD, which involves the O(N3) variable N- For parallelization of MD simulations, we use spatial charge problem;and (3)quantum-mechanical(QM)calcu- decomposition [3]. The total volume of the system is lationbasedontheDFT,toprovideapproximatesolutions divided into P subsystems of equal volume, and each sub- to the exponentially complex quantum N-body problem. system is assigned to a node in an array of P compute nodes. To calculate the force on an atom in a subsystem, 2.1. Space–time multiresolution molecular dynamics thecoordinatesoftheatomsintheboundariesofneighbor algorithm subsystems are ‘‘cached’’ from the corresponding nodes. Afterupdatingtheatomicpositionsduetoatime-stepping WehavedesignedchemicallyreactiveO(N)MDsimula- procedure, some atoms may have moved out of its subsys- tion algorithms on the basis of their nonreactive predeces- tem. These atoms are ‘‘migrated’’ to the proper neighbor sor, space–time multiresolution molecular dynamics nodes. With the spatial decomposition, the computation (MRMD)algorithm[3].IntheMDapproach,oneobtains scales as N/P, while communication scales in proportion the phase-space trajectories of the system (positions and to (N/P)2/3 for an N-atom system. Tree-based algorithms velocities of all atoms at all time). Atomic force law for such as the FMM incur an O(logP) overhead, which is describing how atoms interact with each other is mathe- negligible for coarse-grained (N/P(cid:4)P) applications. matically encoded in the interatomic potential energy, E (rN), which is a function of the positions of all N 2.2. Fast reactive force-field molecular dynamics algorithm MD atoms, rN={rji=1,...,N}, in the system. In our many- i body interatomic potential scheme, the energy, Thedensityfunctionaltheory(DFT)[10,11]hasreduced E ({r },{r }), is an analytic function that depends on the exponentially complex quantum-mechanical problem MD ij ijk relativepositionsofatomicpairs,r ,andtriplets,r .Time to O(M3), by solving M one-electron problems self-consis- ij ijk evolutionofrNisgovernedbyasetofcoupledordinarydif- tently instead of one M-electron problem. Unfortunately, ferential equations. For interatomic potentials with finite DFT-based MD simulations [5] are rarely performed over ranges, the computational cost is made O(N) using a N(cid:3)102 atoms because of its O(N3) computational linked-list cell approach [3]. For the long-range electro- complexity, which severely limits their scalability. The first static interaction, we use the fast multipole method principles-based ReaxFF approach [7,13] significantly (FMM) [15–17] to reduce the O(N2) computational com- reduces the computational cost of simulating chemical Fig.2. Schematicofanembeddeddivide-and-conquer(EDC)algorithm.(Left)Thephysicalspaceissubdividedintospatiallylocalizedcells,withlocal atoms constituting subproblems (bottom), which are embedded in a global field (shaded) solved with a tree-based algorithm. (Right) To solve the subproblemindomainX intheEDC–DFTalgorithm,coarsemultigrids(gray)areusedtoaccelerateiterativesolutionsontheoriginalreal-spacegrid a (correspondingtothegridrefinementlevel,l=3).Thebottompanelshowsfinegridsadaptivelygeneratedneartheatoms(spheres)toaccuratelyoperate theionicpseudopotentialsontheelectronicwavefunctions. A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 645 reactions. However, because of its O(N3) complexity asso- in which a hierarchical grid technique combines multigrid ciated with the variable N-charge problem and the multi- preconditioning and adaptive fine mesh generation [12]. tude of atomic n-tuple information (n=2–6) required to The DFT can be formulated as a minimization of the compute interatomic forces, parallelization of the ReaxFF energy functional, E (rN,wM), with respect to electronic QM has onlyseen limited success,and the largest ReaxFF MD wave functions (or Kohn–Sham orbitals), wM(r)= simulations to date have involved N<104 atoms. {w (r)jn=1,...,M},subjecttoorthonormalityconstraints n The O(N3) complexity of ReaxFF arises from the dense (Misthenumberofindependentelectronicwavefunctions linear system of equations to determine atomic charges, andisontheorderofN).TheEDC–DFTalgorithmrepre- {qji=1,...,N}, at every MD step, i.e., the variable sents the physical system as a union of overlapping spatial i N-charge problem. We have developed a fast reactive domains,X=[ X (seeFig.2),andphysicalpropertiesare a a force-field (F-ReaxFF) MD algorithm, which reduces the computed as linear combinations of domain properties. complexity toO(N)by combining theFMM [15–17]based For example, the electronic density is expressed as on spatial locality and an iterative minimization approach qðrÞ¼R paðrÞR fajwaðrÞj2, where pa(r) is a support func- a n n n to utilize the temporal locality of the solutions [21]. To tion that vanishes outside the ath domain X , and fa and a n furtheracceleratetheconvergence,weuseamultilevelpre- wa(r) are the occupation number and the wave function n conditionedconjugate-gradient(MPCG)method[9,21],by of the nth Kohn–Sham orbital in X . The domains are a splitting the Coulomb-interaction matrix into short- and embedded in a global Kohn–Sham potential, which is a long-range components and using the sparse short-range functional of q(r) and is determined self-consistently with matrix asapreconditioner. The extensiveuseofthe sparse {fa;waðrÞ}. We use the multigrid method to compute the n n preconditioner enhances the data locality, and thereby global potential in O(N) time [12,28]. improves the parallel efficiency [21]. TheDFTcalculationineachdomainisperformedusing The chemical bond order, B , is an attribute of an a real-space approach [29], in which electronic wave func- ij atomic pair, (i,j), and changes dynamically adapting to tions are numerically represented on grid points (see the local environment. In the ReaxFF, the interatomic Fig.2).Thereal-spacegridisaugmentedwithcoarsermulti- potential energy, E ({r },{r },{r },{q},{B }), grids to accelerate the convergence of iterative solutions ReaxFF ij ijk ijkl i ij betweenatomicpairs,r ,triplets,r ,andquadruplets,r , [12,28,30].Furthermore,afinergridisadaptivelygenerated ij ijk ijkl dependson thebond orders ofallconstituent atomic pairs near every atom, in order to accurately operate ionic [13]. Force calculations in ReaxFF MD thus involve up to pseudopotentials for calculating electron–ion interactions atomic 4-tuples explicitly, and require information on 6- [12].Weinclude electron–ion interactionsusing norm-con- tuples implicitly due to chain-rule differentiations through serving pseudopotentials [31] and the exchange-correlation thebondorders.Toefficientlyhandletheresultingmultiple energy in a generalized gradient approximation [32]. interaction ranges, our parallel F-ReaxFF algorithm TheEDC–DFTalgorithmonthehierarchicalreal-space employs a multilayer cellular-decomposition (MCD) grids is implemented on parallel computers based on spa- scheme for caching atomic n-tuple (n=2–6) information tial decomposition [12]. Each compute node contains one (see Section 3). ormoredomainsoftheEDCalgorithm.Foreachdomain, TheF-ReaxFFcalculationofRDX(1,3,5-trinitro-1,3,5- its electronic structure is computed independently, with triazine) in this paper has an extensive validation database little information needed from other compute nodes (only against DFT calculations, which include not only 1600 the global density but not individual wave functions is equilibrated molecular fragments but also 40 key reaction communicated). The resulting large computation/commu- energies [7]. nication ratio makes this approach highly scalable on parallel computers. 2.3. Divide-and-conquer density functional theory The convergence of the new algorithm has been verified algorithm on adaptive multigrids for nontrivial problems such as amorphous CdSe and liquid Rb [12]. The EDC–DFT calculation for alumina in TheconcurrentDFT-basedMDapproachisbestimple- this paper (with the domain size 9.0·7.8·8.2a.u. and mented with a divide-and-conquer algorithm [22], which is the buffer lengths 4.5, 3.9, and 4.1a.u.) reproduces an based on a data locality principle called quantum near- O(N3) DFT energy within 0.001a.u. per atom. The sightedness[23],andnaturally leads to O(N)DFT calcula- EDC–DFT MD algorithm has also overcome the energy tions [12,22,24–27]. However, it is only in the past several drift problem, which plagues most O(N) DFT-based MD yearsthatO(N)DFTalgorithms,especiallywithlargebasis algorithms [12]. sets (>104 unknowns per electron, necessary for the trans- ferability of accuracy [12,25–27]), have attained controlled 3. Tunable hierarchical cellular-decomposition error bounds, robust convergence properties, and energy parallelization framework conservation during MD simulations, to make large DFT-based MD simulations practical [12,26]. For exam- Datalocality principles arekeytodevelopinga scalable ple, we have recently designed an embedded divide-and- parallel computing framework as well. We have developed conquer density functional theory (EDC–DFT) algorithm, a tunable hierarchical cellular-decomposition (THCD) 646 A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 framework to map the above O(N) algorithms onto mas- both data layouts (atoms are sorted according to their cell sively parallel computers with deep memory hierarchies. indicesandthelinkedlists)andcomputationlayouts(force The THCD maximally exposes data locality and exploits computationsarere-orderedbytraversingthecellsaccord- parallelism at multiple decomposition levels, while provid- ingtoaspacefillingcurve,amappingfromthe3Dspaceto ing performance tunability through a hierarchy of param- a 1D list) [35]. Cells are traversed along either a Morton eterized cell data/computation structures (see Fig. 3). curve (Fig. 3) or a Hilbert curve, instead of the traditional At the finestlevel, EDC algorithms consistof computa- raster-scan order. In a multi-threading case, the Morton tional cells—linked-list cells (which are identical to the curve ensures maximal separation between the threads octree leaf cells in the FMM) [15–17] in MRMD and and thus eliminates critical sections. Furthermore, the cell F-ReaxFF, or domains in EDC–DFT [12] (see Fig. 3). In size is made tunable [14] to optimize the performance. theTHCDframework,eachcomputenode(oftencompris- Thecomputationalcellsarealsousedinthemultilayercel- ing multiple processors with shared memory) of a parallel lular-decomposition(MCD)schemeforinter-nodecaching computer is identified as a subsystem (Pc in Fig. 3) in spa- ofatomic n-tuple(n=2–6)information(see Fig. 3),where p tialdecomposition,whichcontainsalargenumberofcom- n changes dynamically in the MTS or MPCG algorithm putational cells. Our EDC algorithms are implemented as (see Section 2). The Morton curve also facilitates a data hybrid message-passing and shared-memory (MPI+ compression algorithm based on data locality to reduce OpenMP) programs, in which inter-node communication the I/O. The algorithm uses octree indexing and sorts for caching and migrating atoms between Pc’s is handled atoms accordingly on the resulting Morton curve [36]. By p with messages, whereas loops over cells within each Pc storing differences between successive atomic coordinates, p (or MPI process) are parallelized with threads (denoted the I/O requirement for a given error tolerance level as dots with arrows in Fig. 3). To avoid performance- reduces from O(NlogN) to O(N). An adaptive, variable- degrading critical sections, the threads are ordered by length encoding scheme is used to make the scheme blocking cells, so that the atomic n-tuples being processed tolerant to outliers and optimized dynamically. An order- by the threads share no common atom. of-magnitude improvement in the I/O performance was On top of the computational cells, cell blocks, and achieved for MD data with user-controlled error bound spatial-decomposition subsystems, the THCD framework [36]. introduces a coarser level of decomposition by defining The THCD framework includes a topology-preserving process groups as MPI Communicators (PGc ¼[ Pc in computational spatial-decomposition scheme to minimize p p Fig.3).ThisprovidesamechanismtooptimizeEDCappli- latency through structured message passing [37] and load- cationsdistributedoveralooselycoupledcollectionofpar- imbalance/communication costs through a novel wavelet- allelcomputers,e.g.,aGridofgloballydistributedparallel based load-balancing scheme [38]. The load-balancing computers[33,34].Ourprogramsaredesignedtominimize problem can be stated as an optimization problem, i.e., global operations across PGc’s and to overlap computa- one minimizes the load-imbalance cost as well as the size tions with inter-group communications [34]. andthenumberofmessages[39].Tominimizethenumber Thecellulardatastructureoffersaneffectiveabstraction of messages, we preserve the 3D mesh topology, so that mechanism for performance optimization. We optimize message passing is performed in a structured way in only six steps. To minimize the load-imbalance cost as well as themessagesize,wehavedevelopedacomputational-space decompositionscheme[37].Themainideaofthisschemeis that the computational space shrinks where the workload density is high, so that the workload is uniformly distrib- uted in the computational space. To implement the curved computationalspace,weintroduceacurvilinearcoordinate transformation.Thesumofloadimbalanceandcommuni- cation costs is then minimized as a functional of the coor- dinatetransformation,usingsimulatedannealing.Wehave found that wavelet representation leads to compact repre- sentation of curved partition boundaries, and accordingly speeds up the convergence of the minimization procedure [38]. Fig. 3. In tunable hierarchical cellular decomposition (THCD), the physicalvolumeissubdividedintoprocessgroups,PGc,eachofwhichis 4. Performance tests spatially decomposed into processes, Pc. Each process consists of a p numberofcomputationalcells(e.g.,linked-listcellsinMDordomainsin The two new parallel reactive MD algorithms, F-Rea- EDC–DFT) of size l , which are traversed concurrently by threads cell xFFandEDC–DFT,aswellastheirnonreactivepredeces- (denotedbydotswitharrows)tocomputeinteratomicforcesinblocksof cells.Pc isdynamicallyaugmentedwithn layersofcachedcellsfrom sor, MRMD, are portable and have been run on various p layer neighborprocesses. platforms, including Intel Itanium2, Intel Xeon, AMD A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 647 Opteron and IBM Power4-based parallel computers. This Thereisalsoatrade-offbetweenspatial-decomposition/ section presents performance tests of the three algorithms message-passing (MPI) and thread (OpenMP) parallelisms on some of the architectures. [40–42] in our hybrid MPI+OpenMP programs. While spatial decomposition involves extra computation on 4.1. Performance tunability cachedcellsfromneighborsubsystems,itsdisjointmemory subspacesarefreefromshared-memoryprotocoloverhead. WehavetestedtheperformancetunabilityoftheTHCD Amarkedcontrastinthistrade-offisobservedbetweenthe parallel implementations of the MRMD, F-ReaxFF, and MRMD and F-ReaxFF algorithms. The MRMD is char- EDC–DFT algorithms. For example, we typically observe acterized by a low inter-node caching overhead in spatial (cid:3)10% performance improvement by the data re-ordering decomposition and a small number of floating-point oper- and computation re-ordering based on the Morton curve ations per memory access, since it mostly consists of look- for MRMD. upsforpre-computedinteratomicforcetables.Incontrast, Fig.4cshowstheeffectofthelinked-listcellsize(inunit F-ReaxFF has a large inter-node caching overhead for 6- oftheaveragevolumeperatom,X/N)ontheCPUtimeof tupleinformationandalargecomputation/memory-access the MRMD algorithm for a 331,776-atom silica material ratio. on a 1.4GHz Pentium III processor. The CPU time takes Table 1 shows the execution time of the MRMD algo- theminimalvalueatacellsizeof2.1,whichcanbeunder- rithm for an 8,232,000-atom silica material and that of stoodasatrade-offbetweentheincreasingnumberoffloat- the F-ReaxFF algorithm for a 290,304-atom RDX crystal ing-point operations (Fig. 4a) and the decreasing number on P=8 processors in an 8-way 1.5GHz Power4 node. of L2 cache misses (Fig. 4b) as a function of the cell size. (The test was performed on the Iceberg Power4 system at Here,wehaveusedthePerformanceApplicationProgram- the Arctic Region Supercomputing Center.) We compare ming Interface (PAPI) to measure the number of cache different combinations of the number of OpenMP threads misses (see http://icl.cs.utk.edu/papi). per MPI process, n , and that of MPI processes, n , while td p keepingP=n ·n constant.Theoptimalcombinationof td p (n ,n ) with the minimum execution time is (1,8) for the td p MRMD and is (4,2) for the F-ReaxFF. 4.2. Scalability ScalabilitytestsofthetwonewparallelMDalgorithms, F-ReaxFF and EDC–DFT, as well as MRMD, on which theyarebased,havebeenperformedonthe10,240-proces- sorColumbiasupercomputerattheNASAAmesResearch Center. The SGI Altix 3000 system uses the NUMAflex global shared-memory architecture, which packages pro- cessors, memory, I/O, interconnect, graphics, and storage intomodularcomponentscalledbricks.Thecomputational building block of Altix is the C-Brick, which consists of four Intel Itanium2 processors (in two nodes), local mem- ory, and a two-controller application-specific integrated circuit called the Scalable Hub (SHUB). Each SHUB Table1 PerformanceofhybridMPI+OpenMPprograms NumberofOpenMP NumberofMPI Executiontime/MDtime threads,n processes,n step(s) td p MRMD F-ReaxFF 1 8 4.19 62.5 2 4 5.75 58.9 4 2 8.60 54.9 8 1 12.5 120 ExecutiontimeperMDtimesteponP=n ·n =8processorsinan8- td p way1.5GHzPower4node,withdifferentcombinationsofthenumberof Fig.4. (a)Thenumberoffloating-pointinstructions,(b)thenumberof OpenMPthreadsperMPIprocess,n ,andthatofMPIprocesses,n ,for: td p L2cachemisses,and(c)theCPUtimeperMDstep,asafunctionofthe (1)theMRMDalgorithmforan8,232,000-atomsilicasystem;and(2)the linked-list cell size, for the MRMD algorithm, to study a 331,776-atom F-ReaxFF algorithm for a 290,304-atom RDX system. The minimum silicamaterialona1.4GHzPentiumIIIprocessor. executiontimeforeachalgorithmistypedinboldface. 648 A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 interfaces to the two CPUs within one node, along with memory, I/O devices, and other SHUBs. The Altix cache- coherency protocol implemented in the SHUB integrates the snooping operations of the Itanium2 and the direc- tory-basedschemeusedacrosstheNUMAflexinterconnec- tion fabric. A load/store cache miss causes the data to be communicated via the SHUB at the cache-line granularity and automatically replicated in the local cache. The 64-bit Itanium2 architecture operates at 1.5GHz and is capable of issuing two multiply–add operations per cycleforapeakperformanceof6Gflops.Thememoryhier- archy consists of 128 floating-point registers and three on- chipdatacaches(32KBL1,256KBL2,and6MBL3).The Itanium2 cannot store floating-point data in L1, making register loads and spills a potential source of bottlenecks; however, a relatively large register set helps mitigate this issue. The superscalar processor implements the Explicitly Fig. 5. Total execution (circles) and communication (squares) times per Parallel Instruction set Computing (EPIC) technology, MDtimestepasafunctionofthenumberofprocessorsfortheF-ReaxFF MDalgorithmwithscaledworkloads—36,288PatomRDXsystemsonP where instructions are organized into 128-bit VLIW bun- processors(P=1,...,1920)ofColumbia. dles. The Altix platform uses the NUMAlink3 intercon- nect, a high-performance custom network in a fat-tree topology, in which the bisection bandwidth scales linearly 0.995. Also the algorithm involves very small communica- withthenumberofprocessors.Columbiaruns64-bitLinux tion time (see Fig. 5). version 2.4.21. Our experiments use a 6.4TB parallel XFS Fig. 6 shows the performance of the EDC–DFT-based file system with a 35-fiber optical channel connection to MD algorithm with scaled workloads—720P atom alu- the CPUs. mina systems on P processors (P=1,...,1920). In the Columbia is configured as a cluster of 20 Altix boxes, EDC–DFT calculations, each domain of size 6.66· each with 512 processors and 1TB of global shared-access 5.76·6.06A˚3contains40electronicwavefunctions,where memory. Of these 20 boxes, 12 are model 3700 and the each wave function is represented on 283=21,952 grid remaining eight are BX2—a double-density version of the points. The execution time includes three self-consistent 3700. Four of the BX2 boxes are linked with NUMAlink4 (SC) iterations to determine the electronic wave functions technology to allow the global shared-memory constructs and the Kohn–Sham potential, with three CG iterations to significantly reduce inter-processor communication per SC cycle to refine each wave function iteratively. The latency. This 2048-processor subsystem within Columbia largest calculation on 1920 processors involves 1,382,400 provides a 13Tflops peak capability platform, and was atoms and 5,529,600 electronic wave functions on the basis of the computations reported here. Fig. 5 shows the execution time of the F-ReaxFF MD algorithm for RDX material as a function of the number of processors, P. In this and following figures, we set n td to 1. We scale the problem size linearly with the number of processors, so that the number of atoms, N=36,288P (P=1,...,1920). The computation time includes three conjugate-gradient (CG) iterations to solve the electroneg- ativity equalization problem for determining atomic chargesateachMDtimestep.Theexecutiontimeincreases only slightly as a function of P, and this signifies an excel- lent parallelefficiency.We define the speedof an MDpro- gram as a product of the total number of atoms and time stepsexecutedpersecond.Thespeedupistheratiobetween the speed of P processors and that of one processor. The parallel efficiency is the speedup divided by P. On 1920 processors, the isogranular parallel efficiency of the F- ReaxFF algorithm is 0.953. A better measure of the inter-box scaling efficiency based on NUMAlink4 is the Fig. 6. Total execution (circles) and communication (squares) times per speedupfrom480processorsinoneboxto1920processors MD time step as a function of the number of processors for the EDC– in four boxes, divided by the number of boxes. On 1920 DFTMDalgorithmwithscaledworkloads—720Patomaluminasystems processors, the measured inter-box scaling efficiency is onPprocessors(P=1,...,1920)ofColumbia. A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 649 widelyusedaspropellants,becausetheircombustionprod- ucts such as Al O are accompanied by a large amount of 2 3 heat release. Burn rates of propellants can be accelerated by reducing the size of Al particles, thereby increasing the surface to volume ratio and the rate of chemical reac- tions. A major technical difficulty for such small reactant particles is the dead weight of oxide layers. The thickness of the oxidized layer in an Al particle is known to be a few nanometers regardless of the particles size. Therefore, the ratio of the oxidized layer, which is not effective as a propellant,tothereactiveportionincreasesforthesmaller Al particles. This dead-weight problem in nanoscale reac- tantparticlesmaybeovercomebyencapsulatingtheparti- cleswithincomplementaryreactivematerialssuchasRDX. The 1.01 million-atom F-ReaxFF MD simulation has beenperformedon256Xeonprocessorstostudyshock-ini- tiateddetonationofRDXcrystal/oxidizedAlnanoparticle Fig. 7. Design-space diagram for reactive and nonreactive MD simula- (n-Al)composite.Inthesimulation,twoslabsoftheRDX/ tionson1920Itanium2processorsofColumbia.Thefigureshowsthetotal n-Al composite, each of size 482·353·65A˚3 in the x, y executiontimeperMDstepasafunctionofthenumberofatomsforthree and z directions, are impacted with the impact velocity of linear-scalingalgorithms:quantum-mechanicalMDbasedontheembed- ded divide-and-conquer density functional theory (EDC–DFT, circles); 5km/s in the z direction. Each oxidized n-Al consists fast reactive force-field MD (F-ReaxFF, squares); space–time multi- of 707 atoms. The simulation reveals atomistic processes resolutionMD(MRMD,triangles).LinesshowidealO(N)scaling. of shock compression and subsequent explosive reaction. Strong attractive forces between oxygen and aluminum 121,385,779,200gridpoints,forwhichtheisogranularpar- atomsbreakN–OandN–NbondsintheRDXand,subse- allel efficiency is 0.907. The inter-box scaling efficiency quently, the dissociated oxygen atoms and NO molecules between 480 and 1920 processors is 0.966. oxidizeAl,whichhasalsobeenobservedinourDFT-based Major design parameters for reactive and nonreactive MD simulation [43]. MD simulations of materials include the number of atoms TheF-ReaxFFMDmethodhasbeenvalidatedbycom- in the simulated material and the method to compute paring calculated shock wave velocities in RDX with interatomic forces (classically in MRMD, semi-empirically experimental data, where a shock wave is generated by a in F-ReaxFF MD, or quantum mechanically in EDC– planar impactor. Fig. 8 compares MD and experimental DFT MD). Fig. 7 shows a design-space diagram for results on the shock velocity as a function of the particle classical and quantum-mechanical MD simulations on velocity that drives theshock[7].The MD andexperimen- 1920Itanium2 processorsofColumbia. Thelargestbench- taldataagreeverywell.Furthermore,thesimulationshows mark tests in this study include 18,925,056,000-atom MRMD, 557,383,680-atom F-ReaxFF, and 1,382,400- atom (121,385,779,200 electronic degrees-of-freedom) EDC– DFT calculations. The figure demonstrates perfect linear scaling for all the three algorithms, with prefactors spanning five orders-of-magnitude. Only exception is the F-ReaxFF algorithm below 100 million atoms, where the execution time scales even sub-linearly. 5. Applications The scalable parallel MD algorithms, F-ReaxFF and EDC–DFT, have been applied to the study of a number of chemically reactive material processes, and this section briefly describes some of the applications. 5.1. Shock-induced detonation of energetic materials We have performed F-ReaxFF MD simulations to study shock-initiated detonation of RDX (1,3,5-trinitro- Fig. 8. F-ReaxFF MD and experimental data on the planar shock 1,3,5-triazine, C N O H ) matrix embedded with alumi- 3 6 6 6 velocity in RDX as a function of the particle velocity. An experimental num nanoparticles (Fig. 1a). Aluminum powders are detonationvelocityinRef.[44]isalsoshown. 650 A.Nakanoetal./ComputationalMaterialsScience38(2007)642–652 Fig.9. (a)Close-upofEDC–DFTMDsimulationofana-AlO (0001)surfaceimmersedinwater.White,red,andgreenspheresrepresenthydrogen, 2 3 oxygen,andaluminumatoms,respectively.(b)Indentationofamorphoussilicacoveredwithabsorbedwaterlayer.Whiteandyellowspheresarehydrogen andsiliconatoms,whereasblueandredspheresareoxygenatomsoriginatedfromwaterandsilicasubstrate,respectively.Theindenterisshowningreen. a sudden increase of the number of molecular products catastrophic failure of materials resulting from chemically such as HONO, N , and OH above a shock velocity influenced corrosion. The basic requirements for the oper- 2 (cid:3)9km/s, which is consistent with an experimental detona- ationofstructuralsystemsexposedtocorrodingconditions tion velocity [44]. under stress loads are safety and reliability. Such safe and reliable operation is endangered by the uncertainties in 5.2. Stress-induced bond breaking in ceramics in corrosive stress corrosion cracking (SCC) [45,46]. To prevent SCC environments and to predict the lifetime beyond which SCC may cause failure requires that we understand the atomistic mecha- We have also started EDC–DFT MD simulations nisms underlying SCC; that is the conditions influencing involving 10,368 atoms on 240 dual-processor/dual-core initiation of SCC and the dynamics and growth rates. AMD Opteron nodes (in total of 960 cores), to study the Fig. 9a shows a close-up of the simulation at zero stress, effect of applied stress on the bond breaking at an a- whichshowsthedissociationofwatermoleculesconsistent Al O (0001)surfaceimmersedinwater.Thealuminasub- withthemechanismfoundbyHassetal.[47].Underauni- 2 3 strate is 57.0·67.2·12.2A˚3 in the ½2(cid:2)1(cid:2)10(cid:5), ½10(cid:2)10(cid:5) and axial strain of 0.15 in the ½10(cid:2)10(cid:5) direction, the simulation [0001] directions, covered with 13A˚ layer of adsorbed exhibits significant relaxation of the alumina surface. water. The performance and lifetime of materials widely We are also studying the effect of adsorbed layers of used in industrial applications is often severely limited by water and hydrazine (N H ) on the indentation behavior 2 4 corrosion of these systems in an environment containing of amorphous silica using F-ReaxFF MD simulations oxygen and water. Most critical here is premature and (Fig. 9b). Indentation is a unique local probe to measure Fig. 10. A hierarchy of MD simulation methods. (Top) Reactive MD methods and their largest implementations include DFT-based MD of metals involving1000atoms(orN =10,000electrons)[50],fragmentmolecularorbital(FMO)methodcalculationofproteinsinvolving20,000atoms[51],O(N) el DFT(ONDFT,thepresentpaper),reactiveforcefield(ReaxFF,thepresentpaper),whereasthelargestnonreactiveMDsimulationsbasedoneffective forcefields(EFF)includea160billion-atomMDofmetals[52].(Bottom)Thespatiotemporalscale,NT,ofMDsimulations.OurlargeMDsimulationsof ceramics[48,53,54]hadNT=0.03–0.04atomss,whereasoneofthelongestbiomolecularsimulationsontheHP-36peptidehadNT=0.0096[55].

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.