Table Of Content

Open-architecture Implementation of Fragment Molecular Orbital Method for Peta-scale Computing Toshiya Takami∗, Jun Maki∗, Jun-ichi Ooba∗, Yuichi Inadomi†, Hiroaki Honda†, Taizo Kobayashi∗, Rie Nogita∗, and Mutsumi Aoyagi∗† ∗Computing and Communications Center, Kyushu University, 6–10–1 Hakozaki, Higashi-ku, Fukuoka 812-8581, Japan 7 †PSI Project Laboratory, Kyushu University, 3–8–33–710 Momochihama, Sawara-ku, Fukuoka 814-0001, Japan 0 0 2 Abstract—We present our perspective and goals on high- n performance computing for nanoscience in accordance with the a J global trend toward “peta-scale computing.” After reviewing our results obtained through the grid-enabled version of the 1 fragment molecular orbital method (FMO) on the grid testbed 1 by the Japanese Grid Project, National Research Grid Initiative (NAREGI), we show that FMOis one of the best candidates for ] C peta-scale applications by predicting its effective performance in peta-scale computers. Finally, we introduce our new project D constructing a peta-scale application in an open-architecture . implementation of FMO in order to realize both goals of high- s performance in peta-scale computers and extendibility to multi- c [ physics simulations. 1 v I. INTRODUCTION 5 On account of the recent developments of the grid com- 7 0 puting environment, we can use many computer resources Fig. 1. In FMO, the target molecule is divided into fragments for which ab initio calculation is performed. Usually, carbon–carbon single bonds are 1 more than a thousand CPUs. However,those large distributed chosenasaboundarybetweenfragments. 0 resources are accessible only when we pass through some 7 gateway to the grid system after a tedious procedure for 0 grid-enabling of application programs. Thus, for scientists in / s nanosciencetousethegridsystemasoneofthedailytools,it shown in section III, and the introduction of the actual im- c plementationofFMOinordertorealizepeta-scalecomputing : is importantto make their applicationsgrid-awarein advance. v is given in section IV. Finally in section V, we discuss what On the other hand, the development on high performance i is necessary in the actual peta-scale computing for scientific X computing (HPC) environments shows no sign of slowing applications. r down, and, within several years, we might reach the scale of a peta (∼1015) in computer resources for scientific computing. II. GRID-ENABLED FMO It is expected that the “peta-scale computer” exhibits more In this section, we review the results on a grid-enabled than a peta flops performance in floating-point calculations, versionofFMO,andevaluateitseffectivityinthegridtestbed its available memory is more than a peta byte in total, or it of NAREGI [3]. Our main contribution to the project is to has external storages amount to more than a peta byte. Thus, develop large-scale grid-enabled applications in nanoscience. the global trend in HPC is now to study how to realize the One of these applications is the Grid-FMO which is based peta-scale computing [1]. on the famous ab initio molecular orbital (MO) package The purpose of this paper is twofold: one is to present program, GAMESS [6], and is constructed by dividing the our computational results in nanoscience supported by the package into several independent modules so that they are Japanese Grid project, National Research Grid Initiative executed on a grid environment. The algorithm of FMO and (NAREGI) [2], and the other is to show a perspective about the implementation of Grid-FMO is briefly reviewed in the the HPC applications for peta-scale computing. This paper is following subsections. coordinatedas follows. Those computationalresults using the A. Algorithm of FMO fragmentmolecular orbital method (FMO) on grid computing environments are shown in section II. The performance pre- TheFMOmethoddevelopedbyDr.Kitauraandco-workers diction of the FMO calculation in a peta-scale computer is [4]canexecuteallelectroncalculationinlargemoleculeswith Fig.3. FlowofLC-FMOrepresented inNAREGIWorkflowTool fragment calculation (monomer), run.dim for the fragment pair calculation (dimer), and run.tot for the total energy calculation,andare invokedfroma scriptprogramwhichalso managesthefile transferoverdistributedcomputers.Thetotal flow of the LC-FMO is shown in fig. 2. Another benefit of the loosely-coupled form is the extendibility of its functionality. In fact, after the first release Fig.2. Flow ofcalculation ofthegrid-enabled FMO of the LC-FMO, we have developed two other modules which realize a linkage to the initial density database and a visualization feature of the total molecular orbitals. Since morethan10thousandsatoms.Thebriefoverviewoftheflow the top-level program to invoke modules is lightweight and of the fragment MO method up to dimer correction is the can be written in script languages, further modification of the following: (1) the molecule to be calculated is divided into total flow is easily performed. fragments,(2)abinitioMOcalculation[5]isexecutedforeach C. Execution on NAREGI Grid fragment(monomer)undertheelectro-staticpotentialmadeby In order to execute LC-FMO on the NAREGI grid, we put all the other fragments, and this calculation is repeated until theflowoftheprogramintoNAREGIWorkflowTool[3].The the potential is self-consistently converged, (3) ab initio MO graphicalrepresentationof the LC-FMO programis shown in calculation is executed for pairs of fragments (dimer), and fig. 3, where modules and input/output files are represented (4) the totalenergyis obtainedby summingup allthe results. in icons connected each other. This procedure is simple and ThisalgorithmofFMOisimplementedinseveralfamousMO straightforward because we have already reconstructed the packages such as GAMESS [6], ABINIT-MP [7], etc. programinto a suitable form for distributed computing.Thus, Although the most time-consuming part of this calculation the importantis whether the basic structure of the program is isabinitioMOcalculationsforeachfragmentandeachpairof grid-aware or not. fragments, these calculations can be executed independently. Once a program is grid-enabled, we can execute it by Thus, FMO is easily executed in parallel computers with the use of the large-scale computing resources over one high efficiency. The FMO routines included in those famous thousand CPUs. From the programmer’s points of view, the packages are already parallelized for cluster machines, and most preferable thing is that we are relieved of arranging are coordinated so that they exhibit efficient performance. the resources to execute the program in high efficiency. In However, if the program is used in distributed computing en- NAREGI testbed system [3], NAREGI Super Scheduler can vironments,we should consider robustnessand controllability managetheresourcerearrangementwith thehelpofNAREGI in addition to the efficiency. Thus, grid-enablingis necessary. Information Service. Among many ways of grid-enabling, we choose a strategy to reconfigure the program into a loosely-coupled form in order D. Efficiency of LC-FMO to satisfy such properties. Robustness and efficiency are conflicting each other in general. Since our reconstruction of FMO to increase the B. Loosely-coupled FMO controllability might hurt the efficient execution of the pro- We have developed a grid-enabled version of FMO, called gram,itisnecessarytoevaluatethe efficiencyofLC-FMO by a Loosely-coupled (LC) FMO program [8], as part of the comparing with the original FMO program in GAMESS. NAREGI project. At first, we show the procedure to change In table I, we show elapsed times for chicken egg white the GAMESS-FMO programs into a “loosely-coupled form,” cystatin (1CEW, 1701 atoms, 106 fragments, shown in fig. 4) where the original FMO in the GAMESS package is divided by the LC-FMO and the originalGAMESS-FMO codes. This into several “single task” modules which are connected each is obtained on the NAREGI testbed system which consists of other through input/output file transfers. Those modules con- 16 CPUs of Xeon 3GHz. Generally speaking, an increase of sists of run.ini for the initialization, run.mon for the the total elapsed time is inevitable when we reconstructsome TABLEI TABLEII THEELAPSEDTIMEFORCHICKENEGGWHITECYSTATIN. TIMINGDATAOBTAINEDINASINGLENODEOFIBMP51.9GHZ. Input 1cew 1ao6half 1ao6 1ao6 dim LC-FMO GAMESS-FMO No.Atom 1701 9121 18242 36484 Initial Guess 37s 4s Nf 106 561 1122 2244 Monomer 1h11m 59m Nd(Nf) 690 4192 8416 16832 Dimer 2h16m 2h11m Nes(Nf) 4875 152888 620465 2499814 Energy 4s <1s Im 17 17 17 17 Total 3h28m 3h10m monomer 1356 13364 40005 140810 (Average) (0.752) (1.40) (2.10) (3.69) Time SCF-dimer 2037 20689 70465 186901 (sec) (Average) (2.95) (4.94) (8.37) (11.1) ES-dimer 398 13772 55955 208627 (Average) (0.0816) (0.0901) (0.0902) (0.0835) ElapsedTime(sec) 3799 47886 166601 536898 TABLEIII TIMINGDATAOBTAINEDIN16CPUSOFXEON3GHZ. Input 1cew 1ao6half 1ao6 No.Atoms 1701 9121 18242 Nf 106 561 1122 Nd(Nf) 690 4192 8416 Nes(Nf) 4875 152888 620465 Im 17 17 17 monomer 1030.4 10808.9 33989.2 (Average) (9.72) (19.3) (30.3) Fig.4. Themolecularstructureofchickeneggwhitecystatin(1CEW,1701 Time SCF-dimer 1677.4 17517.6 50819.2 atoms,106fragments). (sec) (Average) (41.3) (71.0) (102.7) ES-dimer 293.1 9594.8 39133.4 (Average) (1.02) (1.07) (1.07) ElapsedTime(sec) 3003.5 38065.9 126330.9 application into a loosely-coupled form, which is considered as a cost for grid-enabling.As shown in table I, however, the increase of the time is relatively small. Thus, it is evaluated B. Performance Prediction of FMO that the LC-FMO is effectively executed on grids. The procedure to predict the performance of FMO in the III. PERFORMANCE PREDICTION peta-scale computeris a somewhatphenomenologicalmethod In this section, we predict what performance would be based on actual measurements in current computer systems. expected if we executed our FMO program in a “peta-scale First, we assume an execution model of FMO which gives a computer.” Since the actual hardware is not available at theoretical timing function of the number of fragments N . f present,thepredictionisbasedonahypotheticalspecifications After fixing some parameters by several measurementsof the of the computer. program, we predict the elapsed time on the virtual computer A. Hypothetical Specification of the Peta-scale Computer with peta-scale specifications. It is said that, in a few years, such computer systems 1) Execution Model of GAMESS-FMO: Even if you could with peta-scale specifications will be designed [9]. Here, we analyze all the program codes of GAMESS, it is almost concentrate on the CPU resourcesof the peta-scale computer, impossible to determine the number of floating-point cal- andestimatehowmanyCPU resourcesarenecessaryto attain culations in a FMO routine precisely since there are many peta-scale performances. conditionalbrancheswhichdependonthe molecularstructure The current peak performance of a single-core CPU is givenasaninput.Ourstrategy,here,isapracticalapproachto the order of 10 giga flops. If a parallel computer with 10 obtainphenomenologicalvaluesoftheexecutiontimethrough peta-flops peak performance is configured by the CPU to experimental executions of FMO in the current machines. achieve 1 peta-flop effective performance, 1,000,000 CPU ThetotalexecutiontimeofFMOisdividedintothreeparts: cores are necessary in total. Since cluster system constructed amonomerpart,anSCF-dimerpart,andanES-dimerpart.The by1,000,000machinesconnectedeachotherisunimaginable, monomer and SCF-dimer parts perform SCF calculations for itisnecessarytorearrangethoseresourcesintomultiplelayers. eachfragmentsandeachpairoffragments,respectively,while We do not concern here how to realize computational nodes the ES-dimer part obtains dimer correction terms under an with multiple CPU cores. If we can configure a lowest layer electro-static(ES)approximationbetweenfragments.Usually, by a node with the performance of 100 ∼ 1,000 CPUs, the the ES approximation is applied to the fragment pair with totalcomputerwillbe aparallelsystem composedof1,000∼ fragments which are separated more than a certain threshold. 10,000 nodes. Westartfromanexpressionofthetotalamountofcomputation TABLEIV (a) PARAMETERVALUESTOREPRESENTTHEEXECUTIONMODELOFFMO 4.0 Parameter Value 3.5 ffm((00)) ffm((11)) 02..5893 00..00001349 ec) 23..50 d d s fe(s0) — 0.082 — me ( 2.0 Eibm Exeon 1.0 0.071 Ti 1.5 IBM p5 1.0 0.5 Xeon as a function of the number of fragments N , 0.0 f 0 500 1000 1500 2000 2500 Fragments Ftotal(Nf)=Fm(Nf)+Fd(Nf)+Fes(Nf), (1) (b) 12 where F (N ), F (N ), and F (N ) represent the number m f d f es f 10 of floating-point calculations for monomer, SCF-dimer, and ES-dimer parts, respectively. These are assumed as c) 8 e s Fm(Nf) = hfm(0)+fm(1)NfiNfIm (2) me ( 6 Ti 4 F (N ) = f(0)+f(1)N N (N ) (3) IBM p5 d f h d d fi d f 2 Xeon F (N ) = f(0)N (N ), (4) es f es es f 0 0 500 1000 1500 2000 2500 according to the algorithm of the FMO theory [4], where Fragments I is the number of monomer loops, N (N ) and N (N ) (c) m d f es f 0.1 represent the numbers of SCF-dimers and ES-dimers. SCF 0.09 calculationsformonomersand dimersin FMO dependon the 0.08 environmental potential which is made by all the fragments 0.07 other than the target monomer and dimer, while it can be ec)0.06 s ignoredforES-dimers.Thus,werepresenttheamountofeach e (0.05 m SCF calculation for a monomer and an SCF-dimer up to a Ti0.04 0.03 IBM p5 lineartermofN ,andthatofnon-SCFcalculationforan ES- f 0.02 dimer as a constant. The parameters, fm(0), fm(1), fd(0), fd(1), 0.01 Xeon f(0), should be determined from the actual timing data. If we 0 es define additional parameters, the number of parallel nodes K 0 500 1000 1500 2000 2500 Fragments and the effectivefloating-pointperformanceE (flops) of each node,we can comparethe actualtiming data with the amount Fig. 5. The fragment number dependency of the amount of computation of computation divided by KE. for(a) amonomer, (b) anSCF-dimer, and (c) anES-dimer areplotted with Table II and III show the timing data of each part in test theeffective performance ratios (E=1forIBMp5,E=0.071forXeon). executions of several inputs on the machines, 1 node of IBM Lines are functions (a)(fm(0)+fm(1)Nf)/KE,(b) (fd(0)+fd(1)Nf)/KE, p5 1.9GHz and a 16 CPU cluster of intel Xeon 3GHz. Since and(c)fe(s0)/KE obtained bytheleast-square method. the averaged size of fragments in these inputs are considered as almost the same, the difference of the elapsed time mainly 20000 depends on the number of the total fragments N . The least- f square fitting with the additional parameters representing the mer16000 effective performances of each computer, Eibm and Exeon, di determines all the parameter values shown in table IV after F-12000 C the normalization E = 1. In figures 5 (a), (b) and (c), we S plotthetimingdataaibnmdtheresultsofthefunctions,(a)(fm(0)+ m. of 8000 fm(1)Nf)/KE, (b) (fd(0)+fd(1)Nf)/KE, and (c) fe(s0)/KE. Nu4000 2) PerformanceinPeta-scaleComputer: Inordertopredict thetotalperformanceofFMO,itisconvenientifthefunctions 0 0 500 1000 1500 2000 2500 N (N ) and N (N ) are represented in simple functions of d f es f Fragments N . By the least-square fitting of the data in table II and III, f we obtain a function for the number of SCF-dimers Fig. 6. Thenumber ofSCF-dimers inthe test executions andthe function Nd(Nf)bytheleast-square method. N (N )=7.50 N . (5) d f f 30 25 ES dimer r) 20 u o h e ( 15 SCF dimer m Ti 10 5 Monomer 0 0 50000 100000 Fragments Fig.7. PredictedelapsedtimeoftheFMOcalculationbyGAMESS as a function of the number of fragments N . This is obtained on f Fig.8. Multi-physics/multi-scale simulator stackincluding OpenFMO. the assumption that 10,000 nodes are available, and each node is 5 times faster than a current machine. If we consider a constraint (N −1)N f f N (N )+N (N )= , (6) d f es f 2 the number of ES-dimers is represented in the form (N −1)N N (N )= f f −7.50 N . (7) es f f 2 If we substitute these functions and parameters into Eqs. (2), (3), and (4), the total computational amount of FMO is obtained as a function of N . As already seen in the f Fig. 9. Can multi-physics/multi-scale simulations reveal the true aspect of previoussection,theFMOalgorithmisappropriateforparallel thecomplexworldofmatter? executions when the number of fragments is large enough compared to the number of available nodes. Suppose that we have a peta-scale computer with K = 10000 and E = 5, i.e., the number of available nodes is 10,000 and each node One of the reason why the current program fails to run is has an effective speed 5 times faster than a node of IBM p5 memory consumption in each node, where the current FMO 1.9GHz which is used in this paper as a reference machine code in GAMESS tries to allocate two-dimensional arrays (E =1).Then,thepredictedelapsedtimeF (N )/KE ibm total f with respect to fragments, e.g., the distance between two is calculated as a quadratic function of N shown in fig. f fragments.Thisexceeds40gigabytesevenifwe considerthe 7. From the viewpoint of the elapsed time, we can perform symmetry. The limit of the number of fragmentis considered quantum calculations for molecules with more than 100,000 as the order of 10,000 in the current implementation. Thus, fragments if the peta-scale computer is realized. the reconstruction of the FMO code is necessary in order to According to the effective performance measurements on PCs, TheFMO calculationis executedin 0.5∼1.0gigaflops correspond to the peta-scale computing. per one CPU of Xeon or Pentium4, which means that our In this section, we introduceour new project to reconstruct referencemachine,anodeofIBMp51.9GHz,exhibitsalmost a FMO program from scratch. The name of this project, 10 giga flops for the program since it is E /E ≈ 14 OpenFMO [10],stands for the followingopenness:(1) names ibm xeon times faster than a Xeon. Then, the total performance of and argument lists of APIs constructing the FMO program FMO calculations for a peta computer with K = 10000 are opened publicly, (2) the program structure is also opened and E = 5 is considered as 0.5 peta flops. Thus, FMO is to combining other theories of physics and chemistry, i.e., consideredasapredominantcandidatewhichcanrecordpeta- multi-scale/multi-physics simulations, and (3) the skeleton scale performance. program is developed under some open-source licenses and its development process will also be opened to the public. In IV. OPENFMO PROJECT fig. 8, we show a stack structure of this program. Although In spite of the fact that the FMO is a suitable algorithm the main target is still a quantum chemical calculation of to achieve the peta-scale performance through large-scale molecules,itcanbecombinedwithothertheoriestoconstruct parallelexecutions,theactualGAMESS-FMOcodecannotbe multi-physics/multi-scalesimulations(fig.9)forcomplexphe- executedfor the molecule with more than 100,000fragments. nomena [11]. A. Open Architecture Implementation of FMO Generally speaking, it is very difficult for scientific applications to be executed in a peta-scale performance since a AsalreadyshowninsectionII-A,thefragmentMOmethod simple parallel scheme fails in case that the total number of consists of the standard ab initio MO calculations including CPUs is the order of 1,000,000. Thus, it will be necessary thegenerationofFockmatriceswithone/two-electronintegral that the application itself becomes multi-layered in order to calculations. Using this property, the usual FMO program is use the resources efficiently corresponding to the hardware divided into two layers, the skeleton program which control architecture of peta-scale computers. The FMO calculation the whole flow of the FMO algorithm, and the molecular which we have studied in this paper has a two-stage structure orbital (MO) APIs to provide the charge distribution of each in its original algorithm: ab initio calculations are performed fragmentthrough ab initio MO calculations. Since these spec for each fragment and fragment pair while the interactions of interfaces between the skeleton and the APIs are fixed and between these fragments are described by a classical electro- opened publicly, either of them can easily be substituted by static potentials.Thistheoreticalreconstructionoftheoriginal other programs. ab initio MO method into the layered form is considered as B. Open Interface to Multi-physics Simulations the key to the successful performance of the program. The multi-physics simulation is one of the predominant Scientific calculations with layered structures have been strategy to construct peta-performance applications for nano- carriedoutasamulti-physics/multi-scalesimulationinvarious scale materials. Our new implementationof FMO can also be fields. This type of calculations is important not only as a opened for such multi-physics simulations. Since the FMO techniqueofrealisticsimulationsbutalsoasanactualexample method is based on electro-static interaction between frag- for the peta-scale computing. However, we claim that the ments, we can extend each fragment to the general object deep understandingof the physical/chemicaltheories must be which can provide a static charge distribution. For examples, necessary for a proper construction of effective simulations we often use molecular mechanics representation of atomic which describes complex aspects of nature. Then, the the- clusters which should be given by quantum mechanical de- oretical and practical way of constructing such simulations scription,ortheenvironmentalchargedistributionsurrounding should be established before the time when the peta-scale a molecule which models a solvent can be included as a computers are available. We expect that our implementation fragment. Thus, the MO-APIs for a certain fragment can be of the simulator is one of the solutions for high-performance substituted to other programs based on the different approxi- scientific simulations in the next-generation. mation levels. ACKNOWLEDGMENT In addition to the description in molecular sciences, this ThisworkispartlysupportedbytheMinistryofEducation, is extended to larger-scale simulations through the “multi- Sports,Culture,ScienceandTechnology(MEXT)throughthe physics/multi-scale simulator” layer. This type of extension Science-grid NAREGI Program under the Development and is widely used for the large-scale computation representing Application of Advanced High-performance Supercomputer realisticmodelsofmoleculesincellsorotherlivingorganisms. Project. C. Open Source Development of the Programs REFERENCES The source code of the skeleton program of OpenFMO is publicly opened according to some open-source licenses. At [1] Therecentactivities forpeta-scale computing inJapanarereviewed in “The11th‘Science inJapan’Forum”(WashingtonDC,Jun16,2006), present, the OpenFMO project is managed by several core http://www.jspsusa.org/FORUM2006/Agenda06.html members including the authors of this paper. Once we show [2] NAREGIPoject, http://www.naregi.org/ the effectivenessof our approachto the openimplementation, [3] S. Matsuoka, S. Shimojo, M. Aoyagi, S. Sekiguchi, H. Usami, and K.Miura, “Japanese Computational Grid Research Project: NAREGI,” and the direction of the project is settled properly, we are inProc.IEEE,2005,vol.93,pp.522–533. willingtochangethemanagementoftheprojectintoso-called [4] K. Kitaura, E. Ikeo, T. Asada, T. Nakano, M. Uebayasi, “Fragment open-source-softwaredevelopments. Molecular OrbitalMethod:anApproximate Computational Methodfor LargeMolecules,” Chem.Phys.Lett.,1999,vol.313,pp.701–706. V. SUMMARY AND DISCUSSION [5] C. C. J.Roothaan, “New Developments in Molecular Orbital Theory,” Rev.Mod.Phys.,1951,vol.23,pp.69–89. Inthispaper,weshowedourresultsinnanoscienceexecuted [6] http://www.msg.ameslab.gov/GAMESS/GAMESS.html on the NAREGI grid system, and expressed our perspective [7] T.Nakano, T.Kaminuma, T.Sato, K.Fukuzawa, Y. Akiyama, M.Ue- bayasi,K.Kitaura,“Fragmentmolecularorbitalmethod:useofapproxi- toward the peta-scale computing. In section III, we showed mateelectrostaticpotential,”Chem.Phys.Lett.,2002,vol.351,pp.475. that FMO is one of the peta-scale applications. However, it [8] M.Aoyagi,“LooselyCoupledApproachonNano-ScienceSimulations,” has been also realized that the actual implementations of the an oral presentation given in “The 11th ‘Science in Japan’ Forum,” http://www.jspsusa.org/FORUM2006/aoyagi-slide.pdf FMOmethod,atpresent,doesnotcorrespondtotheexecution [9] RikenSupercomputer Project, http://www.nsc.riken.jp/ inpeta-scaleenvironments,i.e.,theydoesnotsolvemolecules [10] OpenFMOProject, http://www.openfmo.org/ with more than 10 thousand fragments even if peta-scale [11] T.Takami,J.Maki,J.Ooba,T.Kobayashi,R.Nogita,andM.Aoyagi, “Interaction and Localization of One-Electron Orbitals in an Organic computersare available. In order to improvethe situation, we Molecule: Fictitious ParameterAnalysisforMultiphysics Simulations,” started an open-source project for multi-physics simulations J.Phys.Soc.Jpn.76,013001(2007)[4pages];e-printnlin.CD/0611042. including new FMO codes from scratch.

Open-architecture Implementation of Fragment Molecular Orbital Method for Peta-scale Computing PDF

0.36 MB·English

by Toshiya Takami

#additional_collections #journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Open-architecture Implementation of Fragment Molecular Orbital Method for Peta-scale Computing

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.