' $ FPGA ACCELERATION OF MOLECULAR DYNAMICS SIMULATIONS YONGFENG GU Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy & % BOSTON UNIVERSITY BOSTON UNIVERSITY COLLEGE OF ENGINEERING Dissertation FPGA ACCELERATION OF MOLECULAR DYNAMICS SIMULATIONS by YONGFENG GU B.S., Fudan University, 2000 M.S., Fudan University, 2003 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2008 Approved by First Reader Martin Herbordt, Ph.D. Professor of Electrical and Computer Engineering Second Reader Roscoe Giles, Ph.D. Professor of Electrical and Computer Engineering Third Reader Wei Qin, Ph.D. Professor of Electrical and Computer Engineering Fourth Reader Sandor Vajda, Ph.D. Professor of Biomedical Engineering FPGA ACCELERATION OF MOLECULAR DYNAMICS SIMULATIONS (Order No. ) YONGFENG GU Boston University, College of Engineering, 2008 Major Professor: Martin Herbordt, Ph.D., Professor of Electrical and Computer Engineering ABSTRACT While molecular dynamics simulations (MD) are a fundamental method for gaining the understanding of chemical and biological systems, their computational cost is extremely high: Simulating macromolecules requires thousands of node hours and cell-level systems remain altogether out of reach. We address this issue by using an emerging mode of high performance computing that is based on configurable logic in the form of Field Pro- grammableGate Arrays (FPGAs). Theproblemis that, whileFPGAs have often delivered 100-fold speed-upsper node over microprocessor-based systems, the applications have gen- erally been limited to those with small regular kernels operating on low-precision integer data types. MD possesses neither. We address this problem by creating an explicitly designed FPGA-coprocessor that can be integrated into generic commercially available systems. MD is an iterative technique: the forces on each particle are computed, then applied using the equations of motion. We use standard partitioning by computing bonded forces, motionupdates,andbookkeepingonthehost,whilecomputingtheremainingforces(which dominate) on the FPGA accelerator. For the short-range forces we combine the following: cell lists, systematically determined interpolation and precision, handling of exclusion, and iii support for models with large numbers of particles. This has required new microarchitec- tures for the cell list processor and off-chip memory controller; and extensive experimen- tation to explore the design space to optimize precision, interpolation order, interpolation mode,tablesizes, andsimulation quality. Toperformefficientandaccuratenumericalcom- putation on FPGA, we created a novel arithmetic mode that is tuned for computing high order polynomial interpolation. For the long-range forces we use the multigrid method: we show that this is an excellent match to FPGAs with the primary operations having a favorable systolic structure and taking advantage of the large number of independently addressable RAMs. The significance of this work lies at several levels: the 5 to 10 acceleration of MD × × productioncodewhileretainingsimulationquality; theturingsofalgorithmsfortheFPGA; the system integration; the new arithmetic mode; the numerous novel microarchitectures; and the methods for optimizing MD implementations on FPGAs. iv Contents 1 Introduction 1 1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Molecular Dynamics Computations . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 High Performance Computing With Accelerators . . . . . . . . . . . . . . . 4 1.4 High End FPGAs as HPC Accelerators . . . . . . . . . . . . . . . . . . . . 7 1.4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Programmability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.3 High Performance FPGA-Based Computing . . . . . . . . . . . . . . 9 1.4.4 High Performance Reconfigurable Computing Platforms . . . . . . . 12 1.5 HPRC for MD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.2 Acceleration of MD . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.3 General Computation Models and Designs . . . . . . . . . . . . . . . 19 1.7 Organization of the Rest Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Molecular Dynamics 21 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Basic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Fast MD Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Non-bonded Force Evaluation . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Long Time Step Integrator . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 MD Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 NAMD2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 v 2.4.2 ProtoMol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.3 CHARMM and AMBER . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.4 GROMACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5 Special Purpose Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.1 MD-GRAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.2 MD Engine and MODEL . . . . . . . . . . . . . . . . . . . . . . . . 40 3 FPGA Acceleration of MD 42 3.1 FPGA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 High Performance Reconfigurable Computing . . . . . . . . . . . . . . . . . 47 3.4 Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Reconfigurable Computer Systems . . . . . . . . . . . . . . . . . . . . . . . 50 3.5.1 SGI RASC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5.2 Cray XD1 and XT4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.3 SRC MAP Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.4 XtremeData XD1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.5 Annapolis Microsystems Plug-In Boards . . . . . . . . . . . . . . . . 57 3.5.6 Summary of Reconfigurable System Products . . . . . . . . . . . . . 58 3.6 MD Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Algorithm Design Part 1: Short Range Forces 62 4.1 Numerical Computation of Complex Expressions . . . . . . . . . . . . . . . 63 4.1.1 General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.2 Interpolation of r−x . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1.3 Computing the Coefficients . . . . . . . . . . . . . . . . . . . . . . . 69 4.1.4 Comparing the Interpolation Methods . . . . . . . . . . . . . . . . . 73 4.1.5 Algorithm to Compute Interpolation Coefficients with Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 vi 4.2 Semi Floating Point Numbering . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Simulation Quality - Precision vs. Accuracy . . . . . . . . . . . . . . . . . . 87 5 Algorithm Design Part 2: Long Range Forces 91 5.1 Multigrid Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 Multigrid Method for the Coulomb Force Computation . . . . . . . . . . . . 95 5.3 Mapping Multigrid to FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.2 Particle-Grid Converter . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3.3 Grid-Grid Convolver . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.4 Interleaved Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6 System Design 111 6.1 System Level Design and Operation . . . . . . . . . . . . . . . . . . . . . . 112 6.1.1 Basic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.1.2 Basic Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.1.3 Use of Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.4 Integration Into Production MD Systems . . . . . . . . . . . . . . . 115 6.2 Short Range Non-bonded Force Coprocessor . . . . . . . . . . . . . . . . . . 116 6.2.1 Short Range Non-bonded Force Coprocessor Architecture . . . . . . 117 6.2.2 Non-bonded Force Exclusion . . . . . . . . . . . . . . . . . . . . . . 122 6.2.3 Short Range Non-bonded Force Pipeline . . . . . . . . . . . . . . . . 126 6.2.4 Polynomial Interpolation Pipeline with Semi-FP . . . . . . . . . . . 129 6.3 Multigrid Coprocessor for Coulomb Force . . . . . . . . . . . . . . . . . . . 134 6.3.1 Multigrid Coprocessor Architecture . . . . . . . . . . . . . . . . . . . 134 6.3.2 Implementation Consideration . . . . . . . . . . . . . . . . . . . . . 137 6.4 Supporting Large Simulations with Explicitly Managed Cache . . . . . . . . 138 6.4.1 Off-chip Memory Interface and Constrains . . . . . . . . . . . . . . . 139 6.4.2 Coprocessor and Cache Interface . . . . . . . . . . . . . . . . . . . . 140 vii 7 Validation and Performance 143 7.1 Experiment Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.2 Simulation Quality Experiments . . . . . . . . . . . . . . . . . . . . . . . . 145 7.3 Performance Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4 Detailed Analysis of Multigrid Coprocessor . . . . . . . . . . . . . . . . . . 151 8 Summary and Future Directions 158 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.2.1 Node Level Optimization . . . . . . . . . . . . . . . . . . . . . . . . 160 8.2.2 System Level Parallelization . . . . . . . . . . . . . . . . . . . . . . . 161 References 163 viii List of Tables 4.1 Trade-off between Interval Size and Interpolation Order . . . . . . . . . . . 68 4.2 Relative Root Mean Square Error of r−7 with Orthogonal Polynomial Inter- polation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Relative RootMeanSquareErrorofr−7 withTaylorPolynomialInterpolation. 74 4.4 Relative Root Mean Square Error of r−7 with Hermite Polynomial Interpo- lation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Relative RootMeanSquareErrorofr−7 withLinearPolynomialInterpolation. 75 4.6 Resource Usage of Floating Point and LNS . . . . . . . . . . . . . . . . . . 83 4.7 Resource Usage of Different Components . . . . . . . . . . . . . . . . . . . . 87 4.8 Latency of Various Components. . . . . . . . . . . . . . . . . . . . . . . . . 87 7.1 Profile of the 77K particle model simulation . . . . . . . . . . . . . . . . . . 147 7.2 Performance of Various Configurations . . . . . . . . . . . . . . . . . . . . . 149 7.3 Clock Period of Various Configurations . . . . . . . . . . . . . . . . . . . . . 150 7.4 Absolute Force Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.5 Average Force Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.6 Maximum Force Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.7 Potential Energy Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.8 Multigrid Coprocessor Profile . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.9 Detail Characteristic of Multigrid Computation . . . . . . . . . . . . . . . . 155 ix
Description: