Performance Benchmarking of Fast Multipole Methods Thesis by Noha Ahmed Al-Harthi In Partial Fulfillment of the Requirements For the Degree of Masters of Science King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia June, 2013 The thesis of Noha Ahmed Al-Harthi is approved by the examination committee Committee Chairperson: David Keyes Committee Member: Hakan Bagci Committee Member: Timothy Ravasi 2 Copyright ©2013 Noha Ahmed Al-Harthi All Rights Reserved 3 ABSTRACT Performance Benchmarking of Fast Multipole Methods Noha Ahmed Al-Harthi The current trends in computer architecture are shifting towards smaller byte/flop ratios, while available parallelism is increasing at all levels of granularity – vector length, core count, and MPI process. Intel’s Xeon Phi coprocessor, NVIDIA’s Kepler GPU, and IBM’s BlueGene/Q all have a Byte/flop ratio close to 0.2, which makes it very difficult for most algorithmstoextractahighpercentageofthetheoreticalpeakflop/sfromthesearchitectures. Popular algorithms in scientific computing such as FFT are continuously evolving to keep up with this trend in hardware. In the meantime it is also necessary to invest in novel algorithms that are more suitable for computer architectures of the future. The fast multipole method (FMM) was originally developed as a fast algorithm for ap- proximating the N-body interactions that appear in astrophysics, molecular dynamics, and vortex based fluid dynamics simulations. The FMM possesses have a unique combination of being an efficient O(N) algorithm, while having an operational intensity that is higher than a matrix-matrix multiplication. In fact, the FMM can reduce the requirement of Byte/flop to around 0.01, which means that it will remain compute bound until 2020 even if the cur- rent trend in microprocessors continues. Despite these advantages, there have not been any benchmarks of FMM codes on modern architectures such as Xeon Phi, Kepler, and Blue- Gene/Q. This study aims to provide a comprehensive benchmark of a state of the art FMM code “exaFMM” on the latest architectures, in hopes of providing a useful reference for deciding whentheFMMwillbecomeusefulasthecomputationalengineinagivenapplicationcode. It 4 mayalsoserveasawarningtocertainproblemsizedomainsareaswheretheFMMwillexhibit insignificant performance improvements. Such issues depend strongly on the asymptotic constants rather than the asymptotics themselves, and therefore are strongly implementation and hardware dependent. The primary objective of this study is to provide these constants on various computer architectures. 5 ACKNOWLEDGEMENTS First and foremost, I would like to thank my supervisor and committee chair Dr. David Keyes for his continuous support, guidance, enthusiasm and optimism. I consider myself exceedingly fortunate to work under his supervision. Furthermore, I would like to express the deepest appreciation to Dr. Rio Yokota for introducing me to the topic as well for the unconditional support he gave me, and day-to-day guidance. Without his guidance and per- sistent help this thesis would not have been possible. I would also like to extend my gratitude to my thesis committee members Dr. Hakan Bagci, and Dr. Timothy Ravasi for their advices, support and trust. My special thanks goes to my friends and colleagues at King Abdullah University of Science and Technology who made my stay here the best experience of my life. And I know these are friendships that will last a lifetime. Finally, I would like to express my gratitude to my father, Ahmed Al-Harthi, whose en- couragement, motivation, and support made it possible for me to always succeed in both my personal and professional life. I also would like to dedicate this thesis to my mother’s soul, Salha Al-Harthi, who provided me with prayers and endless love throughout my life, and whom I know would have been very proud of me. I would like also to thank my great brothers, Fahad and Sultan and my wonderful sisters, Najat, Nada and Sara, for their con- stant support. Also, I do not have words to express my real appreciation for my grandfather Fahad, my grandmother Najat, my uncles, Mohammad, Zuhair, Abdulrahman and Khalid, and my aunts, Ibtisam and Suad. They have been the source of inspiration for me. 6 TABLE OF CONTENTS Examination Committee Approval 2 Copyright 3 Abstract 4 Acknowledgements 6 List of Abbreviations 9 List of Symbols 11 List of Figures 12 List of Tables 13 1 Introduction 14 1.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Major Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Fast N-body Methods 17 2.1 N-body Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Direct All-pairs Summation . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Cut-off Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Ewald Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Particle-Mesh Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Particle-Particle/Particle-Mesh . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Treecode and FMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7 3 Exascale and HPC 21 3.1 Supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Exascale Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Fast Multipole Method 23 4.1 FMM Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 FMM Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 FMM Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.1 Multipole Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.2 Local Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 FMM Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Platforms and Architectures 31 5.1 Hardware Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.1 Intel(R) Xeon Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.2 NVIDIA Kepler GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.3 Fujitsu K Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2.4 BlueGene/Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6 FMM Benchmarks 34 6.1 Comparison Between FMM Codes . . . . . . . . . . . . . . . . . . . . . . . . 34 6.2 Deeper Analysis of exaFMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3 exaFMM on K computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7 Concluding Remarks 51 7.1 Remarks on the Reported Results . . . . . . . . . . . . . . . . . . . . . . . . 51 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 References 53 8 LIST OF ABBREVIATIONS 2-D Two Dimensional 3-D Three Dimensional AVX Advanced Vector Extensions CPU Central Processing Unit DDR3 Double Data Rate Type Three DTT Dual Tree Traversal EC Equivalent Charges ECC Error-Correcting-Code ExaFLop/s Quintillion Floating-Point Operations per Second FFT Fast Fourier Transform FLOP Floating Point Operations FMM Fast Multipole Method GemsFMM GPU Implementation of the Treecode/Fast Multipole Method GigaFLop/s Billion Floating-Point Operations per Second GPU Graphics Processing Unit HPC High Performance Computing KIFMM Kernel-Independent Fast Multipole Method L2L Local-to-Local L2P Local-to-Particle LE Local Expansion 9 M2L Multipole-to-Local M2M Multipole-to-Multipole M2P Multipole-to-Particle ME Multipole Expansion P2M Particle-to-Multipole P2P Particle-to-Particle P3M Particle-Particle/Particle-Mesh PDE Partial Differential Equation PetaFLop/s Quadrillion Floating-Point Operations per Second PetFMM Portable Extensible Toolkit for FMM PETSc Portable Extensible Toolkit for Scientific Computation PFLOPS Thousand Trillion Floating Point Operations per Second PM Particle-Mesh QPE Quad-Processing Extension SOC System-on-Chip SSE Streaming SIMD Extensions STT Single Tree Traversal TBB Thread Building Blocks TeraFLop/s TrillionFloating-Point Operations per Second TFLOPS Trillion Floating Operations per Second 10
Description: