Technische Universita¨t Mu¨nchen Fakult¨at fu¨r Informatik Informatik 5 – Lehrstuhl fu¨r Wissenschaftliches Rechnen (Prof. Bungartz) Scalable scientific computing applications for GPU-accelerated heterogeneous systems Christoph Karl Riesinger Vollst¨andiger Abdruck der von der Fakult¨at fu¨r Informatik der Technischen Universit¨at Mu¨nchen zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Prof. Dr.-Ing. J¨org Ott Pru¨fer der Dissertation: 1. Prof. Dr. rer. nat. Hans-Joachim Bungartz 2. Prof. Dr. Sci. Takayuki Aoki Tokyo Institute of Technology, Japan Die Dissertation wurde am 16.05.2017 bei der Technischen Universit¨at Mu¨nchen einge- reicht und durch die Fakult¨at fu¨r Informatik am 12.07.2017 angenommen. Abstract In the last decade, graphics processing units (GPUs) became a major factor to increase performance in the area of high performance computing. This is reflected by numerous examples of the fastest supercomputers in the world which are accelerated by GPUs. GPUs are one representative of many-core chips, the major technology development to boost hardware performance in previous years. Many-core chips were one factor besides others such as progress in modeling, algorithmics, and data structures to allow scientific computing advance to its current state-of-the-art. To exploit the whole computational power of GPUs, numerous challenges have to be tackled in the area of parallel programming: Latest developments in handling the core characteristics of GPUs such as two additional levels of parallelism, the memory sys- tem, and offloading shift the focus on the usage of multiple GPUs in parallel and/or combining them with the performance of CPUs (heterogeneous computing). As a con- sequence, hybrid parallel programming (e.g. MPI, OpenMP) concepts are required and load balancing as well as communication hiding become even more relevant to achieve good scalability. In this work, we present approaches to benefit from GPUs for three different appli- cations, each covering different algorithmic characteristics: First, a pipelined approach is used to determine the eigenvalues of a symmetric matrix not only enabling very high FLOPSratesbutalsoallowingforthehandlingofevenlargesystemsononesingleGPU. Second, the solution of random ordinary differential equations (RODEs) offers multiple levels of parallelism which is predestined for systems with multiple GPUs leading to the first implementation of an RODE solver to deal with problems of reasonable size. Finally, it is shown that a pioneering hybrid implementation of the lattice Boltzmann method making use of all available compute resources in the system where the CPU can process regions of arbitrary volume can attain good scalability. iii Acknowledgements Even if there is only one author name written on the front page of this thesis, there are numerous other persons who contributed to this document in one way or the other. So I am taking the chance to express my acknowledgements and thanks to these people. Firstofall, IwouldliketomentionmyPhDsupervisorsProf.Hans-JoachimBungartz andProf.TakayukiAoki. Theyofferedmetheopportunitytostartandsuccessfullywork on my PhD in very comfortable, pleasant, productive and competent environments, especially during my research stay abroad in Tokyo where the first actual results could be achieved. Before achieving actual results, much groundwork has to be finished, not always done by myself. Here, I want to thank Tobias Neckel and Florian Rupp for their preliminary studies in the field of random ordinary differential equations forming the theoretical basis of part III of this document. Special acknowledgements go to Tobias who did not just contribute in a technical way as the advisor of my thesis but also became a close friend. The same gratefulness belongs to Martin Schreiber and Arash Bakhtiari for their practical effort in the area of the lattice Boltzmann method continued by me in part IV. Martin, I am not sure if you reduced or actually extended the time to finish my PhD, anyways, you definitely made this time much more valuable. In addition, I would like to thank these people who gave me access to the computing resources essential for my research work. Robert Speck paved the way to utilize the infrastructure in Ju¨lich, Frank Jenko enabled the access to the Max Planck resources in Garching, Maria Grazia Giuffreda arranged the usage of several clusters in Lugano, and, again, Prof. Takayuki Aoki is mentioned for his support in Tokyo. If there was any onside technical problem, Roland Wittmann was the guy you can count on. Furthermore, I have to thank Alfredo Parra Hinojosa, again, Tobias Neckel, Philipp Neumann, and Benjamin Uekermann who significantly enhanced the quality and the language of this thesis by proofreading and reviewing. Alfredo also has to be mentioned for his pragmatic and effective approach to execute the duties of a coordinator of the Computational Science and Engineering (CSE) program and, hence, was the perfect colleague a CSE secretary can rely on. In the same way, I thank my former CSE and office colleague Marion Weinzierl. Besides colleagues and people who supported me in a technical way (and sometimes became very good friends), there is also my family I always could count on. I want to deeply thank my girl-friend Barbara and my parents Elisabeth and Karl for doing the “cover my back” stuff and for always giving me the feeling, no the certainty that nothing can really go wrong. Hence, everytimea“we”, “our”, or“us”ismentionedonthefollowingpages, allthese just listed people are also meant in one way or the other. v Contents I. Introduction 1 1. Opening 3 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2. Architecture of GPUs 9 2.1. Hardware structure of GPUs . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2. Programming & execution model . . . . . . . . . . . . . . . . . . . . . . . 15 2.3. Scheduling & GPU indicators . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4. Heterogeneous computing & GPU-equipped HPC clusters . . . . . . . . . 18 3. Relevance of GPUs in scientific computing 21 3.1. Acceleration of scientific computing software. . . . . . . . . . . . . . . . . 22 3.2. Lighthouse projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 II. Pipelined approach to determine eigenvalues of symmetric matrices 27 4. The SBTH algorithm 31 4.1. Block decomposition of a banded matrix . . . . . . . . . . . . . . . . . . . 31 4.2. Serial reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3. Parallel reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5. Implementation of the SBTH algorithm 39 5.1. Determination of Householder transformations . . . . . . . . . . . . . . . 40 5.2. Transformation of block pairs . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3. Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4. Matrix storage format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6. Results 49 6.1. Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.2. Scalability of the pipelined approach . . . . . . . . . . . . . . . . . . . . . 53 6.3. Comparison with ELPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 vii Contents III. Multiple levels of parallelism to solve random ordinary differential equations 63 7. Random ordinary differential equations 69 7.1. Random & stochastic ordinary differential equations . . . . . . . . . . . . 69 7.2. The Kanai-Tajimi earthquake model . . . . . . . . . . . . . . . . . . . . . 70 7.3. Numerical schemes for RODEs . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3.1. Averaged schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3.2. K-RODE-Taylor schemes . . . . . . . . . . . . . . . . . . . . . . . 74 7.3.3. Remarks on numerical schemes . . . . . . . . . . . . . . . . . . . . 77 8. Building block 1: Pseudo random number generation 79 8.1. The Ziggurat method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 8.1.1. Definition of the Ziggurat . . . . . . . . . . . . . . . . . . . . . . . 81 8.1.2. Algorithmic description of the Ziggurat method . . . . . . . . . . . 82 8.1.3. Setup of the Ziggurat . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.1.4. Memory/runtime trade-off for the Ziggurat method . . . . . . . . . 86 8.2. Rational polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.3. The Wallace method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.4. Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.4.1. Evaluation of particular pseudo random number generators . . . . 93 8.4.2. Performance comparison of pseudo random number generators . . 97 9. Building block 2: Ornstein-Uhlenbeck process 101 9.1. From the Ornstein-Uhlenbeck process to prefix sum. . . . . . . . . . . . . 101 9.2. Parallel prefix sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 9.3. Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 10.Building block 3: Averaging 109 10.1.Single & double averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 10.2.Tridiagonal averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 10.3.Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 11.Building block 4: Coarse timestepping for the right-hand side 115 11.1.Averaged schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 11.2.K-RODE-Taylor schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 12.Results of the full random ordinary differential equations solver 119 12.1.Configurations of choice for the building blocks . . . . . . . . . . . . . . . 120 12.2.Profiling of single path-wise solutions . . . . . . . . . . . . . . . . . . . . . 122 viii Contents 12.3.Scalability of the multi-path solution . . . . . . . . . . . . . . . . . . . . . 124 12.4.Statistical evaluation of the multi-path solution . . . . . . . . . . . . . . . 127 IV. Scalability on heterogeneous systems of the lattice Boltzmann method133 13.The lattice Boltzmann method and its serial implementation 137 13.1.Discretization schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 13.2.Collision & propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 13.3.Memory layout pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 14.Parallelization of the lattice Boltzmann method 143 14.1.Domain decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 14.2.Computation of the GPU- & CPU-part of a subdomain . . . . . . . . . . 146 14.2.1. Lattice Boltzmann method kernels for the GPU . . . . . . . . . . . 146 14.2.2. Lattice Boltzmann method kernels for the CPU . . . . . . . . . . . 147 14.3.Communication scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 15.Performance modeling of the lattice Boltzmann method on heterogeneous systems 155 16.Results 159 16.1.Characteristics of kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 16.1.1. Results of the GPU kernels . . . . . . . . . . . . . . . . . . . . . . 161 16.1.2. Results of the CPU kernels . . . . . . . . . . . . . . . . . . . . . . 162 16.2.Benchmark results for heterogeneous systems . . . . . . . . . . . . . . . . 163 16.2.1. Single subdomain results. . . . . . . . . . . . . . . . . . . . . . . . 163 16.2.2. Preparations for multiple subdomains results . . . . . . . . . . . . 165 16.2.3. Weak scaling results of multiple subdomains. . . . . . . . . . . . . 167 16.2.4. Strong scaling results of multiple subdomains . . . . . . . . . . . . 169 16.3.Validation of the performance model . . . . . . . . . . . . . . . . . . . . . 174 V. Conclusion 179 ix
Description: