ENERGY EFFICIENCY OF PARALLEL SCIENTIFIC KERNELS A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Sayan Ghosh April 2012 ENERGY EFFICIENCY OF PARALLEL SCIENTIFIC KERNELS Sayan Ghosh APPROVED: Dr. Barbara Chapman, Chairman Dept. of Computer Science Dr. Lennart Johnsson Dept. of Computer Science Dr. Eric Bittner Dept. of Chemistry Dean, College of Natural Sciences and Mathematics ii Acknowledgements I would like to thank the individuals who have spent time in making my work better and play a great role in bringing this thesis to fruition. I would start by thanking my thesis advisor Dr. Barbara Chapman. It has been an honor to be her Masters student. I really appreciate all her patience, time and ideas that were the key success to making my Masters experience productive and exciting. Her trust and confidence in me has been the most important motivation for me to do research. MymentorattheHPCToolslab-SunitaChandrasekaranhelpedmebyreviewing my findings line by line; scrutinizing them and providing great feedback on the material, for which I am very grateful. Her ideas helped me refine my results. Theinceptionofthisprojecthappenedduringthesummerof2011, whenIworked as an intern at PNNL, and I cannot mention enough about Dr. Darren Kerbyson, Dr. Abhinav Vishnu and Dr. Kevin Barker in assisting me with my work there. I had to depend greatly on the hardware infrastructure for performing the ex- periments, and I am grateful to Tony Curtis, Research Scientist at UofH, for being extremely forthcoming and patient in rectifying software and hardware malfunctions. I would like to thank Dr. Lennart Johnsson and Dr. Edgar Gabriel of the Computer Science Dept. at UofH for helping me whenever I had trouble interpreting results of some experiment. Finally, my best compliment goes to my colleagues at the HPCTools lab, all of whose help I have sought at one point or the other. iii ENERGY EFFICIENCY OF PARALLEL SCIENTIFIC KERNELS An Abstract of a Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Sayan Ghosh April 2012 iv Abstract Scientific kernels such as FFT, BLAS and Stencils have been subjects of active re- search for many decades now. Performance of numerical algorithms could be in- terpreted by the GFlops count, as this reflects the total number of floating-point arithmetic operations per unit time on a particular hardware under consideration. Since these results expose the actual throughput of the underlying hardware, most of the times these numerical algorithms are used as benchmarks to evaluate the per- formance of a machine. Apart from performance, a careful study of applications and their electrical energy consumption under varying input conditions is required to achieve power at the exascale level. Designs of power efficient processor chips and dynamically controlling processor frequency have been employed to throttle machine energy. However, there is also a need to introduce electrical energy consumption as anevaluationmetric, tohaveaclearideaabouttheperformanceofanapplicationper watt or performance per dollar spent on energy. This is important especially when supercomputing facilities are known to invest millions of dollar every year on elec- tricity. This thesis provides insights on the energy characteristics of certain classes of applications running on a heterogeneous computing environment. The work in this thesis discusses the energy efficiency of some kernels such as FFT, DGEMM, Stencils and Pseudo Random Number Generators, that are widely used in various disciplines of high performance computing. A power analyzer has been used to an- alyze/extract the electrical power usage information of the multi-GPU node under inspection. An API was written in order to remotely interface with the analyzer and v get the instantaneous power readings. The results show that the power/energy be- havior of different application kernels reflect their computation-communication pat- terns. Conversely, it is possible to provide a reasonable estimation of power/energy characteristics of a given application, if the computation/communication overhead could be determined. vi Contents 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Electrical Power and Energy . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Current State of the Art of Computation Systems . . . . . . . . . . . 9 2.2.1 Hardware Architectures: present and near future . . . . . . . 10 2.2.2 Challenges in Software Architecture . . . . . . . . . . . . . . . 15 2.3 Performance versus Efficiency . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Scientific kernels - Typical candidates for HPC . . . . . . . . . . . . . 22 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Graphics Processing Units (GPU) in Scientific Computation 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Architectural differences - SIMT versus SIMD . . . . . . . . . . . . . 27 3.3 Design of GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 vii 3.4 Programming on Nvidia GPUs . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1 Native API - CUDA . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 Pragma-based approaches for Hardware Accelerators . . . . . 37 4 Energy Efficiency 43 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Measurement of Energy . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Existing strategies to control power/energy . . . . . . . . . . . . . . . 47 4.4 Factors affecting energy consumption . . . . . . . . . . . . . . . . . . 48 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 Testbed 50 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Node hardware - CPUs and GPUs . . . . . . . . . . . . . . . . . . . . 51 5.3 Power Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 Power API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.5 CPU and GPU Memory Bandwidth Tests . . . . . . . . . . . . . . . 56 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6 Experiments 61 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Dense Linear Algebra: Double Precision Matrix-Matrix Multiply . . . 64 6.2.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 Spectral Methods: Fast Fourier Transforms . . . . . . . . . . . . . . . 69 6.3.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 71 viii 6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4 Structured Grids: Finite Difference using Stencils . . . . . . . . . . . 75 6.4.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.5 MonteCarloMethods: PseudoRandomNumberGenerators(Mersenne Twister) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.5.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.6 Analysis of GPU Hardware Counters . . . . . . . . . . . . . . . . . . 84 6.6.1 Correlation Tests . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7 Conclusion and Future Work 91 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Nomenclature 93 Bibliography 95 A Experiments Revisited 102 A.1 DGEMM using streams . . . . . . . . . . . . . . . . . . . . . . . . . . 102 A.2 FFT on multicore CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.3 3D Finite Differences on Multicore CPUs . . . . . . . . . . . . . . . . 104 A.3.1 Performance and power of directive based approaches . . . . . 104 A.4 Verification and validation of results . . . . . . . . . . . . . . . . . . . 106 A.4.1 DGEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 ix A.4.2 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.4.3 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.4.4 Mersenne Twister . . . . . . . . . . . . . . . . . . . . . . . . . 107 B UHPwrLib API to interface with the power analyzer 112 B.1 Using the API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.2 Best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 C Installing and working with the USBTMC Module 116 C.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 C.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 D Explanation of abbreviations 119 x
Description: