INSIGHTFUL PERFORMANCE ANALYSIS OF MANY-TASK RUNTIMES THROUGH TOOL-RUNTIME INTEGRATION by NICHOLAS A. CHAIMOV A DISSERTATION Presented to the Department of Computer and Information Science and the Graduate School of the University of Oregon in partial fulfillment of the requirements for the degree of Doctor of Philosophy June 2017 DISSERTATION APPROVAL PAGE Student: Nicholas A. Chaimov Title: Insightful Performance Analysis of Many-Task Runtimes through Tool-Runtime Integration This dissertation has been accepted and approved in partial fulfillment of the requirements for the Doctor of Philosophy degree in the Department of Computer and Information Science by: Allen D. Malony Chair Boyana R. Norris Core Member Hank R. Childs Core Member Gregory Bothun Institutional Representative and Scott L. Pratt Dean of the Graduate School Original approval signatures are on file with the University of Oregon Graduate School. Degree awarded June 2017 ii © 2017 Nicholas A. Chaimov iii DISSERTATION ABSTRACT Nicholas A. Chaimov Doctor of Philosophy Department of Computer and Information Science June 2017 Title: Insightful Performance Analysis of Many-Task Runtimes through Tool-Runtime Integration Future supercomputers will require application developers to expose much more parallelism than current applications expose. In order to assist application developers in structuring their applications such that this is possible, new programming models and libraries are emerging, the many-task runtimes, to allow for the expression of orders of magnitude more parallelism than currently existing models. This dissertation describes the challenges that these emerging many-task runtimes will place on performance analysis, and proposes deep integration between runtimes and performance tools as a means of producing correct, insightful, and actionable performance results. I show how tool-runtime integration can be used to aid programmer understanding of performance characteristics and to provide online performance feedback to the runtime for Unified Parallel C (UPC), High Performance ParalleX (HPX), Apache Spark, the Open Community Runtime, and the OpenMP runtime. This dissertation includes previously published co-authored material. iv CURRICULUM VITAE NAME OF AUTHOR: Nicholas A. Chaimov GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR Portland State University, Portland, OR Reed College, Portland, OR DEGREES AWARDED: Doctor of Philosophy, Computer and Information Science, 2017, University of Oregon Master of Science, Computer and Information Science, 2012, University of Oregon Bachelor of Science, Computer and Information Science, 2010, University of Oregon Bachelor of Science, Biology, 2007, University of Oregon AREAS OF SPECIAL INTEREST: High-Performance Computing Scientific Computing Performance Monitoring PROFESSIONAL EXPERIENCE: Graduate Research Fellow, Computer and Information Science, University of Oregon, 2010-2017 Software Engineering Intern, Intel Federal LLC, Summer 2016 Research Assistant, Lawrence Berkeley National Lab, Summer 2015 Research Assistant, Lawrence Berkeley National Lab, Summer 2014 v GRANTS, AWARDS AND HONORS: Gurdeep Pall Graduate Student Fellowship, University of Oregon, 2016 Student Travel Grant, High Performance Distributed Computing (HPDC), 2016 Member of 1st Place Graduate Team, Eugene Luks Programming Competition, University of Oregon, 2015 Member of 1st Place Graduate Team, Eugene Luks Programming Competition, University of Oregon, 2012 Member, Upsilon Pi Epsilon International Honor Society for the Computing and Information Disciplines, 2010-Present PUBLICATIONS: Nicholas Chaimov, Allen Malony, Shane Canon, Costin Iancu, Khaled Z Ibrahim, and Jay Srinivasan. “Scaling Spark on HPC Systems.” International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 2016. Nicholas Chaimov, Allen Malony, Costin Iancu, and Khaled Ibrahim. “Scaling Spark on Lustre.” Workshop On Performance and Scalability of Storage Systems. 2016. Nicholas Chaimov, Allen Malony, Khaled Ibrahim, Costin Iancu, Shane Canon, and Jay Srinivasan. “Performance Evaluation of Apache Spark on Cray XC Systems.” Cray Users Group. 2016. Md Abdullah Shahneous Bari, Nicholas Chaimov, Abid M Malik, Kevin A Huck, Barbara Chapman, Allen D. Malony, Osman Sarood. “ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications.” IEEE International Conference on Cluster Computing (CLUSTER). 2016. vi Nicholas Chaimov, Khaled Ibrahim, Sam Williams and Costin Iancu. “Reaching Bandwidth Saturation Using Transparent Injection Parallelization.” International Journal of High Performance Computing Applications. 2016. Kevin Huck, Allan Porterfield, Nicholas Chaimov, Hartmut Kaiser, Allen D. Malony, Thomas Sterling, and Rob Fowler. “An Autonomic Performance Environment for Exascale.” Supercomputing Frontiers and Innovation 2, no. 3 (2015): 49-66. Robert Lim, Allen Malony, Boyana Norris, and Nicholas Chaimov. “Identifying Optimization Opportunities Within Kernel Execution in GPU Codes.” Euro-Par 2015: Parallel Processing Workshops, pp. 185-196. 2015. Nicholas Chaimov, Khaled Ibrahim, Sam Williams and Costin Iancu. “Exploiting Communication Concurrency on High Performance Computing Systems.” International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM). 2015. Nicholas Chaimov, Boyana Norris, and Allen D. Malony. “Toward Multi-target Autotuning for Accelerators.” International Conference on Parallel and Distributed Systems (ICPADS). 2014. Nicholas Chaimov, Boyana Norris, and Allen D. Malony. “Integration and Synthesis for Automated Performance Tuning: the SYNAPT Project.” International Workshop on Automatic Performance Tuning (iWAPT). 2014. Nicholas Chaimov, Scott Biersdorff, and Allen D. Malony. “Tools for machine-learning-based empirical autotuning and specialization.” International Journal of High Performance Computing Applications 27.4 (2013): 403-411. vii ACKNOWLEDGEMENTS I thank my advisor, Prof. Allen Malony, for his help with research, with identifying research areas, with securing interesting and useful internships, and, most of all, for convincing me to pursue a PhD. I also thank Prof. Hank Childs and Prof. Boyana Norris for their help with my research, as well as Sameer Shende and Kevin Huck of the Performance Research Lab. I thank my collaborators at Lawrence Berkeley National Lab: Costin Iancu, Khaled Ibrahim, and Sam Williams; and my collaborators at Intel: Bala Seshasayee, Romain Cledat, Bryan Pawlowski, and Nick Pepperling. Parts of this document were supported by the DOE Office of Advanced Scientific Computing Research (contract number DE-AC02-05CH11231). Parts of this document are based upon work supported by the Department of Energy Office of Science under Award Number DE-SC0008717. Support for this work was provided through the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research (and Basic Energy Sciences/Biological and Environmental Research/High Energy Physics/Fusion Energy Sciences/Nuclear Physics) under award numbers DE-SC0008638, DE-SC0008704, DE-FG02-11ER26050 and DE-SC0006925. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. viii DE-AC02-05CH11231. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work was partially supported by the Intel Parallel Computing Center at Lawrence Berkeley National Laboratory - Lustre . ix TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1. Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3. Coauthored Material . . . . . . . . . . . . . . . . . . . . . . . . . 11 II. BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . 12 2.1. Current Programming Models . . . . . . . . . . . . . . . . . . . . 13 2.2. Capturing Performance Data . . . . . . . . . . . . . . . . . . . . . 23 2.3. Autotuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4. Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5. Performance Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6. Exascale Computing and Future Programming Models . . . . . . . . 53 2.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 III. ONLINE COMMUNICATIONS ADAPTATION IN UPC . . . . . . . . . 72 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2. Maximizing Message Concurrency . . . . . . . . . . . . . . . . . . 73 3.3. Communication and Concurrency . . . . . . . . . . . . . . . . . . 76 3.4. Runtime Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.5. Network Performance and Saturation . . . . . . . . . . . . . . . . . 85 3.6. Parallelizing Injection in Applications . . . . . . . . . . . . . . . . . 95 3.7. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.8. Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 x
Description: