SANDIA REPORT SAND2015-8051 Unlimited Release Printed Sep, 2015 Towards Accurate Application Characterization for Exascale (APEX) S.D. Hammond Center for Computing Research Sandia National Laboratories Albuquerque, NM, 87185 [email protected] Preparedby SandiaNationalLaboratories Albuquerque,NewMexico87185andLivermore,California94550 SandiaNationalLaboratoriesisamulti-programlaboratorymanagedandoperatedbySandiaCorporation, awhollyownedsubsidiaryofLockheedMartinCorporation,fortheU.S.DepartmentofEnergy’s NationalNuclearSecurityAdministrationundercontractDE-AC04-94AL85000. Approvedforpublicrelease;furtherdisseminationunlimited. IssuedbySandiaNationalLaboratories,operatedfortheUnitedStatesDepartmentofEnergy bySandiaCorporation. NOTICE:ThisreportwaspreparedasanaccountofworksponsoredbyanagencyoftheUnited States Government. Neither the United States Government, nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors, or their employees, make any warranty, express or implied, or assume any legal liability or responsibility for the accuracy, completeness,orusefulnessofanyinformation,apparatus,product,orprocessdisclosed,orrep- resent that its use would not infringe privately owned rights. Reference herein to any specific commercialproduct,process,orservicebytradename,trademark,manufacturer,orotherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government, any agency thereof, or any of their contractors or subcontractors. TheviewsandopinionsexpressedhereindonotnecessarilystateorreflectthoseoftheUnited StatesGovernment,anyagencythereof,oranyoftheircontractors. PrintedintheUnitedStatesofAmerica. Thisreporthasbeenreproduceddirectlyfromthebest availablecopy. AvailabletoDOEandDOEcontractorsfrom U.S.DepartmentofEnergy OfficeofScientificandTechnicalInformation P.O.Box62 OakRidge,TN37831 Telephone: (865)576-8401 Facsimile: (865)576-5728 E-Mail: [email protected] Onlineordering: http://www.osti.gov/bridge Availabletothepublicfrom U.S.DepartmentofCommerce NationalTechnicalInformationService 5285PortRoyalRd Springfield,VA22161 Telephone: (800)553-6847 Facsimile: (703)605-6900 E-Mail: [email protected] Onlineordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online D•E(cid:13)PARTMENTOFENERGY•(cid:13) UNITEDSTATESOFAMERICA 2 SAND2015-8051 Unlimited Release Printed Sep, 2015 Towards Accurate Application Characterization for Exascale (APEX) S.D. Hammond Center for Computing Research Sandia National Laboratories Albuquerque, NM, 87185 [email protected] Sandia National Laboratories has been engaged in hardware and software codesign activities for a number of years, indeed, it might be argued that prototyping of clusters as far back as the CPLANT machines and many large capability resources including ASCI Red and RedStorm were examples of codesigned solutions. As the research supporting our codesign activities has moved closer to investigating on-node runtime behavior a nature hunger has grownfordetailedanalysisofbothhardwareandalgorithmperformancefromtheperspective of low-level operations. The Application Characterization for Exascale (APEX) LDRD was a project concieved of addressing some of these concerns. Primarily the research was to intended to focus on generating accurate and reproducible low-level performance metrics using tools that could scale to production-class code bases. Along side this research was an advocacy and analysis role associated with evaluating tools for production use, working with leading industry ven- dors to develop and refine solutions required by our code teams and to directly engage with production code developers to form a context for the application analysis and a bridge to the research community within Sandia. On each of these accounts significant progress has been made, particularly, as this report will cover, in the low-level analysis of operations for important classes of algorithms. This report summarizes the development of a collection of tools under the APEX research program and leaves to other SAND and L2 milestone reports the description of codesign progress with Sandia’s production users/developers. 3 Acknowledgment Developing new tools and approaches to understanding performance is a particularly chal- lenging endeavor, not least because many developers have unique perspectives on how they want to understand ‘performance’ in their codes. Without the many inputs from a whole range of developers and colleagues it would not have been possible to have made any progress onthistask. Iam particularly gratefultomyarchitecturecolleaguesin01422 including Arun Rodrigues, Scott Hemmert, Gwen Voskuilen, Dave Resnick, Jeanine Cook and Bob Benner for the many conversations. From the testbed and advanced architecture perspective inputs from Jim Laros, Sue Kelly, Jim Brandt, Ann Gentile, Victor Kuhns, Nathan Gauntt, Jason Repik, T.J. Lee, Mahesh Rajan and Dennis Dinge were immensely helpful. The support from Jim Ang has been invaluable, particularly the encouragement to reach out to vendors and our own internal development teams for input. Several other colleagues deserve much credit for their support - Christian Trott, Carter Edwards, Dan Sunderland, Mike Heroux, Doug Doerfler and Rob Hoekstra. Our friends and colleagues in 1500 within Sandia have provided a great deal of direction and some great dialogue over vectorization, code structures and many issues present in a production environment. I am indebted to Mike Glass, Kendall Pierson, Nate Crane, Mike Tupek,MarkMereweather,TravisFisherandMicahHowardforalloftheirsupport. Working with these teams was, and continues to be, a great experience. The error modeling discussion using the GLITCH tool came from collaboration with Rob Armstrong and Jackson Mayo of Sandia’s Livermore campus. These discussions have been incredibly insightful into mechanisms that may be used to detect and correct errors in complex scientific codes. Over the last two years I have benefited immensely from discussions with Scott Pakin at the Los Alamos Laboratory. His deep knowledge of architectures and tools to extract perfor- mance analysis helped in understanding where this project could fit in with the community and add value. Finally, my thanks go to the LDRD program offie and support team at Sandia who make incredibleresearchprojectspossiblewithfinancialandadministrationsupport. Werecognize how much this contributes to making our lives easier. 4 Contents 1 Introduction 9 2 APEX Toolkit 11 2.1 Low-Level Instruction Analysis using TESSERAE . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Floating Point Instruction Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 Masked Register Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3 Application Memory Access Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.4 Bytes to FLOP/s Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.5 Instruction Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Floating Point Error Modeling using GLITCH . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Early Instruction Level Analysis of Knights Landing. . . . . . . . . . . . . . . . . . . . . 21 3 Ariel Front End to the Structural Simulation Toolkit 25 4 KokkosP Profiling Interface 27 4.1 Profiling Interface for Kokkos Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Conclusions 31 References 33 5 List of Figures 2.1 TESSERAE Instruction Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Standard (Unmasked) Vector Operations on an 8-wide Vector Unit. . . . . . . . . 15 2.3 Masked Vector Operations on an 8-wide Vector Unit . . . . . . . . . . . . . . . . . . . . 15 2.4 Floating Point Intensity and Vectorization of Total Instructions . . . . . . . . . . . 16 2.5 Ratio of Bytes Read to Bytes Written . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Average Memory Operation Request Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Ratio of Bytes Requested by Instructions to Floating Point Operations Per- formed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Instructions (%) which Request Memory Operations. . . . . . . . . . . . . . . . . . . . . 20 2.9 Breakdown of Instructions Executed on the Knights Landing Emulation En- vironment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.10 Breakdown of Memory Access Operations on the Knights Landing Emulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.11 Breakdown of Vector Operations on the Knights Landing Emulation Environ- ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1 KokkosP Infrastructure for a Parallel-For Dispatch . . . . . . . . . . . . . . . . . . . . . . 28 4.2 LULESH Kernel Timer Analysis using KokkosP on Intel Haswell . . . . . . . . . . 29 4.3 LULESH Kernel Timer Analysis using KokkosP on IBM POWER8 . . . . . . . . 29 6 List of Tables 2.1 Maximum Operations Performed per Vector Instruction in Intel Processor Families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 7 8 Chapter 1 Introduction For over four decades Sandia National Laboratories has been a leading laboratory for the design of large-scale engineering and physics simulation applications. These demanding calculationshaveandcontinuetoplacetogreatstrainontheleadingcomputingarchitectures of the day, not least because of the continued pressure for greater resolution of problems, enhanced models of physical phonomena and an ever broadening range of users. Given the continued to pressure to deliver high performance solutions to analysts across the laboratories it will come as no surprise to many readers to know that a considerably broad range of research is undertaken to identify workflow bottlenecks and to analyze our existing use cases for opportunities to improve scientific delivery. In recent years the term ‘codesign’ has been used to describe the bringing together of domain experts, application developers, users, machine architects and many others to collec- tively identify and analyze tradeoffs for novel solutions in optimizing the scientific delivery of the laboratories. It might be argued that even as far back as the early CPLANT clusters and many large capability resources including the NCUBE, ASCI Red [4] and most recently Red Storm [15, 16, 2], Sandia has been engaged in this activity. Afeatureofmanyrecentstudieshashowevershiftedfromassessmentoffull-scalemachine designs – which are often focused on interconnection networks – to assessment of in-node performance. This is partly driven by the broadening range of solutions available in con- temporary supercomputer designs but also the growing difficulty being found by application developers in extracting performance from recent, more complex node architectures. We are finding that this continues to be particularly challenging in the area of instruction vectoriza- tion. A natural consequence of this refocusing is the need to efficiently identify and extract low-level performance information about the operations production-class applications and algorithms execute. The output has several uses including: (1) supporting algorithm and bottleneck analysis; (2) evaluation of libraries, compilers and system-software components which contribute to (or in some cases inhibit) performance and, finally, (3) as a design or feedback mechanism to hardware architects who rely on this kind of information to design the processors, accelerators and memory subsystems of the future. The Application Characterization for Exascale (APEX) LDRD was a project conceived of addressing some of these concerns. Primarily the research was to intended to focus on generating accurate and reproducible low-level performance metrics using tools that could 9 scale to production-class code bases, as well as multi-threaded execution (preemptively sup- porting programming models that will be offered on the ASC Trinity Phase I and Phase II deployments). Along side this research was an advocacy and analysis role associated with evaluating tools for production use, working with leading industry vendors to develop and refine solutions required by our code teams and to directly engage with production code developers to form a context for the application analysis and a bridge to the research community within Sandia. On each of these counts significant progress has been made, par- ticularly, as this report will cover, in the low-level analysis of operations for important classes of algorithms. This report summarizes the development of a collection of tools under the APEX research program and leaves to other SAND and L2 milestone reports the description of codesign progress with Sandia’s production users/developers. The reader is refered to some SAND reports, papers and invited talks [14, 11, 10, 8, 9] for further information on APEX and the results obtained using the tools described herein. 10
Description: