(cid:13)c 2007 Abhinav Bhatele APPLICATION–SPECIFIC TOPOLOGY–AWARE MAPPING AND LOAD BALANCING FOR THREE–DIMENSIONAL TORUS TOPOLOGIES BY ABHINAV BHATELE B. Tech., Indian Institute of Technology, Kanpur, 2005 THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2007 Urbana, Illinois Adviser: Professor Laxmikant V. Kal´e Abstract The advent of supercomputers like Blue Gene/L, XT3, XT4 and Blue Gene/P has given rise to new challenges in parallel programming. Interconnect topology of ma- chines is one such factor which has become crucial in optimizing the performance of applications on clusters with thousands of processors. If the topology of an inter- connect is clearly defined, it is possible to minimize the communication volume and balance it evenly across processors. It can be minimized by co-locating communi- cating entities on nearby objects. This can be done at two stages in the program: 1. during program start-up where we define the initial mapping for migratable entities and static mapping for non-migratable entities, and 2. during load balancing when migratable entities are moved across processors to keep the load evenly balanced. In this thesis, we examine how topological considerations in mapping and load balancing algorithms can help communication and make applications faster. To present concrete results, several applications written in Charm++, 7-point 3- dimensional Stencil, NAMD and LeanCP are considered. Results on 7-point Sten- cilarepresentedasaproofofprinciplebecauseitisarelativelysimpletoanalyzeand map. NAMDandLeanCPontheotherhandareproductioncodeswiththousands of users. NAMD is a classical molecular dynamics application and heavily depends on load balancing for optimal performance. Load balancers deployed in NAMD have been modified to place communicating objects in proximity to one another by considering the communication in multicasts. LeanCP is a quantum chemistry application based on the Car-Parrinello ab-initio Molecular Dynamics (CPAIMD) method. In LeanCP, the default mapping of its several object arrays to processors ii by the Charm++ runtime has been modified to optimize communication. Results of improvements from these strategies are demonstrated for all these applications on IBM’s Blue Gene/L and Cray’s XT3. iii To my aunt, mother and girlfriend. Girija, Vimlesh and Shambhavi. iv Acknowledgements There are many people whom I need to thank for their help and support with this thesis. The order in which they are mentioned is immaterial and I apologize if I have missed out some names. I wish to thank my advisor Professor Laxmikant V. Kal´e for showing the torch and helping me move forward inch by inch. He has always given me an ear, no matter how busy he has been. He has been my inspiration to continue for a PhD after my Master’s degree. My colleagues in the Parallel Programming Laboratory (PPL) have made my stay in US easier and a lot of fun. I thank Eric Bohm for helping me set my foot in research and for the constant technical help with LeanCP. Thanks to Gengbin Zheng for his help with various Charm++, load balancing and NAMD problems. I also want to thank my colleagues Chee Wai Lee, Nilesh Choudhury and Sayantan Chakravorty for their help with various things. Special thanks to Glenn Martyna and Sameer Kumar at IBM for their mentoring and support at my summer internship and otherwise. Thanks to Jim Phillips for help with understanding NAMD code. Many thanks to Fred Mintzer and Glenn Martyna at IBM and Shawn Brown and Chad Vizino at PSC for runs on Blue Gene/L and BigBen respectively. Thanks to Chao Huang at PPL for the Blue Gene/L topology interface and to Deborah Weisser at Cray Inc. for help with the XT3 topology interface. Again, it would have been difficult without Shawn and Chad’s help to get the topology interface running on XT3. Thanks to my friends for their support. Heartfelt thanks to all the wonderful people in my family who have always shown confidence in me. My mother, father v and uncle for seeing a dream for me; My Maa, for her endless love and blessings which have brought me this far; and my brothers for their support. And finally, I thank my girlfriend who has been patient with me whenever I told her I was busy with my thesis. She has been a constant source of encouragement, motivation and inspiration. vi Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Parallel Machines and Topologies . . . . . . . . . . . . . . . . . . . . 1 1.2 Effect of Topology on Communication . . . . . . . . . . . . . . . . . . 3 1.3 Charm++ and its Applications . . . . . . . . . . . . . . . . . . . . . 4 2 Processor and Communication Graphs . . . . . . . . . . . . . . . 6 2.1 Processor Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Communication Scenarios . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Regular and Irregular Communication . . . . . . . . . . . . . 8 2.2.2 Point-to-Point, Multicasts and Reductions . . . . . . . . . . . 9 2.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Objects in Charm++ . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Chares and Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Dynamic Mapping Framework . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Topology Interface in Charm++ . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Cray XT3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 3D Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.1 Topology Aware Mapping . . . . . . . . . . . . . . . . . . . . 15 3.4.2 Evaluation of Mapping Strategies . . . . . . . . . . . . . . . . 17 4 LeanCP and Regular Communication . . . . . . . . . . . . . . . . 19 4.1 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Communication in LeanCP . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Topology Aware Mapping . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.1 Mapping GSpace and RealSpace Arrays . . . . . . . . . . . . 24 4.3.2 Mapping of Density Objects . . . . . . . . . . . . . . . . . . . 25 4.3.3 Mapping PairCalculator Arrays . . . . . . . . . . . . . . . . . 25 vii 5 NAMD and Multicasts . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Communication in NAMD . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2.1 Topological mapping of patches . . . . . . . . . . . . . . . . . 29 5.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3.1 Topology-aware decisions . . . . . . . . . . . . . . . . . . . . . 32 6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.1 3D Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1.1 Performance on Blue Gene/L . . . . . . . . . . . . . . . . . . 35 6.1.2 Performance on Cray XT3 . . . . . . . . . . . . . . . . . . . . 38 6.2 LeanCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.3 NAMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7 Future Work and Conclusion . . . . . . . . . . . . . . . . . . . . . 44 7.1 Generalized Topology Sensitive Mapping . . . . . . . . . . . . . . . . 44 7.2 Load Balancers for Section Multicasts . . . . . . . . . . . . . . . . . . 45 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 viii List of Tables 6.1 Performance of 3D Jacobi on Blue Gene/L (CO mode) for different decompositions(ProblemSize: fixedat512×512×512). Thenumbers are time in seconds for 1000 iterations. . . . . . . . . . . . . . . . . . 35 6.2 Performance of 3D Jacobi on Blue Gene/L (CO mode) for different decompositions(ProblemSize: variablefrom512×512×512to1024× 1024×1024). The numbers are time in seconds for 1000 iterations. . 35 6.3 Performance and hop-counts for 3D Jacobi on Blue Gene/L (VN mode) for block size 32 × 32 × 32 (Problem Size: fixed at 512 × 512×1024). The numbers are time in seconds for 1000 iterations. . . 37 6.4 Performance and hop-counts for 3D Jacobi on Cray XT3 for different decompositions (Problem Size: fixed at 512×512×512). N stands for na¨ıve and T stands for topology-aware mapping. The numbers are time in seconds for 1000 iterations. . . . . . . . . . . . . . . . . . 38 6.5 Performance and hop-counts for 3D Jacobi on Cray XT3 for different decompositions(ProblemSize: variablefrom512×512×256to1024× 512 × 1024). N stands for na¨ıve and T stands for topology-aware mapping. The numbers are time in seconds for 1000 iterations. . . . . 38 6.6 Performance of LeanCP on Blue Gene/L (CO mode). The numbers represent time per step in seconds. . . . . . . . . . . . . . . . . . . . 40 6.7 Performance of LeanCP on Blue Gene/L (VN mode). The numbers represent time per step in seconds. . . . . . . . . . . . . . . . . . . . 40 6.8 Reduction in hop-bytes for 3D NAMD on Blue Gene/L (Bench- mark:ApoA1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.9 Reductioninhop-bytesfor3DNAMDonBlueGene/L(Benchmark:F - 1 ATPase) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 ix
Description: