AUTOMATIC PERFORMANCE PREDICTION OF PARALLEL PROGRAMS AUTOMATIC PERFORMANCE PREDICTION OF PARALLEL PROGRAMS Thomas Fahringer Institute for Software Technology and Parallel Systems University of Vienna Vienna, Austria .... ., KLUWER ACADEMIC PUBLISHERS Boston/London/Dordrecht Distributors for North America: Kluwer Academic Publishers 10 1 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS Library of Congress Cataloging-in-Publication Data A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN-13: 978-1-4612-8592-2 e-ISBN-13: 978-1-4613-1371-7 DOl: 10.1007/978-1-4613-1371-7 Copyright ID 1996 by Kluwer Academic Publishers Softcover reprint of the hardcover 1st edition 1996 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061 Printed on acid-free paper. v Dedicated to Elisabeth, Sophia, and Anna CONTENTS LIST OF FIGURES Xl LIST OF TABLES XV PREFACE XVll Acknowledgments XIX 1 INTRODUCTION 1 1.1 Automatic Parallelization for Multiprocessor Systems 1 1.2 Motivation for Performance Prediction 3 1.3 p3T: Parameter based Performance Prediction Tool 7 1.4 Overview 12 2 MODEL 15 2.1 Introduction 15 2.2 Sequential Programs 15 2.3 Parallel Programs 21 2.4 Basic Parallelization Strategy 31 2.5 Optimizing Compiler Transformations 37 2.6 Using p3T and WF under VFCS 41 2.7 Summary 45 3 SEQUENTIAL PROGRAM PARAMETERS 47 3.1 Introduction 47 3.2 Sequential Program Parameters 49 3.3 Instrumentation 50 Vlll AUTOMATIC PERFORMANCE PREDICTION 3.4 Optimization 53 3.5 Adaptation of Profile Data 65 3.6 Summary 71 4 PARALLEL PROGRAM PARAMETERS 73 4.1 Introduction 73 4.2 Work Distribution 74 4.3 N umber of Transfers 100 4.4 Amount of Data Transferred 127 4.5 Transfer Time 137 4.6 Network Contention 149 4.7 N umber of Cache Misses 160 4.8 Computation Time 179 4.9 Summary 188 5 EXPERIMENTS 191 5.1 Introduction 191 5.2 Implementation Status 192 5.3 Estimation Accuracy of p3T 192 5.4 Usefulness of p3T 194 5.5 Graphical User Interface of p3T 207 5.6 Summary 214 6 RELATED WORK 215 6.1 Performance Prediction Techniques 215 6.2 Classification of Performance Estimators 221 7 CONCLUSIONS 227 7.1 Contributions 228 7.2 Future Research 232 A APPENDIX 235 A.1 Intersection and Volume Computation of Convex n-dimensional Polytopes 235 A.2 Notation 249 Contents IX REFERENCES 253 INDEX 267 LIST OF FIGURES Chapter 2 2.1 Overlap area of array U in the JACOBI code 27 2.2 The EXSR primitive 30 2.3 Structure of Vienna Fortran Compilation System 32 2.4 Subroutine JACOBI after initial parallelization 35 2.5 Subroutine JACOBI after optimization and target code gene- ration 37 2.6 Structure of p3T and Weight Finder as part of VFCS. 42 Chapter 3 3.1 Before Hoisting Instrumentation Code 58 3.1 After Hoisting Instrumentation Code 58 Chapter 4 4.1 Loop iteration space intersection for P(3) 81 4.2 Loop iteration space intersection for P(3, 2, 2) 82 4.3 Loop iteration space intersection for P(2, 2) 83 4.4 Loop iteration space intersection for P(3) 84 4.5 LFK-6 iteration space 90 4.6 Estimated versus measured work distribution 93 4.7 Measured runtimes for various LFK-9 versions 96 4.8 Useful work distribution for various parallel LFK-9 versions 97 4.9 Block-wise distribution of array VAL 104 4.10 Block-wise distribution of array U 106 4.11 Communication pattern for a parallel Gauss/Seidel relaxation 111 4.12 Loop iteration space intersection for processor P(3,2) 112 xi XlI AUTOMATIC PERFORMANCE PREDICTION 4.13 a. Intersection with a 2-dimensional iteration space; b. Inter- section with a 3-dimensional iteration space 114 4.14 Communication pattern for an inside communication 117 4.15 Loop iteration space intersection based on array U 117 4.16 Loop iteration space intersection for P(3,2) 119 4.17 Number of transfers for various Gauss/Seidel versions 123 4.18 Number of transfers for synthetic Gauss/Seidel versions 126 4.19 Loop iterations accessing non-local data in C4 for P(3, 2) 129 4.20 2Dblock (dotted lines) versus column-wise (solid lines) distri- bution in JACOBI 138 4.21 4-dimensional hypercube topology 140 4.22 Transfer time of various message lengths on the iPSC/860 hypercube 159 4.23 JACOBI runtime for various data sizes and number of proces- sors 172 4.24 Number of cache misses in JACOBI for various data sizes and number of processors 173 4.25 JACOBI runtime before and after loop interchange 174 4.26 Cache misses in JACOBI before and after loop interchange 175 4.27 LFK-8 runtime before and after loop distribution 176 4.28 Cache misses in LFK-8 before and after loop distribution 177 4.29 Irregular runtime behavior of benchmark kernels 182 4.30 Measured versus predicted JACOBI runtimes 187 Chapter 5 5.1 Sequential stencil kernel 195 5.2 Measured versus predicted parameter values and measured runtimes 197 5.3 Sequential EFLUX program 200 5.4 Performance tuning of EFLUX 201 5.5 Sequential SHALLOW program 204 5.6 Search tree of different program transformations and data distribution strategies for SHALLOW 205 5.7 VFCS main window with HPF JACOBI program. 208 5.8 Select parallel program parameters. 209 5.9 Change target architecture specific parameters. 210 List of Figures Xlll 5.10 HPF JACOBI main program with p3T performance data. 211 5.11 Performance visualization for a single statement. 212 5.12 Sorted list of program units with respect to performance. 213 Appendix A A.1 Intersection of a polytope with a hyperplane. 243 A.2 Triangularization of a 2-dimensional polytope. 247 A.3 Triangularization of a 3-dimensional polytope. 248