Table Of Content

Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich [email protected] P. Arbenz Institute for Scientific Computing Department Informatik, ETHZ, Zurich [email protected] 1 3 GreatClarendonStreet,OxfordOX26DP OxfordUniversityPressisadepartmentoftheUniversityofOxford. ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship, andeducationbypublishingworldwidein Oxford NewYork Auckland CapeTown DaresSalaam HongKong Karachi KualaLumpur Madrid Melbourne MexicoCity Nairobi NewDelhi Shanghai Taipei Toronto Withofficesin Argentina Austria Brazil Chile CzechRepublic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore SouthKorea Switzerland Thailand Turkey Ukraine Vietnam OxfordisaregisteredtrademarkofOxfordUniversityPress intheUKandincertainothercountries PublishedintheUnitedStates byOxfordUniversityPressInc.,NewYork (cid:1)c OxfordUniversityPress2004 Themoralrightsoftheauthorhavebeenasserted DatabaserightOxfordUniversityPress(maker) Firstpublished2004 Allrightsreserved.Nopartofthispublicationmaybereproduced, storedinaretrievalsystem,ortransmitted,inanyformorbyanymeans, withoutthepriorpermissioninwritingofOxfordUniversityPress, orasexpresslypermittedbylaw,orundertermsagreedwiththeappropriate reprographicsrightsorganization.Enquiriesconcerningreproduction outsidethescopeoftheaboveshouldbesenttotheRightsDepartment, OxfordUniversityPress,attheaddressabove Youmustnotcirculatethisbookinanyotherbindingorcover andyoumustimposethesameconditiononanyacquirer AcataloguerecordforthistitleisavailablefromtheBritishLibrary LibraryofCongressCataloginginPublicationData (Dataavailable) TypesetbyNewgenImagingSystems(P)Ltd.,Chennai,India PrintedinGreatBritain onacid-freepaperby BiddlesLtd.,King’sLynn,Norfolk ISBN0198515766(hbk) 0198515774(pbk) 10 9 8 7 6 5 4 3 2 1 PREFACE The contents of this book are a distillation of many projects which have sub- sequentlybecomethematerialforacourseonparallelcomputinggivenforseveral years at the Swiss Federal Institute of Technology in Zu¨rich. Students in this course have typically been in their third or fourth year, or graduate students, andhavecomefromcomputerscience,physics,mathematics,chemistry,andpro- gramsforcomputationalscienceandengineering.Studentcontributions,whether large or small, critical or encouraging, have helped crystallize our thinking in a quickly changing area. It is, alas, a subject which overlaps with all scientific and engineering disciplines. Hence, the problem is not a paucity of material but rather the distillation of an overflowing cornucopia. One of the students’ most oftenvoicedcomplaintshasbeenorganizationalandofinformationoverload.Itis thus the point of this book to attempt some organization within a quickly changing interdisciplinary topic. In all cases, we will focus our energies on floating point calculations for science and engineering applications. Our own thinking has evolved as well: A quarter of a century of experi- ence in supercomputing has been sobering. One source of amusement as well as amazement to us has been that the power of 1980s supercomputers has been brought in abundance to PCs and Macs. Who would have guessed that vector processing computers can now be easily hauled about in students’ backpacks? Furthermore, the early 1990s dismissive sobriquets about dinosaurs lead us to chuckle that the most elegant of creatures, birds, are those ancients’ successors. Just as those early 1990s contemptuous dismissals of magnetic storage media must now be held up to the fact that 2 GB disk drives are now 1 in. in diameter and mounted in PC-cards. Thus, we have to proceed with what exists now and hope that these ideas will have some relevance tomorrow. Until the end of 2004, for the three previous years, the tip-top of the famous Top 500 supercomputers [143] was the Yokohama Earth Simulator. Currently, the top three entries in the list rely on large numbers of commodity processors: 65536 IBM PowerPC 440 processors at Livermore National Laboratory; 40960 IBMPowerPCprocessorsattheIBMResearchLaboratoryinYorktownHeights; and 10160 Intel Itanium II processors connected by an Infiniband Network [75] and constructed by Silicon Graphics, Inc. at the NASA Ames Research Centre. The Earth Simulator is now number four and has 5120 SX-6 vector processors from NEC Corporation. Here are some basic facts to consider for a truly high performance cluster: 1. Moderncomputerarchitecturesruninternalclockswithcycleslessthana nanosecond. This defines the time scale of floating point calculations. vi PREFACE 2. For a processor to get a datum within a node, which sees a coherent memory image but on a different processor’s memory, typically requires a delay of order 1 µs. Note that this is 1000 or more clock cycles. 3. For a node to get a datum which is on a different node by using message passing takes more than 100 or more µs. Thuswehavethefollowingnotparticularlyprofoundobservations:ifthedataare local to a processor, they may be used very quickly; if the data are on a tightly coupled node of processors, there should be roughly a thousand or more data items to amortize the delay of fetching them from other processors’ memories; and finally, if the data must be fetched from other nodes, there should be a 100 times more than that if we expect to write-off the delay in getting them. So it is that NEC and Cray have moved toward strong nodes, with even stronger processors on these nodes. They have to expect that programs will have blocked or segmented data structures. As we will clearly see, getting data from memory totheCPUistheproblemofhighspeedcomputing, notonlyforNECandCray machines, but even more so for the modern machines with hierarchical memory. It is almost as if floating point operations take insignificant time, while data access is everything. This is hard to swallow: The classical books go on in depth about how to minimize floating point operations, but a floating point operation (flop)countisonlyanindirectmeasureofanalgorithm’sefficiency. Alowerflop count only approximately reflects that fewer data are accessed. Therefore, the best algorithms are those which encourage data locality. One cannot expect a summation of elements in an array to be efficient when each element is on a separate node. This is why we have organized the book in the following manner. Basically, we start from the lowest level and work up. 1. Chapter 1 contains a discussion of memory and data dependencies. When one result is written into a memory location subsequently used/modified byanindependentprocess,whoupdateswhatandwhenbecomesamatter of considerable importance. 2. Chapter 2 provides some theoretical background for the applications and examples used in the remainder of the book. 3. Chapter 3 discusses instruction level parallelism, particularly vectoriza- tion. Processor architecture is important here, so the discussion is often close to the hardware. We take close looks at the Intel Pentium III, Pentium 4, and Apple/Motorola G-4 chips. 4. Chapter 4 concerns shared memory parallelism. This mode assumes that dataarelocaltonodesoratleastpartofacoherentmemoryimageshared by processors. OpenMP will be the model for handling this paradigm. 5. Chapter 5 is at the next higher level and considers message passing. Our model will be the message passing interface, MPI, and variants and tools built on this system. PREFACE vii Finally,averyimportantdecisionwasmadetouseexplicitexamplestoshow howallthesepieceswork.Wefeelthatonelearnsbyexamplesandbyproceeding from the specific to the general. Our choices of examples are mostly basic and familiar: linear algebra (direct solvers for dense matrices, iterative solvers for largesparsematrices),FastFourierTransform,andMonteCarlosimulations.We hope, however, that some less familiar topics we have included will be edifying. Forexample,howdoesonedolargeproblems,orhighdimensionalones?Itisalso not enough to show program snippets. How does one compile these things? How does one specify how many processors are to be used? Where are the libraries? Here, again, we rely on examples. W. P. Petersen and P. Arbenz Authors’ comments on the corrected second printing We are grateful to many students and colleagues who have found errata in the one and half years since the first printing. In particular, we would like to thank Christian Balderer, Sven Knudsen, and Abraham Nieva, who took the time to carefully list errors they discovered. It is a difficult matter to keep up with such a quickly changing area such as high performance computing, both regarding hardware developments and algorithms tuned to new machines. Thus we are indeed thankful to our colleagues for their helpful comments and criticisms. July 1, 2005. ACKNOWLEDGMENTS Our debt to our students, assistants, system administrators, and colleagues is awesome. Former assistants have made significant contributions and include Oscar Chinellato, Dr Roman Geus, and Dr Andrea Scascighini—particularly for theircontributionstotheexercises.Thehelpofoursystemguruscannotbeover- stated. George Sigut (our Beowulf machine), Bruno Loepfe (our Cray cluster), and Tonko Racic (our HP9000 cluster) have been cheerful, encouraging, and at every turn extremely competent. Other contributors who have read parts of an always changing manuscript and who tried to keep us on track have been Prof.MichaelMascagniandDrMichaelVollmer.IntelCorporation’sDrVollmer did so much to provide technical material, examples, advice, as well as trying hard to keep us out of trouble by reading portions of an evolving text, that a “thank you” hardly seems enough. Other helpful contributors were Adrian Burri, Mario Ru¨tti, Dr Olivier Byrde of Cray Research and ETH, and Dr Bruce GreerofIntel.Despitetheirvaliantefforts,doubtlesserrorsstillremainforwhich onlytheauthorsaretoblame.Wearealsosincerelythankfulforthesupportand encouragementofProfessorsWalterGander,GastonGonnet,MartinGutknecht, Rolf Jeltsch, and Christoph Schwab. Having colleagues like them helps make many things worthwhile. Finally, we would like to thank Alison Jones, Kate Pullen, Anita Petrie, and the staff of Oxford University Press for their patience and hard work. CONTENTS List of Figures xv List of Tables xvii 1 BASIC ISSUES 1 1.1 Memory 1 1.2 Memory systems 5 1.2.1 Cache designs 5 1.2.2 Pipelines, instruction scheduling, and loop unrolling 8 1.3 Multiple processors and processes 15 1.4 Networks 15 2 APPLICATIONS 18 2.1 Linear algebra 18 2.2 LAPACK and the BLAS 21 2.2.1 Typical performance numbers for the BLAS 22 2.2.2 Solving systems of equations with LAPACK 23 2.3 Linear algebra: sparse matrices, iterative methods 28 2.3.1 Stationary iterations 29 2.3.2 Jacobi iteration 30 2.3.3 Gauss–Seidel (GS) iteration 31 2.3.4 Successive and symmetric successive overrelaxation 31 2.3.5 Krylov subspace methods 34 2.3.6 The generalized minimal residual method (GMRES) 34 2.3.7 The conjugate gradient (CG) method 36 2.3.8 Parallelization 39 2.3.9 The sparse matrix vector product 39 2.3.10 Preconditioning and parallel preconditioning 42 2.4 Fast Fourier Transform (FFT) 49 2.4.1 Symmetries 55 2.5 Monte Carlo (MC) methods 57 2.5.1 Random numbers and independent streams 58 2.5.2 Uniform distributions 60 2.5.3 Non-uniform distributions 64 x CONTENTS 3 SIMD, SINGLE INSTRUCTION MULTIPLE DATA 85 3.1 Introduction 85 3.2 Data dependencies and loop unrolling 86 3.2.1 Pipelining and segmentation 89 3.2.2 More about dependencies, scatter/gather operations 91 3.2.3 Cray SV-1 hardware 92 3.2.4 Long memory latencies and short vector lengths 96 3.2.5 Pentium 4 and Motorola G-4 architectures 97 3.2.6 Pentium 4 architecture 97 3.2.7 Motorola G-4 architecture 101 3.2.8 Branching and conditional execution 102 3.3 Reduction operations, searching 105 3.4 Some basic linear algebra examples 106 3.4.1 Matrix multiply 106 3.4.2 SGEFA: The Linpack benchmark 107 3.5 Recurrence formulae, polynomial evaluation 110 3.5.1 Polynomial evaluation 110 3.5.2 A single tridiagonal system 112 3.5.3 Solving tridiagonal systems by cyclic reduction. 114 3.5.4 Another example of non-unit strides to achieve parallelism 117 3.5.5 Some examples from Intel SSE and Motorola Altivec 122 3.5.6 SDOT on G-4 123 3.5.7 ISAMAX on Intel using SSE 124 3.6 FFT on SSE and Altivec 126 4 SHARED MEMORY PARALLELISM 136 4.1 Introduction 136 4.2 HP9000 Superdome machine 136 4.3 Cray X1 machine 137 4.4 NEC SX-6 machine 139 4.5 OpenMP standard 140 4.6 Shared memory versions of the BLAS and LAPACK 141 4.7 Basic operations with vectors 142 4.7.1 Basic vector operations with OpenMP 143 4.8 OpenMP matrix vector multiplication 146 4.8.1 The matrix–vector multiplication with OpenMP 147 4.8.2 Shared memory version of SGEFA 149 4.8.3 Shared memory version of FFT 151 4.9 Overview of OpenMP commands 152 4.10 Using Libraries 153 CONTENTS xi 5 MIMD, MULTIPLE INSTRUCTION, MULTIPLE DATA 156 5.1 MPI commands and examples 158 5.2 Matrix and vector operations with PBLAS and BLACS 161 5.3 Distribution of vectors 165 5.3.1 Cyclic vector distribution 165 5.3.2 Block distribution of vectors 168 5.3.3 Block–cyclic distribution of vectors 169 5.4 Distribution of matrices 170 5.4.1 Two-dimensional block–cyclic matrix distribution 170 5.5 Basic operations with vectors 171 5.6 Matrix–vector multiply revisited 172 5.6.1 Matrix–vector multiplication with MPI 172 5.6.2 Matrix–vector multiply with PBLAS 173 5.7 ScaLAPACK 177 5.8 MPI two-dimensional FFT example 180 5.9 MPI three-dimensional FFT example 184 5.10 MPI Monte Carlo (MC) integration example 187 5.11 PETSc 190 5.11.1 Matrices and vectors 191 5.11.2 Krylov subspace methods and preconditioners 193 5.12 Some numerical experiments with a PETSc code 194 APPENDIX A SSE INTRINSICS FOR FLOATING POINT 201 A.1 Conventions and notation 201 A.2 Boolean and logical intrinsics 201 A.3 Load/store operation intrinsics 202 A.4 Vector comparisons 205 A.5 Low order scalar in vector comparisons 206 A.6 Integer valued low order scalar in vector comparisons 206 A.7 Integer/floating point vector conversions 206 A.8 Arithmetic function intrinsics 207 APPENDIX B ALTIVEC INTRINSICS FOR FLOATING POINT 211 B.1 Mask generating vector comparisons 211 B.2 Conversion, utility, and approximation functions 212 B.3 Vector logical operations and permutations 213 B.4 Load and store operations 214 B.5 Full precision arithmetic functions on vector operands 215 B.6 Collective comparisons 216 xii CONTENTS APPENDIX C OPENMP COMMANDS 218 APPENDIX D SUMMARY OF MPI COMMANDS 220 D.1 Point to point commands 220 D.2 Collective communications 226 D.3 Timers, initialization, and miscellaneous 234 APPENDIX E FORTRAN AND C COMMUNICATION 235 APPENDIX F GLOSSARY OF TERMS 240 APPENDIX G NOTATIONS AND SYMBOLS 245 References 246 Index 255

Introduction to parallel computing: [a practical guide with examples in C] PDF

278 Pages·2004·1.22 MB·english

by W. P. Petersen

#Computers - Hardware

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introduction to parallel computing: [a practical guide with examples in C]

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.