Table Of ContentOXFORD TEXTS IN APPLIED AND ENGINEERING
MATHEMATICS
OXFORD TEXTS IN APPLIED AND ENGINEERING
MATHEMATICS
* G.D.Smith:Numerical Solution of Partial Differential Equations
3rdEdition
* R.Hill:A First Course in Coding Theory
* I.Anderson:A First Course in Combinatorial Mathematics 2ndEdition
* D.J.Acheson:Elementary Fluid Dynamics
* S.Barnett:Matrices: Methods and Applications
* L.M.Hocking:Optimal Control: An Introduction to the Theory with
Applications
* D.C.Ince:An Introduction to Discrete Mathematics, Formal System
Specification, and Z 2ndEdition
* O.Pretzel:Error-Correcting Codes and Finite Fields
* P.Grindrod:The Theory and Applications of Reaction–Diffusion Equations: Patterns
and Waves 2ndEdition
1. AlwynScott:Nonlinear Science: Emergence and Dynamics of Coherent Structures
2. D.W.JordanandP.Smith:Nonlinear Ordinary Differential Equations:
An Introduction to Dynamical Systems 3rdEdition
3. I.J.Sobey:Introduction to Interactive Boundary Layer Theory
4. A.B.Tayler:Mathematical Models in Applied Mechanics (reissue)
5. L.RamdasRam-Mohan:Finite Element and Boundary Element Applications in
Quantum Mechanics
6. Lapeyre,et al.:Monte Carlo Methods for Transport and Diffusion Equations
7. I.ElishakoffandY.Ren:Finite Element Methods for Structures with Large Stochastic
Variations
8. AlwynScott:Nonlinear Science:Emergence and Dynamics of Coherent Structures2nd
Edition
9. W.P.PetersenandP.Arbenz:Introduction to Parallel Computing
Titlesmarkedwithanasterisk(*)appearedintheOxfordAppliedMathematics
and Computing Science Series, which has been folded into, and is continued by,
the current series.
Introduction to Parallel
Computing
W. P. Petersen
Seminar for Applied Mathematics
Department of Mathematics, ETHZ, Zurich
wpp@math.ethz.ch
P. Arbenz
Institute for Scientific Computing
Department Informatik, ETHZ, Zurich
arbenz@inf.ethz.ch
1
3
GreatClarendonStreet,OxfordOX26DP
OxfordUniversityPressisadepartmentoftheUniversityofOxford.
ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship,
andeducationbypublishingworldwidein
Oxford NewYork
Auckland CapeTown DaresSalaam HongKong Karachi
KualaLumpur Madrid Melbourne MexicoCity Nairobi
NewDelhi Shanghai Taipei Toronto
Withofficesin
Argentina Austria Brazil Chile CzechRepublic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
SouthKorea Switzerland Thailand Turkey Ukraine Vietnam
OxfordisaregisteredtrademarkofOxfordUniversityPress
intheUKandincertainothercountries
PublishedintheUnitedStates
byOxfordUniversityPressInc.,NewYork
(cid:1)c OxfordUniversityPress2004
Themoralrightsoftheauthorhavebeenasserted
DatabaserightOxfordUniversityPress(maker)
Firstpublished2004
Allrightsreserved.Nopartofthispublicationmaybereproduced,
storedinaretrievalsystem,ortransmitted,inanyformorbyanymeans,
withoutthepriorpermissioninwritingofOxfordUniversityPress,
orasexpresslypermittedbylaw,orundertermsagreedwiththeappropriate
reprographicsrightsorganization.Enquiriesconcerningreproduction
outsidethescopeoftheaboveshouldbesenttotheRightsDepartment,
OxfordUniversityPress,attheaddressabove
Youmustnotcirculatethisbookinanyotherbindingorcover
andyoumustimposethesameconditiononanyacquirer
AcataloguerecordforthistitleisavailablefromtheBritishLibrary
LibraryofCongressCataloginginPublicationData
(Dataavailable)
TypesetbyNewgenImagingSystems(P)Ltd.,Chennai,India
PrintedinGreatBritain
onacid-freepaperby
BiddlesLtd.,King’sLynn,Norfolk
ISBN0198515766(hbk)
0198515774(pbk)
10 9 8 7 6 5 4 3 2 1
PREFACE
The contents of this book are a distillation of many projects which have sub-
sequentlybecomethematerialforacourseonparallelcomputinggivenforseveral
years at the Swiss Federal Institute of Technology in Zu¨rich. Students in this
course have typically been in their third or fourth year, or graduate students,
andhavecomefromcomputerscience,physics,mathematics,chemistry,andpro-
gramsforcomputationalscienceandengineering.Studentcontributions,whether
large or small, critical or encouraging, have helped crystallize our thinking in a
quickly changing area. It is, alas, a subject which overlaps with all scientific
and engineering disciplines. Hence, the problem is not a paucity of material but
rather the distillation of an overflowing cornucopia. One of the students’ most
oftenvoicedcomplaintshasbeenorganizationalandofinformationoverload.Itis
thus the point of this book to attempt some organization within a quickly chan-
ging interdisciplinary topic. In all cases, we will focus our energies on floating
point calculations for science and engineering applications.
Our own thinking has evolved as well: A quarter of a century of experi-
ence in supercomputing has been sobering. One source of amusement as well as
amazement to us has been that the power of 1980s supercomputers has been
brought in abundance to PCs and Macs. Who would have guessed that vector
processing computers can now be easily hauled about in students’ backpacks?
Furthermore, the early 1990s dismissive sobriquets about dinosaurs lead us to
chuckle that the most elegant of creatures, birds, are those ancients’ successors.
Just as those early 1990s contemptuous dismissals of magnetic storage media
must now be held up to the fact that 2 GB disk drives are now 1 in. in diameter
and mounted in PC-cards. Thus, we have to proceed with what exists now and
hope that these ideas will have some relevance tomorrow.
Until the end of 2004, for the three previous years, the tip-top of the famous
Top 500 supercomputers [143] was the Yokohama Earth Simulator. Currently,
the top three entries in the list rely on large numbers of commodity processors:
65536 IBM PowerPC 440 processors at Livermore National Laboratory; 40960
IBMPowerPCprocessorsattheIBMResearchLaboratoryinYorktownHeights;
and 10160 Intel Itanium II processors connected by an Infiniband Network [75]
and constructed by Silicon Graphics, Inc. at the NASA Ames Research Centre.
The Earth Simulator is now number four and has 5120 SX-6 vector processors
from NEC Corporation. Here are some basic facts to consider for a truly high
performance cluster:
1. Moderncomputerarchitecturesruninternalclockswithcycleslessthana
nanosecond. This defines the time scale of floating point calculations.
vi PREFACE
2. For a processor to get a datum within a node, which sees a coherent
memory image but on a different processor’s memory, typically requires a
delay of order 1 µs. Note that this is 1000 or more clock cycles.
3. For a node to get a datum which is on a different node by using message
passing takes more than 100 or more µs.
Thuswehavethefollowingnotparticularlyprofoundobservations:ifthedataare
local to a processor, they may be used very quickly; if the data are on a tightly
coupled node of processors, there should be roughly a thousand or more data
items to amortize the delay of fetching them from other processors’ memories;
and finally, if the data must be fetched from other nodes, there should be a
100 times more than that if we expect to write-off the delay in getting them. So
it is that NEC and Cray have moved toward strong nodes, with even stronger
processors on these nodes. They have to expect that programs will have blocked
or segmented data structures. As we will clearly see, getting data from memory
totheCPUistheproblemofhighspeedcomputing, notonlyforNECandCray
machines, but even more so for the modern machines with hierarchical memory.
It is almost as if floating point operations take insignificant time, while data
access is everything. This is hard to swallow: The classical books go on in depth
about how to minimize floating point operations, but a floating point operation
(flop)countisonlyanindirectmeasureofanalgorithm’sefficiency. Alowerflop
count only approximately reflects that fewer data are accessed. Therefore, the
best algorithms are those which encourage data locality. One cannot expect a
summation of elements in an array to be efficient when each element is on a
separate node.
This is why we have organized the book in the following manner. Basically,
we start from the lowest level and work up.
1. Chapter 1 contains a discussion of memory and data dependencies. When
one result is written into a memory location subsequently used/modified
byanindependentprocess,whoupdateswhatandwhenbecomesamatter
of considerable importance.
2. Chapter 2 provides some theoretical background for the applications and
examples used in the remainder of the book.
3. Chapter 3 discusses instruction level parallelism, particularly vectoriza-
tion. Processor architecture is important here, so the discussion is often
close to the hardware. We take close looks at the Intel Pentium III,
Pentium 4, and Apple/Motorola G-4 chips.
4. Chapter 4 concerns shared memory parallelism. This mode assumes that
dataarelocaltonodesoratleastpartofacoherentmemoryimageshared
by processors. OpenMP will be the model for handling this paradigm.
5. Chapter 5 is at the next higher level and considers message passing. Our
model will be the message passing interface, MPI, and variants and tools
built on this system.
PREFACE vii
Finally,averyimportantdecisionwasmadetouseexplicitexamplestoshow
howallthesepieceswork.Wefeelthatonelearnsbyexamplesandbyproceeding
from the specific to the general. Our choices of examples are mostly basic and
familiar: linear algebra (direct solvers for dense matrices, iterative solvers for
largesparsematrices),FastFourierTransform,andMonteCarlosimulations.We
hope, however, that some less familiar topics we have included will be edifying.
Forexample,howdoesonedolargeproblems,orhighdimensionalones?Itisalso
not enough to show program snippets. How does one compile these things? How
does one specify how many processors are to be used? Where are the libraries?
Here, again, we rely on examples.
W. P. Petersen and P. Arbenz
Authors’ comments on the corrected second printing
We are grateful to many students and colleagues who have found errata in the
one and half years since the first printing. In particular, we would like to thank
Christian Balderer, Sven Knudsen, and Abraham Nieva, who took the time to
carefully list errors they discovered. It is a difficult matter to keep up with such
a quickly changing area such as high performance computing, both regarding
hardware developments and algorithms tuned to new machines. Thus we are
indeed thankful to our colleagues for their helpful comments and criticisms.
July 1, 2005.
ACKNOWLEDGMENTS
Our debt to our students, assistants, system administrators, and colleagues
is awesome. Former assistants have made significant contributions and include
Oscar Chinellato, Dr Roman Geus, and Dr Andrea Scascighini—particularly for
theircontributionstotheexercises.Thehelpofoursystemguruscannotbeover-
stated. George Sigut (our Beowulf machine), Bruno Loepfe (our Cray cluster),
and Tonko Racic (our HP9000 cluster) have been cheerful, encouraging, and
at every turn extremely competent. Other contributors who have read parts of
an always changing manuscript and who tried to keep us on track have been
Prof.MichaelMascagniandDrMichaelVollmer.IntelCorporation’sDrVollmer
did so much to provide technical material, examples, advice, as well as trying
hard to keep us out of trouble by reading portions of an evolving text, that
a “thank you” hardly seems enough. Other helpful contributors were Adrian
Burri, Mario Ru¨tti, Dr Olivier Byrde of Cray Research and ETH, and Dr Bruce
GreerofIntel.Despitetheirvaliantefforts,doubtlesserrorsstillremainforwhich
onlytheauthorsaretoblame.Wearealsosincerelythankfulforthesupportand
encouragementofProfessorsWalterGander,GastonGonnet,MartinGutknecht,
Rolf Jeltsch, and Christoph Schwab. Having colleagues like them helps make
many things worthwhile. Finally, we would like to thank Alison Jones, Kate
Pullen, Anita Petrie, and the staff of Oxford University Press for their patience
and hard work.
CONTENTS
List of Figures xv
List of Tables xvii
1 BASIC ISSUES 1
1.1 Memory 1
1.2 Memory systems 5
1.2.1 Cache designs 5
1.2.2 Pipelines, instruction scheduling, and loop unrolling 8
1.3 Multiple processors and processes 15
1.4 Networks 15
2 APPLICATIONS 18
2.1 Linear algebra 18
2.2 LAPACK and the BLAS 21
2.2.1 Typical performance numbers for the BLAS 22
2.2.2 Solving systems of equations with LAPACK 23
2.3 Linear algebra: sparse matrices, iterative methods 28
2.3.1 Stationary iterations 29
2.3.2 Jacobi iteration 30
2.3.3 Gauss–Seidel (GS) iteration 31
2.3.4 Successive and symmetric successive overrelaxation 31
2.3.5 Krylov subspace methods 34
2.3.6 The generalized minimal residual method (GMRES) 34
2.3.7 The conjugate gradient (CG) method 36
2.3.8 Parallelization 39
2.3.9 The sparse matrix vector product 39
2.3.10 Preconditioning and parallel preconditioning 42
2.4 Fast Fourier Transform (FFT) 49
2.4.1 Symmetries 55
2.5 Monte Carlo (MC) methods 57
2.5.1 Random numbers and independent streams 58
2.5.2 Uniform distributions 60
2.5.3 Non-uniform distributions 64