Table Of ContentANAL
PERFORMANCE 'VSIS FORMULAS
Amdahl's Law
Let f be the fraction of operations in a computation that must be performed
sequentially, where 0 f < I. The maximum speedup l{! achievable by aparallel Parallel Programming
computer with p processors performing the computation is
in C with MPI
"'::;f+O
andOpenMP
Gustafson-Barsis's Law
Given a parallel program solving a problem of size Il using p processors, let s
denote the fraction of total execution time spent in serial code. The maximum Michael J. Quinn
speedup 1/J achievable by this program is State n.'. ____ ..'L
l{!'::;/J+(l-p)s
Karp-Flatt Metric
C,IUUlllllll/,; speedup if!on pprocessors, where p > I,
the experimentally determined serial fraction e is defined to be
IN-lip
e --.....
I-lip
Isoefficiency Relation
a parallel system exhibits efficiency
size and p denotes number of processors. Define C = f(n, p)/O
T(n, I) denote sequential execution lime, and let 7;,(11, p) denote parallel over
head (total amount of time spent by all processors performing communications
II
and redundant computations). In order to maintain the same level of efficiency as
the number of processors increases, problem size must be increased so that the Higher Education"
fa \lowing inequality is satisfied:
)~ p)
Boston Burr Ridqe. Il Dubuque, IA Madison, WI New York San Francisco SI louis
Caracas Kuala lumpur Lisbon London Madrid Mexico
Milan Montreal New Delhi Santiaao Seoul Sinaaoore Svdnev Taioei Toronto
The McGr(lwHifI(Ompunie$f';~~"
II
W'dh gratitude for their love, support, and guidance,
I dedicate this book to mv parents,
Edward and Georgia Quinn.
PARALLEL PROGRAMMING IN C WITH MP! AND OPENMP
International Edition 2003
Exclusive rights by McGraw,HiII Education (Asia), for manufacture and export. This
book cannot be re-exported from the country to which it is sold by McGraw-HlU. The
International Edition is not available in North America.
Published by McGraw-Hili, a business unit of The McGraw·HiIl Companies, Inc., 1221
Avenue of the Americas, New Yarl(, NY 10020. Copyright ©2004 by The McGraw·Hill
Companies, Tnc. AU rights reserved. No part of this publication may be reproduced or
distributed in any forn or by any means, or stored in adatabase or retrieval system,
without tlie prior written consent of The MeGraw,Hill Companies, Inc., including, but
not limited to, in any network or other electronic storage or transmission, or broadcast for
distance learning.
Some ancillaries, il1tluding electronic and print comjlonents, may not be available to
customers outside the United States.
10 09 08 07 06
20 09 08 07
CTF SLP
Library of Congress Cataloging·in-Publication Data
Quinn, Michael J. (Michael Jay)
Parallel programming in C with MPI and OpenMP I Michael J. Quinl1.-lst ed.
p. em.
ISBN 007-282256-2
I C (ComIJuter program language). 2. Parallel progr.lll1ming (Computer ,cience). I.Tltle.
2004
oo5.13'3-{]c21 2003046371
CIP
When ordering this title, use ISBN 007-123265-6
Printed in Singapore
www.mhhe.com
/\nt.,
~'"
ViWO)
BRIEF TABLE OF CONTENTS~~"'k&"";;~' CONTENTS
Preface xiv
Preface xiv
CHAPTER 2
Parallel Architectures 27
1
2.1 Introduction 27
CHAPTER
1 Motivation and History I
Motivation and tlistory 2.2 Interconnection Networks 28
,
2 Parallel Architectures 27
2.2.1 Shared
l.l Introduction 1
Media 28
3 Parallel
Design 63
1.2 Modern Scientific Method 3
2.2.1 SWiTCh Network Topologies 29
4 Message-Passing Programming 93
1.3 Evolution of Supercomputing 4
2.23 2-D Mesh Network 29
5 The Sieve of Eratosthenes 115
1.4 Modern Parallel Computers 5
2.2.4 Binary Tree Network 30
6 Floyd's Algorithm 137
1A.l The Cosmic Cube 6
2.2.5 Hypertree Network 31
1.42 Commercial Parallel
2.2.6 BUlferjly Network 32
7 Performance Analysis 159
ComplIlers 6
2.2.7
8 Matrix-Vector Multiplication 178
1.4.3 Beowulf 7
2.2.8 35
9 Document Classification 216
1.4.4 2.2.9 Summary 36
[rzitiaJive 8
2.3 Processor Arrays 37
10 Monte Carlo Methods 239
1.5 9
2.3.1 Architecture
11 Matrix Multiplication 273
1.5.1 Data Dependence 9
Operatwns 37
12 Solving Linear Systems 290
1.5.2 Data Parallelism 10
2.3.2 Processor Array Performaflce 39
13 Finite Difference Methods 318
1.5.3 FUllctional Parallelism 10
2.3.3 Processor IntercolUlfetioll Network 40
2.3.4 Enabling lind DisabUng Process01;\ 40
14 Sorting 338
1.5.4 Pipelini/lg 12
2.3.5 Additional Architectural Feat/ires 42
1.5.5 Size ConsideraJions 13
15 The Fast Fourier Transform 353
2.3.6 42
1.6 Data C!ustellng 14
16 Combinatorial Search 369
2.4 43
1.7 Programming Parallel Computers 17
2.4.1 Centralized Multiprocessors 43
17 Shared-Memory Programming 404
1.7.1 El.1elul a Compiler 17
2.4.2 Distributed Multiprocessors 45
18 Combinim! MPIand OoenMP 436
1.7.2
2.5 49
Appendix A MPI Functions 450
2.5.1 Asymmetrical MulticoJllpurers 49
1.7.3 Add a Parallel Programming
Appendix B Utility Functions 485
2.5.2 Symmetrical Muilicoili/mlef's 51
Layer 19
Appendix C Debugging MP! Programs 505
2.5.3 Which Model Is Bestfor a Commodilv
1. 7.4 Create a Parallel Language 19
Cluster? 52
Appendix D Review of Complex Numbers 509
1. 7.5 ClIrrellf Staws 21
2.5.4 DifFerences between Cilisters and
~ Appendix E OpenMP Functions 5!3 1.8 Summary 21
Nelivorks 1~t'WorkstatiolJs 53
1.9 Key Tenus 22
2.6 Flynn's Taxonomy 54
Bibliography 515
1.10 Bibliographic Notes 22
2.6.1 SfSD 54
Author Index 520
1.11 Exercises 23
2.6.2 SfMD 55
Subject Index 522
iv v
Contents vii
vi Contents
2.6.3 MISD 55 3.9 Key Terms 90 5.45 Ramifications ofBlock CHAPTER 7
Decomposition 121 Performance Analysis 159
2.604 MtMD 56 3.10 Bibliographic Notes 90
5.5 Developing the Parallel Algorithm 121
2.7 Summary 58 3.11 Exerdses 90 7.1 Introduction 159
55•.1 Function MPI_Bcast 122
2.8 Key Terms 59 7.2 Speedup and Efficiency 159
5.6 Analysis of Parallel Sieve Algorithm 122
2.9 Bibliographic Notes 59 7.3 Amdahl's Law 161
5.7 Documenting the Parallel Program 123
2.10 Exercises 60 HIIPTER4 7.3.1 Limitations ofAmdahl's Law 164
Message.Passing Programming 93 5.8 Benchmarking· 128
7.3.2 The Anulahl Effect 164
5.9 Improvements 129
4.1 Introduction 93 7.4 GlIstafson-Barsis's Law 164
3 5.9.1 Delete Even Integers 129
APTER 4.2 The Message-Passing Model 94 7.5 The Karp-Flatt Metric 167
Parallel Algorithm Design 63 5.9.2 Eliminate Broadcast 130
43 The Message-Passing Interrace 95 7.6 The Isoefficiency Metric 170
5;9.3 Reorganize Loops IJl
3.1 Introduction 63 4.4 Circuit Satisfiability 96 5.9.4 BencJwwrldng I31 7.7 Sumillary 174
3.2 The Task/Channel Model 63 4.4.1 FunctiollMPI_Init 99 5.HI Summary 133 7.8 Key Terms 175
3.3 Foster's Methodology 64 4.4.2 Functions MPI_Comm_rankand 7.9 Bibliographic Notes 175
5.11 Key Terms 134
3.3.1 65 MPI_Colnm_size 99 5.12 Bibliographic Notes 134 7.10 Exercises 176
3.3.2 Communication 67 4.4.3 FUllctioll MPI_Finalize 101
5.13 Exrrcises 134
3.3.3 Agglomeration 68 4.4.4 Compiling MPl Programs 102 i
8
3.304 Mapping 70 4045 RUllning MPI Programs 102 CHAPTER
3.4 Boundary Value Problem 73 4.5 Introducing Collective 6 Matrix.Vector Multiplication 178
);HAPTER
COllllllunication 104
304.1 Introduction 73 Floyd's Algorithm 137 . 8.1 I1!troduction
4.5.1 Functio/l MPI_Reduce 105
3.4.2 Partitioning 75 8.2 Sequential Algorithm 179
4.6 Benchmarking Parallel Perrormance 108 6.1 Introduction 137
304.3 Communication 75 8.3 Data Decomposition Options 180
3.404 Agglomeration and Mapping 76 4.6.1 FUllctions fliP I _ wtime and 6.2 The All-Pairs Shortest-Path 8.4 Rowwise Block-Striped
MPCwtick 108. Problem 137
3.4.5 Analysis 76 Decomposition 181
./.6.2 FUllction MPI~Barrier 108 6.3 Creating Arrays at Run TIme 139
3.5 Finding the Maximum 77 8.4.1 Design and AIUllysis 181
4.7 Summary 110 6.4 Designing the Parallel Algorithm 140
3.5.1 Introduction 77 804.2 Replicating a Block-Mapped Vector 183
4.8 Key Terms 110 6.4.1 Partitioning 140
3.5.2 Partitioning 77
8.4.3 FUllction MPI_Allga tlwrv 184
4.9 Notes 110 6.4.2 Communication 141
3.5.3 Communication 77
3504 Agglomeration and Mapping 81 4.10 Exercises 6.4.3 AgglomeratkJII and Mapping 142 8.4.4 Replicated 186
6.404 Matrix Input/Outpul 143 8.4.5 Documenting the Parallel Pragmm 187
3..5.5 Analysis 82
6.5 Point-to-Point COllllllllnication 145 8.4.6 187
3.6 The noBody Problem 82
CHAPTERS 6.5.1 FUllctioll MPI_Send 146 8.5 Columnwise
3.6.1 Introduction 82
The Sieve of Eratosthenes 115 6.5.2 FUllctioll MPI_Recv 147 Decomposition 189
3.6.2 Parritioning 83
3.6.3 Communication 83 5.1 Introduction j 15 6.5.3 Deadlock 148 8.5.1 Design (lIUI Analysh 189
3.6.4 Agglomeration alld Mapping 85 5.2 Sequential Algorithm 115 6.6
8.5.2 Reading a Colummvi.e Block-Striped
Program
Matrix 19/
3.6.5 Analysis 85 5.3 Sources of Parallelism 117
6.7 Analysis and Benchmarking 151 8.U Function 191
3.7 Adding Data Input 86 5.4 Data DedlIllposition Options 117
6.8 Summary 154 8.5.4 Printing a COiUl1llHVlSe 1IInrk <;tri
3.7.1 Introduction 86 504.1 Interleaved Data Decomposition I i8
Matrix 193
3.7.2 Communication 87 5.4.2 Block Data Decomposition 118 6.9 Key Terms 154
3.7.3 Analysis 88 5.4.3 Block Decomposition Macros 120 6.10 Bibliographic Notes 154 8.5.5 FUllction MPCGat"~erv 193
8.5.6 Distributing Partial Results 195
Summary 89 5.4.4 Local Index versus Glohallndex 120 6.11 Exercises 154
viii Contents Contents ill:
8.5.7 FUllctionMPI_lllltoaUv 195 9.5.1 Assigning Groups ofDOCIWlenls 232 CHi\PTER 11 12.45 Comparison 303
8.5.8 DOCLUllcntillg the Parallel PlVgram 196 9.5.2 Pipelining 232 Matrix MuHiplication 273 12.4.6 Pipe/ined. Row"O, 304
8.5.9 Benchmarking 198 95.3 Functioll MPI_Testsome 234 12.5 Iterati\re Methods 306
8.6 Checkerboard Block Decomposition 199 9.6 Summary 235 11.1 Introductioll 273 12.6 The Conjugate Gradient Method 309
8.6.1 Design and Analysis /99 9.7 Key Terms 236 11.2 Sequential Matrix Multiplication 274 11.6.1 Sequential Algorithl/! 309
8.6.2 Creating a C011lmun/altor 202 9.8 Bibliographic Notes 236 1l.:U ltemtive, Row-Oriented 3/0
274
8.63 FltllctionJlfPI_Dims_create 203 9.9 Exercises 236 12.7
11.2.1 Recur.ril'e, Block-Oriented
8.6.4 FUllctioll MPI_Cart_create 204
Algorithm 275 12.8 Key Terms 314
8.6.5 Reading a Checkerboard Matrit 205 CHAPTER 10 11.3 Rowwise Block-Striped Parallel 12.9 Bibliographic Notes 314
8.6.6 FunctionJlfPI_Cart_rank 205
Monte Carlo Methods 239 Algorithm 277 12.10 Exercises 314
8.6.7 Function MPI_Cart__ .coord,s 207
10.1 Introduction 239 11.3.1 IdentifyingPfimitive Tasks 277
8.6.8 Function MPCCo1lllu_split 207
8.6.9 BenchnUlrking 208 10.1.1 Why Monte Carlo Works 240 11.3.2 Agglomeration 278 CHAPTER 13
8.7 Summary 210 10.1.2 Monte Carlo and Parallel [[.3.3 COml1lllllicatiol1 and Further
Finite Difference Methods
243 Agglomeration 279
8.S Terms 211
10.2 Sequential Random Number 11.3.4 Analysis 279 ll.l - Introduction 318
8.9 iographic Notes 21
Generators 243 11.4 AlgOlithm 281 13.2 Partial Differential Equations 320
8.10 Exercises 21
10.2.1 Linear COilgruential 244 IlA.1 Agglomeration 281 13:2.1 Categorizing PDEs 320
10.2.2 Lagged Fibonacci 245 11.4.2 Comml/nication 283 13.2.2 Difference Quoti£/JIS 321
9
CHAPT R 10.3 Parallel Random Number Generators 245 11.4.3 Af1(I(y,ris 284 13.3 Vibrating String 322
Document Classification 216 10.3.1 Manager-Worker Method 246 11.5 Summary 286 13.3.1 Deriving EquatiolL, 322
9.1 Introduction 216 10.3.2 Leapfrog Me/hod 246 11.6 Key Terms 287 13.3.2 Deriving the Sequelllia/ Program 323
9.2 Parallel Algorithm Design 217 10.•1.3 Sequence Splitting 247 11.7 Bibliographic Notes 287 13.3.3 Parallel Program Design 324
9.2.1 Partitioning and C011l11Jltnicalion 217 10.3.4 Parameterization 248 11.8 Exercises - 287 133.4 Isoefficiency Analysis 327
9.2.2 Agglomeration and MaPIJing 217 10.4 Other Random Number Distributions 248 1335 Replicating Comptlfotions 327
13.4 Steady-State Heat Distribution 329
9.2.3 ManagerlWorker Paradigm 218 10A.1 Inverse Cumulatil'e Distribution Functioll
12
9.2.4 Manager PlVcess 219 Tr{Ul~for/11atiol1 249 CHAPTER 13.4.1 Deriving Equations 329
10.4.2 Box-MullerTransfimnatiol1 250 Solving Unear Systems 290 13.4.2 Deriving the Sequemiill Program 330
9.25 Function MPI__Abort 220
9.2.6 Worker Process 221 10A.3 The Rejection Method 251 12.1 Introduction 290 13.4.3 Parallel Pragrom Design 332
9.2.7 Creating a Workers-oilly 10.5 Case Studies 253 12.2 Terminology 291 13.4.4 Isoefficiency Analysi, 332
Communicator 223 10.5.1 Neutron Transport 253 12.3 Back Subslimtion 292 13.4.5 Implementation Details 334
9.3 Nonblocking Communications 223 10.5:2 Temperature at a Point Inside 12.11 Seqllelllial Algorithm 292 13.5 Summary 334
9.3.1 Manager's COlrlnumicarion 224 (l 2"D Plate 255 J2.1l Row-Oriented 293 13.6 Key Terms 335
9.3.2 Function MPI_IrecIT 224 105.3 Two.Dimensionallsing Model 257 13.7 Bibliographic Notes 335
12.3.3 CoIl/lllJl-Oriented Parallel
9.3.3 Function MP l __Wai t 225 10.5.4 Room Assignment Problem 259 Algorithm 295 13.8 Exercises 335
9.3.4 Workm" Con11l1lmications 225 10.5.5 Pmking Garage 262 Comparison 295
9.3.5 Function NPI,_Isend 225 10.5.6 Traffic Circle 264 12.4 Gaussian Elimination 296
14
9.3.6 Function Ml'I_Probe 225 10.6 Summruy 268 CHAPTER
12.4.1 Sequential Algorithm 296
Sorting 338
9.3.7 Function MPI.._Ge(_count 226 10.7 Key Terms 269
12.4.2 Parallel Algorithms 298
9.4 Documenting the Parallel Program 226 10.8 Bibliographic Notes 269 12.4.3 Row-Oriented AlgOlithm 299 14.1 flltwduction 338
9.5 Enhancements 232 10.9 Exercises 270 12.4.4 Column-Oriented Algorithm 303 14.2 Quicksort 339
x Contents Contents xi
14.3 340
16.3 Backtrack Search 371
17.4.2 412
18.3.2 ParalleliZing FUlIctioll
16.3.1 EXtll1lple 371
17.4.3 lastprivateClause 412
find_steady_state 444
17.5 Critical Sections 413
18.3.3 BenchmarkillQ 446
341
16.3.2 umeandSpace Complexity 374
14.3.3 16.4 Parallel Backtrack Search 374
17.5.1 criticalPragma 415
18.4 Summary 448
14.4 Hyperquicksort 16.5 Distributed Termination Detection 377
17.6 Reductions 415
18.5 Exercises 448
17.7 Performance Improvements 417
14.4.1 Algorithm Description 343
16.6 Branch and Bound 380
14.4.2 Isoefficiency AIUllysis 345
16.6.i F.xample 380
17.7.1 Inverting liJops 417
APPENDIX A
17.7.2 Conditionally ExeclIIing Loops 418
14.5 Parallel Sorting by Regular Sampling 346
16.62 Sequential Algorithm 382
MPI Functions 450
17.7.3 Scheduling Loops 419
145.1 Algorithm Description 346
16.6.3 Anll.lysis 385
17.8 MoreGeneralDataParallelism 421
14.52 IsoeJjir:iency Analysis 347
16.7 Parallel Branch and Bound 385
14.6 SummalY 349
16.7.1 SlOrinf! and Sharinf! Unexamined 17.B.1 parallel Pragma 422
APPfNDIXB
17.B2 FunctionolRP_get_
14.7 Key Terms 349
Utility Functions 485
16.72 Efficiency .187
thread~'lum 42.1
14.8 Bibliographic Notes 350
16.7.3 Haltinfl Conditions .187
17.8.3 Function oIrIfJ_get_ B.1 Header File MyMPI. h 485
14.9 Exercises 350
nllm_tllreads 425
16.8 B.2 Source File 14yMPI. c 4&6
17.B.4 for Pragma 425
17.3.5 single Pragma 427
15
16.8.2 Alphll-Beta Prlming 392
CHAPTER 17.B.6 nwaiL Clause 427
APPENDIXC
The Fast Fourier Transform 353
16.8.3 Enhancements to Alpha·Beta
17.9 Functional Parallelism 428
Debugging MPI Programs 505
Pruning 395
15.1 Introduction 353
17.9.1 parallel se'ctions Pragl/la 429
16.9 Parallel Alpha-Beta Search 395
C.1 Introduction 505
15.2 Fourier Analysis 353
17.9.2 section Pragma 429
16.9.1 Parallel Aspiratioll Search 396
C.2 Typical Bugs in MPI Programs 505
15.3 The Discrete Fourier Transform 355
16.9.2 Parallel Subtree Evaluation 396
17.9.3 sections Pragma 429
C.2.1 Bug.v Resulting ill Deadlock 505
15.3.1 Inverse Discrete Fouriel' 16.9.3 Distributed Tree Search 397
17.10 430
C.2.2 Bug.v 506
Transform 357. 16.10 :\lImmnr" 17.11 C.2.3
15.3.2 16.11 17.12 Bibliographic Notes 432
Communicatio/1s 507
16.12 Bibliographic Notes 400
17.13 Exercises 433
C,3 Practical Dehugging 507
15.4 The Fast Fourier Transform 360
16.13 Exercises 401
15.5 ParallelPro!!Iam Desi.'O 363
I'arlitioning and Communication 363
CHAPT R 18
APPENDIXD
15.5.2 Agglomeration and Mappill,£ 365
17
Combining MPI and OpenMP 436
Review of Complex Numbers 509
CHAPTER
15.5.3 lsoefficiency Analysis 365
Shared-Memory Programming 404
18.1 Introduction 436
15.6 Summary 367
18.2 Conjugate Gradient Method 438
17.1 Introduction 404
E
15.7 Key Terms 367
APPENDIX
IB.2.1 MPI Pmgram 438
15.8 Bibliographic Notes 367
17.2 The Shared-Memory Model 405
OpenMP Functions 513
1B.2.2 Funcliol1aJ Pro.#ling 442
17.3 Parallel for Loops 407
15.9 Exercises 367
1B.2.3 Parallelizing FUllction
17.3.1 parallel for Pragml! 408
matr'ix_vectoryroduct 442
17.32 Function
Bibliography 515
1B.2.4 Benchmarking 443
CHAPTER 16
num-.procs 410
18.3 Jacobi Method 444
Author Index 520
17.3.3 FUllctiOf!
Combinatorial Search 369
Jlillll_threads 410
IB.3.1 Profilinl! MP1 Proi?ram 444
Subject Index
16.1 Introduction 369
17.4 Declaring Private Variables 410
16.2 Divide and Conquer 370
17.4.1 pri vate Clause 411
Preiace xiii
PREFACE
Chapter 7 focuses on four metrics for analyzing and
mance of parallel systems: Amdahl's Law, Gustafson-Barsis'
and the isoefficiency metric.
Chapters 10-16 provide additional examples of how to analyze a problem
and design a good parallel algorithm to solve it. At this point the development of
MPI programs implementing the parallel algorithms is left to the reader. I
This book is a practical introduction to parallel programming in C using Monte Carlo methods and the challenges associated with parallel random number
the MPI (Message Passing Interface) library and the OpenMP applica generation. Later chapters present a variety of key algorithms: matrix multipli-
tion programming interface. It is targeted to upper-division unden~radu; Gaussian elimination, the conjugate gradient method, finite difference
ate students, beginning graduate students, and computer professionals sorting, the fast Fourier transform, backtrack search, branch-and-bound
this material on their own. It assumes the reader has a good background in C search, and alpha-beta search.
progratnming and has had an introductory class in the analysis of algorithms. Chapters 17 and 18 are an introdl!ction to the new shared-memory program
Fortran programmers interested in parallel programming can also benefit ming standard OpenMP. I present fue features of OpenMP as needed to convert
from thi s text. While the examples in the book are in C, the underlying concepts sequential code segments into parallel ones. I use two case studies to demonstrate
programming with MPI and OpenMP are essentially the same for both the process of transforming MPI programs into hybrid MPIIOpenMP prO'Jrams
Cand Fortran programmers. that can exhibit higher performance on multiprocessor clusters than programs
In the pa~t twenty years I have taught parallel programming to hundreds based solely on MPI.
of undergraduate and graduate students. In the process I have learned a great This book has more than enough material for a one-semester course in par
deal about the sorts of problems people encounter when they begin "thinking in allel programming. While parallel programming is more demanding than typical
and writing parallel programs. Students benefit·from seeing programs programming, it is also more rewarding. Even with a teacher's instruction and
implemented step by step. My philosophy is to introduce new func most students are unnerved at the prospect of harnessing multiple pro
in time." As much as possible, every new concept appears in the cessors to perform a single task. However, this fear is transformed into a feeling
a design, implementation, or analysis problem. When you see of genuine accomplishment when they see their debugged programs run much
faster than "ordinary" C programs. For this reason, programming assignments
o-r
should playa central role in the course.
F011unately, parallel computers are more accessible than ever. Ifacommercial
in a page margin, you'll know I'm presenting a key computer is not available, it is a straightforward task to build a small
Til(: two chapters explain when and why parallel computing began and cluster out of afew PCs, networking equipment, .and free software.
gives a high-level overview of parallel architectures. Chapter 3 presents Foster's P.I illustrates the precedence relations among the chapters. A solid
methodology and shows how it is used through several arrow from Ato B indicates chapter Bdepends heavily upon material presented
ca~e studies.. Chapters 6,8, and 9demonstrate how to use the design method- in chapter A. A dashed arrow from A to B indicates a weak dependence. If
to develop MPI programs that solve aseries of progressively more difficult you cover the chapters in numerical order, you will satisfy aU of these prece
lmming problems. The 27 MPI functions presented in these chapters are a dences. However, if you would like your students to start programming in C with
robust enough subset to implement parallel programs for a wide variety of MPI as quickly as possible, you may wish to skip Chapter 2 or only cover one
cations. These chapters also introduce functions that simplify matrix and vector or two sections of it. If you wish to focus on numerical algorithms, you may
1/0. The source code for this 110 library appears in Appendix B. wish to skip 5 and introduce students to the function MPI_Bcast in
The programs ofChapters 4, 5, 6, and 8have been benchmarked on acommod another way. If you would like to start by having your students programming
ity cluster of microprocessors, and these results appear in the text. Because new Monte Carlo algorithms, you can jump to Chapter 10 immediately after Chapter
generations of microprocessors appear much faster than books can be produced, 4. If you want to cover OpenMP before MPI, you can jump to Chaoter 17 after
readers will observe that the processors are several generations old. The point of Chapter 3.
presenting the result~ is not to amaze the reader with the speed of the computa I thank everyone at McGraw-Hill who helped me create this book, espe
tions. Rather, the purpose of the benchmarking is to demonstrate that knowledge cially Betsy Jones, Michelle Flomenhoft, and Kay Brimeyer. Thank you for your
of the and bandwidth of the interconnection network, combined with in sponsorship, encouragement, and assistance. I also appreciate the help provided
formation ahout the performance of a sequential program, are often sufficient to by Maggie Murphy and the rest of the compositors at Interactive Composition
allow reasonablv accurate oredictions of the performance of a parallel program.
xii
xiv Preface
C HAP T E R
.........
\
.
,
,
,
Motivation and History
Well done is quickly done.
Caesar Augustus
Figure Pot Dependences among the chapters. Asolid arrow
indicates a strong dependence; a dashed arrow indicates a weak
dependence. .
1.1 INTRODUCTION
. I am indebted to the reviewers who carefully read the manuscript, correcting
errors, pointing out weak spots, and suggesting additional topics. My thanks Are you one of those people for whom "fast" isn', fast enough? Today's work
to: A. P. W. Bohm, Colorado State University; Thomas Cormen, Dartmouth stations are about ahundred times faster than those made just a decade ago, but
College; Narsingh Deo, University of Central Florida; Philip 1. Hatcher, some computational scientists and engineers need even more speed. They make
great simplifications to the problems they are solving and.still must wait
of New Hampshire; Nickolas S. Jovanovic, University of Arkansas
at Little Rock; Dinesh Mehta, Colorado School of Mines; Zina Ben or even weeks for their programs to finish running.
Indiana University-Purdue University, Indianapolis; Paul E. Plassman, Faster computers let yon tackle larger computations. Suppose you can afford
Pennsylvania State University; Quinn O. Snell, Brigham Young University; to wait overnight for your program to produce aresult. If your program
Ashok Srinivasan, Florida State University; Xian-He Sun, Illinois Institute of ran 10 times faster, previously out-of-reach computations would now be within
Technology; Virgil Wallentine, Kansas State University; Bob Weems, Univer your grasp. You could produce in 15 hours an answer that previously required
of Texas at Arlington; Kay Zemoudel, California State I fn;""",,;hLI:;'>n nearly a week to generate.
Bernardino; and Iun Zhang, University of Kentucky. Of course, you could simply wait for CPUs to faster. In about five years
people at Oregon State University also lent me a hand. Rubin Landau CPUs will be 10 times faster than they are today (a consequence of Moore's
and Henri I amen helped me understand Monte Carlo algorithms and the detailed Law). On the other hand, if you can afford to wait five
balance condition, respectively. Students Charles Sauerbier and Bernd Michael much of a hurTY! Parallel computing is a proven way to get
Kelm suggested questions that made their way into the text. Tim Budd showed now.
me how to incorporate PostScript figures into LaTeX documents. Jalal Haddad
What's parallel com[lllling?
'" provided technical support. Thank you for your help!
Parallel computing is the use ofaparallel computer to reduce the time needed ,......0
I am grateful to my wife, Victoria, for encouraging me to back
to solve a single computational problem. Parallel computing is now considered a
into textbook writing. Thanks for the inspiring Christmas present: Chicken Soup
standard way for computational scientistsand engineers to solve problems inareas
Writer's Soul: Stories to Open the Heart and Rekindle the Spirit ofWriters.
as diverse as galactic evolution, climate modeling, aircraft desie:n. and molecular
Michael J. Quinn dynamics.
Oregon
2 CHAPTER 1 Motivation and History SECTION 1.2 Modern Scientific Method 3
What's aparallel compuur? to interact. In Chapter 18 you'll see an example of how a hybrid MPIIOpenMP
o--r
program can outperform an MPI-only program on apractical application.
A parallel computer is a multiple-processor computer system supporting
By working thwugb this book, you'11 learn alittle bit about parallel computer
programming. Two important categories of parallel computers are multi
hardware and a lot about parallel program development strategies. That includes
computers and centralized multiprocessors
parallel algorithm design and analysis, program implementation and debugging,
As its name implies, a multicomputer is a parallel computer constructed
and ways to benchmark and optimize your programs.
out of multiple computers and an interconnection network. The processors on
different computers interact by passing messages to each other.
In contrast, acentralized mUltiprocessor (also called asymmetrical multi·
processor Of SMP) is a more bigbly integrated system in which all CPUs share 1.2 MODERN SCIENTIFIC METHOD
access to a global memory. This shared memory SUDDorts communication
and synchronization among processors. Classical science is based on observation, theory, and physical experimentation.
We'll study centralized mUltiprocessors, multicomputers, and other parallel Observation of a phenomenon leads to a hypothesis. The scientist develoDs a
computer architectures in Chapter 2. to explain the phenomenon ,md designs an experiment to test that
the results of the experiment require the scientist to refine the theorv. if
What:S parallel programming?
o--r not completely reject it. Here, observation may again take center stage.
Parallel programming is programming in a language that allows you to Classical science is characterized by physical experiments and models. For
explicitly indicate how~ different portions of the computation may be executed example, ~any physics students have explored the relationship between mass,
concun'ently by different processors. We'll ctiscuss various kinds of parallel pro force, and ~cceleration using paper tape, pucks, and air tables. Physical exper
languages in more detail near the end of this chapter. iments allow scientists to test theories. snch as Newton's first law of
Is parallelprogramming really nece,~sary? against _ .
In contrast, contemporary science is characterized by observation, theory,
Alot ofresearch has been invested in the development of compiler technology experimentation, and numerical simulation (Figure LI). Numerical simulation is
that would allow orctinary Fortran 77 or C programs to be translated into codes an increasingly important tool for scientists, who often cannot use physical ex
that would execute with good efficiency on parallel computers with large numbers periments to test theories because they may be too expensive or time-consuming,
of processors. This is a very difficult pro.blem, and while many experimental because they may be unethical, or because they may be impossible to perfo~m. .
paralleJizingl compilers have been developed, at the present time commercial The modem scientist compares the behavior of a numerical simulation, which
systems are still in their infancy. Tbe alternative is for you to write your own
programs.
Wily should I program using MPI and OpenMP?
MPI (Message Passing Interface) is a standard specification
passing libraries. Libraries meeting the standard are available on vutually every
computer system. Free libraries are also available in case you want to /Observation
run MPI on a network of workstations or a parallel computer built out of com (
modity component~ (PCs and switches). If you develop programs using
you will be able to reuse them when you get access to a newer, faster parallel Numerical Phy~ical Theory
computer. simulation experimentation I
Increasingly. parallel computers are constructed out of symmetrical
Within each S~P, the CPUs have ashared address space. While
MPI is aperfectly satisfactory way for processors in different SMPs to communi
cate with each other. OpenMP is a better way for processors within asingle SMP
Figure 1.1 The introduction of numerical simulation
distinguishes the contemporary scientific method from
'n.,.,.l1p!i.. verb: to make parallel. the classical scientific method.
4 CHAPTER 1 Motivation and History SECTION 1.4 Modern Parallel Computers 5
implements the theory, to data collected from nature. The differences cause the businesses, quicker computations lead to a competitive advantage. More rapid
scientist to revise the thcory and/or make more obsefllations. crash simulations can reduce !he time an automaker needs to design a new car.
Many important scientific problems are so complcx that solving them via Faster drug design can increase the number of patents held by a pharmaceutical
numerical simulation requires extraordinarily powerful computers. These com finn. High-speed computers haveeven been used to design products as mundane
problems, often called grand chaUenges for science, fall into several as disposable diapers!
~m~"ori~~ (73]: Computing speeds have risen dramatically in the
could perform abollt 350 multiplications per second. Today's supercomputers
1. Quantum chemistry, statistical mechanics, and relativistic physics
are more than a billion times faster, able to perfonn trillions of floating point
2. Cosmology and astrophysics
operations per second.
3. Computational fluid dynamics and turblllf'nl'P Single processors are about amillion times faster than they were 50 years ago.
4. Materials design and supercondouctivity Most ofthe speed increase is due to higherclock rates that enable asingleoperation
5. Biology, pharmacology, genome sequencing, genetic engineering, to be performed more quickly. The remaining speed increase is due to greater
enzyme activity, and cell modeling system concurrency: allowing the system to work simultaneously on
operations. The history of computing has been marked by rapid progress on both
6. Medicine, and modeling of human organs and bones
these fronts, asexempliticd by contemporary high-perfonnance microprocessors.
7. Global weather and environmental modeling
Intel's Pentium 4CPU, for example, has clock speeds well in excess of I GHz, two
While grand challenge problems emerged in the late 1980s as astimulus for. arithmetic-logic units (ALUs) clocked at twice the core processor clock speed, and
further developments in high-perfOlmance computing, you can view the entire extensive hflIdware support for out-of-order speculative execution ofinstructions.
of electronic computing as the {Iuest for higher performance. How cim today's supercomputers be a billion times faster than the ENIAC, if
individual processors are only about amillion times faster? The answer is
the remaining thousand-fold speed increase is achieved by collecting thousand,
of processors into an integrated system capable of solving individual
1.3 EVOLUTION OF SUPERCOMPUTING
faster than a single CPU; i.e., a parallel computer.
The United States government has played a key role in the development and use
The meaning ofthe word supercomputer, then, has changed overtime. In 1976
of high-performance computers. During World War II the U.S.
supercomputer meant aCray-l, asingle-CPU computer with ahigh-performance
construction of the ENIAC in order to speed the calculation of artillery tables.
pipelined vector processor connected to a high-performance memory system.
In the 30 years after World War n, the U.S. government used high-perfonnance
Today, supercomputer means a parallel computer with thousand., of CPUs.
computers to design nuclear weapons, break codes, and perform other national
The invention of the microprocessor is a watershed event that led to the
security-related applications.
demise of traditional minicomputers and mainframes and spurred the
Supercomputers are the most powerful computers that can be buill ment of low-cost parallel computers. Since the rnid-1980s, microprocessor manu
(As computer speeds increase, the bar for "supercomputer" statns Jiscs, too.) Thc facturers have improved the perfonnance of their top-end processors at an annual
tenn supercomputer first came into widespread use with the introduction of the rate of 50 percent while keeping prices more or less constant [90]. The rapid in
Cray-l supercomputer in 1976. The Cray-I was a pipelined vector processor, crease in microprocessor speeds has completely changed the face of computing.
not a multiple-processor computer, but it was capable of more than 100 million Microprocessor-based sefllers now fill the role formerly playedby minicomputers
point operations per second. cOllstmcted out of gate arrays or off-the-shelf-logic. Even mainframe I'f)mnlltprc
Supercomputers have typically cost $10 million or more. The high cost of are beiM. constructed out of microprocessors.
supercomputers once meant they were found almost exclusively in government
research facilities, such as Los Alamos National Laboratory
Over time, however, supercomputers began to appear outside of government 1.4 MODERN PARALLEL COMPUTERS
facilities. In the late 1970s supercomputers showed up in capital-intensive indus· "
tries. Petroleum companies harnessed supercomputers to help them look for oil, Pamllel computers only became attractive to a wide range of customers with the ,..-()
and automobile manufacturers started using these systems to imvrove the fuel advent of Very Large Scale Integration (VLSI) in the late 1970s. Supercomputers
efficiency and safety such as the Cray-l were far too expensive for most organizations. Experimen·
Ten years later, hundreds of corporations around the globe were using super tal mU'alIel computers were less expensive than supercomputers, but they were
computers to support their business enterprises. The reason is simple: for many rpl~tivpl" costly. They were unreliable, to bool VLSI technology allowed