ebook img

Introduction to High Performance Scientific Computing PDF

464 Pages·2011·24.604 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Introduction to High Performance Scientific Computing

Introduction to High Performance Scientific Computing EvolvingCopy-openforcomments VictorEijkhout with EdmondChow,RobertvandeGeijn 1stedition2011–revision481 Introduction to High-Performance Scientific Computing c Victor Eijkhout, distributed under a Creative (cid:13) Commons Attribution 3.0 Unported (CC BY 3.0) license and made possible by funding from The Saylor Foundationhttp://www.saylor.org. Preface The field of high performance scientific computing lies at the crossroads of a number of disciplines and skill sets, and correspondingly, for someone to be successful at using high performance computing in sci- ence requires at least elementary knowledge of and skills in all these areas. Computations stem from an applicationcontext,sosomeacquaintancewithphysicsandengineeringsciencesisdesirable.Then,prob- lemsintheseapplicationareasaretypicallytranslatedintolinearalgebraic,andsometimescombinatorial, problems, so a computational scientist needs knowledge of several aspects of numerical analysis, linear algebra, and discrete mathematics. An efficient implementation of the practical formulations of the appli- cation problems requires some understanding of computer architecture, both on the CPU level and on the level of parallel computing. Finally, in addition to mastering all these sciences, a computational scientist needssomespecificskillsofsoftwaremanagement. While good texts exist on numerical modeling, numerical linear algebra, computer architecture, parallel computing,performanceoptimization,nobookbringstogetherthesestrandsinaunifiedmanner.Theneed for a book such as the present became apparent to the author working at a computing center: users are domain experts who not necessarily have mastery of all the background that would make them efficient computationalscientists.Thisbook,then,teachesthosetopicsthatseemindispensibleforscientistsengag- inginlarge-scalecomputations. The contents of this book are a combination of theoretical material and self-guided tutorials on various practical skills. The theory chapters have exercises that can be assigned in a classroom, however, their placementinthetextissuchthatareadernotinclinedtodoexercisescansimplytakethemasstatementof fact. The tutorials should be done while sitting at a computer. Given the practice of scientific computing, they haveaclearUnixbias. Public draft This book is open for comments. What is missing or incomplete or unclear? Is material presentedinthewrongsequence?Kindlymailmewithanycommentsyoumayhave. Youmayhavefoundthisbookinanyofanumberofplaces;theauthoritativedownloadlocationishttp: //www.tacc.utexas.edu/˜eijkhout/istc/istc.html.Itisalsopossibletogetanicelyprinted copyfromlulu.com: http://www.lulu.com/shop/victor-eijkhout/introduction-to-high-performance-scientific-computing/ paperback/product-20452679.html [email protected] ResearchScientist TexasAdvancedComputingCenter TheUniversityofTexasatAustin Acknowledgement HelpfuldiscussionswithKazushigeGotoandJohnMcCalpinaregratefullyacknowl- edged.ThankstoDanStanzioneforhisnotesoncloudcomputing,ErnieChanforhisnotesonscheduling of block algorithms, and John McCalpin for his analysis of the top500. Thanks to Elie de Brauwer, Susan Lindsey,andLorenzoPesceforproofreadingandmanycomments. 4 IntroductiontoHighPerformanceScientificComputing–r481 Introduction Scientificcomputingisthecross-disciplinaryfieldattheintersectionofmodelingscientificprocesses,and the use of computers to produce quantitative results from these models. It is what takes a domain science andturnsitintoacomputationalactivity.Asadefinition,wemayposit Theefficientcomputationofconstructivemethodsinappliedmathematics. Thisclearlyindicatesthethreebranchesofsciencethatscientificcomputingtoucheson: Appliedmathematics:themathematicalmodelingofreal-worldphenomena.Suchmodelingof- • ten leads to implicit descriptions, for instance in the form of partial differential equations. In ordertoobtainactualtangibleresultsweneedaconstructiveapproach. Numericalanalysisprovidesalgorithmicthinkingaboutscientificmodels.Itoffersaconstructive • approachtosolvingtheimplicitmodels,withananalysisofcostandstability. Computingtakesnumericalalgorithmsandanalyzestheefficacyofimplementingthemonactu- • allyexisting,ratherthanhypothetical,computingengines. One might say that ‘computing’ became a scientific field in its own right, when the mathematics of real- world phenomena was asked to be constructive, that is, to go from proving the existence of solutions to actuallyobtainingthem.Atthispoint,algorithmsbecomeanobjectofstudythemselves,ratherthanamere tool. Thestudyofalgorithmsbecameespeciallyimportantwhencomputerswereinvented.Sincemathematical operationsnowwereendowedwithadefinabletimecost,complexityofalgorithsbecameafieldofstudy; since computing was no longer performed in ‘real’ numbers but in representations in finite bitstrings, the accuracyofalgorithmsneededtobestudied.Someoftheseconsiderationsinfactpredatetheexistenceof computers,havingbeeninspiredbycomputingwithmechanicalcalculators. A prime concern in scientific computing is efficiency. While to some scientists the abstract fact of the existence of a solution is enough, in computing we actually want that solution, and preferably yesterday. Forthisreason,inthisbookwewillbequitespecificabouttheefficiencyofbothalgorithmsandhardware. It is important not to limit the concept of efficiency to that of efficient use of hardware. While this is important,thedifferencebetweentwoalgorithmicapproachescanmakeoptimizationforspecifichardware asecondaryconcern. This book aims to cover the basics of this gamut of knowledge that a successful computational scientist needstomaster.Itissetupasatextbookforgraduatestudentsoradvancedundergraduatestudents;others canuseitasareferencetext,readingtheexercisesfortheirinformationcontent. Contents 1 SequentialComputing 9 1.1 TheVonNeumannarchitecture 9 1.2 Modernfloatingpointunits 11 1.3 MemoryHierarchies 16 1.4 Multicorearchitectures 30 1.5 Localityanddatareuse 33 1.6 Programmingstrategiesforhighperformance 38 1.7 Powerconsumption 51 2 ParallelComputing 53 2.1 Introduction 53 2.2 ParallelComputersArchitectures 56 2.3 Differenttypesofmemoryaccess 59 2.4 Granularityofparallelism 63 2.5 Parallelprogramming 67 2.6 Topologies 92 2.7 Efficiencyofparallelcomputing 104 2.8 Multi-threadedarchitectures 110 2.9 GPUcomputing 110 2.10 Remainingtopics 113 3 ComputerArithmetic 127 3.1 Integers 127 3.2 Realnumbers 129 3.3 Round-offerroranalysis 135 3.4 Moreaboutfloatingpointarithmetic 140 3.5 Conclusions 142 4 Numericaltreatmentofdifferentialequations 143 4.1 Initialvalueproblems 143 4.2 Boundaryvalueproblems 149 4.3 Initialboundaryvalueproblem 157 5 Numericallinearalgebra 163 5.1 Eliminationofunknowns 163 5.2 Linearalgebraincomputerarithmetic 166 5.3 LUfactorization 168 5.4 Sparsematrices 176 6 Contents 5.5 Iterativemethods 188 5.6 FurtherReading 205 6 Highperformancelinearalgebra 207 6.1 Thesparsematrix-vectorproduct 207 6.2 Paralleldensematrix-vectorproduct 208 6.3 ScalabilityofLUfactorization 220 6.4 Parallelsparsematrix-vectorproduct 222 6.5 Computationalaspectsofiterativemethods 227 6.6 Parallelpreconditioners 230 6.7 Orderingstrategiesandparallelism 233 6.8 Operatorsplitting 242 6.9 Parallelismandimplicitoperations 243 6.10 Blockalgorithmsonmulticorearchitectures 248 7 Moleculardynamics 252 7.1 ForceComputation 253 7.2 ParallelDecompositions 257 7.3 ParallelFastFourierTransform 263 7.4 IntegrationforMolecularDynamics 266 8 Combinatorics 270 8.1 Sorting 270 8.2 Graphproblems 273 9 Discretesystems 279 9.1 TheBarnes-Hutalgorithm 280 9.2 TheFastMultipoleMethod 281 9.3 Implementation 281 10 Computationalbiology 283 11 MonteCarloMethods 285 11.1 ParallelRandomNumberGeneration 285 11.2 Examples 286 Appendices289 A Theoreticalbackground 289 A.1 Linearalgebra 290 A.2 Complexity 296 A.3 PartialDifferentialEquations 297 A.4 Taylorseries 299 A.5 Graphtheory 301 A.6 FourierTransforms 307 A.7 Automatatheory 309 B Practicaltutorials 311 B.1 Unixintro 314 B.2 Compilersandlibraries 333 B.3 ManagingprojectswithMake 337 B.4 Sourcecodecontrol 350 VictorEijkhout 7 Contents B.5 ScientificDataStorage 364 B.6 ScientificLibraries 374 B.7 PlottingwithGNUplot 388 B.8 Goodcodingpractices 391 B.9 Debugging 398 B.10 Performancemeasurement 407 B.11 C/Fortraninteroperability 410 B.12 LATEXforscientificdocumentation 414 C Classprojects 427 C.1 Cachesimulationandanalysis 427 C.2 EvaluationofBulkSynchronousProgramming 429 C.3 Heatequation 430 C.4 Thememorywall 433 D Codes 434 D.1 Hardwareeventcounting 434 D.2 Testsetup 434 D.3 Cachesize 435 D.4 Cachelines 436 D.5 Cacheassociativity 439 D.6 TLB 440 D.7 Unrepresentiblenumbers 443 E Indexandlistofacronyms 452 8 IntroductiontoHighPerformanceScientificComputing–r481 Chapter 1 Sequential Computing In order to write efficient scientific codes, it is important to understand computer architecture. The differ- ence in speed between two codes that compute the same result can range from a few percent to orders of magnitude, depending only on factors relating to how well the algorithms are coded for the processor architecture. Clearly, it is not enough to have an algorithm and ‘put it on the computer’: some knowledge ofcomputerarchitectureisadvisable,sometimescrucial. Some problems can be solved on a single CPU, others need a parallel computer that comprises more than one processor. We will go into detail on parallel computers in the next chapter, but even for parallel pro- cessing,itisnecessarytounderstandtheinvidualCPUs. In this chapter, we will focus on what goes on inside a CPU and its memory system. We start with a brief generaldiscussionofhowinstructionsarehandled,thenwewilllookintothearithmeticprocessinginthe processor core; last but not least, we will devote much attention to the movement of data between mem- ory and the processor, and inside the processor. This latter point is, maybe unexpectedly, very important, since memory access is typically much slower than executing the processor’s instructions, making it the determiningfactorinaprogram’sperformance;thedayswhen‘flop1 counting’wasthekeytopredictinga code’sperformancearelonggone.Thisdiscrepancyisinfactagrowingtrend,sotheissueofdealingwith memorytraffichasbeenbecomingmoreimportantovertime,ratherthangoingaway. This chapter will give you a basic understanding of the issues involved in CPU design, how it affects per- formance,andhowyoucancodeforoptimalperformance.Formuchmoredetail,seeanonlinebookabout PCarchitecture[87],andthestandardworkaboutcomputerarchitecture,HenneseyandPatterson[77]. 1.1 TheVonNeumannarchitecture Whilecomputers,andmostrelevantlyforthischapter,theirprocessors,candifferinanynumberofdetails, they also have many aspects in common. On a very high level of abstraction, many architectures can be described as von Neumann architectures. This describes a design with an undivided memory that stores both program and data (‘stored program’), and a processing unit that executes the instructions, operating onthedatain‘fetch,execute,store’cycle’2. 1. FloatingPointOperation. 2. Thismodelwithaprescribedsequenceofinstructionsisalsoreferredtoascontrolflow.Thisisincontrasttodataflow, whichwewillseeinsection6.10. 9 1. SequentialComputing This setup distinguishes modern processors for the very earliest, and some special purpose contemporary, designs where the program was hard-wired. It also allows programs to modify themselves or generate other programs, since instructions and data are in the same storage. This allows us to have editors and compilers: the computer treats program code as data to operate on. In this book we will not explicitly discuss compilers, the programs that translate high level languages to machine instructions. However, on occasionwewilldiscusshowaprogramathighlevelcanbewrittentoensureefficiencyatthelowlevel. Inscientificcomputing,however,wetypicallydonotpaymuchattentiontoprogramcode,focusingalmost exclusivelyondataandhowitismovedaboutduringprogramexecution.Formostpracticalpurposesitis as if program and data are stored separately. The little that is essential about instruction handling can be describedasfollows. The machine instructions that a processor executes, as opposed to the higher level languages users write in, typically specify the name of an operation, as well as of the locations of the operands and the result. Theselocationsarenotexpressedasmemorylocations,butasregisters:asmallnumberofnamedmemory locationsthatarepartoftheCPU3.Asanexample,hereisasimpleCroutine void store(double *a, double *b, double *c) { *c = *a + *b; } anditsX86assembleroutput,obtainedby4 gcc -O2 -S -o - store.c: .text .p2align 4,,15 .globl store .type store, @function store: movsd (%rdi), %xmm0 # Load *a to %xmm0 addsd (%rsi), %xmm0 # Load *b and add to %xmm0 movsd %xmm0, (%rdx) # Store to *c ret Theinstructionshereare: Aloadfrommemorytoregister; • Anotherload,combinedwithanaddition; • Writingbacktheresulttomemory. • Eachinstructionisprocessedasfollows: Instruction fetch: the next instruction according to the program counter is loaded into the pro- • cessor.Wewillignorethequestionsofhowandfromwherethishappens. 3. Direct-to-memoryarchitecturesarerare,thoughtheyhaveexisted.TheCyber205supercomputerinthe1980scouldhave3 datastreams,twofrommemorytotheprocessor,andonebackfromtheprocessortomemory,goingonatthesametime.Suchan architectureisonlyfeasibleifmemorycankeepupwiththeprocessorspeed,whichisnolongerthecasethesedays. 4. Thisis64-bitoutput;addtheoption-m64on32-bitsystems. 10 IntroductiontoHighPerformanceScientificComputing–r481

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.