Multicore and GPU Programming Multicore and GPU Programming An Integrated Approach Second Edition Gerassimos Barlas MorganKaufmannisanimprintofElsevier 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates Copyright©2023ElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronic ormechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem, withoutpermissioninwritingfromthepublisher.Detailsonhowtoseekpermission,further informationaboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuch astheCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite: www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe Publisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience broadenourunderstanding,changesinresearchmethods,professionalpractices,ormedicaltreatment maybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluating andusinganyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuch informationormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers,including partiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assume anyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability, negligenceorotherwise,orfromanyuseoroperationofanymethods,products,instructions,orideas containedinthematerialherein. LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ISBN:978-0-12-814120-5 ForinformationonallMorganKaufmannpublications visitourwebsiteathttps://www.elsevier.com/books-and-journals Publisher:StephenMerken EditorialProjectManager:AngieBreckon ProductionProjectManager:KarthikeyanMurthy/ManikandanChandrasekaran Designer:MilesHitchen TypesetbyVTeX PrintedintheUnitedStatesofAmerica Lastdigitistheprintnumber: 9 8 7 6 5 4 3 2 1 Dedicatedtomylateparents, DimitrisandMaria, formakingitpossible, and mylovingwifeKaterinaandmytwosons AlexandrosandDimitris,formakingitworthwhile. Contents Listoftables .................................................. xv Preface ...................................................... xvii PART 1 Introduction CHAPTER 1 Introduction .................................... 3 1.1 Theeraofmulticoremachines ........................ 3 1.2 Ataxonomyofparallelmachines ...................... 5 1.3 Aglimpseofinfluentialcomputingmachines ............. 7 1.3.1 TheCellBEprocessor ........................ 8 1.3.2 NVidia’sAmpere ............................ 9 1.3.3 Multicoretomany-core:TILERA’sTILE-Gx8072and Intel’sXeonPhi ............................. 12 1.3.4 AMD’sEpycRome:scalingupwithsmallerchips.... 14 1.3.5 FujitsuA64FX:computeandmemoryintegration .... 16 1.4 Performancemetrics ............................... 17 1.5 Predictingandmeasuringparallelprogramperformance..... 21 1.5.1 Amdahl’slaw ............................... 25 1.5.2 Gustafson–Barsis’rebuttal ..................... 27 Exercises ........................................ 28 CHAPTER 2 Multicore and parallel program design ........... 31 2.1 Introduction ...................................... 31 2.2 ThePCAMmethodology ............................ 32 2.3 Decompositionpatterns ............................. 36 2.3.1 Taskparallelism ............................. 37 2.3.2 Divide-and-conquerdecomposition ............... 37 2.3.3 Geometricdecomposition ...................... 40 2.3.4 Recursivedatadecomposition ................... 43 2.3.5 Pipelinedecomposition ........................ 49 2.3.6 Event-basedcoordinationdecomposition ........... 53 2.4 Programstructurepatterns ........................... 54 2.4.1 Singleprogram,multipledata ................... 54 2.4.2 Multipleprogram,multipledata ................. 55 2.4.3 Master–worker .............................. 56 2.4.4 Map-reduce................................. 57 2.4.5 Fork/join ................................... 58 2.4.6 Loopparallelism ............................. 60 vii viii Contents 2.5 Matchingdecompositionpatternswithprogramstructure patterns ......................................... 60 Exercises ........................................ 61 PART 2 Programming with threads and processes CHAPTER 3 Threads and concurrency in standard C++........ 65 3.1 Introduction ...................................... 65 3.2 Threads ......................................... 68 3.2.1 Whatisathread? ............................ 68 3.2.2 Whatarethreadsgoodfor? ..................... 68 3.3 Threadcreationandinitialization ...................... 69 3.4 Sharingdatabetweenthreads ......................... 77 3.5 Designconcerns .................................. 80 3.6 Semaphores ...................................... 82 3.7 Applyingsemaphoresinclassicalproblems .............. 87 3.7.1 Producers–consumers ......................... 89 3.7.2 Dealingwithtermination ....................... 93 3.7.3 Thebarbershopproblem–introducingfairness ...... 105 3.7.4 Readers–writers ............................. 111 3.8 Atomicdatatypes ................................. 117 3.8.1 Memoryordering ............................ 122 3.9 Monitors ........................................ 126 3.9.1 Designapproach#1:criticalsectioninsidethemonitor 131 3.9.2 Designapproach#2:monitorcontrolsentrytocritical section .................................... 132 3.9.3 Generalsemaphoresrevisited ................... 136 3.10 Applyingmonitorsinclassicalproblems ................ 138 3.10.1 Producers–consumersrevisited ................. 138 3.10.2 Readers–writers ............................ 145 3.11 Asynchronousthreads .............................. 152 3.12 Dynamicvs.staticthreadmanagement.................. 156 3.13 Threadsandfibers ................................. 165 3.14 Debuggingmulti-threadedapplications ................. 172 Exercises ........................................ 177 CHAPTER 4 Parallel data structures ......................... 181 4.1 Introduction ...................................... 181 4.2 Lock-basedstructures .............................. 185 4.2.1 Queues .................................... 185 4.2.2 Lists ...................................... 189 4.3 Lock-freestructures ................................ 203 4.3.1 Lock-freestacks ............................. 204 4.3.2 Aboundedlock-freequeue:firstattempt ........... 209 4.3.3 TheABAproblem............................ 216 Contents ix 4.3.4 Afixedboundedlock-freequeue ................. 218 4.3.5 Anunboundedlock-freequeue .................. 222 4.4 Closingremarks................................... 227 Exercises ........................................ 228 CHAPTER 5 Distributed memory programming ................ 231 5.1 Introduction ...................................... 231 5.2 MPI ............................................ 232 5.3 Coreconcepts .................................... 234 5.4 YourfirstMPIprogram ............................. 234 5.5 Programarchitecture ............................... 238 5.5.1 SPMD..................................... 238 5.5.2 MPMD .................................... 240 5.6 Point-to-pointcommunication ........................ 241 5.7 Alternativepoint-to-pointcommunicationmodes .......... 245 5.7.1 Bufferedcommunications ...................... 246 5.8 Non-blockingcommunications ....................... 248 5.9 Point-to-pointcommunications:summary ............... 252 5.10 Errorreporting&handling........................... 252 5.11 Collectivecommunications .......................... 254 5.11.1 Scattering ................................. 259 5.11.2 Gathering ................................. 265 5.11.3 Reduction ................................. 267 5.11.4 All-to-allgathering .......................... 271 5.11.5 All-to-allscattering .......................... 276 5.11.6 All-to-allreduction .......................... 282 5.11.7 Globalsynchronization ....................... 282 5.12 Persistentcommunications ........................... 283 5.13 Big-countcommunicationsinMPI4.0 .................. 286 5.14 Partitionedcommunications .......................... 287 5.15 Communicatingobjects ............................. 289 5.15.1 Deriveddatatypes ........................... 291 5.15.2 Packing/unpacking .......................... 298 5.16 Nodemanagement:communicatorsandgroups ........... 300 5.16.1 Creatinggroups............................. 301 5.16.2 Creatingintracommunicators................... 303 5.17 One-sidedcommunication ........................... 306 5.17.1 RMAcommunicationfunctions ................. 307 5.17.2 RMAsynchronizationfunctions ................ 309 5.18 I/Oconsiderations ................................. 318 5.19 CombiningMPIprocesseswiththreads ................. 326 5.20 Timingandperformancemeasurements ................. 328 5.21 Debugging,profiling,andtracingMPIprograms .......... 329 5.21.1 BriefintroductiontoScalasca .................. 330 5.21.2 BriefintroductiontoTAU ..................... 334 x Contents 5.22 TheBoost.MPIlibrary .............................. 336 5.22.1 Blockingandnon-blockingcommunications ....... 337 5.22.2 Dataserialization ........................... 342 5.22.3 Collectiveoperations ......................... 345 5.23 Acasestudy:diffusion-limitedaggregation .............. 349 5.24 Acasestudy:brute-forceencryptioncracking ............ 355 5.25 Acasestudy:MPIimplementationofthemaster–worker pattern .......................................... 361 5.25.1 Asimplemaster–workersetup.................. 361 5.25.2 Amulti-threadedmaster–workersetup ........... 369 Exercises ........................................ 384 CHAPTER 6 GPU programming: CUDA ........................ 389 6.1 Introduction ...................................... 389 6.2 CUDA’sprogrammingmodel:threads,blocks,andgrids .... 392 6.3 CUDA’sexecutionmodel:streamingmultiprocessorsand warps .......................................... 398 6.4 CUDAcompilationprocess .......................... 401 6.5 PuttingtogetheraCUDAproject ...................... 406 6.6 Memoryhierarchy ................................. 409 6.6.1 Localmemory/registers ........................ 416 6.6.2 Sharedmemory .............................. 417 6.6.3 Constantmemory ............................ 426 6.6.4 Textureandsurfacememory .................... 433 6.7 Optimizationtechniques ............................ 433 6.7.1 Blockandgriddesign ......................... 433 6.7.2 Kernelstructure ............................. 445 6.7.3 Sharedmemoryaccess ........................ 453 6.7.4 Globalmemoryaccess ........................ 462 6.7.5 Asynchronousexecutionandstreams:overlappingGPU memorytransfersandmore ..................... 474 6.8 Graphs.......................................... 482 6.8.1 CreatingagraphusingtheCUDAgraphAPI ........ 483 6.8.2 Creatingagraphbycapturingastream ............ 489 6.9 Warpfunctions ................................... 492 6.10 Cooperativegroups ................................ 501 6.10.1 Intrablockcooperativegroups .................. 501 6.10.2 Interblockcooperativegroups .................. 514 6.10.3 Grid-levelreduction ......................... 519 6.11 Dynamicparallelism ............................... 523 6.12 DebuggingCUDAprograms ......................... 527 6.13 ProfilingCUDAprograms ........................... 529 6.14 CUDAandMPI ................................... 533 6.15 Casestudies...................................... 539 6.15.1 Fractalsetcalculation ........................ 540