ebook img

Parallel programming: concepts and practice PDF

405 Pages·2018·4.757 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Parallel programming: concepts and practice

Parallel Programming Parallel Programming Concepts and Practice Bertil Schmidt InstitutfürInformatik Staudingerweg9 55128Mainz Germany Jorge González-Domínguez ComputerArchitectureGroup UniversityofACoruña Edificioáreacientífica(Office3.08),CampusdeElviña 15071,ACoruña Spain Christian Hundt InstitutfürInformatik Staudingerweg9 55128Mainz Germany Moritz Schlarb DataCenter JohannesGutenberg-UniversityMainz Germany Anselm-Franz-von-Bentzel-Weg12 55128Mainz Germany MorganKaufmannisanimprintofElsevier 50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates Copyright©2018ElsevierInc.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicormechanical,including photocopying,recording,oranyinformationstorageandretrievalsystem,withoutpermissioninwritingfromthepublisher.Detailson howtoseekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandourarrangementswithorganizationssuchas theCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher(otherthanasmaybenoted herein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenourunderstanding,changes inresearchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusinganyinformation, methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafety andthesafetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliabilityforanyinjuryand/or damagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods, products,instructions,orideascontainedinthematerialherein. LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary ISBN:978-0-12-849890-3 ForinformationonallMorganKaufmannpublications visitourwebsiteathttps://www.elsevier.com/books-and-journals Publisher:KateyBirtcher AcquisitionEditor:SteveMerken DevelopmentalEditor:NateMcFadden ProductionProjectManager:SreejithViswanathan Designer:ChristianJ.Bilbow TypesetbyVTeX Preface Parallelism abounds. Nowadays, any modern CPU contains at least two cores, whereas some CPUs featuremorethan50processingunits.Anevenhigherdegreeofparallelismisavailableonlargersys- tems containing multiple CPUs such as server nodes, clusters, and supercomputers. Thus, the ability to program these types of systems efficiently and effectively is an essential aspiration for scientists, engineers,andprogrammers.Thesubjectof thisbookis acomprehensiveintroductiontotheareaof parallel programming that addresses this need. Our book teaches practical parallel programming for sharedmemoryanddistributedmemoryarchitecturesbasedontheC++11threadingAPI,OpenMul- tiprocessing (OpenMP), Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI),andUnifiedParallelC++(UPC++),aswellasnecessarytheoreticalbackground.Wehavein- cluded a large number of programming examples based on the recent C++11 and C++14 dialects of theC++programminglanguage. This book targets participants of “Parallel Programming” or “High Performance Computing” courses which are taught at most universities at senior undergraduate level or graduate level in com- puterscienceorcomputerengineering.Moreover,itservesassuitableliteratureforundergraduatesin otherdisciplineswithacomputerscienceminororprofessionalsfromrelatedfieldssuchasresearch scientists, data analysts, or R&D engineers. Prerequisites for being able to understand the contents of our book includesome experiencewith writingsequential code in C/C++ and basic mathematical knowledge. IngoodtraditionwiththehistoricsymbiosisofHighPerformanceComputingandnaturalscience, we introduce parallel concepts based on real-life applications ranging from basic linear algebra rou- tines overmachinelearningalgorithms and physicalsimulations but also traditionalalgorithmsfrom computerscience.Thewritingofcorrectyetefficientcodeisakeyskillforeveryprogrammer.Hence, we focus on the actual implementation and performance evaluation of algorithms. Nevertheless, the theoretical properties of algorithms are discussed in depth, too. Each chapter features a collection of additionalprogrammingexercisesthatcanbesolvedwithinawebframeworkthatisdistributedwith thisbook.TheSystemforAutomatedCodeEvaluation(SAUCE)providesaweb-basedtestingen- vironmentforthesubmissionofsolutionsandtheirsubsequentevaluationinaclassroomsetting:the only prerequisite is an HTML5 compatible web browser allowing for the embedding of interactive programmingexerciseinlectures.SAUCEisdistributedasdockerimageandcanbedownloadedat https://parallelprogrammingbook.org This website serves as hub for related content such as installation instructions, a list of errata, and supplementarymaterial(suchaslectureslidesandsolutionstoselectedexercisesforinstructors). Ifyouareastudentorprofessionalthataimstolearnacertainprogrammingtechnique,weadviseto initiallyreadthefirstthreechaptersonthefundamentalsofparallelprogramming,theoreticalmodels, andhardwarearchitectures.Subsequently,youcandiveintooneoftheintroductorychaptersonC++11 Multithreading,OpenMP,CUDA,orMPIwhicharemostlyself-contained.ThechaptersonAdvanced C++11 Multithreading, Advanced CUDA, and UPC++ build upon the techniques of their preceding chapterandthusshouldnotbereadinisolation. ix x Preface Ifyouarealecturer,weproposeacurriculumconsistingof14lecturesmainlycoveringapplications from the introductory chapters. You could start with a lecture discussing the fundamentals from the first chapter including parallel summation using a hypercube and its analysis, the definition of basic measuressuchasspeedup,parallelizationefficiencyandcost,andadiscussionofrankingmetrics.The second lecture could cover an introduction to PRAM, network topologies, weak and strong scaling. You can spend more time on PRAM if you aim to later discuss CUDA in more detail or emphasize hardware architectures if you focus on CPUs. Two to three lectures could be spent on teaching the basicsoftheC++11threadingAPI,CUDA,andMPI,respectively.OpenMPcanbediscussedwithin aspanofonetotwolectures.Theremaininglecturescanbeusedtoeitherdiscussthecontentinthe advancedchaptersonmultithreading,CUDA,orthePGAS-basedUPC++language. Analternativeapproachissplittingthecontentintotwocourseswithafocusonpair-programming withinthelecture.YoucouldstartwithacourseonCPU-basedparallelprogrammingcoveringselected topicsfromthefirstthreechapters.Hence,C++11threads,OpenMP,andMPIcouldbetaughtinfull detail. The second course would focus on advanced parallel approaches covering extensive CUDA programmingincombinationwith(CUDA-aware)MPIand/orthePGAS-basedUPC++. Wewishyouagreattimewiththebook.Becreativeandinvestigatethecode!Finally,wewouldbe happytohearanyfeedbackfromyousothatwecouldimproveanyofourprovidedmaterial. Acknowledgments Thisbookwouldnothavebeenpossiblewithoutthecontributionsofmanypeople. Initially, we would like to thank the anonymous and few non-anonymous reviewers who com- mentedonourbookproposalandthefinaldraft:EduardoCesarGalobardes,AhmadAl-Khasawneh, andMohammadOlaimat. Moreover,wewouldliketothankourcolleagueswhothoroughlypeer-reviewedthechaptersand providedessentialfeedback:AndréMüllerforhisvaluableadviseonC++programming,RobinKobus forbeingatoughcodereviewer,FelixKallenbornforhissteadyproofreadingsessions,DanielJünger forconstantlycomplainingabouttheCUDAchapter,aswellasStefanEndlerandElmarSchömerfor theirsuggestions. Additionally,wewouldliketothankthestaff ofMorganKaufmanandElsevierwhocoordinated themakingofthisbook.InparticularwewouldliketomentionNateMcFadden. Finally, we would like to thank our spouses and children for their ongoing support and patience duringthecountlesshourswecouldnotspendwiththem. xi CHAPTER 1 INTRODUCTION Abstract Intherecentpast,teachingandlearningofparallelprogramminghasbecomeincreasinglyimportant duetotheubiquityofparallelprocessorsinportabledevices,workstations,andcomputeclusters.Stag- natingsingle-threadedperformanceofmodernCPUsrequiresfuturecomputerscientistsandengineers towritehighlyparallelizedcodeinordertofullyutilizethecomputecapabilitiesofcurrenthardware architectures. The design of parallel algorithms, however, can be challenging especially for inexpe- rienced studentsdue to commonpitfalls such as race conditionswhenconcurrentlyaccessingshared resources, defective communication patterns causing deadlocks, or the non-trivial task of efficiently scaling an application over the whole number of available compute units. Hence, acquiring parallel programming skills is nowadays an important part of many undergraduate and graduate curricula. More importantly, education of concurrent concepts is not limited to the field of High Performance Computing (HPC). The emergence of deep learning and big data lectures requires teachers and stu- dentstoadoptHPCasanintegralpartoftheirknowledgedomain.Anunderstandingofbasicconcepts isindispensableforacquiringadeepunderstandingoffundamentalparallelizationtechniques. Thegoalofthischapteristoprovideanoverviewofintroductoryconceptsandterminologiesinparallel computing.Westartwithlearningaboutspeedup,efficiency,cost,scalability,andthecomputation-to- communicationratiobyanalyzingasimpleyetinstructiveexampleforsummingupnumbersusinga varying number of processors. We get to know about the two most important parallel architectures: distributed memory systems and shared memory systems. Designing efficient parallel programs re- quiresalotof experienceandwewillstudyanumberof typicalconsiderationsforthisprocesssuch as problem partitioning strategies, communicationpatterns, synchronization, and load balancing. We endthischapterwithlearningaboutcurrentandpastsupercomputersandtheirhistoricalandupcoming architecturaltrends. Keywords Parallelism,Speedup,Parallelization,Efficiency,Scalability,Reduction,Computation-to-communica- tionratio,Distributedmemory,Sharedmemory,Partitioning,Communication,Synchronization,Load balancing,Taskparallelism,Prefixsum,Deeplearning,Top500 CONTENTS 1.1 MotivationalExampleandItsAnalysis............................................................................ 2 TheGeneralCaseandtheComputation-to-CommunicationRatio..................................... 8 1.2 ParallelismBasics.................................................................................................... 10 DistributedMemorySystems................................................................................ 10 SharedMemorySystems..................................................................................... 11 ParallelProgramming.DOI:10.1016/B978-0-12-849890-3.00001-0 1 Copyright©2018ElsevierInc.Allrightsreserved. 2 CHAPTER 1 INTRODUCTION ConsiderationsWhenDesigningParallelPrograms...................................................... 13 1.3 HPCTrendsandRankings........................................................................................... 16 1.4 AdditionalExercises.................................................................................................. 18 1.1 MOTIVATIONAL EXAMPLE AND ITS ANALYSIS Inthissectionwelearnaboutsomebasicconceptsandterminologies.Theyareimportantforanalyzing parallel algorithms or programs in order to understand their behavior. We use a simple example for summing up numbers using an increasing number of processors in order to explain and apply the followingconcepts: • Speedup. You have designed a parallel algorithm or written a parallel code. Now you want to know how much faster it is than your sequential approach; i.e., you want to know the speedup. Thespeedup(S)isusuallymeasuredorcalculatedforalmosteveryparallelcodeoralgorithmand is simply defined as the quotient of the time taken using a single processor (T(1)) over the time measuredusingpprocessors(T(p))(seeEq.(1.1)). T(1) S= (1.1) T(p) • Efficiencyandcost.Thebestspeedupyoucanusuallyexpectisalinearspeedup;i.e.,themaximal speedup you can achieve with p processors or cores is p (although there are exceptions to this, whicharereferredtoassuper-linearspeedups).Thus,youwanttorelatethespeeduptothenumber of utilized processors or cores. The Efficiency E measures exactly that by dividing S by P (see Eq. (1.2)); i.e., linear speedup would then be expressed by a value close to 100%. The cost C is similarbutrelatestheruntime T(p) (insteadofthespeedup)tothenumberofutilizedprocessors (orcores)bymultiplyingT(p)andp(seeEq.(1.3)). S T(1) E= = (1.2) p T(p)×p C=T(p)×p (1.3) • Scalability.Oftenwedonotonlywanttomeasuretheefficiencyforoneparticularnumberofpro- cessors or cores but for a varying number; e.g. P = 1, 2, 4, 8, 16, 32, 64, 128, etc. This is called scalabilityanalysisandindicatesthebehaviorofaparallelprogramwhenthenumberofprocessors increases. Besides varying the number of processors, the input data size is another parameter that youmightwanttovarywhenexecutingyourcode.Thus,therearetwotypesofscalability:strong scalabilityandweakscalability.Inthecaseofstrongscalabilitywemeasureefficienciesforavary- ingnumberofprocessorsandkeeptheinputdatasizefixed.Incontrast,weakscalabilityshowsthe behaviorofourparallelcodeforvaryingboththenumberofprocessorsandtheinputdatasize;i.e. whendoublingthenumberofprocessorswealsodoubletheinputdatasize. • Computation-to-communication ratio. This is an important metric influencing the achievable scalabilityofaparallelimplementation.Itcanbedefinedasthetimespentcalculatingdividedby thetimespentcommunicatingmessagesbetweenprocessors.Ahigherratiooftenleadstoimproved speedupsandefficiencies.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.