ebook img

A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems PDF

0.32 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems

A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid 2 1 Computer Systems 0 2 n a MarekBLAZEWICZaStevenR.BRANDTb,cPeterDIENERb,d J DavidM.KOPPELMANeKrzysztofKUROWSKIaFrankLÖFFLERb 0 1 ErikSCHNETTERf,b,g JianTAOb aApplicationsDepartment,Poznan´ SupercomputingandNetworkingCenter,Poland ] C bCenterforComputation&Technology,LouisianaStateUniversity,BatonRouge,USA D cDepartmentofComputerScience,LouisianaStateUniversity,BatonRouge,USA . dDepartmentofPhysics&Astronomy,LouisianaStateUniversity,BatonRouge,USA s c eDepartmentofElectricalandComputerEngineering,LouisianaStateUniversity, [ BatonRouge,USA 1 fPerimeterInstituteforTheoreticalPhysics,Waterloo,Canada v gDepartmentofPhysics,UniversityofGuelph,Guelph,Canada 8 1 1 Keywords.hybridsystem,stencilcomputations,CFD,computationalframework, 2 largescalescientificapplication . 1 0 2 1 : Introduction v i X HeterogeneoussystemsarebecomingmorecommononHighPerformanceComputing r a (HPC)systems.EvenusingtoolslikeCUDA[1]andOpenCL[2]itisanon-trivialtask toobtainoptimalperformanceontheGPU.Approachestosimplifyingthistaskinclude Merge[3](alibrarybasedframeworkforheterogeneousmulti-coresystems),Zippy[4] (aframeworkforparallelexecutionofcodesonmultipleGPU’s),BSGP[5](anewpro- gramminglanguageforgeneralpurposecomputationontheGPU)andCUDA-lite[6](an enhancementtoCUDAthattransformscodebasedonannotations).Inaddition,efforts are underway to improve compiler tools for automatic parallelization and optimization ofaffineloopnestsforGPU’s[7]andforautomatictranslationofOpenMPparallelized codestoCUDA[8]. Inthispaperwepresentanalternativeapproach:anewcomputationalframeworkfor thedevelopmentofmassivelydataparallelscientificcodesapplicationssuitableforuse onsuchpetascale/exascalehybridsystemsbuiltuponthehighlyscalableCactusframe- work[9,10]Asthefirstnon-trivialdemonstrationofitsusefulness,wesuccessfullyde- velopedanew3DCFDcodethatachievesimprovedperformance. 1. CactusComputationalFramework TheCactusframework[9,10]wasdesignedanddevelopedtoenhanceprogrammingpro- ductivityinlarge-scalesciencecollaborations.ThedesignofCactusallowsscientistsand engineerstodevelopindependentcomponentsforCactuswithoutworryingaboutporta- bilityissuesoncomputingsystems.ThecommoninfrastructureprovidedbyCactusalso enablesdevelopingscientificcodesthatworkacrossdifferentdisciplines.Thisapproach emphasizes code re-usability, and greatly simplifies constructing sound interfaces and well-testedandwell-supportedsoftware.AsthenameCactussuggests,theCactusframe- workconsistsofacentralcorecalledflesh,whichprovidesinfrastructureandinterfaces formodularcomponentscalledthorns. Buildingupontheflesh,thornscanprovideimplementationsforcomputationalcon- ceptssuchasparallelization,meshrefinement,I/O,check-pointing,webservers,andso forth. The Cactus Computational Toolkit (CCTK) is a collection of thorns which pro- videbasiccomputationalcapabilities.ApplicationthornsmakeuseoftheCCTKviathe Cactus flesh API. Cactus is well suited for domain discretizations via regular, block- structuredgridsasarecommone.g.forhigherorderfinitedifferences.TheCarpetAMR library[11,12]implementstherecursiveblock-structuredAMRalgorithmbyBergerand Oliger [13], and provides support for multi-block (or multi-patch) domain discretiza- tions.AsetofexplicittimeintegrationschemessuchasRunge-Kuttamethodsarepro- vided by a Method of Lines time integrator. Overall, the Cactus framework hides the detailedimplementationsofCarpetandotherutilitythornsfromapplicationdevelopers. 1.1. MPI-BasedDataParallelisminCactus The Cactus framework adopts the idea of data parallelism in its design and implemen- tation. In Cactus, the computational grid is decomposed into multiple components that are distributed between processes, and the same set of operations are applied to each. ThecommunicationcomponentofCactususestheMessagePassingInterface(MPI)to exchangedatabetweenprocesses.InCactus,itisthetaskofaspecialdrivercomponent to set up storage for variables, partition the grid between MPI processes, and manage inter-processcommunication.Unlikephysicalboundarieswheretheboundarydatacan besetorcalculatedfromboundaryconditions,dataatinter-processboundariesneedto becopiedfromotherprocesseswheretheneighboringgridcomponentsarelocated.This is implemented via a ghost region at the inter-process boundaries that is automatically setupbythedriver.Thenecessarysizeofaghostregiondependsonthenumericalalgo- rithmsusedandcanbeselectedasparametersatrun-time. 1.2. ParallelizationonCPU-GPUHybridSystems The data parallelism in Cactus matches well with the features of CUDA [14] and OpenCL[15]insupportingprogrammingonhybridcomputersystems.Onthecomputa- tionalframeworklevel,thereisnotmuchdifferencebetweenCUDAandOpenCLwhen targeting NVIDIA Fermi-class GPU’s. In this work we only focus on the paralleliza- tion on the CUDA architecture, and will present a computational framework based on OpenCLinalaterpublication. BasedupontheCUDAarchitecture,webuildanMPI-CUDAbasedcomputational frameworkinCactus1.Itenablesasimple,semi-automatic,yetefficientimplementation andexecutionofCUDA-enabledapplications.Auto-tuningenablesefficientdatadistri- bution between nodes, effectively hiding additional cost introduced by GPU-host and host-host interconnections.The computational overhead in such a generic framework is greatly reduced by overlapping data transfers and computation with the asynchronous data transfers and concurrent copy and execution supported in CUDA. With the help fromsuchacomputationalframework,applicationdeveloperscanthenspendmoretime optimizingthenumericalkernelitself,implementingmoreefficientalgorithmsinthese kernels,and(mostimportantly)advancingthesciencecontentintheircode. This system has been tested and benchmarked on a 3D CFD implementation (see section4)basedonafinitedifferencediscretizationofNavier-Stokesequations. Boundary Boundary Conditions Conditions Node 1 Node 2 Node N MPI_Send MPI_Send MPI_Send MPI_Recv MPI_Recv MPI_Recv CPU CPU CPU cudaMemCpy cudaMemCpy cudaMemCpy (cid:7)GPU (cid:7)GPU (cid:7)GPU Figure 1. The Cactus computational framework manages the domain decomposition and communication amongthesplitdomainsviaMPI.ComputationsareperformedprimarilyonGPU’s.Thedatatransfersbetween CPU’saswellastoandfromtheGPU’sareconcurrentwiththecomputation. 2. GPGPUProgramminginCactus AchievingefficientexecutiononaGPUoftenrequirescarefulanalysisoftheapplication followed by extensive testing and tuning. For many important problems, such as linear algebraroutines,thisworkhasbeendoneandpackagedintolibrariesforconvenientuse byothers[16]. Iterativegridtechniquesarewidelyused,andseemlikeagoodfitforthehighfloat- ingpointdensityofGPU’s.Butbecauseeachinvestigatormayrunadifferentgridkernel asimplelibraryroutinewouldnotachievewideuse.GPUimplementationsofiterative gridalgorithmsmustdealwiththeproblemofghostzoneexchangemademoretedious by GPU memory access constraints, among other factors. On conventional cluster sys- tems iterative grid application programmers do not need to consider such issues when usingaframeworklikeCactus.Cactusmanagesdatacommunicationbetweenacluster’s nodes,includingghostzoneexchange,sothatapplicationcodeneedonlyoperateonthat data.TheproblemsrelatedtoghostzoneinterchangebetweenCUDAblocksissimilarin manywaystoghostzoneinterchangebetweenprocessorsinaclusterCPUconfiguration. Inthiswork theCactusframeworkhas beenextendedtocover GPUexecutionvia anarchitectureneutralprogrammingabstractiontohighlyoptimizefinitedifferenceop- erationsinamultithreadedcomputingenvironment(seelist1). Figure2. TheworkflowchartofaCaCUDA-basedapplication.Theupperboxshowsthegenerationofthe CaCUDAkernelheadersatthecodecompilationstage.Thelowerboxshowshowthevariablesareevolvedto thenexttimestep. 3. CaCUDAKernelAbstraction ThetaskofsimplifyingthegenerationofCUDAcodeforafinitedifferencingcodeisnot a straightforward one. Shared arrays with appropriate stencil sizes have to be carefully managed,anddataneededbythestencilhastobestreamedinwhilecalculationsproceed. Itispossibletoabstractawaymuchofthedifficultworkintoboilerplatecode,butdoing sorequiressomeextramachinery.Wedesignandimplementaprogrammingabstraction intheCactusframeworktoenableautomaticgenerationfromasetofhighlyoptimized templatestosimplifycodeconstruction.TheworkflowchartofatypicalCaCUDA-based applicationcanbefoundinfigure2. TherearethreemajorcomponentsinourCaCUDAKernelabstraction. 1. CaCUDAKernelDescriptor isusedtodeclarethevariablesthatwillbeneeded intheGPGPUcomputation,andidentifyafewrelevantproperties. 2. CaCUDATemplatesareasetoftemplateswhicharehighlyoptimizedforpartic- ulartypesofcomputationaltasksandoptimizationstrategies. 3. CaCUDACodeGeneratorisusedtoparsethedescriptorsandautomaticallygen- erate CUDA-based macros. The code generator is based on Piraha[17], which implementsatypeofparsingexpressiongrammar[18].Duetothepagelimit,we donotlistthetemplatesandthesamplecodegeneratedbyCaCUDA.Moreabout theCaCUDAprojectcanbefoundattheCaCUDAprojectsite[19]. Listing1:AsamplekerneldefinitioninCactus CCTK_CUDA_KERNEL UPDATE_VELOCITY TYPE=3DBLOCK STENCIL="1,1,1,1,1,1" TILE="16,16,16" { CCTK_CUDA_KERNEL_VARIABLE CACHED=YES INTENT=SEPARATEINOUT { vx, vy, vz } "VELOCITY" CCTK_CUDA_KERNEL_VARIABLE CACHED=YES INTENT=IN { p } "PRESSURE" CCTK_CUDA_KERNEL_PARAMETER { density } "DENSITY" } The above kernel abstraction can be integrated in a straightforward manner as a thorn (module),CaCUDAinCactuswithouttouchingtheflesh(coreinfrastructure).Thekernel descriptorinthisabstractionissimilarinbothformatandfunctionalitytothoseCactus Configuration Language (CCL) files, which are used to declare global data structures, runtime parameters, and the way various C or Fortran subroutines interact through the scheduletree.TheabstractionsalreadyusedbyCactusare:param.ccl,configuration.ccl, schedule.ccl, and interface.ccl. To this set we add an additional declarative file called cacuda.ccl.TheCactusFrameworkalreadyhasamechanism,implementedthroughthe configuration.ccl file, by which discovery of additional preprocessing code can be en- abledpriortothecompilationofC/Fortrancode. 4. CFDImplementation ComputationalFluidDynamics(CFD)isoneofthebranchesoffluidmechanicswhich uses numerical methods and algorithms to solve and analyze fluid flows. It is success- fullyusedinvariousfieldsofscienceandengineeringsuchasweatherforecasting,aero- dynamicoptimizationofbodyshapes(e.g.planes,cars,ships),gasreservoiruncertainty analysis.UnfortunatelyaccurateCFDsimulationsneedgreatcomputationalpower.Itis veryimportanttoadaptexistingalgorithmstonewhybridarchitecturesandexecutethem inamassivelyparallelmanner. 4.1. BackgroundandGoverningEquations The CFD numerical method is governed by Navier-Stokes incompressible equations whicharederivedfromNewton’ssecondlaw(conservationofmomentum)andconser- vationofmass(incompressibility).TheNavier-Stokesequationsare Figure3. Thefiguretotheleftshowsthequantitativecomparisonofmidsectioncenterlinevelocitywiththose byGhiaetc.[22].TheonetotherightshowsthecontoursoftheXcomponentofthevelocityfieldalongY axis. ∂u +(u·∇)u=−∇φ+ν∇2u+f (1) ∂t ∇·u=0 (2) where u is the velocity field, ν is the the kinematic viscosity, f is the body force, φ is themodifiedpressure(pressureoverdensity).Thepresentedequationsneedtobefurther discretizedinordertoperformpropersimulation.Inthisprocesswehavefollowed[20] and[21].Theequationsarediscretizedusingafinite-differencemethod.Thecomputa- tionaldomainisdistributedontoaregularrectangularandstaggeredgrid.Thecomputa- tionsareperformedinthestencilpattern.Thisimpliesthatcalculationsareperformedin closeneighborhoodofeachgrid’scell. 4.2. CodeValidationandVerification A homogeneous distribution of computations for the lid-driven cavity problem with a Reynoldsnumberof100wasusedtobenchmarktheoverallperformanceoftheframe- workandverifythenumericalimplementation.Weshowthequantitativecomparisonof midsectioncenterlinevelocitywiththosebyGhiaetc.[22]infigure3. While these results come from a terascale machine, there is no logical barrier to continuedscaling,andweplantocontinuescalingstudiesasresourcesbecomeavailable. 4.3. PerformanceandScalability Wecarriedoutperformanceandscalingtestsona6nodeGPUclusteratCyfronet.Each nodehad2TeslaM2050GPU’s,twoIntelXeonX5670processorsrunningat2.93GHz, and an infiniband interconnect. The CFD code was measured for both the standalone code and the CaCUDA framework-based code. The performance results of one node were43.5and58GFlop/sforthestandalonesimulationandthesimulationimplemented withinCaCUDArespectively.Thescalabilityresultsofthe3DCFDcodethatmakesuse oftheCaCUDAframeworkaswellasthestandaloneversionareshowninfigure4. Linear standalone simulation scaling Standalone simulation speedup CaCUDA simulaiton speedup 12 10 8 p u ed- 6 e p S 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of GPU’s Figure4. Thisplotcomparesthespeed-upoftheCFDcodebuiltwithCaCUDAtothestandalone,handwritten implementation.Speed-upsarecomputedrelativetotheperformanceofthestandalonecodeonasinglenode usingasingleGPU. 5. Conclusions In this paper an implementation of a new generic capability for computing on hybrid CPU/GPUarchitecturesintheCactuscomputationalframeworkhasbeenpresented.The capability to handle the data exchange between GPU and CPU address space and de- ploying the computations in the hybrid environment was implemented as a new thorn “CaCUDA”.Moreovertheapplicationremarkablyfacilitatestheimplementationprocess by generating the templates of all declared kernel functions. Due to the flexibility and extensibility of the Cactus framework no changes to the Cactus flesh were necessary, guaranteeingthatexistingfeaturesanduserimplementedthornsarenotaffectedbythis addition. Asatestcaseapplicationofthesenewframework’sfeaturesanincompressibleCFD code has been implemented to test the overall performance and scalability. The results provingitsusabilityhavebeenpresented. Ourcurrenteffortisfocusedonminimizingthecostsofthedataexchangebetween GPU and CPU and optimizing the boundary exchange. Further integration in this area mayimproveperformanceandscalability. Acknowledgments ThisworkissupportedbytheCybertools(http://cybertools.loni.org)project(NSFaward 701491), the NG-CHC project (NSF award 1010640) through the Louisiana Board of Regents , and the NSF award PIF-0904015 (CIGR). This work used the computer re- sources provided by LSU/LONI. This research was supported in part by PL-Grid In- frastructure. This work is also supported by the UCoMS project under award number MNiSW(PolishMinistryofScienceandHigherEducation)Nr4691N-USA/2009in close collaboration with U.S. research institutions involved in the U.S. Department of Energy(DOE)fundedgrantunderawardnumberDE-FG02-04ER46136andtheBoard ofRegents,StateofLouisiana,undercontractno.DOE/LEQSF(2004-07).Theauthors want to thank our colleagues at both CCT and PSNC for great ideas and discussions. TheauthorswanttothankSoon-HeumKofromtheNationalSupercomputingCentreat LinköpinginSwedenforhelpingvalidatingtheCFDcode. References [1] NVIDIACorporation2011NVIDIACUDACProgrammingGuideNVIDIACorporation [2] MunshiA(ed)2011TheOpenCLSpecificationVersion:1.1(TheKhronosGroup)URLhttp://www. khronos.org/registry/cl/specs/opencl-1.1.pdf [3] LindermanMD,CollinsJD,WangHandMengTH2008SIGPLANNot.43(3)287–296ISSN0362- 1340 [4] FanZ,QiuFandKaufmanAE2008ComputerGraphicsForum27341–350ISSN1467-8659 [5] HouQ,ZhouKandGuoB2008ACMTrans.Graph.27(3)19:1–19:12ISSN0730-0301 [6] UengSZ,LatharaM,BaghsorkhiSandHwuWm2008LanguagesandCompilersforParallelCom- puting(LectureNotesinComputerSciencevol5335)edAmaralJ(SpringerBerlin/Heidelberg)pp 1–1510.1007/978-3-540-89740-8_1URLhttp://dx.doi.org/10.1007/978-3-540-89740-8_1 [7] BaskaranMM,BondhugulaU,KrishnamoorthyS,RamanujamJ,RountevAandSadayappanP2008 Proceedingsofthe22ndannualinternationalconferenceonSupercomputingICS’08(NewYork,NY, USA:ACM)pp225–234ISBN978-1-60558-158-3URLhttp://doi.acm.org/10.1145/1375527.1375562 [8] LeeS,MinSJandEigenmannR2009SIGPLANNot.44(4)101–110ISSN0362-1340 [9] GoodaleT,AllenG,LanfermannG,MassóJ,RadkeT,SeidelEandShalfJ2003HighPerformance ComputingforComputationalScience-VECPAR2002,5thInternationalConference,Porto,Portugal, June26-28,2002(Berlin:Springer)pp197–227 [10] CactusFrameworkURLhttp://www.cactuscode.org [11] SchnetterE,HawleySHandHawkeI2004Class.QuantumGrav.211465–1488gr-qc/0310042 [12] AdaptiveMeshRefinementwithCarpetURLhttp://www.carpetcode.org/ [13] BergerMJandOligerJ1984J.Comput.Phys.53484–512 [14] NVIDIA CUDA (Compute Unified Device Architecture) URL http://www.nvidia.com/object/cuda_ home_new.html [15] OpenCL(OpenComputingLanguage)URLhttp://www.khronos.org/opencl/ [16] VolkovVandDemmelJW2008Proceedingsofthe2008ACM/IEEEconferenceonSupercomputing SC’08(Piscataway,NJ,USA:IEEEPress)pp31:1–31:11ISBN978-1-4244-2835-9URLhttp://portal. acm.org/citation.cfm?id=1413370.1413402 [17] SRBrandtGA2011201011thACM/IEEEInternationalConferenceonGridComputing [18] FordB2004Proceedingsofthe31stACMSIGPLAN-SIGACTSymposiumonPrinciplesofProgram- mingLanguages [19] CaCUDAMPI+CUDAProgrammingFrameworkURLhttp://code.google.com/p/cacuda/ [20] HirtCandNicholsB1981JournalofComputationalPhysics [21] TorreyMD,CloutmanLD,MjolsnessRCandCWHir1985NASA-VOF2D:AComputerProgram IncompressibleFlowswithFreeSurfacesTech.rep.LosAlamosNationalLaboratory [22] GhiaU,GhiaKNandShinCT1982JournalofComputationalPhysics48387–411ISSN0021-9991

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.