Table Of Content

Beyond the Socket: NUMA-Aware GPUs UgljesaMilic†,∓ OresteVilla‡ EvgenyBolotin‡ AkhilArunkumar∗ EimanEbrahimi‡ AamerJaleel‡ AlexRamirez⋆ DavidNellans‡ †BarcelonaSupercomputingCenter(BSC),∓UniversitatPolitècnicadeCatalunya(UPC) ‡NVIDIA,∗ArizonaStateUniversity,⋆Google [email protected],{ovilla,ebolotin,eebrahimi,ajaleel,dnellans}@nvidia.com,[email protected], [email protected] ABSTRACT ACMReferenceformat: GPUs achieve high throughput and power efficiency by employ- Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi,AamerJaleel,AlexRamirez,DavidNellans.2017.Beyondthe ingmanysmallsingleinstructionmultiplethread(SIMT)cores.To Socket:NUMA-AwareGPUs.InProceedingsofMICRO-50,Cambridge, minimizeschedulinglogicandperformancevariancetheyutilize MA,USA,October14–18,2017,13pages. auniformmemorysystemandleveragestrongdataparallelismex- https://doi.org/10.1145/3123939.3124534 posedviatheprogrammingmodel.WithMoore’slawslowing,for GPUstocontinuescalingperformance(whichlargelydependson 1 INTRODUCTION SIMTcorecount)theyarelikelytoembracemulti-socketdesigns wheretransistorsaremorereadilyavailable.Howeverwhenmoving InthelastdecadeGPUscomputinghastransformedthehighper- tosuchdesigns,maintainingtheillusionofauniformmemorysys- formancecomputing,machinelearning,anddataanalyticsfields temisincreasinglydifficult.Inthisworkweinvestigatemulti-socket thatwerepreviouslydominatedbyCPU-basedinstallations[27,34, non-uniformmemoryaccess(NUMA)GPUdesignsandshowthat 53, 61]. Many systems now rely on a combination of GPUs and significantchangesareneededtoboththeGPUinterconnectand CPUstoleveragehighthroughputdataparallelGPUswithlatency cachearchitecturestoachieveperformancescalability.Weshowthat criticalexecutionoccurringontheCPUs.Inpart,GPU-accelerated applicationphaseeffectscanbeexploitedallowingGPUsocketsto computinghasbeensuccessfulinthesedomainsbecauseofnative dynamicallyoptimizetheirindividualinterconnectandcachepoli- supportfordataparallelprogramminglanguages[24,40]thatre- cies,minimizingtheimpactofNUMAeffects.OurNUMA-aware duceprogrammerburdenwhentryingtoscaleprogramsacrossever GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while growingdatasets. achieving89%,84%,and76%oftheoreticalapplicationscalability Nevertheless,withGPUsnearingthereticlelimitsformaximum in2,4,and8socketsdesignsrespectively.Implementabletoday, diesize andthe transistor densitygrowthrateslowingdown[5], NUMA-awaremulti-socketGPUsmaybeapromisingcandidatefor developerslookingtoscaletheperformanceoftheirsingleGPU scalingGPUperformancebeyondasinglesocket. programs are in a precarious position. Multi-GPU programming modelssupportexplicitprogrammingoftwoormoreGPUs,but itischallengingtoleveragemechanismssuchasPeer-2-Peerac- CCSCONCEPTS cess [36] or a combination of MPI and CUDA [42] to manage • Computing methodologies → Graphics processors; • Com- multipleGPUs.Theseprogrammingextensionsenableprogrammers putersystemsorganization→Singleinstruction,multipledata; toemploymorethanoneGPUforhighthroughputcomputation,but requirere-writingoftraditionalsingleGPUapplications,slowing KEYWORDS theiradoption. GraphicsProcessingUnits,Multi-socketGPUs,NUMASystems High port-count PCIe switches are now readily available and thePCI-SIGroadmapisprojectingPCIe5.0bandwidthtoreach 128GBsin2019[6].Atthesametime,GPUsarestartingtoexpand beyond the traditional PCIe peripheral interface to enable more Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor efficientinterconnectionprotocolsbetweenbothGPUsandCPUs, classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation suchasAMD’sInfinityFabricorNVIDIA’sScalableLinkInterface onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM andNVLink[2,29,35,39,44].FuturehighbandwidthGPUtoGPU mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora interconnects,possiblyusingimprovedcommunicationprotocols, [email protected]. mayleadtosystemdesignswithcloselycoupledgroupsofGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA thatcanefficientlysharememoryatfinegranularity. ©2017AssociationforComputingMachinery. ACMISBN978-1-4503-4952-9/17/10...$15.00 Theonsetofsuchmulti-socketGPUswouldprovideapivotpoint https://doi.org/10.1145/3123939.3124534 forGPUandsystemvendors.Ononehand,vendorscancontinueto MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. PCGGIePPUU HighGGGG BPPPPPWCUUUUI GePU-GPGGGGUP PPPPCInIUUUUteerface GGGGPPPPUUUU of Workloadsure Large GPUs1068000 41/41 37/41 35/41 33/41 31/41 GGPPUU ge Fut 40 PCICCePPUU PCIeCCPPUUPCIe GGGGPPPPUUUU CCPPUU PercentaAble to Fill 200Current GPU 2x GPU 4x GPU 8x GPU 16x GPU [56 SMs] [112 SMs] [224 SMs] [448 SMs] [896 SMs] GGGGPPPPUUUU GGGGPPPPUUUU GGPPUU GGPPUU Figure 2: Percentage of workloads that are able to fill future High BW GPU-GPU Interface largerGPUs(averagenumberofconcurrentthreadblocksex- Logical Programmer Exposed GPU(s) ceedsnumberofSMsinthesystem). Figure1:TheevolutionofGPUsfromtraditionaldiscretePCIe devicestosinglelogical,multi-socketedacceleratorsutilizinga acrossmultiplesocketsisagooddesignchoice,despitethe switchedinterconnect. overheads. • Weshowthatmulti-socketNUMA-awareGPUscanallow traditional GPU programs to scale efficiently to as many exposetheseGPUsasindividualGPUsandforcedeveloperstouse as 8 GPU sockets, providing significant headroom before multipleprogrammingparadigmstoleveragethesemultipleGPUs. developersmustre-architectapplicationstoobtainadditional Ontheother,vendorscouldexposemulti-socketdesignsasasingle performance. non-uniformmemoryaccess(NUMA)GPUresourceasshownin Figure1.ByextendingthesingleGPUprogrammingmodeltomulti- 2 MOTIVATIONANDBACKGROUND socketGPUs,applicationscanscalebeyondtheboundsofMoore’s Over the last decade single GPU performance has scaled thanks law,whilesimultaneouslyretainingtheprogramminginterfaceto to a significant growth in per-GPU transistor count and DRAM whichGPUdevelopershavebecomeaccustomed. bandwidth.Forexample,in2010NVIDIA’sFermiGPUintegrated Severalgroupshavepreviouslyexaminedaggregatingmultiple 1.95Btransistorsona529mm2die,with180GBsofDRAMband- GPUstogetherunderasingleprogrammingmodel[7,30];however width[13].In2016NVIDIA’sPascalGPUcontained12Btransistors thisworkwasdoneinanerawhereGPUshadlimitedmemoryad- withina610mm2die,whilerelyingon720GBsofmemoryband- dressabilityandreliedonhighlatency,lowbandwidthCPU-based width[41].Unfortunately,transistordensityisslowingsignificantly PCIeinterconnects.Asaresult,priorworkfocusedprimarilyonim- andintegratedcircuitmanufacturersarenotprovidingroadmapsbe- provingthemulti-GPUprogrammingexperienceratherthanachiev- yond7nm.Moreover,GPUdiesizes,whichhavebeenalsoslowly ingscalableperformance.Buildinguponthiswork,weproposea butsteadilygrowinggenerationally,areexpectedtoslowdowndue multi-socketNUMA-awareGPUarchitectureandruntimethatag- tolimitationsinlithographyandmanufacturingcost. gregatesmultipleGPUsintoasingleprogrammertransparentlogical Withouteitherlargerordenserdies,GPUarchitectsmustturn GPU.Weshowthatinthetheeraofunifiedvirtualaddressing[37], toalternativesolutionstosignificantlyincreaseGPUperformance. cachelineaddressablehighbandwidthinterconnects[39],anddedi- Recently 3D die-stacking has seen significant interest due to its catedGPUandCPUsocketPCBdesigns[29],scalablemulti-GPU successeswithhighbandwidthDRAM[23].Unfortunately3Ddie- performancemaybeachievableusingexistingsingleGPUprogram- stacking still has a significant engineering challenges related to mingmodels.Thisworkmakesthefollowingcontributions: powerdelivery,energydensity,andcooling[60]whenemployed • We show that traditional NUMA memory placement and inpowerhungry,maximaldie-sizedchipssuchasGPUs.Thuswe schedulingpoliciesarenotsufficientformulti-socketGPUs proposeGPUmanufacturersarelikelytore-examineatriedand toachieveperformancescalability.Wethendemonstratethat truedsolutionfromCPUworld,multi-socketGPUs,toscalingGPU inter-socketbandwidthwillbetheprimaryperformancelim- performancewhilemaintainingthecurrentratiooffloatingpoint iterinfutureNUMAGPUs. operationspersecond(FLOPS)andDRAMbandwidth. • Byexploitingprogramphasebehaviorweshowthatinter- Multi-socketGPUsareenabledbytheevolutionofGPUsfrom socketlinks(andthusbandwidth)shouldbedynamicallyand externalperipheralstocentralcomputingcomponents,considered adaptivelyreconfiguredatruntimetomaximizelinkutiliza- atsystemdesigntime.GPUoptimizedsystemsnowemploycus- tion.Moreover,weshowthatlinkpolicymustbedetermined tomPCBdesignsthataccommodatehighpincountsocketedGPU onaper-GPUbasis,asglobalpoliciesfailtocaptureper-GPU modules[35]withinter-GPUinterconnectsresemblingQPIorHy- phasebehavior. perTransport[17,21,39].Despitetherapidimprovementinhard- • WeshowthatboththeGPUL1andL2cachesshouldbemade warecapabilities,thesesystemshavecontinuedtoexposetheGPUs NUMA-awareanddynamicallyadapttheircachingpolicy providedasindividuallyaddressableunits.Thesemulti-GPUsys- tominimizeNUMAeffects.WedemonstratethatinNUMA temscanprovidehighaggregatethroughputwhenrunningmultiple GPUs,extendingexistingGPUcachecoherenceprotocols concurrentGPUkernels,buttoaccelerateasingleGPUworkload BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA theyrequirelayeringadditionalsoftwareruntimesontopofnative memorypropertieswithintheGPU.Finegrainedinterleavingof GPUprogramminginterfacessuchasCUDAorOpenCL[24,38]. memoryaddressesacrossmemorychannelsontheGPUprovides Unfortunately,byrequiringapplicationre-designmanyworkloads implicitloadbalancingacrossmemorybutdestroysmemorylocality. areneverportedtotakeadvantageofmultipleGPUs. Asaresult,threadblockschedulingpoliciesneednotbesophisti- ExtendingsingleGPUworkloadperformancesignificantlyisa catedtocapturelocality,whichhasbeendestroyedbythememory laudable goal, but we must first understand if these applications systemlayout.ForfutureNUMAGPUstoworkwell,bothsystem will be able to leverage larger GPUs. Assuming that the biggest softwareandhardwaremustbechangedtoachievebothfunction- GPUinthemarkettodayamasses≈50SMs(i.e.NVIDIA’sPascal alityandperformance.Beforefocusingonarchitecturalchangesto GPUcontains56),Figure2showsthatacrossabenchmarksetof41 buildaNUMA-awareGPUwedescribetheGPUruntimesystem applicationsthatarelaterdescribedinSection3.2,mostsingleGPU weemploytoenablemulti-socketGPUexecution. optimizedworkloadsalreadycontainsufficientdataparallelismtofill Priorworkhasdemonstratedfeasibilityofaruntimesystemthat GPUsthatare2–8×largerthantoday’sbiggestGPUs.Forthosethat transparentlydecomposesGPUkernelsintosub-kernelsandexe- donot,wefindthattheabsolutenumberofthreadblocks(CTAs)is cutesthemonmultiplePCIeattachedGPUsinparallel[7,25,30]. intentionallylimitedbytheprogrammer,orthattheproblemdataset Forexample,onNVIDIAGPUsthiscanbeimplementedbyinter- cannotbeeasilyscaledupduetomemorylimitations.Whilemostof ceptingandremappingeachkernelcall,GPUmemoryallocation, theseapplicationsthatdoscaleareunlikelytoscaletothousandsof memorycopy,andGPU-widesynchronizationissuedbytheCUDA GPUsacrossanentiredatacenterwithoutadditionaldevelopereffort, driver.Per-GPUmemoryfencesmustbepromotedtosystemlevel moderateprogrammertransparentperformancescalabilitywillbe and seen by all GPUs, and sub-kernels CTA identifiers must be attractiveforapplicationsthatalreadycontainsufficientalgorithmic properlymanagedtoreflectthoseoftheoriginalkernel.Cabezaset anddataparallelism. al.solvethesetwoproblemsbyintroducingcodeannotationsandan Inthiswork,weexaminetheperformanceofafuture4-module additionalsource-to-sourcecompilerwhichwasalsoresponsiblefor NUMAGPUtounderstandtheeffectsthatNUMAwillhavewhen staticallypartitioningdataplacementandcomputation[7]. executingapplicationsdesignedforUMAGPUs.Wewillshow(not In our work, we follow a similar strategy but without using a surprisingly),thatwhenexecutingUMA-optimizedGPUprograms source-to-source translation. Unlike prior work, we are able to onamulti-socketNUMAGPU,performancedoesnotscale.We rely on NVIDIA’s Unified Virtual Addressing [37] to allow dy- drawonpriorworkandshowthatbeforeoptimizingGPUmicroarchi- namicplacementofpagesintomemoryatruntime.Similarly,tech- tectureforNUMA-awareness,severalsoftwareoptimizationsmust nologieswithcachelinegranularityinterconnectslikeNVIDIA’s beinplacetopreservedatalocality.AlonetheseSWimprovements NVLink[39]allowtransparentaccesstoremotememorywithout arenotsufficienttoachievescalableperformancehoweverandin- theneedtomodifyapplicationsourcecodetoaccesslocalorremote terconnectbandwidthisasignificanthindranceonperformance.To memory addresses. Due to these advancements, we assume that overcomethisbottleneckweproposetwoclassesofimprovements throughdynamiccompilationofPTXtoSASSatexecutiontime,the toreducetheobservedNUMApenalty. GPUruntimewillbeabletostaticallyidentifyandpromotesystem First,inSection4weexaminetheabilityofswitchconnected widememoryfencesaswellasmanagesub-kernelCTAidentifiers. GPUstodynamicallyrepartitiontheiringressandegresslinkstopro- Current GPUs perform fine-grained memory interleaving at a videasymmetricbandwidthprovisioningwhenrequired.Byusing sub-pagegranularityacrossmemorychannels.InaNUMAGPU existinginterconnectsmoreefficiently,theeffectiveNUMAband- thispolicywoulddestroylocalityandresultin75%ofallaccesses widthratioofremotememorytolocalmemorydecreases,improving tobetoremotememoryina4GPUsystem,anundesirableeffect performance.Second,tominimizetrafficonoversubscribedintercon- inNUMAsystems.Similarly,around-robinpagelevelinterleaving nectlinksweproposeGPUcachesneedtobecomeNUMA-aware couldbeutilized,similartotheLinuxinterleavepageallocation inSection5.Traditionalon-chipcachesareoptimizedtomaximize strategy,butdespitetheinherentmemoryloadbalancing,thisstill overallhitrate,thusminimizingoff-chipbandwidth.However,in resultsin75%ofmemoryaccessesoccurringoverlowbandwidth NUMAsystems,notallcachemisseshavethesamerelativecost NUMAlinks.InsteadweleverageUVMpagemigrationfunctional- andthusshouldnotbetreatedequally.DuetotheNUMApenalty itytomigratepageson-demandfromsystemmemorytolocalGPU of accessing remote memory, we show that performance can be memoryassoonasthefirstaccess(alsocalledfirst-touchallocation) maximizedbypreferencingcachecapacity(andthusimprovinghit isperformedasdescribedbyArunkumaret.al[3]. rate)towardsdatathatresidesinslowerremoteNUMAzones,at OnasingleGPU,fine-graineddynamicassignmentofCTAsto theexpenseofdatathatresidesinfasterlocalNUMAzones.Tothis SMsisperformedtoachievegoodloadbalancing.Extendingthis end,weproposeanewNUMA-awarecachearchitecturethatdynam- policy to a multi-socket GPU system is not possible due to the icallybalancescachecapacitybasedonmemorysystemutilization. relativelyhighlatencyofpassingsub-kernellaunchesfromsoftware Beforedivingintomicroarchitecturaldetailsandresults,wefirst tohardware.Toovercomethispenalty,theGPUruntimemustlaunch describethelocality-optimizedGPUsoftwareruntimethatenables ablockofCTAstoeachGPU-socketatacoursegranularity.To ourproposedNUMA-awarearchitecture. encourageloadbalancing,eachsub-kernelcouldbecomprisedof aninterleavingofCTAsusingmoduloarithmetic.Alternativelya singlekernelcanbedecomposedintoNsub-kernels,whereNisthe 3 ANUMA-AWAREGPURUNTIME totalnumberofGPUsocketsinthesystem,assigningequalamount CurrentGPUsoftwareandhardwareisco-designedtogethertoopti- ofcontiguousCTAstoeachGPU.Thisdesignchoicepotentially mizethroughputofprocessorsbasedontheassumptionofuniform exposesworkloadunbalanceacrosssub-kernels,butitalsopreserves MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Single GPU 4-Socket NUMA GPU :: Traditional GPU Scheduling and Memory Interleaving Hypothetical 4x Larger Single GPU 4-Socket NUMA GPU :: Locality-Optimized GPU Scheduling and Memory Placement (Baseline) 5 4 up3 d e e p2 S 1 0 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Lonestar-DMR Rodinia-Srad Rodinia-Backprop Other-Stream-Triad Other-Bitcoin-Crypto ML-AlexNet-ConvNet2 HPC-RabbitCT ML-OverFeat-cudnn-Lev3 Rodinia-Kmeans Arithmetic Mean Geometric Mean H H Figure3:Performanceofa4-socketNUMAGPUrelativetoasingleGPUandahypothetical4×larger(allresourcesscaled)single GPU.Applicationsshowningreyachievegreaterthan99%ofperformancescalingwithSW-onlylocalityoptimization. datalocalitypresentinapplicationswherecontiguousCTAsalso theoreticalperformance.Inparticularforthetwofar-mostbench- accesscontiguousmemoryregions[3,7]. marksontheright,thelocalityoptimizedsolutionscanoutperform thehypothetical4×largerGPUduehighercachehitratesbecause 3.1 PerformanceThroughLocality contiguousblockschedulingismorecachefriendlythantraditional GPUscheduling. Figure3showstherelativeperformanceofa4-socketNUMAGPU However,theapplicationsontheleftsideshowalargegapbe- comparedtoasingleGPUusingthetwopossibleCTAscheduling tweentheLocality-OptimizedNUMAdesignandtheoreticalperfor- andmemoryplacementstrategiesexplainedabove.Thegreenbars mance.Theseareworkloadsinwhicheitherlocalitydoesnotexist showtherelativeperformanceoftraditionalsingle-GPUschedul- ortheLocality-OptimizedGPUruntimeisnoteffective,resulting ingandmemoryinterleavingpolicieswhenadaptedtoaNUMA inlargenumberofremotedataaccesses.Becauseourgoalistopro- GPU.Thelightbluebarsshowtherelativeperformanceofusing videscalableperformanceforsingle-GPUoptimizedapplications, locality-optimizedGPUschedulingandmemoryplacement,con- therestofthepaperdescribeshowtoclosethisperformancegap sistingofcontiguousblockCTAschedulingandfirsttouchpage throughmicroarchitecturalinnovation.Tosimplifylaterdiscussion, migration;afterwhichpagesarenotdynamicallymovedbetween wechoosetoexcludebenchmarksthatachieve≥99%ofthetheo- GPUs.TheLocality-Optimizedsolutionalmostalwaysoutperforms reticalperformancewithsoftware-onlylocalityoptimizations.Still, thetraditionalGPUschedulingandmemoryinterleaving.Without weincludeallbenchmarksinourfinalresultstoshowtheoverall theseruntimelocalityoptimizations,a4-socketNUMAGPUisnot scalabilityachievablewithNUMA-awaremulti-socketGPUs. abletomatchtheperformanceofasingleGPUdespitethelarge increaseinhardwareresources.Thus,usingvariantsofpriorpro- 3.2 SimulationMethodology posals[3,7],wenowonlyconsiderthislocalityoptimizedGPU runtimefortheremainderofthepaper. ToevaluatetheperformanceoffutureNUMA-awaremulti-socket Despitetheperformanceimprovementsthatcancomevialocality- GPUsweuseaproprietary,cycle-level,trace-drivensimulatorfor optimizedsoftwareruntimes,manyapplicationsdonotscalewellon singleandmulti-GPUsystems.OurbaselineGPUinbothsingle ourproposedNUMAGPUsystem.Toillustratethiseffect,Figure3 GPUandmulti-socketGPUconfigurations,approximatesthelatest showsthespeedupachievablebyahypothetical(unbuildable)4× NVIDIAPascalarchitecture[43].Eachstreamingmultiprocessors largerGPUwithareddash.Thisreddashrepresentsanapproxima- (SM)ismodeledasanin-orderprocessorwithmultiplelevelsof tionofthemaximumtheoreticalperformanceweexpectedfroma cachehierarchycontainingprivate,per-SM,L1cachesandamulti- perfectlyarchitected(bothHWandSW)NUMAGPUsystem.Fig- banked,shared,L2cache.EachGPUisbackedbylocalon-package ure3sortstheapplicationsbythegapbetweenrelativeperformance highbandwidthmemory[23].Ourmulti-socketGPUsystemscon- oftheLocality-OptimizedNUMAGPUandhypothetical4×larger tains two to eight of these GPUs interconnected through a high GPU.Weobservethatontherightsideofthegraphsomework- bandwidthswitchasshowninFigure1.Table1providesamore loads(showninthegreybox)canachieveorsurpassthemaximum detailedoverviewofthesimulationparameters. BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA Parameter Value(s) Benchmark Time-weighted Memory AverageCTAs Footprint(MB) NumofGPUsockets 4 ML-GoogLeNet-cudnn-Lev2 6272 1205 TotalnumberofSMs 64perGPUsocket ML-AlexNet-cudnn-Lev2 1250 832 GPUFrequency 1GHz ML-OverFeat-cudann-Lev3 1800 388 ML-AlexNet-cudnn-Lev4 1014 32 MaxnumberofWarps 64perSM ML-AlexNet-ConvNet2 6075 97 WarpScheduler GreedythenRoundRobin Rodinia-Backprop 4096 160 L1Cache Private,128KBperSM,128Blines,4-way, Rodinia-Euler3D 1008 25 Write-Through,GPU-sideSW-basedcoherent Rodinia-BFS 1954 38 Rodinia-Gaussian 2599 78 L2Cache Shared,Banked,4MBpersocket,128Blines, Rodinia-Hotspot 7396 64 16-way,Write-Back,Mem-sidenon-coherent Rodinia-Kmeans 3249 221 GPU–GPUInterconnect 128GBspersocket(64GBseachdirection) Rodnia-Pathfinder 4630 1570 8lanes8Bwideeachperdirection Rodinia-Srad 16384 98 128-cyclelatency HPC-SNAP 200 744 HPC-Nekbone-Large 5583 294 DRAMBandwidth 768GBsperGPUsocket HPC-MiniAMR 76033 2752 DRAMLatency 100ns HPC-MiniContact-Mesh1 250 21 HPC-MiniContact-Mesh2 15423 257 Table 1: Simulation parameters for evaluation of single and HPC-Lulesh-Unstruct-Mesh1 435 19 multi-socketGPUsystems. HPC-Lulesh-Unstruct-Mesh2 4940 208 HPC-AMG 241549 3744 GPUcoherenceprotocolsarenotone-sizefitsall[49,54,63]. HPC-RSBench 7813 19 This work examines clusters of large discrete GPUs but smaller HPC-MCB 5001 162 moretightlyintegratedGPU–CPUdesignsexisttodayassystemon HPC-NAMD2.9 3888 88 chips(SoC)[12,33].InthesedesignsGPUsandCPUscanshare HPC-RabbitCT 131072 524 asinglememoryspaceandlast-levelcache,necessitatingacom- HPC-Lulesh 12202 578 patibleGPU–CPUcoherenceprotocol.Howevercloselycoupled HPC-CoMD 3588 319 CPU-GPUsolutionsarenotlikelytobeidealcandidatesforGPU- HPC-CoMD-Wa 13691 393 HPC-CoMD-Ta 5724 394 centricHPCworkloads.DiscreteGPUseachdedicatetensofbillions HPC-HPGMG-UVM 10436 1975 oftransistorstothroughputcomputing,whileintegratedsolutions HPC-HPGMG 10506 1571 dedicateonlyafractionofthechiparea.WhilediscreteGPUsare Lonestar-SP 75 8 alsostartingtointegratemorecloselywithsomeCPUcoherence Lonestar-MST-Graph 770 86 protocols[1,63],PCIeattacheddiscreteGPUs(whereintegratedco- Lonestar-MST-Mesh 895 75 herenceisnotpossible)arelikelytocontinuedominatingthemarket, Lonestar-SSSP-Wln 60 21 thankstobroadcompatibilitybetweenCPUandGPUvendors. Lonestar-DMR 82 248 Thisworkexaminesthescalabilityofonesuchcachecoherence Lonestar-SSSP-Wlc 163 21 protocolusedbyPCIeattacheddiscreteGPUs.Theprotocolisop- Lonestar-SSSP 1046 38 timized for simplicity and without need for hardware coherence Other-Stream-Triad 699051 3146 Other-Optix-Raytracing 3072 87 support at any level of the cache hierarchy. SM-side L1 private Other-Bitcoin-Crypto 60 5898 cachesachievecoherencethroughcompilerinsertedcachecontrol (flush) operations and memory-side L2 caches, which do not re- Table2:Time-weightedaveragenumberofthreadblocksand quirecoherencesupport.Whilesoftware-basedcoherencemayseem applicationfootprint. heavyhandedcomparedtofinegrainedMOESI-stylehardwareco- herence,manyGPUprogrammingmodels(inadditiontoC++2011) ofinterestdeemedrepresentativeforeachworkload,whichmaybe aremovingtowardsscopedsynchronizationwhereexplicitsoftware assmallasasinglekernelcontainingatightinnerlooporseveral acquireandreleaseoperationsmustbeusedtoenforcecoherence. thousandkernelinvocations.Weruneachbenchmarktocompletion Withouttheuseoftheseoperations,coherenceisnotgloballyguar- forthedeterminedregionofinterest.Table2providesboththemem- anteedandthusmaintainingfinegrainCPU-styleMOESIcoherence oryfootprintperapplicationaswellastheaveragenumberofactive (viaeitherdirectoriesorbroadcast)maybeanunnecessaryburden. CTAsintheworkload(weightedbythetimespentoneachkernel)to We study the scalability of multi-socket NUMA GPUs using providearepresentationofhowmanyparallelthreadblocks(CTAs) 41 workloads taken from a range of production codes based on aregenerallyavailableduringworkloadexecution. theHPCCORALbenchmarks[28],graphapplicationsfromLon- 4 ASYMMETRICINTERCONNECTS estar[45],HPCapplicationsfromRodinia[9],andseveralother in-houseCUDAbenchmarks.Thissetofworkloadscoversawide Figure4(a)showsaswitchconnectedGPUwithsymmetricandstatic spectrumofGPUapplicationsusedinmachinelearning,fluiddy- linkbandwidthassignment.Eachlinkconsistsofequalnumbersof namic,imagemanipulation,graphtraversal,andscientificcomput- uni-directionalhigh-speedlanesinbothdirections,collectivelycom- ing.Eachofourbenchmarksishandselectedtoidentifytheregion prisingasymmetricbi-directionallink.Traditionalstatic-designtime MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Saturated lane Unsaturated lane Dynamically allocated lane GPU Egress GPU Ingress h 1.0 High BW 32 of 6+4 GB/s widtU 000..68 GPU Switch 64 of 64 GB/s ndGP0.4 96 of 128 GB/s Ba 00..02 [75% BW Utilization] h 1.0 (a) Symmetric Bandwidth Assignment dt10.8 wiU 0.6 ndGP0.4 32 of 32 GB/s a 0.2 High BW + B 0.0 GPU 96 of 96 GB/s Switch h 1.0 128 of 128 GB/s dt20.8 (b) Asymmetric Bandwidth Assignment [100% BW Utilization] ndwiGPU 00..46 a 0.2 B 0.0 Figure 4: Example of dynamic link assignment to improve in- h 1.0 terconnectefficiency. widtU 300..68 ndGP0.4 a 0.2 B 0.0 Time linkcapacityassignmentisverycommonandhasseveraladvan- tages.Forexample,onlyonetypeofI/Ocircuitry(egressdrivers Figure5:NormalizedlinkbandwidthprofileforHPC-HPGMG-UVM oringressreceivers)alongwithonlyonetypeofcontrollogicare showingasymmetriclinkutilizationbetweenGPUsandwithin implementedateachon-chiplinkinterface.Moreover,themulti- aGPU.Verticalblackdottedlinesindicatekernellaunchevents. socketswitchesresultinsimplerdesignsthatcaneasilysupporta staticallyprovisionedbandwidthrequirements.Ontheotherhand, multi-socketlinkbandwidthutilizationcanhavealargeinfluenceon overallsystemperformance.Staticpartitioningofbandwidth,when onaper-GPUbasis,resultingindynamicasymmetriclinkcapacity applicationneedsaredynamic,canleaveperformanceonthetable. assignmentsasshowninFigure4(b). BecauseI/Obandwidthisalimitedandexpensivesystemresource, To evaluate this proposal, we model point-to-point links con- NUMA-awareinterconnectdesignsmustlookforinnovationsthat taining multiple lanes, similar to PCIe [47] or NVLink [43]. In cankeepwireandI/Outilizationhigh. these links, 8 lanes with 8GBs capacity per lane yield an aggre- Inmulti-socketNUMAGPUsystems,weobservethatmanyap- gatebandwidthof64GBsineachdirection.Weproposereplacing plicationshavedifferentutilizationofegressandingresschannelson uni-directionallaneswithbi-directionallanestowhichweapply bothaperGPU-socketbasisandduringdifferentphasesofexecution. anadaptivelinkbandwidthallocationmechanismthatworksasfol- Forexample,Figure5showsalinkutilizationsnapshotovertime lowing.Foreachlinkinthesystem,atkernellaunchthelinksare forHPC-HPGMG-UVMbenchmarkrunningonaSWlocality-optimized alwaysreconfiguredtocontainsymmetriclinkbandwidthwith8 4-socketNUMAGPU.Verticaldottedblacklinesrepresentkernel lanesperdirection.Duringkernelexecutionthelinkloadbalancer invocationsthataresplitacrossthe4GPUsockets.Severalsmall periodicallysamplesthesaturationstatusofeachlink.Ifthelanes kernelshavenegligibleinterconnectutilization.However,forthe inonedirectionarenotsaturated,whilethelanesintheopposite laterlargerkernels,GPU0andGPU2fullysaturatetheiringress directionare99%saturated,thelinkloadbalancerreconfiguresand links,whileGPU1andGPU3fullysaturatetheiregresslinks.Atthe reversesthedirectionofoneoftheunsaturatedlanesafterquiescing sametimeGPU0andGPU2,andGPU1andGPU3areunderutilizing allpacketsonthatlane. theiregressandingresslinksrespectively. Thissampleandreconfigureprocessstopsonlywhendirectional InmanyworkloadsacommonscenariohasCTAswritingtothe utilizationisnotoversubscribedorallbutonelaneisconfiguredina samememoryrangeattheendofakernel(i.e.parallelreductions, singledirection.Ifbothingressandegresslinksaresaturatedand datagathering). ForCTAsrunningononeofthesockets,GPU0 inanasymmetricconfiguration,linksarethenreconfiguredback forexample,thesememoryreferencesarelocalanddonotproduce towardasymmetricconfigurationtoencourageglobalbandwidth anytrafficontheinter-socketinterconnections.HoweverCTAsdis- equalization.Whilethisprocessmaysoundcomplex,thecircuitry patchedtootherGPUsmustissueremotememorywrites,saturating fordynamicallyturninghighspeedsingleendedlinksaroundinjust theiregresslinkswhileingresslinksremainunderutilized,butcaus- tensofcyclesorlessisalreadyinusebymodernhighbandwidth ingingresstrafficonGPU0.Suchcommunicationpatternstypically memoryinterfaces,suchasGDDR,wherethesamesetofwiresis utilizeonly50%ofavailableinterconnectbandwidth.Inthesecases, usedforbothmemoryreadsandwrites[16].Inhighspeedsignaling dynamicallyincreasingthenumberofingresslanesforGPU0(by implementations,necessaryphase–delaylockloopresynchroniza- reversingthedirectionofegresslanes)andswitchingthedirectionof tioncanoccurwhiledataisinflight;eliminatingtheneedtoidlethe ingresslanesforGPUs1–3,cansubstantiallyimprovetheachievable linkduringthislonglatency(microseconds)operationifupcoming interconnectbandwidth.Motivatedbythesefindings,wepropose linkturnoperationscanbesufficientlyprojectedaheadoftime,such to dynamically control multi-socket link bandwidth assignments asonafixedinterval. BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA Static 128GB/s Inter-GPU Interconnect Dynamic 128GB/s Inter-GPU Interconnect :: 10K Cycle Sample Dynamic 128GB/s Inter-GPU Interconnect :: 1K Cycle Sample Dynamic 128GB/s Inter-GPU Interconnect :: 50K Cycle Sample Dynamic 128GB/s Inter-GPU Interconnect :: 5K Cycle Sample Static 256GB/s Inter-GPU Interconnect 2.00 1.75 1.50 up1.25 d e1.00 e p S0.75 0.50 0.25 0.00 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean H H Figure6:Relativespeedupofthedynamiclinkadaptivitycomparedtothebaselinearchitecturebyvaryingsampletimeandassuming switchtimeof100cycles.Inred,wealsoprovidespeedupachievablebydoublinglinkbandwidth. 4.1 ResultsandDiscussion Atthesametime,usingafasterlaneswitch(10cycles)doesnot Twoimportantparametersaffecttheperformanceofourproposed significantlyimprovetheperformanceovera100cyclelinkturn mechanism:(i)SampleTime–thefrequencyatwhichthescheme time. The link turnaround times of modern high-speed on-board samplesforapossiblereconfigurationand(ii)SwitchTime–thecost linkssuchasGDDR5[16]areabout8nswithbothlinkandinternal ofturningthedirectionofanindividuallane.Figure6showstheper- DRAMturn-aroundlatency,whichislessthan10cyclesat1GHz. formanceimprovement,comparedtoourlocality-optimizedGPUby Ourresultsdemonstratethatasymmetriclinkbandwidthalloca- exploringdifferentvaluesoftheSampleTimeindicatedbygreenbars tioncanbeveryattractivewheninter-socketinterconnectbandwidth andassumingaSwitchTimeof100cycles.TheredbarsinFigure6 isconstrainedbythenumberofon-PCBwires(andthustotallink provideanupper-boundofperformancespeedupswhendoublingthe bandwidth).Theprimarydrawbackofthissolutionisthatbothtypes availableinterconnectbandwidthto256GBs.Forworkloadsonthe ofinterfacecircuitry(TXandRX)andlogicmustbeimplemented rightofthefigure,doublingthelinkbandwidthhaslittleeffect,indi- foreachlaneinboththeGPUandswitchinterfaces.Weconducted catingthatadynamiclinkpolicywillalsoshowlittleimprovement ananalysisofthepotentialcostofdoublingtheamountofI/Ocir- duetolowGPU–GPUinterconnectbandwidthdemand.Ontheleft cuitry and logic based on a proprietary state of the art GPU I/O sideofthefigure,forsomeapplicationswhenimprovedinterconnect implementation.Ourresultsshowthatdoublingthisinterfacearea bandwidthhasalargeeffect,dynamiclaneswitchingcanimprove increasestotalGPUareabylessthan1%whileyieldinga12%im- applicationperformancebyasmuchas80%.Forsomebenchmarks provementinaverageinterconnectbandwidthanda14%application likeRodinia-Euler-3D,HPC-AMG,andHPC-Lulesh,doublingthe performanceimprovement.Oneadditionalcaveatworthnotingis linkbandwidthprovides2×speedup,whileourproposeddynamic thattheproposedasymmetriclinkmechanismoptimizeslinkband- linkassignmentmechanismisnotabletosignificantlyimproveper- widthinagivendirectionforeachindividuallink,whilethetotal formance.Theseworkloadssaturatebothlinkdirections,sothere switchbandwidthremainsconstant. isnoopportunitytoprovideadditionalbandwidthbyturninglinks around. 5 NUMA-AWARECACHEMANAGEMENT Usingamoderate5Kcyclesampletime,thedynamiclinkpolicy Section4showedthatinter-socketbandwidthisanimportantfactor canimproveperformanceby14%onaverageoverstaticbandwidth inachievingscalableNUMAGPUperformance.Unfortunately,be- partitioning.Ifthelinkloadbalancersamplestooinfrequently,ap- causeeithertheoutgoingorincominglinksmustbeunderutilizedfor plicationdynamicscanbemissedandperformanceimprovementis ustoreallocatethatbandwidthtothesaturatedlink,ifbothincoming reduced.Howeverifthelinkisreconfiguredtoofrequently,band- andoutgoinglinksaresaturated,dynamiclinkrebalancingyields widthislostduetotheoverheadofturningthelink.Whilewehave minimalgains.Toimproveperformanceinsituationswheredynamic assumedapessimisticlinkturntimeof100cycles,weperformed linkbalancingisineffective,systemdesignerscaneitherincrease sensitivitystudiesthatshoweveniflinkturntimewereincreasedto linkbandwidth,whichisveryexpensive,ordecreasetheamountof 500cycles,ourdynamicpolicyloseslessthan2%inperformance. trafficthatcrossesthelowbandwidthcommunicationchannels.To MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Cache Partitioning Algorithm SM … SM SM … SM SM … SM SM … SM 0) Allocate ½ ways for local and ½ for remote data L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho 12)) ElIofs citnaimtle Dar-tRGeA PiMnUc ooBmuWtig niosg i sniangtt uBerrWa-Gt ePdU a BnWd DaRnAdM m oBnWit onro t SC R$ SC Coherent L2 SC Coherent L2 SC - RemoteWays++ and LocalWays-- 3) If DRAM BW is saturated and inter-GPU BW not To Switch To Switch To Switch To Switch NoC NoC NoC NoC - RemoteWays-- and LocalWays++ DRLA2 M tnerehoC )yromem fo trap DRLA2 M tnerehoC )yromem fo trap DRAM DRAM 4 5 6 ))) ING f oob -- no bEDetaqho cou kanfa rottleiohtz h ees1iam n)at gal uli fosrt acsetaaret tSdeuad rma wtpealdey sT i(m++e acyncdl e-s-) ( ( (a) Mem-Side Local (c) Shared Coherent (b) Static R$ (d) NUMA-Aware L1+L2 Only L2 L1+L2 Figure7:PotentialL2cacheorganizationstobalancecapacitybetweenremoteandlocalNUMAmemorysystems. decreaseoff-chipmemorytraffic,architectstypicallyturntocaches someGPU-socketsprimarilyaccessingdatalocallywhileothers tocapturelocality. areaccessingdataremotely,afixpartitioningofcachecapacityis GPUcachehierarchiesdifferfromtraditionalCPUhierarchiesas guaranteed to be sub-optimal. Second, while we show that most theytypicallydonotimplementstronghardwarecoherenceproto- applications will be able to completely fill large NUMA GPUs, cols[55].TheyalsodifferfromCPUprotocolsinthatcachesmay thismaynotalwaysbethecase.GPUswithinthedatacenterare bebothprocessorside(wheresomeformofcoherenceistypically beingvirtualizedandthereiscontinuousworktoimproveconcur- necessary)ortheymaybememoryside(wherecoherenceisnot rent execution of multiple kernels and processes within a single necessary).AsdescribedinTable1andFigure7(a),aGPUtoday GPU [15, 32, 46, 50]. If a large NUMA GPU is sub-partitioned, istypicallycomposedofrelativelylargeSWmanagedcoherentL1 itisintuitivethatsystemsoftwareattempttopartitionitalongthe cacheslocatedclosetotheSMs,whilearelativelysmall,distributed, NUMAboundaries(evenwithinasingleGPU-socket)toimprove non-coherentmemorysideL2cacheresidesclosetothememory thelocalityofsmallGPUkernels.Toeffectivelycapturelocality controllers.ThisorganizationworkswellforGPUsbecausetheir inthesesituation,NUMA-awareGPUsneedtobeabletodynami- SIMTprocessordesignsoftenallowforsignificantcoalescingof callyre-purposecachecapacityatruntime,ratherthanbestatically requeststothesamecacheline,sohavinglargeL1cachesreduces partitionedatdesigntime. theneedforglobalcrossbarbandwidth.Thememory-sideL2caches Coherence:To-date,discreteGPUshavenotmovedtheirmem- donotneedtoparticipateinthecoherenceprotocol,whichreduces sidecachestoprocessorsidebecausetheoverheadofcacheinvalida- asystemcomplexity. tion(duetocoherence)isanunnecessaryperformancepenalty.For asinglesocketGPUwithauniformmemorysystem,thereislittle 5.1 DesignConsiderations performance advantage to implementing L2 caches as processor InNUMAdesigns,remotememoryreferencesoccurringacrosslow sidecaches.Still,inamulti-socketNUMAdesign,theperformance bandwidthNUMAinterconnectionsresultsinpoorperformance,as taxofextendingcoherenceintoL2cachesisoffsetbythefactthat showninFigure3.Similarly,inNUMAGPUsutilizingtraditional remotememoryaccessescannowbecachedlocallyandmaybe memory-sideL2caches(thatdependonfinegrainedmemoryinter- justified.Figure7(c)showsaconfigurationwithacoherentL2cache leavingforloadbalancing)isabaddecision.Becausememory-side whereremoteandlocaldatacontendforL2capacityasextensions caches only cache accesses that originate in their local memory, oftheL1caches,implementingidenticalcoherencepolicy. theycannotcachememoryfromotherNUMAzonesandthuscan DynamicPartitioning:BuildinguponcoherentGPUL2caches, notreduceNUMAinterconnecttraffic.Previousworkhasproposed wepositthatwhileconceptuallysimple,allowingbothremoteand thatGPUL2cachecapacityshouldbesplitbetweenmemory-side localmemoryaccessestocontendforcachecapacity(inboththe cachesandanewprocessor-sideL1.5cachethatisanextensionof L1andL2caches)inaNUMAsystemisflawed.InUMAsystems theGPUL1caches[3]toenablecachingofremotedata,shownin itiswellknowthatperformanceismaximizedbyoptimizingfor Figure7(b).BybalancingL2capacitybetweenmemorysideand cachehitrate,thusminimizingoff-chipmemorysystembandwidth. remotecaches(R$),thisdesignlimitstheneedforextendingexpen- However,inNUMAsystems,notallcachemisseshavethesame sivecoherenceoperations(invalidations)intotheentireL2cache relativecostperformanceimpact.Acachemisstoalocalmemory whilestillminimizingcrossbarorinterconnectbandwidth. addresshasasmallercost(inbothtermsoflatencyandbandwidth) Flexibility:Designsthatstaticallyallocatecachecapacitytolocal than a cache miss to a remote memory address. Thus, it should memoryandremotememory,inanybalance,mayachievereason- bebeneficialtodynamicallyskewcacheallocationtopreference ableperformanceinspecificinstancesbuttheylackflexibility.Much cachingremotememoryoverlocaldatawhenitisdeterminedthe likeapplicationphasingwasshowntoaffectNUMAbandwidthcon- systemisbottle-neckedonNUMAbandwidth. sumptiontheabilitytodynamicallysharecachecapacitybetween Tominimizeinter-GPUbandwidthinmulti-socketGPUsystems localandremotememoryhasthepotentialtoimproveperformance weproposeaNUMA-awarecachepartitioningalgorithm,withcache underseveralsituations.First,whenapplicationphasingresultsin organizationandbriefsummaryshowninFigure7(d).Similarto BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA Mem-Side Local Only L2 Static Partitioning Shared Coherent L1+L2 NUMA-aware L1+L2 4.0 3.5 5.01 15.16 5.83 4.89 15.16 5.69 3.0 7.42 5.27 p2.5 u d e2.0 e Sp1.5 1.0 0.5 0.0 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean H H Figure8:Performanceof4-socketNUMA-awarecachepartitioning,comparedtomemory-sideL2andstaticpartitioning. Single GPU tobebandwidthsaturated(Step 1 ).Incaseswheretheintercon- 4-Socket GPU :: NUMA-aware Coherent L1+L2 With L2 Invalidations nectbandwidthissaturatedbutlocalmemorybandwidthisnot,the 4-Socket GPU :: NUMA-aware Coherent L1+L2 Without L2 Invalidations 4.5 partitioningalgorithmattemptstoreduceremotememorytrafficby 4.0 re-assigningonewayfromthegroupoflocalwaystotheremote 3.5 ways grouping (Step 2 ). Similarly, if the local memory BW is 3.0 p saturatedandinter-GPUbandwidthisnot,thepolicyre-allocates edu2.5 one way from the remote group, and allocates it to the group of e2.0 p S1.5 localways(Step 3 ).Tominimizetheimpactoncachedesign,all 1.0 waysareconsultedonlookup,allowinglazyevictionofdatawhen 0.5 thewaypartitioningchanges.Incasewhereboththeinterconnect 0.0 Rodinia-Euler3DHPC-AMGHPC-RSBenchHPC-CoMD-TaHPC-Lulesh-Unstruct-Mesh2HPC-LuleshHPC-NekboneLonestar-MST-MeshHPC-Lulesh-Unstruct-Mesh1Lonestar-SPHPC-CoMD-WaHPC-HPGMG-UVMRodinia-BFSHPC-CoMDLonestar-SSSP-WlcHPC-MCBML-AlexNet-cudnn-Lev2Rodinia-GaussianOther-Optix-RaytracingHPC-MiniContact-Mesh2Lonestar-MST-GraphHPC-Namd2.9Lonestar-SSSP-WlnHPC-MiniContact-Mesh1ML-GoogLeNet-cudnn-Lev2Rodinia-HotspotHPC-SNAPML-AlexNet-cudnn-Lev4Rodnia-PathfinderLonestar-SSSPHPC-MiniAMRHPC-HPGMGArithmetic MeanGeometric Mean aeltodarihelqnfrnwmeaudeemaaspoilytlto(aohiseSztlceiiertoacrceeslfaypqlltolmulhotcyai4ecerakeaimln)enl.auoscomFtmrrrnileyenerobameaaesbmsalerolatcyronooat,yitdfnni.oefwewdnmniwaaed(yeSiatstmshythueaeobiapsnrrsrsyeeoai5qgfl(sluntwa)hcee.tehaundTcirtlcofhiadnoheteprkrsocdsrrtaep,eoavumroibesenueonecrstptaueepcmlrlraoarofeenclocnimdhcartemtyloleyorsdagcystnaraatcaolrltauvedcetr)aueaia,ttanctihwelohclednyyeer, Figure9:PerformanceoverheadofextendingcurrentGPUsoft- warebasedcoherenceintotheGPUL2caches. 5.2 Results Figure8comparestheperformanceof4differentcacheconfigura- tionsinour4-socketNUMAGPU.OurbaselineisatraditionalGPU ourinterconnectbalancingalgorithm,atinitialkernellaunch(after with memory-side local-only write-back L2 caches. To compare GPUcacheshavebeenflushedforcoherencepurposes)weallocate againstpriorwork[3]weprovidea50–50staticpartitioningwhere one half of the cache ways for local memory and the remaining theL2cachebudgetissplitbetweentheGPU-sidecoherentremote waysforremotedata(Step 0 ).Afterexecutingfora5Kcyclespe- cachewhichcontainsonlyremotedata,andthememorysideL2 riod,wesampletheaveragebandwidthutilizationonlocalmemory whichcontainsonlylocaldata.Inour4-socketNUMAGPUstatic andestimatetheGPU-socket’sincomingreadrequestratebylook- partitioningimprovesperformanceby54%onaverage,althoughfor ingattheoutgoingrequestratemultipliedbytheresponsepacket somebenchmarks,ithurtstheperformancebyasmuchas10%for size.Byusingtheoutgoingrequestratetoestimatetheincoming workloadsthathavenegligibleinter-socketmemorytraffic.Wealso bandwidth,weavoidsituationswhereincomingwritesmaysaturate showtheresultsforGPU-sidecoherentL1andL2cacheswhereboth ourlinkbandwidthfalselyindicatingweshouldpreferenceremote localandremotedatacontendcapacity.Onaverage,thissolution data caching. Projected link utilization above 99% is considered MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Single GPU 4-Socket NUMA GPU :: Asymmetric Inter-GPU Interconnect Hypothetical 4x Larger Single GPU 4-Socket NUMA GPU :: Asymmetric Inter-GPU Interconnect and NUMA-aware Cache Partitioning 5 4-Socket NUMA GPU :: Baseline 4 up3 d e e p2 S 1 0 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean H H Figure10:FinalNUMA-awareGPUperformancecomparedtoasingleGPUand4×largersingleGPUwithscaledresources. outperformsstaticcachepartitioningsignificantlydespiteincurring for some applications, on average SW-based cache invalidations additionalflushingoverheadduetocachecoherence. overheadsareonly10%,evenwhenextendedacrossallGPU-socket Finally, our proposed NUMA-aware cache partitioning policy L2caches.SowhilefinegrainHWcoherenceprotocolsmayim- isshownindarkgrey.Duetoitsabilitytodynamicallyadaptthe proveperformance,themagnitudeoftheirimprovementmustbe capacityofbothL2andL1tooptimizeperformancewhenbacked weightedagainsttheirhardwareimplementationcomplexity.While byNUMAmemory,itisthehighestperformingcacheconfiguration. inthestudiesaboveweassumedawrite-backpolicyinL2caches, Byexaminingsimulationresultswefindthatforworkloadsonthe asasensitivitystudywealsoevaluatedtheeffectofusingawrite- leftsideofFigure8whichfullysaturatetheinter-GPUbandwidth, throughcachepolicytomirrorthewrite-throughL1cachepolicy. NUMA-awaredynamicpolicyconfigurestheL1andL2cachesto Ourfindingsindicatethatwrite-backL2outperformswrite-through beprimarilyusedasremotecaches.However,workloadsontheright L2by9%onaverageinourNUMA-GPUdesignduetothedecrease sideofthefiguretendtohavegoodGPU-socketmemorylocality, intotalinter-GPUwritebandwidth. andthuspreferL1andL2cachesstoreprimarilylocaldata.NUMA- awarecachepartitioningisabletoflexiblyadapttovaryingmemory 6 DISCUSSION accessprofilesandcanimproveaverageNUMAGPUperformance CombinedImprovement:Sections4and5providetwotechniques 76% compared to traditional memory side L2 caches, and 22% aimedatmoreefficientlyutilizingscarceNUMAbandwidthwithin comparedtopreviouslyproposedstaticcachepartitioningdespite futureNUMAGPUsystems.Theproposedmethodsfordynamic incurringadditionalcoherenceoverhead. interconnectbalancingandNUMA-awarecachingareorthogonal When extending SW-controlled coherence into the GPU L2 andcanbeappliedinisolationorcombination.Dynamicintercon- caches,L1coherenceoperationsmustbeextendedintotheGPUL2 nectbalancinghasanimplementationsimplicityadvantageinthat caches.Usingbulksoftwareinvalidationtomaintaincoherenceis thesystemlevelchangestoenablethisfeatureareisolatedfrom simpletoimplementbutisaperformancepenaltywhenfalselyevict- thelargerGPUdesign.Conversely,enablingNUMA-awareGPU ingnotrequireddata.Theoverheadofthisinvalidationisdependent cachingbasedoninterconnectutilizationrequireschangestoboth onboththefrequencyoftheinvalidationsaswellasaggregatecache thephysicalcachearchitectureandtheGPUcoherenceprotocol. capacityinvalidated.ExtendingtheL1invalidationprotocolintothe Becausethesetwofeaturestargetthesameproblem,whenem- sharedL2,andthenacrossmultipleGPUs,increasesthecapacity ployedtogethertheireffectsarenotstrictlyadditive.Figure10shows affectedandfrequencyoftheinvalidationevents. theoverallimprovementNUMA-awareGPUscanachievewhenap- Tounderstandtheimpactoftheseinvalidations,weevaluatehy- plyingbothtechniquesinparallel.ForbenchmarkssuchasCoMD, potheticalL2cacheswhichcanignorethecacheinvalidationevents; thesefeaturescontributenearlyequallytotheoverallimprovement, thusrepresentingtheupperlimitonperformance(nocoherenceevic- butforotherssuchasML-AlexNet-cudnn-Lev2orHPC-MST-Mesh1, tionseveroccur)thatcouldbeachievedbyusingafinergranularity interconnectimprovementsorcachingaretheprimarycontributor HW-coherenceprotocol.Figure9showstheimpacttheseinvalida- respectively.Onaverage,weobservethatwhencombinedwesee tionoperationshaveonapplicationperformance.Whilesignificant 2.1×improvementoverasingleGPUand80%overthebaseline

Description:

tem is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be

NUMA-Aware GPUs PDF

13 Pages·2017·1.21 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview NUMA-Aware GPUs

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.