Beyond the Socket: NUMA-Aware GPUs UgljesaMilic†,∓ OresteVilla‡ EvgenyBolotin‡ AkhilArunkumar∗ EimanEbrahimi‡ AamerJaleel‡ AlexRamirez⋆ DavidNellans‡ †BarcelonaSupercomputingCenter(BSC),∓UniversitatPolitècnicadeCatalunya(UPC) ‡NVIDIA,∗ArizonaStateUniversity,⋆Google [email protected],{ovilla,ebolotin,eebrahimi,ajaleel,dnellans}@nvidia.com,[email protected], [email protected] ABSTRACT ACMReferenceformat: GPUs achieve high throughput and power efficiency by employ- Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi,AamerJaleel,AlexRamirez,DavidNellans.2017.Beyondthe ingmanysmallsingleinstructionmultiplethread(SIMT)cores.To Socket:NUMA-AwareGPUs.InProceedingsofMICRO-50,Cambridge, minimizeschedulinglogicandperformancevariancetheyutilize MA,USA,October14–18,2017,13pages. auniformmemorysystemandleveragestrongdataparallelismex- https://doi.org/10.1145/3123939.3124534 posedviatheprogrammingmodel.WithMoore’slawslowing,for GPUstocontinuescalingperformance(whichlargelydependson 1 INTRODUCTION SIMTcorecount)theyarelikelytoembracemulti-socketdesigns wheretransistorsaremorereadilyavailable.Howeverwhenmoving InthelastdecadeGPUscomputinghastransformedthehighper- tosuchdesigns,maintainingtheillusionofauniformmemorysys- formancecomputing,machinelearning,anddataanalyticsfields temisincreasinglydifficult.Inthisworkweinvestigatemulti-socket thatwerepreviouslydominatedbyCPU-basedinstallations[27,34, non-uniformmemoryaccess(NUMA)GPUdesignsandshowthat 53, 61]. Many systems now rely on a combination of GPUs and significantchangesareneededtoboththeGPUinterconnectand CPUstoleveragehighthroughputdataparallelGPUswithlatency cachearchitecturestoachieveperformancescalability.Weshowthat criticalexecutionoccurringontheCPUs.Inpart,GPU-accelerated applicationphaseeffectscanbeexploitedallowingGPUsocketsto computinghasbeensuccessfulinthesedomainsbecauseofnative dynamicallyoptimizetheirindividualinterconnectandcachepoli- supportfordataparallelprogramminglanguages[24,40]thatre- cies,minimizingtheimpactofNUMAeffects.OurNUMA-aware duceprogrammerburdenwhentryingtoscaleprogramsacrossever GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while growingdatasets. achieving89%,84%,and76%oftheoreticalapplicationscalability Nevertheless,withGPUsnearingthereticlelimitsformaximum in2,4,and8socketsdesignsrespectively.Implementabletoday, diesize andthe transistor densitygrowthrateslowingdown[5], NUMA-awaremulti-socketGPUsmaybeapromisingcandidatefor developerslookingtoscaletheperformanceoftheirsingleGPU scalingGPUperformancebeyondasinglesocket. programs are in a precarious position. Multi-GPU programming modelssupportexplicitprogrammingoftwoormoreGPUs,but itischallengingtoleveragemechanismssuchasPeer-2-Peerac- CCSCONCEPTS cess [36] or a combination of MPI and CUDA [42] to manage • Computing methodologies → Graphics processors; • Com- multipleGPUs.Theseprogrammingextensionsenableprogrammers putersystemsorganization→Singleinstruction,multipledata; toemploymorethanoneGPUforhighthroughputcomputation,but requirere-writingoftraditionalsingleGPUapplications,slowing KEYWORDS theiradoption. GraphicsProcessingUnits,Multi-socketGPUs,NUMASystems High port-count PCIe switches are now readily available and thePCI-SIGroadmapisprojectingPCIe5.0bandwidthtoreach 128GBsin2019[6].Atthesametime,GPUsarestartingtoexpand beyond the traditional PCIe peripheral interface to enable more Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor efficientinterconnectionprotocolsbetweenbothGPUsandCPUs, classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation suchasAMD’sInfinityFabricorNVIDIA’sScalableLinkInterface onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM andNVLink[2,29,35,39,44].FuturehighbandwidthGPUtoGPU mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora interconnects,possiblyusingimprovedcommunicationprotocols, [email protected]. mayleadtosystemdesignswithcloselycoupledgroupsofGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA thatcanefficientlysharememoryatfinegranularity. ©2017AssociationforComputingMachinery. ACMISBN978-1-4503-4952-9/17/10...$15.00 Theonsetofsuchmulti-socketGPUswouldprovideapivotpoint https://doi.org/10.1145/3123939.3124534 forGPUandsystemvendors.Ononehand,vendorscancontinueto MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. PCGGIePPUU HighGGGG BPPPPPWCUUUUI GePU-GPGGGGUP PPPPCInIUUUUteerface GGGGPPPPUUUU of Workloadsure Large GPUs1068000 41/41 37/41 35/41 33/41 31/41 GGPPUU ge Fut 40 PCICCePPUU PCIeCCPPUUPCIe GGGGPPPPUUUU CCPPUU PercentaAble to Fill 200Current GPU 2x GPU 4x GPU 8x GPU 16x GPU [56 SMs] [112 SMs] [224 SMs] [448 SMs] [896 SMs] GGGGPPPPUUUU GGGGPPPPUUUU GGPPUU GGPPUU Figure 2: Percentage of workloads that are able to fill future High BW GPU-GPU Interface largerGPUs(averagenumberofconcurrentthreadblocksex- Logical Programmer Exposed GPU(s) ceedsnumberofSMsinthesystem). Figure1:TheevolutionofGPUsfromtraditionaldiscretePCIe devicestosinglelogical,multi-socketedacceleratorsutilizinga acrossmultiplesocketsisagooddesignchoice,despitethe switchedinterconnect. overheads. • Weshowthatmulti-socketNUMA-awareGPUscanallow traditional GPU programs to scale efficiently to as many exposetheseGPUsasindividualGPUsandforcedeveloperstouse as 8 GPU sockets, providing significant headroom before multipleprogrammingparadigmstoleveragethesemultipleGPUs. developersmustre-architectapplicationstoobtainadditional Ontheother,vendorscouldexposemulti-socketdesignsasasingle performance. non-uniformmemoryaccess(NUMA)GPUresourceasshownin Figure1.ByextendingthesingleGPUprogrammingmodeltomulti- 2 MOTIVATIONANDBACKGROUND socketGPUs,applicationscanscalebeyondtheboundsofMoore’s Over the last decade single GPU performance has scaled thanks law,whilesimultaneouslyretainingtheprogramminginterfaceto to a significant growth in per-GPU transistor count and DRAM whichGPUdevelopershavebecomeaccustomed. bandwidth.Forexample,in2010NVIDIA’sFermiGPUintegrated Severalgroupshavepreviouslyexaminedaggregatingmultiple 1.95Btransistorsona529mm2die,with180GBsofDRAMband- GPUstogetherunderasingleprogrammingmodel[7,30];however width[13].In2016NVIDIA’sPascalGPUcontained12Btransistors thisworkwasdoneinanerawhereGPUshadlimitedmemoryad- withina610mm2die,whilerelyingon720GBsofmemoryband- dressabilityandreliedonhighlatency,lowbandwidthCPU-based width[41].Unfortunately,transistordensityisslowingsignificantly PCIeinterconnects.Asaresult,priorworkfocusedprimarilyonim- andintegratedcircuitmanufacturersarenotprovidingroadmapsbe- provingthemulti-GPUprogrammingexperienceratherthanachiev- yond7nm.Moreover,GPUdiesizes,whichhavebeenalsoslowly ingscalableperformance.Buildinguponthiswork,weproposea butsteadilygrowinggenerationally,areexpectedtoslowdowndue multi-socketNUMA-awareGPUarchitectureandruntimethatag- tolimitationsinlithographyandmanufacturingcost. gregatesmultipleGPUsintoasingleprogrammertransparentlogical Withouteitherlargerordenserdies,GPUarchitectsmustturn GPU.Weshowthatinthetheeraofunifiedvirtualaddressing[37], toalternativesolutionstosignificantlyincreaseGPUperformance. cachelineaddressablehighbandwidthinterconnects[39],anddedi- Recently 3D die-stacking has seen significant interest due to its catedGPUandCPUsocketPCBdesigns[29],scalablemulti-GPU successeswithhighbandwidthDRAM[23].Unfortunately3Ddie- performancemaybeachievableusingexistingsingleGPUprogram- stacking still has a significant engineering challenges related to mingmodels.Thisworkmakesthefollowingcontributions: powerdelivery,energydensity,andcooling[60]whenemployed • We show that traditional NUMA memory placement and inpowerhungry,maximaldie-sizedchipssuchasGPUs.Thuswe schedulingpoliciesarenotsufficientformulti-socketGPUs proposeGPUmanufacturersarelikelytore-examineatriedand toachieveperformancescalability.Wethendemonstratethat truedsolutionfromCPUworld,multi-socketGPUs,toscalingGPU inter-socketbandwidthwillbetheprimaryperformancelim- performancewhilemaintainingthecurrentratiooffloatingpoint iterinfutureNUMAGPUs. operationspersecond(FLOPS)andDRAMbandwidth. • Byexploitingprogramphasebehaviorweshowthatinter- Multi-socketGPUsareenabledbytheevolutionofGPUsfrom socketlinks(andthusbandwidth)shouldbedynamicallyand externalperipheralstocentralcomputingcomponents,considered adaptivelyreconfiguredatruntimetomaximizelinkutiliza- atsystemdesigntime.GPUoptimizedsystemsnowemploycus- tion.Moreover,weshowthatlinkpolicymustbedetermined tomPCBdesignsthataccommodatehighpincountsocketedGPU onaper-GPUbasis,asglobalpoliciesfailtocaptureper-GPU modules[35]withinter-GPUinterconnectsresemblingQPIorHy- phasebehavior. perTransport[17,21,39].Despitetherapidimprovementinhard- • WeshowthatboththeGPUL1andL2cachesshouldbemade warecapabilities,thesesystemshavecontinuedtoexposetheGPUs NUMA-awareanddynamicallyadapttheircachingpolicy providedasindividuallyaddressableunits.Thesemulti-GPUsys- tominimizeNUMAeffects.WedemonstratethatinNUMA temscanprovidehighaggregatethroughputwhenrunningmultiple GPUs,extendingexistingGPUcachecoherenceprotocols concurrentGPUkernels,buttoaccelerateasingleGPUworkload BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA theyrequirelayeringadditionalsoftwareruntimesontopofnative memorypropertieswithintheGPU.Finegrainedinterleavingof GPUprogramminginterfacessuchasCUDAorOpenCL[24,38]. memoryaddressesacrossmemorychannelsontheGPUprovides Unfortunately,byrequiringapplicationre-designmanyworkloads implicitloadbalancingacrossmemorybutdestroysmemorylocality. areneverportedtotakeadvantageofmultipleGPUs. Asaresult,threadblockschedulingpoliciesneednotbesophisti- ExtendingsingleGPUworkloadperformancesignificantlyisa catedtocapturelocality,whichhasbeendestroyedbythememory laudable goal, but we must first understand if these applications systemlayout.ForfutureNUMAGPUstoworkwell,bothsystem will be able to leverage larger GPUs. Assuming that the biggest softwareandhardwaremustbechangedtoachievebothfunction- GPUinthemarkettodayamasses≈50SMs(i.e.NVIDIA’sPascal alityandperformance.Beforefocusingonarchitecturalchangesto GPUcontains56),Figure2showsthatacrossabenchmarksetof41 buildaNUMA-awareGPUwedescribetheGPUruntimesystem applicationsthatarelaterdescribedinSection3.2,mostsingleGPU weemploytoenablemulti-socketGPUexecution. optimizedworkloadsalreadycontainsufficientdataparallelismtofill Priorworkhasdemonstratedfeasibilityofaruntimesystemthat GPUsthatare2–8×largerthantoday’sbiggestGPUs.Forthosethat transparentlydecomposesGPUkernelsintosub-kernelsandexe- donot,wefindthattheabsolutenumberofthreadblocks(CTAs)is cutesthemonmultiplePCIeattachedGPUsinparallel[7,25,30]. intentionallylimitedbytheprogrammer,orthattheproblemdataset Forexample,onNVIDIAGPUsthiscanbeimplementedbyinter- cannotbeeasilyscaledupduetomemorylimitations.Whilemostof ceptingandremappingeachkernelcall,GPUmemoryallocation, theseapplicationsthatdoscaleareunlikelytoscaletothousandsof memorycopy,andGPU-widesynchronizationissuedbytheCUDA GPUsacrossanentiredatacenterwithoutadditionaldevelopereffort, driver.Per-GPUmemoryfencesmustbepromotedtosystemlevel moderateprogrammertransparentperformancescalabilitywillbe and seen by all GPUs, and sub-kernels CTA identifiers must be attractiveforapplicationsthatalreadycontainsufficientalgorithmic properlymanagedtoreflectthoseoftheoriginalkernel.Cabezaset anddataparallelism. al.solvethesetwoproblemsbyintroducingcodeannotationsandan Inthiswork,weexaminetheperformanceofafuture4-module additionalsource-to-sourcecompilerwhichwasalsoresponsiblefor NUMAGPUtounderstandtheeffectsthatNUMAwillhavewhen staticallypartitioningdataplacementandcomputation[7]. executingapplicationsdesignedforUMAGPUs.Wewillshow(not In our work, we follow a similar strategy but without using a surprisingly),thatwhenexecutingUMA-optimizedGPUprograms source-to-source translation. Unlike prior work, we are able to onamulti-socketNUMAGPU,performancedoesnotscale.We rely on NVIDIA’s Unified Virtual Addressing [37] to allow dy- drawonpriorworkandshowthatbeforeoptimizingGPUmicroarchi- namicplacementofpagesintomemoryatruntime.Similarly,tech- tectureforNUMA-awareness,severalsoftwareoptimizationsmust nologieswithcachelinegranularityinterconnectslikeNVIDIA’s beinplacetopreservedatalocality.AlonetheseSWimprovements NVLink[39]allowtransparentaccesstoremotememorywithout arenotsufficienttoachievescalableperformancehoweverandin- theneedtomodifyapplicationsourcecodetoaccesslocalorremote terconnectbandwidthisasignificanthindranceonperformance.To memory addresses. Due to these advancements, we assume that overcomethisbottleneckweproposetwoclassesofimprovements throughdynamiccompilationofPTXtoSASSatexecutiontime,the toreducetheobservedNUMApenalty. GPUruntimewillbeabletostaticallyidentifyandpromotesystem First,inSection4weexaminetheabilityofswitchconnected widememoryfencesaswellasmanagesub-kernelCTAidentifiers. GPUstodynamicallyrepartitiontheiringressandegresslinkstopro- Current GPUs perform fine-grained memory interleaving at a videasymmetricbandwidthprovisioningwhenrequired.Byusing sub-pagegranularityacrossmemorychannels.InaNUMAGPU existinginterconnectsmoreefficiently,theeffectiveNUMAband- thispolicywoulddestroylocalityandresultin75%ofallaccesses widthratioofremotememorytolocalmemorydecreases,improving tobetoremotememoryina4GPUsystem,anundesirableeffect performance.Second,tominimizetrafficonoversubscribedintercon- inNUMAsystems.Similarly,around-robinpagelevelinterleaving nectlinksweproposeGPUcachesneedtobecomeNUMA-aware couldbeutilized,similartotheLinuxinterleavepageallocation inSection5.Traditionalon-chipcachesareoptimizedtomaximize strategy,butdespitetheinherentmemoryloadbalancing,thisstill overallhitrate,thusminimizingoff-chipbandwidth.However,in resultsin75%ofmemoryaccessesoccurringoverlowbandwidth NUMAsystems,notallcachemisseshavethesamerelativecost NUMAlinks.InsteadweleverageUVMpagemigrationfunctional- andthusshouldnotbetreatedequally.DuetotheNUMApenalty itytomigratepageson-demandfromsystemmemorytolocalGPU of accessing remote memory, we show that performance can be memoryassoonasthefirstaccess(alsocalledfirst-touchallocation) maximizedbypreferencingcachecapacity(andthusimprovinghit isperformedasdescribedbyArunkumaret.al[3]. rate)towardsdatathatresidesinslowerremoteNUMAzones,at OnasingleGPU,fine-graineddynamicassignmentofCTAsto theexpenseofdatathatresidesinfasterlocalNUMAzones.Tothis SMsisperformedtoachievegoodloadbalancing.Extendingthis end,weproposeanewNUMA-awarecachearchitecturethatdynam- policy to a multi-socket GPU system is not possible due to the icallybalancescachecapacitybasedonmemorysystemutilization. relativelyhighlatencyofpassingsub-kernellaunchesfromsoftware Beforedivingintomicroarchitecturaldetailsandresults,wefirst tohardware.Toovercomethispenalty,theGPUruntimemustlaunch describethelocality-optimizedGPUsoftwareruntimethatenables ablockofCTAstoeachGPU-socketatacoursegranularity.To ourproposedNUMA-awarearchitecture. encourageloadbalancing,eachsub-kernelcouldbecomprisedof aninterleavingofCTAsusingmoduloarithmetic.Alternativelya singlekernelcanbedecomposedintoNsub-kernels,whereNisthe 3 ANUMA-AWAREGPURUNTIME totalnumberofGPUsocketsinthesystem,assigningequalamount CurrentGPUsoftwareandhardwareisco-designedtogethertoopti- ofcontiguousCTAstoeachGPU.Thisdesignchoicepotentially mizethroughputofprocessorsbasedontheassumptionofuniform exposesworkloadunbalanceacrosssub-kernels,butitalsopreserves MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Single GPU 4-Socket NUMA GPU :: Traditional GPU Scheduling and Memory Interleaving Hypothetical 4x Larger Single GPU 4-Socket NUMA GPU :: Locality-Optimized GPU Scheduling and Memory Placement (Baseline) 5 4 up3 d e e p2 S 1 0 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Lonestar-DMR Rodinia-Srad Rodinia-Backprop Other-Stream-Triad Other-Bitcoin-Crypto ML-AlexNet-ConvNet2 HPC-RabbitCT ML-OverFeat-cudnn-Lev3 Rodinia-Kmeans Arithmetic Mean Geometric Mean H H Figure3:Performanceofa4-socketNUMAGPUrelativetoasingleGPUandahypothetical4×larger(allresourcesscaled)single GPU.Applicationsshowningreyachievegreaterthan99%ofperformancescalingwithSW-onlylocalityoptimization. datalocalitypresentinapplicationswherecontiguousCTAsalso theoreticalperformance.Inparticularforthetwofar-mostbench- accesscontiguousmemoryregions[3,7]. marksontheright,thelocalityoptimizedsolutionscanoutperform thehypothetical4×largerGPUduehighercachehitratesbecause 3.1 PerformanceThroughLocality contiguousblockschedulingismorecachefriendlythantraditional GPUscheduling. Figure3showstherelativeperformanceofa4-socketNUMAGPU However,theapplicationsontheleftsideshowalargegapbe- comparedtoasingleGPUusingthetwopossibleCTAscheduling tweentheLocality-OptimizedNUMAdesignandtheoreticalperfor- andmemoryplacementstrategiesexplainedabove.Thegreenbars mance.Theseareworkloadsinwhicheitherlocalitydoesnotexist showtherelativeperformanceoftraditionalsingle-GPUschedul- ortheLocality-OptimizedGPUruntimeisnoteffective,resulting ingandmemoryinterleavingpolicieswhenadaptedtoaNUMA inlargenumberofremotedataaccesses.Becauseourgoalistopro- GPU.Thelightbluebarsshowtherelativeperformanceofusing videscalableperformanceforsingle-GPUoptimizedapplications, locality-optimizedGPUschedulingandmemoryplacement,con- therestofthepaperdescribeshowtoclosethisperformancegap sistingofcontiguousblockCTAschedulingandfirsttouchpage throughmicroarchitecturalinnovation.Tosimplifylaterdiscussion, migration;afterwhichpagesarenotdynamicallymovedbetween wechoosetoexcludebenchmarksthatachieve≥99%ofthetheo- GPUs.TheLocality-Optimizedsolutionalmostalwaysoutperforms reticalperformancewithsoftware-onlylocalityoptimizations.Still, thetraditionalGPUschedulingandmemoryinterleaving.Without weincludeallbenchmarksinourfinalresultstoshowtheoverall theseruntimelocalityoptimizations,a4-socketNUMAGPUisnot scalabilityachievablewithNUMA-awaremulti-socketGPUs. abletomatchtheperformanceofasingleGPUdespitethelarge increaseinhardwareresources.Thus,usingvariantsofpriorpro- 3.2 SimulationMethodology posals[3,7],wenowonlyconsiderthislocalityoptimizedGPU runtimefortheremainderofthepaper. ToevaluatetheperformanceoffutureNUMA-awaremulti-socket Despitetheperformanceimprovementsthatcancomevialocality- GPUsweuseaproprietary,cycle-level,trace-drivensimulatorfor optimizedsoftwareruntimes,manyapplicationsdonotscalewellon singleandmulti-GPUsystems.OurbaselineGPUinbothsingle ourproposedNUMAGPUsystem.Toillustratethiseffect,Figure3 GPUandmulti-socketGPUconfigurations,approximatesthelatest showsthespeedupachievablebyahypothetical(unbuildable)4× NVIDIAPascalarchitecture[43].Eachstreamingmultiprocessors largerGPUwithareddash.Thisreddashrepresentsanapproxima- (SM)ismodeledasanin-orderprocessorwithmultiplelevelsof tionofthemaximumtheoreticalperformanceweexpectedfroma cachehierarchycontainingprivate,per-SM,L1cachesandamulti- perfectlyarchitected(bothHWandSW)NUMAGPUsystem.Fig- banked,shared,L2cache.EachGPUisbackedbylocalon-package ure3sortstheapplicationsbythegapbetweenrelativeperformance highbandwidthmemory[23].Ourmulti-socketGPUsystemscon- oftheLocality-OptimizedNUMAGPUandhypothetical4×larger tains two to eight of these GPUs interconnected through a high GPU.Weobservethatontherightsideofthegraphsomework- bandwidthswitchasshowninFigure1.Table1providesamore loads(showninthegreybox)canachieveorsurpassthemaximum detailedoverviewofthesimulationparameters. BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA Parameter Value(s) Benchmark Time-weighted Memory AverageCTAs Footprint(MB) NumofGPUsockets 4 ML-GoogLeNet-cudnn-Lev2 6272 1205 TotalnumberofSMs 64perGPUsocket ML-AlexNet-cudnn-Lev2 1250 832 GPUFrequency 1GHz ML-OverFeat-cudann-Lev3 1800 388 ML-AlexNet-cudnn-Lev4 1014 32 MaxnumberofWarps 64perSM ML-AlexNet-ConvNet2 6075 97 WarpScheduler GreedythenRoundRobin Rodinia-Backprop 4096 160 L1Cache Private,128KBperSM,128Blines,4-way, Rodinia-Euler3D 1008 25 Write-Through,GPU-sideSW-basedcoherent Rodinia-BFS 1954 38 Rodinia-Gaussian 2599 78 L2Cache Shared,Banked,4MBpersocket,128Blines, Rodinia-Hotspot 7396 64 16-way,Write-Back,Mem-sidenon-coherent Rodinia-Kmeans 3249 221 GPU–GPUInterconnect 128GBspersocket(64GBseachdirection) Rodnia-Pathfinder 4630 1570 8lanes8Bwideeachperdirection Rodinia-Srad 16384 98 128-cyclelatency HPC-SNAP 200 744 HPC-Nekbone-Large 5583 294 DRAMBandwidth 768GBsperGPUsocket HPC-MiniAMR 76033 2752 DRAMLatency 100ns HPC-MiniContact-Mesh1 250 21 HPC-MiniContact-Mesh2 15423 257 Table 1: Simulation parameters for evaluation of single and HPC-Lulesh-Unstruct-Mesh1 435 19 multi-socketGPUsystems. HPC-Lulesh-Unstruct-Mesh2 4940 208 HPC-AMG 241549 3744 GPUcoherenceprotocolsarenotone-sizefitsall[49,54,63]. HPC-RSBench 7813 19 This work examines clusters of large discrete GPUs but smaller HPC-MCB 5001 162 moretightlyintegratedGPU–CPUdesignsexisttodayassystemon HPC-NAMD2.9 3888 88 chips(SoC)[12,33].InthesedesignsGPUsandCPUscanshare HPC-RabbitCT 131072 524 asinglememoryspaceandlast-levelcache,necessitatingacom- HPC-Lulesh 12202 578 patibleGPU–CPUcoherenceprotocol.Howevercloselycoupled HPC-CoMD 3588 319 CPU-GPUsolutionsarenotlikelytobeidealcandidatesforGPU- HPC-CoMD-Wa 13691 393 HPC-CoMD-Ta 5724 394 centricHPCworkloads.DiscreteGPUseachdedicatetensofbillions HPC-HPGMG-UVM 10436 1975 oftransistorstothroughputcomputing,whileintegratedsolutions HPC-HPGMG 10506 1571 dedicateonlyafractionofthechiparea.WhilediscreteGPUsare Lonestar-SP 75 8 alsostartingtointegratemorecloselywithsomeCPUcoherence Lonestar-MST-Graph 770 86 protocols[1,63],PCIeattacheddiscreteGPUs(whereintegratedco- Lonestar-MST-Mesh 895 75 herenceisnotpossible)arelikelytocontinuedominatingthemarket, Lonestar-SSSP-Wln 60 21 thankstobroadcompatibilitybetweenCPUandGPUvendors. Lonestar-DMR 82 248 Thisworkexaminesthescalabilityofonesuchcachecoherence Lonestar-SSSP-Wlc 163 21 protocolusedbyPCIeattacheddiscreteGPUs.Theprotocolisop- Lonestar-SSSP 1046 38 timized for simplicity and without need for hardware coherence Other-Stream-Triad 699051 3146 Other-Optix-Raytracing 3072 87 support at any level of the cache hierarchy. SM-side L1 private Other-Bitcoin-Crypto 60 5898 cachesachievecoherencethroughcompilerinsertedcachecontrol (flush) operations and memory-side L2 caches, which do not re- Table2:Time-weightedaveragenumberofthreadblocksand quirecoherencesupport.Whilesoftware-basedcoherencemayseem applicationfootprint. heavyhandedcomparedtofinegrainedMOESI-stylehardwareco- herence,manyGPUprogrammingmodels(inadditiontoC++2011) ofinterestdeemedrepresentativeforeachworkload,whichmaybe aremovingtowardsscopedsynchronizationwhereexplicitsoftware assmallasasinglekernelcontainingatightinnerlooporseveral acquireandreleaseoperationsmustbeusedtoenforcecoherence. thousandkernelinvocations.Weruneachbenchmarktocompletion Withouttheuseoftheseoperations,coherenceisnotgloballyguar- forthedeterminedregionofinterest.Table2providesboththemem- anteedandthusmaintainingfinegrainCPU-styleMOESIcoherence oryfootprintperapplicationaswellastheaveragenumberofactive (viaeitherdirectoriesorbroadcast)maybeanunnecessaryburden. CTAsintheworkload(weightedbythetimespentoneachkernel)to We study the scalability of multi-socket NUMA GPUs using providearepresentationofhowmanyparallelthreadblocks(CTAs) 41 workloads taken from a range of production codes based on aregenerallyavailableduringworkloadexecution. theHPCCORALbenchmarks[28],graphapplicationsfromLon- 4 ASYMMETRICINTERCONNECTS estar[45],HPCapplicationsfromRodinia[9],andseveralother in-houseCUDAbenchmarks.Thissetofworkloadscoversawide Figure4(a)showsaswitchconnectedGPUwithsymmetricandstatic spectrumofGPUapplicationsusedinmachinelearning,fluiddy- linkbandwidthassignment.Eachlinkconsistsofequalnumbersof namic,imagemanipulation,graphtraversal,andscientificcomput- uni-directionalhigh-speedlanesinbothdirections,collectivelycom- ing.Eachofourbenchmarksishandselectedtoidentifytheregion prisingasymmetricbi-directionallink.Traditionalstatic-designtime MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Saturated lane Unsaturated lane Dynamically allocated lane GPU Egress GPU Ingress h 1.0 High BW 32 of 6+4 GB/s widtU 000..68 GPU Switch 64 of 64 GB/s ndGP0.4 96 of 128 GB/s Ba 00..02 [75% BW Utilization] h 1.0 (a) Symmetric Bandwidth Assignment dt10.8 wiU 0.6 ndGP0.4 32 of 32 GB/s a 0.2 High BW + B 0.0 GPU 96 of 96 GB/s Switch h 1.0 128 of 128 GB/s dt20.8 (b) Asymmetric Bandwidth Assignment [100% BW Utilization] ndwiGPU 00..46 a 0.2 B 0.0 Figure 4: Example of dynamic link assignment to improve in- h 1.0 terconnectefficiency. widtU 300..68 ndGP0.4 a 0.2 B 0.0 Time linkcapacityassignmentisverycommonandhasseveraladvan- tages.Forexample,onlyonetypeofI/Ocircuitry(egressdrivers Figure5:NormalizedlinkbandwidthprofileforHPC-HPGMG-UVM oringressreceivers)alongwithonlyonetypeofcontrollogicare showingasymmetriclinkutilizationbetweenGPUsandwithin implementedateachon-chiplinkinterface.Moreover,themulti- aGPU.Verticalblackdottedlinesindicatekernellaunchevents. socketswitchesresultinsimplerdesignsthatcaneasilysupporta staticallyprovisionedbandwidthrequirements.Ontheotherhand, multi-socketlinkbandwidthutilizationcanhavealargeinfluenceon overallsystemperformance.Staticpartitioningofbandwidth,when onaper-GPUbasis,resultingindynamicasymmetriclinkcapacity applicationneedsaredynamic,canleaveperformanceonthetable. assignmentsasshowninFigure4(b). BecauseI/Obandwidthisalimitedandexpensivesystemresource, To evaluate this proposal, we model point-to-point links con- NUMA-awareinterconnectdesignsmustlookforinnovationsthat taining multiple lanes, similar to PCIe [47] or NVLink [43]. In cankeepwireandI/Outilizationhigh. these links, 8 lanes with 8GBs capacity per lane yield an aggre- Inmulti-socketNUMAGPUsystems,weobservethatmanyap- gatebandwidthof64GBsineachdirection.Weproposereplacing plicationshavedifferentutilizationofegressandingresschannelson uni-directionallaneswithbi-directionallanestowhichweapply bothaperGPU-socketbasisandduringdifferentphasesofexecution. anadaptivelinkbandwidthallocationmechanismthatworksasfol- Forexample,Figure5showsalinkutilizationsnapshotovertime lowing.Foreachlinkinthesystem,atkernellaunchthelinksare forHPC-HPGMG-UVMbenchmarkrunningonaSWlocality-optimized alwaysreconfiguredtocontainsymmetriclinkbandwidthwith8 4-socketNUMAGPU.Verticaldottedblacklinesrepresentkernel lanesperdirection.Duringkernelexecutionthelinkloadbalancer invocationsthataresplitacrossthe4GPUsockets.Severalsmall periodicallysamplesthesaturationstatusofeachlink.Ifthelanes kernelshavenegligibleinterconnectutilization.However,forthe inonedirectionarenotsaturated,whilethelanesintheopposite laterlargerkernels,GPU0andGPU2fullysaturatetheiringress directionare99%saturated,thelinkloadbalancerreconfiguresand links,whileGPU1andGPU3fullysaturatetheiregresslinks.Atthe reversesthedirectionofoneoftheunsaturatedlanesafterquiescing sametimeGPU0andGPU2,andGPU1andGPU3areunderutilizing allpacketsonthatlane. theiregressandingresslinksrespectively. Thissampleandreconfigureprocessstopsonlywhendirectional InmanyworkloadsacommonscenariohasCTAswritingtothe utilizationisnotoversubscribedorallbutonelaneisconfiguredina samememoryrangeattheendofakernel(i.e.parallelreductions, singledirection.Ifbothingressandegresslinksaresaturatedand datagathering). ForCTAsrunningononeofthesockets,GPU0 inanasymmetricconfiguration,linksarethenreconfiguredback forexample,thesememoryreferencesarelocalanddonotproduce towardasymmetricconfigurationtoencourageglobalbandwidth anytrafficontheinter-socketinterconnections.HoweverCTAsdis- equalization.Whilethisprocessmaysoundcomplex,thecircuitry patchedtootherGPUsmustissueremotememorywrites,saturating fordynamicallyturninghighspeedsingleendedlinksaroundinjust theiregresslinkswhileingresslinksremainunderutilized,butcaus- tensofcyclesorlessisalreadyinusebymodernhighbandwidth ingingresstrafficonGPU0.Suchcommunicationpatternstypically memoryinterfaces,suchasGDDR,wherethesamesetofwiresis utilizeonly50%ofavailableinterconnectbandwidth.Inthesecases, usedforbothmemoryreadsandwrites[16].Inhighspeedsignaling dynamicallyincreasingthenumberofingresslanesforGPU0(by implementations,necessaryphase–delaylockloopresynchroniza- reversingthedirectionofegresslanes)andswitchingthedirectionof tioncanoccurwhiledataisinflight;eliminatingtheneedtoidlethe ingresslanesforGPUs1–3,cansubstantiallyimprovetheachievable linkduringthislonglatency(microseconds)operationifupcoming interconnectbandwidth.Motivatedbythesefindings,wepropose linkturnoperationscanbesufficientlyprojectedaheadoftime,such to dynamically control multi-socket link bandwidth assignments asonafixedinterval. BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA Static 128GB/s Inter-GPU Interconnect Dynamic 128GB/s Inter-GPU Interconnect :: 10K Cycle Sample Dynamic 128GB/s Inter-GPU Interconnect :: 1K Cycle Sample Dynamic 128GB/s Inter-GPU Interconnect :: 50K Cycle Sample Dynamic 128GB/s Inter-GPU Interconnect :: 5K Cycle Sample Static 256GB/s Inter-GPU Interconnect 2.00 1.75 1.50 up1.25 d e1.00 e p S0.75 0.50 0.25 0.00 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean H H Figure6:Relativespeedupofthedynamiclinkadaptivitycomparedtothebaselinearchitecturebyvaryingsampletimeandassuming switchtimeof100cycles.Inred,wealsoprovidespeedupachievablebydoublinglinkbandwidth. 4.1 ResultsandDiscussion Atthesametime,usingafasterlaneswitch(10cycles)doesnot Twoimportantparametersaffecttheperformanceofourproposed significantlyimprovetheperformanceovera100cyclelinkturn mechanism:(i)SampleTime–thefrequencyatwhichthescheme time. The link turnaround times of modern high-speed on-board samplesforapossiblereconfigurationand(ii)SwitchTime–thecost linkssuchasGDDR5[16]areabout8nswithbothlinkandinternal ofturningthedirectionofanindividuallane.Figure6showstheper- DRAMturn-aroundlatency,whichislessthan10cyclesat1GHz. formanceimprovement,comparedtoourlocality-optimizedGPUby Ourresultsdemonstratethatasymmetriclinkbandwidthalloca- exploringdifferentvaluesoftheSampleTimeindicatedbygreenbars tioncanbeveryattractivewheninter-socketinterconnectbandwidth andassumingaSwitchTimeof100cycles.TheredbarsinFigure6 isconstrainedbythenumberofon-PCBwires(andthustotallink provideanupper-boundofperformancespeedupswhendoublingthe bandwidth).Theprimarydrawbackofthissolutionisthatbothtypes availableinterconnectbandwidthto256GBs.Forworkloadsonthe ofinterfacecircuitry(TXandRX)andlogicmustbeimplemented rightofthefigure,doublingthelinkbandwidthhaslittleeffect,indi- foreachlaneinboththeGPUandswitchinterfaces.Weconducted catingthatadynamiclinkpolicywillalsoshowlittleimprovement ananalysisofthepotentialcostofdoublingtheamountofI/Ocir- duetolowGPU–GPUinterconnectbandwidthdemand.Ontheleft cuitry and logic based on a proprietary state of the art GPU I/O sideofthefigure,forsomeapplicationswhenimprovedinterconnect implementation.Ourresultsshowthatdoublingthisinterfacearea bandwidthhasalargeeffect,dynamiclaneswitchingcanimprove increasestotalGPUareabylessthan1%whileyieldinga12%im- applicationperformancebyasmuchas80%.Forsomebenchmarks provementinaverageinterconnectbandwidthanda14%application likeRodinia-Euler-3D,HPC-AMG,andHPC-Lulesh,doublingthe performanceimprovement.Oneadditionalcaveatworthnotingis linkbandwidthprovides2×speedup,whileourproposeddynamic thattheproposedasymmetriclinkmechanismoptimizeslinkband- linkassignmentmechanismisnotabletosignificantlyimproveper- widthinagivendirectionforeachindividuallink,whilethetotal formance.Theseworkloadssaturatebothlinkdirections,sothere switchbandwidthremainsconstant. isnoopportunitytoprovideadditionalbandwidthbyturninglinks around. 5 NUMA-AWARECACHEMANAGEMENT Usingamoderate5Kcyclesampletime,thedynamiclinkpolicy Section4showedthatinter-socketbandwidthisanimportantfactor canimproveperformanceby14%onaverageoverstaticbandwidth inachievingscalableNUMAGPUperformance.Unfortunately,be- partitioning.Ifthelinkloadbalancersamplestooinfrequently,ap- causeeithertheoutgoingorincominglinksmustbeunderutilizedfor plicationdynamicscanbemissedandperformanceimprovementis ustoreallocatethatbandwidthtothesaturatedlink,ifbothincoming reduced.Howeverifthelinkisreconfiguredtoofrequently,band- andoutgoinglinksaresaturated,dynamiclinkrebalancingyields widthislostduetotheoverheadofturningthelink.Whilewehave minimalgains.Toimproveperformanceinsituationswheredynamic assumedapessimisticlinkturntimeof100cycles,weperformed linkbalancingisineffective,systemdesignerscaneitherincrease sensitivitystudiesthatshoweveniflinkturntimewereincreasedto linkbandwidth,whichisveryexpensive,ordecreasetheamountof 500cycles,ourdynamicpolicyloseslessthan2%inperformance. trafficthatcrossesthelowbandwidthcommunicationchannels.To MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Cache Partitioning Algorithm SM … SM SM … SM SM … SM SM … SM 0) Allocate ½ ways for local and ½ for remote data L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho 12)) ElIofs citnaimtle Dar-tRGeA PiMnUc ooBmuWtig niosg i sniangtt uBerrWa-Gt ePdU a BnWd DaRnAdM m oBnWit onro t SC R$ SC Coherent L2 SC Coherent L2 SC - RemoteWays++ and LocalWays-- 3) If DRAM BW is saturated and inter-GPU BW not To Switch To Switch To Switch To Switch NoC NoC NoC NoC - RemoteWays-- and LocalWays++ DRLA2 M tnerehoC )yromem fo trap DRLA2 M tnerehoC )yromem fo trap DRAM DRAM 4 5 6 ))) ING f oob -- no bEDetaqho cou kanfa rottleiohtz h ees1iam n)at gal uli fosrt acsetaaret tSdeuad rma wtpealdey sT i(m++e acyncdl e-s-) ( ( (a) Mem-Side Local (c) Shared Coherent (b) Static R$ (d) NUMA-Aware L1+L2 Only L2 L1+L2 Figure7:PotentialL2cacheorganizationstobalancecapacitybetweenremoteandlocalNUMAmemorysystems. decreaseoff-chipmemorytraffic,architectstypicallyturntocaches someGPU-socketsprimarilyaccessingdatalocallywhileothers tocapturelocality. areaccessingdataremotely,afixpartitioningofcachecapacityis GPUcachehierarchiesdifferfromtraditionalCPUhierarchiesas guaranteed to be sub-optimal. Second, while we show that most theytypicallydonotimplementstronghardwarecoherenceproto- applications will be able to completely fill large NUMA GPUs, cols[55].TheyalsodifferfromCPUprotocolsinthatcachesmay thismaynotalwaysbethecase.GPUswithinthedatacenterare bebothprocessorside(wheresomeformofcoherenceistypically beingvirtualizedandthereiscontinuousworktoimproveconcur- necessary)ortheymaybememoryside(wherecoherenceisnot rent execution of multiple kernels and processes within a single necessary).AsdescribedinTable1andFigure7(a),aGPUtoday GPU [15, 32, 46, 50]. If a large NUMA GPU is sub-partitioned, istypicallycomposedofrelativelylargeSWmanagedcoherentL1 itisintuitivethatsystemsoftwareattempttopartitionitalongthe cacheslocatedclosetotheSMs,whilearelativelysmall,distributed, NUMAboundaries(evenwithinasingleGPU-socket)toimprove non-coherentmemorysideL2cacheresidesclosetothememory thelocalityofsmallGPUkernels.Toeffectivelycapturelocality controllers.ThisorganizationworkswellforGPUsbecausetheir inthesesituation,NUMA-awareGPUsneedtobeabletodynami- SIMTprocessordesignsoftenallowforsignificantcoalescingof callyre-purposecachecapacityatruntime,ratherthanbestatically requeststothesamecacheline,sohavinglargeL1cachesreduces partitionedatdesigntime. theneedforglobalcrossbarbandwidth.Thememory-sideL2caches Coherence:To-date,discreteGPUshavenotmovedtheirmem- donotneedtoparticipateinthecoherenceprotocol,whichreduces sidecachestoprocessorsidebecausetheoverheadofcacheinvalida- asystemcomplexity. tion(duetocoherence)isanunnecessaryperformancepenalty.For asinglesocketGPUwithauniformmemorysystem,thereislittle 5.1 DesignConsiderations performance advantage to implementing L2 caches as processor InNUMAdesigns,remotememoryreferencesoccurringacrosslow sidecaches.Still,inamulti-socketNUMAdesign,theperformance bandwidthNUMAinterconnectionsresultsinpoorperformance,as taxofextendingcoherenceintoL2cachesisoffsetbythefactthat showninFigure3.Similarly,inNUMAGPUsutilizingtraditional remotememoryaccessescannowbecachedlocallyandmaybe memory-sideL2caches(thatdependonfinegrainedmemoryinter- justified.Figure7(c)showsaconfigurationwithacoherentL2cache leavingforloadbalancing)isabaddecision.Becausememory-side whereremoteandlocaldatacontendforL2capacityasextensions caches only cache accesses that originate in their local memory, oftheL1caches,implementingidenticalcoherencepolicy. theycannotcachememoryfromotherNUMAzonesandthuscan DynamicPartitioning:BuildinguponcoherentGPUL2caches, notreduceNUMAinterconnecttraffic.Previousworkhasproposed wepositthatwhileconceptuallysimple,allowingbothremoteand thatGPUL2cachecapacityshouldbesplitbetweenmemory-side localmemoryaccessestocontendforcachecapacity(inboththe cachesandanewprocessor-sideL1.5cachethatisanextensionof L1andL2caches)inaNUMAsystemisflawed.InUMAsystems theGPUL1caches[3]toenablecachingofremotedata,shownin itiswellknowthatperformanceismaximizedbyoptimizingfor Figure7(b).BybalancingL2capacitybetweenmemorysideand cachehitrate,thusminimizingoff-chipmemorysystembandwidth. remotecaches(R$),thisdesignlimitstheneedforextendingexpen- However,inNUMAsystems,notallcachemisseshavethesame sivecoherenceoperations(invalidations)intotheentireL2cache relativecostperformanceimpact.Acachemisstoalocalmemory whilestillminimizingcrossbarorinterconnectbandwidth. addresshasasmallercost(inbothtermsoflatencyandbandwidth) Flexibility:Designsthatstaticallyallocatecachecapacitytolocal than a cache miss to a remote memory address. Thus, it should memoryandremotememory,inanybalance,mayachievereason- bebeneficialtodynamicallyskewcacheallocationtopreference ableperformanceinspecificinstancesbuttheylackflexibility.Much cachingremotememoryoverlocaldatawhenitisdeterminedthe likeapplicationphasingwasshowntoaffectNUMAbandwidthcon- systemisbottle-neckedonNUMAbandwidth. sumptiontheabilitytodynamicallysharecachecapacitybetween Tominimizeinter-GPUbandwidthinmulti-socketGPUsystems localandremotememoryhasthepotentialtoimproveperformance weproposeaNUMA-awarecachepartitioningalgorithm,withcache underseveralsituations.First,whenapplicationphasingresultsin organizationandbriefsummaryshowninFigure7(d).Similarto BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA Mem-Side Local Only L2 Static Partitioning Shared Coherent L1+L2 NUMA-aware L1+L2 4.0 3.5 5.01 15.16 5.83 4.89 15.16 5.69 3.0 7.42 5.27 p2.5 u d e2.0 e Sp1.5 1.0 0.5 0.0 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean H H Figure8:Performanceof4-socketNUMA-awarecachepartitioning,comparedtomemory-sideL2andstaticpartitioning. Single GPU tobebandwidthsaturated(Step 1 ).Incaseswheretheintercon- 4-Socket GPU :: NUMA-aware Coherent L1+L2 With L2 Invalidations nectbandwidthissaturatedbutlocalmemorybandwidthisnot,the 4-Socket GPU :: NUMA-aware Coherent L1+L2 Without L2 Invalidations 4.5 partitioningalgorithmattemptstoreduceremotememorytrafficby 4.0 re-assigningonewayfromthegroupoflocalwaystotheremote 3.5 ways grouping (Step 2 ). Similarly, if the local memory BW is 3.0 p saturatedandinter-GPUbandwidthisnot,thepolicyre-allocates edu2.5 one way from the remote group, and allocates it to the group of e2.0 p S1.5 localways(Step 3 ).Tominimizetheimpactoncachedesign,all 1.0 waysareconsultedonlookup,allowinglazyevictionofdatawhen 0.5 thewaypartitioningchanges.Incasewhereboththeinterconnect 0.0 Rodinia-Euler3DHPC-AMGHPC-RSBenchHPC-CoMD-TaHPC-Lulesh-Unstruct-Mesh2HPC-LuleshHPC-NekboneLonestar-MST-MeshHPC-Lulesh-Unstruct-Mesh1Lonestar-SPHPC-CoMD-WaHPC-HPGMG-UVMRodinia-BFSHPC-CoMDLonestar-SSSP-WlcHPC-MCBML-AlexNet-cudnn-Lev2Rodinia-GaussianOther-Optix-RaytracingHPC-MiniContact-Mesh2Lonestar-MST-GraphHPC-Namd2.9Lonestar-SSSP-WlnHPC-MiniContact-Mesh1ML-GoogLeNet-cudnn-Lev2Rodinia-HotspotHPC-SNAPML-AlexNet-cudnn-Lev4Rodnia-PathfinderLonestar-SSSPHPC-MiniAMRHPC-HPGMGArithmetic MeanGeometric Mean aeltodarihelqnfrnwmeaudeemaaspoilytlto(aohiseSztlceiiertoacrceeslfaypqlltolmulhotcyai4ecerakeaimln)enl.auoscomFtmrrrnileyenerobameaaesbmsalerolatcyronooat,yitdfnni.oefwewdnmniwaaed(yeSiatstmshythueaeobiapsnrrsrsyeeoai5qgfl(sluntwa)hcee.tehaundTcirtlcofhiadnoheteprkrsocdsrrtaep,eoavumroibesenueonecrstptaueepcmlrlraoarofeenclocnimdhcartemtyloleyorsdagcystnaraatcaolrltauvedcetr)aueaia,ttanctihwelohclednyyeer, Figure9:PerformanceoverheadofextendingcurrentGPUsoft- warebasedcoherenceintotheGPUL2caches. 5.2 Results Figure8comparestheperformanceof4differentcacheconfigura- tionsinour4-socketNUMAGPU.OurbaselineisatraditionalGPU ourinterconnectbalancingalgorithm,atinitialkernellaunch(after with memory-side local-only write-back L2 caches. To compare GPUcacheshavebeenflushedforcoherencepurposes)weallocate againstpriorwork[3]weprovidea50–50staticpartitioningwhere one half of the cache ways for local memory and the remaining theL2cachebudgetissplitbetweentheGPU-sidecoherentremote waysforremotedata(Step 0 ).Afterexecutingfora5Kcyclespe- cachewhichcontainsonlyremotedata,andthememorysideL2 riod,wesampletheaveragebandwidthutilizationonlocalmemory whichcontainsonlylocaldata.Inour4-socketNUMAGPUstatic andestimatetheGPU-socket’sincomingreadrequestratebylook- partitioningimprovesperformanceby54%onaverage,althoughfor ingattheoutgoingrequestratemultipliedbytheresponsepacket somebenchmarks,ithurtstheperformancebyasmuchas10%for size.Byusingtheoutgoingrequestratetoestimatetheincoming workloadsthathavenegligibleinter-socketmemorytraffic.Wealso bandwidth,weavoidsituationswhereincomingwritesmaysaturate showtheresultsforGPU-sidecoherentL1andL2cacheswhereboth ourlinkbandwidthfalselyindicatingweshouldpreferenceremote localandremotedatacontendcapacity.Onaverage,thissolution data caching. Projected link utilization above 99% is considered MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal. Single GPU 4-Socket NUMA GPU :: Asymmetric Inter-GPU Interconnect Hypothetical 4x Larger Single GPU 4-Socket NUMA GPU :: Asymmetric Inter-GPU Interconnect and NUMA-aware Cache Partitioning 5 4-Socket NUMA GPU :: Baseline 4 up3 d e e p2 S 1 0 Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean H H Figure10:FinalNUMA-awareGPUperformancecomparedtoasingleGPUand4×largersingleGPUwithscaledresources. outperformsstaticcachepartitioningsignificantlydespiteincurring for some applications, on average SW-based cache invalidations additionalflushingoverheadduetocachecoherence. overheadsareonly10%,evenwhenextendedacrossallGPU-socket Finally, our proposed NUMA-aware cache partitioning policy L2caches.SowhilefinegrainHWcoherenceprotocolsmayim- isshownindarkgrey.Duetoitsabilitytodynamicallyadaptthe proveperformance,themagnitudeoftheirimprovementmustbe capacityofbothL2andL1tooptimizeperformancewhenbacked weightedagainsttheirhardwareimplementationcomplexity.While byNUMAmemory,itisthehighestperformingcacheconfiguration. inthestudiesaboveweassumedawrite-backpolicyinL2caches, Byexaminingsimulationresultswefindthatforworkloadsonthe asasensitivitystudywealsoevaluatedtheeffectofusingawrite- leftsideofFigure8whichfullysaturatetheinter-GPUbandwidth, throughcachepolicytomirrorthewrite-throughL1cachepolicy. NUMA-awaredynamicpolicyconfigurestheL1andL2cachesto Ourfindingsindicatethatwrite-backL2outperformswrite-through beprimarilyusedasremotecaches.However,workloadsontheright L2by9%onaverageinourNUMA-GPUdesignduetothedecrease sideofthefiguretendtohavegoodGPU-socketmemorylocality, intotalinter-GPUwritebandwidth. andthuspreferL1andL2cachesstoreprimarilylocaldata.NUMA- awarecachepartitioningisabletoflexiblyadapttovaryingmemory 6 DISCUSSION accessprofilesandcanimproveaverageNUMAGPUperformance CombinedImprovement:Sections4and5providetwotechniques 76% compared to traditional memory side L2 caches, and 22% aimedatmoreefficientlyutilizingscarceNUMAbandwidthwithin comparedtopreviouslyproposedstaticcachepartitioningdespite futureNUMAGPUsystems.Theproposedmethodsfordynamic incurringadditionalcoherenceoverhead. interconnectbalancingandNUMA-awarecachingareorthogonal When extending SW-controlled coherence into the GPU L2 andcanbeappliedinisolationorcombination.Dynamicintercon- caches,L1coherenceoperationsmustbeextendedintotheGPUL2 nectbalancinghasanimplementationsimplicityadvantageinthat caches.Usingbulksoftwareinvalidationtomaintaincoherenceis thesystemlevelchangestoenablethisfeatureareisolatedfrom simpletoimplementbutisaperformancepenaltywhenfalselyevict- thelargerGPUdesign.Conversely,enablingNUMA-awareGPU ingnotrequireddata.Theoverheadofthisinvalidationisdependent cachingbasedoninterconnectutilizationrequireschangestoboth onboththefrequencyoftheinvalidationsaswellasaggregatecache thephysicalcachearchitectureandtheGPUcoherenceprotocol. capacityinvalidated.ExtendingtheL1invalidationprotocolintothe Becausethesetwofeaturestargetthesameproblem,whenem- sharedL2,andthenacrossmultipleGPUs,increasesthecapacity ployedtogethertheireffectsarenotstrictlyadditive.Figure10shows affectedandfrequencyoftheinvalidationevents. theoverallimprovementNUMA-awareGPUscanachievewhenap- Tounderstandtheimpactoftheseinvalidations,weevaluatehy- plyingbothtechniquesinparallel.ForbenchmarkssuchasCoMD, potheticalL2cacheswhichcanignorethecacheinvalidationevents; thesefeaturescontributenearlyequallytotheoverallimprovement, thusrepresentingtheupperlimitonperformance(nocoherenceevic- butforotherssuchasML-AlexNet-cudnn-Lev2orHPC-MST-Mesh1, tionseveroccur)thatcouldbeachievedbyusingafinergranularity interconnectimprovementsorcachingaretheprimarycontributor HW-coherenceprotocol.Figure9showstheimpacttheseinvalida- respectively.Onaverage,weobservethatwhencombinedwesee tionoperationshaveonapplicationperformance.Whilesignificant 2.1×improvementoverasingleGPUand80%overthebaseline
Description: