ebook img

Gunrock: A High-Performance Graph Processing Library on the GPU PDF

0.81 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Gunrock: A High-Performance Graph Processing Library on the GPU

Artifact GunPrrooccke:ssAingHLigihb-rPaerryfoornmtahneceGGPUraph PoPPP***CeodsnusiesRteeontty*stCaoaEm*pluedteetln*eW*maeulcloDvAEEC* YangzihaoWang,AndrewDavidson∗,YuechaoPan,YuduoWu†,AndyRiffel,JohnD.Owens UniversityofCalifornia,Davis {yzhwang, aaldavidson, ychpan, yudwu, atriffel, jowens}@ucdavis.edu 6 1 Abstract implementedandevaluatedinGunrock,wefocusinthispaperon 0 breadth-firstsearch(BFS),single-sourceshortestpath(SSSP),be- 2 Forlarge-scalegraphanalyticsontheGPU,theirregularityofdata tweennesscentrality(BC),PageRank,andconnectedcomponents access/controlflowandthecomplexityofprogrammingGPUshave b been two significant challenges for developing a programmable (CC). Though the GPU’s excellent peak throughput and energy e high-performance graph library. “Gunrock,” our high-level bulk- efficiency [17] have been demonstrated across many application F domains, these applications often exploit regular, structured par- synchronous graph-processing system targeting the GPU, takes 2 a new approach to abstracting GPU graph analytics: rather than allelism. The inherent irregularity of graph data structures leads toirregularityindataaccessandcontrolflow,makinganefficient 2 designing an abstraction around computation, Gunrock instead implementationonGPUsasignificantchallenge. implementsanoveldata-centricabstractioncenteredonoperations ] onavertexoredgefrontier.Gunrockachievesabalancebetween Our goal with Gunrock is to deliver the performance of cus- tomized, complex GPU hardwired graph primitives with a high- C performance and expressiveness by coupling high-performance levelprogrammingmodelthatallowsprogrammerstoquicklyde- D GPUcomputingprimitivesandoptimizationstrategieswithahigh- velopnewgraphprimitives.Todoso,wemustaddressthechief level programming model that allows programmers to quickly s. developnewgraphprimitiveswithsmallcodesizeandminimal challengeinahighlyparallelgraphprocessingsystem:managing irregularityinworkdistribution.Gunrockintegratessophisticated c GPUprogrammingknowledge.WeevaluateGunrockonfivegraph [ primitives (BFS, BC, SSSP, CC, and PageRank) and show that load-balancingandwork-efficiencystrategiesintoitscore.These strategiesarehiddenfromtheprogrammer;theprogrammerinstead Gunrockhasonaverageatleastanorderofmagnitudespeedupover 6 expresseswhatoperationsshouldbeperformedonthefrontierrather BoostandPowerGraph,comparableperformancetothefastestGPU v thanhowthoseoperationsshouldbeperformed.Programmerscan hardwiredprimitives,andbetterperformancethananyotherGPU 7 assemblecomplexandhigh-performancegraphprimitivesfromop- high-levelgraphlibrary. 8 erationsthatmanipulatethefrontier(the“what”)withoutknowing 3 theinternalsoftheoperations(the“how”). 1. Introduction 5 Ourcontributionsareasfollows: 0 Graphsareubiquitousdatastructuresthatcanrepresentrelation- . shipsbetweenpeople(socialnetworks),computers(theInternet), 1. Wepresentanoveldata-centricabstractionforgraphoperations 1 biologicalandgeneticinteractions,andelementsinunstructured thatallowsprogrammerstodevelopgraphprimitivesatahigh 0 meshes,justtonameafew.Inthispaper,wedescribe“Gunrock,” levelofabstractionwhilesimultaneouslydeliveringhighper- 5 ourgraphicsprocessor(GPU)-basedsystemforgraphprocessing formance.Thisabstraction,unliketheabstractionsofprevious 1 thatdelivershighperformanceincomputinggraphanalyticswith GPUprogrammableframeworks,isabletoelegantlyincorpo- : v itshigh-level,data-centricparallelprogrammingmodel.Unlikepre- rateprofitableoptimizations—kernelfusion,push-pulltraversal, i viousGPUgraphprogrammingmodelsthatfocusonsequencing idempotenttraversal,andpriorityqueues—intothecoreofits X computationsteps,ourdata-centricmodel’skeyabstractionisthe implementation. r frontier,asubsetoftheedgesorverticeswithinthegraphthatis a currentlyofinterest.AllGunrockoperationsarebulk-synchronous 2. WedesignandimplementasetofsimpleandflexibleAPIsthat canexpressawiderangeofgraphprocessingprimitivesata andmanipulatethisfrontier,eitherbycomputingonvalueswithinit highlevelofabstraction(atleastassimple,ifnotmoreso,than orbycomputinganewfrontierfromit. otherprogrammableGPUframeworks). Atahighlevel,Gunrocktargetsgraphprimitivesthatareiter- ative,convergentprocesses.Amongthegraphprimitiveswehave 3. WedescribeseveralGPU-specificoptimizationstrategiesfor memoryefficiency,loadbalancing,andworkloadmanagement ∗CurrentlyanemployeeatGoogle. thattogetherachievehighperformance.Allofourgraphprimi- †CurrentlyanemployeeatIBM. tivesachievecomparableperformancetotheirhardwiredcoun- terpartsandsignificantlyoutperformpreviousprogrammable GPUabstractions. 4. We provide a detailed experimental evaluation of our graph Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseisgrantedwithout primitiveswithperformancecomparisonstoseveralCPUand feeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnotice GPUimplementations. andthefullcitationonthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbe honored.Abstractingwithcreditispermitted.Tocopyotherwise,torepublish,topostonservers,ortoredistributeto lists,contacttheOwner/Author.Requestpermissionsfrompermissions@acm.orgorPublicationsDept.,ACM,Inc.,fax +1(212)869-0481.Copyright2016heldbyOwner/Author.PublicationRightsLicensedtoACM. Gunrock is currently available in an open-source repository PPoPP’16 March12-16,2016,Barcelona,Spain at http://gunrock.github.io/ and is currently available for use by Copyright(cid:13)c 2016ACM978-1-4503-4092-2/16/03...$15.00 DOI:http://dx.doi.org/10.1145/2851141.2851145 externaldevelopers. 2. RelatedWork oftheloopfromtheuser.Usershavetogeneratetheactiveelements set directly for different graph algorithms. Help is a library that Thissectiondiscussestheresearchlandscapeoflarge-scalegraph provideshigh-levelprimitivesforlarge-scalegraphprocessing[29]. analyticsframeworksinfourfields: UsingtheprimitivesinHelpismoreintuitiveandmuchfasterthan 1. Single-nodeCPU-basedsystems,whichareincommonusefor usingtheAPIsofexistingdistributedsystems.Green-Marl[15]is graphanalyticstoday,butwhoseserialorcoarse-grained-parallel adomain-specificlanguageforwritinggraphanalysisalgorithms programmingmodelsarepoorlysuitedforamassivelyparallel on shared memory with built-in breadth-first search (BFS) and processorliketheGPU; depth-firstsearch(DFS)primitivesinitscompiler.Itslanguageap- proachprovidesgraph-specificoptimizationsandhidescomplexity. 2. DistributedCPU-basedsystems,whichofferscalabilityadvan- However,thelanguagedoesnotsupportoperationsonarbitrarysets tagesoversingle-nodesystemsbutincursubstantialcommunica- of vertices for each iteration, which makes it difficult to use for tioncost,andwhoseprogrammingmodelsarealsopoorlysuited traversalalgorithmsthatcannotbeexpressedusingaBFSorDFS. toGPUs; 3. GPU“hardwired,”low-levelimplementationsofspecificgraph 2.2 SpecializedParallelGraphAlgorithms primitives,whichprovideaproofofconceptthatGPU-based Recent work has developed numerous best-of-breed, hardwired graphanalyticscandeliverbest-in-classperformance.However, implementationsofmanygraphprimitives.Merrilletal.[24]’slinear best-of-classhardwiredprimitivesarechallengingtoeventhe parallelizationoftheBFSalgorithmontheGPUhadsignificant most skilled programmers, and their implementations do not influenceinthefield.Theyproposedanadaptivestrategyforload- generalizewelltoavarietyofgraphprimitives;and balancingparallelworkbyexpandingonenode’sneighborlisttoone 4. High-level GPU programming models for graph analytics, thread,onewarp,orawholeblockofthreads.Withthisstrategyand whichoftenrecapitulateCPUprogrammingmodels(e.g.,CuSha amemory-accessefficientdatarepresentation,theirimplementation andMapGraphusePowerGraph’sGASprogrammingmodel, achieveshighthroughputonlargescale-freegraphs.Beameretal.’s Medusa uses Pregel’s messaging model). The best of these recentworkonaveryfastBFSforsharedmemorymachines[1] systems incorporate generalized load balance strategies and uses a hybrid BFS that switches between top-down and bottom- optimizedGPUprimitives,buttheygenerallydonotcompare up neighbor-list-visiting algorithms according to the size of the favorablyinperformancewithhardwiredprimitivesduetothe frontiertosaveredundantedgevisits.Thecurrentfastestconnected- overheadsinherentinahigh-levelframeworkandthelackof component algorithm on the GPU is Soman et al.’s work [34] primitive-specificoptimizations. basedontwoPRAMconnected-componentalgorithms[14].There are several parallel Betweenness Centrality implementations on 2.1 Single-nodeandDistributedCPU-basedSystems the GPU [10, 22, 27, 31] based on the work from Brandes and Parallel graph analytics frameworks provide high-level, pro- Ulrik [2]. Davidson et al. [5] proposed a work-efficient Single- grammable,high-performanceabstractions.TheBoostGraphLi- SourceShortestPathalgorithmontheGPUthatexploresavariety brary(BGL)isamongthefirsteffortstowardsthisgoal,thoughits of parallel load-balanced graph traversal and work organization serialformulationandC++focustogethermakeitpoorlysuitedfor strategiestooutperformotherparallelmethods.Afterwediscussthe amassivelyparallelarchitecturelikeaGPU.Designedusingthe GunrockabstractioninSection4.1,wewilldiscusstheseexisting genericprogrammingparadigm,theparallelBGL[13]separates hardwiredGPUgraphalgorithmimplementationsusingGunrock theimplementationofparallelalgorithmsfromtheunderlyingdata terminology. structuresandcommunicationmechanisms.WhilemanyBGLim- 2.3 High-levelGPUProgrammingModels plementationsarespecializedperalgorithm,itsbreadth first visit pattern(forinstance)allowssharingcommonoperatorsbetweendif- InMedusa[37],ZhongandHepresentedtheirpioneeringworkon ferentgraphalgorithms.Pregel[20]isGoogle’seffortatlarge-scale ahigh-levelGPU-basedsystemforparallelgraphprocessing,using graphcomputing.ItfollowstheBulkSynchronousParallel(BSP) amessage-passingmodel.CuSha[18],targetingaGASabstraction, model.AtypicalapplicationinPregelisaniterativeconvergentpro- implementstheparallel-sliding-window(PSW)graphrepresenta- cessconsistingofglobalsynchronizationbarrierscalledsuper-steps. tionontheGPUtoavoidnon-coalescedmemoryaccess.CuSha ThecomputationinPregelisvertex-centricandbasedonmessage additionallyaddressesirregularmemoryaccessbypreprocessing passing.Itsprogrammingmodelisgoodforscalabilityandfault the graph data structure (“G-Shards”). Both frameworks offer a tolerance.However,instandardgraphalgorithmsinmostPregel-like smallsetofuser-definedAPIsbutarechallengedbyloadimbalance graphprocessingsystems,slowconvergencearisesfromgraphswith andthusfailtoachievethesamelevelofperformanceaslow-level structure. GraphLab [19] allows asynchronous computation and GPUgraphimplementations.MapGraph[8]alsoadoptstheGAS dynamicasynchronousscheduling.Byeliminatingmessage-passing, abstractionandachievessomeofthebestperformanceresultsfor itsprogrammingmodelisolatestheuser-definedalgorithmfrom programmablesingle-nodeGPUgraphcomputation. themovementofdata,andthereforeismoreconsistentlyexpres- sive.PowerGraph[12]usesthemoreflexibleGather-Apply-Scatter 3. Background&Preliminaries (GAS)abstractionforpower-lawgraphs.ItsupportsbothBSPand asynchronousexecution.Fortheloadimbalanceproblem,ituses AgraphisanorderedpairG=(V,E,w ,w )comprisedofaset e v vertex-cuttosplithigh-degreeverticesintoequaldegree-sizedre- ofverticesV togetherwithasetofedgesE,whereE ⊆ V ×V. dundantvertices.Thisexposesgreaterparallelisminnaturalgraphs. w andw aretwoweightfunctionsthatshowtheweightvalues e v Ligra[32]isaCPU-basedgraphprocessingframeworkforshared attachedtoedgesandverticesinthegraph.Agraphisundirected memory.Itusesasimilaroperatorabstractionfordoinggraphtraver- ifforallv,u∈V :(v,u)∈E ⇐⇒(u,v)∈E.Otherwise,itis sal.Itslightweightimplementationistargetedatsharedmemory directed.Ingraphprocessing,avertexfrontierrepresentsasubset architecturesandusesCilkPlusforitsmultithreadingimplementa- ofverticesU ⊆V andanedgefrontierrepresentsasubsetofedges tion.Galois[26,28]isagraphsystemforsharedmemorybasedon I ⊆E. adifferentoperatorabstractionthatsupportspriorityschedulingand ModernNVIDIAGPUsarethroughput-orientedmanycorepro- dynamicgraphsandprocessesonsubsetsofverticescalledactive cessorsthatusemassiveparallelismtogetveryhighpeakcompu- elements.However,theirmodeldoesnotabstracttheinternaldetails tationalthroughputandhidememorylatency.Kepler-basedGPUs canhaveupto15vectorprocessors,termedstreamingmultiproces- processedinparallel.Forinstance,acomputationoneachvertex sors(SMX),eachcontaining192parallelprocessingcores,termed withinthefrontiercanbeparallelizedacrossvertices,andupdating streamingprocessors(SP).NVIDIAGPUsusetheSingleInstruc- thefrontierbyidentifyingalltheverticesneighboringthecurrent tionMultipleThread(SIMT)programmingmodeltoachievedata frontiercanalsobeparallelizedacrossvertices.BSPoperationsare parallelism.GPUprogramscalledkernelsrunonalargenumber well-suitedforefficientimplementationontheGPUbecausethey ofparallelthreads.Eachsetof32threadsformsadivergent-free exhibitenoughparallelismtokeeptheGPUbusyanddonotrequire groupcalledawarptoexecuteinlockstepinaSingleInstruction expensivefine-grainedsynchronizationorlockingoperations. MultipleData(SIMD)fashion.Thesewarpsarethengroupedinto ThegraphprimitiveswedescribeinthispaperusethreeGunrock cooperativethreadarrayscalledblockswhosethreadscancommu- steps—advance,filter,andcompute—eachofwhichmanipulatethe nicatethroughapoolofon-chipsharedmemory.AllSMXsshare frontierinadifferentway(Figure1). anoff-chipglobalDRAM. Forproblemsthatrequireirregulardataaccessessuchasgraph Advance Anadvancestepgeneratesanewfrontierfromthecurrent problems,inadditiontoexposingenoughparallelism,asuccess- frontierbyvisitingtheneighborsofthecurrentfrontier.Afrontier fulGPUimplementationbenefitsfromthefollowingapplication canconsistofeitherverticesoredges,andanadvancestepcaninput characteristics:1)coalescedmemoryaccessandeffectiveuseofthe andoutputeitherkindoffrontier.Advanceisanirregularly-parallel memoryhierarchy,2)minimizingthreaddivergencewithinawarp, operation for two reasons: (1) different vertices in a graph have and3)reducingscatteredreadsandwrites. differentnumbersofneighborsand(2)verticesshareneighbors, To achieve these goals, Gunrock represents all per-node and soanefficientadvanceisthemostsignificantchallengeofaGPU per-edgedataasstructure-of-array(SOA)datastructuresthatallow implementation. coalescedmemoryaccesseswithminimalmemorydivergence.The The generality of Gunrock’s advance allows us to use the same datastructureforthegraphitselfisperhapsevenmoreimportant. advanceimplementationacrossawidevarietyofinterestinggraph InGunrock,weuseacompressedsparserow(CSR)sparsematrix operations.Forinstance,wecanutilizeGunrockadvanceoperators forvertex-centricoperationsbydefaultandallowuserstochoose to: 1) visit each element in the current frontier while updating anedge-list-onlyrepresentationforedge-centricoperations.CSR localvaluesand/oraccumulatingglobalvalues(e.g.,BFSdistance usesacolumn-indicesarray,C,tostorealistofneighborvertices updates);2)visitthevertexoredgeneighborsofalltheelementsin andarow-offsetsarray,R,tostoretheoffsetoftheneighborlist thecurrentfrontierwhileupdatingsourcevertex,destinationvertex, foreachvertex.Itprovidescompactandefficientmemoryaccess, and/or edge values (e.g., distance updates in SSSP); 3) generate andallowsustousescan,acommonandefficientparallelprimitive, edgefrontiersfromvertexfrontiersorviceversa(e.g.,BFS,SSSP, toreorganizesparseandunevenworkloadsintodenseanduniform depth-firstsearch,etc.);or4)pullvaluesfromallvertices2hops onesinallphasesofgraphprocessing[24]. away by starting from an edge frontier, visiting all the neighbor edges,andreturningthefar-endverticesoftheseneighboredges. Asaresult,wecanconcentrateoureffortonsolvingoneproblem 4. TheGunrockAbstractionandImplementation (implementinganefficientadvance)andseethateffortreflectedin 4.1 Gunrock’sAbstraction betterperformanceonothertraversal-basedgraphoperations. Gunrocktargetsgraphoperationsthatcanbeexpressedasiterative Filter Afilterstepgeneratesanewfrontierfromthecurrentfrontier convergentprocesses.By“iterative,”wemeanoperationsthatmay bychoosingasubsetofthecurrentfrontierbasedonprogrammer- requirerunningaseriesofstepsrepeatedly;by“convergent,”we specifiedcriteria.Thoughfilteringisanirregularoperation,using meanthattheseiterationsallowustoapproachthecorrectanswer parallel scan for efficient filtering is well-understood on GPUs. andterminatewhenthatanswerisreached.Thistargetissimilarto Gunrock’sfilterscaneither1)splitverticesoredgesbasedona mosthigh-levelgraphframeworks. filter(e.g.,SSSP’sdelta-stepping),or2)compactoutfiltereditems Where Gunrock differs from other frameworks, particularly tothrowthemaway(e.g.,duplicateverticesinBFS,SSSP,andBC). other GPU-based frameworks, is in our abstraction. Rather than Compute Aprogrammer-specifiedcomputestepdefinesanoper- focusing on sequencing steps of computation, we instead focus ation on all elements (vertices or edges) in the current frontier; onmanipulatingadatastructure,thefrontierofverticesoredges Gunrockthenperformsthatoperationinparallelacrossallelements. thatrepresentsthesubsetofthegraphthatisactivelyparticipating Becausethisparallelismisregular,computationisstraightforwardto in the computation. It is accurate to say that for many (but not parallelizeinaGPUimplementation.Manysimplegraphprimitives all)computations,thesequenceofoperationsthatresultfromour (e.g.,computingthedegreedistributionofagraph)canbeexpressed abstractionmaybesimilartowhatanotherabstractionmayproduce. asasinglecomputationstep. Nonetheless,wefeelthatthinkingaboutgraphprocessinginterms ofmanipulationsoffrontierdatastructuresistherightabstraction fortheGPU.Wesupportthisthesisqualitativelyinthissectionand quantitativelyinSection6. Functor Oneimportantconsequenceofdesigningourabstractionwitha data-centeredfocusisthatGunrock,fromitsverybeginning,has Advance Filter Compute supportedbothvertexandedgefrontiers,andcaneasilyswitchbe- tweenthemwithinthesamegraphprimitive.Wecan,forinstance, Figure 1: Three operators in Gunrock’s data-centric abstraction generateanewfrontierofneighboringedgesfromanexistingfron- convertacurrentfrontier(inblue)intoanewfrontier(ingreen). tierofvertices.Incontrast,gather-apply-scatter(PowerGraph/GAS) andmessage-passing(Pregel)abstractionsarefocusedonopera- Gunrockprimitivesareassembledfromasequenceofthesesteps, tionsonverticesandcannoteasilysupportedgefrontierswithin which are executed sequentially: one step completes all of its theirabstractions. operationsbeforethenextstepbegins.Typically,Gunrockgraph In our abstraction, we expose bulk-synchronous “steps” that primitivesruntoconvergence,whichonGunrockusuallyequatesto manipulatethefrontier,andprogrammersbuildgraphprimitives anemptyfrontier;asindividualelementsinthecurrentfrontierreach fromasequenceofsteps.Differentstepsmayhavedependencies convergence,theycanbefilteredoutofthefrontier.Programmers between them, but individual operations within a step can be canalsouseotherconvergencecriteriasuchasamaximumnumber ofiterationsorvolatileflagvaluesthatcanbesetinacomputation 4.2 AlternativeAbstractions step. Inthissectionwediscussseveralalternativeabstractionsdesigned ExpressingSSSPinprogrammableGPUframeworks SSSPisa forgraphprocessingonvariousarchitectures. reasonablycomplexgraphprimitivethatcomputestheshortestpath fromasinglenodeinagraphtoeveryothernodeinthegraph.We Gather-apply-scatter (GAS) abstraction The GAS abstraction assumeweightsbetweennodesareallnon-negative,whichpermits wasfirstappliedondistributedsystems[12].PowerGraph’svertex- theuseofDijkstra’salgorithmanditsparallelvariants.Efficiently cut splits large neighbor lists, duplicates node information, and implementingSSSPcontinuestobeaninterestingprobleminthe deployseachpartialneighborlisttodifferentmachines.Working GPUworld[3,5,6]. asaloadbalancingstrategy,itreplacesthelargesynchronization The iteration starts with an input frontier of active vertices cost in edge-cut into a single-node synchronization cost. This (or a single vertex) initialized to a distance of zero. First, SSSP is a productive strategy for multi-node implementations. GAS enumerates the sizes of the frontier’s neighbor list of edges and abstractionshavesuccessfullybeenmappedtotheGPU,firstwith computesthelengthoftheoutputfrontier.Becausetheneighbor VertexAPI2[7]andlaterwithMapGraph[8]andCuSha[18].GAS edgesareunequallydistributedamongthefrontier’svertices,SSSP offers the twin benefits of simplicity and familiarity, given its nextredistributestheworkloadacrossparallelthreads.Thiscanbe popularityintheCPUworld. expressedwithinanadvancefrontier.Inthefinalstepoftheadvance Recently,Wuetal.comparedGunrockvs.twoGPUGASframe- frontier,eachedgeaddsitsweighttothedistancevalueatitssource works,VertexAPI2andMapGraph[36],demonstratingthatGunrock valueand,ifappropriate,updatesthedistancevalueofitsdestination hadappreciableperformanceadvantagesovertheothertwoframe- vertex.Finally,SSSPremovesredundantvertexIDs(specificfilter), works.Oneoftheprincipalperformancedifferencestheyidentified decideswhichupdatedverticesarevalidinthenewfrontier,and comesfromthesignificantfragmentationofGASprogramsacross computesthenewfrontierforthenextiteration. manykernelsthatwediscussinmoredetailinSection4.3.Applying Algorithm1providesmoredetailofhowthisalgorithmmapsto automatickernelfusiontoGAS+GPUimplementationscouldpo- Gunrock’sabstraction. tentiallyhelpclosetheirperformancegap,butsuchanoptimization ishighlycomplexandhasnotyetappearedinanypublishedwork. Algorithm1Single-SourceShortestPath,expressedinGunrock’s abstraction At a more fundamental level, we found that a compute-focused programmingmodellikeGASwasnotflexibleenoughtomanipulate 1: procedureSET PROBLEM DATA(G,P,root) the core frontier data structures in a way that enabled powerful 2: P.labels[1..G.verts]←∞ 3: P.preds[1..G.verts]←−1 featuresandoptimizationssuchaspush-pullandtwo-levelpriority 4: P.labels[root]←0 queues;bothfitnaturallyintoGunrock’sabstraction.Webelieve 5: P.preds[root]←src bulk-synchronousoperationsonfrontiersareabetterfitthanGAS 6: P.frontier.Insert(root) forforward-lookingGPUgraphprogrammingframeworks. 7: endprocedure Message-passing Pregel [20] is a vertex-centric programming 8: 9: procedureUPDATELABEL(sid,did,eid,P) modelthatonlyprovidesdataparallelismonvertices.Forgraphs 10: newlabel←P.labels[sid]+P.weights[eid] withsignificantvarianceinvertexdegree(e.g.,power-lawgraphs), 11: returnnewlabel<atomicMin(P.labels[did],newlabel) thiswouldcausesevereloadimbalanceonGPUs.Thetraversalop- 12: endprocedure eratorinPregelisgeneralenoughtoapplytoawiderangeofgraph 13: primitives,butitsvertex-centricdesignonlyachievesgoodparal- 14: procedureSETPRED(sid,did,P) lelismwhennodesinthegraphhavesmallandevenly-distributed 15: P.preds[did]←sid neighborhoods.Forreal-worldgraphsthatoftenhaveunevendistri- 16: P.outputqueueids[did]←outputqueueid butionofnodedegrees,Pregelsuffersfromsevereloadimbalance. 17: endprocedure The Medusa GPU graph-processing framework [37] also imple- 18: 19: procedureREMOVEREDUNDANT(nodeid,P) ments a BSP model and allows computation on both edges and 20: returnP.outputqueueid[nodeid]==outputqueueid vertices.Medusa,unlikeGunrock,alsoallowsedgesandverticesto 21: endprocedure sendmessagestoneighboringvertices.TheMedusaauthorsnote 22: thecomplexityofmanagingthestorageandbufferingofthesemes- 23: procedureSSSP ENACTOR(G,P,root) sages,andthedifficultyofload-balancingwhenusingsegmented 24: SET PROBLEM DATA(G,P,root) reductionforper-edgecomputation.Thoughtheyaddressbothof 25: whileP.frontier.Size()>0do thesechallengesintheirwork,theoverheadofanymanagementof 26: ADVANCE(G,P,UpdateLabel,SetPred) messagesisasignificantcontributortoruntime.Gunrockprefersthe 27: FILTER(G,P,RemoveRedundant) lesscostlydirectcommunicationbetweenprimitivesandsupports 28: PRIORITYQUEUE(G,P) 29: endwhile bothpush-based(scatter)communicationandpull-based(gather) 30: endprocedure communicationduringtraversalsteps. CPUstrategies Ligra’spowerfulload-balancingstrategyisbased GunrockmapsoneSSSPiterationontothreeGunrocksteps:(1) onCilkPlus,afine-grainedtask-parallellibraryforCPUs.Despite advance,whichcomputesthelistofedgesconnectedtothecurrent promisingGPUresearcheffortsontaskparallelism[4,35],nosuch vertexfrontierand(transparently)load-balancestheirexecution;(2) equivalentisavailableonGPUs,thusweimplementourownload- compute,toupdateneighboringverticeswithnewdistances;and(3) balancingstrategieswithinGunrock.Galois,likeGunrock,cleanly filter,togeneratethefinaloutputfrontierbyremovingredundant separatesdatastructuresfromcomputation;theirkeyabstractions nodes,optionallyusinga2-levelpriorityqueue,whoseuseenables are ordered and unordered set iterators that can add elements to delta-stepping(abinningstrategytoreduceoverallworkload[5,25]). setsduringexecution(suchadynamicdatastructureisasignificant Withthismappinginplace,thetraversalandcomputationofpath researchchallengeonGPUs).Galoisalsobenefitsfromspeculative distancesissimpleandintuitivelydescribed,andGunrockisable parallelexecutionwhoseGPUimplementationwouldalsopresent tocreateanefficientimplementationthatfullyutilizestheGPU’s asignificantchallenge.BothLigraandGaloisscalewellwithina computingresourcesinaload-balancedway. nodethroughinter-CPUsharedmemory;inter-GPUscalability,both input output frontier Enumerate Compute Load Update Mark frontier Compact Neighbors New Frontier Balancing Label Values Valid Gunrock: Traversal:Advance Compute Traversal:Filter PowerGraph: Scatter Vertex-Cut Gather Apply Scatter GetValue Pregel: GetOutEdgeIterator MutableValue VoteToHalt SendMsgTo Ligra: EdgeMap(including Update) VertexMap(including Reset) Medusa: ELIST Combiner VERTEX Figure2:OperationsthatmakeuponeiterationofSSSPandtheirmappingtotheGunrock,PowerGraph(GAS)[12],Pregel[20],Ligra[32], andMedusa[37]abstractions. duetohigherlatencyandalackofhardwaresupport,isamuch codeandexposekernelfusionopportunitiesthatwediscussbelow; moremanual,complexprocess. andanenactor,whichservesastheentrypointofthegraphalgo- rithmandspecifiesthecomputationasaseriesofadvanceand/or Help’sPrimitives Help[29]characterizesgraphprimitivesasaset filterkernelcallswithuser-definedkernellaunchingsettings. offunctionsthatenablespecialoptimizationsfordifferentprimitives GivenGunrock’sabstraction,themostnaturalwaytospecify atthecostoflosinggenerality.ItsFilter,LocalUpdateofVertices Gunrockprogramswouldbeasasequenceofbulk-synchronous (LUV),UpdateVerticesUsingOneOtherVertex(UVUOV),and steps, specified within the enactor and implemented as kernels, AggregateGlobalValue(AGV)areallGunrockfilteroperations thatoperateonfrontiers.Suchanenactorisinfactthecoreofa withdifferentcomputations.AggregatingNeighborValues(ANV) Gunrockprogram,butanenactor-onlyprogramwouldsacrificea maps to the advance operator in Gunrock. We also successfully significantperformanceopportunity.Weanalyzedthetechniques implementedFSinGunrockusingtwofilterpasses,oneadvance that hardwired (primitive-specific) GPU graph primitives used pass,andseveralotherGPUcomputingprimitives(sort,reduce,and to achieve high performance. One of their principal advantages scan). is leveraging producer-consumer locality between operations by Asynchronousexecution ManyCPUframeworks(e.g.,Galoisand integratingmultipleoperationsintosingleGPUkernels.Because GraphLab)efficientlyincorporateasynchronousexecution,butthe adjacentkernelsinCUDAorOpenCLsharenostate,combining GPU’sexpensivesynchronizationorlockingoperationswouldmake multiple logical operations into a single kernel saves significant thisapoorchoiceforGunrock.Wedorecoversomeofthebenefits memorybandwidththatwouldotherwiseberequiredtowriteand ofprioritizingexecutionthroughourtwo-levelpriorityqueue. thenreadintermediatevaluestoandfrommemory.TheCUDAC++ programmingenvironmentweusehasnoabilitytoautomatically 4.3 Gunrock’sAPIanditsKernel-FusionOptimization fuseneighboringkernelstogethertoachievethisefficiency(and automatingthis“kernelfusion”problemisasignificantresearch __device__ bool CondEdge(VertexId s_id, VertexId d_id, DataSlice *problem, challenge). VertexId e_id = 0, VertexId e_id_in = 0) In particular, we noted that hardwired GPU implementations __device__ void ApplyEdge(VertexId s_id, VertexId d_id, DataSlice *problem, fuseregularcomputationstepstogetherwithmoreirregularsteps VertexId e_id = 0, VertexId e_id_in = 0) likeadvanceandfilterbyrunningacomputationstep(withregular __device__ bool CondVertex(VertexId node, DataSlice *p) parallelism)ontheinputoroutputoftheirregularly-parallelstep, __device__ void all within the same kernel. To enable similar behavior in a pro- ApplyVertex(VertexId node, DataSlice *p) grammableway,Gunrockexposesitscomputationstepsasfunctors gunrock::oprtr::advance::Kernel <AdvancePolicy, Problem, Functor> thatareintegratedintoadvanceandfilterkernelsatcompiletimeto <<<advance_grid_size, AdvancePolicy::THREADS>>>( queue_length, achievesimilarefficiency.Wesupportfunctorsthatapplyto{edges, graph_slice->ping_pong_working_queue[selector], vertices} and either return a boolean value (the “cond” functor), graph_slice->ping_pong_working_queue[selector^1], data_slice, usefulforfiltering,orperformacomputation(the“apply”functor). context, gunrock::oprtr::ADVANCETYPE) Thesefunctorswillthenbeintegratedinto“advance”and“filter” gunrock::oprtr::filter::Kernel kernelcalls,whichhideanycomplexitiesofhowthosestepsare <FilterPolicy, Problem, Functor> internallyimplemented.WesummarizetheAPIfortheseoperations <<<filter_grid_size, FilterPolicy::THREADS>>>( queue_length, inFigure3.OurfocusonkernelfusionenabledbyourAPIdesign graph_slice->ping_pong_working_queue[selector], graph_slice->ping_pong_working_queue[selector^1], isabsentfromotherprogrammableGPUgraphlibraries,butitis data_slice) crucialforperformance. Figure 3: Gunrock’s API set. Cond functors compute a boolean Intermsofdatastructures,Gunrockrepresentsallper-nodeand valueperelement,usefulforfiltering.Applyfunctorsimplement per-edgedataasstructure-of-array(SOA)datastructuresthatallow acomputeoperationoneachelement.Userspecificfunctorstruct coalescedmemoryaccesseswithminimalmemorydivergence.The thatcontainsitsownimplementationofthesefourfunctorsisinte- datastructureforthegraphitselfisperhapsevenmoreimportant. gratedatcompiletimeintoAdvanceorFilterkernels,providing InGunrock,weuseacompressedsparserow(CSR)sparsematrix automatickernelfusion. forvertex-centricoperationsbydefaultandallowuserstochoose anedge-list-onlyrepresentationforedge-centricoperations.CSR Gunrockprogramsspecifythreecomponents:theproblem,which usesacolumn-indicesarray,C,tostorealistofneighborvertices providesgraphtopologydataandanalgorithm-specificdatamanage- andarow-offsetsarray,R,tostoretheoffsetoftheneighborlist mentinterface;thefunctors,whichcontainuser-definedcomputation foreachvertex.Itprovidescompactandefficientmemoryaccess, andallowsustousescan,acommonandefficientparallelprimitive, entireblock.Allthethreadsintheblockthencooperativelyprocess toreorganizesparseandunevenworkloadsintodenseanduniform theneighborlistofthewinner’snode.Thisprocedurecontinuesuntil onesinallphasesofgraphprocessing[24]. allnodeswithlargelistshavebeenprocessed.Next,allthreadsin We next provide detail on Gunrock’s implementations of eachwarpbeginasimilarproceduretoprocessallthenodeswhose workload-mapping/load-balancing(Section4.4)andoptimizations neighborlistsaremedium-sizedlists.Finally,theremainingnodes (Section4.5) areprocessedusingourper-threadfine-grainedworkload-mapping strategy(Figure4). 4.4 WorkloadMappingandLoadBalancingDetails The specialization of this method allows higher throughput on Choosingtherightabstractionisonekeycomponentinachieving frontierswithahighvarianceindegreedistribution,butatthecost highperformancewithinagraphframework.Thesecondcomponent of higher overhead due to the sequential processing of the three isoptimizedimplementationsoftheprimitiveswithintheframework. differentsizes. OneofGunrock’smajorcontributionsisgeneralizingtwoworkload- Load-BalancedPartitioning Davidsonetal.andGunrockimprove distributionandload-balancestrategiesthateachpreviouslyapplied onthismethodbyfirstorganizinggroupsofedgesintoequal-length toasinglehardwiredGPUgraphprimitiveintoGunrock’sgeneral- chunksandassigningeachchunktoablock.Thisdivisionrequires purposeadvanceoperator. ustofindthestartingandendingindicesforalltheblockswithinthe Gunrock’sadvancestepgeneratesanirregularworkload.Con- frontier.Weuseanefficientsortedsearchtomapsuchindiceswith sideranadvancethatgeneratesanewvertexfrontierfromtheneigh- thescannededgeoffsetqueue.Whenwestarttoprocessaneighbor borsofallverticesinthecurrentfrontier.Ifweparallelizeoverinput listofanewnode,weusebinarysearchtofindthenodeIDforthe vertices,graphswithavariationinvertexdegree(withdifferent- edgesthataregoingtobeprocessed.Usingthismethod,weensure sizedneighborlists)willgenerateacorrespondingimbalancein load-balancebothwithinablockandbetweenblocks(Figure5). per-vertexwork.Thus,mappingtheworkloadofeachvertexonto Atthehighlevel,Gunrockmakesaload-balancingstrategydecision theGPUsothattheycanbeprocessedinaload-balancedwayis depending on topology. We note that our coarse-grained (load- essentialforefficiency. balancing)traversalmethodperformsbetteronsocialgraphswith Themostsignificantpreviousworkinthisareabalancesloadby irregular distributed degrees, while the fine-grained method is cooperating between threads. Targeting BFS, Merrill et al. [24] superiorongraphswheremostnodeshavesmalldegrees.Forthis map the workload of a single vertex to a thread, a warp, or a reason,inGunrockweimplementahybridofbothmethodsonboth cooperativethreadarray(CTA),accordingtothesizeofitsneighbor vertexandedgefrontiers,usingthefine-graineddynamicgrouping list. Targeting SSSP, Davidson et al. [5] use two load-balanced strategy for nodes with relatively smaller neighbor lists and the workloadmappingstrategies,onethatgroupsinputworkandthe coarse-grained load-balancing strategy for nodes with relatively otherthatgroupsoutputwork.Thefirstpartitionsthefrontierinto larger neighbor lists. Within the latter, we set a static threshold. equallysizedchunksandassignsallneighborlistsofonechunkto Whenthefrontiersizeissmallerthanthethreshold,weusecoarse- oneblock;thesecondpartitionstheneighborlistsetintoequally grained load-balance over nodes, otherwise coarse-grained load- sizedchunks(possiblysplittingtheneighborlistofonenodeinto balanceoveredges.Wehavefoundthatsettingthisthresholdto multiplechunks)andassignseachchunkofedgeliststooneblockof 4096yieldsconsistenthighperformancefortestsacrossallGunrock- threads.Merrilletal.(unlikeDavidsonetal.)alsosupportsthe(BFS- providedgraphprimitives.Userscanalsochangethisvalueeasily specific)abilitytoprocessfrontiersofedgesratherthanjustfrontiers intheEnactormodulefortheirowndatasetsorgraphprimitives. ofvertices.Weintegratebothtechniquestogether,generalizethem Superiorloadbalancingisoneofthemostsignificantreasonswhy intoagenericadvanceoperator,andextendthembysupportingan GunrockoutperformsotherGPUframeworks[36]. effectivepull-basedoptimizationstrategy(Section4.5).Theresult isthefollowingtwoload-balancingstrategieswithinGunrock. 4.5 Gunrock’sOptimizations Per-thread fine-grained One straightforward approach to load OneofourmaingoalsindesigningtheGunrockabstractionwas balancingistomaponefrontiervertex’sneighborlisttoonethread. toeasilyallowintegratingexistingandnewalternativesandopti- Eachthreadloadstheneighborlistoffsetforitsassignednode,then mizationsintoourprimitivestogivemoreoptionstoprogrammers. seriallyprocessesedgesinitsneighborlist.Wehaveimprovedthis Ingeneral,wehavefoundthatourdata-centricabstraction,andour methodinseveralways.First,weloadalltheneighborlistoffsets focusonmanipulatingthefrontier,hasbeenanexcellentfitforthese intosharedmemory,thenuseaCTAofthreadstocooperatively alternativesandoptimizations,comparedtoamoredifficultimple- processper-edgeoperationsontheneighborlist.Simultaneously, mentationpathforotherGPUcomputation-focusedabstractions. weusevertex-cuttosplittheneighborlistofanodesothatitcan Weofferthreeexamples. beprocessedbymultiplethreads.Wefoundoutthatthismethod performsbetterwhenusedforlarge-diametergraphswitharelatively Idempotentvs.non-idempotentoperations Becausemultipleel- even degree distribution since it balances thread work within a ementsinthefrontiermayshareacommonneighbor,anadvance CTA,butnotacrossCTAs.Forgraphswithamoreunevendegree stepmaygenerateanoutputfrontierthathasduplicatedelements. distribution (e.g., scale-free social graphs), we turn to a second Forsomegraphprimitives(e.g.,BFS)with“idempotent”operations, strategy. repeatingacomputationcausesnoharm,andGunrock’sfilterstep canperformaseriesofinexpensiveheuristicstoreduce,butnot Per-warpandper-CTAcoarse-grained Significantdifferencesin eliminate, redundant entries in the output frontier. Gunrock also neighborlistsizecausetheworstperformancewithourper-thread supportsanon-idempotentadvance,whichinternallyusesatomic fine-grainedstrategy.Wedirectlyaddressthevariationinsizeby operationstoguaranteeeachelementappearsonlyonceintheoutput grouping neighbor lists into three categories based on their size, frontier. thenindividuallyprocessingeachcategorywithastrategytargeted directlyatthatsize.Ourthreesizesare(1)listslargerthanaCTA; Push vs. pull traversal Other GPU programmable graph frame- (2)listslargerthanawarp(32threads)butsmallerthanaCTA;and worksalsosupportanadvancestep,ofcourse,butbecausetheyare (3)listssmallerthanawarp.Webeginbyassigningasubsetofthe centeredonvertexoperationsonanimplicitfrontier,theygenerally frontiertoablock.Withinthatblock,eachthreadownsonenode. supportonly“push”-styleadvance:thecurrentfrontierofactive Thethreadsthatownnodeswithlargelistsarbitrateforcontrolofthe vertices“pushes”activestatustoitsneighborstocreatethenew Frontier that contains neighbor lists with various sizes Frontier that contains neighbor lists with various sizes g Load balancing search n ip u o rg Large Medium Small neighbor lists neighbor lists neighbor lists (size > 256) (size > 32 and (size <= 32) <=256) Equally sized chunks, neighbor lists could get split into several chunks Sequentially process three groups block warp thread cooperative cooperative cooperative Block cooperative advance for chunks advance advance advance Figure4:LoadbalancingstrategyofMerrilletal.[24] Figure5:LoadbalancingstrategyofDavidsonetal.[5] frontier.Beameretal.[1]describeda“pull”-styleadvanceonCPUs: potentiallyincreasetheperformanceofvarioustypesofcommunity insteadofstartingwithafrontierofactivevertices,pullstartswitha detectionandlabelpropagationalgorithmsaswellasalgorithmson frontierofunvisitedvertices,generatingthenewfrontierbyfiltering graphswithsmall“longtail”frontiers. theunvisitedfrontierforverticesthathaveneighborsinthecurrent frontier. 5. Applications Beameretal.showedthisapproachisbeneficialwhenthenumber OneoftheprincipaladvantagesofGunrock’sabstractionisthat ofunvisitedverticesdropsbelowthesizeofthecurrentfrontier. ouradvance,filter,andcomputestepscanbecomposedtobuild Whilevertex-centeredGPUframeworkshavefounditchallenging newgraphprimitiveswithminimalextrawork.Foreachprimitive tointegratethisoptimizationintotheirabstraction,ourdata-centric below,wedescribethehardwiredGPUimplementationtowhich abstractionisamuchmorenaturalfitbecausewecaneasilyperform wecompare,followedbyhowweexpressthisprimitiveinGun- moreflexibleoperationsonfrontiers.Gunrockinternallyconverts rock.Section6comparestheperformancebetweenhardwiredand thecurrentfrontierintoabitmapofvertices,generatesanewfrontier Gunrockimplementations. of all unvisited nodes, then uses an advance step to “pull” the computation from these nodes’ predecessors if they are valid in thebitmap. Advance Filter Advance Filter BFS: Update Remove SSSP: Update Remove Near/Far Pile Withthisoptimization,weseeaspeeduponBFSof1.52xforscale- Label Value Redundant Label Value Redundant farbesetrgarcatpiohnslaiknedM1.e2d8uxsfao,rwsimthailtls-dfiexgerdeem-leatrhgoed-d(siaemgmeteenrtgedrarpehdsu.cItnioann) BC: SAiAcgcmduvama Vnucalaelutee RRedeFumilnteodrvaent BCACodmv Vapanulcuteee CoTmrapvuetrastaioln toconstructfrontiers,itwouldbeasignificantchallengetointegrate Filter Filter Advance Filter apull-basedadvance.CurrentlyinGunrock,thisoptimizationis CC: cF[vo1r ]e t=o( vc1[v,2v2].) R, aesmsoigvne c[v] tFo ocr[ cv[,v a]]s. sRigenmove PR: PDRi svtarilbuuet etoUpRdeamteo vPeR w vhaelune. appliedtoBFSonly,butinthefuture,moresophisticatedBCand e when c[v1]==c[v2] v when c[v]==c[c[v]] NeighborsPR value converge SSSPimplementationscouldbenefitfromitaswell. Figure6:OperationflowchartforselectedprimitivesinGunrock(a PriorityQueue AstraightforwardBSPimplementationofanop- blacklinewithanarrowatoneendindicatesawhileloopthatruns eration on a frontier treats each element in the frontier equally, untilthefrontierisempty). i.e., with the same priority. Many graph primitives benefit from prioritizingcertainelementsforcomputationwiththeexpectation that computing those elements first will save work overall (e.g., 5.1 Breadth-FirstSearch(BFS) delta-steppingforSSSP[25]).Gunrockgeneralizestheapproachof Davidsonetal.[5]byallowinguser-definedpriorityfunctionsto BFSinitializesitsvertexfrontierwithasinglesourcevertex.Oneach organizeanoutputfrontierinto“near”and“far”slices.Thisallows iteration,itgeneratesanewfrontierofverticeswithallunvisited theGPUtouseasimpleandhigh-performancesplitoperationto neighborverticesinthecurrentfrontier,settingtheirdepthsand createandmaintainthetwoslices.Gunrockthenconsidersonlythe repeatinguntilallverticeshavebeenvisited.BFSisoneofthemost nearsliceinthenextprocessingsteps,addinganynewelementsthat fundamentalgraphprimitivesandservesasthebasisofseveralother donotpassthenearcriterionintothefarslice,untilthenearsliceis graphprimitives. exhausted.Wethenupdatethepriorityfunctionandoperateonthe Hardwired GPU Implementation The well-known BFS imple- farslice. mentation of Merrill et al. [24] achieves its high performance Like other Gunrock steps, constructing a priority queue directly throughcarefulload-balancing,avoidanceofatomics,andheuristics manipulatesthefrontierdatastructure.Itisdifficulttoimplement foravoidingredundantvertexdiscovery.Itschiefoperationsareex- suchanoperationinaGAS-basedprogrammingmodelsincethat pand(togenerateanewfrontier)andcontract(toremoveredundant programmingmodelhasnoexplicitwaytoreorganizeafrontier. vertices)phases. Currently Gunrock uses this specific optimization only in SSSP, Gunrock Implementation Merrill et al.’s expand maps nicely but we believe a workload reorganization strategy based on a to Gunrock’s advance operator, and contract to Gunrock’s filter more general priority queue implementation will enable a semi- operator.Duringadvance,wesetalabelvalueforeachvertexto asynchronousexecutionmodelinGunrocksincedifferentpartsof showthedistancefromthesource,and/orsetapredecessorvaluefor frontiercanprocessanarbitrarynumberofBSPsteps.Thiswill eachvertexthatshowsthepredecessorvertex’sID.Weimplement efficient load-balancing (Section 4.4) and both push- and pull- 5.4 ConnectedComponentLabeling basedadvance(Section4.5)formoreefficienttraversal.Ourbase The connected component primitive labels the vertices in each implementationusesatomicsduringadvancetopreventconcurrent connectedcomponentinagraphwithauniquecomponentID. vertexdiscovery.Whenavertexisuniquelydiscovered,wesetits label(depth)and/orpredecessorID.Gunrock’sfastestBFSusesthe idempotentadvanceoperator(thusavoidingthecostofatomics)and Hardwired GPU Implementation Soman et al. [34] base their usesheuristicswithinitsfilterthatreducetheconcurrentdiscovery implementationontwoPRAMalgorithms:hookingandpointer- ofchildnodes(Section4.5). jumping.Hookingtakesanedgeastheinputandtriestosetthe componentIDsofthetwoendverticesofthatedgetothesame value.Inodd-numberediterations,thelowervertexwritesitsvalue 5.2 Single-SourceShortestPath tothehighervertex,andviceversaintheevennumberediteration. Single-sourceshortestpathfindspathsbetweenagivensourcevertex Thisstrategyincreasestherateofconvergence.Pointer-jumping andallotherverticesinthegraphsuchthattheweightsonthepath reducesamulti-leveltreeinthegraphtoaone-leveltree(star).By betweensourceanddestinationverticesareminimized.Whilethe repeatingthesetwooperatorsuntilnocomponentIDchangesfor advancemodeofSSSPisidenticaltoBFS,thecomputationmode anynodeinthegraph,thealgorithmwillcomputethenumberof differs. connectedcomponentsforthegraphandtheconnectedcomponent towhicheachnodebelongs. HardwiredGPUImplementation Currentlythehighestperform- Gunrock Implementation Gunrock uses a filter operator on an ingSSSPalgorithmimplementationontheGPUistheworkfrom edge frontier to implement hooking. The frontier starts with all Davidsonetal.[5].Theyprovidetwokeyoptimizationsintheir edgesandduringeachiteration,oneendvertexofeachedgeinthe SSSPimplementation:1)aloadbalancedgraphtraversalmethod frontiertriestoassignitscomponentIDtotheothervertex,andthe and2)apriorityqueueimplementationthatreorganizesthework- filterstepremovestheedgewhosetwoendverticeshavethesame load.Gunrockgeneralizesbothoptimizationstrategiesintoitsimple- componentID.Werepeathookinguntilnovertex’scomponentID mentation,allowingthemtoapplytoothergraphprimitivesaswell changesandthenproceedtopointer-jumping,whereafilteroperator asSSSP.WeimplementGunrock’spriorityqueueasanadditional onverticesassignsthecomponentIDofeachvertextoitsparent’s filterpassbetweentwoiterations. componentIDuntilitreachestheroot.Thenafilterstepremoves thenodewhosecomponentIDequalsitsownnodeID.Thepointer- GunrockImplementation Westartfromasinglesourcevertexin jumpingphasealsoendswhennovertex’scomponentIDchanges. thefrontier.Tocomputeadistancevaluefromthesourcevertex,we needoneadvanceandonefilteroperator.Oneachiteration,wevisit allassociatededgesinparallelforeachvertexinthefrontierand 5.5 PageRankandOtherNodeRankingAlgorithms relaxthedistance’svalue(ifnecessary)oftheverticesattachedto thoseedges.WeuseanAtomicMintoatomicallyfindtheminimal ThePageRanklinkanalysisalgorithmassignsanumericalweighting distancevaluewewanttokeepandabitmapflagarrayassociated toeachelementofahyperlinkedsetofdocuments,suchastheWorld withthefrontiertoremoveredundantvertices.Aftereachiteration, WideWeb,withthepurposeofquantifyingitsrelativeimportance weuseapriorityqueuetoreorganizetheverticesinthefrontier. withintheset.TheiterativemethodofcomputingPageRankgives eachvertexaninitialPageRankvalueandupdatesitbasedonthe PageRankofitsneighbors,untilthePageRankvalueforeachvertex 5.3 BetweennessCentrality converges. PageRank is one of the simplest graph algorithms to TheBCindexcanbeusedinsocialnetworkanalysisasanindicator implementonGPUsbecausethefrontieralwayscontainsallvertices, oftherelativeimportanceofverticesinagraph.Atahighlevel,the soitscomputationiscongruenttosparsematrix-vectormultiply; BCforavertexinagraphisthefractionofshortestpathsinagraph becauseitissimple,mostGPUframeworksimplementitinasimilar thatpassthroughthatvertex.Brandes’sBCformulation[2]ismost wayandattainsimilarperformance. commonlyusedforGPUimplementations. InGunrock,webeginwithafrontierthatcontainsallverticesin thegraphandendwhenallverticeshaveconverged.Eachiteration contains one advance operator to compute the PageRank value HardwiredGPUImplementation Brandes’sformulationhastwo on the frontier of vertices, and one filter operator to remove the passes:aforwardBFSpasstoaccumulatesigmavaluesforeach verticeswhosePageRankshavealreadyconverged.Weaccumulate node,andabackwardBFSpasstocomputecentralityvalues.Jiaet PageRankvalueswithAtomicAddoperations. al.[16]andSariyu¨ceetal.[31]bothuseanedge-parallelmethod toimplementtheabovetwopasses.WeachievethisinGunrock using two advance operators on an edge frontier with different Bipartitegraphs Geiletal.[9]usedGunrocktoimplementTwit- computations.Therecent(hardwired)multi-GPUBCalgorithmby ter’swho-to-followalgorithm(“Money”[11]),whichincorporated McLaughlinandBader[22]usestaskparallelism,dynamicload threenode-rankingalgorithmsbasedonbipartitegraphs(Personal- balancing,andsamplingtechniquestoperformBCcomputationin izedPageRank,StochasticApproachforLink-StructureAnalysis parallelfromdifferentsourcesondifferentGPUSMXs. (SALSA),andHyperlink-InducedTopicSearch(HITS)).Theirim- plementation,thefirsttouseaprogrammableframeworkforbipar- Gunrock Implementation Gunrock’s implementation also con- titegraphs,demonstratedthatGunrock’sadvanceoperatorisflexible tains two phases. The first phase has an advance step identical enoughtoencompassallthreenode-rankingalgorithms,includinga totheoriginalBFSandacomputationstepthatcomputesthenum- 2-hoptraversalinabipartitegraph. berofshortestpathsfromsourcetoeachvertex.Thesecondphase uses an advance step to iterate over the BFS frontier backwards with a computation step to compute the dependency scores. We Beyond the five graph primitives we evaluate here, we have achievecompetitiveperformanceonscale-freegraphswiththelat- developedorareactivelydevelopingseveralothergraphprimitives esthardwiredBCalgorithm[23].WithinGunrock,wehaven’tyet inGunrock,includingminimalspanningtree,maximalindependent consideredtaskparallelismsinceitappearstobelimitedtoBC,but set,graphcoloring,Louvain’smethodforcommunitydetection,and itisaninterestingareaforfuturework. graphmatching. Dataset Vertices Edges MaxDegree Diameter Type largraphs.Aswell,graphswithuniformlylowdegreeexposeless parallelismandwouldtendtoshowsmallergainsincomparisonto soc-orkut 3M 212.7M 27,466 9 rs CPU-basedmethods. hollywood-09 1.1M 112.8M 11,467 11 rs indochina-04 7.4M 302M 256,425 26 rs vs.CPUGraphLibraries WecompareGunrock’sperformance krong500-logn21 2.1M 182.1M 213,904 6 gs withfourCPUgraphlibraries:theBoostGraphLibrary(BGL)[33], rggn24 16.8M 265.1M 40 2622 gm one of the highest-performing CPU single-threaded graph li- roadnetCA 2M 5.5M 12 849 rm braries[21];PowerGraph,apopulardistributedgraphlibrary[12]; andLigra[32]andGalois[26,28],twoofthehighest-performing Table1:DatasetDescriptionTable.Graphtypesare:r:real-world, multi-coreshared-memorygraphlibraries.AgainstbothBGLand g:generated,s:scale-free,andm:mesh-like. PowerGraph,Gunrockachieves6x–337xspeeduponaverageon allprimitives.ComparedtoLigra,Gunrock’sperformanceisgener- allycomparableonmosttestedgraphprimitives;noteLigrauses Algorithm Galois BGL PowerGraph Medusa bothCPUseffectively.TheperformanceinconsistencyforSSSP vs. Ligra is due to comparing our Dijkstra-based method with BFS 2.811 — — 6.938 Ligra’sBellman-Fordalgorithm.OurSSSP’sedgethroughputis SSSP 0.725 52.04 6.207 11.88 BC 1.494 — — — smallerthanBFSbutsimilartoBCbecauseofsimilarcomputations PageRank 1.94 337.6 9.683 8.982 (atomicMinvs.atomicAdd)andalargernumberofiterationsfor CC 1.859 171.3 143.8 — convergence.TheperformanceinconsistencyforBCvs.Ligraon fourscale-freegraphsisbecausethatLigraappliespull-basedtraver- Table 2: Geometric-mean runtime speedups of Gunrock on the salonBCwhileGunrockhasnotyetdoneso.ComparedtoGalois, datasets from Table 1 over frameworks not in Table 3. Due to Gunrockshowsmorespeedupontraversal-basedgraphprimitives Medusa’smemorylimitations,itsSSSPandPageRankcomparisons (BFS,SSSP,andBC)andlessperformanceadvantageonPageRank weremeasuredonsmallerdatasets. andCCduetotheirdensecomputationandmoreregularfrontier structures. vs.HardwiredGPUImplementationsandGPULibraries Com- 6. Experiments&Results paredtohardwiredGPUimplementations,dependingonthedataset, WeranallexperimentsinthispaperonaLinuxworkstationwith Gunrock’sperformanceiscomparableorbetteronBFS,BC,and 2×3.50GHzIntel4-core,hyperthreadedE5-2637v2XeonCPUs, SSSP.ForCC,Gunrockis5xslower(geometricmean)thanthe 528GBofmainmemory,andanNVIDIAK40cGPUwith12GB hardwiredGPUimplementationduetoirregularcontrolflowbe- on-boardmemory.GPUprogramswerecompiledwithNVIDIA’s causeeachiterationstartswithfulledgelistsinbothhookingand nvcc compiler (version 7.0.27) with the -O3 flag. The BGL and pointer-jumpingphases.Thealternativeisextrastepstoperform PowerGraphcodewerecompiledusinggcc4.8.4withthe-O3flag. additionaldatareorganization.Thistradeoffisnottypicalofour Ligra was compiled using icpc 15.0.1 with CilkPlus. All results otherprimitives.Whilestillachievinghighperformance,Gunrock’s ignoretransfertime(bothdisk-to-memoryandCPU-to-GPU).All applicationcodeissmallerinsizeandclearerinlogiccomparedto testswererun10timeswiththeaverageruntimeusedforresults. otherGPUgraphlibraries1.Gunrock’sProblemclass(thatdefines ThedatasetsusedinourexperimentsareshowninTable1.We problemdatausedforthegraphalgorithm)andkernelenactorare convertedalldatasetstoundirectedgraphs.Thesixdatasetsinclude bothtemplate-basedC++code;Gunrock’sfunctorcodethatspeci- bothreal-worldandgeneratedgraphs;thetopologyofthesedatasets fiesper-nodeorper-edgecomputationisC-likedevicecodewithout spansfromregulartoscale-free. anyCUDA-specifickeywords.Foranewgraphprimitive,usersonly Soc-orkut(soc)andhollywood-09(h09)aretwosocialgraphs; needtowritefrom133(simpleprimitive,BFS)to261(complex indochina-04(i04)isacrawledhyperlinkgraphfromindochinaweb primitive,SALSA)linesofcode.WritingGunrockcodemayrequire domains;kron g500-logn21(kron)isageneratedR-MATgraph.All parallelprogrammingconcepts(e.g.,atomics)butneitherdetailsof fourdatasetsarescale-freegraphswithdiametersoflessthan20 low-levelGPUprogrammingnoroptimizationknowledge. andunevenlydistributednodedegrees(80%ofnodeshavedegree GunrockcomparesfavorablytoexistingGPUgraphlibraries.Map- lessthan64). GraphisfasterthanMedusaonallbutonetest[8]andGunrockis Both rgg n 24 (rgg) and roadnet CA (roadnet) datasets have fasterthanMapGraphonalltests:thegeometricmeanofGunrock’s large diameters with small and evenly distributed node degrees speedupsoverMapGraphonBFS,SSSP,andPageRankare4.3,3.7, (mostnodeshavedegreelessthan12). and2.1,respectively.GunrockalsooutperformsCuShaonBFSand socisfromNetworkRepository;i04,h09,andkronarefrom SSSP.ForPageRank,Gunrockachievescomparableperformance UFSparseMatrixCollection;rggisarandomgeometricgraphwe withouttheG-Sharddatapreprocessing,whichservesasthemain generated. load-balancingmoduleinCuSha.The1-GPUGunrockimplementa- The edge weight values (used in SSSP) for each dataset are tionhas1.83xmoreMTEPS(4731vs.2590)ondirection-optimized randomvaluesbetween1and64. BFSonthesoc-LiveJournaldataset(asmallerscale-freegraphin theirtestset)thanthe2-CPU,2-GPUconfigurationofTotem[30]. Performance Summary Tables 2 and 3, and Figure 7, compare AllthreeGPUBFS-basedhigh-level-programming-modelefforts Gunrock’sperformanceagainstseveralothergraphlibrariesand (Medusa,MapGraph,andGunrock)adoptload-balancingstrategies hardwired GPU implementations. In general, Gunrock’s perfor- fromMerrilletal.’sBFS[24].WhilewewouldthusexpectGun- manceonBFS-basedprimitives(BFS,BC,andSSSP)showscom- rocktoshowsimilarperformanceonBFS-basedgraphprimitivesas paratively better results when compared to other graph libraries on four scale-free graphs (soc, h09, i04, and kron), than on two small-degreelarge-diametergraphs,rggandroadnet.Theprimary 1WebelievethisassertionistruegivenourexperiencewithotherGPU reasonisourload-balancingstrategyduringtraversal(Table4shows librarieswhenpreparingthisevaluationsection,butfreelyacknowledgethis Gunrock’ssuperiorperformanceonwarpefficiency,ameasureof isnearlyimpossibletoquantify.Weinvitereaderstoperuseourannotated load-balancingquality,acrossGPUframeworksanddatasets),and codeforBFSandSALSAathttp://gunrock.github.io/gunrock/ particularlyouremphasisongoodperformanceforhighlyirregu- doc/annotated_primitives/annotated_primitives.html. Runtime(ms)[lowerisbetter] Edgethroughput(MTEPS)[higherisbetter] Hardwired Hardwired Alg. Dataset CuSha MapGraph GPU Ligra Gunrock CuSha MapGraph GPU Ligra Gunrock soc 251.8 OOM 45.43 27.2 47.23 844.7 — 4681 7819 4503 h09 244.1 62.9 22.43 13.9 20.01 461.5 1791 5116 8100 5627 S i04 1809 OOM 84.08 223 62.12 164.8 — 4681 1337 4799 BF kron 237.9 162.7 37.33 18.5 19.15 765.2 1119 4877 9844 9510 rgg 52522 OOM 202.5 1020 351.4 5.048 — 1309 260 754.6 roadnet 288.5 61.66 8.21 82.1 31 19.14 89.54 672.9 67.25 178.1 soc — OOM 1106.6* 950 1088 — — — — 195.5 h09 1043 OOM 308.5* 281 100.4 — — — — 1122 P i04 — OOM OOM 850 511.5 — — — — 582.9 S S kron 315.5 540.8 677.7* 416 222.7 — — — — 817.6 S rgg — OOM OOM 103000 117089 — — — — 2.264 roadnet 1185 1285 224.2 451 222.1 — — 24.63 — 24.86 soc — — 1044 223 721.2 — — 407.4 1907 589.8 h09 — — 479.5 78.6 132.3 — — 469.6 2867 1703 i04 — — 1389 557 164.3 — — 429.1 1071 3630 C B kron — — 488.3 184 716.1 — — 745.8 1979 508.5 rgg — — 25307 2720 1449 — — 20.94 195 366 roadnet — — 256.8 232 120.6 — — 42.99 47.6 91.57 soc 105.8 OOM — 721 176 k h09 43.27 94.35 — 107 27.31 n a i04 121.8 OOM — 273 74.28 R e kron 46.6 739.8 — 456 176.2 g Pa rgg 48.6 OOM — 307 80.42 roadnet 0.864 8.069 — 14.6 6.691 soc — — 91.58 313 252.9 h09 — — 37.05 129 202.8 i04 — — 120.8 535 2501 C C kron — — 142.7 311 428.9 rgg — — 109.6 3280 552.7 roadnet — — 6.78 776 25.52 Table 3: Gunrock’s performance comparison (runtime and edge throughput) with other graph libraries (CuSha, MapGraph, Ligra) and hardwiredGPUimplementations.SSSPMTEPSstatisticsareunavailableinmostframeworks.AllPageRanktimesarenormalizedtoone iteration.HardwiredGPUimplementationsforeachprimitiveareb40c(BFS)[24],delta-steppingSSSP[5](numberswith*areachieved withoutdelta-steppingoptimization,otherwisewillrunoutofmemory),gpu BC(BC)[31],andconn(CC)[34].OOMmeansout-of-memory. theseotherframeworks,weattributeourperformanceadvantageto Alg. Framework soc h09 i04 kron rgg roadnet tworeasons:(1)ourimprovementstoefficientandload-balanced Gunrock 97.39% 97.35% 97.97% 97.73% 96.72% 97.01% traversalthatareintegratedintotheGunrockcore,and(2)amore BFS MapGraph — 95.81% — 97.19% — 87.49% powerful,GPU-specificprogrammingmodelthatallowsmoreeffi- CuSha 77.12% 80.12% 72.40% 50.34% 85.32% 87.80% cienthigh-levelgraphimplementations.(1)isalsothereasonthat Gunrock 83.35% 82.56% 83.18% 85.15% 82.84% 83.47% Gunrockimplementationscancompetewithhardwiredimplementa- SSSP MapGraph — — — 95.62% — 91.51% CuSha 78.40% 80.17% 76.63% 52.72% 86.96% 85.28% tions;webelieveGunrock’sload-balancingandworkdistribution Gunrock 99.56% 99.42% 99.54% 99.43% 99.52% 99.49% strategiesareatleastasgoodasifnotbetterthanthehardwired PR MapGraph — 98.97% — 99.16% — 96.27% primitiveswecompareagainst.Gunrock’smemoryfootprintisat CuSha 82.29% 87.26% 85.10% 63.46% 91.04% 89.23% thesamelevelasMedusaandbetterthanMapGraph(notetheOOM testcasesforMapGraphinTable3).Thedatasizeisα|E|+β|V| Table 4: Average warp execution efficiency (fraction of threads forcurrentgraphprimitives,where|E|isthenumberofedges,|V| active during computation). This figure is a good metric for the isthenumberofnodes,andαandβarebothintegerswhereαis qualityofaframework’sload-balancingcapability.(—indicatesthe usually1andatmost3(forBC)andβisbetween2to8. graphframeworkranoutofmemory.) Figure8showshowdifferentoptimizationstrategiesimprove theperformanceofgraphtraversal;hereweuseBFSasanexample. distribution;manyapplicationscenariosmayallowprecomputation AsnotedinSection4.4,theload-balancingtraversalmethodworks ofthisdistributionandthuswecanchoosetheoptimalstrategies better on social graphs with irregular distributed degrees, while beforewebegincomputation. theThread-Warp-CTAmethodworksbetterongraphswheremost nodeshavesmalldegrees.Thedirection-optimaltraversalstrategy 7. FutureWork alsoworksbetteronsocialgraphs,whereasontheroad-networkand bitcoin-transactionsgraph,weseelessconcurrentdiscoveryandthe We believe Gunrock currently provides an excellent foundation performancebenefitsarenotassignificant.Ingeneral,wecanpredict for developing GPU-based graph primitives. We hope to extend whichstrategieswillbemostbeneficialbasedonlyonthedegree Gunrockwiththefollowingimprovements:

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.