ebook img

DTIC ADA575485: Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search PDF

0.42 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA575485: Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search Scott Beamer Aydn Buluc Krste Asanovi David A. Patterson Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2013-2 http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-2.html January 3, 2013 Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 03 JAN 2013 2. REPORT TYPE 00-00-2013 to 00-00-2013 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Distributed Memory Breadth-First Search Revisited: Enabling 5b. GRANT NUMBER Bottom-Up Search 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of California at Berkeley,Electrical Engineering and REPORT NUMBER Computer Sciences,Berkeley,CA,94720 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT Breadth-first search (BFS) is a fundamental graph primitive frequently used as a building block for many complex graph algorithms. In the worst case, the complexity of BFS is linear in the number of edges and vertices, and the conventional top-down approach always takes as much time as the worst case. A recently discovered bottom-up approach manages to cut down the complexity all the way to the number of vertices in the best case, which is typically at least an order of magnitude less than the number of edges. The bottom-up approach is not always advantageous, so it is combined with the top-down approach to make the direction-optimizing algorithm which adaptively switches from top-down to bottom-up as the frontier expands. We present a scalable distributed-memory parallelization of this challenging algorithm and show up to an order of magnitude speedups compared to an earlier purely top-down code. Our approach also uses a 2D decomposition of the graph that has previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our new algorithm achieves a performance rate of over 240 billion edges per second on 115 thousand cores of a Cray XE6, which makes it over 7 faster than a conventional top-down algorithm using the same set of optimizations and data distribution. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 13 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Copyright © 2013, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement This research used resources of the NERSC, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC02- 05CH11231. This research also used resources of the Oak Ridge Leadership Computing Facility located in the ORNL, which is supported by the Office of Science of the DOE under Contract DE-AC05-00OR22725. The second author was supported in part by the DARPA UHPC program under contract HR0011-10-9-0008, and in part by the Director, Office of Science, U.S. DOE under Contract No. DE-AC02-05CH11231. Research was also supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, Nokia, NVIDIA, Oracle, and Samsung. Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search Scott Beamer Aydın Buluc¸ Krste Asanovic´ David Patterson EECS Department Computational Research Division EECS Department University of California Laurence Berkeley National Laboratory University of California Berkeley, California Berkeley, California Berkeley, California Abstract—Breadth-first search (BFS) is a fundamental graph This level-synchronous algorithm (henceforth called top- primitive frequently used as a building block for many complex down) is overly pessimistic and can be wasteful in practice, graph algorithms. In the worst case, the complexity of BFS is because it always does as many operations as the worst-case. linearinthenumberofedgesandvertices,andtheconventional Suppose that a vertex v is d hops away from the source and top-downapproachalwaystakesasmuchtimeastheworstcase. Arecentlydiscoveredbottom-upapproachmanagestocutdown is reachable by x vertices, x(cid:48) ≤ x of which are d−1 hops the complexity all the way to the number of vertices in the best away from the source. In other words, each one of those x(cid:48) case, which is typically at least an order of magnitude less than vertices can potentially be the parent of v. In theory, only the number of edges. The bottom-up approach is not always one of those x(cid:48) incoming edges of v needs to be explored, advantageous, so it is combined with the top-down approach but the top-down algorithm is unable to exploit this and does to make the direction-optimizing algorithm which adaptively switches from top-down to bottom-up as the frontier expands. x(cid:48)−1extrachecks.Bycontrast,v wouldquicklyfindaparent We present a scalable distributed-memory parallelization of this by checking its incoming edges if a significant number of challenging algorithm and show up to an order of magnitude its neighbors are reachable in d−1 of hops of the source. speedups compared to an earlier purely top-down code. Our The direction-optimizing BFS algorithm [5] uses this intuition approach also uses a 2D decomposition of the graph that has to significantly outperform the top-down algorithm because previously been shown to be superior to a 1D decomposition. Using the default parameters of the Graph500 benchmark, our it reduces the number of edge examinations by integrating a new algorithm achieves a performance rate of over 240 billion bottom-up algorithm into its search. edges per second on 115 thousand cores of a Cray XE6, which Implementing this bottom-up search strategy on distributed makesitover7×fasterthanaconventionaltop-downalgorithm memory poses multiple challenges. First, the bottom-up ap- using the same set of optimizations and data distribution. proach needsfast frontiermembership teststo find aneighbor in the frontier, but the frontier is far too large to replicate in I. INTRODUCTION each processor’s memory. Second, each vertex’s search for a parent must be sequentialized in order to skip checking Breadth-first search (BFS) is a fundamental graph traversal unnecessary edges once a parent is found. If a vertex’s search technique that serves as a building block for many graph for a parent is fully parallelized, there is potential the search algorithms. Parallel graph algorithms increasingly rely on will not terminate as soon as a parent is found, resulting BFS as the alternative graph traversal approach, since depth- in redundant work that could nullify any performance gains. first search is inherently sequential. The fastest parallel graph We tackle the first challenge by adapting the two-dimensional algorithms often use BFS even for cases when the optimal graph partitioning approach that reduces the amount of the sequential algorithm for solving the same problem relies frontier that needs to be locally replicated for each processor. on depth-first search, such as identifying strongly connected We tackle the second challenge by using systolic shifts that components [1] [2]. provide a good compromise between work and parallelism. Given a distinguished source vertex s, BFS systematically In this paper we introduce a distributed memory parallel exploresthegraphGtodiscovereveryvertexthatisreachable algorithm for bottom-up search. froms.Intheworstcase,BFShastoexplorealloftheedgesin The primary contributions of this article are: theconnectedcomponentinordertoreacheveryvertexinthe • A novel distributed-memory parallel algorithm for the connected component. A simple level-synchronous traversal bottom-up BFS using a two-dimensional decomposition. that explores all of the outgoing edges of the current frontier • Demonstrationofexcellentweakscalingonupto115,000 (thesetofverticesdiscoveredinthislevel)isthereforeconsid- coresofaCrayXE6,and6.5−7.9×performanceincrease eredoptimalintheworst-caseanalysis.Thislevel-synchronous over the top-down algorithm. algorithm exposes lots of parallelism for low-diameter small- • Careful analysis of the communication costs in our new world graphs [3]. Many real-world graphs, such as those algorithm, which highlights the reduction in the amount representing social interactions and brain anatomy [4], are of data communicated compared to the top-down algo- small-world. rithm. The technical heart of our paper is Section IV where Algorithm 1 Sequential top-down BFS algorithm we present the distributed memory parallelization of our 2D Input: G(V,E), source vertex s bottom-up algorithm, its parallel complexity analysis, and im- Output: parent[1..n], where parent[v] gives the parent of plementation detail. To yield a fast direction-optimizing BFS v ∈V in the BFS tree or −1 if it is unreachable from s implementation, our bottom-up implementation is combined 1: parent[:]←−1, parent[s]←s withanexistingperformanttop-downimplementation[6].We 2: frontier ←{s}, next ←φ provide a parallel complexity analysis of the new algorithm 3: while frontier (cid:54)=φ do in terms of the bandwidth and synchronization (latency) costs 4: for each u in frontier do in Section V. Section VI gives details about our direction- 5: for each neighbor v of u do optimizing approach that combines top-down and bottom-up 6: if parent[v]=−1 then steps.OurextensivelargescaleexperimentsonCrayXK6and 7: next ←next∪{v} Cray XE6 machines are in Section VIII. 8: parent[v]←u 9: frontier ←next, next ←φ II. BREADTH-FIRSTSEARCH Beforedelvingintothedetailsofimplementingourparallel Algorithm 2 Sequential bottom-up BFS algorithm algorithm, we review sequential versions of the top-down and Input: G(V,E), source vertex s bottom-up BFS algorithms. The level-synchronous top-down Output: parent[1..n], where parent[v] gives the parent of BFScanbeimplementedsequentiallyusingaqueue,asshown v ∈V in the BFS tree or −1 if it is unreachable from s in Algorithm 1. The algorithm outputs an implicit “breadth- 1: parent[:]←−1, parent[s]←s firstspanningtree”rootedatsbymaintainingparentsforeach 2: frontier ←{s}, next ←φ vertex. The parent of a vertex v who is d hops away from the 3: while frontier (cid:54)=φ do root, can be any of the vertices that are both d−1 hops away 4: for each u in V do from the root and have an outgoing edge to v. This algorithm 5: if parent[u]=−1 then is optimal in the worst case, running in time proportional to 6: for each neighbor v of u do Θ(n + m) where n = |V| is the number of vertices and 7: if v in frontier then m = |E| is the number of edges of a graph G = (V,E). 8: next ←next∪{u} However, its best case performance is no better than its worst 9: parent[u]←v caseperformance,anditmaydomanyedgeexaminationsthat 10: break are redundant because the vertex has already been examined 11: frontier ←next, next ←φ by the breadth-first search. The key insight the bottom-up approach leverages is that most edge examinations are unsuccessful because the end- III. PARALLELTOP-DOWNBFS points have already been visited. In the conventional top- down approach, during each step, every vertex in the frontier Data distribution plays a critical role in parallelizing BFS examines all of its neighbors and claims the unvisited ones as ondistributed-memorymachines.Theapproachofpartitioning childrenandaddsthemtothenextfrontier.Onalow-diameter vertices to individual processors (along with their outgoing graph when the frontier is at its largest, most neighbors of the edges) is the so-called 1D partitioning. By contrast, 2D par- frontierhavealreadybeenexplored(manyofwhicharewithin titioning assigns vertices to groups of processors (along with thefrontier),butthetop-downapproachmustcheckeveryedge their outgoing edges), which are further assigned to members in case the neighbor’s only legal parent is in the frontier. The ofthegroup.2Dcheckerboardpartitioningassumesthesparse bottom-upapproachpassesthisresponsibilityfromtheparents adjacency matrix of the graph is partitioned as follows: to the children (Algorithm 2).   A ... A Duringeachstepofthebottom-upapproach,everyunvisited 1,1 1,pc vertex (parent[u]=−1) checks its neighbors to see if any of A= ... ... ...  (1) themareinthefrontier.Iftheyare,theyareavalidparentand A ... A theneighborexaminations(line6–line10)canendearly.This pr,1 pr,pc early termination sequentializes the inner loop in order to get Processors are logically organized in a square p=p ×p r c the savings from stopping as soon as a valid parent is found. mesh, indexed by their row and column indices. Submatrix Ingeneral,thebottom-upapproachisonlyadvantageouswhen A is assigned to processor P(i,j). The nonzeros in the ith ij the frontier constitutes a substantial fraction of the graph. row of the sparse adjacency matrix A represent the outgoing Thus,aperformantBFSwillusethetop-downapproachforthe edges of the ith vertex of G, and the nonzeros in the jth beginning and end of the search and the bottom-up approach column of A represent the incoming edges of the jth vertex. forthemiddlestepswhenthefrontierisatitslargest.Sincethe Our top-down algorithm actually operates on the transpose of BFS for each step is done in whichever direction will require this matrix, but we will omit the transpose and assume that the least work, it is a direction-optimizing BFS. the input is pre-transposed for the rest of this section. Algorithm 3 Parallel 2D top-down BFS algorithm (adapted inginoneMPIprocessperchipinsteadofoneMPIprocessper from the linear algebraic algorithm [6]) core,whichwillreducethenumberofcommunicatingparties. Input: A: graph represented by a boolean sparse adjacency Large scale experiments of 1D versus 2D show that the 2D matrix, s: source vertex id approach’s communication costs are lower than the respective Output: π:densevector,whereπ[v]isthepredecessorvertex 1D approach’s, with or without in-node multithreading [6]. ontheshortestpathfromstov,or−1ifv isunreachable The study also shows that in-node multithreading gives a 1: π(:)←−1, π(s)←s further performance boost by decreasing network contention. 2: f(s)←s (cid:46) f is the current frontier 3: for all processors P(i,j) in parallel do IV. PARALLELBOTTOM-UPBFS 4: while f (cid:54)=∅ do Implementingabottom-upBFSonaclusterwithdistributed 5: TRANSPOSEVECTOR(fij) memory introduces some challenges that are not present in 6: fi ← ALLGATHERV(fij,P(:,j)) the shared memory case. The speedup from the algorithm 7: ti ←∅ (cid:46) t is candidate parents is dependent on fast membership tests for the frontier and 8: for each fi(u)(cid:54)=0 do (cid:46) u is in the frontier sequentializing the inner loop. On a single compute node, the 9: adj(u)← INDICES(Aij(:,u)) fast (constant time) membership tests for the frontier can be 10: ti ←ti∪ PAIR(adj(u),u) efficientlyimplementedwithabitmapthatoftenfitsinthelast 11: tij ← ALLTOALLV(ti,P(i,:)) level of cache. Sequentializing the inner loop is trivial since 12: for (v,u) in tij do theouterloopcanstillprovidesufficientparallelismtoachieve 13: if πij(v)(cid:54)=−1 then (cid:46) Set parent if new good multicore performance. 14: πij(v)←u A performant distributed implementation must have fast 15: fij(v)←v membership tests for the frontier which necessitates it being 16: else (cid:46) Remove if discovered before abletodetermineifavertexisinthefrontierwithoutcrossing 17: tij ←tij \(u,v) the network. Holding the entire frontier in each processor’s memory is clearly unscalable. Fortunately, the 2D decompo- sition [6] [7] greatly aids this, since for each processor, only a small subset of vertices can be the sources of a processor’s The pseudocode for parallel top-down BFS algorithm with incomingedges.Thissubsetissmallenoughthatitcanfitina 2DpartitioningisgiveninAlgorithm3forcompleteness.Both processor’s memory, and the frontier can be represented with f and t are implemented as sparse vectors. For distributed a dense vector for constant time access. The dense format vectors,thesyntaxv denotesthelocaln/psizedpieceofthe ij does not necessarily consume more memory than a sparse vectorownedbytheP(i,j)thprocessor.Thesyntaxv denotes i vector, because it can be compressed by using a bitmap and the hypothetical n/p sized piece of the vector collectively r thefrontieristypicallyalargefractionofthegraphduringthe ownedbyalltheprocessorsalongtheithprocessorrowP(i,:). bottom-up steps. The algorithm has four major steps: Although the 2D decomposition helps with providing fast • Expand: Construct the current frontier of vertices on frontier checks, it complicates sequentializing the inner loop. each processor by a collective allgather step along the Since all of the edges for a given vertex are spread across processor column (line 6). multiple processors, the examination of a vertex’s neighbors • Local discovery: Inspect adjacencies of vertices in willbedoneinparallel.Iftheinnerloopisnotsequentialized, the current frontier and locally merge them(line 8). The the bottom-up approach’s advantage of terminating the inner operation is actually a sparse matrix-sparse vector multi- loop early once a parent is found will be hard to maintain. plicationonaspecialsemiringwhereeachscalarmultiply Unnecessaryedgescouldbeexaminedduringthetimeittakes returns the second operand and each scalar addition for the termination message to propagate across the network. returns the minimum. Tosequentializetheinnerloopofcheckingifneighborsare • Fold: Exchange newly-discovered adjacencies using a in the frontier, we propose partitioning the work temporally collectivealltoallvstepalongtheprocessorrow(line11). (Figure 1). We break down the search step into p sub-steps, c This step optionally merges updates from multiple pro- andduringeachsub-step,agivenvertex’sedgeswillbeexam- cessors to the same vertex using the first pair entry (the inedbyonlyoneprocessor.Duringeachsub-step,aprocessor discovered vertex id) as the key. processes (1/p )th of the vertices in that processor row. After c • Local update: Update distances/parents for unvisited eachsub-step,itpassesontheresponsibilityforthosevertices vertices (line 12). The new frontier is composed of any to the processor to its right and accepts new vertices from entries that was not removed from the candidate parents. the processor to its left. This pairwise communication sends In contrast to the 1D case, communication in the 2D which vertices have been completed (found parents), so the algorithm happens only along one processor dimension. If next processor knows to skip over them. This has the effect Expand happens along one processor dimension, then Fold of the processor responsible for processing a vertex rotating happensalongtheotherprocessordimension.Both1Dand2D rightalongtheroweachsub-step.Whenavertexfindsavalid algorithmscanbeenhancedbyin-nodemultithreading,result- parent to become visited, its index along with its discovered Approach Words/Search Latencies/Step c Ai,j Ai,j+1 Top-Down 4m+n(pr+1) O(1) Bottom-Up n(sb(pr+64pc+1) +2) O(pc) from c p TABLEI i,j-1 ANALYTICCOMPARISONOFCOMMUNICATIONCOSTS. to p i,j+2 t and w are simply queues of updates represented as pairs of vertices of the form (child, parent). All processor column indices are modulo p (the number of processor columns). c For distributed vectors, the syntax f denotes the local n/p ij Fig. 1. Sub-step for processors pi,j and pi,j+1. They initially use their sized piece of the frontier owned by the P(i,j)th processor. segmentofcompleted (c)tofilterwhichverticestoprocessfromtheshaded region and update completed for each discovery. At the end of the sub- Likewise, the syntax Vi,j represents the vertices owned by the step,thecompleted segmentsrotatetotheright.Theparentupdatesarealso P(i,j)th processor. The syntax f denotes the hypothetical j transmittedattheendofthesub-step(notshown). n/p sized piece of the frontier collectively owned by all the c processors along the jth processor column P(:,j). Each step parent is queued up and sent to the processor responsible for of the algorithm has four major operations: the corresponding segment of the parent array to update it. • Gather frontier (per step) Each processor is given the segment of the frontier corresponding to their incoming Algorithm 4 Parallel 2D bottom-up BFS algorithm edges (lines 6 and 7). Input: A: graph represented by a boolean sparse adjacency • Local discovery (per substep) Search for parents with matrix, s: source vertex id the information available locally (line 10 – line 16). Output: π: dense vector, where π[v] is the parent vertex on • Update parents (per substep) Send updates of children the shortest path from s to v, or −1 if v is unreachable that found parents and process updates for own segment 1: f(:)←0, f(s)←1 (cid:46) bitmap for frontier of π (line 17 – line 21). 2: c(:)←0, c(s)←1 (cid:46) bitmap for completed • Rotate along row (per substep) Send completed to right 3: π(:)←−1, π(s)←s neighbor and receive completed for the next sub-step 4: while f(:)(cid:54)=0 do from left neighbor (line 22). 5: for all processors P(i,j) in parallel do 6: TRANSPOSEVECTOR(fij) V. ANALYTICMODELOFCOMMUNICATION 7: fi ← ALLGATHERV(fij,P(:,j)) In addition to the savings in computation, the bottom- 8: for s in 0...pc−1 do (cid:46) pc sub-steps up steps will also reduce the communication volume. We 9: t←φ (cid:46) t holds parent updates summarize the communication costs in Table I, where we 10: for u in Vi,j+s do assume that the bottom-up approach is used for only s steps 11: if cij(u)=0 then (cid:46) u is unvisited of the search (out of d potential steps) on a graph wbith m 12: for each neighbor v of u do edges and n vertices, distributed on a p ×p processor grid. 13: if fi(v)=1 then r c We first present a simple model that counts the number of 14: tij ←tij ∪{(u,v)} words sent and received during the entire search. To represent 15: cij(u)←1 the compression provided by bitmaps, we divide the number 16: break of words by 64 since we use 64-bit identifiers. To further 17: fij(:)←0 simplify the expressions, we assume (p −1)/(p ) ≈ 1 and 18: wij ← SENDRECV(tij,P(i,j+s),P(i,j−s))) ignore transfers that send only a word (ccommunicacting sizes). 19: for (u,v) in wij do Wecalculatethedatavolumefortheentiresearch,andassume 20: πij(u)←v that every vertex and every edge is part of the connected 21: fij(u)←1 component. 22: cij ← SENDRECV(cij,P(i,j+1),P(i,j−1))) The parallel 2D top-down approach transfers data for two operations: gathering the frontier (expand) and sending edges The pseudocode for our parallel bottom-up BFS algorithm (fold). Every vertex is part of the frontier exactly once, so with2DpartitioningisgiveninAlgorithm4forcompleteness. communicating the frontier sends n words for the transpose f (frontier)isimplementedasadensebitmapandπ(parents) andnp wordsfortheallgatheralongthecolumn.Everyedge r is implemented as a dense vector of integers. c (completed) is examined once, however sending it requires sending both is a dense bitmap and it represents which vertices have found endpoints (two words). Since the graph is undirected, each parents and thus no longer need to search. The temporaries edge is examined from both sides, which results in sending 4m words. In total, the number of words a search with the rounds, but each step of the bottom-up approach has Θ(p ) c top-down approach sends is approximately: rounds which could be significant depending on the network latencies. w =4m+n(p +1) t r The types of communication primitives used is another Since the bottom-up approach is most useful when com- important factor since primitives with more communicat- bined with the top-down approach, we assume the bottom- ing parties may have higher synchronization penalties. The up approach is used for only sb steps of the search, but it communication primitives used by top-down involve more still processes the entire graph. There are three types of com- participants, as it uses: point-to-point (transpose to set up munication that make up the bottom-up approach: gathering expand),allgatheralongcolumns(expand),andall-to-allalong the frontier, communicating completed vertices, and sending rows (fold). The bottom-up approach uses point-to-point for parentupdates.Gatheringthefrontieristhesamecombination all communication except for the allgather along columns for of a transpose and an allgather along a column like the top- gathering the frontier. down approach except a dense bitmap is used instead of a sparsevector.Sincethebottom-upapproachusesadensedata VI. COMBININGBOTTOM-UPWITHTOP-DOWN structure and it sends the bitmap every step it is run, it sends sbn(1 + pr)/64 words to gather the frontier. To rotate the The bottom-up BFS has the potential to skip many edges bitmaps for completed, it transfers the state of every vertex to accelerate the search as a whole, but it will not always be once per sub-step, and since there are pc sub-steps, an entire more efficient than the top-down approach. Specifically, the search sends sbnpc/64 words. Each parent update consists of bottom-up approach is typically only more efficient when the a pair of words (child, parent), so in total sending the parent frontierislargebecauseitincreasestheprobabilityoffindinga updatesrequires2nwords.Allcombined,thenumberofwords validparent.Thisleadstothedirection-optimizingapproach,a the bottom-up approach sends is approximately: hybriddesignofthetop-downapproachpoweringthesearchat s (p +p +1) thebeginningandend,andthebottom-upapproachprocessing w =n( b r c +2) b 64 the majority of the edges during only a few steps in the middle when the frontier is at or near its largest. We leverage To see the reduction in data volume, we take the ratio of the insight gained from prior work [5] [8] to choose when thenumberofwordsthetop-downapproachsends(w )tothe t to switch between the two BFS techniques at a step (depth) number of words the bottom-up approach will send (w ), as b granularity. showninEquation2.Weassumeour2Dpartitioningissquare (pr =pc)sincethatwillsendtheleastamountofdataforboth We use the number of edges in the frontier (mf) to decide approaches. Furthermore, we assume the degree of the target when to switch from top-down to bottom-up and the number graph is k =m/n. of vertices in the frontier (nf) to know when to switch from bottom-up back to top-down. Both the computation and the w p +4k+1 t = c (2) communication costs per step of the top-down approach is w s (2p +1)/64+2 b b c proportional to the number of edges in the frontier, hence For a typical value of sb (3 or 4), by inspection the the steps when the frontier is the largest consume the ma- ratio will always be greater than 1; implying the bottom- jority of the runtime. Conversely, the bottom-up approach is up approach sends less data. Both approaches suffer when advantageous during these large steps, so using the number scaling up the number of processors, since it increases the of edges in the frontier is appropriate to determine when communication volume. This is not unique to either approach the frontier is sufficiently large to switch to the bottom-up presented,andthisleadstosub-linearspeedupsfordistributed approach. Using the heuristic as well as the tuning results BFS implementations. This ratio also demonstrates that the from prior work [5] [8], we switch from top-down to bottom- higher the degree is, the larger the gain for the bottom-up up when: approach relative to the top-down approach is. Substituting m m > typicalvalues(k =16andpc =128),thebottom-upapproach f 10 needs to take s ≈ 47.6 steps before it sends as much data b This can be interpreted as once the frontier encompasses at as the top-down approach. A typical s for the low-diameter b least one tenth of the edges, the bottom-up approach is likely graphs examined in this work is 3 or 4, so the bottom-up to be advantageous. Even though the probability of finding a approach typically moves an order of magnitude less data. parent(andthusstoppingearly)maycontinuetobehighasthe Thisisintuitive,sincetofirstorder,theamountofdatathetop- sizeofthefrontierrampsdowninlatersteps,thereissufficient down approach sends is proportional to the number of edges, fixed overhead for a step of the bottom-up approach to make while for the bottom-up approach, it is proportional to the itworthwhiletoswitchback tothetop-downapproach.Using number of vertices. the results from prior work, where k is the degree, we switch The critical path of communication is also important to back to top-down when: consider.Thebottom-upapproachsendslessdata,butitcould be potentially bottlenecked by latency. Each step of the top- n n < down algorithm has a constant number of communication f 14k Hopper Jaguar Thedegreeterminthedenominatorensuresthathigher-degree Operator NERSC ORNL graphs switch back later to top-down since an abundance of SupercomputerModel CrayXE6 CrayXK6 edges will continue to help the bottom-up approach. Interconnect CrayGemini CrayGemini ProcessorModel AMDOpteron6172 AMDOpteron6274 ProcessorArchitecture Magny-Cours Interlagos The switch to bottom-up uses the number of edges in the ProcessorClockrate 2.1GHz 2.2GHz frontier while the switch back to top-down uses the number Sockets/node 2 1 of vertices in the frontier because the apex of the number of Cores/socket 12 16 L1Cache/socket 12×64KB 16×16KB edges in the frontier is often a step or two before the apex of L2Cache/socket 12×512KB 8×2MB the number of vertices in the frontier. For scale-free graphs, L3Cache/socket 2×6MB 2×8MB the high-degree vertices tend to be reached in the early steps Memory/node 32GB 32GB sincetheirmanyedgesmakethemclosetomuchofthegraph. TABLEII In the steps that follow the apex of the number of edges in SYSTEMSPECIFICATIONS the frontier, the number of vertices in the frontier becomes its largest as it contains the high-degree vertices’ many low- degree neighbors. Since edges are the critical performance VII. EXPERIMENTALSETTINGS predictor, the number of edges in the frontier is used to guide the important switch to bottom-up. Although the number of We run experiments on two major supercomputers: Hopper edges in the frontier could be used to detect when to switch and Jaguar (Table II). We benchmark flat (1 thread per backtop-down,itisunnecessarytocomputesincethenumber process) MPI versions of both the conventional top-down of vertices in the frontier will suffice. Although the control algorithmandthedirection-optimizingalgorithmforanygiven heuristic allows for arbitrary patterns of switches, for each concurrency and setting. The additional benefits of using in- search on all of the graphs studied, the frontier size has the node multithreaded has been demonstrated before [6], and its same shape of continuously increasing and then continuously benefits are orthogonal. decreasing [5] [8] [9]. WeusesyntheticgraphsbasedontheR-MATrandomgraph model [10], as well as a large-scale real world graph that To compute the number of edges in the frontier, we sum represents the structure of the Twitter social network [11], the degrees of all the vertices in the frontier. An undirected which has 61.5 million vertices and 1.47 billion edges. The edge with both endpoints in the frontier will be counted twice Twitter graph is anonymized to respect privacy. R-MAT is a by this method, but this is appropriate since the top-down recursive graph generator that creates networks with skewed approach will check the edge from both sides too. When degree distributions and a very low graph diameter. R-MAT loading in the graph, we calculate the degree for each vertex graphs make for interesting test instances because traversal and store that in a dense distributed vector. Thus, to calculate load-balancing is non-trivial due to the skewed degree distri- the number of edges in the frontier, we take the dot product bution,thelackofgoodgraphseparators,andcommonvertex of the degree vector with the frontier. relabelingstrategiesarealsoexpectedtohaveaminimaleffect on cache performance. We use undirected graphs for all our ThetransitionbetweendifferentBFSapproachesisnotonly experiments. a change in control, but it also requires some data struc- We set the R-MAT parameters a, b, c, and d to ture conversion. The bottom-up approach makes use of two 0.59,0.19,0.19,0.05 respectively and set the degree to 16 bitmaps (completed and frontier) that need to be generated unless otherwise stated. These parameters are identical to the and distributed. Generating completed can be simply accom- ones used for generating synthetic instances in the Graph500 plished by setting bits to one whenever the corresponding BFS benchmark [12]. Like Graph500, to compactly describe index has been visited (parent[i] (cid:54)= −1). Converting the the size of a graph, we use the scale variable to indicate the frontier is similar, as it involves setting a bit in the bitmap graph has 2scale vertices. frontier for every index in the sparse frontier used by the top- When reporting numbers, we use the performance rate down approach. For the switch back to top-down, the bitmap TEPS, which stands for Traversed Edges Per Second. Since frontier is converted back to a sparse list. Another challenge the bottom-up approach may skip many edges, we compute of combining top-down and bottom-up search is that top- the TEPS performance measure consistently by dividing the down requires fast access to outgoing edges while bottom- number of input edges by the runtime. During preprocessing, up requires fast access to incoming edges. We keep both AT we prune duplicate edges and vertices with no edges from and A in memory to facilitate fast switching between search the graph. For all of our timings, we do 16 to 64 BFS runs directions. The alternative of transposing the matrix during fromrandomlyselecteddistinctstartingverticesandreportthe execution proved to be more expensive than the cost of BFS harmonic mean. itself. A symmetrical data structure that allows fast access to Both of our implementations use the Combinatorial both rows (outgoing edges) and columns (incoming edges) BLAS [13] infrastructure so that their input graph data struc- without increasing the storage costs would be beneficial and tures, processor grid topology, etc. are the same. Our baseline is considered in future work. comparisons are against a previously published top-down im-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.