1 Distributed Graph Layout for Scalable Small-world Network Analysis George M. Slota, Member, IEEE, Sivasankaran Rajamanickam, Member, IEEE, and Kamesh Madduri (cid:70) Abstract—The in-memory graph layout or organization has a consid- and small-world irregular graphs in mind, this paper 7 erableimpactonthetimeandenergyefficiencyofdistributedmemory attempts to answer the following questions: 1 0 graphcomputations.Itaffectsmemorylocality,inter-taskloadbalance, 1) Will the layout of the graphs impact the performance communicationtime,andoverallmemoryutilization.Graphlayoutcould 2 ofirregular,data-analyticalgorithmsandframeworks refer to partitioning or replication of vertex and edge arrays, selective ? n replicationofdatastructuresthatholdmeta-data,andreorderingvertex a and edge identifiers. In this work, we present DGL, a fast, parallel, 2) Can such a layout be computed in a scalable and J andmemory-efficientdistributedgraphlayoutstrategythatisspecifically efficient fashion to be applicable in graph analytics 2 designed for small-world networks (low-diameter graphs with skewed ? vertexdegreedistributions).Labelpropagation-basedpartitioninganda 3) What kind of graph computations will be impacted ] scalableBFS-basedorderingarethemainstepsinthelayoutstrategy. C by the graph layouts and how ? We show that the DGL layout can significantly improve end-to-end As has been observed, the impact of partitioning and D performanceoffivechallenginggraphanalyticsworkloads:PageRank, orderingonirregulargraphcomputationscanbeconsid- . a parallel subgraph enumeration program, tuned implementations of s breadth-first search and single-source shortest paths, and RDF3X- erable [10], [21], [26]. However, using traditional layout c MPI, a distributed SPARQL query processing engine. Using these strategies based on graph/hypergraph partitioners and [ benchmarks, we additionally offer a comprehensive analysis on how orderingsfordatalayoutofhighlyirregularsmall-world 1 graph layout affects the performance of graph analytics with variable graphsmaynotbeappropriateforthefollowingreasons: v computationandcommunicationcharacteristics. 1) Traditional partitioners and even some ordering 3 0 methods, for example nested dissection, are heavy- 5 1 INTRODUCTION weighttoolsthatareexpensivebothintermsofmem- 0 ory usage and time. They are appropriate when fol- 0 Layouts of graphs and sparse matrices in distributed lowed by more expensive linear solvers or when the . memory and shared memory have been well-studied 1 partitioning results can be used for multiple solves. 0 for regular graphs that arise in the scientific computing In contrast, graph analytic workloads are constantly 7 domain. “Layout” in this instance refers to how ver- evolving and a typical analytic operation is typically 1 tices and edges are partitioned in distributed-memory cheaper than a linear solver. : v and how the vertex identifiers are ordered in shared- 2) Previousorderingalgorithmsaredesignedformetrics i memory. Recently, several new open-source distributed- X appropriate for linear solvers such as minimizing a memory graph processing frameworks have emerged bandwidth [22] or minimizing the fill-in in a LU r a into mainstream usage. These include GraphLab [35] factorization [2], [33]. In contrast, ordering methods anditsderivativesPowerGraph[27]andPowerLyra[17], that improve the layouts in a shared memory context Giraph [19], Trinity [47], and PEGASUS [31], among for small-world graphs are needed. others. The primary goal of these frameworks is to 3) The performance of distributed-memory graph algo- analyzereal-worldgraphssuchaswebcrawlsandsocial rithms can be dependent on both local and global networks, which tend to be low-diameter graphs with graph topology. Global topology affects the num- skewedvertexdegreedistributions.Mostoftheseframe- ber of parallel phases and synchronization overhead, works assume an initial topology-agnostic vertex and whilelocaltopologicalchangesimpactper-phaseload edge partitioning and ordering. With these frameworks balance. Optimizing for aggregate measures such as conductance oredge cutwould ignore localtopology • G.M.SlotaiswiththeComputerScienceDepartment,RensselaerPoly- changes and may not account for dynamic variations technicInstitute,317LallyBuilding,Troy,NY,12180,USA,USA.E-mail: [email protected] in per-phase execution. • S. Rajamanickam is with the Center for Computing Research at Sandia Graph computations on highly irregular graphs re- National Laboratories, P.O. Box 5800, MS 1320, Albuquerque, NM, quire a layout that depends on parallel partitioners and 87123,USA.E-mail:[email protected] • K.MadduriiswiththeDepartmentofComputerScienceandEngineering, ordering methods that are highly scalable for very large ThePennsylvaniaStateUniversity,343EISTBuilding,UniversityPark, graphs. Label propagation-based partitioners are shown PA,16802,USA.E-mail:[email protected] to be useful for partitioning small-world graphs [38]. 2 We utilize such partitioning algorithms (PULP [50]) to ensure consistency. We also analyze trade-offs between computethedistributedmemorylayout.Labelpropaga- partitioning quality on computational load balance and tion exploits the community structure inherent in many communication overhead for several large real-world real small-world graphs to quickly partition even multi- networks. The following is a summary of the key ob- billion edge networks. Label propagation also allows servations and findings from this workload analysis. for optimization of various objectives under multiple 1) Acomprehensivestudyoftheperformanceofthefive constraints, which enables us to explore the impact analyticswithseveralpartitioning-orderingcombina- of these objectives and constraints on total execution tions. and communication times for our various test analytics. 2) Our DGL ordering strategy is about 2× faster than In addition, we also introduce a breadth-first search- RCM, and our PULP partitioning strategy is about based ordering that is more scalable than other ordering 10× faster than METIS. schemes and suitable for small-world graphs in the 3) Weshowthat DGL layoutimprovessubgraphcount- shared-memorylayout.Incaseofdistributedgraphpro- ing performance by 1.28× in comparison to random cessing,weconsidervariouspartitioning-orderingpossi- partitioning. Partitioning with PULP would enable bilities, a simultaneous global partitioning and ordering end-to-end processing (partitioning & computation) of all vertices, and a local ordering of vertices after the of the counts of ten vertex subgraphs on the 2 bil- partitioning phase. lion edge Twitter graph to complete in under fifteen In short, we propose a “distributed-memory graph minutes on 16 nodes of Blue Waters. layout” based on vertex partitioning using label propa- 4) DGL layout improves the communication time of gation and a BFS-based parallel ordering strategy. The BFS and SSSP by 1.48× and 1.43× in comparison to proposed DGL (Distributed Graph Layout) is a fast, random partitioning. memory-efficient, and scalable graph layout strategy. 5) An informed topology-aware graph layout benefits We demonstrate the new DGL layout scheme is about external memory computations as well, improving 10-12× times faster to compute than METIS partition- the performance of RDF3X-MPI, our distributed- ing [33], and about 2.3× faster to compute than Reverse memory implementation of the popular RDF store Cuthill-McKee (RCM)-based orderings. RDF-3X [42]. We demonstrate the impact of DGL and present de- 6) The total computation time of PageRank can be tailedanalysisontheend-to-endperformanceofdistinct accelerated by about 5× with a locality-optimizing graph analytic workloads. The graph analysis routines ordering such as DGL. include subgraph counting, breadth-first search (BFS), 7) A cross-analytics comparison reveals new and inter- single-source shortest paths (SSSP), resource description esting trade-offs of communication time, load bal- framework (RDF) queries, and PageRank. The five algo- ance, and memory utilization for various graphs. rithms were chosen to be representative of the diversity We finally mention that DGL is not limited to the in modern graph analytics. We chose a recent algorithm MPIprocessingmodelsconsideredinthiswork,andcan for subgraph counting [49] which is a randomized par- therefore be utilized as a preprocessing step while run- allel algorithm to generate approximate counts of tree- ning under other graph engines and parallel execution structuredsubgraphs.Althoughrecentrelatedwork[15], environments. [16] primarily looks at strong scaling of BFS and related 2 DISTRIBUTED GRAPH LAYOUT computationsonmassivesyntheticGraph500networks, ourworkexaminesthesubgraphcountingalgorithm,an In this section, we discuss the distributed graph lay- analyticthatiscomputationallyverydifferentfromBFS. out using label propagation-based partitioning and BFS- However, we also do an in-depth evaluation of BFS and based ordering methods. We define a distributed graph SSSP performance. The fourth benchmark evaluates a layout as the pair of partitioning×ordering. The partition- distributed-memoryimplementationofthepopularRDF ing part of the layout affects the number of parallel store RDF-3X [42]. Our final included algorithm is a phases and synchronization overhead in a graph com- highly scalable implementation of PageRank [51], which putation. It is important to balance the computation in is a popular and more computationally-intensive imple- differentparallelphasesaswellasminimizethecommu- mentation than BFS for benchmarking performance of nication overhead. We explore trade-offs in work and frameworks and systems. memory balance and communication minimization be- We use the end-to-end graph analysis times for tween tasks with different partitioning strategies. Work partitioning-ordering-workload in both single-threaded performedandmemoryutilizationper-taskroughlycor- (MPI) and multi-threaded (MPI+OpenMP) distributed relates with the number of vertices and adjacent edges programming models. We also consider computation stored on each task. The communication requirements and communication times of the analytic separately, in roughly correlates with the number of inter-task edges, order to better isolate the effects of partitioning and or edge cut resulting from partitioning. The ordering ordering on performance. We primarily consider real- partofthelayoutaffectstheper-phasecomputationtime world rather than synthetic graphs in our study. We use ingraphcomputations.Weideallywanttoincreaseintra- tuned implementations, all developed by us, in order to nodememoryaccesslocalitytoreducecachemissesand 3 improve execution times. In order to be practical the (minimize total edge cut and maximum edge cut per partitioning×ordering pair must be computed in parallel, part)algorithmthatdemonstratesthesetwostages.After scalable fashion. initialization,wefirstutilizeweightedlabelpropagation in k alternating stages to balance the initial parts for 1 our vertex constraint and then refining to minimize the 2.1 Partitioning total edge cut. Next, we perform k alternating stages 2 We utilize three partitioners in this work. We use a ran- of balancing for our edge constraint while minimizing dom partitioning to establish a baseline for benchmark- the secondary objective of max per-part cut and then ing, which randomly assigns part assignments to each again refining to minimize the total edge cut. In prior vertex. We use the well-known METIS [33] partitioner work [50], we describe the algorithm in considerably as a representation of the state-of-the-art. greater detail and demonstrate the approach’s effective- We also utilize the PULP partitioner, which is specif- ness in terms of cut quality and runtime with respect ically optimized to partition the small-world graphs we to other traditional partitioners. However, it is critical are considering in this work. We consider both single to show that such label propagation-based partitionings constraint and multi-constraint partitioning scenarios, arenotonlyeasytocompute,butthattheyalsoimprove where we either balance partitions for vertices or for the end-to-end runtimes of graph analytic applications. both vertices and edges. We attempt to minimize to- With DGL, we are able to utilize such a partitioner in tal edge cut for both PULP and METIS. Additionally, the layout strategy and demonstrate its applicability for for PULP, we also attempt to balance communication the first time. among parts by minimizing the maximal number of cut edges coming out of any single part. 2.2 Ordering The PULP partitioner is based off of the commu- For a distributed graph computation, a good graph nity detection label propagation algorithm [45]. Label partitioning will reduce inter-node communication cost. propagation methods are attractive as they have low Thegoalofon-nodevertexorderingistoincreaselocality computational overhead, low memory utilization, are of intra-node memory references, and thereby reduce easy to parallelize, and demonstrate scaling to graphs intra-node computation time. This is done by relabeling with billions of vertices. vertex identifiers so that consecutive accesses of per- An overview of the basic label propagation algorithm vertexspecificinformationoccurwithgreaterspatialand isasfollows:Initially,eachvertexinagraphisinitialized temporal locality. As many graph computations access to having a unique label. Iteratively, each vertex then per-vertex data based on adjacencies, and per-vertex examines all of its neighbors’ labels then updates to its data is commonly stored in a flat array, minimizing label the label that appears most frequently among its the numeric difference between the vertex identifiers of neighbors, with ties broken randomly. The loop over adjacent vertex pairs can greatly improve access locality all vertices can be parallelized without any explicit syn- and therefore cache utilization. chronizations or locking with minimal effect on solution Reverse Cuthill-McKee (RCM) is a commonly-used quality [50]. We continue to loop over all vertices until vertex ordering strategy in sparse matrix and graph no labels are updated, or, more commonly, after some applications. We propose a BFS-based ordering (see Al- numberofiterationsoftheoutermostWhileloop(usually gorithm 2) which can be considered an approximation 10 or fewer iterations is sufficient). to RCM. It avoids the costly sorting step used in RCM where it tries to order the nodes with the same parent Algorithm 1 PULP Multi-ConstraintMulti-ObjectiveAl- in terms of the degree. Recently, a similar ordering was gorithm Overview proposed for improving the matrix-vector multiply time Initializepparts andbandwidthreduction[32].Theprimaryfocusofthat Executedegree-weightedlabelpropagation. approach was to arrive at parallel orderings to improve fork1 iterationsdo Balancepartsforvertexconstraint. thelinearsolvertime.Ourfocusistoimprovethegraph Refinepartstominimizeedgecut. computations’ end-to-end time. fork2 iterationsdo We demonstrate our approach in Algorithm 2. We Balancepartstosatisfyedgeconstraint andminimizemaxper-partcut. randomly choose a minimal-degree vertex as the root Refinepartstominimizeedgecut. and perform a standard BFS routine, tracking visitation status with visited and the current level with level. We PULP’s subroutines essentially use variants label add vertices to level sets L when they are visited, as propagation that limit the number of possible labels with RCM. We avoid explicit sorting by assuming that to the number of desired parts and impose addi- each L , where Max is the maximum BFS 0···Maxlevel level tional weighting criteria to create balanced partitions. level, is mostly sorted in the order of decreasing vertex This weighted form of label propagation is utilized in degree, as there is a higher likelihood of encountering two separate stages during execution of PULP. Algo- high-degree vertices sooner in any given level for most rithm 1 gives a very broad overview of the PULP multi- real world graphs. We assign new labels using an in- constraint (vertices and edge per part) multi-objective crementing value of n by starting with the vertices in 4 the highest level and working backwards to those in With the five partitioning methods (random, METIS the lowest level. As we will show in the next section, {single constraint, single objective and multiple con- this approach performs better than both random and straint, single objective} and PULP {multiple con- RCM orderings in applications that have a high number straint, single objective and multiple constraint, mul- of irregular memory accesses. As with RCM in [32], tiple objective}) and three ordering methods (random, Algorithm 2 can be straightforwardly parallelized. RCM, and DGL) we evaluate all the combinations of partitioning×ordering pairs and demonstrate that the Algorithm 2 DGL BFS-basedvertexorderingalgorithm. DGL layout with PULP partitioner and DGL-based or- Vid←DGL-order(G(V,E)) dering performs the best in irregular graph computa- forallv∈V do tions. V (v)←v id level←0 root←SelectRoot() Q←root 3 PARALLEL GRAPH COMPUTATIONS Visited(1···|V|)←false whileQ(cid:54)=∅do In this section, we will give an overview of the five dis- forallv∈Qdo tributed graph analytics used during our experimental Insertv intoLlevel forall(cid:104)v,u(cid:105)∈E do analysis of the impact of partitioning and ordering on ifVisited(u)=falsethen analytic performance. In an attempt to best understand Visited(u)←true InsertuintoQ the general effects of varying partitioning and ordering level←level+1 on the performance, the graph analytics were selected Max ←level n←l0evel astorepresentawiderangeofexecutioncharacteristics. fori=Maxlevel···1do The test suite includes an implementation which is rela- forVjid=(L1i·(·j·)|)L←i|dno tively computation-heavy, PageRank, algorithms which n←n+1 are relatively more communication-heavy, breadth-first search and single source shortest paths, an algorithm We compare the performance of our ordering method which is both very computation and communication and RCM to a random ordering, where vertices are intensive, color-coding subgraph counting, as well as an shuffled into a random order. Another common order- algorithm whose performance is dependent on the sizes ing strategy is using the “natural” ordering, which is ofthen-hopneighborhoodsofeachpartition,distributed using the original vertex assignments as given in the query processing of Resource Description Framework graph data file. We avoid comparison to the natural stores. ordering (and associated vertex block partitioning) for acoupleofreasons.Thequalityofthenaturallayoutfor 3.1 DistributedPageRank graphsretrievedfromadatabasecanbehighlyvariable. E.g. a natural layout for a web graph that’s based on Our distributed PageRank uses an MPI+OpenMP ap- crawling methodology might give considerably better proachandan |V| partitioning,witheachofpMPItasks p performance than the layout for a social graph that’s calculatingthecountsforanequivalentportionofthe|V| based off of user ID (we observed a spread of both verticesinthegraphG.WithoneMPItaskpernode,we 2× speedup and slowdown for a natural ordering vs. then use thread parallelism while updating the counts random ordering on our PageRank test; similarly, we of owned vertices. With the exception of the single MPI observed consistent variance in vertex block vs. random communication call on each iteration, all per-task work partitioning).Additionally,a“natural”layoutforcertain can be done in parallel. Updates are passed among graphs is not necessarily well-defined. For example, the neighbors using an MPI all-to-all exchange. In practice, user IDs for a social network might be considerably this specific implementation has been observed to be largerthanthenumberofverticesinthegraph.Tocreate very efficient and scalable, giving per-iteration costs of and efficiently store a graph in-memory, these values less than a few seconds for networks of over 100 billion need to be mapped to vertex identifiers 0···|V| − 1). edgeswhilerunningon256computenodes.Thespecific This mapping can be done by compressing the user IDs technical details of the implementation are omitted, but and retaining their original order, or through a more please see [51] for a more in-depth discussion. efficient first-come-first-served mapping where IDs are mapped to vertex identifiers as they are encountered in 3.2 SubgraphCounting the data file. The method used would be application- specific. Because the extensive number of tests we per- Subgraph counting is a computationally challenging formed limited the number of graphs we could use in task, with the na¨ıve approach scaling as O(nk), where n our experiments, we wished to eliminate any potential isthenumberofverticesinagraphandk thenumberof impacts due to the above graph creation methodologies vertices in the subgraph being counted. The best known for the sake of uniformity in our comparative results. exact algorithm [25] improves the exponent by a factor Hence why we view a random ordering to be the best of α, where α is the exponent for fast matrix multipli- 3 baseline for relative comparison. cation. Because of these extremely high execution time 5 bounds, recent work has focused on approximation al- all-to-all collective communication routine. At the end gorithms.Onesuchapproachforcountingtree-structured of each BFS iteration and ∆-stepping phase, each task subgraphs utilizes the color-coding technique of Alon et locally updates the distance of its own vertices using al. [1]. theexchangedinformation.TheupdateinBFSisonlyon In prior work, we developed a fast parallel im- unvisited vertices, while ∆-stepping updates all vertices plementations of color-coding subgraph counting in whose distances can be decreased. Thus, the ∆-stepping both shared-memory and distributed-memory environ- algorithm performs more computation and has a higher ments [49]. The distributed version of our approach communication complexity. uses several optimizations, including fully partitioning Since our goal is to analyze and evaluate the effect and compressing the memory-intensive dynamic pro- of graph partitioning and vertex reordering, we have grammingtabletodecreasememoryrequirementsacross not yet implemented all the optimizations in [15], [16]. all tasks, further compressing the table during com- However, our approach has three new optimizations: (i) munication to reduce the total transfer volume, and A semi-sort of vertex adjacencies based on weights is using all-to-all exchanges in lieu of broadcasts to reduce used prior to execution of the algorithm. (ii) Memory- communication times. These optimizations demonstrate optimized queues are used to represent the bucket data good scaling and enable us to count subgraphs of 10 structure. This decreases the algorithm memory require- and 11 vertices on billion-edge networks in minutes on ment,whileslightlyincreasingtherunningtime.(iii)An a modest number of 16 nodes. For space consideration, arrayofalllocaluniqueadjacenciesiscreatedandlocally we omit a detailed description of our implementation. usedtotracktentativedistanceofadjacencies.Thisarray Instead,pleasereferto[49]foranin-depthdiscussionof improvesefficiencybyfilteringoutunnecessaryrequests the stages and execution of the algorithm. to be added in the new frontiers. 3.3 SSSPandBFS 3.4 DistributedRDFStoresandSPARQLQueryPro- cessing We also assess the performance impact of layout on tuned implementations for parallel breadth-first search Resource Description Framework (RDF) [46] is a pop- (BFS) and single-source shortest paths (SSSP) computa- ular data format for storing web data sets. Informally, tion in this paper. Our parallel BFS approach can take the RDF format specifies typed relationships between advantage of both 1D and 2D graph distributions [10], entities, and the basic record in an RDF data set is a [11], [12]. We use a 1D distribution in this work, as triple. There are a growing number of publicly-available it is easier to correlate communication time with edge RDF data sets that contain billions of triples. Thus, cut after partitioning with a 1D distribution. Recent BFS database methodologies for storing these RDF data sets, and SSSP implementations use a 1D partitioning and also called triple stores, are becoming popular. We have direction-optimizing search [4] for work-efficient and developed a distributed MPI-based implementation of highly scalable execution on Graph 500 test instances. an open-source triple store called RDF-3X [42]. Our For an overview of the current state-of-the-art in per- distributed RDF store is called RDF3X-MPI [20]. formance optimizations for these routines, we refer the An alternate approach to viewing an RDF data set is reader to [15], [16]. as a directed graph with edge types. RDF data sets can We use an optimized parallel implementation [44] be queried using a language called SPARQL. We extend of the ∆-stepping algorithm [37] for parallel SSSP in thedistributedRDFstoremethodologyofRDF-3Xtothe this paper. Each BFS iteration and ∆-stepping phase is SPARQL querying phase as well. Thus our RDF3X-MPI comprised of three main steps: local discovery, all-to-all tool has two phases, a load phase and a query phase. exchange,andlocalupdate.Toaidadjacencyqueries,we In the load phase, the given triple data set is partitioned use a distributed compressed sparse row representation into several independent files, one per task, and each for a graph. The distance array is also partitioned and task then constructs an index for helping answering distributed along with the distributed vertices (for ∆- SPARQL queries. It is possible to parallelize some query stepping). In the local discovery step, both algorithms evaluationinapurelydata-parallelmanner(i.e.,withno expand their frontiers by listing all corresponding ad- communication between tasks), provided there is suffi- jacencies and their proposed distance based on vertices cient replication of triples among partitions. Formally, in a queue of recently-visited vertices (for BFS) or in a if the triple partitions satisfy an n-hop guarantee, then currentbucket(for∆-stepping).NotethatBFSvisitseach SPARQL queries in which all pairs of join variables are reachable vertex only once while ∆-stepping may visit at distance of less than n hops from each other can each reachable vertex multiple times before it is settled. be solved without any inter-task communication [30]. Once all vertices in the queue are processed or the So the role of graph partitioning in this application is current bucket is empty (with no more vertex reinser- to reorder vertices such that the number of triples that tions), all p tasks exchange vertices in these generated are replicated between tasks after applying an n-hop lists to make them local to the owner tasks. This step guarantee are minimized. If the number of triples that is the same for both BFS and ∆-stepping, and uses an are replicated is reduced, then the database indexes are 6 smaller, making them potentially faster to query. For systemwith64GBmainmemoryandAMD6276Interla- this application, we study the impact of partitioning gosprocessorsat2.3GHz.ThesystemusesaCrayGem- on the number of replicated triples. A smaller value of ini 3D torus interconnect. We built our programs with replication is desired, and further, smaller index sizes the GNU C++ compiler (version 4.8.2), using OpenMP should translate to faster query times. for multithreading and the -O3 optimization parameter during compilation. For the pre-processing phases of DGL (partitioning and reordering) and some scalability 4 EXPERIMENTAL SETUP runs, we utilized Compton, a testbed cluster. Compton has a dual socket setup with Intel Xeon E5-2670 (Sandy We evaluate performance of our new partitioning and Bridge) CPUs at 2.60 GHz and 64 GB main memory. ordering strategy DGL and the graph analytics work- Due to the large memory requirements of partitioning load on a collection of nine large-scale low diame- withMETIS,wealsohadtousethelargememorynodes ter graphs, listed in Table 1. LiveJournal, Orkut, and on Carver at NERSC for partitioning the larger networks Twitter (follower network) are crawls of online social (Twitter, uk-2005, Webbase, and sk-2005). Carver’s large networks obtained from the SNAP Database and the memory nodes have 1024 GB main memory and four Max Planck Institute for Software Systems [14], [52]. uk- IntelXeonX7550(”Nehalem-EX”)CPUsat2.00GHz.We 2005 and sk-2005 are crawls of the United Kingdom performedk-waypartitioningwithMETISusingversion (.uk) and Slovakian (.sk) domains performed in 2005 5.1.0. using UbiCrawler and downloaded from the Univer- sity of Florida Sparse Matrix Collection [6], [8], [23]. WebBase is similarly a crawl obtained in 2001 by the 5 RESULTS AND DISCUSSION Stanford WebBase crawler. We created the BSBM and 5.1 DGL PerformanceEvaluation LUBM graphs from RDF data sets generated using the We first evaluate our DGL label propagation-based par- Berlin SPARQL benchmark [5] and Lehigh University Benchmark [28] generators. DBpedia was created from titioning methodology, PULP, against METIS partition- ing by examining total running time for generating 16 RDF triples extracted from Wikipedia [40]. and 64 partitions. We consider two versions of both The Orkut graph is undirected and the remaining graphs are directed. For the web and social graphs, PULP and METIS. For PULP, we have an implemen- tation that has both maximal vertex and edge balance we preprocessed the graphs before executing PageRank, constraints and minimizes both total edge cut and max- BFS, SSSP, and subgraph counting. Specifically, we re- imal per-part edge cut. We consider this our baseline moved all degree-0 vertices, multi-edges, and extracted thelargest(weakly)connectedcomponent.Further,edge implementation, and label it in figures as PULP-MM directivity was ignored when partitioning the graphs (PULP multi-objective multi-constraint). We also have a dual constraint version that only attempts to minimize using PULP and METIS and reordering with RCM and DGL. Table 1 lists the sizes of these nine graphs after the total edge cut, which we call PULP-M. Similarly for METIS, the dual constraint single objective version preprocessing. is termed METIS-M, while the single constraint (vertex balance) and single-objective version is termed simply TABLE1 as METIS. METIS-M and PULP-M are solving the same Testgraphcharacteristicsafter preprocessing.Graphsbelongtothree problem. For our constraints, we fix the maximal vertex categories,OSN:Onlinesocialnetworks,WWW:Webcrawl,RDF: imbalance ratio at 1.10 and the edge imbalance ratio graphsconstructedfromRDFdata.#Vertices(n),#Edges(m), at 1.50. The results will show that the multi-constraint, average(davg)andmax (dmax)vertexdegrees,andapproximate diameter(D(cid:101))arelisted.B=×109,M =×106,K=×103. multi-objectivemodeof PULP-MMcanbeimportantfor irregular graph computations. Table 2 shows the partitioning time of PULP-MM Network Category n m davg dmax D(cid:101) Source along with METIS-M running on Compton. Due to METIS’s large memory requirements (close to 500GB for LiveJournal OSN 4.8M 42M 18 39K 21 [34] Orkut OSN 3.1M 117M 76 33K 9 [57] Twitter), only LiveJournal, Orkut, and the RDF graphs Twitter OSN 44M 2.0B 37 750K 36 [14] could be partitioned on Compton. The larger web graphs uk-2005 WWW 39M 781M 40 1.8M 21 [8] WebBase WWW 113M 844M 15 816K 376 [8] andTwitterwereallpartitionedonalargememorynode sk-2005 WWW 44M 1.6B 73 15M 308 [8] of Carver. We also report the relative speedup of PULP BSBM RDF 16M 67M 8.6 3.6M 7 [5] LUBM RDF 33M 133M 8.1 11M 6 [28] toMETIS.WeomittimecomparisontoParMETIS,asthe DBpedia RDF 62M 190M 6.1 7.3M 7 [40] only graphs it was able to successfully partition on any systemwereLiveJournalandOrkut.Further,ParMETIS’s The scalability studies for subgraph counting, BFS, speedupsrelativetoMETISforthosetwoinstanceswere SSSP,andRDFqueryprocessingweredoneprimarilyon minimal (less than 2× with 16-way parallelism). From Blue Waters, a large petascale supercomputer at the Na- Table2weobserveconsiderablespeedupforPULP,with tional Center for Supercomputing Applications (NCSA). a geometric mean speedup of 12.4× for 16 parts and Each XE compute node of Blue Waters is a dual-socket 10.1× for 64 parts. 7 TABLE2 TABLE4 PULP-MMandMETIS-Mpartitioningtimewith16-wayand64-way DGLserialreorderingtimewith16-wayand64-waypartitioning. partitioning.PULP-MMusesmulti-constraintmulti-objective partitioning.METIS-Musesmulti-constraintsingle-objective Network 16-waypartitioning 64-waypartitioning partitioning. timReC(Ms) timDeG(sL) Speedup timReC(Ms) timDeG(sL) Speedup LiveJournal 2.3 1.0 2.3× 2.3 1.0 2.3× Orkut 3.9 1.9 2.1× 3.9 1.9 2.1× Network MtEimTIeS1-(M6s-)waPyULptiaPmr-teMit(iMos)ninSgpeedup MtEimTIeS6-(M4s-)waPyULptiaPmr-teMit(iMos)ninSgpeedup TuWwke-ib2tbt0e0ar5se 513063 821.443 212...195××× 613175 721.967 222...121××× sk-2005 24 11 2.2× 23 11 2.1× LiveJournal 75 7.4 10× 74 7.3 10× BSBM 5.1 2.3 2.2× 4.7 2.3 2.0× Orkut 156 10 16× 197 13 15× LUBM 5.7 1.7 3.4× 5.7 1.7 3.4× Twitter 12348 530 23× 12484 565 22× DBpedia 16 6.1 2.6× 17 6.9 2.5× uk-2005 255 15 17× 353 80 4.4× WebBase 539 39 14× 551 42 13× sk-2005 465 39 12× 514 65 7.9× BSBM 348 28 12× 395 32 12× LUBM 707 88 8.0× 966 123 7.9× times of both DGL and RCM in serial across all three DBpedia 898 133 6.8× 1001 133 7.5× partitioning strategies for reordering the vertices within eachpartition. DGL reorderingresultsina2.3×average TABLE3 speedupcomparedtoRCMforreorderingboth16and64 Averagepartitioningcharacteristicsacrossallgraphs.Geometricmean parts. This reduction is due to the avoidance of explicit ofvertexbalanceVmax,edgebalanceEmax,improvementover sorting required by RCM. There does not seem to be a randompartitioningforedgecutratioECandmaxper-partedgecut large dependence of running times on the number of ECmax,andthemeanimprovement(decrease)intheaveragetotal partitions, although with a greatly increased partition numberofconnectedcomponentsforallparts(#CCs)areshown.The count for a fixed graph, it would be expected that bestvaluesforeachofthelastthreecolumnsareinboldfont. running time decreases due to a lower diameter BFS searchandoverallincreasedcacheutilization.Boththese Partitioner EC(imp) ECmax(imp) Vmax Emax avg min max avg min max methods can be parallelized as DGL can use a parallel Random 1.15 1.70 1.00 1.00 1.00 1.00 1.00 1.00 BFS and RCM can be implemented using the parallel METIS 1.10 3.88 7.71 1.5 107 2.39 0.25 63 METIS-M 1.10 1.50 4.40 1.02 41 2.16 0.77 22 version [32]. However, their timings are insignificant in PULP-M 1.10 1.50 5.50 1.17 64 2.10 0.54 23 the end-to-end performance of complex analytics such PULP-MM 1.10 1.50 5.00 1.19 63 3.18 2.54 204 as our subgraph counting benchmark. TABLE5 The partitioning quality in terms of both vertex and edgebalanceconstraintsandedgecutandmaximalper- DGLandPULPscalingtohigherpartcounts.Executiontimesarein seconds. part edge cut objectives for the different partitioners is shown in Table 3 as geometric averages. We also note Network 256-way 512-way 1024-way that aggregate measures don’t fully capture the wide PULP-MM DGL PULP-MM DGL PULP-MM DGL spread of results among different tests, so include min LiveJournal 10 1.2 18 1.2 30 1.2 and max improvements for edge cut and max per-part Orkut 23 1.6 37 1.6 65 1.5 Twitter 910 42 1340 42 1560 41 cut as well. E.g., the improvement METIS has relative uk-2005 109 6.6 161 6.7 252 6.7 Webbase 53 11 119 12 190 12 to random partitioning varies from 1.5× for 64-way sk-2005 165 12 285 12 587 12 partitioning of Twitter to 107× improvement for 16-way BSBM 55 1.6 75 1.6 127 1.6 LUBM 164 1.5 194 1.5 355 1.5 partitioning of uk-2005. DBpedia 219 8.0 325 8.1 502 7.9 In terms of the total edge cut (EC), the single- constraint, single-objective METIS does the best, but We demonstrate that the DGL and PULP strategies it performs poorly in the maximum per-part edge cut are also able to efficiently compute the layout for larger (ECmax) and edge balance (Emax). PULP-MM also per- numbers of parts beyond 16 and 64. Table 5 gives the forms better than all the methods in the ECmax metric executiontimesofDGLorderingandPULPpartitioning without sacrificing a lot in EC and still respecting the when computing the layout of the various test graphs vertex balance and edge balance constraints. Also note with 256, 512, and 1024 parts. We observe flat scaling themuchlargerE ofsingleconstraintMETIS.Aswe of DGL ordering with increasing part counts due to max willdemonstrate,thiscanhaveaconsiderablyimpactof its intrinsic work efficiency and O(n + m) expected execution time for the applications in our benchmarks. execution time. We observe an increase in PULP times We bold the best values for each column for edge cut for higher part counts. This is due to PULP having a and max per-part cut, and note that METIS performs per-iteration workload of O(np + m) [50], where p is consistently best overall in the edge cut metric and thenumberofpartsbeingcomputing.However,westill PULP-MM performs best overall in the max per-part observe sub-linear scaling relative to p for almost all metric by a wide margin. test cases. Overall, we observe no intrinsic scalability We additionally compare our DGL vertex ordering bottlenecks of these methods at this higher scale. strategy to RCM. Table 4 gives the average running We include one more table to demonstrate how our 8 DGL ordering strategy might improve cache perfor- computational timings results we’ll report next in our mance of executing codes. To improve the performance benchmarks will demonstrate that these measurements of linear solvers, a common ordering metric to optimize translate into real performance benefits across a wide for is graph bandwidth, which is the maximum inte- range of graph analytics. ger distance between vertex identifiers for vertices that share a single neighbor. RCM is an effective means of 5.2 PageRankPerformance bandwidth reduction for regular matrices. However, for small-world graphs, the bandwidth is usually going to TABLE7 belarge,ontheorderofd ,whered isthemaximal max max Speedupsofvariouspartitioningandorderingstrategiesversus degreeofanyvertexinthegraph.Comparingbandwidth randompartitioningandrandomorderingforthePageRankcounting measures between different orderings therefore won’t benchmark. show any global improvements in compaction for rows of much lesser degree vertices. Network Partitioning Ordering Assuch,welookatothermetricstogiveanindication METIS METIS-M PULP-M PULP-MM RCM DGL of the possible cache efficiency in practice. Across the LiveJournal 2.560 2.453 2.561 2.832 1.404 1.325 Orkut 2.519 1.689 1.784 2.068 1.214 1.205 entire adjacency array, we measure how often edges Twitter 1.454 1.459 1.871 1.716 1.346 1.292 uk-2005 8.913 3.518 4.427 9.725 4.641 5.039 listed in order also have identifiers within a single WebBase 13.99 10.87 11.92 11.93 3.776 4.870 integer value of each other. This indicates that these sk-2005 6.170 8.293 7.287 6.797 1.100 1.155 edges would be neighboring nonzeros in the same row Overall 3.621 3.525 3.395 4.465 1.881 1.970 of an adjacency matrix. Co-located edges improve cache utilization of per-vertex information accesses, such as For our first set of experimental benchmarks results, checking visitation status for BFS or PageRank value we examine the effect of partitioning and ordering on lookups. To quantify how many co-located vertex iden- a distributed PageRank implementation. We will first tifiers for the edges are in the adjacency list, we report show the effect that different partitionings have on two values. First, we report a ratio of how co-located communication times, and then we will show the effect all edges are, where a value of zero indicates that no that orderings have on computation times. For these edges are co-located and a value of one indicates that experiments we use the three social network graphs all edges are co-located. Second, we report a running (LiveJournal, Orkut, and Twitter) as well as the three “cost” as the sum of the distances, or gaps, between web crawls (uk-2005, WebBase, sk-2005). Figure 1 (top) vertex identifiers in the adjacency list. We scale the gives the speedups relative to random partitioning for distances by their log, as a distance of one or close to it METIS single and multiple constraint partitionings and indicates that vertices are closely co-located and would PULPmultipleconstraintwithsingleandmultipleobjec- have minimal cache cost for their subsequent accesses, tive partitionings. Figure 1 (bottom) gives the speedups and the cost difference between large and very large relative to random ordering for DGL and RCM. Table 7 distances is minimal, since it’s likely a new cache line gives the explicit speedup values and overall geometric would need to be loaded in both instances. Finding an means across the six test graphs. These value are for 20 ordering that minimizes this sum is referred to in the iterations of PageRank executing on 16 nodes of Blue literature as the Minimum Logarithmic Gap Arrangement Waters. problem, which is NP-hard [18]. In an attempt to give a We observe that all partitionings offer considerable graph-independent ratio, we further scale the log sum speedups relative to random. In general, the web crawls by a worst-case possible value of mlogn. The true show even greater speedups than the social networks. dependence of cache utilization on distance would be This is due to the web crawls being greater in diameter architecture-specific, but the approximation of this cost and more separable than social networks, resulting in a gives enough insight for comparative purposes when decrease in the number of cut edges and subsequently examining ordering quality. greater performance improvements relative to random. Table 6 gives both the co-location ratio (Co-loc. Ratio) Averaged across all six test graphs, PULP multiple as well as the log sum of gap distances ratios (Gap Sum constraint and multiple objective partitioning offers the Ratio)forallorderingcombinationsacrossallgraphsfor greatest speedup. The performance benefit is due to the 16 and 64 parts. We report the geometric mean values implementation’s use of an iterative bulk synchronous across all five partitioning strategies (Random, METIS, model and moderately low required communication, METIS-M, PULP-M, PULP-MM). For co-location ratio, so the improved communication balance resulting from higherindicatesbetterlocality,whilefortheloggapsum PULP-MM’s decrease in max per-part cut becomes ap- ratio, lower indicates better locality. We omit reporting parent in the timings. the co-location ratio for Random ordering in Table 6, as Additionally, we note that both RCM and DGL offer all values are close to zero, a few orders of magnitudes considerable speedups for total computation times rela- less than RCM and DGL. We observe nearly that DGL tive to random ordering. On the uk-2005 and WebBase ordering results in the best co-location ratio and low- graphs, the speedups for DGL are about 5×. Again est log gap sum ratio across almost all instances. The we observe that the social networks generally show 9 TABLE6 OrderingperformanceforDGL,RCM,andRandomintermsco-locationratio(Co-loc.Ratio)andlogsumofgapdistances(GapSumRatio)for 16-wayand64-waypartitioning,averagedacrossthefivedifferentpartitioningstrategies. 16-waypartitioning 64-waypartitioning Network Co-loc.Ratio GapSumRatio Co-loc.Ratio GapSumRatio DGL RCM DGL RCM Rand DGL RCM DGL RCM Rand LiveJournal 0.115 0.010 0.036 0.034 0.043 0.104 0.014 0.009 0.006 0.012 Orkut 0.028 0.001 0.046 0.057 0.054 0.021 0.001 0.010 0.020 0.011 Twitter 0.032 0.005 0.037 0.035 0.038 0.026 0.006 0.013 0.010 0.016 uk-2005 0.659 0.176 0.015 0.022 0.046 0.582 0.184 0.005 0.006 0.011 Webbase 0.562 0.162 0.020 0.039 0.050 0.519 0.172 0.005 0.006 0.011 sk-2005 0.613 0.149 0.018 0.026 0.050 0.689 0.167 0.002 0.004 0.013 BSBM 0.146 0.146 0.040 0.040 0.062 0.146 0.153 0.007 0.006 0.009 LUBM 0.105 0.105 0.026 0.026 0.045 0.094 0.105 0.006 0.006 0.009 DBpedia 0.442 0.257 0.028 0.036 0.046 0.398 0.267 0.006 0.008 0.010 Overall 0.298 0.112 0.030 0.034 0.048 0.274 0.119 0.007 0.008 0.011 m LiveJournal Orkut Twitter uk−2005 WebBase sk−2005 o nd15 22222.....8888833333xxxxx 15 22222.....5555522222xxxxx 15 11111.....7777722222xxxxx 15 99999.....7777722222xxxxx 15 1111144444.....00000xxxxx 15 88888.....2222299999xxxxx a R s. 10 10 10 10 10 10 v p 5 5 5 5 5 5 u d ee 0 0 0 0 0 0 Sp Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS M M M M M M M M M M M M Partitioner m LiveJournal Orkut Twitter uk−2005 WebBase sk−2005 ndo6 111...444xxx 111...222111xxx 111...333555xxx 555...000444xxx 444...888777xxx 111...111555xxx a R4 s. v p 2 u d e e0 p S R D R R D R R D R R D R R D R R D R a G C a G C a G C a G C a G C a G C nd L M nd L M nd L M nd L M nd L M nd L M o o o o o o m m m m m m Ordering Fig.1. CommunicationspeedupofthePageRankimplementationon16nodeswithvariouspartitioningoptions(top)andcomputationspeedup ofPageRankwithvariousorderingstrategies(bottom). TABLE8 less performance benefit relative to random, and this Speedupsofvariouspartitioningandorderingstrategiesversusrandom is again due to their lower diameter and small-world partitioningandrandomorderingforthesubgraphcountingbenchmark. characteristics,whichmakeseffectiveorderingmoredif- ficult. However, we still observe a consistent 20%-40% speedups with the improved orderings. Overall, DGL Network Partitioning Ordering METIS METIS-M PULP-M PULP-MM RCM DGL givesagreaterperformancespeedupoverRCMbyabout LiveJournal 2.099 2.202 2.211 2.150 1.009 1.020 10%, a result we expected based on our measurement Orkut 2.307 2.400 2.411 2.350 1.014 1.015 Twitter 1.378 1.399 1.580 1.271 1.041 1.029 of potential locality and cache performance as demon- uk-2005 - 5.433 5.476 5.642 1.049 1.057 WebBase - 3.412 3.375 3.311 1.125 1.148 strated in Table 6. sk-2005 5.568 5.675 5.772 5.621 1.072 1.091 Overall - 3.033 3.106 2.961 1.051 1.059 5.3 SubgraphCountingPerformance We next compare the impact of various partitioning and ordering strategies with regards to the running times of partitioning strategies with random ordering in terms our subgraph counting implementation. We run on 16 of total time spent in the communication, computation, node of Blue Waters. We compare communication times and partitioning steps. Note that the results with single resulting from each of the 5 partitioners with a fixed constraint METIS for the uk-2005 and WebBase graphs random ordering. We also compare the computation are absent. This is due to execution times taking longer times resulting from the 3 ordering strategies with fixed than 24 hours for these instances. PULP-MM partitioning. The speedups for each strategy Several trends can be observed in Figure 2. The top onthe6testgraphsaregiveninFigure2andTable8.We subfiguregivesthespeedupofthecommunicationphase also look at total end-to-end execution time for the five of subgraph counting for each of the partitioning strate- 10 m LiveJournal Orkut Twitter uk−2005 WebBase sk−2005 o d an6 22222.....2222211111xxxxx 22222.....4444411111xxxxx 11111.....5555588888xxxxx 55555.....6666644444xxxxx 33333.....4444411111xxxxx 55555.....7777777777xxxxx R s. 4 v p 2 u d ee0 Sp Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS Random PuLP−M PuLP−M METIS− METIS M M M M M M M M M M M M Partitioner m LiveJournal Orkut Twitter uk−2005 WebBase sk−2005 ndo1.5 111...000222xxx 111...000222xxx 111...000444xxx 111...000666xxx 111...111555xxx 111...000999xxx a R1.0 s. v p 0.5 u d e e0.0 p S R D R R D R R D R R D R R D R R D R a G C a G C a G C a G C a G C a G C nd L M nd L M nd L M nd L M nd L M nd L M o o o o o o m m m m m m Ordering Step Computation Communication Partitioning e m LiveJournal Orkut Twitter uk−2005 WebBase sk−2005 ution Ti200 300 15000 3000 4000 45000000 ec150 200 10000 2000 3000 x nd E100 100 5000 1000 2000 2000 e 50 1000 − o −t 0 0 0 0 0 0 nd R P P M M R P P M M R P P M M R P P M M R P P M M R P P M M E andom uLP−M uLP−M ETIS− ETIS andom uLP−M uLP−M ETIS− ETIS andom uLP−M uLP−M ETIS− ETIS andom uLP−M uLP−M ETIS− ETIS andom uLP−M uLP−M ETIS− ETIS andom uLP−M uLP−M ETIS− ETIS M M M M M M M M M M M M Partitioner Fig. 2. Speedups achieved with subgraph counting for total communication time of the various partitioning strategies relative to random partitioning, all with random ordering. Additionally, the speedups for the RCM and DGL orderings relative to random ordering with PULP multi objective partitioning. The bottom plot gives total end-to-end execution time in terms of the initial partitioning, total computation time, and total communicationtime,allinseconds. gies relative to random partitioning. We again note con- locality has less of an effect in preventing re-accesses siderable speedup for all partitioners. We note that the to main memory; however, we note even a modest 5%- PULPmethodsgivethebestimprovementforfiveoutof 6% consistent improvement can be noteworthy in this the six tested graphs. Overall PULP-M gives the highest instance. On processors with larger cache, this relative speedup overall. This implementation doesn’t benefit as improvement would be expected to increase. highly from the more communication-balanced PULP- Finally, the bottom subfigure of Figure 2 shows the MM partitioning due to the overall higher communica- total end-to-end execution time in seconds for initial tion requirements (the Twitter graph requires compres- partitioning plus running of the subgraph counting sion and transfer of several terabytes of data in total application. We further split subgraph counting into for the Count table exchanges between tasks) and lower the sum of time spent in each of its computation and overall synchronization cost relative to PageRank, so communicationphases.Weobservethatourpartitioning total edge cut is observed to have a greater effect in and ordering strategies result in the fastest end-to-end practice. This emphasizes the fact that a one-size-fits all running times for all test instances. The time spent for solution is not optimal in practice, and implementation partitioningisconsiderablerelativetoexecutiontimefor knowledge is required to extract the best performance METIS, as is the extra communication costs that result for any given running application when utilizing a lay- with random partitioning. The additional partitioning out strategy. time cost for METIS might be amortized in practice by re-usingthesamepartitionsforsubsequentanalysis,but The middle subfigure of Figure 2 plots the speedup we note that PULP partitioning shows an immediate relative to random ordering for the DGL and RCM re- decreaseintotalend-to-endtimeafterasingleanalytic orderingstrategieswithPULP-MMpartitioning.Overall, run. we note about a 6% improvement for DGL and 5% im- provement for RCM ordering relative to random. These 5.4 ExecutionTimelines improvements are much lower than PageRank’s im- provementduetoconsiderablymoreinformationstored To offer visual explanation of the performance of bal- per-vertex in the stored counts table, so greater cache ancedconstraintpartitioningontotalexecutiontime,we