ebook img

Scalable Online Betweenness Centrality in Evolving Graphs PDF

0.69 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Scalable Online Betweenness Centrality in Evolving Graphs

Scalable Online Betweenness Centrality in Evolving Graphs Nicolas Kourtellis Gianmarco De Francisci Morales Francesco Bonchi YahooLabs YahooLabs YahooLabs Barcelona,Spain Barcelona,Spain Barcelona,Spain [email protected] [email protected] [email protected] ABSTRACT bridge have a strategically favorable position because they 5 canblockinformation, oraccessitbeforetheotherindivid- Betweenness centrality is a classic measure that quantifies 1 uals in their community: they span a“structural hole’’ [9]. the importance of a graph element (vertex or edge) accord- 0 Girvan and Newman [15] exploit this concept to define 2 ingtothefractionofshortestpathspassingthroughit. This one of the first and most elegant algorithms for community measure is notoriously expensive to compute, and the best r knownalgorithmrunsinO(nm)time. Theproblemsofeffi- detection. Thealgorithmiterativelyremovesthehighestbe- p tweenness edge and produces a hierarchical decomposition ciencyandscalabilityareexacerbatedinadynamicsetting, A ofthegraph,wheretheremainingdisconnectedcomponents wheretheinputisanevolvinggraphseenedgebyedge,and are the communities discovered. 8 thegoalistokeepthebetweennesscentralityuptodate. In Betweenness centrality has been used to analyze social 2 this paper we propose the first truly scalable algorithm for networks[20,25],proteinnetworks[19],wirelessad-hocnet- online computation of betweenness centrality of both ver- works[27],mobilephonecallnetworks[10],multiplayeron- ] tices and edges in an evolving graph where new edges are S line gaming networks [2], to inform the design of socially- added and existing edges are removed. Our algorithm is D carefullyengineeredwithout-of-coretechniquesandtailored aware P2P systems [22], just to name a few examples. Measuringbetweennesscentralityrequirescomputingthe . for modern parallel stream processing engines that run on s shortestpathsbetweenallpairsofverticesinagraph. This clusters of shared-nothing commodity hardware. Hence, it c computation is possible in small graphs with a few tens of [ is amenable to real-world deployment. We experiment on graphs that are two orders of magnitude larger than pre- thousands vertices and edges, but it quickly becomes pro- 2 vious studies. Our method is able to keep the betweenness hibitively expensive as the graphs grow larger. Indeed, the v centralitymeasuresuptodateonline,i.e.,thetimetoupdate best known algorithm for betweenness centrality, proposed 1 by Brandes [6], runs in O(nm) time. the measures is smaller than the inter-arrival time between 8 Due to its high cost, some works have proposed to par- two consecutive updates. 9 allelize its execution [4]. Others, have proposed to approxi- 6 matebetweennesscentralitythroughtheuseofrandomized . 1. INTRODUCTION algorithms [8, 23, 31]. However, the accuracy of these ran- 1 domized algorithms can decrease considerably with the in- 0 Betweennesscentralitymeasurestheimportanceofanel- 4 ementofagraph,eitheravertexoranedge,bythefraction creaseingraphsize[23]. Variantsofbetweennesscentrality, 1 of shortest paths that pass through it [3, 14, 15]. such as flow betweenness [13] and random-walk between- : Intuitively, an edge that connects two vertices that have ness [29], also run in O(nm) time. Finally, there are no v cheapermeasuresthatcanbeusedasaproxy,astheydonot i many common neighbors is to some extent redundant. It X belongstoadenseareaofthegraph,andinformationwould to correlate well with betweenness centrality [5] (differently from, e.g., degree centrality for PageRank). So there is no r be able to propagate even without it. In other terms, not a manyshortestpathswillneedsuchedge. Thisedgeiswhat easy workaround to the complexity of computing between- ness centrality. State-of-the-art methods are too expensive sociologists call a strong tie. for graphs with millions of vertices and edges, thus making Conversely, a weak tie is an edge that connects two ver- this measure hard to use in practical application scenarios. ticeswithfewcommonneighbors[16]. Suchanedgeislikely The picture gets even worse when considering that real to bridge two distinct dense areas of the graph (also called graphsofinterest,suchastheWeb,socialnetworks,andin- communities),henceitparticipatesinmanyshortestpaths. formation networks, are dynamic in nature and evolve con- Information has to go over this“bridge”to propagate from tinuouslywithnewedgesandverticesarriving,andoldedges one community to another. The two vertices that own the beingremoved. Insuchascenario,thena¨ıveapproachofre- computing the measure from scratch is impractical even on moderately large graphs. Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare 1.1 Relatedwork notmadeordistributedforprofitorcommercialadvantageandthatcopies bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to Recently there have been three main proposals by Lee republish,topostonserversortoredistributetolists,requirespriorspecific etal.[24],Greenetal.[17]andKasetal.[21]fortheincre- permissionand/orafee. mental computation of vertex betweenness centrality. Copyright20XXACMX-XXXXX-XX-X/XX/XX...$15.00. QUBE [24] relies on the decomposition of the graph into Table 1: Comparison with previous studies: vertex (CV) disjointminimumunioncycles(MUCs). Thealgorithmuses andedgecentrality(CE),edgeaddition(+)andremoval(-), this decomposition to identify vertices whose centrality can parallel and streaming computation ((cid:107)), size of the largest potentiallychange. Iftheupdatestothegraphdonotaffect graph used in the experiments (|V| and |E|). Note that the decomposition, then the centrality needs to be recom- Nasreetal.[28]havesmallertimecomplexitythanBrandes’ puted (from scratch) only within the MUCs of the affected and other algorithms. vertices. If the decomposition changes, a new one must be computed, and then new centralities must be computed for Method Year Space CV CE + − (cid:107) |V| |E| all the affected components. The performance of the algo- Leeetal.[24] 2012 O(n2+m) (cid:51) (cid:55) (cid:51) (cid:51) (cid:55) 12k 65k rithm is tightly connected to the size of the MUCs found, Greenetal.[17] 2012 O(n2+nm) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) 23k 94k whichinreal(andespeciallysocial)graphscanbeverylarge. Kasetal.[21] 2013 O(n2+nm) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) 8k 19k The algorithm depends on a preprocessing step that com- Nasreetal.[28] 2014 O(n2) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) - - putes a minimum cycle basis for the given graph. Then a Thiswork 2014 O(n2) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) 2.2M 5.7M MUC decomposition is computed by recursively taking the 1.2 Contributions unionofcyclesintheminimumcyclebasisthatshareaver- tex. Therefore, it requires O(m) storage. Themaincontributionofthispaperistoprovidethe first QUBE leverages Brandes’ [6] algorithm to compute the truly scalable and practical framework for computing ver- centralityscoresinsideaMUC.However,thealgorithmalso tex and edge betweenness centrality of large evolving graphs, requirestocomputethe(numberof)shortestpathsbetween incrementally and online. Our proposal represents an ad- each pair of vertices in the affected MUC, which could re- vancement over the state of the art in four main aspects as quireO(n2)spaceintheworstcase. Theoverallspacecom- summarized in Table 1. plexity is thus O(n2+m). First, our method maintains both vertex and edge be- Greenetal.[17]proposetomaintainthepreviouslycom- tweennesscentralityup-to-dateforthesamecomputational putedvaluesofbetweennessandthedatastructuresneeded cost, while the previously proposed methods are only tai- by Brandes’ algorithm [6], and update the ones affected by lored for vertex betweenness. a graph change. Their approach has a space complexity of Second,wehandlebothadditionsandremovals ofedgesin O(n(n+m)),whichbecomesprohibitiveforlargegraphsof aunifiedapproach,whilethepreviouslyproposedmethods, millions of vertices. besides QUBE, can handle only addition of edges. In fact, Kas et al. [21] extend the work proposed by Ramalingam we show that the incremental, up-to-date edge betweenness and Reps [30] to accommodate the computation of vertex under continuous edge removals allows for faster execution betweenness centrality while adding edges or vertices. Dif- of the Girvan-Newman [15] algorithm on larger graphs. ferently from Brandes’ algorithm, their technique does not Third,ourmethodhasreducedspaceoverhead. Similarly usedependenciesbuttheactualshortestdistances. Itkeeps to previous work [17], our algorithm maintains the previ- the data structures needed to update the betweenness cen- ouslycomputedvaluesofbetweennessandotherneededdata trality of vertices: distances, number of shortest paths and structures,andupdatestheonesaffectedbygraphchanges. predecessors list. The computational complexity can be at However, our method avoids maintaining the predecessors mostasBrandes’,i.e.,O(nm). However,thespacecomplex- lists,thusreducingthespacecomplexitytoO(n2)[18]. This ity is the same as Green et al. [17], i.e., O(n(n+m)). optimization requires a scan of all neighbors instead of pre- Recently, and concurrently with our study, some works decessors: we show that this does not affect the time com- have also improved the space complexity of the incremen- plexity, and makes the algorithm more scalable and faster tal betweenness computation to O(n2) [26, 28]. Similarly in practice. to our approach, McLaughlin et al. [26] modifies the tech- Forth, our framework is truly scalable and amenable to niquebyGreenetal.[17]bydroppingthepredecessorlists, real-world deployment. The framework is carefully engi- and porting it to GPUs to run on larger graphs. However, neered to use out-of-core techniques to store its data struc- it has only been tested on a small sample of source nodes tures on disk in a compact binary format. Data structures (256)insteadofthefullgraph. Thus,itsscalabilitytolarge are read sequentially by employing columnar storage, and graphs with higher memory demands, as well as its capa- memory structures are mapped directly on disk to minimise bility to follow dynamic rates of graph updates have not memory copies. been demonstrated. Nasre et al. [28] present a tighter up- Finally,weshowhowourmethodcanbeparallelizedand per bound of O(nm∗) on time complexity, where m∗ is the deployed on top of modern parallel data processing engines number of edges that lie on shortest paths. This is the first thatrunonclustersofcommodityhardware,suchasStorm1, algorithmtohavealowertimecomplexitythanBrandes’[6]. S42, Samza3, orHadoop. Ourexperimentstestourmethod However,whileusuallysmaller,m∗isstillO(m)intheworst ongraphswithmillionsofverticesandedges,i.e.,twoorders case (e.g., an unweighted clique). Most importantly, their of magnitude larger than previous studies. By experiment- algorithmisnotstraightforwardtoparallelize,asitrequires ing with real-world evolving graphs, we also show that our accesstotheshortestpathsDAG(SPdag)rootedinvfrom algorithm is able to keep the betweenness centrality mea- eachSPdagrootedins(∀s∈V),wherev isanendpointof sure up to date online, i.e., the time to update the measure theupdatededge. Therefore,itisnotpossibletoparallelize is always smaller than the inter-arrival time between two the algorithm by distributing the set of SPdag on several consecutive updates. machines, as in our approach. Moreover, neither of these 1http://storm.apache.org approachescanhandleremovalofedgesorprovideupdated 2http://incubator.apache.org/s4 edge betweenness scores. 3http://samza.apache.org Anopen-sourceimplementationofourmethodisavailable Input: GraphG(V,E)andedgeupdatestreamES on GitHub.4 Inclusion in SAMOA,5 a platform for mining Output: VBC(cid:48)[V(cid:48)]andEBC(cid:48)[E(cid:48)]forupdatedG(cid:48)(V(cid:48),E(cid:48)) big data streams [11, 12] is also planned. Step 1: ExecuteBrandes’alg. onGtocreate&storedata structuresforincrementalbetweenness. Roadmap. Section2introducesthebasicconceptsandre- calls Brandes’ algorithm. We overview our algorithm for Step 2: For eachupdatee∈ES,executeAlgorithm1. online betweenness centrality in Section 3 and cover the Step 2.1 Updatevertexandedgebetweenness. details on insertions and removals in Section 4. Section 5 Step 2.2 Updatedatastructuresinmemoryordiskfor presents various optimizations that make our method scal- nextedgeadditionorremoval. able to large graphs. Section 6 reports our experimental results, while Section 7 concludes the paper. Figure 1: The proposed algorithmic framework. 2. PRELIMINARIES from the source σ [v], and the dependency δ [v] accumu- s s Let G=(V,E) be a (directed or undirected) graph, with lated when backtracking at the end of the search. |V|=n and |E|=m. Let P (t) denote the set of predeces- Onunweightedgraphs,Brandes’algorithmusesabreadth s sors of a vertex t on shortest paths from s to t in G. Let firstsearch(BFS)todiscovershortestpaths,anditsrunning σ(s,t)denotethetotal number of shortest paths fromstot time is O(nm). The space complexity of the algorithm is in G and, for any v ∈ V, let σ(s,t | v) denote the number O(m+n). While this algorithm was initially defined only of shortest paths from s to t in G that go through v. Note for vertex betweenness it can be easily modified to produce that σ(s,s)=1, and σ(s,t|v)=0 if v ∈{s,t} or if v does edge betweenness centrality at the same time [7]. not lie on any shortest path from s to t. Similarly, for any edge e ∈ E, let σ(s,t | e) denote the number of shortest 3. FRAMEWORKOVERVIEW paths from s to t in G that go through e. The betweenness Ourframeworkcomputesbetweennesscentralityinevolv- centrality of a vertex v is the sum over all pairs of vertices ing unweighted graphs. We assume new edges are added to of the fractional count of shortest paths going through v. thegraphorexistingedgesareremovedfromthegraph,and these changes are seen as a stream of updates, i.e., one by Definition 2.1 (Vertex Betweenness Centrality). one. Henceforth,forsakeofclarity,weassumeanundirected For every vertex v∈V of a graph G(V,E), its betweenness graph. However, our framework can also work on directed centrality VBC(v) is defined as follows: graphsbyfollowingoutlinksinthesearchphaseandinlinks (cid:88) σ(s,t|v) in the backtracking phase rather than generic neighbours. VBC(v)= . (1) σ(s,t) The framework is composed of two basic steps shown in s,t∈V,s(cid:54)=t Figure1. ItacceptsasinputagraphG(V,E)andastreamof edgesE tobeadded/removed,andoutputs,foranupdated Definition 2.2 (Edge Betweenness Centrality). S graphG(cid:48)(V(cid:48),E(cid:48)),thenewbetweennesscentralityofvertices For every edge e ∈ E of a graph G(V,E), its betweenness (VBC(cid:48))andedges(EBC(cid:48))foreachvertexv∈V(cid:48) andedge centrality EBC(e) is defined as follows: e∈E(cid:48). (cid:88) σ(s,t|e) TheframeworkusesBrandes’algorithmasabuildingblock EBC(e)= . (2) σ(s,t) in step 1: this is executed only once, offline, before any up- s,t∈V,s(cid:54)=t date. Wemodifythealgorithmto(i)keeptrackofbetween- ness for vertices and edges at the same time, (ii) use addi- Brandes’algorithm[6]leveragesthenotionofdependency tionaldatastructurestoallowforincrementalcomputation, score of a source vertex s on another vertex v, defined as and(iii)removethepredecessorslisttoreducethememory δs(v)=(cid:80)t(cid:54)=s,v σσ(s(s,t,|tv)). ThebetweennesscentralityVBC(v) footprint and make out-of-core computation efficient. of any vertex v can be expressed in terms of dependency Edge betweenness. ByleveragingideasfromBrandes[7], (cid:80) scores as VBC(v) = s(cid:54)=vδs(v). The following recursive we modify the algorithm to produce edge betweenness cen- relation on δs(v) is the key to Brandes’ algorithm: trality scores. To compute simultaneously both edge and vertex betweenness, the algorithm stores the intermediate (cid:88) σ(s,v) δ (v)= (1+δ (w)) (3) dependency values (Eq. 3) independently for each vertex. s σ(s,w) s w:v∈Ps(w) Additional data structures. To allow for incremental computation we need to maintain some additional data. In The algorithm takes as input a graph G=(V,E) and out- particular,weneedacompactrepresentationofthedirected puts the betweenness centrality VBC(v) of every v∈V. It acyclic graph of shortest paths rooted in the source vertex runs in two phases. During the first phase, it performs a (which we refer to as SPdag), and the accumulated depen- searchonthewholegraphtodiscovershortestpaths, start- dencyvalues. Thus,foreachsourcevertexswemaintainan ing from every source vertex s. When the search ends, additional data structure BD[s] that stores its betweenness it performs a dependency accumulation step by backtrack- data. BD[s]storesthreepiecesofinformationforeachother ing along the shortest paths discovered. During these two vertex t: phases,thealgorithmmaintainsfourdatastructuresforeach vertex found on the way: a predecessors list P [v], the dis- • BD[s].d[t]: the distance of vertex t from source s; s tance ds[v] from the source, the number of shortest paths • BD[s].σ[t]: the number of shortest paths starting from source s and ending at the given vertex t; 4http://github.com/nicolas-kourtellis/ StreamingBetweenness • BD[s].δ[t]: thedependencyaccumulatedonthevertext 5http://samoa.incubator.apache.org in the backtracking to source s. Thedatastructureisinitializedinstep1,andpopulatedat Algorithm 1: Incremental computation of vertex and the end of the dependency accumulation phase. Then, it is edge betweenness when adding or removing an edge. used in step 2.1 and updated in step 2.2. Input: G(V,E),(u1,u2),VBC[V],EBC[E],BD[V] Memoryoptimisation. Brandes’algorithmbuildsalistof Output: VBC(cid:48)[V(cid:48)],EBC(cid:48)[E(cid:48)],BD(cid:48)[V(cid:48)] predecessors during the search phase to speed up the back- Initialization: EBC(cid:48)[(u1,u2)]=0.0 tracking phase. Differently from the other data structures, Addition: E(cid:48)←E∪(u1,u2);Deletion: E(cid:48)←E\(u1,u2) the size of this list is variable and can grow considerably. 1 fors∈V(cid:48) do For example, by assuming just 4 predecessors on average, 2 uL←findLowest(u1,u2,BD[s]) 3 uH ←findHighest(u1,u2,BD[s]) the size of the list would be as large as BD[·] (assuming 4 dd←BD[s].d[uL]−BD[s].d[uH] integer identifiers). 5 if dd==0then To reduce the space complexity of the algorithm, we re- 6 continue// same level addition/deletion move the predecessors lists. When backtracking in the de- 7 if dd≥1then pendency accumulation phase, the algorithm checks all the 8 forr∈V do neighbors of the current vertex, and uses the level of the 9 σ(cid:48)[r]=BD[s].σ[r];d(cid:48)[r]=BD[s].d[r];δ(cid:48)[r]=0 vertex in the SPdag (i.e., the distance from the source) to 10 t[r]←NT // not touched before 11 LQ[|V|]←emptyqueues;QBFS ←emptyqueue pick the next vertices to visit. 12 QBFS ←uL For each source, we need to maintain an SPdag to all 13 if addition then other vertices in the graph (O(n)), along with the edges to 14 if dd==1then their predecessors (O(m)). Therefore, each SPdag takes 15 executeAlg.2;// 0 level rise O(n+m) space. In total, the space complexity of the orig- 16 if dd>1then 17 executeAlg.4;// 1 or more level rise inal algorithm is O(n(n+m)) with the predecessors lists. 18 else if deletion then Byremovingthepredecessorslists,wereducethespacecom- 19 if uL has predecessors then plexitytoO(n2). Furthermore,thetimecomplexityremains 20 executeAlg.2;// 0 level drop unchanged, as shown next. 21 elseexecuteAlg.6// 1 or more level drop To understand why removing the predecessors lists does Data Structures Update: notincreasethetimecomplexity,considerthattheorderin 22 forr∈V(cid:48) do whichverticesaretraversedisunchanged. Assumethereare 23 BD[s].σ[r]=σ(cid:48)[r];BD[s].d[r]=d(cid:48)[r] k edgesinthepredecessorslistsoverall. Brandes’algorithm 24 if t[r](cid:54)=NT thenBD[s].δ[r]=δ(cid:48)[r] checks each edge once during the search phase to populate and which one is furthest (denoted u ). Let dd(u ,u ) the predecessors lists. In the backtracking phase only the L L H denote thedifference indistance fromthesourceof the two edges in predecessors list are checked. Thus, overall the endpoints before the addition, i.e., dd(u ,u )=d(s,u )− complexityofthealgorithmisO(m+k). Intheworstcase,k L H L d(s,u ) where d represents shortest path distance. Since isofthesameorderasm,hence,thecomplexityoftraversing H only one edge is added at a time, we simply denote this allthepredecessorslistsisboundedbyO(m). Scanningthe difference as dd. neighbors of each vertex is also bounded by O(m). Thus, Depending on how large dd is, a different type of update the algorithm’s worst-case time complexity is unchanged. is needed. In particular, three cases can arise: Anadditionalbenefitofremovingthepredecessorslistsis to also avoid the overhead of building it during the traver- • dd=0 (Proposition 3.1); sal. In practice, this optimization not only reduces the • dd=1 (0 level rise, Section 4.1); space complexity, but also decreases the average running • dd>1 (1 or more levels rise, Section 4.2). timeofthealgorithm,asshowninSection6andinprevious The first case involves two vertices that are at the same work [18, 1]. Moreover, by removing the predecessors lists, distance from the source vertex. we do not need to maintain any variable-length data struc- ture in our algorithm. This simplification allows us to use Proposition 3.1. Giventwoverticesu andu suchthat 1 2 very efficient out-of-core techniques to manage BD[·] when they have the same distance from a source vertex s, and an its size outgrows the available main memory (see Section edge e=(u ,u ) that connects the two vertices, no shortest 1 2 5.1). pathfromstoanyothernodeinthegraphpassestroughthe edge e, i.e., d(s,u )=d(s,u ) =⇒ ∀t∈V,σ(s,t|e)=0. 3.1 Additionandremovalofedges 1 2 Proof. We prove the proposition by contradiction. As- Theadditionofanewedgemaycausestructuralchanges sume there exists a shortest path from vertex s to p that in a graph, i.e., changes in the distance between vertices. goesthroughtheedge(u ,u ),i.e.,paths,...,u ,u ,...,p. Dependingonthepreviousdistancebetweenthetwonewly- 1 2 1 2 However, since u and u are at the same distance from connectedendpoints,thesechangesmaybringsomevertices 1 2 the source s, we can construct another path that is one closer to the current source s. Similarly, when an existing hop shorter, that starts from s and ends in p but skips u , edge is removed structural changes may move the furthest 1 i.e., s,...,u ,...,p, which contradicts the assumption that endpoint away from the source. 2 s,...,u ,u ,...,p is a shortest path. 1 2 Edgeaddition. Instep2oftheframeworktheAlgorithm1 isexecutedtoupdatethebetweennesscentralityofthegraph No shortest path goes through the edge, no change occurs whenanewedge(u ,u )isadded. Thealgorithmhasaccess in the SPdag, so the current source can be ignored. 1 2 to the data structure BD[·] computed in step 1, and runs In the second case, the new edge connects two vertices independently for each source. For a given source s, the whosedistancefromthesourcediffersonlybyone(Fig.2a). algorithm uses BD[s] to determine which endpoint of the Thus, thisadditiondoesnotcauseanystructuralchangein new edge is closest to the current source s (denoted u ) theSPdag,andallthedistancesremainthesame. However, H new shortest paths can be created due to the addition, and (a) s 0 s (b) therefore the shortest paths and the dependencies of the graph must be updated. δ δ δ δ In the third and most complex case, dd > 1, structural uH k uH changes occur in the SPdag (Fig. 2b depicts this case after δ the rise of uL). In order to handle these changes properly, uL k+1 uL r we introduce the concept of pivot. Definition 3.2 (Pivot). Let s be the current source, BFS BFS1 δ letd()andd(cid:48)()bethedistancebeforeandafteranupdate,re- k+2 pv spectively,wedefine pivotavertexp |d(s,p )=d(cid:48)(s,p )∧ V V V ∃w∈neighbors(pV):d(s,w)(cid:54)=d(cid:48)(s,w). δ δ δ BFS2 δ Thus, a pivot is a vertex that, under an edge addition or removal,doesnotchangeitsdistancefromthesources,but Figure2: Thered(light)edgeisadded/removedandeither has neighbors that do so. does not cause structural changes (a), or does so (b). Whendd>1,weneedtofirstcomputethenewdistances by leveraging the pivots. Given that their distance has not addition, as some pivots cannot be discovered while adjust- changed, we can use them as starting points to correct the distances in the SPdag. In the case of addition, all the ing the shortest paths (e.g., if nodes uL and r were con- nected). Therefore, we need to first search and find the pivots are situated in the sub-dag rooted in u , so we can L pivots, and then start a second BFS from those pivots to combine the discovery of the pivot with the correction of correct the shortest paths. The details of this case are cov- the shortest paths. The different cases that can arise are ered in Section 4.3. discussed in detail in Section 4.2. Thereisalsothecasewheretheedgeremoveddisconnects Thereexistsalsoafourthcase: thenewedgeconnectstwo thesub-dagrootedinu fromtherestofthegraph(or,sim- previouslydisconnectedcomponents. Thiscasedegenerates L ilarly, turns u into a singleton). In this case, the shortest into the case dd = 1. Indeed, no previous shortest path L paths coming from the source, as well as the dependencies existed between the two disconnected components, so there is no structural change in the SPdag. going to the source from this component must be removed andthebetweennessadjusted. Ifu istoberemoved,allits Finally, new vertices arriving in the graph are handled L simply by adding them to the source set V(cid:48) with a zero edges are iteratively removed and the singleton is replaced VBC(cid:48). Then, for all sources, the new vertex is considered with zero VBC’ (Section 4.5). asu withd[u ]=d[u ]+1,whereu istheotherendpoint L L H H of the incoming edge (therefore dd=1). 4. INCREMENTALADDITION&REMOVAL Edge removal. In the case of an edge (u1, u2) removed In this section we discuss the details of our framework in from the graph, dd is at most one, as the two endpoints the case of edge addition and removal. are connected before the removal. In this case, one of the 4.1 Nolevelchange two endpoints, u , is closest to the source, and clearly the H edge (uH,uL) belongs to at least one shortest path from Algorithm 2 handles the case when the edge added or re- thesourcestouL. Therefore,thealgorithmneedstocheck moveddoesnotcausestructuralchangesintheSPdag,i.e., whether uL has other shortest paths from s, not passing d(s,uL)isunchanged(Fig.2a). ThealgorithminitializesuL trough (uH,uL). Again, there are three cases: by adding or removing all shortest paths from its predeces- • dd=0 (Proposition 3.1); soruH,dependingonwhethertheedge(uH,uL)isaddedor removed. It also maintains a state flag t[v] for each vertex • dd = 1 and u has other predecessors (0 level drop, L v whichrepresentsthedirectioninwhichthealgorithmhas Section 4.1); encounteredthevertexv: D ifdescending(searching),U N P • dd = 1 and uL has no other predecessor (1 or more if ascending (backtracking), and NT if untouched. levels drop, Section 4.3). The vertices whose shortest paths from a source s have Inthefirstcasetherearenoshortestpathspassingthrough potentially been altered are situated in the sub-dag rooted the edge. Therefore no changes occur in the SPdag, so the in u . Thus, Algorithm 2 performs a BFS traversal of the L current source s can be skipped. SPdag starting from u by visiting neighbors whose dis- L Inthesecondcase,ifu isconnectedtoatleastonevertex tance from s is higher than the current vertex (line 7, or- L u(cid:48) such that dd(u ,u(cid:48) ) = 0, then u will remain at the ange single-dotted triangle in Fig. 2a). During the BFS, it H H H L same distance (Fig. 2a), and no structural change occurs. updates the number of shortest paths to each vertex found Thus distances remain the same. However, some shortest to take into account the paths created (lost) by the addi- pathscomingthrough(u ,u )arelost,sothebetweenness tion(removal)oftheedge(u ,u ). Toavoiddoublecount- H L H L centrality needs to be updated. ing,theoldnumberofshortestpathsfromeachpredecessor Inthethirdandmostcomplexcase,structuralchangesoc- BD[s].σ[v] is subtracted before adding the new number of curinthegraph(Fig.2bdepictsthiscasebeforeu drops). shortest paths σ(cid:48)[v] (line 10). The vertices encountered in L Also in this case we make use of pivots to correct the dis- the BFS are added to a level-specific queue LQ for later tancesintheSPdagfirst,andsubsequentlyadjusttheshort- use. AttheendoftheBFS,thenumberofshortestpathsis est paths and dependency values. However, not all pivots up-to-date. willbefoundinthesub-dagrootedinu aftertheremoval. Inthedependencyaccumulationphase,thealgorithmpolls L This difference makes this case more complicated than the the LQ queue for each level and visits the vertices in re- Algorithm 2: Betweenness update for addition or re- Algorithm 4: Betweenness update for addition of an moval of an edge where u remains at the same level edge where u rises one or more levels after addition. L L after the update. BFS Traversal from uL: BFS Traversal from uL: 1 LQ[d[uL]]←uL;d(cid:48)[uL]=BD[s].d[uH]+1 1 LQ[d(cid:48)[uL]]←uL;t[uL]←DN 2 whileQBFS not empty do 2 If addition thenσ(cid:48)[uL]+=BD[s].σ[uH]; 3 v←QBFS;t[v]←DN;σ(cid:48)[v]=0 3 If deletion thenσ(cid:48)[uL]-=BD[s].σ[uH]; 4 forw∈neighbors(v)do 4 whileQBFS not empty do 5 if d(cid:48)[w]+1==d(cid:48)[v] thenσ(cid:48)[v]+=σ(cid:48)[w]; 5 v←QBFS 6 if d(cid:48)[w]>d(cid:48)[v] and t[w]==NT then 6 forw∈neighbors(v)do 7 t[w]←DN;d(cid:48)[w]=d(cid:48)[v]+1;LQ[d(cid:48)[w]]←w; 7 if d(cid:48)[w]==d(cid:48)[v]+1then QBFS←w 8 if t[w]==NT then 8 if d(cid:48)[w]==d(cid:48)[v] and BD[s].d[w](cid:54)=BD[s].d[v]then 9 t[w]←DN;QBFS ←w;LQ[d(cid:48)[w]]←w 9 if t[w]==NT then 10 σ(cid:48)[w]+=σ(cid:48)[v]−BD[s].σ[v] 10 t[w]←DN;LQ[d(cid:48)[w]]←w;QBFS ←w Dependency Accumulation: Dependency Accumulation: 11 if deletionthen 11 level=|V(cid:48)|;te[e]←NT,e∈E 1123 δL(cid:48)Q[u[Hd(cid:48)][u=HB]]D←[su].Hδ[;uHt[u]H−]σσ←[[uuHLU]]P(;1+BD[s].δ[uL]) 111342 whiwlehliewlvee←Ll>QL[Q0le[vdleeolv]enl]ot empty do 14 level=|V(cid:48)| 15 forv∈neighbors(w)do 15 whilelevel>0do 16 if d(cid:48)[v]<d(cid:48)[w]then 16 whileLQ[level] not empty do 17 ExecutemoduleinAlg.3. 17 w←LQ[level] 18 if (t[v]=UP) and (v(cid:54)=uH or w(cid:54)=uL) 18 forv∈neighbors(w)do then 19 if d(cid:48)[v]<d(cid:48)[w]then 19 δ(cid:48)[v]−=α 20 ExecutemoduleinAlg.3. 20 if BD[s].d[v]==BD[s].d[w] thenα=0.0 21 if additionthen 21 if BD[s].d[w]<BD[s].d[v]then 22 itfhet[nv]==UP and (v(cid:54)=uH or w(cid:54)=uL) 22 α= BBDD[[ss]]..σσ[[wv]](1+BD[s].δ[v]) 23 δ(cid:48)[v]−=α 23 if (v,w)(cid:54)=(uL,uH) then 24 if (v,w)(cid:54)=(uL,uH) then EBC(cid:48)[(v,w)]−=α EBC(cid:48)[(v,w)]−=α 24 if d(cid:48)[v]==d(cid:48)[w] and BD[s].d[w](cid:54)=BD[s].d[v] 25 if deletionthen then 26 EBC(cid:48)[(v,w)]−=α 25 ExecutemoduleinAlg.5. 27 if t[v]==UP thenδ(cid:48)[v]−=α 26 if w(cid:54)=s thenVBC[w]+=δ(cid:48)[w]−BD[s].δ[w]; 28 if w(cid:54)=s thenVBC[w]+=δ(cid:48)[w]−BD[s].δ[w] 27 level=level−1; 29 level=level−1; Algorithm 5: EBC correction if endpoints were not at Algorithm 3: Initialization dependency module. the same level before the change. 1 if t[v]==NT then 1 if te[(v,w)]==NT then 2 t[v]←UP;δ(cid:48)[v]=BD[s].δ[v];LQ[level-1]←v 2 te[(v,w)]←UP;α=0 34 cα==σσBB(cid:48)(cid:48)[D[Dwv[][]ss(]]1..σσ+[[wv]]δ((cid:48)1[w+]);BδD(cid:48)[[vs]]+.δ=[wc];)EBC(cid:48)[(v,w)]+=c 345 iiff BBαDD=[[ss]]..BBddDD[[ww[[ss]]]].><.σσ[[wBBv]]DD(1[[ss+]]..ddB[[vvD]][ttshh].eeδnn[w]) 6 α= BBDD[[ss]]..σσ[[wv]](1+BD[s].δ[v]) verse order of discovery by the BFS (δ arrows in Fig. 2a). 7 EBC[(v,w)]−=α In the case of edge removal (lines 11-13), u is inserted in H LQ before the dependency accumulation starts, to guaran- pathissubtracted(line23foraddition,line27forremoval). tee that the data structures of u and its predecessors will H Thenewdependencycisaddedinline3ofAlg.3andthen be updated even though the edge (u , u ) does not exist H L corrected by subtracting the old dependency α in line 24 anymore. for addition and line 26 for removal. Similarly the VBC is The vertices whose dependency needs an update are sit- uated in two parts of the SPdag. Part of them is in the updated in line 28. sub-dag rooted in u , as discovered by the BFS. For these L 4.2 Addition: 1ormorelevelsrise vertices, the number of shortest paths was updated during theBFS.Theyhavebeeninsertedintheappropriatequeue Inthiscase,theSPdagundergoesstructuralchanges. To LQandwillbeexaminedduringthedependencyaccumula- handle these changes, we need to find the pivots, which de- tion phase. The others are the predecessors of the sub-dag, fine the boundary of the structural change of the SPdag. whichlayatthefringeofthefirstones(green,double-dotted Given that the addition of the edge can only bring vertices triangle in Fig. 2a). These vertices are found while back- closer to the source if they are reachable from u , all the L tracking in the dependency accumulation phase by examin- pivotscanbefoundwhiletraversingtheSPdagwithaBFS ingtheneighborsofverticesinLQ. Discoveredpredecessors starting from u , similarly to the previous case (orange, L are added in LQ and examined in the next level (line 2 in single-dotted triangle in Fig. 2b). Alg. 3). Algorithm 4 shows the details. The algorithm begins by Duringthebacktrackingphase,thealgorithmcorrectsthe initializing the new distance of u (line 1). Similarly to the L dependencyofverticesthatareaffected. Ifavertexwasnot previous case, it adds all the new shortest paths to vertices encountered before, the old dependency for the particular found during its BFS traversal. However, in this case there Before change After Addition After Deletion case 1 level (a) level (a) level X Y j i X Y j i X Y i (b) (b) Y j-1 Y j i X j i X j+1 case 2 (a) (a) X j i X j i Y j+1 Y j+1 X i (b) (b) Y i+1 X Y j i X Y j i (c) (c) Y j-1 Y j i+1 X j i X j+1 are structural changes in the SPdag due to the connection Before After Addition After Deletion of two endpoints previously far apart. The vertices reach- case 1 (a) (b) (c) (d) (e) (f) jY<i able from u might be pulled closer to the source s along L with u . As a result, new shortest paths may emerge, old X Y i X j i X Y j i Y X Y j i L shortestpathsmaybecomeobsolete,anddistancesfromthe X j>i source may change. Therefore, the vertices do not inherit case 2 Y the shortest paths from their predecessors (line 3), rather, the shortest paths are computed during the modified BFS. X i X j i X Y j i X j i X j i ThestructuralchangesthatcanhappenintheSPdagare Y Y Y X Y j>i Y depictedinFigure3. Letusexaminethesecasesforavertex X j i+2 x and its neighbor y. Let a sibling be a neighbor of vertex that is at the same distance from the source. Before the Figure 3: Possible configurations of an edge before and addition,xandy couldbeeithersiblings(case1,Fig.3)or after an update that causes structural changes. predecessorandsuccessor(case2). Ifyisnowapredecessor of x (case 1a), the algorithm adjusts the shortest paths of x (line 5). If x was and still is a predecessor of y (line 6), Algorithm6:Pivotfinderwhenanedgeremovalcauses the new edge has caused both x and y to move closer to s u to drop one or more levels. L by the same amount (case 2a). In this case, we update the distance from s and insert y in the BFS queue for further 1 PQ[|V|]←emptyqueues;QBFS ←emptyqueue 2 first=V;QBFS ←uL;t[uL]←NP; exploration (line 7). If y is now on the same level as x, but 3 whileQBFS is not empty do wasnotbeforetheaddition(case2b),yisaddedtotheBFS 4 v←QBFS forfurtherexploration(line10). Ifymovedtwolevelsw.r.t. 5 forw∈neighbors(v)do x (case 2c), it will be discovered first in the BFS after the 6 if BD[s].d[w]+1==BD[s].d[v] and t[w]==NT and update (line 6). t[v](cid:54)=PV then Clearly there are no structural changes in the vertices at 7 PQ[d[v]]←v;t[v]←PV // a new pivot 8 if first>BD[s].d[v]then levelsaboveuL (i.e.,closertothesource). Thepossiblesub- 9 first=BD[s].d[v]// the first pivot cases examined cover all possible scenarios of how a pair 10 else if BD[s].d[w]==BD[s].d[v]+1 or of connected vertices (and thus their edge) can be found BD[s].d[w]==BD[s].d[v]then after the addition of the new edge. The shortest paths and 11 if t[w]==NT then distances(σ(cid:48)[·],d(cid:48)[·])areupdatedinthewaythattheoriginal 12 QBFS ←w;t[w]=NP // not a pivot Brandes’ algorithm proposes. 13 if PQ[.] not empty then In the dependency accumulation phase, the dependency 14 ExecuteAlgorithm7// ≥ 1 pivots 15 else if PQ[.] is empty then scoreofallverticesexaminedisupdatedwiththenewnum- 16 ExecuteAlgorithm10// Disconnected component berofshortestpathscomputedintheBFSphase. Thispart of the algorithm is similar to the corresponding one in Al- Algorithm 6 shows how to handle an edge removal that gorithm 2. However, there are important differences in the causesu tomoveoneormorehopsawayfromthesources. correction of the dependency for the edge betweenness cen- L First, the algorithm needs to find the pivots in the sub-dag trality (lines 20–25, Alg. 4). Assuming v is x and w is y, if rooted in u . Then, distances from the source are updated bothxandy remainatthesamerelativedistancefromthe L by exploring the graph starting from the discovered pivots. source, the dependency to be subtracted α is calculated in Finally,theusualdependencyaccumulationcorrectstheval- line 4 of Alg. 3 (case 2a). However, if y moves closer (case ues of dependency and betweenness centrality. 2c), then y was a successor of x but now it is a predecessor The algorithm starts a BFS from u to search for ver- of x. Therefore, we need to subtract the dependency on y. L ticesthathavepredecessorsstillconnectedtotherestofthe The subtracted value is adjusted by switching w with v in graph, i.e., outside the scope of the sub-dag rooted in u thedependencyaccumulationformula(lines21–22,Alg.4). L (lines 6–9) and marks them as pivots (P ) (BFS , orange, Iftheendpointsoftheedgewereatthesamelevelbefore V 1 single-dotted triangle in Fig. 2b). The rest are non-pivots theaddition(case1)thereisnoneedforcorrectionsinceno (N ). Figure3illustratesthepossiblecasesthatcanbeen- dependencywasaccumulatedontheedge(line20,Alg.4). If P counteredbythisBFStraversalinthecaseofedgeremoval. theendpointsarenowatthesamelevelbutwerenotbefore Pivots are represented by vertex y if y remained in level i (case 2b), the old dependency needs to be subtracted from (case 1d), or if y remained in level i+1 (cases 2e and 2f). thebetweennessoftheedge. Also,theedgeismarkednotto AfterthefirstBFSfinishes,Algorithm7startsanewBFS be traversed again (Alg. 5). In Alg. 5, if w was a successor (BFS ,blue,single-dashedtriangleinFig.2b)fromthepiv- ofv,theolddependencyiscalculatedonline4,whereasifw 2 ots found. This new BFS first corrects the distances of the was a predecessor of v, the old dependency is calculated on vertices discovered (line 9), and their shortest path counts line6. Thevertexbetweennesscentralityisupdatedonline (line 12). Furthermore, in lines 23-31 it adjusts the depen- 26ofAlg.4byaddingthenewdependencyaccumulatedon dency and betweenness values in a similar fashion as de- the vertex w and subtracting the old dependency. scribedinSection4.2,bycoveringallpossiblecases(Fig.3) Insummary,allpossiblecasesofstructuralchangesinthe and following the δ arrows in Fig. 2b. SPdagbelowu arecoveredbyAlg.4,whichcorrectlyup- L If u has at least one sibling before the removal of the datesthebetweennessscoresandaccompanyingdatastruc- L edge,wecanusethesamegeneraltechniquepresentedabove. tures of all affected vertices and edges. However,someoptimizationstoreducethecomputationover- head are possible. Indeed, the first search from u to find 4.3 Removal: 1ormorelevelsdrop L the pivots can be avoided. In fact, in this case all possible Algorithm 7: Betweenness update for edge removal less if they stayed in the same position or moved together where u drops one or more levels after removal. downwards). However,therearealsocaseswhereoneofthe L twoverticesmovesandtheotherdoesnot,duetotheirpre- BFS Traversal: fromfirstpivotpoint(s) decessors(Figure3,cases1dand2e). Insuchcases,notonly 1 QBFS ←emptyqueue;QBFS ←PQ[first] 2 next=first+1 thedistanceofthemovedvertexmustbefixed,butalsothe 3 whileQBFS not empty do shortest paths counts from source. All these sub-cases are 4 v←QBFS;t[v]←DN;LQ[d(cid:48)[v]]←v;σ(cid:48)[v]=0 examined in Algorithm 8. 5 if next==d(cid:48)[v]+1then Observe that we start the BFS from u to minimize the 6 QBFS ←PQ[next];next=next+1 scopeofchanges,i.e.,toinvestigateonlythLesub-dagdirectly 7 forw∈neighbors(v)do underu . Therefore,asimpleBFSisnotenoughtoperform 8 if t[w]==NP then L 9 t[w]=DN;d(cid:48)[w]=d(cid:48)[v]+1;QBFS ←w bothoftheseadjustments(i.e.,distanceandshortestpaths) 10 else if t[w]==PV thent[w]←DN in one pass as before. Instead, in some cases, we need to 11 else checktheneighborsz oftheneighborwunderexamination, 12 if d(cid:48)[w]+1==d(cid:48)[v] thenσ(cid:48)[v]+=σ(cid:48)[v] if this neighbor is not properly adjusted yet (lines 25–27). 13 if d(cid:48)[w]==d(cid:48)[v] and BD[s].d[w](cid:54)=BD[s].d[v] If we had executed a BFS from the same-level neighbors of then u ,wecouldpotentiallyperformaone-passBFS.However, 14 if BD.d[w]>BD.d[v] and t[w](cid:54)=DN then L 15 t[w]←DN;LQ[d(cid:48)[w]]←w;QBFS ←w this would lead to greater costs with respect to how many Dependency Accumulation: verticesareunnecessarilytouchedonthewaydownwiththe 1176 Lδ(cid:48)Q[u[Hd(cid:48)][u=HB]]D←[su].Hδ[;uHte][e−]←σσN[[uuTHL,]]e(1∈+EB;Dt[u[sH].]δ←[uULP]);level=|V(cid:48)| BFASnaynvdertthiecnesodnistchoevweraeydutphadtudriindgntohtemaocvcuembuutlahtaiovnepnheiagshe-. 18 whilelevel>0do borswhomoved,arepivotingpointsandareplacedintothe 19 whileLQ[level] not empty do same-levelBFSqueuetobeexaminedinthisBFSlevel(line 20 w←LQ[level] 31). Otherwise,theyareplacedinthenext-levelBFSqueue 21 forv∈neighbors(w)do (lines 29 or 33). If they have moved, their distance is also 22 if d(cid:48)[v]<d(cid:48)[w]then corrected (line 33). Finally, Algorithm 9 is executed to ad- 23 ExecutemoduleinAlg.3;α=0; 24 if BD[s].d[w]>BD[s].d[v]then just the dependencies of the affected vertices and edges. 25 α= BBDD[[ss]]..σσ[[wv]](1+BD[s].δ[w]) 4.5 Removal: DisconnectedComponent 26 else if BD[s].d[w]<BD[s].d[v]then If no pivoting points are found from Algorithm 6, then 27 α= BBDD[[ss]]..σσ[[wv]](1+BD[s].δ[v]) it means the sub-dag under u is a disconnected compo- 28 if t[v]==UP thenδ(cid:48)[v]−=α nent and thus unreachable fromLsource s. Therefore, Algo- 29 EBC(cid:48)[(v,w)]−=α rithm 10 is executed to re-initialize the data structures and 30 if d(cid:48)[v]==d(cid:48)[w] and BD[s].d[w](cid:54)=BD[s].d[v] correctthebetweennessvaluesofverticesandedges. First,a then 31 ExecutemoduleinAlg.5. BFSisstartedfromuL andtoinitializethedatastructures 32 if w(cid:54)=s thenVBC[w]+=δ(cid:48)[w]−BD[s].δ[w] of the vertices found as well as to adjust the betweenness 33 level=level−1 centrality values for vertices and edges. Second, the algo- rithmbacktrackstheLQqueuesfromu upandadjustsall H dependency values and betweenness values of vertices and scenarios shown in Figure 3 can be seamlessly found and edges in the other disconnected component. If indeed the resolved while adjusting the shortest paths (Algorithm 7), removaldisconnectsthissub-graph(eitheraportionofitor sincethestartingpivotsarethesiblingsofu ,andtheother L justonevertexintoasingleton),thispartoftheframework pivotsareallfoundduringtheBFS.Theseoptimizationsare willbeexecutedforeverysourceofthegraph,andthedata explained in detail next. structures and betweenness scores will be adjusted accord- 4.4 Removal: 1leveldrop(opimization) ingly in the storage files. We note that this step is needed for every source and cannot be avoided, e.g., by treating it Algorithms 8 and 9 show in detail the steps needed to be as a special case at the beginning of the framework. performed by the framework when an edge removed forces u todroponlyonelevelfromthesource,asanoptimization L 5. SCALABILITY from the previous method that is generic and covers one or more levels drop. The algorithm described previously is able to update the Inthiscase,u doesnothaveanyotherpredecessorsinthe betweenness centrality of a graph incrementally. Neverthe- L previouslevel. Notethatthisoneleveldropofu canleadto less, to attain our goal of a useful and practical framework L subsequentchangestomanyofu ’ssuccessorswithrespect for real-world deployment, the algorithm is not enough. In- L totheirdistancefromsources. ByexecutingAlgorithm8,a deed,anumberofscalabilityissuesneedstobeaddressedin BFSstartsfromu whichtargetsonfirstfixingthedistances order to have a practical tool. L ofthefoundvertices,andafterthat,adjustingtheirshortest Thespacecomplexityofthealgorithmisquadraticinthe path counts. number of vertices. When dealing with large graphs, the During the BFS traversal from u , various sub-cases are space requirements can easily outgrow the available main L encountereddependingontheneighborsofeachvertexand memory. In this case, we can use the disk to store the re- where it was positioned with respect to the vertex under quired data structures, as our algorithm allows for an effi- examination (lines 8, 11 and 20). Figure 3 illustrates these cient out-of-core implementation. possible sub-cases. Note that in some cases (1e and 2d), Despitethisfeature,thespacerequirementsforverylarge there is no relative change: the vertices remain in the same graphs can still outgrow the disk. Furthermore, disk access distancedifferenceastheywerebeforetheremoval(regard- will become a bottleneck due to reading and writing large Algorithm 8: Betweenness update for removal of an Algorithm 9: Betweenness update for removal of an edge where u drops one level (BFS part). edge where u drops one level (depend. acc. part). L L BFS Traversal from u : Dependency Accumulation: L 123 tQde(cid:48)S[[uea]Lm=]e+←N+Te;,met[pu∈tLyE]=quDeuNe;;QQNSaemxte←←eumLpty queue 312 wδL(cid:48)Q[hu[iHdle(cid:48)][u=leHvB]e]lD←>[su0].Hδd[;uotH[u]H−]σσ=[[uuUHLP]](;1l+evBelD=[s|]V.δ(cid:48)[|uL]) 4 LQ[d[uL]]←uL; d(cid:48)[uL]=BD[s].d[uH]+1 4 while LQ[level] not empty do 5 while QSame not empty do 5 w←LQ[level] 6 v←QSame; σ(cid:48)[v]=0; LQ[d(cid:48)[v]]←v 6 for v∈neighbors(w) do 7 for w∈neighbors(v) do 7 if d(cid:48)[v]<d(cid:48)[w] then 89 if BifDt[[sw].]d[(cid:54)=w]D=N=aBndD[ts[]v.]d[=v=] tDheNnthen 8 c= σσ(cid:48)(cid:48)[[wv]](1+δ(cid:48)[w]); δ(cid:48)[v]+=c; α=0 10 σ(cid:48)[v]+=σ(cid:48)[w] 9 if BD[s].d[v]<BD[s].d[w] then 11 if BD[s].d[w]+1==BD[s].d[v] then 10 α= BBDD[[ss]]..σσ[[wv]](1+BD[s].δ[w]) 12 if t[w](cid:54)=DN and t[v]==P then 11 if t[v]==0 then 13 σ(cid:48)[v]+=σ(cid:48)[w] 12 t[v]=1; LQ[level−1]←v 14 if t[w]==DN and t[v]==DN then 13 δ(cid:48)[v]+=BD[s].δ[v] 15 σ(cid:48)[v]+=σ(cid:48)[w] 14 EBC(cid:48)[(v,w)]+=c−α 16 if t[w]==DN and t[v]==P then 15 if t[v]==1 then 17 if te[(v,w)]==NT then 16 δ(cid:48)[v]−=α 18 α= NNDD[[ss]]..σσ[[wv]](1+ND[s].δ[v]) 17 if w(cid:54)=s then 19 te[(v,w)]=DN; EBC[(v,w)]−=α 18 VBC[w]+=δ(cid:48)[w]−BD[s].δ[w] 20 if BD[s].d[w]==BD[s].d[v]+1 and 19 level=level−1 t[w]==N then T 21 if t[v]==P then Algorithm 10: Betweenness update for removal of an 22 t[w]=P edgewhereu ’ssubdagformsadisconnectscomponent. 23 else L 24 t[w]=DN BFS Traversal from uL: 25 for z∈neighbors(w) do 1 QBFS ← empty queue 26 if BD[s].d[z]+1==BD[s].d[w] and 2 QBFS ←uL; d(cid:48)[uL]=−1; σ(cid:48)[uL]=0; δ(cid:48)[uL]=0 t[z](cid:54)=DN then 3 while QBFS is not empty do 27 t[w]=P; break 4 v←QBFS 28 if t[w]==P and t[v]==P then 5 for w∈neighbors(v) do 29 QN ←w; 6 if BD[s].d[w]==BD[s].d[v]+1 then 30 else if t[w]==P and t[v]==DN then 7 if t[w]==DN then 31 QS ←w; 8 QBFS ←w; t[w]=M; d(cid:48)[w]=−1; 32 else if t[w]==DN then σ(cid:48)[w]=0; δ(cid:48)[w]=0 33 QN ←w; d(cid:48)[w]=d(cid:48)[w]+1 9 α= BBDD[[ss]]..σσ[[wv]](1+BD[s].δ[w]) 34 if QS is empty and QN not empty then 10 VBC[w]−=α; EBC(cid:48)[(v,w)]−=α 35 QS ←QN; QN ← empty queue Dependency Accumulation: 11 δ(cid:48)[uH]=BD[s].δ[uH]− σσ[[uuHL]](1+BD[s].δ[uL]) amountsofdata. Asimplesolutionistodividetheexecution 12 LQ[d(cid:48)[uH]]←uH; t[uH]←UP; level=|V(cid:48)| across multiple machines with multiple disks. This solution 13 while level>0 do notonlyallowstheframeworktoscaletolargerinputs,but 14 while LQ[level] not empty do also leads to improved speedup over the sequential version. 15 w←LQ[level] 5.1 Out-of-coreorganization 16 for v∈neighbors(w) do 17 if d(cid:48)[v]<d(cid:48)[w] then The removal of the predecessors lists from the algorithm 18 Execute module in Alg. 3. reduces its space complexity by O(nm). As an additional 19 if t[v]==UP then δ(cid:48)[v]−=α benefit,itleavesthealgorithmwithnovariable-lengthdata 20 EBC(cid:48)[(v,w)]−=α structure. Indeed,ouralgorithmstoresonlythreefixedsize 21 if w(cid:54)=s then datastructurespervertex: thedistancefromthesourced[·] 22 VBC[w]+=δ(cid:48)[w]−BD[s].δ[w] (1byte),thenumberofshortestpathsfromthesourceσ[·](2 23 level=level−1 bytes)andtheaccumulateddependencyδ[·](8bytes). This organization allows an efficient data layout on disk. The graph is loaded and kept in memory, allowing fast random WeavoidstoringthevertexIDsforsourcesanddestinations access of vertex neighborhoods. by storing the data structures sequentially on disk, and in- We encode BD[·] in binary format on disk. For each ferring the ID from the order. Overall, for each source the sources,westorethedataforeachothervertexinacolum- algorithm stores 3 arrays, with sizes 1×n, 2×n and 8×n nar fashion, i.e., we store on disk all the distances, then bytes,respectively. Becauseofthebinaryformat,eacharray all the numbers of shortest paths, and finally the depen- can be read by using file channels and byte buffers, loaded dency values, in order: {BD[s].d[·],BD[s].σ[·],BD[s].δ[·]}. directlyinmemoryandbereadyforuse. Thisoptimization avoids memory allocations and memory copies during the 1.#Send#to#distributed#cache:# execution of the algorithm. In practice, the computation ###a.#G(V,E):#edgeElist# G,ES# happensatthespeedofsequentialdiskaccess. Inthefuture ###b.#E={set#of#updates}# S weplantoexplorecompressionschemestoreducethespace 2.#Produce#ranges#Π#and#map#machines## and disk access overhead. Π:#{0,#(|V'|/p)E1}# Π:#{(pE1)*(|V'|/p),#(|V'|E1)}# 1 p As explained in the previous sections, the work done by Mapper#1# Mapper#p# thealgorithmdependsondd(u ,u ). Therefore,afterload- L H ienngdptohientdsisutaHncaensdfroumL.diIsfk,thweeycahreeckatthtehedissatamnecedifsotranthcee 12..##EStxoerceu#tpea#rBKraaln#BdDes[Π#in1]#Π#o1n##dGis,EkS##………####12..##EStxoerceu#tpea#rBKraaln#BdDes[Π#inp]#Π#opn##dGis,EkS## 3.#for(edge#in#E){# 3.#for(edge#in#E){# (dd = 0), we skip directly to the next source without load- S S ###a.#Read#BD[Π]#from#disk# ###a.#Read#BD[Π]#from#disk# ingtherestofthedatastructure. Thisoperationisefficient 1 p ###b.#Execute#Algorithm#1#in#Π# ###b.#Execute#Algorithm#1#in#Π# because the data structures have fixed size so the offset to 1 p ###c.#Update#BD'[Π]#on#disk# ###c.#Update#BD'[Π]#on#disk# skip to reach the beginning of the next source is constant. 1 p Otherwise,thearraysareloadedinmemoryandAlgorithm1 Emit#1:# Emit#p:# isexecuted. Whenasourceiscovered,thealgorithmwrites <i,#VBC'(Π1)># Reducer# <i,#VBC'(Πp)># the arrays back to disk, in place and sequentially. <(u,v),#EBC'(Π1)># 1.#Group#on#vertex/edge#id# <(u,v),#EBC'(Πp)># 2.#Sum#VBC'(Π)#&#EBC'(Π)# 3.#Store#VBC'(Π)#&#EBC'(Π)# 5.2 Parallelization Figure 4: MapReduce version of our framework. The out-of-core version presented in the previous section isslowerthanthein-memoryone,butenablestheframework to scale to large graphs. However, the space requirements forrealgraphscanbestaggering(foragraphof1Mvertices t = t ×n/p+t , where t is the average time needed U S M M we need ≈ 11TB of space). Although nowadays is fairly to merge the results. easy to have such an amount of disk storage available on Assuming that an evolving graph has an average rate of servers,readingandwritingthisamountofdatawouldincur updates per time unit F =1/t , the system can always (on I a significant overhead. average)produceupdatedbetweennessscoresbeforethenew To solve this issue, we take advantage of the parallel na- timeperiod,ift <t . However,ifthesystemmeasuresan U I tureofourframeworkandproposetodistributethecompu- increased rate of arrival F(cid:48) > F, it can adjust the number tationonaclusterofshared-nothingmachines. Letpbethe of machines to p(cid:48) > t ×n/(t(cid:48) −t ) to guarantee online S I M number of available machines in the system. We distribute updates,assumingthattheaveragetimepersourceremains thedatastructureBD[·]evenlyamongthepmachines,i.e., unchanged, and that (t(cid:48) > t +t ), i.e., the inter-arrival I S M each machine will be allocated ≈ n/p sources. This paral- timeislargerthantheinherentserialpartofthealgorithm. lelizationispossiblesinceeachsourcecanbeexaminedand updated independently, and the partial betweenness scores can be summed at the end. 5.4 AMapReduceembodiment Thedistributedversionoftheframeworkhasmultiplead- vantages. First, the space requirements per machine are Various paradigms can be used to embody the proposed reducedtoO(n2/p). Forexample,if100machinesareavail- parallel framework, such as parallel stream processing en- able,foragraphwith1Mverticesthestorageneededisjust gines (e.g., Storm, S4, and Samza). Due to its ease of use 150GBpermachine,whichisperfectlyreasonablebytoday’s and popularity, we deploy and experiment with a MapRe- standards. Second, the overall work of the algorithm is di- duce embodiment on a Hadoop cluster. videdamongpprocessors,leadingtoatheoreticalpspeedup Figure 4 illustrates a MapReduce adaptation of our al- oversinglemachineexecution. Furthermore,thediskaccess gorithmic framework. The graph G(V,E) and the set of workloadisdistributedinabalancedfashionacrossmultiple updates ES (new edges to be added and existing edges to disks. Asaresult,thediskaccessspeedoftheframeworkis be removed) are replicated on all machines via distributed ptimesfaster. Asweshowinthenextsection,thespeedup cache,andloadedinmemory. Wegenerateaninputforeach of the framework is indeed almost ideal. mapper i that represents a partition Πi of the graph. Each partitioniscomprisedoftwointegersthatrepresentthefirst and last ID of the range of sources for which the particular 5.3 Onlinebetweennessupdates mapperiisresponsible. ThedatastructuresBD[Π ]created i Real graphs are constantly evolving with addition of new duringstep1arestoredlocallyonthediskofeachmachine. vertices and edges, and removal of existing ones. Updating The Map function processes all edges in E in sequence S the betweenness centrality in real-time is extremely chal- and updates the betweenness centrality. For each update, lenging, given its computational cost. However, due to the it emits key-value pairs of vertex or edge IDs together with inherent parallelism of our design, the framework can scale theirpartialbetweennesscentrality(PBC)bysources,i.e., not only to large graphs but also to rapidly changing ones. (cid:104)id,vbc (id)|ebc (id)(cid:105),whereidiseitheravertexoranedge s s Each of the p available machines is responsible for up- identifier. All the intermediate pairs are sent to the reduc- dating the data structures and partial centrality scores for ers who are responsible for producing the final aggregated n/p sources. The system can monitor the average time t betweennessresultsforallverticesandedges. EachReduce S needed for each machine to process a source, given the ad- function aggregates the partial betweenness score of one el- dition of a new edge or the removal of an existing one. ement (vertex or edge) in the graph. The final value of the Thus, the average time t to produce updatedbetweenness computation is the new betweenness score for each element U scoresforallverticesandedgesuponarrivalofanupdateis of the graph after the set of updates E is applied. S

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.