Chapter 10 A SURVEY OF ALGORITHMS FOR DENSE SUBGRAPH DISCOVERY VictorE.Lee DepartmentofComputerScience KentStateUniversity Kent,OH44242 [email protected] NingRuan DepartmentofComputerScience KentStateUniversity Kent,OH44242 [email protected] RuomingJin DepartmentofComputerScience KentStateUniversity Kent,OH44242 [email protected] CharuAggarwal IBMT.J.WatsonResearchCenter YorktownHeights,NY10598 [email protected] Abstract Inthischapter,wepresentasurveyofalgorithmsfordensesubgraphdiscovery. Theproblemofdensesubgraphdiscoveryiscloselyrelatedtoclusteringthough thetwoproblemsalsohaveanumberofdifferences. Forexample,theproblem ofclusteringislargelyconcernedwiththatoffindingafixedpartitioninthedata, whereastheproblemof densesubgraphdiscovery definesthesedensecompo- nentsinamuchmoreflexibleway. Theproblemofdensesubgraphdiscovery 304 MANAGINGANDMININGGRAPHDATA maywitherbedefinedoversingleormultiplegraphs.Weexplorebothcases.In thelattercase,theproblemisalsocloselyrelatedtotheproblemofthefrequent subgraphdiscovery. Thischapterwilldiscussandorganizetheliteratureonthis topiceffectivelyinordertomakeitmuchmoreaccessibletothereader. Keywords: Densesubgraphdiscovery,graphclustering 1. Introduction Inalmostanynetwork,densityisanindicationofimportance. Justassome- one reading a road map is interesting in knowing the location of the larger cities and towns, investigators who seek information from abstract graphs are often interested in the dense components of the graph. Depending on what properties are being modeled bythegraph’s vertices andedges, dense regions may indicate high degrees of interaction, mutual similarity and hence collec- tivecharacteristics, attractiveforces, favorable environments, orcritical mass. From a theoretical perspective, dense regions have many interesting prop- erties. Dense components naturally have small diameters (worst case shortest path between any two members). Routing within these components is rapid. A simple strategy also exists for global routing. If most vertices belong to a dense component, only a few selected inter-hub links are needed to have a shortaveragedistancebetweenanytwoarbitraryverticesintheentirenetwork. Commercialairlinesemploythishub-basedroutingscheme. Denseregionsare alsorobust,inthesensethatmanyconnections canbebrokenwithoutsplitting the component. A less well-known but equally important property of dense subgraphs comes from percolation theory. If a graph is sufficiently dense, or equivalently, if messages are forwarded from one node to its neighbors with higher than a certain probability, then there is very high probability of propa- gating a message across the diameter of the graph [20]. This fact is useful in everything fromepidemiology tomarketing. Notallgraphs have dense components, however. A sparse graph may have fewornone. Inordertounderstand thisissue, wefirstneed todefineaformal notion of the words ‘dense’ and ‘sparse’. We will address this issue shortly. A uniform graph is either entirely dense or not dense at all. Uniform graphs, however, are rare, usually limited to either small or artificially created ones. Due to the usefulness of dense components, it is generally accepted that their existence isthe rulerather than theexception in nature and inhuman-planned networks[39]. Densecomponentshavebeenidentifiedinandhaveenhancedunderstanding ofmanytypesofnetworks;amongthebest-knownaresocialnetworks[53,44], the World Wide Web [30, 17, 11], financial markets [5], and biological sys- ASurveyofAlgorithmsforDenseSubgraphDiscovery 305 tems[26]. Muchoftheearlymotivation,research,andnomenclatureregarding densecomponents wasinthefieldofsocialnetworkanalysis. Evenbeforethe advent of computers, sociologists turned to graph theory to formulate models for the concept of social cohesion. Clique, 𝐾-core, 𝐾-plex, and 𝐾-club are metrics originally devised to measure social cohesiveness [53]. It is not sur- prising that we also see dense components in the World Wide Web. In many ways, theWebissimplyavirtual implementation oftraditional direct human- humansocialnetworks. Today,thenatural sciences, thesocialsciences, andtechnological fieldsare all using network and graph analysis methods to better understand complex systems. Dense component discovery and analysis is one important aspect ofnetworkanalysis. Therefore, readers frommanydifferent backgrounds will benefitfromunderstandingmoreaboutthecharacteristicsofdensecomponents andsomeofthemethodsusedtouncoverthem. In the next section, we outline the graph terminology and define the fun- damental measures of density to be used in the rest of the chapter. Section 3 categorizes thealgorithmic approaches andpresents representative implemen- tations in more detail. Section 4 expands the topic to consider frequently- occurring dense components in a set of graphs. Section 5 provides examples ofhowthesetechniqueshavebeenappliedinvariousscientificfields. Section6 concludes thechapterwithalooktothefuture. 2. Types of Dense Components Different applications find different definitions of dense component to be useful. Inthissection,weoutlinethemanywaystodefineadensecomponent, categorizing them by their important features. Understanding these features of the various types of components are valuable for deciding which type of componenttopursue. 2.1 Absolute vs. Relative Density Wecandividedensitydefinitions intotwoclasses,absolute densityandrel- ative density. An absolute density measure establishes rules and parameter values for what constitutes a dense component, independent of what is out- side the component. For example, we could say that we are only interested in cliques, fully-connected subgraphs of maximum density. Absolute density measurestaketheformofrelaxations ofthepurecliquemeasure. Ontheotherhand,arelativedensitymeasurehasnopresetlevelforwhatis sufficiently dense. It compares the density of one region to another, with the goaloffindingthedensestregions. Toestablishtheboundariesofcomponents, a metric typically looks to maximize the difference between intra-component connectedness and inter-component connectedness. Oftenbut notnecessarily, 306 MANAGINGANDMININGGRAPHDATA relative density techniques look for a user-defined number 𝑘 densest regions. The alert reader may have noticed that relative density discovery is closely relatedtoclustering andinfactsharesmanyfeatureswithit. Since this book contains another chapter dedicated to graph clustering, we willfocus our attention onabsolute density measures. However, wewillhave moresosayabouttherelationshipbetweenclusteringanddensityattheendof thissection. 2.2 Graph Terminology Let 𝐺(𝑉,𝐸) be a graph with 𝑉 vertices and 𝐸 edges. If the edges are ∣ ∣ ∣ ∣ weighted, then 𝑤(𝑢) is the weight of edge 𝑢. We treat unweighted graphs as the special case where all weights are equal to 1. Let 𝑆 and 𝑇 be sub- sets of 𝑉. For an undirected graph, 𝐸(𝑆) is the set of induced edges on 𝑆: 𝐸(𝑆) = (𝑢,𝑣) 𝐸 𝑢,𝑣 𝑆 . Then, 𝐻 is the induced subgraph 𝑆 { ∈ ∣ ∈ } (𝑆,𝐸(𝑆)). Similarly, 𝐸(𝑆,𝑇) designates the set of edges from 𝑆 to 𝑇. 𝐻 𝑆,𝑇 istheinducedsubgraph(𝑆,𝑇,𝐸(𝑆,𝑇)). Notethat𝑆 and𝑇 arenotnecessarily disjoint from each other. If 𝑆 𝑇 = , 𝐻 is a bipartite graph. If 𝑆 and 𝑇 𝑆,𝑇 ∩ ∅ arenotdisjoint(possibly𝑆 = 𝑇 =𝑉),thisnotationcanbeusedtorepresenta directedgraph. A dense component is a maximal induced subgraph which also satisfies some density constraint. A component 𝐻 is maximal if no other subgraph 𝑆 of 𝐺 which is a superset of 𝐻 would satisfy the density constraints. Table 𝑆 10.1 defines some basic graph concepts and measures that we will use to de- finedensitymetrics. Table10.1.GraphTerminology Symbol Description 𝐺(𝑉,𝐸) graphwithvertexset𝑉 andedgeset𝐸 𝐻 subgraphwithvertexset𝑆andedgeset𝐸(𝑆) 𝑆 𝐻 subgraphwithvertexset𝑆 𝑇 andedgeset𝐸(𝑆,𝑇) 𝑆,𝑇 ∪ 𝑤(𝑢) weightofedge𝑢 𝑁 (𝑢) neighborsetofvertex𝑢in𝐺: 𝑣 (𝑢,𝑣) 𝐸 𝐺 { ∣ ∈ } 𝑁 (𝑢) onlythoseneighborsofvertex𝑢thatarein𝑆: 𝑣 (𝑢,𝑣) 𝑆 𝑆 { ∣ ∈ } 𝛿 (𝑢) (weighted)degreeof𝑢in𝐺:∑ 𝑤(𝑣) 𝐺 𝑣∈𝑁𝐺(𝑢) 𝛿 (𝑢) (weighted)degreeof𝑢in𝑆 :∑ 𝑤(𝑣) 𝑆 𝑣∈𝑁𝑆(𝑢) 𝑑 (𝑢,𝑣) shortest(weighted)pathfrom𝑢to𝑣traversinganyedgesin𝐺 𝐺 𝑑 (𝑢,𝑣) shortest(weighted)pathfrom𝑢to𝑣traversingonlyedgesin𝐸(𝑆) 𝑆 We now formally define the density of S, 𝑑𝑒𝑛(𝑆), as the ratio of the total weight of edges in 𝐸(𝑆) to the number of possible edges among 𝑆 vertices. ∣ ∣ Ifthe graph isunweighted, then thenumerator issimply the number ofactual ASurveyofAlgorithmsforDenseSubgraphDiscovery 307 edges, and the maximum possible density is 1. If the graph is weighted, the maximum density is unbounded. The number of possible edges in an undi- rected graph of size 𝑛 is 𝑛 = 𝑛(𝑛 1)/2. We give the formulas for an 2 − undirected graph;theformulasforadirectedgraphlackthefactorof2. ( ) 2𝐸(𝑆) 𝑑𝑒𝑛(𝑆) = ∣ ∣ 𝑆 (𝑆 1) ∣ ∣ ∣ ∣− 2 𝑤(𝑢,𝑣) 𝑢,𝑣 𝑆 𝑑𝑒𝑛𝑊(𝑆) = ∈ 𝑆 (𝑆 1) ∑∣ ∣ ∣ ∣− Someauthorsdefinedensity astheratioofthenumberofedgestothenumber ofvertices: ∣𝐸∣. Wewillrefertothisasaverage degreeofS. 𝑉 Another i∣m∣portant metric is the diameter of S, 𝑑𝑖𝑎𝑚(𝑆). Since we have given two different distance measures, 𝑑 and 𝑑 , we accordingly offer two 𝑆 𝐺 differentdiametermeasures. Thefirstisthestandardone,inwhichweconsider onlypathswithin𝑆. Thesecond permitspaths tostrayoutside 𝑆,ifitoffers a shorterpath. 𝑑𝑖𝑎𝑚(𝑆) = 𝑚𝑎𝑥 𝑑 (𝑢,𝑣) 𝑢,𝑣 𝑆 𝑆 { ∣ ∈ } 𝑑𝑖𝑎𝑚 (𝑆) = 𝑚𝑎𝑥 𝑑 (𝑢,𝑣) 𝑢,𝑣 𝑆 𝐺 𝐺 { ∣ ∈ } 2.3 Definitions of Dense Components Wenowpresentacollectionofmeasuresthathavebeenusedtodefinedense components in the literature (Table 10.2). To focus on the fundamentals, we assumeunweightedgraphs. Inasense,alldensecomponentsareeithercliques, which represent the ideal, or some relaxation of the ideal. There relaxations fallintothreecategories: density, degree,anddistance. Eachrelaxationcanbe quantifiedaseitherapercentage factororasubtractiveamount. Whilemostof there definitions are widely-recognized standards, the name quasi-clique has been applied to any relaxation, with different authors giving different formal definitions. Abello[1]definedthetermintermsofoveralledge density, with- out any constraint on individual vertices. This offers considerable flexibility in the component topology. Several other authors [36, 32, 33] have opted to define quasi-clique in terms of minimum degree of each vertex. Li et al. [32] provide abrief overview and comparison of quasi-cliques. In our table, when the authorship of a specific metric can be traced, it is given. Our list is not exhaustive; however, the majority ofdefinitions can be reduced tosome com- binationofdensity, degree,anddiameter. Notethat inunweighted graphs, cliques have adensity of1. Density-based quasi-cliques are only defined for unweighted graphs. We use the term Kd- clique instead of Mokken’s original name K-clique, because 𝐾-clique is al- ready defined in the mathematics and computer science communities tomean acliquewith𝑘 vertices. 308 MANAGINGANDMININGGRAPHDATA Table10.2.TypesofDenseComponents Component Reference Formaldefinition Description Clique (𝑖,𝑗),𝑖=𝑗 𝑆 Every vertex connects to every other ∃ ∕ ∈ vertexin𝑆. Quasi-Clique [1] 𝑑𝑒𝑛(𝑆) 𝛾 𝑆 has at least 𝛾 𝑆 (𝑆 1)/2 edges. ≥ ∣ ∣ ∣ ∣ − (density-based) Densitymaybeimbalancedwithin𝑆. Quasi-Clique [36] 𝛿 (𝑢) 𝛾 (𝑘 1) Eachvertexhas𝛾 percent of thepossi- 𝑆 ≥ ∗ − (degree-based) bleconnectionstoothervertices. Local degreesatisfiesaminimum.Compareto 𝐾-coreand𝐾-plex. K-core [45] 𝛿 (𝑢) 𝑘 Everyvertexconnectstoatleast𝑘other 𝑆 ≥ verticesin𝑆.Acliqueisa(𝑘-1)-core. K-plex [46] 𝛿 (𝑢) 𝑆 𝑘 Eachvertexismissingnomorethan𝑘 𝑆 ≥∣ ∣− − 1 edges to its neighbors. A clique is a 1-plex. Kd-clique [34] 𝑑𝑖𝑎𝑚 (𝑆) 𝑘 Theshortestpathfromanyvertextoany 𝐺 ≤ othervertexisnotmorethan𝑘. Anor- dinarycliqueisa1d-clique. Pathsmay gooutside𝑆. K-club [37] 𝑑𝑖𝑎𝑚(𝑆) 𝑘 Theshortestpathfromanyvertextoany ≤ other vertex is not more than 𝑘. Paths maynotgooutside𝑆. Therefore,every K-clubisaK-clique. Figure 10.1, a superset of an illustration from Wasserman and Faust [53], demonstrates eachofthedensecomponents thatwehavedefinedabove. Cliques: 1,2,3 and 2,3,4 { } { } 0.8-Quasi-clique: 1,2,3,4 (includes 5/6 > 0.83ofpossible edges) { } 2-Core: 1,2,3,4,5,6,7 { } 3-Core: none 2-Plex: 1,2,3,4 (vertices 1and3aremissingoneedgeeach) { } 2d-Cliques: 1,2,3,4,5,6 and 2,3,4,5,6,7 (Inthefirstcomponent, { } { } 5connects to6via7,whichneednotbeamemberofthecomponent) 2-Clubs: 1,2,3,4,5 , 1,2,3,4,6 , and 2,3,5,6,7 { } { } { } 2.4 Dense Component Selection When mining for dense components in a graph, a few additional questions mustbeaddressed: ASurveyofAlgorithmsforDenseSubgraphDiscovery 309 (cid:2) (cid:7) (cid:1) (cid:3) (cid:6) (cid:4) (cid:5) Figure10.1.ExampleGraphtoIllustrateComponentTypes 1 Minimum size 𝜎: What is the minimum number of vertices in a dense component 𝑆? I.e., 𝑆 𝜎. ∣ ∣ ≥ 2 Allortop-𝑁?: Oneofthefollowingcriteria shouldbeapplied. Select all components which meet the size, density, degree, and distance constraints. Select the 𝑁 highest ranking components that meet the minimum constraints. A ranking function must be established. This can be assimpleasoneofthesamemetricsusedforminimumconstraints (size, density, degree, distance, etc.) or a linear combination of them. Select the 𝑁 highest ranking components, with no minimum con- straints. 3 Overlap: Maytwocomponents sharevertices? 2.5 Relationship between Clusters and Dense Components Themeasuresdescribed abovesetanabsolute standard forwhatconstitutes adensecomponent. Anotherapproachistofindthemostdensecomponentson arelative basis. This isthe domain of clustering. It may seem that clustering, athoroughly-studied topicindataminingwithmanyexcellentmethodologies, would provide a solution to dense component discovery. However, clustering isaverybroadterm. Readersinterested inasurvey onclustering maywishto consult either Jain, Murty, and Flynn [24] or Berkhin [8]. In the data mining 310 MANAGINGANDMININGGRAPHDATA community, clustering refers to the task of assigning similar or nearby items tothesamegroup whileassigning dissimilar/distant itemstodifferent groups. In most clustering algorithms, similarity is a relative concept; therefore it is potentially suitable for relative density measures. However, not all clustering algorithms arebasedondensity, andnotalltypesofdensecomponents canbe discovered withclustering algorithms. Partitioning refers to one class of clustering problem, where the objective is to assign every item to exactly one group. A 𝑘-partitioning requires the result to have𝑘 groups. 𝐾-partitioning isnot agood approach foridentifying absolute dense components, because the objectives are at odds. Consider the well-known 𝑘-Means algorithm applied toauniform graph. Itwillgenerate 𝑘 partitions, because itmust. However, the partitioning isarbitrary, changing as theseedcentroids change. In hierarchical clustering, weconstruct a tree of clusters. Conceptually, as wellasinactualimplementation,thiscanbeeitheragglomerative(bottom-up), where the closest clusters are merged together to form a parent cluster, or di- visive (top-down), where a cluster is subdivided into relatively distant child clusters. Inbasicgreedyagglomerativeclustering, theprocessstartsbygroup- ing together the two closest items. The pair are now treated as a single item, and the process is repeated. Here, pairwise distance is the density measure, and the algorithm seeks to group together the densest pair. If we use divisive clustering, we can choose to stop subdividing after finding 𝑘 leaf clusters. A drawback of both hierarchical clustering and partitioning is that they do not allow for a separate "non-dense" partition. Even sparse regions are forced to belong to some cluster, so they are lumped together with their closest denser cores. Spectralclustering describes agraphasaadjacency matrix𝑊,fromwhich is derived the Laplacian matrix 𝐿 = 𝐷 𝑊(unnormalized) or 𝐿 = 𝐼 − − 𝐷1/2𝑊𝐷 1/2(normalized),where𝐷isthediagonalmatrixfeaturingeachver- − tex’s degree. The eigenvectors of 𝐿 can be used as cluster centroids, with the corresponding eigenvalues giving an indication of the cut size between clus- ters. Since we want minimum cut size, the smallest eigenvalues are chosen first. This ranking of clusters is an appealing feature for dense component discovery. None of these clustering methods, however, are suited for an absolute den- sity criterion. Nor can they handle overlapping clusters. Therefore, some but not all clustering criteria are dense component criteria. Most clustering methods are suitable for relative dense component discovery, excluding 𝑘- partitioning methods. ASurveyofAlgorithmsforDenseSubgraphDiscovery 311 3. Algorithms for Detecting Dense Components in a Single Graph In this section, we explore algorithmic approaches for finding dense com- ponents. Firstwelookatbasicexact algorithms forfinding cliques andquasi- cliques andcommentontheirtimecomplexity. Because theclique problem is NP-hard,wethenconsidersomemoretimeefficientsolutions. Thealgorithms canbecategorized asfollows: Exactenumeration (Section3.1),FastHeuristic Enumeration (Section 3.2), andBounded Approximation Algorithms (Section 3.3). We review some recent works related to dense component discovery, concentrating onthedetailsofseveralwell-received algorithms. Thefollowingtable(Table10.3)givesanoverviewofthemajoralgorithmic approaches andliststherepresentative examplesweconsider inthischapter. Table10.3.OverviewofDenseComponentAlgorithms AlgorithmType ComponentType Example Comments Enumeration Clique [12] Biclique [35] Quasi-clique [33] min.degreeforeachvertex Quasi-biclique [47] 𝑘-core [7] Fast Heuristic Maximalbiclique [30] nonoverlapping Enumeration Quasi-clique/biclique [13] spectralanalysis Relativedensity [18] shingling Maximalquasi-biclique [32] balancednoisetolerance Quasi-clique,𝑘-core [52] prunedsearch; visualresultswith upper-boundedestimates Bounded Max.averagedegree [14] undirectedgraph:2-approx. Approximation directedgraph:2+𝜖-approx. Densestsubgraph, 𝑛 𝑘 [4] 1/3-approx. ≥ Subgraphofknown density𝜃 [3] finds subgraph with density Ω(𝜃/logΔ) 3.1 Exact Enumeration Approach The most natural way to discover dense components in a graph is to enu- merateallpossible subsetsofverticesandtocheckifsomeofthemsatisfythe definition of dense components. In the following, we investigate some algo- rithmsfordiscovering densecomponents byexplicitenumeration. 312 MANAGINGANDMININGGRAPHDATA Enumeration Approach. Finding maximal cliques in a graph may be straightforward, butitistime-consuming. Thecliquedecisionproblem,decid- ing whether a graph of size 𝑛 has a clique of size at least 𝑘, is one of Karp’s 21NP-Completeproblems[28]. Itiseasytoshowthatthecliqueoptimization problem, finding alargest clique inagraph, isalsoNP-Complete, because the optimization and decision problems each can be reduced in polynomial time to the other. Our goal is to enumerate all cliques. Moon and Moser showed thatagraphmaycontain upto3𝑛/3 maximalcliques[38]. Therefore, evenfor modest-sized graphs, itisimportanttofindthemosteffectivealgorithm. One well-known enumeration algorithm for generating cliques was pro- posed by Bron and Kerbosch [12]. This algorithm utilizes the branch-and- bound technique in order to prune branches which are unable to generate a clique. Thebasicideaistoextendasubsetofvertices, untilthecliqueismax- imal,byadding avertexfromacandidate setbutnotinaexclusion set. Let𝐶 bethesetofvertices whichalready formaclique, 𝐶𝑎𝑛𝑑bethesetofvertices whichmaypotentially beusedforextending𝐶,and𝑁𝐶𝑎𝑛𝑑bethesetofver- ticeswhicharenotallowedtobecandidates for𝐶. 𝑁(𝑣)aretheneighbors of vertex 𝑣. Initially, 𝐶 and 𝑁𝐶𝑎𝑛𝑑 are empty, and 𝐶𝑎𝑛𝑑 contains all vertices in the graph. Given 𝐶, 𝐶𝑎𝑛𝑑 and 𝑁𝐶𝑎𝑛𝑑, we describe the Bron-Kerbosch algorithm below. The authors experimentally observed 𝑂(3.14𝑛), but did not provetheirtheoretical performance. Algorithm6CliqueEnumeration(𝐶,𝐶𝑎𝑛𝑑,𝑁𝐶𝑎𝑛𝑑) if𝐶𝑎𝑛𝑑= and𝑁𝐶𝑎𝑛𝑑= then ∅ ∅ outputthecliqueinducedbyvertices𝐶; else forall𝑣 𝐶𝑎𝑛𝑑do 𝑖 ∈ 𝐶𝑎𝑛𝑑 𝐶𝑎𝑛𝑑 𝑣 ; 𝑖 ← ∖{ } call𝐶𝑙𝑖𝑞𝑢𝑒𝐸𝑛𝑢𝑚𝑒𝑟𝑎𝑡𝑖𝑜𝑛(𝐶 𝑣 ,𝐶𝑎𝑛𝑑 𝑁(𝑣 ),𝑁𝐶𝑎𝑛𝑑 𝑁(𝑣 )); 𝑖 𝑖 𝑖 ∪{ } ∩ ∩ 𝑁𝐶𝑎𝑛𝑑 𝑁𝐶𝑎𝑛𝑑 𝑣 ; 𝑖 ← ∪{ } endfor endif Makino et al. [35] proposed new algorithms making full use of efficient matrix multiplication to enumerate all maximal cliques in a general graph or bicliquesinabipartitegraph. Theydevelopeddifferentalgorithmsfordifferent types ofgraphs (general graph, bipartite, dense, and sparse). Inparticular, for a sparse graph such that the degree of each vertex is bounded by Δ 𝑉 , ≪ ∣ ∣ an algorithm with 𝑂(𝑉 𝐸 ) preprocessing time, 𝑂(Δ4) time delay (i.e, the ∣ ∣∣ ∣ bound of running time between two consecutive outputs) and 𝑂(𝑉 + 𝐸 ) ∣ ∣ ∣ ∣ space is developed to enumerate all maximal cliques. Experimental results demonstrate goodperformance forsparsegraphs.
Description: