ebook img

Community detection in complex networks using Extremal Optimization PDF

0.18 MB·English
by  J. Duch
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Community detection in complex networks using Extremal Optimization

Community detection in complex networks using Extremal Optimization Jordi Duch1 and Alex Arenas1 1Departament d’Enginyeria Inform`atica i Matem`atiques, Universitat Rovira i Virgili, 43007 Tarragona, Spain (Dated: February 2, 2008) 5 We propose a novel method to find the community structure in complex networks based on an 0 extremaloptimization ofthevalueofmodularity. Themethodoutperformstheoptimalmodularity 0 found by the existing algorithms in the literature. We present the results of the algorithm for 2 computer simulated and real networks and compare them with other approaches. The efficiency n andaccuracyofthemethodmakeitfeasibletobeusedfortheaccurateidentificationofcommunity a structurein large complex networks. J PACSnumbers: 6 1 The description of the structure of complex networks certain mesoscopic description of the graph in terms of ] n has been one of the focus of attention of the physicist’s communitiesismoreorlessaccurate. Thelargerthe val- n community in the recent years. The levels of descrip- ues of Q the most accurate a partition into communities - tion range from the microscopic (degree, clustering co- is. s i efficient, centrality measures, etc., of individual nodes) The search for the optimal (largest) modularity value d . to the macroscopic description in terms of statistical is a NP-hard problem due to the fact that the space of t a properties of the whole network (degree distribution, to- possible partitions grows faster than any power of the m talclusteringcoefficient,degree-degreecorrelations,etc.) system size. For this reason, a heuristic search strategy [1,2,3,4]. Betweenthesetwoextremesthereisa”meso- is mandatory to restrict the search space while preserv- - d scopic” description of networks that tries to explain its ing the optimization goal [14]. Indeed, it is possible to n community structure. The general notion of community relate the current optimization problem for Q with clas- o structureincomplexnetworkswasfirstpointedoutinthe sical problems in statistical physics, e.g. the spin glass c physics literature by Girvan and Newman [5], and refers problem of finding the ground state energy [15], where [ to the fact that nodes in many real networks appear to algorithms inspired in natural optimization processes as 1 group in subgraphs in which the density of internal con- simulatedannealing[16]andgeneticalgorithms[17]have v nections is larger than the connections with the rest of been successfully used. 8 nodes in the network. In this Letter, we propose a new divisive algorithm 6 3 The community structure has been empirically found thatoptimizesthemodularityQusinganheuristicsearch 1 inmanyrealtechnological,biologicalandsocialnetworks basedontheExtremalOptimization(EO)algorithmpro- 0 [6,7,8,9,10]anditsemergenceseemstobeattheheart posed by Boettcher and Percus [18, 19]. This algorithm 5 of the network formation process [11]. isinspiredinturnintheevolutionmodelofBak-Sneppen 0 The existing methods intended to devise the commu- [20], and basically operates optimizing a global variable / t nity structure in complex networks have been recently by improving extremal local variables that involve co- a reviewed in [10]. All these methods require a definition evolutionary avalanches. The performance of EO algo- m ofcommunitythatimposesthelimituptowhichagroup rithms have been shown to overcome the efficiency of - d shouldbeconsideredacommunity. However,theconcept classicalsimulatedannealingandgeneticalgorithmspro- n of community itself is qualitative: nodes must be more viding competitive accuracy [21]. o connectedwithinitscommunitythanwiththerestofthe In our case, the global variable to optimize is Q as c network,anditsquantificationisstillasubjectofdebate. defined in eq.(1). Thus, the definition of the local vari- : v Some quantitative definitions that came from sociology ables used in the extremal optimization problem should i have been used in recent studies [12], but in general,the be related to the contribution of individual nodes i to X physics community has widely accepted a recent mea- the summation in eq.(1) given a certain partition into ar sure for the community structure based on the concept communities ofmodularityQintroducedbyNewmanandGirvan[13]: q =κ −k a (2) Q=X(err−a2r) (1) i r(i) i r(i) r where κ is the number of links that a node i be- r(i) wheree arethefractionoflinksthatconnecttwonodes longing to a community r has with nodes into the same rr insidethecommunityr,a thefractionoflinksthathave community, and k is the degree of node i. Note that r i one or both vertices inside of the community r, and the Q = 1 q where i refers to all nodes in the network 2LPi i sum extends to all communities r in a given network. given a certain partition into communities and L is the Note that this measure provides a way to determine if a total number of links in the network. Eq.(2) provides a 2 measure that depends on the node degree, and its nor- malizationinvolveallthe links inthe networkafter sum- mation. Re-scaling the local variable q by the degree of i node i we obtain a proper definition for the contribution of node i to the modularity, relative to its own degree and normalized in the interval [-1,1]. Pajek Pajek FIG. 1: Left: Random initialization of the Zachary network q κ λ = i = r(i) −a (3) intotwopartitions,redandgreen. Right: Fivedifferentcom- i ki ki r(i) munities identified as connected components in each parti- tions. Each color definesa different community. Keeping in mind this definition of λ we can compare i the relative contribution of individual nodes to the com- munity structure. We will consider λ as the local vari- i again repeat the process until no changes could improve able involved in the extremal optimization process that it (see Fig. 2). characterizesanindividualnode,fromnowonwewillre- fertoλ asthefitnessofnodeiusingthecommonjargon The application of the algorithm to the Zachary net- i in extremal optimization problems. work provides the optimal modularity value after three The heuristic search we propose to find the optimal recursive iterations. The network is decomposed in four modularity value evolves as follows: communities and the value for the modularity is 0.419, greater than the value 0.381 reported by Newman [14], • Initially, we split the nodes of the whole graph in the value 0.406reportedby Reichardtet al. [23]and the two random partitions having the same number of value0.412reportedbyDonettietal. [24]usingdifferent nodes each one. This splitting creates an initial optimization methods. communities division, where communities are un- The extremal optimization (EO) approach presented derstood as connected components in each parti- here has several technical implementation details, that tion. are relevant for our purposes. In the original EO algo- • At each time step, the system self-organizes by rithm,thenodeselectedisalwaysthenodewiththeworst moving the node with the lower fitness (extremal) λj value. This is a deterministic and fast way to solve from one partition to the other. In principle, each the problem, but it presents some drawbacks: the final movementimpliestherecalculationofthefitnessof result strongly depends on the initialization and there is manynodesbecausetherighthandsideofequation no possibility to escape from local maxima. Instead, we (3) involves the pseudo-global magnitude a . use a probabilistic selection called τ-EO [18], in which r(i) the nodes are ranked according to their fitness values, • The process is repeated until an ”optimal state” and then the node of rank r is selected according to the withamaximumvalue ofQis reached. After that, we delete all the links between both partitions and proceed recursively with every resultant connected component. The process finishes when the modu- larity Q could not be improved [32]. Note that this process is not a bipartitioning of the Pajek First Cut SecoPnajekd Cut Third CutPajek graph as known in computer science [19], because: the Q = 0.3718 Q = 0.4020 Q = 0.4188 number of nodes in each partition is dependent on the evolutionprocessandnotrestrictedtobethesameatthe 0.4 end of the process;andmore importantly, eachpartition arity could contain different connected components (commu- ul d0.3 nities) that when the partitions are disconnected result Mo in several subgraphs. on Let us illustrate the above mentioned heuristics in a urati0.2 g simple case. We will apply it to the well-know Zachary nfi o karate club network [22]. Initially we split the nodes in C two random partitions (see Fig.1 left). Note that the 0.1 numberofinitialcommunities(connectedcomponentsin 0 50 100 150 200 250 300 eachpartition)inthis caseis five(see Fig.1right). After Algorithm Steps that, the self-organization process starts: the node with the ”worst fitness” is selected and moved from its parti- FIG. 2: Top: Network after edge removal at each recursive tion to the other partition, this movement provokes an cut. Bottom: Evolution of the Q value in the at each step avalanche of changes in the fitness of the rest of nodes. of the adaptation process. Separation bars indicate recursive We calculate the new value for the modularity Q, and divisions of thegraph performed at maximum Q. 3 following probability distribution: 1 y Girvan-Newman ctl0.9 Present e P(r)∝r−τ (4) orr d c0.8 aτfonrThdrahaasinlslbdoswoeoemlsunttniotoeuntenwsiecosdarlpkeaessrsoofsrfueonsnmidszietltoiNhvceeatltohompadttaiixmffaipemarplearnov.taaTlciunhheietstihaeoelxibzpstacoataniilnoeinenndgst ertices classifie00..67 um modularity000...468 v m τ ∼1+1/ln(N) [18]. The use of this technique also im- n of 0.5 Maxi0.2 pstleiepsstαheNdenteeerdmedinatotiodnecoidftehtehnautmthbeermoafxsiemlfu-omrgvaanliuzeathioans Fractio0.4 0Av. nu2mber in4ter-com6munit8y edge1s0 little chance to be improved. In practice, we keep track 0.3 ateachstepofthelastmaximumvalueobatinedforQ,if 0 2 4 6 8 10 Average number of inter-community edges per vertex this maximum is not improved in αN steps we stop the search. Usually α is empirically determined balancing FIG.3: Fractionofnodescorrectlyclassifiedusingcomputer- accuracy and efficiency in the algorithm, we use α = 1 generated graphs described in text. Each point is an average allowing as many steps as nodes to improve the current over 100 different networks. Inset: Average of the maximum maximum value of Q. The computational cost involved modularity obtained in each case. inthewholeprocessisO(N2ln2N)whereafactorNlnN is the cost associate to the ranking process, however it can be substantially reduced using heap data structures university e-mail network [11], the C.elegans metabolic [25] for the ranking selection process up to O(N). The network[28],anetworkofusersofthePGPalgorithmfor total cost of the algorithm can then be improved up to secureinformationtransactions[29],andfinally therela- O(N2lnN). tions between authors that shared a paper in cond-mat To test the performance of the algorithm we use first [30]. computer-generated graphs with a known community structure [5]. These graphs have 128 vertices grouped in Network Size QN #comsN QEO #comsEO fourcommunitiesof32vertices. Eachvertexhasonaver- Zachary 34 0.3810 2 0.4188 4 agez edgestoverticesinthesamecommunityandz in out edges to vertices in other communities, keeping an aver- Jazz 198 0.4379 4 0.4452 4 age degree z +z = 16. We generate several graphs C. elegans 453 0.4001 10 0.4342 12 in out using zout values between 0 and 10, and compare the E-mail 1133 0.4796 13 0.5738 15 results of our algorithm with those obtained using the PGP 10680 0.7329 80 0.8459 365 heuristics proposed by Newman [14]. This shows the ca- Cond-Mat 27519 0.6683 302 0.6790 647 pabilities of each algorithm identifying the communities whenthesearemorefuzzyinsidethewholenetwork. Us- TABLEI:Maximummodularityobtainedusingthealgorithm ing the Girvan-Newman algorithm, wich is the reference [14] QN and the extremal optimization algorithm QEO for algorithmforcommunityidentification,the communities differentcomplexnetworks. Itisalso includedthenumberof are well detected until values of z = 6. In contrast, communitiesfoundattheconfigurationwithmaximummod- out our algorithm detects the communities up to z = 8, ularity. out where the community structure still persist but is much moredifficulttoreveal,seeFig.3. Inthisparticularcase In Table I we present the results for the maximum 50 percent of the links are within the community and 50 modularity achieved by our algorithm compared to the percentarelinkswithnodesoutsidethecommunity. This modularity obtained using [14]. The difference in maxi- resultthatcouldseemcontradictoryisnot. Notethatthe mummodularityis upto 15%depending onthe network 50percentoflinkswithnodesoutsidethecommunityare considered. These differences result in a best determina- equally distributed among the rest of communities, and tion of the unknown community structure of the whole then its contribution to the definition of community is network. Thepartitionintocommunitiesisclearlydiffer- deprivedbythenumberofcommunitiesintherestofthe ent as shows the different number of communities found network,in our case three. For this reasonit is expected using both algorithms. to find community structure even in these cases. Note that sincethe coreofthe algorithmis stochastic, For values higher than 8, the average maximum mod- differentrunscouldyieldinprincipledifferentpartitions. ularity rapidly approach the limit Q = 0.208 (see inset We have performed 100 runs of the algorithm for the e- Fig.3), the expected modularity for a random network mail network and for a random network with the same with the same number of links and nodes, as it has been number of links and nodes to check the consistency of shown in [26]. the proposed method. In Fig. 4 we present the results Wehavealsoanalyzedthecommunitystructureofsev- ofthe fractionoftimes acouple ofnodes areclassifiedin eral real networks: the jazz musicians network [27], an the same partition. The community structure is clearly 4 revealedforthee-mailnetworkwhilefortherandomnet- work this structure is inexistent. Recently, Guimer`a and Amaral have obtained similar results by applying simu- lated annealing to find the community structure in the context of metabolic networks [31]. Summarizing,wehavepresentedanextremaloptimiza- tion based algorithm that optimizes the modularity and allows an accurate identification of community structure incomplexnetworks. Theresultsoutperformallprevious algorithms existent in the literature. We thankM.Bogun˜a,L.Danon,A.Diaz-Guileraand R. Guimer`a for helpful comments and suggestions. We FIG.4: Fractionofnodesclassifiedinthesamepartitionover also thank M.E.J. Newman for providing us the cond- 100 realizations of the algorithm. The color of the position (i,j) corresponds to the fraction of times that nodes i and j mat network. This work has been supported by DGES belong to thesame partition. of the Spanish Government Grant No. BFM-2003-08258 and EC-FET Open Project No. IST-2001-33555. [1] S.Strogatz, Nature410, 268 (2001). (2001). [2] R. Albert and A.-L. Barab´asi, Rev. Mod. Phys. 74, 47 [19] S. Boettcher and A. G. Percus, Phys Rev E 64, 026114 (2002). (2001). [3] S.Dorogovtsev and J. Mendes, Adv.Phys. 51 (2002). [20] P.BakandK.Sneppen,Phys.Rev.Lett.71,4083(1993). [4] M. E. J. Newman, SIAMReview 45, 167 (2003). [21] S.BoettcherandA.G.Percus,ArtificialIntelligence119, [5] M. Girvan and M. Newman, Proc. Natl. Acad. Sci. 99, 275 (2000). 7821 (2002). [22] W.W.Zachary,JournalofAnthropologicalResearch 33 [6] J.-P. Eckmann and E. Moses, Proc. Natl. Acad.Sci. 99, (1977). 5825 (2002). [23] J. Reichardt and S. Bornholdt, Phys. Rev. Lett 93, [7] K. Eriksen, I. Simonsen, S. Maslov, and K. Sneppen, 218701 (2004). Phys.Rev.Lett. 90, 148701 (2003). [24] L. Donetti and M. Munoz, J. Stat. Mech.: Theor. Exp. [8] P.Holme,M.Huss,andH.Jeong,Bioinformatics19,532 p. 10012 (2004). (2003). [25] A.V.Aho,J.D.Ullman,andJ.E.Hopcroft,DataStruc- [9] A.Arenas,L.Danon,A.Diaz-Guilera,P.M.Gleiser,and tures and Algorithms (Addison-Wesley,1983). R.Guimer`a, Eur. Phys.J. B 38, 373 (2004). [26] R.Guimer´a,M.Sales-Pardo,andL.A.N.Amaral,Phys. [10] M. E. J. Newman, Eur. Phys.J. B 38, 321 (2004). Rev. E70, 025101 (2004). [11] R. Guimer`a, L. Danon, A. D´ıaz-Guilera, F. Giralt, and [27] P. Gleiser and L. Danon, Advancesin Complex Systems A.Arenas, Phys. Rev.E 68, 065103(R) (2003). 6, 565 (2003). [12] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and [28] H.Jeong, B.Tombor, R.Albert,Z.N.Oltvai,andA.-L. D.Parisi, Proc. Natl. Acad.Sci. 101, 2658 (2004). Barab´asi, Nature 407, 651 (2000). [13] M. E. J. Newman and M. Girvan, Phys. Rev. E 69, [29] X. Guardiola, R. Guimera, A. Arenas, A. Diaz-Guilera, 026113 (2004). and L. A.N. Amaral, cond-mat/0206240 (2002). [14] M. E. J. Newman, Phys.Rev.E 69, 066133 (2004). [30] M. E. J. Newman, Phys.Rev.E 64, 016131 (2001). [15] D. Sherrington and S. Kirkpatrick, Phys. Rev. Lett. 35, [31] R. Guimera and L. Amaral, Nature (2005), in press. 1792 (1975). [32] The value of Q always refers to the whole network i.e. [16] S. Kirkpatrick, C. Gilatt, and M. Vecchi, Science 220 is the sum over all the communities. At a certain mo- (1983). ment more subdivisions into communities will necessar- [17] D. E. Goldberg, Genetic Algorithms in Search, Opti- ily decrease Q because the limit of decomposition is a mization, and Machine Learning (Addison-Wesley Pro- community per node whose value of Q is negative. fessional, 1989). [18] S.BoettcherandA.G.Percus,Phys.Rev.Lett.86,5211

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.