Table Of ContentCommunity detection in complex networks using Extremal Optimization
Jordi Duch1 and Alex Arenas1
1Departament d’Enginyeria Inform`atica i Matem`atiques,
Universitat Rovira i Virgili, 43007 Tarragona, Spain
(Dated: February 2, 2008)
5 We propose a novel method to find the community structure in complex networks based on an
0 extremaloptimization ofthevalueofmodularity. Themethodoutperformstheoptimalmodularity
0 found by the existing algorithms in the literature. We present the results of the algorithm for
2
computer simulated and real networks and compare them with other approaches. The efficiency
n andaccuracyofthemethodmakeitfeasibletobeusedfortheaccurateidentificationofcommunity
a structurein large complex networks.
J
PACSnumbers:
6
1
The description of the structure of complex networks certain mesoscopic description of the graph in terms of
]
n has been one of the focus of attention of the physicist’s communitiesismoreorlessaccurate. Thelargerthe val-
n community in the recent years. The levels of descrip- ues of Q the most accurate a partition into communities
-
tion range from the microscopic (degree, clustering co- is.
s
i efficient, centrality measures, etc., of individual nodes) The search for the optimal (largest) modularity value
d
. to the macroscopic description in terms of statistical is a NP-hard problem due to the fact that the space of
t
a properties of the whole network (degree distribution, to- possible partitions grows faster than any power of the
m talclusteringcoefficient,degree-degreecorrelations,etc.) system size. For this reason, a heuristic search strategy
[1,2,3,4]. Betweenthesetwoextremesthereisa”meso- is mandatory to restrict the search space while preserv-
-
d scopic” description of networks that tries to explain its ing the optimization goal [14]. Indeed, it is possible to
n community structure. The general notion of community relate the current optimization problem for Q with clas-
o
structureincomplexnetworkswasfirstpointedoutinthe sical problems in statistical physics, e.g. the spin glass
c
physics literature by Girvan and Newman [5], and refers problem of finding the ground state energy [15], where
[
to the fact that nodes in many real networks appear to algorithms inspired in natural optimization processes as
1 group in subgraphs in which the density of internal con- simulatedannealing[16]andgeneticalgorithms[17]have
v
nections is larger than the connections with the rest of been successfully used.
8
nodes in the network. In this Letter, we propose a new divisive algorithm
6
3 The community structure has been empirically found thatoptimizesthemodularityQusinganheuristicsearch
1 inmanyrealtechnological,biologicalandsocialnetworks basedontheExtremalOptimization(EO)algorithmpro-
0 [6,7,8,9,10]anditsemergenceseemstobeattheheart posed by Boettcher and Percus [18, 19]. This algorithm
5 of the network formation process [11]. isinspiredinturnintheevolutionmodelofBak-Sneppen
0 The existing methods intended to devise the commu- [20], and basically operates optimizing a global variable
/
t nity structure in complex networks have been recently by improving extremal local variables that involve co-
a
reviewed in [10]. All these methods require a definition evolutionary avalanches. The performance of EO algo-
m
ofcommunitythatimposesthelimituptowhichagroup rithms have been shown to overcome the efficiency of
-
d shouldbeconsideredacommunity. However,theconcept classicalsimulatedannealingandgeneticalgorithmspro-
n of community itself is qualitative: nodes must be more viding competitive accuracy [21].
o connectedwithinitscommunitythanwiththerestofthe In our case, the global variable to optimize is Q as
c network,anditsquantificationisstillasubjectofdebate. defined in eq.(1). Thus, the definition of the local vari-
:
v Some quantitative definitions that came from sociology ables used in the extremal optimization problem should
i have been used in recent studies [12], but in general,the be related to the contribution of individual nodes i to
X
physics community has widely accepted a recent mea- the summation in eq.(1) given a certain partition into
ar sure for the community structure based on the concept communities
ofmodularityQintroducedbyNewmanandGirvan[13]:
q =κ −k a (2)
Q=X(err−a2r) (1) i r(i) i r(i)
r
where κ is the number of links that a node i be-
r(i)
wheree arethefractionoflinksthatconnecttwonodes longing to a community r has with nodes into the same
rr
insidethecommunityr,a thefractionoflinksthathave community, and k is the degree of node i. Note that
r i
one or both vertices inside of the community r, and the Q = 1 q where i refers to all nodes in the network
2LPi i
sum extends to all communities r in a given network. given a certain partition into communities and L is the
Note that this measure provides a way to determine if a total number of links in the network. Eq.(2) provides a
2
measure that depends on the node degree, and its nor-
malizationinvolveallthe links inthe networkafter sum-
mation. Re-scaling the local variable q by the degree of
i
node i we obtain a proper definition for the contribution
of node i to the modularity, relative to its own degree
and normalized in the interval [-1,1].
Pajek Pajek
FIG. 1: Left: Random initialization of the Zachary network
q κ
λ = i = r(i) −a (3) intotwopartitions,redandgreen. Right: Fivedifferentcom-
i ki ki r(i) munities identified as connected components in each parti-
tions. Each color definesa different community.
Keeping in mind this definition of λ we can compare
i
the relative contribution of individual nodes to the com-
munity structure. We will consider λ as the local vari-
i
again repeat the process until no changes could improve
able involved in the extremal optimization process that
it (see Fig. 2).
characterizesanindividualnode,fromnowonwewillre-
fertoλ asthefitnessofnodeiusingthecommonjargon The application of the algorithm to the Zachary net-
i
in extremal optimization problems. work provides the optimal modularity value after three
The heuristic search we propose to find the optimal recursive iterations. The network is decomposed in four
modularity value evolves as follows: communities and the value for the modularity is 0.419,
greater than the value 0.381 reported by Newman [14],
• Initially, we split the nodes of the whole graph in
the value 0.406reportedby Reichardtet al. [23]and the
two random partitions having the same number of
value0.412reportedbyDonettietal. [24]usingdifferent
nodes each one. This splitting creates an initial
optimization methods.
communities division, where communities are un-
The extremal optimization (EO) approach presented
derstood as connected components in each parti-
here has several technical implementation details, that
tion.
are relevant for our purposes. In the original EO algo-
• At each time step, the system self-organizes by rithm,thenodeselectedisalwaysthenodewiththeworst
moving the node with the lower fitness (extremal) λj value. This is a deterministic and fast way to solve
from one partition to the other. In principle, each the problem, but it presents some drawbacks: the final
movementimpliestherecalculationofthefitnessof result strongly depends on the initialization and there is
manynodesbecausetherighthandsideofequation no possibility to escape from local maxima. Instead, we
(3) involves the pseudo-global magnitude a . use a probabilistic selection called τ-EO [18], in which
r(i)
the nodes are ranked according to their fitness values,
• The process is repeated until an ”optimal state” and then the node of rank r is selected according to the
withamaximumvalue ofQis reached. After that,
we delete all the links between both partitions and
proceed recursively with every resultant connected
component. The process finishes when the modu-
larity Q could not be improved [32].
Note that this process is not a bipartitioning of the
Pajek First Cut SecoPnajekd Cut Third CutPajek
graph as known in computer science [19], because: the Q = 0.3718 Q = 0.4020 Q = 0.4188
number of nodes in each partition is dependent on the
evolutionprocessandnotrestrictedtobethesameatthe 0.4
end of the process;andmore importantly, eachpartition arity
could contain different connected components (commu- ul
d0.3
nities) that when the partitions are disconnected result Mo
in several subgraphs. on
Let us illustrate the above mentioned heuristics in a urati0.2
g
simple case. We will apply it to the well-know Zachary nfi
o
karate club network [22]. Initially we split the nodes in C
two random partitions (see Fig.1 left). Note that the 0.1
numberofinitialcommunities(connectedcomponentsin
0 50 100 150 200 250 300
eachpartition)inthis caseis five(see Fig.1right). After Algorithm Steps
that, the self-organization process starts: the node with
the ”worst fitness” is selected and moved from its parti- FIG. 2: Top: Network after edge removal at each recursive
tion to the other partition, this movement provokes an cut. Bottom: Evolution of the Q value in the at each step
avalanche of changes in the fitness of the rest of nodes. of the adaptation process. Separation bars indicate recursive
We calculate the new value for the modularity Q, and divisions of thegraph performed at maximum Q.
3
following probability distribution: 1
y Girvan-Newman
ctl0.9 Present
e
P(r)∝r−τ (4) orr
d c0.8
aτfonrThdrahaasinlslbdoswoeoemlsunttniotoeuntenwsiecosdarlpkeaessrsoofsrfueonsnmidszietltoiNhvceeatltohompadttaiixmffaipemarplearnov.taaTlciunhheietstihaeoelxibzpstacoataniilnoeinenndgst ertices classifie00..67 um modularity000...468
v m
τ ∼1+1/ln(N) [18]. The use of this technique also im- n of 0.5 Maxi0.2
pstleiepsstαheNdenteeerdmedinatotiodnecoidftehtehnautmthbeermoafxsiemlfu-omrgvaanliuzeathioans Fractio0.4 0Av. nu2mber in4ter-com6munit8y edge1s0
little chance to be improved. In practice, we keep track
0.3
ateachstepofthelastmaximumvalueobatinedforQ,if 0 2 4 6 8 10
Average number of inter-community edges per vertex
this maximum is not improved in αN steps we stop the
search. Usually α is empirically determined balancing
FIG.3: Fractionofnodescorrectlyclassifiedusingcomputer-
accuracy and efficiency in the algorithm, we use α = 1
generated graphs described in text. Each point is an average
allowing as many steps as nodes to improve the current
over 100 different networks. Inset: Average of the maximum
maximum value of Q. The computational cost involved modularity obtained in each case.
inthewholeprocessisO(N2ln2N)whereafactorNlnN
is the cost associate to the ranking process, however it
can be substantially reduced using heap data structures university e-mail network [11], the C.elegans metabolic
[25] for the ranking selection process up to O(N). The network[28],anetworkofusersofthePGPalgorithmfor
total cost of the algorithm can then be improved up to secureinformationtransactions[29],andfinally therela-
O(N2lnN). tions between authors that shared a paper in cond-mat
To test the performance of the algorithm we use first [30].
computer-generated graphs with a known community
structure [5]. These graphs have 128 vertices grouped in
Network Size QN #comsN QEO #comsEO
fourcommunitiesof32vertices. Eachvertexhasonaver-
Zachary 34 0.3810 2 0.4188 4
agez edgestoverticesinthesamecommunityandz
in out
edges to vertices in other communities, keeping an aver- Jazz 198 0.4379 4 0.4452 4
age degree z +z = 16. We generate several graphs C. elegans 453 0.4001 10 0.4342 12
in out
using zout values between 0 and 10, and compare the E-mail 1133 0.4796 13 0.5738 15
results of our algorithm with those obtained using the
PGP 10680 0.7329 80 0.8459 365
heuristics proposed by Newman [14]. This shows the ca-
Cond-Mat 27519 0.6683 302 0.6790 647
pabilities of each algorithm identifying the communities
whenthesearemorefuzzyinsidethewholenetwork. Us- TABLEI:Maximummodularityobtainedusingthealgorithm
ing the Girvan-Newman algorithm, wich is the reference [14] QN and the extremal optimization algorithm QEO for
algorithmforcommunityidentification,the communities differentcomplexnetworks. Itisalso includedthenumberof
are well detected until values of z = 6. In contrast, communitiesfoundattheconfigurationwithmaximummod-
out
our algorithm detects the communities up to z = 8, ularity.
out
where the community structure still persist but is much
moredifficulttoreveal,seeFig.3. Inthisparticularcase In Table I we present the results for the maximum
50 percent of the links are within the community and 50 modularity achieved by our algorithm compared to the
percentarelinkswithnodesoutsidethecommunity. This modularity obtained using [14]. The difference in maxi-
resultthatcouldseemcontradictoryisnot. Notethatthe mummodularityis upto 15%depending onthe network
50percentoflinkswithnodesoutsidethecommunityare considered. These differences result in a best determina-
equally distributed among the rest of communities, and tion of the unknown community structure of the whole
then its contribution to the definition of community is network. Thepartitionintocommunitiesisclearlydiffer-
deprivedbythenumberofcommunitiesintherestofthe ent as shows the different number of communities found
network,in our case three. For this reasonit is expected using both algorithms.
to find community structure even in these cases. Note that sincethe coreofthe algorithmis stochastic,
For values higher than 8, the average maximum mod- differentrunscouldyieldinprincipledifferentpartitions.
ularity rapidly approach the limit Q = 0.208 (see inset We have performed 100 runs of the algorithm for the e-
Fig.3), the expected modularity for a random network mail network and for a random network with the same
with the same number of links and nodes, as it has been number of links and nodes to check the consistency of
shown in [26]. the proposed method. In Fig. 4 we present the results
Wehavealsoanalyzedthecommunitystructureofsev- ofthe fractionoftimes acouple ofnodes areclassifiedin
eral real networks: the jazz musicians network [27], an the same partition. The community structure is clearly
4
revealedforthee-mailnetworkwhilefortherandomnet-
work this structure is inexistent. Recently, Guimer`a and
Amaral have obtained similar results by applying simu-
lated annealing to find the community structure in the
context of metabolic networks [31].
Summarizing,wehavepresentedanextremaloptimiza-
tion based algorithm that optimizes the modularity and
allows an accurate identification of community structure
incomplexnetworks. Theresultsoutperformallprevious
algorithms existent in the literature.
We thankM.Bogun˜a,L.Danon,A.Diaz-Guileraand
R. Guimer`a for helpful comments and suggestions. We FIG.4: Fractionofnodesclassifiedinthesamepartitionover
also thank M.E.J. Newman for providing us the cond- 100 realizations of the algorithm. The color of the position
(i,j) corresponds to the fraction of times that nodes i and j
mat network. This work has been supported by DGES
belong to thesame partition.
of the Spanish Government Grant No. BFM-2003-08258
and EC-FET Open Project No. IST-2001-33555.
[1] S.Strogatz, Nature410, 268 (2001). (2001).
[2] R. Albert and A.-L. Barab´asi, Rev. Mod. Phys. 74, 47 [19] S. Boettcher and A. G. Percus, Phys Rev E 64, 026114
(2002). (2001).
[3] S.Dorogovtsev and J. Mendes, Adv.Phys. 51 (2002). [20] P.BakandK.Sneppen,Phys.Rev.Lett.71,4083(1993).
[4] M. E. J. Newman, SIAMReview 45, 167 (2003). [21] S.BoettcherandA.G.Percus,ArtificialIntelligence119,
[5] M. Girvan and M. Newman, Proc. Natl. Acad. Sci. 99, 275 (2000).
7821 (2002). [22] W.W.Zachary,JournalofAnthropologicalResearch 33
[6] J.-P. Eckmann and E. Moses, Proc. Natl. Acad.Sci. 99, (1977).
5825 (2002). [23] J. Reichardt and S. Bornholdt, Phys. Rev. Lett 93,
[7] K. Eriksen, I. Simonsen, S. Maslov, and K. Sneppen, 218701 (2004).
Phys.Rev.Lett. 90, 148701 (2003). [24] L. Donetti and M. Munoz, J. Stat. Mech.: Theor. Exp.
[8] P.Holme,M.Huss,andH.Jeong,Bioinformatics19,532 p. 10012 (2004).
(2003). [25] A.V.Aho,J.D.Ullman,andJ.E.Hopcroft,DataStruc-
[9] A.Arenas,L.Danon,A.Diaz-Guilera,P.M.Gleiser,and tures and Algorithms (Addison-Wesley,1983).
R.Guimer`a, Eur. Phys.J. B 38, 373 (2004). [26] R.Guimer´a,M.Sales-Pardo,andL.A.N.Amaral,Phys.
[10] M. E. J. Newman, Eur. Phys.J. B 38, 321 (2004). Rev. E70, 025101 (2004).
[11] R. Guimer`a, L. Danon, A. D´ıaz-Guilera, F. Giralt, and [27] P. Gleiser and L. Danon, Advancesin Complex Systems
A.Arenas, Phys. Rev.E 68, 065103(R) (2003). 6, 565 (2003).
[12] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and [28] H.Jeong, B.Tombor, R.Albert,Z.N.Oltvai,andA.-L.
D.Parisi, Proc. Natl. Acad.Sci. 101, 2658 (2004). Barab´asi, Nature 407, 651 (2000).
[13] M. E. J. Newman and M. Girvan, Phys. Rev. E 69, [29] X. Guardiola, R. Guimera, A. Arenas, A. Diaz-Guilera,
026113 (2004). and L. A.N. Amaral, cond-mat/0206240 (2002).
[14] M. E. J. Newman, Phys.Rev.E 69, 066133 (2004). [30] M. E. J. Newman, Phys.Rev.E 64, 016131 (2001).
[15] D. Sherrington and S. Kirkpatrick, Phys. Rev. Lett. 35, [31] R. Guimera and L. Amaral, Nature (2005), in press.
1792 (1975). [32] The value of Q always refers to the whole network i.e.
[16] S. Kirkpatrick, C. Gilatt, and M. Vecchi, Science 220 is the sum over all the communities. At a certain mo-
(1983). ment more subdivisions into communities will necessar-
[17] D. E. Goldberg, Genetic Algorithms in Search, Opti- ily decrease Q because the limit of decomposition is a
mization, and Machine Learning (Addison-Wesley Pro- community per node whose value of Q is negative.
fessional, 1989).
[18] S.BoettcherandA.G.Percus,Phys.Rev.Lett.86,5211