ebook img

Surprise maximization reveals the community structure of complex networks. PDF

2.7 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Surprise maximization reveals the community structure of complex networks.

Surprise maximization reveals the community structure of complex networks SUBJECTAREAS: RodrigoAldecoa&IgnacioMar´ın BIOINFORMATICS DATAINTEGRATION InstitutodeBiomedicinadeValencia.ConsejoSuperiordeInvestigacionesCient´ıficas(IBV-CSIC).CalleJaimeRoig11,46010. APPLIEDMATHEMATICS Valencia.Spain. STATISTICS Howtodeterminethecommunitystructureofcomplexnetworksisanopenquestion.Itiscriticalto Received establishthebeststrategiesforcommunitydetectioninnetworksofunknownstructure.Here,using 25September2012 standardsyntheticbenchmarks,weshowthatnoneofthealgorithmshithertodevelopedforcommunity structurecharacterizationperformoptimally.Significantly,evaluatingtheresultsaccordingtotheir Accepted modularity,themostpopularmeasureofthequalityofapartition,systematicallyprovidesmistaken 28December2012 solutions.However,anovelqualityfunction,calledSurprise,canbeusedtoelucidatewhichistheoptimal divisionintocommunities.Consequently,weshowthatthebeststrategytofindthecommunitystructureof Published allthenetworksexaminedinvolveschoosingamongthesolutionsprovidedbymultiplealgorithmstheone 14January2013 withthehighestSurprisevalue.WeconcludethatSurprisemaximizationpreciselyrevealsthecommunity structureofcomplexnetworks. Correspondenceand requestsformaterials Theanalysis of networkshas profound implicationsinvery differentfields, from sociologytobiology1–5. Oneofthemostinterestingfeaturesofanetworkisitscommunitystructure6,7.Communitiesaregroups shouldbeaddressedto ofnodesthataremorestronglyorfrequentlyconnectedamongthemselvesthanwiththeothernodesof I.M.([email protected]. the network. The best way to establish the communities present in a network is an open problem. Two es) related questions are still unsolved. First, which is the best algorithm to characterize networks of known community structure. Second, how to evaluate algorithm performance when the community structure is unknown. The first question requirestesting the algorithms in benchmarkscomposed of complex networks wherethecommunitystructureisestablishedapriori.Inthesebenchmarks,ithasbeenfoundthatalgorithm performance depends on how different is the density of intracommunity links from the average density of links in the network. In addition, it has been determined that most algorithms perform well when the networks are small and the communities have similar sizes, but many perform quite poorly in benchmarks composed oflargenetworkswithmany communities ofheterogeneous sizes8–18.Thus,benchmarkswiththe latter features have become crucial to rank algorithm performances. Among them, the Lancichinetti- Fortunato-Radicchi (LFR) benchmarks11–18 and the Relaxed Caveman (RC) benchmarks14,18–20 have shown tobeparticularlyuseful.Bothbenchmarksposeasterntestforalgorithmsthatdealpoorlywiththepresence ofmanycommunities,ofsmallcommunities orofamixtureofcommunitiesofdifferentsizes(see e.g.refs. 11, 13 and 14). Thesecondquestion,howtodeterminethebestperformancewhenthecommunitystructureisunknown, involves devising an independent measure ofthequality ofapartition into communities that canbe reliably appliedtoanytypeofnetwork.Thefirstandstilltodaymostpopularsuchmeasurewascalledmodularity21(often abbreviatedasQ).Modularitycomparesthenumberoflinkswithineachcommunitywiththeexpectednumber oflinksinarandomgraphofthesamesizeandsamedistributionofnodedegreesandthenaddsthedifferences betweenexpectedandobservedvaluesforallthecommunities.Itwasproposedthattheoptimalpartitionofa networkcouldbefoundbymaximizingQ21.However,itwaslaterdeterminedthatmodularity-basedevaluations areoftenincorrectwhensmallcommunitiesarepresentinthenetwork,i.e.Qhasaresolutionlimit22.Several other works have found additional, subtle problems caused by using modularity maximization to determine networkcommunitystructure17,23–26.AlltheseresultssuggestthatusingQprovidesincorrectanswersinmany cases. We recently suggested an alternative global measure of performance, which we called Surprise14. Surprise assumes as a null model that links between nodes emerge randomly. It then evaluates the departure of the observedpartitionfromtheexpecteddistributionofnodesandlinksintocommunitiesgiventhatnullmodel. Todoso,itusesthefollowingcumulativehypergeometricdistribution: SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 1 www.nature.com/scientificreports (cid:2)M(cid:3) (cid:2)F{M(cid:3) justafewalgorithms.Giventhatotheralgorithmscouldprovideeven MiXnðM,nÞ j n{j higherSvalues,itwasunclearhowoptimaltheseresultswere. S~{log (cid:2)F(cid:3) ð1Þ Here,wetestthebeststrategiescurrentlyavailabletocharacterize j~p thestructureofcomplexnetworksandwecomparethemwiththe n results provided by Surprise maximization in both LFR and RC benchmarks. We first show that none among a large number of WhereFisthemaximumpossiblenumberoflinksinanetwork([k22 state-of-the-artalgorithmsworkconsistentlywellinallthesecom- k]/2,beingkthenumberofunits),nistheobservednumberoflinks, plex benchmarks. Particularly, all modularity-based heuristics M is the maximum possible number of intracommunity links for a behave poorly. Also, we demonstrate that evaluating the perform- given partition, and p is the total number of intracommunity links anceofanalgorithmusingmodularityisincorrect.Wethenshow observedinthatpartition14.Usingacumulativehypergeometricdis- that a simple meta-algorithm, which consists in choosing in each tributionallowstocalculatetheexactprobabilityofthedistributionof network the algorithm that maximizes Surprise, very efficiently linksandnodesinthecommunitiesdefinedforthenetworkbyagiven determines the community structure of all the networks tested. partition.Thus,Smeasureshowunlikely(or‘‘surprising’’,hencethe This method clearly performs better than any of the algorithms name of the parameter) is that distribution. In previous studies, we devisedsofar.WeconcludethatSurprisemaximizationisthestrat- showed that Surprise improved on modularity in standard bench- egyofchoiceforcommunitycharacterizationincomplexnetworks. marksandthatchoosingalgorithmswithhighSvaluesleadstoaccur- ate community structurecharacterization14,18. Although theseresults were encouraging,whether S maximization could be used to obtain Results optimalpartitionswasnotrigorouslytested.Thiswasduetothefact In order to determine the performance of different algorithms for that Surprise values wereestimated from the partitions provided by community structure characterization, we explored two standard Figure1|Globalperformanceofthealgorithms. (a)BehaviorofthealgorithmsintheLFRbenchmark.Toobtainthisfigure,thealgorithmswerefirst orderedaccordingtotheVIresultsobtainedforeachmcondition.Then,weplottedtheresultsforthealgorithmwiththebestVIvalue(blackline, indicatedwith‘‘1’’),theaverageofthetopfivealgorithms(redline),averageofthetoptenones(blueline)oraverageforallthe18algorithms(greenline). Thegreyregioncorrespondstothevaluesofm(0.1–0.7)chosentoperformthemaincomparativeanalyses(seetext).Beyondthatregion,eventhebest algorithmsobtainVIvaluesconsiderablyhigherthanzero,meaningthattheoriginalstructureofthenetworkhasbeensignificantlymodifiedbythe increaseinintercommunitylinks.(b)AnexampleshowingthefivelargestcommunitiesinaLFRnetwork(5000units)whenm50.1.Nodesare distributedintotwodimensionswithaspring-embeddedalgorithm46anddrawnusingCytoscape47.Communitiesarewell-isolatedgroups.(c)Thefive largestcommunitieswhenm50.7.Theyarebarelydistinguishableinthisrepresentationbecausethemixingoflinkswasquiteextreme.However,several algorithmswerestilloftenabletodetectthesefuzzycommunities.(d)–(f):ThesameresultsfortheRCbenchmark(512units).Paneledepictsthefive largestcommunitieswhenR510%andPanelftothesamecommunitieswhenR550%.Again,noticeinpaneld)thesharpincreaseinVIvalueswhenR .50%.AnextremedegreeofsuperimpositionamongcommunitiesisobservedalreadywhenR550%(f).IntheLFRbenchmark,therapidincreaseinVI valueswhentheintercommunitylinksgoesfromm50.7to0.8(Panela)isexplainedbyallcommunitiesbeingofsimilarsizes.Therefore,theyare destroyedataboutthesametime.Onthecontrary,themoreprogressiveincreaseinVIwhenRgrows,whichweobservedinPaneld,isduetothe heterogeneoussizesofthecommunitiespresentinthatbenchmark,whichbreakdownatdifferenttimes. SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 2 www.nature.com/scientificreports benchmarks,anLFRbenchmarkwith5000unitsandanRCbench- networkswith10%#R#50%(again,100realizationsperRvalue, markwith512units(seeMethods).VariationofInformation(VI) foratotalof500differentnetworks).Theseconditionsgeneratesome wasusedtodeterminethedegreeofcongruencebetweentheparti- communitystructuresthatareverydifficulttodetect(Figure1). tionsintocommunitiessuggestedby18differentalgorithmsandthe Figure2summarizestheindividualperformanceofthealgorithms realcommunitystructurepresentinthenetworks.Aperfectcongru- accordingtothreeglobalmeasuresofpartitionquality.Thefirstone encecorrespondstoavalueVI50.Figures1aand1ddisplaythe isVI,thegoldstandardforalgorithmperformanceinthesebench- generalresultsobtainedinthetwobenchmarks.AsharpVIincrease marks.Theothertwo,alreadymentionedabove,areSurprise(S)and wasfoundwhenthecommunitystructurewasweakenedbyhighly modularity(Q).Theperformancevaluesmeasuredaccordingtothe increasingthenumberofintercommunitylinks,asoccurswhenthe VIscoresshowninFigure2indicatetwoveryimportantfacts.On mixingparameter(m)oftheLFRbenchmarkhasvaluesabove0.7or onehand,noneofthealgorithmswasthebestinallLFRorinallRC therewiringparameterRoftheRCbenchmarksishigherthan50% networks.Ontheotherhand,thebestalgorithmsinLFRnetworks (see also Methods for the precise definitions of m and R). These oftenperformedpoorlyinRCnetworks,andviceversa(seee.g.the resultsmeanthat,abovem50.7orR550%,thecommunitystruc- results of RB, LPA or RNSC in Figure 2). This can be rigorously tureoriginallypresentinthenetworkswassubstantiallyaltered.In shownbyorderingwithineachbenchmarkthealgorithmsaccording suchcases,wecouldnotdeterminewhetherthepartitionssuggested totheirperformance,assigningarank,frombesttoworst,andcom- bythealgorithmswerecorrectornot:therewouldnotbeaknown paringtheranksinbothbenchmarks.WefoundthatKendall’snon- structurewith which to compare. Thus, wedecided to restrict our parametriccorrelationcoefficientfortheserankswasveryweak,just subsequentanalysestotheLFRnetworkswith0.1#m#0.7(100 t50.31(p50.04,one-tailedcomparison).Weconcludethatusing realizationspermvalue,givingatotalof700networks)andtheRC single algorithms for community characterization is inadvisable, Figure2|PerformanceofthealgorithmsaccordingtoVariationofInformation(VI),Surprise(S)andModularity(Q)inLFRandRCbenchmarks. Averageperformanceandstandarderrorsofthemeanareshown.Performancevalueswereobtainedbythefollowingmethod:1)theVI,SorQvaluesof thepartitionsprovidedbythe18algorithmsineachofthenetworks(i.e.700valuesforLFRbenchmarks,500valuesinRCbenchmarks)wereestablished; 2)Foreachnetwork,thealgorithmswereassignedarankaccordingtotheirperformance(15optimal,185worse);identicalranksweregiventotied algorithms(i.e.theranksthatwouldcorrespondtoeachofthemweresummedupandthendividedbythenumberoftiedalgorithms);and,3) Performancewascalculatedas18–averagerank,meaningthat17isthemaximumpossiblevaluethatwouldobtainanalgorithmthatoutperformstherest inallnetworks,and0equalstobeingtheworstinallnetworks. SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 3 www.nature.com/scientificreports giventhattheirperformanceisstronglydependentontheparticular maximizeQ(Blondel;EO,MLGC,MSG1VMandCNM27–31)and structureofthenetwork. thealgorithmsthatuseQtoevaluatethequalityoftheirpartitions If we focus now on the Surprise (S) and modularity (Q) results (Walktrap,DM32,33)werepoorperformers(Figure2). showninFigure2,anothertwostrikingfactsbecomeapparent.First, IfindeedmaximizationofSurpriseisanoptimalstrategyforcom- therewasaverystrongcorrelationbetweentheperformanceofthe munitycharacterization,asitsstrongcorrelationwithVIsuggests, algorithmsaccordingtoVIandaccordingtoS.Kendall’scorrelation then it should be possible to improve on the results of any single coefficient for the ranks of the performances of the algorithms algorithmbysimplypickingupamongmanyalgorithmstheonethat ordered according to VI and to S values is t 5 0.91 in the LFR generatesthehighestSvalue(S )ineachparticularnetwork.Also, max benchmarks (p 5 4.9 10211, one-tailed comparison) and t 5 0.83 thisS-maximizationstrategyshouldprovideVIvaluesverycloseto intheRCbenchmark(p51.41028,one-tailedtest).Theseresults zero in our benchmarks. These two expectations are fulfilled, as demonstratethatSisanexcellentmeasureoftheglobalqualityofa shown in Figure 3. The top panel (Figure 3a) demonstrates that division into communities, confirming and extending the conclu- choosing in each particular case the algorithm with the highest S sions of one of our previous works14. Second, the performance of valueisbetterthanselectinganyofthestate-of-the-artalgorithms thealgorithmsevaluatedusingQonlyweaklycorrelatedwiththeir tested.ItisremarkablethattheS valuesinFigure3aderivedfrom max performanceaccordingtoVIintheLFRbenchmarks(Kendall’st thecombinedresultsofasmanyas7algorithms(CPM,Infomap,RB, LFR 50.29,p50.048,one-tailedtest)andthesetwomeasuresdidnot RN,RNSC,SClusterandUVCluster16,20,34–38).Inaddition,Figure3b significantlycorrelateintheRCbenchmarks(t 50.27,p50.066, indicatesthatthesumoftheaverageVIvaluesobtainedusingS in RC max againone-tailedtest).Theseresultsindicatethatevaluatingthequal- the1200networksanalyzed(withm50.1–0.7andR510–50%) ityofapartitionaccordingtoitsmodularityisinappropriate.Itwas werejustslightlyabovezero,i.e.almostoptimal.Theaveragevalues therefore logical to find out that both the algorithms devised to were0.00260.000intheLFRbenchmarksand0.10060.007inthe Figure3|Asimplemeta-algorithmbasedonSurprisemaximizationimprovesoverallknowncommunitydetectionalgorithms. (a)Performances (calculatedasinFigure2)forallthealgorithmsarecomparedinboththeLFRandtheRCbenchmarkswiththeperformanceofastrategythatconsistsin pickingupthealgorithmthatprovidesthehighestSvalue(S ).(b)FortheS strategy,theaverageVIvaluesforthe1200networksanalyzedarevery max max closetozero,i.e.analmostoptimalperformance. SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 4 www.nature.com/scientificreports RC benchmarks. We may ask why these VI values are not exactly zero,giventhatVI50wouldbeexpectedforaperfectglobalmea- sure.Wedetectedtworeasonsforthisminordiscrepancy.Thefirst reasonwasthat,insomecases(mainlyintheRCbenchmarkwithR 550%),theavailablealgorithmsfailedtoobtainthehighestpossible S values. We found that the S values expected assuming that the originalcommunitystructureofthenetworkwasintact(S )were orig often higher than S (Table 1). This obviously means that these max algorithmsdidnotfoundthecommunitystructurethatmaximizesS. Thatstructurecouldstillbetheoriginalone--whichindeedhasthe highestSvalueobservedsofarinouranalyses--orsomealternate structure,butclearlynotanyofthosefoundbythealgorithms,which hadlowerSvalues.Thesecondreasonobservedwasthepresenceof minorchangesincommunitystructurethatoccurredinsomenet- workswhenintercommunitylinksincreased.Thus,theexactoriginal structureofthenetworkwasnotpresentanymore.Thiswasdeduced fromthefactthatS valuesweresometimesslightlyhigherthan max S both in the LFR benchmarks with m 5 0.6 – 0.7 and the RC orig benchmarkswithR510–40%(Table1).Theseresultssuggested thatthealgorithmsobtainedoptimalpartitions,buttheywereabit differentfromtheoriginalones.Toestablishthatfact,weexamined the23caseswhereS .S intheRCbenchmarkswithR510%. max orig WefoundthatthepartitionswithS valuesgenerallydifferedfrom max theoriginalstructuresinoneofthesmallestcommunitieshavinglost single units (Supplementary Table 1; see example in Figure 4). Significantly,inthose23networkswealwaysfoundjustonepartition withS .S andseveralalgorithmsoftenrecoveredexactlythat max orig same partition (Supplementary Table 1). All these results indicate thatreal, smallchanges in community structureoccurred in those networks,suggestingthatthepartitionswithS .S valueswere Figure4|WhenVIandSurprisemaximumvaluesdonotcoincide,the max orig indeedoptimal.FromthedatainTable1,wealsoobtainanindirect differenceisoftenduetominimalchangesinthecommunitystructureof validationofourdecisionofusingtheLFRbenchmarkswithm#0.7 thenetwork. ThisisanexamplefromtheRCbenchmarkwhereSmax. andtheRCbenchmarkswithR#50%toevaluatealgorithmper- Sorig(seetext).a):originalstructure.b):afterR510%hasbeenapplied. formance.AsshowninthatTable,uptothoselimits,theSmaxand Smaxisobtainedwhenasingleunit(square)isclassifiedasbeingisolated S valuesarenotsignificantlydifferent,while,beyondthoselimits, fromitsoriginal4-nodescommunity(highlighted).Asshowninpanelb), orig verysignificantdifferencesarefound.Thismeansthattheoriginal thecriticalunithasbecomealmostfullyseparatedfromtherestofthe structures, or structures almost identical to them, were indeed nodesinitsoriginalcommunity,onlyonelinkremains,whileithasbeen connectedtomanynodesinothercommunities. Table1|AverageS andS valuesintheLFRandRCbenchmarks.Statisticalsignificance(p)wasestimatedusingatwo-tailedStudentt orig max test. ns: non-significant differences. In italics, the benchmarks containing quasi-random networks, discarded for the main analyses (summarizedinFigures2and3),butincludedintheanalysesshowninFigure5,below LFRbenchmark m S S p orig max 0.1 99065.696111.50 99065.696111.50 ns 0.2 82631.18693.92 82631.18693.92 ns 0.3 67847.35690.78 67847.35690.78 ns 0.4 54354.47676.71 54354.47676.71 ns 0.5 41991.16648.70 41991.16648.70 ns 0.6 30807.18640.09 30807.38640.09 ns 0.7 20563.37626.92 20570.70626.78 ns 0.8 11598.83617.91 10168.11628.15 ,0.0001 0.9 4204.5067.62 8368.9464.21 ,0.0001 RCbenchmark R S S p orig max 10 19012.72667.33 19012.94667.32 ns 20 13505.84634.14 13506.72634.11 ns 30 9298.98611.88 9301.12611.88 ns 40 6013.6963.92 6017.5864.09 ns 50 3487.65611.54 3479.92612.99 ns 60 1647.42613.82 1540.76616.79 ,0.0001 70 475.35610.42 899.9667.98 ,0.0001 80 11.8461.52 963.7369.42 ,0.0001 90 0.0060.00 1003.2169.95 ,0.0001 SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 5 www.nature.com/scientificreports Figure5|Theperformanceofthealgorithmsinthelimitcases(m50.7,R550%)andbeyondthoselimits(m50.8–0.9,R560–90%)arecorrelated. Astatisticallysignificantcorrelationwasfound,despitethefactthatsomealgorithms,suchasInfomaporLPA,totallycollapsed.Thesealgorithms establishedpartitionsconsistinginasinglecommunity,whichledtoVI50whencomparedwiththeoriginaldistribution. presentinthenetworksexaminedtogeneratetheresultssummarized inFigures2and3,whichpreciselywastheonlyconditionrequired forareliablemeasureofalgorithmperformance. TheimportantresultsdescribedinFigures2and3indicatethatS maximizationshouldallowdeterminingwithaveryhighprecision thecommunitystructureofanynetwork.Wehaveexploredwhether this may be the case even when the community structure is very poorly defined by analyzing the results of our 600 additional net- works,correspondingtotheLFRbenchmarkwithmixingparameter m50.8andm50.9(i.e.200networks)andtheRCbenchmarkwithR 5 60% to R 5 90% (400 networks). As indicated above, in these networks,theVI-basedoptimalitycriterion(i.e.VI50meansfind- ing the original community structure) cannot be confidently used (Figure1;Table1).However,alternative,unknownstructuresmay bepresentthatthealgorithmsshouldbeabletodetect.Ifthisisthe case,areasonablepredictionisthatthealgorithmsthatareproviding the maximum S values in the conditions that are closest to those extremeones(i.e.whenm50.7intheLFRbenchmarks andR5 50%intheRCbenchmarks)shouldalsoprovidethebestSvaluesin the most extreme networks. Figure 5 shows that there is indeed a goodcorrelationbetweentheresultsobtainedinthelimitcasesand in the most extreme cases. Kendall’s non-parametric correlation coefficients for the ranks of the algorithms in the limit networks and in the most extreme networks are significant in both theLFR andRCbenchmarks(t 50.42;p50.007andt 50.49,p5 LFR RC 0.020, one-tailed tests). This occurs despite some algorithms, as InfomaporLPA34,39,totallyfailinginthesequasi-randomnetworks (Figure5).UVCluster,RB,CPMandSCluster16,20,35,38emergeasthe bestalgorithmstocharacterizethestructureinnetworkswithpoorly definedcommunities,ingoodagreementwithpreviousresults14,16. Wedecidedtoperformsomefinalteststodeterminewhetherthe limitationsthataffectQwhencommunitiesareverysmallmayalso affectS.Forthispurpose,weusedtwoextremenetworksofknown structure suggested before17,22,40. The first one includes just three Figure6|Twoextremenetworksdesignedtotestthebehaviorof communities,oneofthemverylarge(400nodeswithaveragedegree Surprisewhensmallcommunitiesarepresent. (a)Anetworkwiththree 5100)andtheothertwomuchsmaller(cliquesof13nodes).These communities(sizes400,13,13).Thenodesofthelargestcommunityhave three communities are connected by single links (Figure 6a). We anaveragedegreeof100,whilethenodesinthetwosmallestcommunities foundinthisnetworkthatthemaximumvalueofQ(EOalgorithm, formcliques.Thethreecommunitiesareinterconnectedbysinglelinks,as Q50.0836)didnotcorrespondtoapartitionintothethreenatural shown.(b)Cliques,eachonewithfivenodes,whichareconnectedalsoby communities.Onthecontrary,andasalreadynotedbyotherauthors singlelinksinawaythatcanbedepictedasaring.Thefigureshowsan insimilarcases17,22,Qindicatesamistakensolution,inthiscasewith examplewitheightcliques,butthatnumberwasprogressivelyincreasedto fivecommunities.Ontheotherhand,thethreecommunitieswere determinewhetherthepartitionwithhighestSstillcorrespondedtothe correctlyfoundbymultiplealgorithms(CPM,Infomap,LPA,RNSC, oneinwhicheachcliqueswasanindependentcommunity. SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 6 www.nature.com/scientificreports SClusterandUVCluster),andthispartitionindeedcorrespondedto maximization strategy was notdetermined. Here, weextend those the maximum value of S (1230.73). The second extreme type of results,toshowthatusingSmaximizationleadstoanalmostperfect network was precisely the ring of cliques in which the resolution performance. We were very close to solve the correct community limitofQwasfirstdescribed22,whichisschematizedinFigure6b. structureofallthenetworksofthesetwobenchmarks,asisstrikingly Here,avariablenumberofcliques,eachonecomposedoffiveunits, demonstratedbytheS resultsshowninFigure3b.Itissignificant max wereconnectedtoeachotherbysinglelinkstoformaring.Wewere thattheywereobtainedbycombiningresultsofthe7algorithmswith interestedindeterminingwhether,evenifweincreasethenumberof thebestaverageperformances,asdetailedinFigure3a:RN,SCluster, cliques,asolutioninwhichallcliquesareseparatelydetectedalways Infomap,CPM,RNSC,UVClusterandRB.Anotherimportantresult has a better S value than one in which pairs of cliques are put issummarizedinFigure5,whichindicatesthatSurprisecanalsobe together.Wetestednetworksofsizesupto1millionnodes,finding usedincasesinwhichthecommunitystructureissoblurredasto thatthebestpartitionwasalwaystheoneinwhichthecliquesare become almost random. Given these results, we conclude that consideredindependentcommunities.Onthecontrary,whenQis Surpriseistheparameter ofchoicetocharacterize thecommunity used,thecliquesareconsideredindependentunitsonlyifthenet- structure of complex networks. Future works should use Surprise worksizeissmallerthan150units. maximization,insteadofmodularitymaximizationorothermeth- ods,toestablishthatstructure. Discussion Itissignificantthatonlytwoalgorithms(SClusterandUVCluster) Ourresultsleadtotwomainconclusions.Thefirstoneisthatnoneof usethemaximizationofSurprisetochooseamongpartitionsgener- thealgorithmscurrentlyavailablegeneratesoptimalsolutionsinall atedbyconsensushierarchicalclustering20,38.Thismayexplaintheir networks(Figures2,3).Infact,thereisjustaweakcorrelationofthe good average results (Figures 2, 3, 5). However, no available algo- algorithmperformancesinthetwostandardbenchmarksusedinthis rithmperformssearchestodirectlydeterminethemaximalSurprise study.Moreprecisely,wecansaythattherearesomealgorithmsthat values. That type of algorithms could overcome the limitations clearlyfailinbothbenchmarksandtheresttendtoperformmuch detectedinallthecurrentlyavailableones,potentiallyallowingthe betterinoneofthebenchmarksthanintheother(seeFigure2).Most characterizationofoptimalpartitionseveninthemostdifficultnet- ofthebestoverallperformerswerealreadyfoundtobeoutstanding works. inotherstudies12–14,16,18,36.TheexceptionisRNSC37,whichhadnot WemayaskwhySurpriseisabletoevaluatewithsuchefficiency been tested in depth before. Among the ones that always perform the quality of a given partition, while modularity cannot. In our poorlyareallthealgorithmsthatusemodularityaseitheraglobal opinion,thedifferencerestsonthefactthatmodularityisbasedon parametertomaximizeorasawaytoevaluatepartitions.Thisfact, an inappropriate definition of community. Newman and Girvan21 togetherwiththedemonstrationthatQdoesnotcorrelatewithVIin verbally defined a community as a region of a network in which networksofknownstructure(Figure2),andalsothegoodperform- the density of links is higher than expected by chance. However, ance of S, including its ability to cope with extreme networks in theprecisemathematicalmodelusedtodeducethemodularityfor- which Q traditionally fails due to its resolution limit (Figure 6), mula implies a definition of community that does not take into shoulddefinitelydeterresearchersfromusingmodularity.Astrong accountthenumberofnodesrequiredtoachievesuchahighden- corollary is that it is advisable a reevaluation of the hundreds of sity21.Bynotevaluatingthenumberofnodes,modularityfallspreyof papers--infieldsasvariedassociology,ecology,molecularbiology aresolutionlimit:smallcommunitiescannotbedetected17,22.Onthe ormedicine--whicharebasedonmodularityanalyses. otherhand,Surpriseanalysesoftenchooseasbestasolutionwhere Thesecond,andmostimportant,conclusionisthatthecommun- somecommunitiesarejustisolatedunits(seeexamplesinFigure4 itystructureofanetworkcanbedeterminedbymaximizingS,for and Ref. 14). This happens because the Surprise formula precisely examplebysimplytakingtheresultsofasmanyalgorithmsaspos- evaluatesnotonlythenumberoflinks,butalsothenumberofnodes sibleandchoosingtheonethatprovidespartitionswiththehighest withineachcommunity.Forinstance,incorporatingasinglepoorly Surprisevalue.Inapreviouspaper,weshowedthatSurprisecanbe connectedunitintoacommunityisoftenforbiddenbythefactthat usedtoefficientlyevaluatethequalityofapartition,behavingmuch suchincorporationsharplyincreasesthenumberofpotentialintra- better than modularity14, but the precise performance of the S- community links (all those that might connect the units already Table2|Detailsofthealgorithmsusedinthisstudy.Asummaryofthestrategiesimplementedbythealgorithmsandthecorresponding referencesareindicated NameoftheAlgorithm Strategyusedbythealgorithm References Blondel Multilevelmodularitymaximization [27] CNM Greedymodularitymaximization [31] CPM MultiresolutionPottsmodel [16] DM Spectralanalysis1modularitymaximization [33] EO Modularitymaximization [28] HAC MaximumLikelihood [43] Infomap Informationcompression [34] LPA Labelpropagation [39] MCL Simulatedflow [44] MLGC Multilevelmodularitymaximization [29] MSG1VM Greedymodularitymaximization1refinement [30] RB MultiresolutionPottsmodel [35] RN MultiresolutionPottsmodel [36] RNSC Neighborhoodtabusearch [37] SAVI Optimalpredictionforrandomwalks [45] SCluster ConsensusHierarchicalClustering1Surprisemaximization [20] UVCluster ConsensusHierarchicalClustering1Surprisemaximization [20,38] Walktrap Randomwalks1modularitymaximization [32] SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 7 www.nature.com/scientificreports presentinthecommunitywiththenewunit)whilebarelyincreasing 7. Labatut,V.&Balasque,J.-M.Detectionandinterpretationofcommunitiesin thenumberofrealintracommunitylinks.ThisleadstoanSvalue complexnetworks:practicalmethodsandapplication.In:Abraham,A.& Hassanien,A.-E.(eds.)ComputationalSocialNetworks:Tools,Perspectivesand muchsmallerthaniftheunitiskeptseparated.Itisalsosignificant Applicationspp79–110.(Springer,2012). thatageneralproblemofmodularitymaximizationandotherrelated 8. Girvan,M.&Newman,M.E.J.Communitystructureinsocialandbiological algorithms--asthosebasedonPottsmodels withmultiresolution networks.Proc.Natl.Acad.Sci.USA99,7821–7826(2002). parameters--isthattheycannotfindaperfectequilibriumbetween 9. Danon,L.,Duch,J.,Diaz-Guilera,A.&Arenas,A.Comparingcommunity structureidentification.J.Stat.Mech.P09008(2005). merging and splitting communities17,25,26. In these methods, each 10.Danon,L.,Diaz-Guilera,A.&Arenas,A.Theeffectofsizeheterogeneityon community is evaluated independently, one at a time. The global communityidentificationincomplexnetworks.J.Stat.Mech.P11010(2006). valuetobemaximizedisthesumofthequalitiesoftheindividual 11.Lancichinetti,A.,Fortunato,S.&Radicchi,F.Benchmarkgraphsfortesting communities.However,incomplexnetworkswithcommunitiesof communitydetectionalgorithms.Phys.Rev.E78,046110(2008). 12.Lancichinetti,A.&Fortunato,S.Communitydetectionalgorithms:acomparative verydifferentsizes,itmaybeoftenimpossibletofindasinglerule analysis.Phys.Rev.E80,056117(2009). (evenusingatunableparameter,asinthesemultiresolutionmeth- 13.Orman,G.K.,Labatut,V.&Cherifi,H.Qualitativecomparisonofcommunity ods) to split some communities while keeping intact the rest17. detectionalgorithms.In:Cherifi,H.,Zain,J.M.&El-Qawasmeh,E.(eds.) Surprise analyses are not affected by this problem, because com- DICTAP(2),vol.167ofCommunicationsinComputerandInformationScience 265–279(Springer,2011). munitiesarenotdefinedindependently,onebyone,butemergeas 14.Aldecoa,R.&Mar´ın,I.DecipheringnetworkcommunitystructurebySurprise. regionsofnodesstatisticallyenrichedinlinks,accordingtothegen- PLoSONE6,e24195(2011). eralfeatures(i.e.thetotalnumberofnodesandlinks)ofthewhole 15.Ronhovde,P.&Nussinov,Z.Multiresolutioncommunitydetectionformegascale network. networksbyinformation-basedreplicacorrelations.Phys.Rev.E80,016109 (2009). 16.Traag,V.A.,VanDooren,P.&Nesterov,Y.Narrowscopeforresolution-free Methods communitydetection.Phys.Rev.E84,016114(2011). 17.Lancichinetti,A.&Fortunato,S.Limitsofmodularitymaximizationin Wesearchedtheliteraturetoselectthebestcommunitydetectionalgorithmsavail- communitydetection.Phys.Rev.E84,066122(2011). abletoanalyzenetworkswithunweighted,undirectedlinks.Ourfinalresultsare 18.Aldecoa,R.&Mar´ın,I.Closedbenchmarksfornetworkcommunitystructure basedon18ofthem(summarizedinTable2).Algorithmsknowntobehavepoorlyin characterization.Phys.Rev.E85,026109(2012). similarbenchmarksorspecificallydesignedtocharacterizecommunitieswithover- 19.Watts,D.J.SmallWorlds:TheDynamicsOfNetworksBetweenOrderAnd lappingnodeswerediscarded.Someotheralgorithmsthatseemedinterestingbutwe Randomness.(PrincetonUniversityPress,1999). wereunabletotestfordiversereasons(e.g.theywerenotprovidedbytheauthors,did 20.Aldecoa,R.&Mar´ın,I.Jerarca:Efficientanalysisofcomplexnetworksusing notcompletethebenchmarks,etc.)aredetailedinSupplementaryTable2.Weper- hierarchicalclustering.PLoSONE5,e11585(2010). formedextensivetestswiththeseselectedalgorithms,usingtheirdefaultparameters, 21.Newman,M.E.J.&Girvan,M.Findingandevaluatingcommunitystructurein intwoverydifferentbenchmarks.Theywerechosenbothdifficultandverydissim- networks.Phys.Rev.E69,026113(2004). ilar,withtheideathattheresultscouldbegeneralenoughastobeextrapolatedto 22.Fortunato,S.&Barthe´lemy,M.Resolutionlimitincommunitydetection.Proc. networksofunknownstructure.ThefirstwasastandardLFRbenchmarkalready Natl.Acad.Sci.USA104,36–41(2007). usedinotherstudiesthatcomparedalgorithms12–14,17.Itiscomposedofnetworkswith 23.Good,B.H.,deMontjoye,Y.-A.&Clauset,A.Performanceofmodularity 5000nodes,structuredinsmallcommunitieswith10–50nodes.Thedistributionof maximizationinpracticalcontexts.Phys.Rev.E81,046106(2010). nodedegreesandcommunitysizesweregeneratedaccordingtopowerlawswith 24.Bagrow,J.P.Communitiesandbottlenecks:Treesandtreelikenetworkshavehigh exponents22and21,respectively.Thesizesofthecommunitiesinthenetworksof modularity.Phys.Rev.E85,066118(2012). thisbenchmarkhaveaveragePielou’sindexes41withavalueof0.98.Thisindexis 25.Xiang,J.&Hu,K.Limitationofmulti-resolutionmethodsincommunity equalto1whenallcommunitiesareofthesamesize.Thechiefdifficultyofthis detection.PhysicaA391,4995–5003(2012). benchmarkthusliesonthepresenceofmanysmallcommunities.Thesecond 26.Xiang,J.etal.Multi-resolutionmodularitymethodsandtheirlimitationsin benchmarkwasoneoftheRelaxedCaveman(RC)type,verysimilartotheonesused communitydetection.Eur.Phys.J.B85,352(2012). inourpreviousworks14,18.ThenetworksinthisRCbenchmarkhave512unitsand16 27.Blondel,V.D.,Guillaume,J.-L.,Lambiotte,R.&Lefebvre,E.Fastunfoldingof communities,withsizesdefinedaccordingtoabroken-stickmodeltoobtainan communitiesinlargenetworks.J.Stat.Mech.P10008(2008). averagePielou’sindex50.75.Thismakesthisbenchmarkverydifficult,giventhatit 28.Duch,J.&Arenas,A.Communitydetectionincomplexnetworksusingextremal consistsofnetworkswithcommunitiesofverydifferentsizes,someofthemvery optimization.Phys.Rev.E72,027104(2005). small(seee.g.Figure4).ItwasnotconvenienttoourpurposestouselargerRC 29.Noack,A.&Rotta,R.Multi-levelalgorithmsformodularityclustering.Lect.Notes benchmarksgiventhatthetotalnumberoflinksinthesenetworksquicklygrows Comp.Sci.5526,257–268(2009). whenthenumberofnodesisincreasedandmanyalgorithmsbecometooslow. 30.Schuetz,P.&Caflisch,A.Efficientmodularityoptimizationbymultistepgreedy Thesetwobenchmarksare‘‘open’’,meaningthattheyhaveatunableparameter algorithmandvertexmoverrefinement.Phys.Rev.E77,046112(2008). that,whenincreased,makesthenetworkcommunitystructuretobecomelessandless 31.Clauset,A.,Newman,M.E.J.&Moore,C.Findingcommunitystructureinvery obviousuntilitshiftstowardsatotallyunknownstructure,potentiallyverydifferent largenetworks.Phys.Rev.E70,066111(2004). fromtheoriginaloneandclosetorandom11,13,14,18.Thisparameterincreasesintra- 32.Pons,P.&Latapy,M.Computingcommunitiesinlargenetworksusingrandom communitylinksandlowersthenumberofintercommunitylinks.Inthecaseofthe walks.J.GraphAlgorithmsAppl.10,191–218(2006). LFRbenchmarks,the‘‘mixingparameter’’,m,indicatesthefractionoflinkscon- 33.Donetti,L.&Mun˜oz,M.A.DetectingNetworkCommunities:anewsystematic nectingeachnodeofacommunitywithnodesoutsideofthecommunity11.FortheRC andefficientalgorithm.J.Stat.Mech.P10012(2004). benchmarks,wedefinedRewiring(R)asthepercentageoflinksthatisrandomly 34.Rosvall,M.&Bergstrom,C.T.Mapsofrandomwalksoncomplexnetworksreveal shuffledamongunits.Thus,R510%meansthat10percentofthelinkswerefirst communitystructure.Proc.Natl.Acad.Sci.USA105,1118–1123(2008). randomlyremovedandthenaddedagain,tolinkrandomlychosennodes. 35.Reichardt,J.&Bornholdt,S.Statisticalmechanicsofcommunitydetection.Phys. Variationofinformation(VI)42wasusedtomeasuretheagreementbetweenthe Rev.E74,016110(2006). originalcommunitystructurepresentinthenetworkandthestructurededucedby 36.Ronhovde,P.&Nussinov,Z.Localresolution-limit-freePottsmodelfor eachalgorithm.TheadvantagesofusingVIhavebeendiscussedinourprevious communitydetection.Phys.Rev.E81,046114(2010). works14,18.AperfectagreementwithaknownstructurewillprovideavalueofVI50. 37.King,A.D.,Przulj,N.&Jurisica,I.Proteincomplexpredictionviacost-based Inaddition,twoglobalqualityfunctions,NewmanandGirvan’smodularity(Q)8and clustering.Bioinformatics20,3013–3020(2004). Surprise(S)14,18,(seeFormula[1]),werealsousedtoevaluatetheresults.Thevaluesof 38.Arnau,V.,Mars,S.&Mar´ın,I.Iterativeclusteranalysisofproteininteractiondata. SandQforthepartitionsproposedbyeachalgorithmwerecalculatedandthenallthe Bioinformatics21,364–378(2005). valueswereusedtodeterminethecorrelationsofSandQwithVIandtoestablish 39.Raghavan,U.N.,Albert,R.&Kumara,S.Nearlineartimealgorithmtodetect thesemaximumvaluesofS. communitystructuresinlarge-scalenetworks.Phys.Rev.E76,036106(2007). 40.Granell,C.,Go´mez,S.&Arenas,A.Hierarchicalmultiresolutionmethodto overcometheresolutionlimitincomplexnetworks.Int.J.Bifurcat.Chaos22, 1. Wasserman,S.&Faust,K.SocialNetworkAnalysis:MethodsandApplications. 1250171(2012). (CambridgeUniversityPress,1994). 41.Pielou,E.C.Themeasurementofdiversityindifferenttypesofbiological 2. Strogatz,S.H.Exploringcomplexnetworks.Nature410,268–276(2001). collections.J.Theor.Biol.13,131–144(1966). 3. Baraba´si,A.-L.&Oltvai,Z.N.Networkbiology:understandingthecell’s 42.Meila,M.Comparingclusterings–aninformationbaseddistance.J.Multivar. functionalorganization.Nat.Rev.Genet.5,101–113(2004). Anal.98,873–895(2007). 4. Costa,L.D.F.,Rodrigues,F.A.,Travieso,G.&Boas,P.R.V.Characterizationof 43.Park,Y.&Bader,J.S.Resolvingthestructureofinteractomeswithhierarchical complexnetworks:Asurveyofmeasurements.Adv.Phys.56,167(2007). agglomerativeclustering.BMCBioinformatics12,S44(2011). 5. Newman,M.E.J.Networks:AnIntroduction(OxfordUniversityPress,2010). 44.Enright,A.J.,VanDongen,S.&Ouzounis,C.A.Anefficientalgorithmforlarge- 6. Fortunato,S.Communitydetectioningraphs.Phys.Rep.486,75–174(2010). scaledetectionofproteinfamilies.Nucl.AcidsRes.30,1575–1584(2002). SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 8 www.nature.com/scientificreports 45.E,W.,Li,T.&Vanden-Eijnden,E.Optimalpartitionandeffectivedynamicsof Additionalinformation complexnetworks.Proc.Natl.Acad.Sci.USA105,7907–7912(2008). Supplementaryinformationaccompaniesthispaperathttp://www.nature.com/ 46.Kamada,T.&Kawai,S.Analgorithmfordrawinggeneralundirectedgraphs.Inf. scientificreports Proc.Lett.31,7–15(1989). 47.Smoot,M.E.,Ono,K.,Ruscheinski,J.,Wang,P.-L.&Ideker,T.Cytoscape2.8:new Competingfinancialinterests:Theauthorsdeclarenocompetingfinancialinterests. featuresfordataintegrationandnetworkvisualization.Bioinformatics27, License:ThisworkislicensedunderaCreativeCommons 431–432(2011). Attribution-NonCommercial-NoDerivs3.0UnportedLicense.Toviewacopyofthis license,visithttp://creativecommons.org/licenses/by-nc-nd/3.0/ Howtocitethisarticle:Aldecoa,R.&Mar´ın,I.Surprisemaximizationrevealsthe Acknowledgements communitystructureofcomplexnetworks.Sci.Rep.3,1060;DOI:10.1038/srep01060 (2013). ThisstudywassupportedbygrantBFU2011-30063(Spanishgovernment). Authorcontributions RAandIMconceivedanddesignedtheexperiments.RAperformedtheexperimentsand analyzedthedata.IMwrotethemainmanuscripttext.Allauthorsreviewedthemanuscript. SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.