The Unreasonable Effectiveness of Tree-Based Theory for Networks with Clustering Sergey Melnik,1 Adam Hackett,1 Mason A. Porter,2,3 Peter J. Mucha,4,5 and James P. Gleeson1 1Department of Mathematics & Statistics, University of Limerick, Ireland 2Oxford Centre for Industrial and Applied Mathematics, Mathematical Institute, University of Oxford, OX1 3LB, UK 3CABDyN Complexity Centre, University of Oxford, Oxford OX1 1HP, UK 4Carolina Center for Interdisciplinary Applied Mathematics, Department of Mathematics, University of North Carolina, Chapel Hill, NC 27599-3250, USA 5Institute for Advanced Materials, Nanoscience & Technology, 0 University of North Carolina, Chapel Hill, NC 27599-3216, USA 1 Wedemonstratethatatree-basedtheoryforvariousdynamicalprocessesyieldsextremelyaccurate 0 resultsfor several networks with high levels of clustering. Wefindthat such atheory works well as 2 long as the mean intervertex distance ℓ is sufficiently small—i.e., as long as it is close to the value n of ℓ in a random network with negligible clustering and the same degree-degree correlations. We a confirmthishypothesisnumerically usingreal-world networksfrom variousdomainsandon several J classes of syntheticclustered networks. Wepresentanalytical calculations that furthersupportour 9 claim that tree-based theories can be accurate for clustered networks provided that the networks are “sufficiently small” worlds. ] n PACSnumbers: 89.75.Hc,89.75.Fb,64.60.aq,87.23.Ge n - s i I. INTRODUCTION knownasthe“configurationmodel”[23])andtocase(ii) d ′ as “P(k,k )-theory”. The clustering in sample networks . t islowinbothsituations;ittypicallydecreasesasN−1 as a One of the most important areas of network science is m the number of nodes N →∞ [51]. the study of dynamical processes on networks [1–4]. On - one hand, research on this topic has provided interest- We concentrate in this paper on undirected, un- d ingtheoreticalchallengesforphysicists,mathematicians, weighted real-world networks, which can be described n and computer scientists. On the other hand, there is an completely using adjacency matrices. It is straight- o c increasing recognition of the need to improve the under- forward to calculate the empirical distributions pk and ′ [ standingofdynamicalsystemsonnetworkstoachievead- P(k,k ), which can then be used as inputs to analyti- vancesinepidemicdynamics[5–7],trafficflowinbothon- caltheoryfor variouswell-studiedprocesses. The results 1 lineandofflinesystems[8],oscillatorsynchronization[9], cansubsequentlybecomparedwithlarge-scalenumerical v 9 and more [3]. simulations using the original networks. 3 Analyticalresultsforcomplexnetworksareratherrare, In the present paper, we demonstrate that analytical 4 especially if one wants to study a dynamical system on results derived using tree-based theory can be applied 1 a network topology that attempts to incorporate even withhighaccuracytocertainnetworksdespitetheirhigh . 1 minimal features of real-world networks. If one consid- levels of clustering. Examples of such networks include 0 ers a dynamical system on a real-world network rather university social networks constructed using Facebook 0 than on a grossly simplified caricature of it, then theo- data [24] and the Autonomous Systems (AS) Internet 1 reticalresultsbecomealmostbarren. Furthermore,most graph [25]. Specifically, the analytical results for bond : v analyses assume that the network under study has a lo- percolation, k-core sizes, and other processes accurately i cally tree-like structure, so that they can only possess match simulations on a given (clustered) network pro- X very few small cycles, whereas most real networks have vided that the mean intervertex distance in the network r a significant clustering (and, in particular, possess numer- is sufficiently small—i.e., that it is close to its value in a ous small cycles). This has motivated a wealth of recent randomly rewired version of the graph. Recalling that a research concerning analytical results on networks with clustered network with a low mean intervertex distance clustering [7, 10–22]. is said to have the small-world property, we find that Most existing theoretical results for (unweighted) net- tree-based analytical results are accurate for networks works are derived for an ensemble of networks using (i) that are “sufficiently small” small worlds. In discussing only their degree distribution p , which gives the prob- this result, we focus considerable attention on quantify- k ability that a random node has degree k (i.e., has ex- ing what it means to be “sufficiently small”. actly k neighbors) or using (ii) their degree distribution Theremainderofthispaperisorganizedasfollows. In and their degree-degree correlations, which are defined Sec.II,weconsiderseveraldynamicalprocessesonhighly ′ by the joint degree distribution P(k,k ) describing the clustered networks and show that tree-based theory ad- probability that a random edge joins nodes of degree k equately describes them on certain networks but not on ′ and k . In the rest of this paper, we will refer to case (i) others. In order to explain our observations, we intro- as“p -theory”(theassociatedrandomgraphensembleis duce in Sec. III a measure of prediction quality E and k 2 1 0.8 0.6 0.4 0.2 (a) Facebook Oklahoma (b) AS Internet 0 S 0 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.6 0.8 1 1 p−theory k P(k,k’)−theory 0.8 Numerical simulation on original network Numerical simulation on rewired network 0.6 0.4 0.2 (c) PGP Network (d) Power Grid 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 p FIG. 1: (Color online) Bond percolation. Plots of GCC size S versus bond occupation probability p for various real-world networks. These networks, which we also use as examples in other figures, are (a) the Facebook network for University of Oklahoma [24], (b) the Internet at the AS level [25], (c) the PGP network [26–28], and (d) the power grid for the western United States [29, 30]; develop a hypothesis, inspired by the well-knownWatts- finite graphs as well.) Bond percolation has been used Strogatz example of small-world networks, regarding its in simple models for epidemiology. In such a context, p dependence onthe mean intervertexdistance ℓ. We pro- is related to the average transmissibility of a disease, so vide support for our hypothesis by numerical examina- thattheGCCisusedtorepresentthesizeofanepidemic tion of a large range of networks in Appendix B and by outbreak (and to give the steady-state infected fraction analytical calculations in Appendix A. We discuss our in an susceptible-infected-recoveredmodel) [23]. conclusions in Sec. IV. Analyticalresults for GCC sizes for p -theory [31] can k be found in Eq. (8.11) of Ref. [23] and analytical results ′ for P(k,k )-theory are available in Eq. (12) of Ref. [32]. We plot these theoretical predictions in Fig. 1 as dashed II. DYNAMICAL PROCESSES ON NETWORKS red and solid blue curves, respectively. In this figure, we use the following data sets as examples: (a) the A. Bond Percolation September 2005 Facebook network for University of Ok- lahoma[24],wherenodesarepeopleandlinksarefriend- We begin by considering bond percolation, which has ships; (b) the Internet atthe Autonomous Systems (AS) been studied extensively on networks. In bond percola- level [25], where nodes represent ASs and links indicate tion,networkedgesaredeleted(orlabeledasunoccupied) the presence of a relationship; (c) the network of users with probability 1−p, where p is called the bond occu- of the Pretty-Good-Privacy (PGP) algorithm for secure pation probability. One can measure the effect of such informationinterchange[26–28];and(d)thenetworkrep- deletionsontheaggregategraphconnectivityinthelimit resenting the topology of the power grid of the western ofinfinitely manynodesusingS(p),the fractionalsizeof United States [29, 30]. We treat all data sets as undi- the giant connected component (GCC) at a given value rected, unweighted networks. ofp. (Inthispaper,wewillusetheterminologyGCCfor We performed numerical calculations of the GCC size 3 1 (a) Facebook Oklahoma (b) AS Internet 0.8 0.6 0.4 e 0.2 z i s e 0 r 0 20 40 60 80 100 0 5 10 15 20 o c 1 p−theory - k k P(k,k’)−theory 0.8 Numerical result for original network 0.6 0.4 0.2 (c) PGP Network (d) Power Grid 0 0 5 10 15 20 25 30 0 5 10 15 20 k FIG. 2: (Color online) Plots of k-core sizes versus k for the real-world networks from Fig. 1. The highest nonzero k-cores are (a) Kpk =91, KP(k,k′) =98, Knum =107; (b) Kpk =132, KP(k,k′) =19, Knum =23; (c) Kpk =7, KP(k,k′) =16, Knum =31; and (d) Kpk =6, KP(k,k′) =7, Knum =19. usingthealgorithminRef.[33]andplottedtheresultsas nismsforthediscrepancybetweentheoryandsimulations blackdisksinFig.1. ItisapparentfromFig.1(a,b)that in Fig. 1(c,d). ′ P(k,k )-theorymatchesnumericalsimulationsveryaccu- In considering other explanations, note that the dis- rately for the AS Internet and Oklahoma Facebook net- crepancybetweentheoryandnumericsinFig.1(c,d)does works, and we found similar accuracy for all 100 single- notarisefromfinite-sizeeffects. Todemonstratethis,we university Facebook data sets available to us. However, rewired the networks using an algorithm that preserves as shown in Fig. 1(c,d), the match between theory and the P(k,k′) distribution but otherwise randomizes con- numerics is much poorer on the PGP and Power Grid nections between the N nodes [52]. Because this scheme networks. The usual explanation for this lack of accu- preserves the degree correlation matrix P(k,k′), we call racy is that it is caused by clustering in the real-world this the P-rewiring algorithm. Note that the ensemble ′ network that is not captured by P(k,k )-theory. Note, of fully P-rewired networks is in fact the ensemble of however, that the Oklahoma Facebook network has one random networks defined by the P(k,k′) matrix of the of the highest clustering coefficients of the four cases original (unrewired) network. in Fig. 1 even though it is accurately described by its We show numerical calculations of the GCC sizes for ′ P(k,k )-theory. these rewired networks with blue squares in Fig. 1(c,d) Indeed,theglobalclusteringcoefficients(definedasthe andobservethattheyagreeverywellwiththecurvespro- ′ meanofthelocalclusteringcoefficientoverallnodes[29]) duced from P(k,k )-theory. We conclude that the struc- for the Oklahoma Facebook, AS Internet, PGP, and tural characteristics of the original networks—rather Power Grid networks are 0.23, 0.21, 0.27, and 0.08, re- than simply their sizes—must underlie the observed dif- spectively. (See Table I for basic summary statistics for ferences between simulations and analytics. ′ these networks.) The clustering coefficients for all 100 Also note that the agreement between P(k,k )- and Facebooknetworksrangefrom0.19to0.41,andthemean p -theories in Fig. 1 is better in panels (a) and (d) than k valueofthesecoefficientsis0.24. Theseobservationssug- in panels (b) and(c). This is because the Pearsoncorre- gestthatoneoughttoconsiderotherexplanatorymecha- lationcoefficientr ofthe end-vertexdegreesofarandom 4 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 (a) Facebook Oklahoma (b) AS Internet 0 0 ρ 1 1 p−theory k P(k,k’)−theory 0.8 0.8 Numerical simulation on original network 0.6 0.6 0.4 0.4 0.2 0.2 (c) PGP Network (d) Power Grid 0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 µ FIG. 3: (Color online) Watts’ threshold model, with threshold mean µ and variance σ2 = 0.04 for the networks from Fig. 1. We use the seed fraction ρ = 0 because the nodes with negative thresholds immediately turn on and act as initial seeds. In 0 otherwords, theeffectiveseed fraction isgiven bythecumulativedistribution ofthresholdsat zero: (cid:2)1+erf(cid:0) µ/(cid:0)σ√2(cid:1)(cid:1)(cid:3)/2. − edge [23] has smaller absolute values for the networks (We use K to denote this maximal value of k.) For shown in panels (a) and (d) (0.074,with the mean 0.063 Fig. 2(a) and (b), we obtain KP(k,k′)/Knum ≈0.916 and over 100 Facebook networks, and 0.0035, respectively) KP(k,k′)/Knum ≈0.826, respectively. The corresponding than it does for the networks in (b) and (c) (−0.2 and values for Fig. 2(c) and (d) are KP(k,k′)/Knum ≈ 0.516 0.24, respectively). and KP(k,k′)/Knum ≈0.368. B. k-Cores C. Watts’ Threshold Model Figures2,3,and4showsimilarcomparisonsofanalyt- ical results versus numerical simulations for other well- Watts [36] introduced a simple model for the spread studied processes on networks. In Fig. 2, we plot the of cultural fads. It allows one to examine how a small k-core sizes of the networks. The k-core is the largest initialfractionofearlyadopterscanleadto a globalcas- subgraph whose nodes all have degree at least k. The cadeofadoptionviaasocialnetwork. Thep -theoryand k ′ p -theory for k-core sizes is given in Ref. [34] and the P(k,k )-theoryforthe averagecascadesizearegiven,re- k ′ P(k,k )-theory is given by Eq. (32) of Ref. [35]. As spectively, in Ref. [37] and Ref. [35]. In Fig. 3, we com- shown in Fig. 2(a,b), we again find very good agree- pare these theories with numerical simulations on popu- ′ ment of P(k,k )-theory with numerical calculations on lations with Gaussian threshold distributions of mean µ the AS Internet and Facebook networks and less accu- and variance σ2 = 0.04. The cascade size shows a sharp rate results for the other example networks. This can transition as µ is increased. As with the other processes be quantifiedbycomparingthe actual(numerical)result thatwediscussedabove,the positionofthis transitionis for the highest value of k for which the k-core size is accurately captured by the theory for the Facebook and ′ nonzerotothevalue thatispredictedbyP(k,k )-theory. AS Internet networks but not for the other examples. 5 0.5 0.25 0.4 0.2 0.3 0.15 0.2 0.1 0.1 0.05 (a) Facebook Oklahoma (b) AS Internet 0 0 I(t) 0.25 0.5 p−theory k P(k,k’)−theory 0.2 Numerical simulation 0.4 on original network 0.15 0.3 0.1 0.2 0.05 0.1 (c) PGP Network (d) Power Grid 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 t FIG. 4: (Color online) SIS dynamics, which we display as plots of infected fraction I(t) versus time t for the networks from Fig. 1. The parameters in Eq. (17) of Ref. [6] are the recovery rate µ and the spreading rate λ. We use the value µ=1 in all figure panels; we use I(0)=10−3 in panels (a)–(c) and I(0)=0.002 in panel (d);and we useλ=0.02 in panel (a); λ=0.2 in (b) and (c); and λ=0.8 in panel (d). D. Susceptible-Infected-Susceptible Model A. Watts-Strogatz Networks In Fig. 4, we show a comparison between theory and Using the small-world networks introduced by Watts numerical simulation results for the time evolution of a and Strogatz [29], one can conduct a systematic study susceptible-infected-susceptible (SIS) epidemic model on of the effects of clustering C and the mean intervertex various networks. Unlike the other processes that we distance ℓ. We start with a ring of N = 10000 nodes havediscussed,thetheoryforthiscase—asgiven,forex- and connect each node to z =10 nearest neighbors. We ample,byEq.(17)ofRef.[6]—isexpectedtoapplyaccu- thenrandomlyrewireafractionf ofthelinksinthenet- ratelyonlytotheearly-timedevelopmentoftheinfection. work [53]. When f = 0, the values of C and ℓ are both Inviewofthisrestriction,theresultsofFig.4areconsis- high. When f = 1, the rewired network is connected ′ tent with those ofFigs.1–3. That is,the P(k,k )-theory completely at random, which gives it low C and ℓ val- once againprovidesaccurateresultsfor certainnetworks ues. Foreachvalueoff between0and1,wenumerically for a varietyof processesof interestbut is rather inaccu- calculatetheclusteringcoefficientC ,themeaninterver- f rate for other networks. tex distance ℓ , and the GCC size S (p) for all values of f f the bondoccupationprobabilitypbetween0 and1. The ′ difference between S (p) and the P(k,k )-theory curve, f III. MEASURE OF PREDICTION QUALITY which we denote by S (p), gives a quantitative measure th for the inaccuracy of the theory for this particular value We now aim to characterize the types of networks for oftherewiringparameterf. Wedefinetheerrormeasure ′ which P(k,k )-theory can be expected to give good re- sults. Because Figs. 1–4 demonstrate that this charac- M 1 terizationholdsforseveralprocesses,wewillconcentrate Ef = X|Sth(pi)−Sf(pi)| , (1) hereafter primarily on the bond percolation case. M i=1 6 7 6 6 ℓf ℓ1 4 − 10 C f 5 × 2 ℓ −ℓ f 1 4 0 0 0.02 0.04 0.06 10×C Ef f 3 100×E f 2 ℓf −ℓ1 C f 1 E f 0 −2 −1 0 10 10 10 f FIG. 5: (Color online) Watts-Strogatz small-world network: ℓ ℓ (red circles), 10 C (open squares), and 100 E (blue f 1 f f triangles) as functions of rewiring fraction f. The inset shows −ℓ ℓ and C as fun×ctions of E for f 10−2. O×bserve the f 1 f f − ≥ linearrelation betweenE andℓ ℓ ,whichsuggeststhatℓ ℓ mightbeagood indicatorofhowwellthebond-percolation f f 1 f 1 − − process on a network can beapproximated bytree-based theory. where p =i/M for i=1,2,...,M are uniformly-spaced GCC size curve S (p) and the theoretical prediction i num valuesintheinterval[0,1]. Takingthespacing1/M tobe S (p): th sufficientlyfine(weuse1/M =10−3)impliesthattheer- M ror measure E approachesthe averagevertical distance 1 between the Stfh(p) and Sf(p) curves for p∈[0,1]. E = M X|Sth(pi)−Snum(pi)| . (2) i=1 In Fig. 5, we plot the values of ℓ −ℓ , C (scaled by f 1 f Essentially, E gives the average distance between the a factor of 10 for ease of visualization), and E (scaled f numerics (black disks) and theory (solid blue curve) in by a factor of 100) as functions of the rewiring parame- ter f. For values of f greater than 10−2, the quantities Fig.1. InFig.6(a),weshowascatterplotoflog10E ver- sus log C, where C is the clustering coefficient of each ℓf and Ef exhibit similar behavior, whereas Cf remains 10 near its f = 0 value of 2/3 until f is much larger [54]. network. We use logarithmic coordinates in Fig. 6 in or- dertofullyresolvetherangeofvaluesforbothvariables, Wehighlightthesimilarscalingofℓ andE intheinset f f of Fig. 5, in which we plot ℓ −ℓ directly as a function as they vary by one or more orders of magnitude. f 1 of E for f ≥ 10−2. The approximately linear depen- We also include synthetic examples, such as Watts- f Strogatzsmall-worldnetworksandclusteredrandomnet- dence that we observe contrasts to the clearly nonlinear works generated using the recent models described in relation between E and the clustering C that we show f f Refs. [12, 13], which we now briefly recall. The fun- inthesameinset. Thisstronglysuggeststhatdifferences damental quantity defining the γ-theory networks of between theory and numerics are related more directly Ref. [13] is the joint probability distribution γ(k,c), to the mean intervertex distance than to the clustering whichgivesthe probabilitythatarandomlychosennode coefficient. has degree k and is a member of a c-clique (an all-to-all connected subgraph of c nodes). With γ(3,3) = 1 (and zero for other values of k and c), each node in such a B. Real-World Networks and Additional Examples network has degree 3 and is part of exactly one triangle. This is equivalent to the p = 1 case in the clustered 1,1 The above results for Watts-Strogatz small-worldnet- random graph model of Ref. [12], where p is the prob- s,t works motivate the examination of a range of real-world ability that a randomlychosennode is part oft different networksinorderto seekaclearrelationshipbetweenan triangles and in addition has s single edges (which don’t error measure similar to (1) and some other characteris- belong to the triangles). In each synthetic network, we ticofthenetwork,suchasclusteringormeanintervertex P-rewire a fraction f of links and show our results for distance. For each network, we calculate the inaccuracy f ={10−3,4×10−3,0.04,0.1,0.4}. ′ ofP(k,k )-theoryintermsoftheerrorE,whichmeasures In order to assess the strength of a relation between the distance betweenthe actual(numerically calculated) the theory error E and some characteristic of the net- 7 −0.5 (a) −0.5 (b) −1 Power Grid −1 PGP Network AS Internet −1.5 RL Internet −1.5 Coauthorships E E Airports500 0 −2 Interacting Proteins 0 −2 g1 C. Elegans Metabolic g1 o C. Elegans Neural o l−2.5 Facebook Caltech l−2.5 Facebook Georgetown −3 Facebook Oklahoma −3 Facebook UNC WS: z=10, N=1000 −3.5 WS: z=10, N=10000 −3.5 γ(3,3)=1, N=1002 γ(3,3)=1, N=10002 −4 −4 −2 −1.5 −1 −0.5 0 0.5 1 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 log C log [(ℓ−ℓ )/z] 10 10 1 FIG.6: (Coloronline)Scatterplotsoflog E versus(a)log C (withR2 0.087) and(b)log [(ℓ ℓ )/z](withR2 0.94). 10 10 ≈ 10 − 1 ≈ work, we calculate the coefficient of determination R2 restrictedinitsapplicability,itlendsweighttoourclaim using a linear regression. For the data in Fig. 6(a), we that E depends primarily on ℓ−ℓ rather than on the 1 calculate R2 ≈ 0.087 (using the points only and ignor- clustering C. ing the connecting curves which help identify families of points). This relatively small value indicates that C is not a good predictor of the theory error across the set IV. CONCLUSIONS of networks that we tested (see Table I). After examin- ing a wide range of possibilities (see the scatter plots in Appendix B), we found that the network measure that At the beginning ofthis paper, we posedthe following best correlates with the error E (on logarithmic scales) question: “How small must small-world networks be in ′ is (ℓ − ℓ )/z (which gives R2 ≈ 0.94), where z is the order for P(k,k )-theory to give accurate results?” Our 1 mean degree and ℓ is the mean intervertex distance in heuristic answer is that they must have a value for the 1 the version of the network that has been fully rewired mean intervertex distance ℓ that differs from the mean whilepreservingthejointdegreedistributionP(k,k′)[see intervertex distance in a random network with the same ′ Fig. 6(b)]. Recall that one can think of such fully P- P(k,k ) and number of nodes by no more than about rewired versions of a network as random networks with 10%ofthe meandegreez. Surprisingly,the levelofclus- the samedegreecorrelationP(k,k′) andsize asthe orig- tering has much less of an impact on the accuracy of ′ inal network. P(k,k )-theory,whichiswhywefoundexcellentmatches We can summarize our observations as follows. Given betweentheoryandnumericalsimulationseveninhighly a network, we compare its mean intervertex distance ℓ clustered graphs such as Facebook social networks and with the value ℓ in a random network of equal size and the AS Internet network. 1 degreecorrelationP(k,k′). Ifthedifferenceℓ−ℓ issuffi- Although our presentation used bond percolation as 1 cientlysmall—e.g.,ifitislessthanz/10,aswasthe case ourprimaryexample, we demonstratedin Figs.1–4 that in Fig. 1(a,b)—then the P(k,k′)-theory can be expected if P(k,k′)-theory is accuratefor percolation,then it also to accuratelygivethe GCC size, k-coresizes,andresults works well for other processes. However, an absolute for several dynamical processes (see Figs. 1–4). For ex- measure ofaccuracy must, of course,depend on the pro- ample,the ASInternetgraphhas(ℓ−ℓ )/z ≈3.3×10−2 cess under scrutiny. For example, Fig. 7 shows a com- 1 andall100Facebooknetworkshavevalues muchsmaller parisonbetweentheoryandsimulationresultsforWatts’ than this. However,the theory is not accurate for larger threshold model in which σ = 0, which implies that all values of ℓ−ℓ . (For example,the PGP andPowerGrid nodes have identical thresholds equal to µ (in contrast 1 networkshave(ℓ−ℓ )/zvaluesofapproximately0.45and to Fig. 3). This example now exhibits different results 1 3.9, respectively.) for theory and numerics even in the Facebook networks. Because the tree-basedtheory systematically gives ac- This suggests that the σ = 0 case of Watts’ model is curate results for dynamical processes on networks that particularly sensitive to deviations of the network from are not locally tree-like when the intervertex distance is randomness and suggests that this case provides a suit- small, it seems that there must be a deeper argument able testing ground for new analytically solvable models than is currently known for the validity of such theories. of networks that include clustering [12, 13]. We show in Appendix A that the error measure E de- In summary, we have shown that for a variety of pends linearly on ℓ− ℓ in a certain class of networks processes—including bond percolation and k-core size 1 with zero clustering. Although this theoretical result is calculations—tree-based analytical theory yields highly 8 1 1 0.8 (a) Facebook Oklahoma 0.8 (b) AS Internet 0.6 0.6 0.4 0.4 0.2 0.2 0 0 ρ 1 1 0.8 (c) PGP Network 0.8 (d) Power Grid 0.6 0.6 0.4 0.4 p−theory k 0.2 0.2 P(k,k’)−theory Numerical simulation on original network 0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 µ FIG. 7: (Color online) Watts’ threshold model, with threshold mean µ and variance σ2 =0 (i.e., with uniform thresholds) for thenetworks from Fig. 1. Weuse a seed fraction of ρ =10−2. 0 accurate results for networks in which ℓ≈ℓ even in the (DMS-0645369). We thank Adam D’Angelo and Face- 1 presenceofsignificantclustering. Suchgraphs,whichin- bookfor providingthe Facebookdata usedinthis study. clude the AS Internet network and Facebook social net- We also thank Alex Arenas, Mark Newman, CAIDA, works, are definitively not locally tree-like, so that the and Cx-Nets collaboratoryfor making publicly available theory is working very well even in situations where the other data sets used in this paper. theory’s fundamentalhypothesis is knownto failutterly. Thefactthatanalyticalresultsforseveraldynamicalpro- cesses can be expected to apply on “sufficiently small” Appendix A: Scaling of Prediction Error with Mean small-world networks increases the value of existing the- Intervertex Distance oreticalworkandhighlightsthetypesofprocessforwhich improvedanalyticalmodellingofclusteringeffectsshould We consider the class of networks for which one can most profitably be targeted. We hope that the results of define a branching matrix [38]. A branching matrix de- the present paper will motivate further research on the scribes the connection probabilities in tree-like networks underlying causes of this “unreasonable” effectiveness of with non-trivial structure, e.g., modular networks [39]. tree-based theory for clustered networks. Inthis appendix, we derive how the errormeasure E de- fined in Eq. (2) depends on ℓ−ℓ for a network with a 1 branching matrix when the network is close to fully P- Acknowledgements rewired (i.e., when it is close to a random network with the same degree correlation). We give the final formula SM, AH, and JPG acknowledge funding provided in Eq. (A6) below. Because clustering is negligible in by Science Foundation Ireland under programmes these infinite networks, E cannot depend on the cluster- 06/IN.1/I366 and MACSI 06/MI/005. MAP acknowl- ing coefficient C. In Fig. 6, we illustrate both of these edges a researchaward(#220020177)from the James S. characteristics for real-worldnetworks. McDonnell Foundation. PJM was funded by the NSF The branching matrix characterizes the average inter- 9 vertexdistanceℓinanetwork,anditalsodeterminesthe our main claim that E depends primarily on the excess bondpercolationbehavior. The largesteigenvalueofthe length ℓ − ℓ . Note C = 0 for branching-matrix net- 1 branching matrix,which we denote by λ, determines the works,soE is (trivially)independent ofC; comparethis percolation threshold to the results for the real-worldnetworksthat are shown in Fig. 6(a). Moreover,the scatter plot of log E versus 10 1 log [(lnλ )2(ℓ−ℓ )/(λ lnN)] in Fig. 8 indicates that pth = . (A1) 10 1 1 1 λ Eq.(A6) gives a goodfit (R2 ≈0.87)evenfor real-world networks. Additionally, an estimate of the mean intervertex dis- tance can be written in terms of λ as [38] 0 lnN −0.5 ℓ≈ , (A2) lnλ −1 Power Grid where we recall that N denotes the number of nodes in E−1.5 PGP Network theWneetswuporpko.se now that the network is almost fully P- og10 −2 ARCSLo a IInunttteherornnresehttips l Airports500 rewired,andweconsiderhowvaluesofλthatdifferfrom −2.5 Interacting Proteins the fully P-rewiredvalue (which we denote by λ ) affect C. Elegans Metabolic 1 −3 C. Elegans Neural the values of ℓ and pth. Note that it is easy to calculate Facebook Caltech λ1,asthe branchingmatrix ofafully P-rewirednetwork −3.5 FFaacceebbooookk GOekloarhgoemtoawn ′ isgivenintermsofthedegreecorrelationmatrixP(k,k ) Facebook UNC −4 by [38] −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 log [(lnλ1)2(ℓ−ℓ )] ′ ′ P(k,k′) ′ P(k,k′) 10 λ1lnN 1 B (k,k )≡(k −1) =(k −1) , (A3) 1 P(k,j) kp /z Pj k FIG. 8: (Color online) Log-log scatter plot of actual (numer- ical) values of E for real-world networks versus the values and λ1 is the largest eigenvalue of B1. Moreover, for predicted by Eq. (A6), for which we numerically calculate ℓ uncorrelated networks produced using the configuration andℓ . Wefindthat R2 0.87; theslope of thefittedlineis 1 ≈ model, λ is simply k(k − 1)p /z. This implies in 1.09. 1 Pk k particular that λ = z−1 for graphs in which all nodes 1 havethe same degree(suchasP-rewiredWatts-Strogatz networksandthespecialcasesofγ-theorynetworksused in Sec. III). Appendix B: Scatter Plots ConsideringonlysmalldeviationsfromfullyP-rewired values,wewriteλ=λ +∆λ,andℓ=ℓ +∆ℓ. Expanding 1 1 to linear terms, we findfrom (A2)that the excesslength In this appendix, we show scatter plots of log10E ver- is susavarietyofpossiblepredictors. RecallthatE,which we defined in Eq. (2), gives an error measure for bond ∆λ lnN percolation. We test for the dependence of E on vari- ∆ℓ=− . (A4) λ (lnλ )2 ouscombinationsofthemeandegreez,meanintervertex 1 1 distance ℓ, and clustering coefficients [55]. Recall again Similarly, we find from (A1) that the change in percola- that ℓ denotes the value takenby ℓ in a fully P-rewired 1 tion threshold is version of a network (i.e., in a random network with the same degree correlation and size). ∆λ ∆pth =− λ2 . (A5) The scatter plots show data points for real-world net- 1 works,andforsyntheticWatts-Strogatzsmall-worldnet- If we now make the further assumption that ∆p is ap- works and γ-theory networks, which are described in th proximatelyequaltotheerrorE forthebondpercolation Sec.IIIB. ThedependenceofE onℓ−ℓ1isclearlystrong process [this approximation is exact if the effect of the (seethetoprowofscatterplots,whichallhaveR2 >0.9), perturbationistoshiftthe entirebondpercolationcurve whereas the dependence on clustering is weak (see the S(p) to S(p+∆p )], we obtain the relation bottom row of scatter plots, which all have R2 < 0.3). th Given the relatively small number of available data sets, (lnλ )2 we cannot definitively select the best scaling function 1 E ≈ (ℓ−ℓ ). (A6) 1 F(z,ℓ,...) for the relation E ≈ F(z,ℓ,...)(ℓ−ℓ ), but λ lnN 1 1 thesimplechoiceF =1/z usedinFig.6(b)andthescal- Although the scope of our analysis is obviously lim- ing function F =ln2λ /(λ lnN) indicated by Eq. (A6) 1 1 ited by our assumptions, Eq. (A6) nevertheless supports both give satisfactory fits. 10 Network N z ℓ ℓ ℓB C C r Ref(s). 1 1 e Power Grid 4941 2.67 18.99 8.61 7.85 0.08 0.10 0.0035 [29, 30] PGP Network 10680 4.55 7.49 5.40 2.66 0.27 0.38 0.23 [26–28] AS Internet 28311 4.00 3.88 3.67 2.56 0.21 0.0071 -0.20 [25] RL Internet 190914 6.34 6.98 5.25 3.17 0.16 0.061 0.025 [40] Coauthorships 39577 8.88 5.50 4.45 2.93 0.65 0.25 0.19 [41, 42] d Airports500 500 11.92 2.99 2.76 1.62 0.62 0.35 -0.278 [43, 44] rl wo Interacting Proteins 4713 6.30 4.22 4.05 2.96 0.09 0.062 -0.136 [45–47] al C. Elegans Metabolic 453 8.94 2.66 2.55 1.93 0.65 0.12 -0.226 [48, 49] e R C. Elegans Neural 297 14.46 2.46 2.33 1.84 0.29 0.18 -0.163 [29, 50] Facebook Caltech 762 43.70 2.34 2.26 1.55 0.41 0.29 -0.066 [24] Facebook Georgetown 9388 90.67 2.76 2.55 1.79 0.22 0.15 0.075 [24] Facebook Oklahoma 17420 102.47 2.77 2.66 1.79 0.23 0.16 0.074 [24] Facebook UNC 18158 84.46 2.80 2.68 1.87 0.20 0.12 7 10−5 [24] × γ-theory [γ(3,3)=1] 1002 3 13.15 8.06 9.97 1/3 1/3 N/A [13] c eti γ-theory [γ(3,3)=1] 10002 3 19.81 11.37 13.29 1/3 1/3 N/A [13] h nt Watts-Strogatz (WS) 1000 10 50.45 3.29 3.14 2/3 2/3 N/A [29] y S Watts-Strogatz (WS) 10000 10 500.45 4.34 4.19 2/3 2/3 N/A [29] TABLE I: Basic summary statistics for the networks that we used in this paper. We have treated all real-world data sets as undirected, unweighted networks and have computed the following properties: total number of nodes N; mean degree z; meanintervertexdistanceℓinoriginalnetwork;meanintervertexdistanceℓ inthecorrespondingfullyP-rewiredversionofthe 1 network(i.e.,inarandomnetworkwiththeoriginaldegreecorrelation);themeanintervertexdistanceℓB predictedbyEq.(A2) 1 using the branching matrix corresponding to a random network with the original degree correlation; clustering coefficients C and C (whose respective definitionsare given byEqs. (3.6) and (3.4) of [23]); and thePearson degree correlation coefficient r. e The last column in thetable gives the citation number(s) for the datain thebibliography. [1] S.H. Strogatz, Nature (London) 410, 268 (2001). [16] M.A´.SerranoandM.Bogun˜´a, Phys.Rev.E74,056114 [2] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and (2006). D.U. Hwang, Phys. Rep.424, 175 (2006). [17] M.A´.SerranoandM.Bogun˜´a, Phys.Rev.E74,056115 [3] A.Barrat,A.Vespignani,andM.Barthelemy,Dynamical (2006). Processes on Complex Networks (Cambridge University [18] M. A´. Serrano and M. Bogun˜´a, Phys. Rev. Lett. 97, Press, Cambridge, UK,2008). 088701 (2006). [4] S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes, [19] P. Trapman, Theor. Pop. Biol. 71, 160 (2007). Rev.Mod. Phys. 80, 1275 (2008). [20] J. C. Miller, J. Roy.Soc. Interface 6, 1121 (2009). [5] R. Pastor-Satorras and A. Vespignani, Phys. Rev. Lett. [21] J. C. Miller, Phys.Rev.E 80, 020901 (2009). 86, 3200 (2001). [22] T.Britton,M.Deijfen,A.N.Lagera˙s,andM.Lindholm, [6] M. Barth´elemy, A. Barrat, R. Pastor-Satorras, and J. Appl.Probab. 45, 743 (2008). A.Vespignani, J. Theor. Biol. 235, 275 (2005). [23] M. E. J. Newman, SIAMRev. 45, 167 (2003). [7] K.T. D.Eames, Theor. Pop. Biol. 73, 104 (2008). [24] A.L.Traud,E.D.Kelsic,P.J.Mucha,andM.A.Porter, [8] M. Garavello and B. Piccoli, Traffic Flow on Networks arXiv (2008), 0809.0690. (American Institute of Mathematical Sciences, Spring- [25] The CAIDA Autonomous System Re- field,MO, 2006). lationships Dataset, 30-Jun-2008, URL [9] A. Arenas, A. Diaz-Guilera, J. Kurths, Y. Moreno, and http://www.caida.org/data/active/as-relationships;http://as- C. Zhou, Phys.Rep.469, 93 (2008). [26] X. Guardiola, R. Guimera, A. Arenas, A. Diaz-Guilera, [10] M. E. J. Newman, Phys.Rev.E 68, 026121 (2003). D. Streib, and L. A.N. Amaral, arXiv (2002), 0206240. [11] M.A.Porter,J.-P.Onnela,andP.J.Mucha,Not.Amer. [27] M. Bogun˜´a, R. Pastor-Satorras, A. Diaz-Guilera, and Math. Soc 9, 1082 (2009). A. Arenas, Phys. Rev.E 70, 056122 (2004). [12] M. E. J. Newman, Phys.Rev.Lett. 103, 058701 (2009). [28] Giant component of the network of users [13] J. P. Gleeson, Phys. Rev.E 80, 036107 (2009). of the Pretty-Good-Privacy algorithm for [14] J. P. Gleeson and S. Melnik, Phys. Rev. E 80, 046121 secure information interchange, URL (2009). http://deim.urv.cat/~aarenas/data/xarxes/PGP.zip. [15] M.OstilliandJ.F.F.Mendes,Phys.Rev.E80,011142 [29] D. J. Watts and S. H. Strogatz, Nature (London) 393, (2009). 440 (1998).