Maximum Entropy Models of Shortest Path and Outbreak Distributions in Networks Christian Bauckhage Kristian Kersting Fabian Hadiji Fraunhofer IAIS TU Dortmund University TU Dortmund University 53754 St.Augustin, Germany 44227 Dortmund, Germany 44227 Dortmund, Germany Abstract—Properties of networks are often characterized in terms of features such as node degree distributions, average S S S S S S 5 pathlengths,diameters,orclusteringcoefficients.Here,westudy S S S S S I I S 1 shortest path length distributions. On the one hand, average S S d=1 S S 0 as well as maximum distances can be determined therefrom; S S d=I0 S S S I d=R0 I S 2 on the other hand, they are closely related to the dynamics S S S S S S I S of network spreading processes. Because of the combinatorial n nature of networks, we apply maximum entropy arguments to (a)t=0 (b)t=1 a derive a general, physically plausible model. In particular, we J 7 ecshtaarbalcistheritzhaetiognenoefraslhizoerdtesGtapmatmhalednigstthribhuisttioognraamssaocfonnettiwnuoorukss S I I I S I R R R I 1 orarbitrarytopology.Experimentalevaluationscorroborateour d=2 d=1R R I S d=3 d=2 d=1R R R I ] theoretical results. I R d=R0 R I R R d=R0 R R I S I R I I R R R S I. INTRODUCTION . (c)t=2 (d)t=3 s As networks are combinatorial structures, network analysis c Fig.1: AhighlyinfectiousSIRcascadeonanetworkrealizesa [ typically relies on statistics such as node degree distributions, nodediscoveryprocess.(a)atonsettimet=0,mostnodesare averagepathlengths,clusteringcoefficients,ormeasuresofas- 1 susceptibleandonlyasinglenodeisinfected.(b)–(d)infected sortativity [1]. Histograms of shortest path lengths are consid- v nodes immediately recover but always infect their susceptible 2 ered less frequently, but they, too, provide useful information. neighbors. This way, the number of newly infected nodes n 3 First of all, features such as average path lengths or network t at time t corresponds to the number n of nodes that are at 2 diameters can be determined therefrom. Second of all, path d 4 distance d=t from the source node of the epidemic. length statistics are closely related to velocities or durations 0 of network spreading processes. Analytically tractable models . 1 of shortest path distributions would thus allow for inference 0 of network properties as well as for reasoning about diffusion Each of these examples concerns an instance of a rather 5 1 dynamics. general phenomenon: An agent (a virus, a rumor, an urge : Yet, although shortest path length distributions are an easy to buy a product, etc.) spreads in form of a contact process v i enough concept, analytic models prove difficult to obtain and and thus cascades through a network of interlinked entities X related work is scarce [2]–[6]. Here, we contribute to these (people, computers, blogs, etc.). At the onset of the agent’s r efforts and derivea new general model of shortestpath length activity, many networked entities are susceptible to its effects a distributions in strongly connected networks. Considering a but only a few are actually infected (see Fig. 1(a)). As time dualitybetweennetworkspreadingprocessesandshortestpath progresses,susceptibleentitiesrelatedtoinfectedonesmaybe- lengths, we show that maximum entropy arguments as to come infected whereas infected entities may remain infected, the dynamics of diffusion processes lead to the generalized recover, become susceptible again, or even be removed from Gamma distribution. the population (see Fig. 1(b)–(d)). Crucial properties of such a process are its infection rate, A. Motivation its recoveryrate, or the numberof infected entities perunit of Models of network spreading processes play an important time.Ifan(information)epidemicisobservedtoragethrough role in various disciplines. They characterize the dynamics of a community, these features help assessing its progression or epidemicdiseases[7],[8],explainthediffusionofinnovations finaloutbreaksizeandthusinformcontagionordissemination or viral messages [9], [10], and, in the wake of social media, strategies.Forinstance,whilepublichealthauthoritiesneedto a quickly growing body of research develops graph diffusion devise immunization protocols to curtail epidemic diseases, modelstostudypatternsofinformationdisseminationinWeb- viral marketers typically aim at maximizing the effects of based social networks [11]–[16]. their campaigns. In any case, knowledge as to the structure Gamma distribution indeed provides highly accurate fits to 8.0 GenGamma nd empiricaldata shortest path histograms. In addition, we simulate a large s6.0 numberofepidemicspreadingprocessesondifferentsynthetic e d4.0 networks and find the generalized Gamma distribution to ac- o n count well for the resulting outbreak data. Finally, we present #2.0 first empirical evidence that different networks topologies 0.0 1 2 3 4 lead to considerably different parameters of fitted generalized distance d Gamma distributions which hints at new approaches towards Fig. 2: Shortest path histogram resulting from the network network inference. spreading process in Fig. 1 and a corresponding maximum Finally, in section IV, we summarize our contributions and likelihood fit of the generalized Gamma distribution. findings and discuss how they may inform future work on network analysis and network spreading processes. of a community through which an epidemic spreads would be II. THEORETICALMODEL beneficial. Alas, in practice, community structures are hardly Analytically tractable models of shortest path length dis- everknownbutavailableinformationonlyconsistsofoutbreak tributions and outbreak data are of considerable interest in data as shown in Fig. 2. the study of epidemic processes on networks. However, as In this paper, we address the problem of relating outbreak networks are combinatorial structures, corresponding results data to network structures. Our approach is orthogonal to are necessarily statistical [2], [6], [14], [20]–[22]. In this relatedcontributionsin[11],[16]–[19]whichassumeoutbreak section, we briefly review existing models and discuss the data and information about network members to be available basic ideas and properties of our new model. In doing so, and apply machine learning to identify hubs, link structures, we ignore technical details but focus on overall results so or infection routes. Instead, we follow the paradigm in [2], that this paper is more easily accessible to a wider audience. [6], [14], [20]–[22] which asks for physical explanations of The mathematical derivation of our main result as well as its the noticeably skewed appearance of outbreak histograms and characteristics are presented in the appendix. attempts to relate corresponding physical models to network properties. A. Known Results and Observations B. Main Results and Overview In a landmark paper [6], Vazquez studied the dynamics of We establish the generalized Gamma distribution as a epidemic processes in power law networks. Arguing based on continuous, analytically tractable model of discrete shortest branchingprocessmodels,heshowedthatfornetworkswhose pathhistogramsofnetworksofarbitrarytopology.Considering node degree distribution has a power law exponent 2 < γ < the fact that temporal distributions of infection counts in 3, the number of infected nodes at time t follows a Gamma highly infectious spreading processes and distributions of distribution. shortest path lengths are dual phenomena allows us to invoke 1 1 maximum entropy arguments from which we derive physical fGA(t|θ,η)= θη Γ(η)tη−1e−t/θ (1) characterizations of path lengths- and outbreak statistics. In other words, our model results from first principles where Γ(·) is the gamma function and θ > 0 and η > 0 are rather than from data mining. It also generalizes previous scaleandshapeparameters,respectively.Curiously,forpower theoretical results [2], [6] and provides an explanation for a law exponents γ ≥3, this result does not apply. recent empirical observation as to path length distributions in ConcernedwithErdo˝s-Re´nyigraphs,theworkin[2]derived social networks [23]. a different result. Based on models of the expected number of In section II, we briefly review existing analytical models pathsoflengthtbetweenarbitrarynodes[3],[4],itconsidered of shortest path- and outbreak distributions, show that the extreme value theory [25] to show that infection counts can generalized Gamma distribution naturally generalizes these, be characterized by the Weibull distribution. and discuss its properties and characteristics. κ Note that, for the sake of readability, we defer technical f (t|λ,κ)= tκ−1e−(t/λ)κ (2) WB λκ detailsonhowthegeneralizedGammaemergesasamaximum entropymodelofpathlength-andoutbreakdatatoappendices where λ>0 and κ>0 are scale and shape parameters. A, B, and C. Note that neither model was obtained from mere empiri- InsectionIII,weempiricallytestourtheoreticalpredictions cal observations. Both follow from basic principles, provide and compare the predictive power of our new model to those physically plausible and comprehensible characterizations of of related previous models. In extensive experiments with shortest path length distributions and diffusion dynamics, and synthesizednetworksofdifferenttopologiesaswellaswiththe were verified empirically. real world networks (social, bipartite, and natural) contained Finally, Bild et al. [23] recently investigated information in the KONECT collection [24], we find that the generalized cascades and retweet networks on twitter and found that they whereσ >0determinesscaleandα>0andβ >0areshape uency001...680 WGLGdoaaeegtmnaiNbGmuoalralmmmala frequency1110001−−−03210 dfitata uency001...680 WGLGdoaaeegtmnaiNbGmuoalralmmmala frequency1110001−−−03210 dfitata pboaethrTaeshkmreedewptiesertrodrsbi.btaoubtiitlohitneysldeaefsntssoiptryetcofiuantlhccetaiorsinegshint[.2(I74t]),ails[s2ou9n]ci.moMnotdaoiasntlsbnusoettvamebrlaayyl, q q fre0.4 10−4100 10n1odedegre1e02 103 fre0.4 10−4100 10n1odedegre1e02 103 we point out that 0.2 0.2 • setting β =1 yields the Gamma distribution in (1) 0.00 2 4 6 8 0.00 2 4 6 • equating α=β yields the Weibull distribution in (2) pathlength pathlength • letting α/β → ∞, the generalized Gamma distribution (a)p=0.005 (b)p=0.0075 approaches the LogNormal distribution in (3) Fig. 3: Qualitative examples of shortest path distributions in These properties immediately suggest that the above results Erdo˝s-Re´nyi graphs. Confirming [2], the Weibull distribution might be specific instances of a more general model. Indeed, fits well. Yet, the generalized Gamma distribution fits better. our theoretical discussion in the appendix shows that this is not coincidental but that the generalized Gamma distribution actually provides a physically plausible model of empirical 1.0 WGLoaegmiNbmuolralmal 101−010 dfitata 1.0 WGLoaegmiNbmuolralmal 101−010 dfitata shoInrtecsotnptarathstdtiostrtihbeutiwoonrskanind o[6u]tbarneadk[d2a],tao.ur derivation in uency00..68 GdaetnaGamma frequency1100−−32 uency00..68 GdaetnaGamma frequency1100−−32 athpepennedtwixorAkdthoreosugnhotwnheiecdhaassvuimrapltaiognenstaisstsoprtehaedtionpgo.lRoagtyheorf, q q fre0.4 10−4100 10n1odedegre1e02 103 fre0.4 10−4100 10n1odedegre1e02 103 it considers general properties of highly infectious processes as discussed in the introduction and resorts to the maximum 0.2 0.2 entropy principle together with likelihood maximization tech- 0.00 2 4 6 8 10 12 0.00 2 4 6 8 niques. Any aspects related to topological properties (e.g. de- pathlength pathlength (a)µ=1,ξ=0.75 (b)µ=2,ξ=0.75 greedistributionorclusteringpatterns)ofindividualnetworks are absorbed into the parameters of the model in (4) which, Fig. 4: Qualitative examples of shortest path distributions in as we shall see next, is flexible enough to represent a wide LogNormalgraphs.Fordifferentnodedegreedistributions,the range of different path length and spreading statistics. modelsconsideredhereprovidemoreorlessaccuratefits.The generalized Gamma, however, always fits best (cf. Tab. I). III. EMPIRICALEVALUATION In this section, we empirically evaluate the merits of the theoretical models discussed above. First, we present results couldaccuratelyfittheirdatausingtheLogNormaldistribution on shortest path length distributions in synthesized networks. Then,wereportonexperimentswithalargenumberofspread- fLN(t|µ,ξ)= √ 1 e−(log2tξ−2µ)2 (3) ingprocessesonsyntheticgraphsand,finally,wediscussresult 2πξt obtained for large, real-world networks. Throughout, we fit continuous distributions to discrete his- where µ and ξ are location and scale parameters. tograms. That is, we apply functions f(t;θ) to represent Regardingourapproachinthispaper,weemphasizethat(3) counts n ,...,n which are grouped into K distinct inter- was not found theoretically but emerged empirically. At the 0 K vals (t ,t ], (t ,t ], ..., (t ,t ). For this setting, it is sametime,wenotethattheGamma,Weibull,andLogNormal 0 1 1 2 K−1 ∞ advantageous to use multinomial likelihood estimation based distribution may easily be confused for one another [26]. Our on reweighted least squares in order to determine optimal discussionsofarthusbegsthequestioniftheaboveresultsare model parameters θ∗ [30]. For a recent detailed exposition contradictory or if they may be unified within a more general of this robust technique, we refer to [31]. framework? Next, we show that the latter is indeed the case. For fitted models, we report quantitative goodness-of-fit B. New Result results in terms of the Hellinger distance In this subsection, we establish the generalized Gamma (cid:0) (cid:1) 1(cid:115)(cid:88)(cid:16)(cid:112) (cid:112) (cid:17)2 distribution as a comprehensive model of the shortest path H h[t],f[t] = h[t]− f[t] (5) 2 distributions and outbreak data in connected networks of t arbitrary topology. between discrete empirical data h[t] and a discretized model Different versions of the generalized Gamma distribution f[t] where can be traced back to 1920s [27]. Here, we are concerned F(t+ 1) if t=0 with thethree-parameter versionthat wasintroduced byStacy 2 f[t]= F(t+ 1)−F(t− 1) if 0<t<K (6) [28]. Its probability density function is defined for t∈[0,∞) 2 2 and given by 1−F(t+ 1) if t=K 2 f (t|σ,α,β)= β 1 tα−1e−(t/σ)β (4) andF(·)isthecorrespondingcumulativedensityfunction.We GG σαΓ(α/β) note that the Hellinger distance is bound as 0≤H ≤1. TABLE I: Goodness of Fit (avg. Hellinger distances) for shortest path histograms of synthetic networks 1.0 WGLoaegmiNbmuolralmal 101−010 dfitata 1.0 WGLoaegmiNbmuolralmal 101−010 dfitata nEeRtwork pπa=ram0e.0te0r5s0 0f.W05B8 0f.1G7A0 0f.2L2N7 0f.0G5G1 uency00..68 GdaetnaGamma frequency1100−−32 uency00..68 GdaetnaGamma frequency1100−−32 q q π=0.0075 0.046 0.164 0.205 0.041 fre0.4 10−4 101nodedegr1e0e2 103 fre0.4 10−4 101 nodede1g0r2ee 103 BA m=1 0.039 0.066 0.135 0.011 0.2 0.2 m=2 0.017 0.125 0.170 0.015 m=3 0.015 0.128 0.171 0.015 0.00 2 4 6 8 10 12 14 16 18 0.00 2 4 6 pathlength pathlength PL γ =2.1 0.154 0.042 0.040 0.030 (a)m=1 (b)m=3 γ =2.3 0.132 0.032 0.046 0.024 Fig. 6: Qualitative examples of shortest path distributions γ =2.5 0.123 0.037 0.064 0.028 γ =2.7 0.101 0.044 0.089 0.028 in Baraba´si-Albert graphs. Independent of the attachment γ =2.9 0.081 0.050 0.108 0.026 parameter m, node degrees are always power law distributed γ =3.1 0.070 0.093 0.137 0.051 with γ = 3. The Weibull distribution therefore fits well, but LN µ=1,ξ=0.75 0.075 0.093 0.149 0.037 again, the generalized Gamma distribution provides the best µ=1,ξ=1.25 0.082 0.067 0.113 0.029 fit. µ=2,ξ=0.75 0.073 0.092 0.145 0.036 µ=2,ξ=1.25 0.079 0.062 0.102 0.026 model fits obtained in these experiments are shown in Figs. 3 trough 6. frequency0001....4680 WGLGdoaaeegtmnaiNbGmuoalralmmmala frequency111100001−−−−043210 101nodedegr1e0e2 dfitata103 frequency0001....4680 WGLGdoaaeegtmnaiNbGmuoalralmmmala frequency111100001−−−−043210 101nodedegr1e0e2 dfitata103 gltaaigrocarTnkpeasehbomsclfeoeonsnfpIstiadnwcseueir,tm=ehdwm[1e2ian0r]oi,,zom0teuh0sire0t earWnexvsoepeudreialbertgiussmellfaeongndrdotissos)toddr.iminbfWefueesetrsioeoofnnobttfsphetreofirovptvpeoialrdroteeahgssmauitelaetss(triw)i(zffeooialn-rrl 0.2 0.2 fitting model for the distribution of shortest path lengths in ER graphs; it outperforms the Gamma and the LogNormal 0.00 2 4 6 8 10 12 0.00 2 4 6 8 10 12 14 16 18 20 pathlength pathlength distribution;(ii)inagreementwith[6],theGammadistribution (a)γ=2.3 (b)γ=3.1 provides a well fitting model for PL graphs where 2<γ <3; Fig. 5: Qualitative examples of shortest path distributions in (iii) for PL graphs where γ (cid:46) 2.2, the LogNormal fits well, power law graphs. Confirming [6], for power law coefficients too; (iv) for PL graphs where γ ≥ 3, the Weibull fits better 2 < γ < 3, the Gamma distribution provides good fits; for than the Gamma or the LogNormal; in this context, we note γ ≥ 3, the Weibull distribution provides better fits. In any that BA graphs are power law graphs for which γ = 3 [33]; case, the generalized Gamma distribution fits best. (v) in any case, the generalized Gamma distribution provides the best fits. The latter is hardly surprising as the densities in (1), (2), and (3) are functions of two parameters whereas the model in A. Synthetic Networks (4) depends on three parameters and therefore offers greater We created different Erdo˝s-Re´nyi (ER), Baraba´si-Albert flexibility in statistical model fitting. Nevertheless, these re- (BA), power law (PL), and LogNormal (LN) graphs of n ∈ sultsagreewiththetheoreticalconsiderationsabove.Whilethe {5,000,10,000} nodes. Gamma and the Weibull distribution fit path length distribu- ER graphs are a staple of graph theory and merit in- tions in particular types of networks, the generalized Gamma vestigation. To create ER graphs, we used edge probability distribution provides accurate fits across a wide variety of parameters π ∈ {0.005,0.0075,0.01}. BA and PL graphs underlying network topologies. represent networks that result from preferential attachment processes and are frequently observed in biological, social, B. Spreading Processes and technical contexts. To create BA graphs, we considered Given the synthetic networks created above, we simulated attachment parameters m ∈ {1,2,3} and exponents of the SIR spreading processes where the infection rate varied in vertex degree distributions of the PL graphs were drawn from i∈{0.5,0.6,...,0.9} and the recovery rate was chosen from γ ∈ {2.1,2.2,...,3.2}. LN graphs show log-normally dis- r ∈{0.5,0.6,...,0.9}. For each network and each choice of tributed vertex degrees and reportedly characterize link struc- iandr,wecreated10epidemicsstartingatrandomlyselected tures within sub-communities on the web [32]. To synthesize sourcenodesv andfittedthegeneralizedGammadistribution s LN graphs, parameters were chosen from µ∈{1,1.5,...,3} to the resulting outbreak data. and ξ ∈{0.25,0.5,0.75,1}. For each parametrization of each Table II shows exemplary results obtained for PL graphs of model, we created 100 instances. 10,000 nodes; rows correspond to different power law expo- Examples of shortest path histograms and corresponding nents γ and columns indicate different choices of pairs (i,r). TABLE II: outbreak data obtained from spreading processes in power law networks of 10,000 nodes i=0.5,r=0.5 i=0.5,r=0.7 i=0.5,r=0.9 i=0.9,r=0.5 i=0.9,r=0.7 i=0.9,r=0.9 γ =2.2 γ =2.6 γ =3.0 Each panel plots the outbreak data of all the corresponding epidemics (grey dots), the corresponding empirical average 18 bipartitenets 18 bipartitenets naturalnets naturalnets taken over the individual outbreak distributions (black dots), 16 socialnets 16 socialnets as well as a generalized Gamma fit to these averages (blue 14 14 curves). 12 12 Visual inspection of these results suggests that the gen- 10 10 eralized Gamma distribution accounts very well for average 8 8 outbreak dynamics in power law graphs. Though not shown 6 6 in the table, this behavior was also observed for individual 4 4 epidemics as well as for epidemics on other networks. In 22 4 6 8 10 12 14 16 18 22 4 6 8 10 12 14 16 18 appendix B, we present a theoretical justification for this (a)fWB (b)fGA empirically observed behavior. C. Real-World Networks 18 bipartitenets 18 bipartitenets naturalnets naturalnets The KONECT network collection [24] provides a com- 16 socialnets 16 socialnets prehensive set of large scale, real-world network data freely 14 14 available for research. Networks contained in this collection 12 12 comprise (online) social networks where edges indicate social 10 10 contacts or friendship relations, natural networks such as 8 8 powergridsorconnectionsbetweenairports,andbipartitenet- 6 6 works such as typically found in the context of recommender 4 4 systems.ThesizesofthesenetworksvarybetweenO(10,000) 22 4 6 8 10 12 14 16 18 22 4 6 8 10 12 14 16 18 to O(1,000,000) nodes and they show different node degree (c)fLN (d)fGG distributionsandclusteringcoefficients.Forfurtherdetails,we Fig. 8: Empirical vs. predicted average shortest path lengths. refer to http://konect.uni-koblenz.de/. Predictions correspond to the means of Weibull, Gamma, Tables III through V summarize goodness-of-fit results LogNormal, and generalized Gamma distributions that were obtained from fitting the models in (1), (2), (3), and (4) fitted to shortest path lengths histograms of the real world to hop count distributions of social, natural, and bipartite networks in our test set. networksrespectively.Againinagreementwiththetheoretical prediction in [6], we observe that the Gamma distribution provides accurate fits to path length distributions in social networks which are often reported to be power law networks theoretically.Inthissense,theworkpresentedinthispapercan with power law exponents 2<γ <3. beseenasthefirstsuchjustificationbecausetheLogNormalis For the case of natural and bipartite networks, we find the obtainedalimitingcaseofthegeneralizedGammadistribution LogNormal distribution to provide better fits than the Gamma which,asshowninappendixA,providesaphysicallyplausible ortheWeibulldistribution.Weemphasizethatthisisanempir- model of distance distributions in networks. icalfindingwhich,toourknowledge,hasnotyetbeenjustified In fact, just as in the previous subsections, the generalized 1.0 Weibull 1.0 Weibull 1.0 Weibull cy0.8 Gamma cy0.8 Gamma cy0.8 Gamma uen0.6 LGoegnNGoarmmmala uen0.6 LGoegnNGoarmmmala uen0.6 LGoegnNGoarmmmala eq0.4 empiricaldata eq0.4 empiricaldata eq0.4 empiricaldata fr0.2 fr0.2 fr0.2 0.0 2 4 6 8 10 12 14 16 18 0.0 2 4 6 8 10 12 14 0.0 2 4 6 8 10 pathlength pathlength pathlength (a)facebookwallpostnetwork (b)openflights (c)movielensusrmov Fig.7: Qualitativeexamplesofshortestpathdistributionsinreal-worldnetworks.Pathlengthdistributionsofsocialandnatural networks are well accounted for by the Gamma distribution whereas the Weibull distribution provides better fits for bipartite networks that occur in recommender settings. In each case, however, the generalized Gamma distribution accounts best for the empirical data. TABLEIII: GoodnessofFit(Hellingerdistances)forshortest TABLEIV: GoodnessofFit(Hellingerdistances)forshortest path histograms of social networks path histograms of natural networks network f f f f network f f f f WB GA LN GG WB GA LN GG advogato 0.110 0.014 0.056 0.012 arenas meta 0.090 0.036 0.025 0.029 arenas email 0.044 0.036 0.074 0.008 as caida20071105 0.277 0.160 0.050 0.113 arenas pgp 0.100 0.022 0.057 0.013 as20000102 0.065 0.016 0.054 0.005 ca AstroPh 0.157 0.021 0.047 0.021 dbpedia similar 0.685 0.686 0.082 0.018 ca cit HepPh 0.133 0.023 0.065 0.015 eat 0.012 0.075 0.115 0.003 ca cit HepTh 0.089 0.016 0.026 0.010 lasagne frenchbook 0.017 0.001 0.058 0.003 catster 0.101 0.023 0.032 0.016 openflights 0.139 0.025 0.040 0.022 cfinder google 0.076 0.021 0.023 0.032 powergrid 0.745 0.745 0.150 0.016 cit HepPh 0.178 0.037 0.065 0.032 usairport 0.053 0.010 0.040 0.007 cit HepTh 0.135 0.018 0.051 0.016 sociopatterns infectious 0.030 0.026 0.053 0.012 dblp cite 0.078 0.042 0.077 0.012 topology 0.130 0.021 0.055 0.019 dogster 0.215 0.036 0.048 0.026 wordnet words 0.192 0.031 0.054 0.022 elec 0.046 0.057 0.096 0.020 wordnet 0.133 0.188 0.214 0.131 email EuAll 0.363 0.331 0.116 0.232 avg. 0.198 0.155 0.076 0.031 enron 0.122 0.046 0.075 0.023 facebook wosn links 0.186 0.022 0.037 0.018 facebook wosn wall 0.172 0.025 0.048 0.021 TABLE V: Goodness of Fit (Hellinger distances) for shortest filmtipset friend 0.140 0.024 0.052 0.017 path histograms of bipartite networks gottron net all 0.075 0.038 0.070 0.029 gottron net core 0.049 0.024 0.060 0.006 hep th citations 0.122 0.011 0.036 0.009 network fWB fGA fLN fGG loc brightkite edges 0.193 0.027 0.035 0.059 adjnoun 0.028 0.037 0.065 0.012 munmun digg reply 0.170 0.028 0.054 0.023 bx 0.264 0.096 0.090 0.130 munmun twitter social 0.132 0.089 0.079 0.071 dbpedia occupation 0.165 0.103 0.113 0.101 collaboration 0.137 0.097 0.074 0.045 dbpedia producer 0.734 0.734 0.152 0.156 ucsocial 0.084 0.012 0.056 0.008 dbpedia starring 0.721 0.721 0.034 0.032 petster carnivore 0.120 0.116 0.119 0.117 dbpedia writer 0.754 0.754 0.096 0.072 petster friendships cat 0.109 0.023 0.032 0.027 epinions 0.209 0.036 0.041 0.026 petster friendships dog 0.206 0.036 0.048 0.026 escorts 0.097 0.048 0.076 0.031 petster friendships hamster 0.167 0.083 0.049 0.060 filmtipset comment 0.060 0.049 0.089 0.007 petster hamster 0.081 0.011 0.033 0.007 github 0.186 0.078 0.082 0.078 sap 0.142 0.063 0.090 0.052 gottron reuters 0.128 0.112 0.137 0.106 slashdot threads 0.267 0.082 0.055 0.102 movielens 10m ti 0.187 0.082 0.101 0.074 slashdot zoo 0.192 0.026 0.052 0.032 movielens 10m ui 0.036 0.083 0.116 0.031 wikiconflict 0.134 0.014 0.055 0.014 movielens 10m ut 0.199 0.146 0.159 0.144 wikisigned k2 0.268 0.166 0.070 0.136 movielens 1m 0.043 0.005 0.040 0.007 wikisigned nontext 0.291 0.086 0.058 0.066 ucforum 0.047 0.050 0.080 0.030 avg. 0.146 0.050 0.059 0.039 pics ut 0.268 0.164 0.166 0.183 prosper support 0.230 0.259 0.232 0.208 youtube groupmemberships 0.192 0.115 0.119 0.115 avg. 0.239 0.193 0.105 0.081 Gamma distribution is again found to provide the best overall fits to the data considered here. Qualitative examples of this behavior are shown in Fig. 7. We also evaluated how the different distributions perform in predicting the average shortest path length and compared TABLE VI: Mean squared errors for Fig. 8 ERgraphs PLgraphs fWB fGA fLN fGG BAgraphs LNgraphs bipartite 1.665 1.382 0.731 0.414 natural 0.216 0.479 0.862 0.093 social 0.474 0.219 0.648 0.220 overall 1.745 1.479 1.303 0.478 80 70 0 60 empiricalmeanstothemeansoffittedmodels.FortheWeibull, 50 1 α 40 2 the mean is given by λΓ(1+1/κ), the mean of the Gamma 30 3 20 is ηθ, that of the LogNormal is exp(µ−ξ2/2), and for the 10 4 β 0 5 generalized Gamma we have 0 5 6 10 Γ((α+1)/β) σ 15 7 E{t}=σ . (7) 20 25 8 Γ(α/β) (a)syntheticnetworks Figure 8 shows scatter plots of predicted average shortest path lengths versus empirically determined ones for the net- socialnetworks works in the KONECT collection. Visual inspection suggests naturalnetworks bipartitenetworks that w.r.t. predicting average path lengths, the Weibull per- forms worse than the Gamma which performs worse than the LogNormalwhichisoutperformedbythegeneralizedGamma distribution. This is backed by Tab. VI which lists mean squarederrorsforthedatainthefigure.Again,thegeneralized 80 Gamma distribution performs best. 70 0 60 50 1 D. Embedding Path Length Histograms in 3D α 40 2 30 3 An interesting consequence of using the three-parameter 2100 4 β generalized Gamma distribution to characterize shortest path 0 5 0 histograms is that it provides a non-linear mapping of path 5 10 6 length data into three dimensions. This allows for visual ana- σ 15 20 87 25 lytics of the behavior of different graph topologies w.r.t. path (b)real-worldnetworks length distributions. Figure 9 shows exemplary shortest path Fig. 9: 3D embedding of shortest path histograms from distributions in terms 3D coordinates (σ,α,β) that result from fitting generalized Gamma distributions. Looking at the different networks. Each point (σ,α,β) represents a path length distribution in terms of the parameters of the best figure,itappearsthatshortestpathdistributionsobtainedfrom fitting generalized Gamma model. Path length distributions different network topologies cluster together or are confined of different types of graphs appear to be confined to specific to certain regions in this parameter space. These preliminary regions. observations are arguably the most interesting finding in this paper as they suggest that the idea of characterizing networks in terms of continuous models of shortest path distributions can inform approaches to the problem of network inference Empirical tests confirmed our theoretical prediction and from outbreak data. For now, we leave the study of the revealed that the generalized Gamma distribution accounts characteristics of these representations and their possible in- well for shortest path histograms of synthesized Erdo˝s-Re´nyi, terpretations to future work. Baraba´si-Albert, power law, and LogNormal graphs as well as for the empirical path length distributions of the real-world IV. SUMMARYANDOUTLOOK networks in the KONECT collection [24]. In this paper, we considered the problem of representing As an application in the context of visual analytics, we discrete path length histograms and epidemic outbreak data introduced a non-linear embedding of path length histograms by means of continuous, analytically tractable models. into 3D parameter spaces and observed striking structural Invoking the maximum entropy principle, we showed that regularities in the resulting representations. generalized Gamma distribution provides a physically plausi- Given our results, there are several directions for future ble model for these data, independent of the topology of the research.Firstofall,itappearsworthwhiletoattempttorelate underlying networks. This result generalizes earlier models the shape and scale parameters of the generalized Gamma [2], [6] and explains recent empirical observations made distribution to physical properties or well established network w.r.t. twitter retweet networks [23]. features; we expect to be able to establish a connection to, say, the expected value or the variance of the generalized Gamma. Second of all, our theoretical results in this paper can be used to devised informed sampling schemes for the problem of computing shortest path histograms for large real- world networks. Efforts in this direction are underway and will hopefully be reported soon. Third of all, we are currently exploring ways of how to apply the results in Fig. 9 to the problem of network inference from outbreak data. A simple (a)constantqk (b)decreasingqk (c)increasingqk idea in this context is to create a large data base of 3D rep- Fig. 10: The probability of fining a path of length k may be resentations of outbreak data so that likely network structures constant or decrease or increase with k. for newly observed epidemic processes can be inferred from, say, nearest neighbor searches over the available examples. APPENDIXA andweemphasizethatp andq willgenerallydiffer(seethe k k MODELDERIVATION didactic examples in Fig. 10). In this section, we show that the generalized Gamma With these definitions, the joint probability of observing distribution provides a principled model of shortest path- and counts n of infected nodes corresponds to the multinomial k outbreak statistics in networks. We argue based on Jaynes’ maximum entropy principle [34] which states that, subject (cid:89)K qnk to observations and contextual knowledge, the probability P(n1,...,nK)=n! k (11) n ! k distributionthatbestrepresentstheavailableinformationisthe k=1 one of highest entropy. In particular, we resort to an approach whose log-likelihood is given by byWallis(cf.[35,chapter11])whichdoesnotassumeentropy as an a priori measure of uncertainty but uncovers it in the (cid:88)K L=logn!+ n logq −logn ! (12) course of the argument. k k k Toderive ouranalyticalmodel ofpathlength histogramsof k=0 arbitrarynetworks,weconsiderthenetworkspreadingprocess Assuming that n (cid:29) 1, we may apply Stirling’s formula k inFig.1.Borrowingterminologyfromepidemiology,wenote logn !≈n logn −n tosimplify(12)whichthenbecomes k k k k that, at onset time t = 0, most nodes in the network are K susceptible and one node is infected. At time t + 1, nodes (cid:88) (cid:0) (cid:1) L≈n logn−n+ n logq −logn +1 that were infected at t have recovered yet did infect their k k k susceptibleneighbors.HighlyinfectiousSIRcascadeslikethis k=0 K realize a node discovery process: the number n of newly (cid:88) (cid:0) (cid:1) t =n logn+n p logq −(logp +logn) k k k infected nodes at time t corresponds to the number of nodes k=0 that are at topological distance d=t from the source. K Given these considerations, we observe that the discrete =−n(cid:88)p logpk. (13) k shortest path distribution of the entire network is nothing but qk k=0 the sum of all node count histograms h [t] taken over all s possible source nodes. As the expression in (13) is a Kullback-Leibler divergence, In what follows, we let K denote the length of the longest our considerations so far led indeed to an entropy that needs shortest path starting at a source v and use n to indicate the to be maximized in order to determine the most likely values s k number of nodes that are k steps away from vs. Then, n∗k of the nk. However, up until now, the quantities q are not defined K k (cid:88) precisely enough to allow for a solution. Also, (13) accounts n= n (8) k only indirectly for temporal aspects of network spreading k=0 processesasanydependencyontimeishiddenintheindexof where n denotes the total number of nodes. The probability summation k. We address both these issues by choosing the of observing a node at distance k can thus be expressed as ansatz p =n /n and we have k k K q =Atα−1 (14) (cid:88) k k 1= p . (9) k k=0 where A is a normalization constant and α > 0. This choice Moreover, we let q denote the probability that there exists a and its significance will be justified in detail in appendix C. k shortest path of length k so that For now, we continue with our main argument. Another underspecified quantity so far is the length K of K 1=(cid:88)q (10) the supposed longest shortest path. However, dealing with k networks of finite size, we can bypass the need of having to k=0 specify K by means of introducing the following constraint In order to establish our final result, we now let ∆t → 0 ∞ which leads to (cid:88)n =n (15) n∗ (cid:90) k k ≈ f(τ)dτ ≈∆tf(t) (22) k=0 n ∆t into the problem of maximizing (13). Finally, to prevent where t − ∆t ≤ t ≤ t + ∆t. Direct comparison of (21) degeneratesolutions(e.g.instantaneousorinfinitespread),we and (22)ktoge2ther with thke sub2stitution σ = (cid:0)βc(cid:1)1/β finally imposeaconstraintonthetimest andrequiretheirmoments α k establishes that β 1 (cid:88)∞ nk tβ =c (16) f(t)= σαΓ(α/β)tα−1e−(t/σ)β (23) n k k=0 which is indeed the generalized Gamma distribution that was to be finite for some β >0. introduced in equation (4). Usingdifferentialforms,wenowexpressthelog-likelihood APPENDIXB in (13) and the two constraints in (15) and (16) as DIFFERENTTIMESCALES ∞ ∞ dL=(cid:88) ∂L dn =(cid:88)(cid:0)logAtα−1−logn (cid:1)dn Our derivation above was based on properties of networked ∂nk k k k k SIR cascades for which the infection rate i as well as the k=0 k=0 ∞ recovery rate r were both assumed to be 100%. Yet, the (cid:88) dn= dn empirical result in section III indicate that the generalized k Gamma distribution also account for the dynamics of less k=0 ∞ infectious spreading processes where i and r are smaller. (cid:88) dc= tβkdnk Such epidemics usually last longer as it takes more time k=0 to reach all susceptible nodes and infected nodes may remain and consider the Lagrangian with multipliers ρ and γ so over extended periods. In other words, such processes can (cid:88)∞ (cid:20) Atα−1 (cid:21) be thought of as taking place on a different time scale. Here, L= log k −ρ−γtβ dn =0 (17) we briefly show that the generalized Gamma distribution can n k k k also explain spreading processes on linearly or polynomially k=0 in order to determine most likely infection counts n∗ for the transformed time scales. k spreading process described above. Since at the solution the Recall that if a random variable X is distributed according bracketed terms [·] must vanish identically for every dn , we to f(x), the monotonously transformed random variable Y = k immediately obtain h(X) has a probability function that is given by (cid:12) (cid:12) n∗k =Ae−ρtαk−1e−γtβk. (18) g(y)=f(cid:0)h−1(y)(cid:1)·(cid:12)(cid:12)(cid:12)ddyh−1(y)(cid:12)(cid:12)(cid:12). (24) Plugging this result back into p = n /n yields a discrete k k Now, if t is generalized Gamma distributed and τ =ct is a probability mass function linearly transformed version of t, then n∗k = tαk−1e−γtβk (19) t= τ and dt = 1 (25) n (cid:80)∞ tα−1e−γtβj c dτ c j so that τ is distributed according to j=0 which explicitly relates infection counts nk to time steps tk. g(τ)= β 1 (cid:16)τ(cid:17)α−1e−(τ/cσ)β · 1 However, (19) still depends on the Lagrangian multiplier γ σαΓ(α/β) c c which is not immediately related to any available data. = β 1 τα−1e−(τ/σ(cid:48))β Assuming the duration ∆t=tk+1−tk of time steps to be σ(cid:48)αΓ(α/β) small, permits the approximation which is a generalized Gamma distribution with scale param- (cid:88)∞ tαk−1e−γtβk∆t≈(cid:90) ∞tα−1e−γtβdt=γ−αβ Γ(αβ/β) eteBryσ(cid:48)th=ecsσam. e token, if t is generalized Gamma distributed k=0 0 and τ =tc is a polynomially transformed version of t, then so that dt 1 n∗k =∆t γαββ tα−1e−γtβk. (20) t=τ1c and dτ = c τ1c−1 (26) n Γ(α/β) k so that τ is distributed according to If this is plugged into (16), tedious but straightforward β 1 (cid:16) (cid:17)α−1 1 algebra yields γ = α which is to say that g(τ)= σαΓ(α/β) τ1c e−(τ1/c/σ)β · c τ1c−1 βc nn∗k =∆t(cid:20)Γ(αβ/β)(cid:16)βαc(cid:17)αβ(cid:21)tαk−1e−βαctβk. (21) = σβ(cid:48)(cid:48)αc(cid:48) Γ(α1(cid:48)/β(cid:48))τα(cid:48)−1e−(τβ(cid:48)/σ(cid:48)β(cid:48)) 7 11 REFERENCES 15 8 17 [1] R. Cohen and S. Havlin, Complex Networks. Cambridge University 2 5 Press,2010. [2] C. Bauckhage, K. Kersting, and B. Rastegarpanah, “The Weibull as 12 18 a Model of Shortest Path Distributions in Random Networks,” in 9 3 1 Proc.Int.WorkshoponMiningandLearningwithGraphs,2013. 6 13 [3] V.Blondel,J.Guillaume,J.Hendrickx,andR.Jungers,“DistanceDis- tributioninRandomGraphsandApplicationtoNetworkExploration,” 10 4 PhysicalReviewE,vol.76,no.6,p.066101,2007. 16 14 [4] A. Fronczak, P. Fronczak, and J. Holyst, “Average Path Length in RandomNetworks,”PhysicalReviewE,vol.70,no.5,p.056110,2004. Fig. 11: Causal graph of the process in Fig. 1. The adjacency [5] A. Ukkonen, “Indirect Estimation of Shortest Path Distributions with matrix of a directed acyclic graph with a single source node Small-WorldExperiments,”inProc.AdvancesinIntelligentDataAnal- ysis,2014. can be brought into strictly upper triangular form and is thus [6] A.Vazquez,“PolynomialGrowthinBranchingProcesseswithDiverging nilpotent. Reproduction Number,” Physical Review Letters, vol. 96, no. 3, p. 038702,2006. [7] M.KeelingandK.Eames,“NetworksandEpidemicModels,”J.Royal SocietyInterface,vol.2,no.4,pp.295–307,2005. which is a generalized Gamma distribution with parameters [8] A. Lloyd and R. May, “How Viruses Spread Among Computers and σ(cid:48) =σc, α(cid:48) = α, and β(cid:48) = β. People,”Science,vol.292,no.5520,pp.1316–1317,2001. c c [9] D.Acemoglu,A.Ozdaglar,andM.Yildiz,“DiffusionofInnovationsin APPENDIXC SocialNetworks,”inProc.Int.Conf.onDecisionandControl. IEEE, 2011. FURTHERDETAILS [10] J. Leskovec, L. Adamic, and B. Huberman, “The Dynamics of Viral Whiletheconstraintsin(15)and(16)arearguablyintuitive, Marketing,”ACMTans.Web,vol.1,no.1,p.5,2007. [11] E. Adar and A. Adamic, “Tracking Information Epidemics in the ansatz in (14) merits further elaboration. Blogspace,”inProc.Int.Conf.onWebIntelligence. IEEE/WIC/ACM, Note that causal graphs as in Fig. 11 indicate possible 2005. transmission pathways of an epidemic. In general they may [12] C.Bauckhage,“InsightsintoInternetMemes,”inProc.ICWSM. AAAI, 2011. be arbitrarily complex but, for the type of spreading process [13] C. Budak, D. Agrawal, and A. E. Abbadi, “Limiting the Spread of studied here, they are necessarily directed and acyclic. MisinformationinSocialNetworks,”inProc.WWW. ACM,2010. Consider a network of n nodes. If v is the source node of [14] J. Iribarren and E. Moro, “Impact of Human Activity Patterns on the s an epidemic on this network and A is the adjacency matrix DynamicsofInformationDiffusion,”PhysicalReviewLetters,vol.103, no.3,p.038702,2009. of the corresponding causal graph where [15] J. Leskovec, L. Backstrom, and J. Kleinberg, “Meme-tracking and the (cid:40) DynamicsoftheNewsCycle,”inProc.KDD. ACM,2009. 1 if nodes vi infects vj [16] J. Yang and J. Leskovec, “Patterns of Temporal Variation in Online A = ij 0 otherwise Media,”inProc.WSDM. ACM,2011. [17] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the Spread of then (cid:0)Ak(cid:1) denotes number of ways the epidemic agent can InfluencethroughASocialNetwork,”inProc.KDD. ACM,2003. sj [18] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst, reach v from v in k steps and “Patterns of Cascading Behavior in Large Blog Graphs,” in Proc. Int. j s ConfonDataMining. SIAM,2007. (cid:88) q ∝ (Ak) . [19] M. Gomez-Rodriguez, J. Leskovec, and B. Scho¨lkopf, “Structure and k sj DynamicsofInformationPathwaysinOnlineMedia,”inProc.WSDM. j ACM,2013. For causal graphs, A is strictly upper triangular and there- [20] M. Barthelemy, A. Barrat, R. Pastor-Satorras, and A. Vespignani, fore nilpotent. That is, ∃K ≤n:Ak =0∀k >K. “VelocityandHierarchicalSpreadofEpidemicOutbreaksinScale-free Networks,”PhysicalReviewLetters,vol.92,no.17,p.178701,2004. Hence,ρ(A)<1whichistosay∃u>0:(cid:107)Ak(cid:107)F <u∀k [21] M.Newman,“TheSpreadofEpidemicDiseasesonNetworks,”Physical and therefore ReviewE,vol.66,no.1,p.016128,2002. [22] R. Pastor-Satorras and A. Vespignani, “Epidemic Spreading in Scale- 0≤(Ak) ≤(cid:107)Ak(cid:107) <u. FreeNetworks,”PhysicalReviewLetters,vol.86,no.14,pp.3200–3203, sj F 2001. Accordingly, (cid:80) (Ak) ∈O(kα−1) for some α≥1. Setting [23] D. Bild, Y. Liu, R. Dick, Z. M. Mao, and D. Wallach, “Aggregate j sj CharacterizationofUserBehaviorinTwitterandAnalysisoftheRetweet tk = (cid:15)k where (cid:15) ≤ 1 yields qk ∈ O(tαk−1) which justifies Graph,”arXiv:1402.2671[cs.SI],2014. (14). [24] J. Kunegis, “KONECT: the Koblenz Network Collection,” in Proc. WWW. ACM,2013. Put in simple terms, this is to say that it is sufficient to [25] L.deHaanandA.Ferreira,ExtremeValueTheory. Springer,2006. model the probability qk of observing an epidemic path of [26] H.Rinne,TheWeibullDistribution. Chapman&Hall/CRC,2008. length k as a polynomial in t rather than, say, an exponential [27] G. Crooks, “The Amoroso Distribution,” arXiv:1005.3274 [math.ST], function. 2010. [28] E.Stacy,“AGeneralizationoftheGammaDistribution,”TheAnnalsof The fact that the parameter α reflects aspects of network MathematicalStatistics,vol.33,no.3,pp.1187–1192,1962. topology is easily seen from the examples in Fig. 10. Assum- [29] C. Bauckhage, “Computing the Kullback-Leibler Divergence between ing the central node to be the source node of an epidemic twoGeneralizedGammaDistributions,”arXiv:1401.6853[cs.IT],2014. [30] R.JennrichandR.Moore,“MaximumLikelihoodEstimationbyMeans the figure shows that q may be constant (α=1), decreasing k of Nonlinear Least Squares,” in Proc. of the Statistical Computing (α<1), or increasing (α>1) with k. Section. AmericanStatisticalAssociation,1975.