Infinite Edge Partition Models for Overlapping Community Detection and Link Prediction Mingyuan Zhou McCombs School of Business, The University of Texas at Austin, Austin, TX 78712, USA 5 1 Abstract beenwelladdressedbythem. Inthispaper,wewillfit 0 2 unweighted undirected relational networks using non- parametric Bayesian generative models, which can be c A hierarchical gamma process infinite edge e used to simulate random networks, detect latent over- partition model is proposed to factorize D lapping communities and community-community in- the binary adjacency matrix of an un- teractions,andpredictmissingedges,withthenumber 0 weighted undirected relational network un- of communities automatically inferred from the data. 3 der a Bernoulli-Poisson link. The model de- scribesbothhomophilyandstochasticequiv- For a relational network, a community can be consid- ] L alence, andisscalabletobigsparsenetworks ered as a subset of nodes (vertices) that are densely M byfocusingitscomputationonpairsoflinked connected to each other but sparsely to the others, nodes. It can not only discover overlap- such as those in a social network, or it can be con- . t pingcommunitiesandinter-communityinter- sidered as a subset of nodes that are sparsely con- a t actions, but also predict missing edges. A nected to each other but densely connected to the s simplified version omitting inter-community nodes belonging to another community, such as those [ interactions is also provided and we reveal in a network consisting of carnivores and herbivores: 2 itsinterestingconnectionstoexistingmodels. tigers and bears both hunt deers but rarely prey on v The number of communities is automatically each other. The former phenomenon is usually de- 8 inferred in a nonparametric Bayesian man- scribed as assortativity or homophily, while the lat- 1 2 ner, and efficient inference via Gibbs sam- terisknownasdissortativityorstochasticequivalence 6 pling is derived using novel data augmenta- (Hoff,2008). Asarelationalnetworkmayexhibitboth 0 tiontechniques. Experimentalresultsonfour homophily and stochastic equivalence, an algorithm . 1 real networks demonstrate the models’ scal- capable of modeling both phenomena would usually 0 ability and state-of-the-art performance. be preferred if no prior information on assortativity is 5 available. Ifanalyzingassortativenetworkswithdense 1 intra-community connections is the main goal, then : v 1 INTRODUCTION onemayconsideranassortativealgorithmthatmodels i X homophily but not necessarily stochastic equivalence. r Community detection and link prediction are two im- The stochastic blockmodel (SBM) is a popular la- a portant problems in network analysis. A vast num- tent class model to detect latent communities (Hol- ber of community detection algorithms based on vari- land et al., 1983; Nowicki and Snijders, 2001). It par- oususefulheuristics,suchasmodularitymaximization titions the nodes into disjoint communities, and mod- (Newman and Girvan, 2004) and clique percolation els the probability for an edge to exist between two (Pallaetal.,2005),havebeenproposed. SeeFortunato nodes solely based on which two communities that (2010) for a comprehensive review. These algorithms, they belong to. It is simple and scalable, and mod- however,arenotbasedongenerativemodelsandhence els both homophily and stochastic equivalence. In ad- usually cannot be used to generate networks and pre- dition, the infinite relational model, a nonparametric dict missing edges (links). Moreover, how to set the Bayesian extension of the SBM based on the Chinese number of communities is a critical issue that has not restaurant process (Aldous, 1985), allows the number of communities to be automatically inferred from the Appearing in Proceedings of the 18th International Con- data (Kemp et al., 2006). Despite these attractive ference on Artificial Intelligence and Statistics (AISTATS) properties, theSBMisrestrictiveinthatcommunities 2015, San Diego, CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the authors. are not allowed to overlap. In practice, however, it is Infinite Edge Partition Models commonforanodetobelongtomultiplecommunities, are related to the models in Miller et al. (2009) and motivating the development of more advanced latent Morup et al. (2011) that use the Indian buffet pro- class models, such as the mixed-membership stochas- cessofGriffithsandGhahramani(2005)tosupportan ticblockmodel(MMSB)ofAiroldietal.(2008)andits infinite binary feature matrix. The proposed models various extensions (Gopalan et al., 2012; Kim et al., depart from existing ones with several distinctions: 1) 2013). The MMSB generalizes the SBM to allow a a Bernoulli-Poisson link connects each edge to a la- node to participate in multiple communities, yet since tentcountthatisfurtherpartitioned;2)ahierarchical it has to infer two community indicators for each pair gamma process is constructed to support an infinite of nodes, regardless of whether an edge exists in that number of communities and an infinite-dimensional pair,itscomputationgrowsquadraticallyasafunction square matrix to describe community-community in- of the number of nodes N. Moreover, the number of teractions; 3) two nonparametric Bayesian EPMs are communities in the MMSB is a model parameter that constructed to factorize the N ×N binary adjacency needs to be carefully selected. matrix under the Bernoulli-Poisson link, supporting a nonnegative feature matrix with an unbounded num- In this paper, instead of clustering nodes, as in the berofcolumns,andatthesametimeassigneachedge SBM,orclusteringallpossibleedges,asintheMMSB, andhenceeachnodetooneormultiplelatentcommu- we propose the edge partition model (EPM) to parti- nities; and 4) efficient and scalable Bayesian inference tiononlytheobservededges,whichreadilyleadstothe via Gibbs sampling is provided. partition of nodes: if the edges linked to a node are partitioned into multiple communities, then the node 2 FACTOR ANALYSIS AND is naturally affiliated with all these communities, and BERNOULLI-POISSON LINK couldbehardassignedtoasinglecommunitythathas the strongest presence in its edges. In contrast to the OurbasicideaistofactorizetheBINARYnetworkad- SBM, the EPM allows communities to overlap; and jacencymatrixusingtoolsdevelopedforCOUNTdata in contrast to the MMSB that spends O(N2) compu- analysis,andtodiscoveroverlappingcommunitiesand tation clustering all possible edges, the EPM spends their interactions by examining how the latent count O(d¯N) computation partitioning only observed edges, foreachedgeispartitioned. Thissectionwillprimarily where d¯is the average degree (number of edges) per discussindividualmodelcomponentsandtheirproper- node, leading to notable computational savings as d¯ ties,withhierarchicalBayesianmodelspresentedlater. is often much smaller than N in a big sparse network commonly observed in practice. 2.1 Poisson Factor Analysis To support a potentially infinite number of commu- We propose a Poisson factor model for a weighted nities and to model both homophily and stochastic undirected N ×N relational network as equivalence in an unweighted undirected relational network, we propose a hierarchical gamma process m ∼Po(cid:16)(cid:80)K (cid:80)K φ λ φ (cid:17), (1) (HGP) EPM, which links each observed edge to a la- ij k1=1 k2=1 ik1 k1k2 jk2 tentcountusingaBernoulli-Poissonlink,andthenfac- where m ≡ m is the integer-valued weight ij ji torizes the latent N ×N random count matrix. The (observed or latent) that links nodes i and j, HGP supports the EPM to have an infinite dimen- (φ ,...,φ ) is the positive feature vector for node i1 iK sionalfeaturevectorforeachnodetodescribeitsaffil- i, λ ≡ λ is a positive rate, and the symbol iations with communities, and an infinite dimensional k1k2 k2k1 ≡ denotes “equal by definition.” This model is con- squareratematrix,whosediagonalandoff-diagonalel- ceptually simple: with φ measuring how strongly ementsdescribetheintra-andinter-communityinter- ik1 node i is affiliated with community k and λ mea- actions, respectively. We also propose a gamma pro- 1 k1k2 suring how strongly communities k and k inter- 1 2 cess EPM as a simplified version of the HGP-EPM, act with each other, the product φ λ φ mea- whichomitsinter-communityinteractionstogainsim- ik1 k1k2 jk2 sures how strongly nodes i and j are connected due pler inference and faster computation at the expense to their affiliations with communities k and k , re- 1 2 of reduced ability to model stochastic equivalence. spectively, and a weighted combination of all intra- Conceptually, our idea of directly partitioning edges communityweights{λkk}1≤k≤K andinter-community and implicitly partitioning nodes into communities is ones {λk1k2}1≤k1(cid:54)=k2≤K is the expected value of mij. related to the one in Ahn et al. (2010) and Evans and The factor model in (1) makes intuitive sense. For Lambiotte(2009). Intermsofconstruction,ourEPMs example,supposepersonsiandj arebothresidentsof are related to the Poisson factor models of Ball et al. CityAvatarandactivemembersoftheAvataranglers (2011)andtheEigenmodelofHoff(2008). Intermsof Meetupgroupthatorganizesfishingtripsregularly. In supporting an infinite number of features, our EPMs addition, persons i is an active member of the Avatar Mingyuan Zhou artificial intelligence (AI) Meetup group while person where x ∼ Po (λ) follows a truncated Poisson dis- + j is an active member of the Avatar statistics Meetup tribution, with P(x = k) = (1−e−λ)−1λke−λ/k! for group. Denotingm asthenumberoftimesthatiand k ∈{1,2,...}. Thusifb=0,thenm=0almostsurely ij j attend the same group meeting in 2015, then due to (a.s.), and if b = 1, then m ∼ Po (λ), which can be + theirstrongaffiliationswiththeanglersMeetupgroup, simulated with rejection sampling: if λ ≥ 1, we draw m would have a large expected value, which is likely m ∼ Po(λ) till m ≥ 1; and if λ < 1, we draw both ij tobefurtherincreasediftheAIandstatisticsMeetup n ∼ Po(λ) and u ∼ Unif(0,1) till u < 1/(n+1), and groups hold joint events regularly. then let m = n+1. The acceptance rate is 1−e−λ if λ ≥ 1 and λ−1(1−e−λ) if λ < 1, and reaches its To model an assortative relational network exhibiting minimum, 63.2%, when λ=1. homophily but not necessarily stochastic equivalence, we may omit the inter-community interactions by let- The BerPo link shares some similarities with the pro- ting λ ≡0 for k (cid:54)=k and simplify (1) as bit link that thresholds a normal random variable at k1k2 1 2 zero, and the logit link that lets b∼Ber[ex/(1+ex)]. (cid:16) (cid:17) m ∼Po (cid:80)K r φ φ , (2) We advocate the BerPo link as an alternative to the ij k=1 k ik jk probit and logit links since if b = 0, then m = 0 a.s., which could lead to significant computational savings where r ≡ λ indicates the prevalence of commu- k kk if a considerable proportion of the data are equal to nity k, and two nodes with similar latent features are zero. Inaddition, theadditivepropertyofthePoisson encouragedtobelinkedbyanedgewithalargeweight. allows us to model the link strength between any two We note Ball et al. (2011) had examined a model re- nodes by aggregating the contributions of all possible lated to (2) and briefly mentioned a model related intra-andinter-communityinteractions,andthecon- to (1). However, they used a heuristic approach to jugacy between the Poisson and gamma distributions model binary data under the Poisson distribution, did makesitconvenienttoconstructhierarchicalBayesian not provide a principled way to set the number of models amenable to posterior simulation. communities K, and had to create possibly nonexis- tentself-edgesinordertoderivetractableexpectation- 2.3 Overlapping Community Structures maximization(EM)inference. Thispaperwilladdress alltheseissuesrigorously,inanonparametricBayesian Note that (1) can be augmented as manner, and carefully examine the models in both (2) (cid:88)(cid:88) and (1) and provide efficient Bayesian inference. m = m , m ∼Po(φ λ φ ). (4) ij ik1k2j ik1k2j ik1 k1k2 jk2 k1 k2 2.2 Bernoulli-Poisson Link where m represents how often nodes i and j in- ik1k2j teractduetotheiraffiliationswithcommunitiesk and To use the Poisson factor models in (1) and (2) for an 1 k , respectively. We may consider that the model is unweighted network with a binary adjacency matrix, 2 partitioning the count m into {m } , weintroduceaBernoulli-Poisson(BerPo)linkfunction ij ik1k2j 1≤k1,k2≤K and hence we call the Poisson factor model in (1) to- that thresholds a random count at one to obtain a gether with the BerPo link in (3) as an edge partition random variable in {0,1} as model (EPM), in which each edge is partitioned ac- cording to all possible K2 community-community in- b=1(m≥1), m∼Po(λ), (3) teractions, and how strongly node i is affiliated with where b = 1 if m ≥ 1 and b = 0 if m = 0. The intu- community k can be measured with φikωik, where ition is that two nodes are connected if they interact (cid:80) (cid:80) ω := φ λ (5) at least once. The mathematical motivation is after ik j(cid:54)=i k(cid:48) jk(cid:48) kk(cid:48) transformingabinary-modelingproblemintoacount- represents how strongly node i would interact with all modeling one, one is readily equipped with a rich set the other nodes through its affiliation with commu- of statistical tools developed for count data analysis nity k. We further introduce the latent count usingthePoissonandnegativebinomialdistributions. (cid:80) (cid:80) (cid:80) (cid:80) m := m + m , (6) If m is marginalized out from (3), then given λ, one ik·· j>i k2 ikk2j j<i k1 jk1ki obtains a Bernoulli random variable as torepresenthowoftennodeiisconnectedtotheother nodes due to its affiliation with community k. We b∼Ber(cid:0)1−e−λ(cid:1). can then assign node i to multiple communities in {k : m ≥ 1}, or (hard) assign it to a single com- ik·· The conditional posterior of m can be expressed as munityusingeitherargmax(φ ω )orargmax(m ). ik ik ik·· k k (m|b,λ)∼b·Po (λ), SimilaranalysisappliestoasimplerEPMbuilton(2). + Infinite Edge Partition Models By hard assigning each node to a single community set A ⊂ Ω (Ferguson, 1973; Kingman, 1993). The and ordering the nodes from the same community to L´evy measure of the gamma process can be expressed be adjacent to each other, we expect the ordered ad- asν(drdφ)=r−1e−c0rdrG0(dφ),andadrawfromthe jacency matrix to exhibit a block structure, where the gammaprocess,consistingofcountablyinfiniteatoms, blocks along and off the diagonal represent the intra- can be expressed as G=(cid:80)∞ r δ , where φ i∼idg , and inter- community connections, respectively. G = γ g , g (dφ) = G (dk=φ1)/γk φiks the basekdistr0i- 0 0 0 0 0 0 bution, and γ = G (Ω) is the mass parameter. A 0 0 2.4 Scalability for Big Sparse Networks gammaprocessbasedmodelhasaninherentshrinkage WearemotivatedtoconstructtheEPMsbecausethey mechanism, as in the prior the number of atoms with not only allow each edge and hence each node to par- rk greater than ε ∈ R+ follows Po(γ0(cid:82)ε∞r−1e−crdr), ticipateinmultiplecommunities,butalsoreadilyscale whose Poisson rate decreases as ε increases. toabigsparsenetworkwhoseaveragedegreepernode Given G=(cid:80)∞ r δ , we further define a relational is much smaller than N. A key observation for scal- k=1 k φk gamma process (rΓP) as able computation is that (1) can be augmented as (4), where m = 0 a.s. for any k and k if no Λ|G∼rΓP(G,ξ,1/β), (8) ik1k2j 1 2 edge exists between nodes i and j (i.e., m = 0). ij On a sparse network, where the edges constitute only a draw from which is defined as a small portion of all possible N2 edges, this prop- Λ=(cid:80)∞ (cid:80)∞ λ δ , erty makes the EPMs computationally appealing. By k1=1 k2=1 k1k2 (φk1,φk2) contrast, conceptually related models, including the where ξ and β are both in R+, λ ≡λ , and MMSB of Airoldi et al. (2008), Eigenmodel of Hoff k2k1 k1k2 (2008) and latent feature relational model of Miller (cid:40) Gam(ξr ,1/β) , if k =k , et al. (2009), spend computation indiscriminately on λ ∼ k1 2 1 k1k2 all pairs of nodes (i,j) no matter whether an edge ex- Gam(rk1rk2,1/β), if k2 >k1. istsbetweennodesiandj,andhencetheyhaveO(N2) GivenarelationalgammaprocessdrawΛ,wegenerate computation and do not scale well as N increases. a binary adjacency matrix B∈{0,1}N×N as 3 EDGE PARTITION MODELS (cid:34) ∞ ∞ (cid:35) (cid:89) (cid:89) (cid:16) (cid:17) B|Λ∼Ber 1− exp −φ λ φT . (9) k1 k1k2 k2 3.1 Hierarchical Gamma Process k1=1k2=1 The EPM takes a weighted combination of all pos- Equations (9), (8) and (7) constitute an HGP-EPM sible intra- and inter- community weights to explain that supports countably infinite atoms and a count- each pair of node, however, the number of commu- ably infinite square matrix, the total sum of whose nities K is still a model parameter that needs to be elements has a finite expectation, as shown in the fol- set appropriately. To allow K to be inferred from lowing Lemma, with proof provided in the Appendix. the data and potentially grow to infinity, we need Lemma 1. The expectation of (cid:80)∞ (cid:80)∞ λ is to introduce a stochastic process that can generate a k1=1 k2=1 k1k2 finite and can be expressed as countably infinite number of atoms {φ } , where k 1,∞ φ = (φ ,...,φ )T measures how strongly the N (cid:20)(cid:88)(cid:88) (cid:21) ξ 1 k 1k Nk E λ = γ + γ2. nodes are affiliated with community k, and an infinite k1k2 c β 0 c2β 0 dimensional square matrix {λ } , where k1 k2 0 0 λ = λ measures how stkro1kn2gl1y≤cko1,mk2m≤∞unities k The usual scenario to consider an HGP construction ank2dk1k intke1rka2ct with each other. Moreover, we need1 is when one models grouped data and wishes to share 2 to ensure (cid:80)∞ (cid:80)∞ λ to be finite a.s. and we statistical strengths across groups. For example, the may wish toki1m=p1osek2s=om1 eks1tkr2uctural regularization on gamma-negative binomial process of Zhou and Carin the infinite square matrix. (2012), related to the hierarchical Dirichlet process of Teh et al. (2006), is considered for topic modeling, To satisfy all these needs, we first define whereeachdocumentisassociatedwithagammapro- cess, and these gamma processes are coupled by shar- G∼ΓP(G ,1/c ) (7) 0 0 ing a lower-level (i.e., further from the data) gamma as a gamma process on a product space R+ × Ω, process as their atomic base measure. The proposed where R+ = {x : x > 0}, Ω is a complete sepa- HGP is distinct in that the product of the weights rable metric space, 1/c is a positive scale parame- of any two atoms of the lower-level gamma process is 0 ter, and G is a finite and continuous base measure, usedtoparameterizetheshapeparameterofagamma 0 such that G(A) ∼ Gam(G (A),1/c ) for each Borel random variable higher in the hierarchy. 0 0 Mingyuan Zhou The proposed HGP also helps express our prior be- 3.3 Gamma Process EPM lief that an atom with a small weight tends to repre- If we omit inter-community interactions by letting sent a small community, which also tends to interact λ ≡ 0 for k (cid:54)= k and λ ≡ r , then the HGP- with the others less frequently. Note that if we let k1k2 1 2 kk k EPM reduces to a gamma process EPM (GP-EPM), λ ∼ Gam(r2,1/β), then the expectation of the ma- kk k which is likely to well fit assortative networks but not trix{λ } given{r } hasarankofone. k1k2 1≤k1,k2≤∞ k 1,∞ necessarily disassortative ones. We notice an inter- Weuseξr insteadofr2 astheshapeparameterofλ k k kk esting connection to the community-affiliation graph to allow r to be inferred with Gibbs sampling and to k model(AGM)ofYangandLeskovec(2012,2014): the prevent overly shrinking λ for small communities. kk GP-EPM generates an edge with probability Note that Palla et al. (2014) proposed a reversible in- finite hidden Markov model using a related HGP in- P(b =1)=1−(cid:81) {1−[1−exp(−r φ φ )]}; (12) ij k k ik jk finite square rate matrix, the normalization of whose each row represents a state transition probability vec- if we define pk = 1 − e−rk and further impose the restriction that φ ∈{0,1}, then (12) reduces to tor. Our HGP serves a distinct modeling purpose; no ik normalization is required for the infinite square rate P(b =1)=1−(cid:81) (1−p ), (13) matrix, and our model allows exploiting unique data ij k∈Cij k augmentation techniques to infer both λk1k2 and rk where Cij = {k : φik = 1 and φjk = 1} ⊂ {1,...,K} withclosed-formGibbssamplingupdateequations,as is a set of communities that nodes i and j share; note discussed in Section 3.4 and the Appendix. that (13) is almost the same as the AGM of Yang and Leskovec (2012, 2014). In fact, one may consider the 3.2 Hierarchical Gamma Process EPM GP-EPM with bij ∼ Ber[1−e−(cid:15)(cid:81)kexp(−rkφikφjk)], where (cid:15) ∈ R+ and φ ∈ {0,1}, as a nonparametric ik We choose the base distribution of the gamma pro- Bayesian AGM. Similarly, we also notice that (11) of cessG∼ΓP(G0,1/c0)asg0(φ)=(cid:81)Ni=1Gam(ai,1/ci). the HGP-EPM is related to the model of Morup et al. For implementation convenience, we consider a dis- (2011) if we restrict φ ∈{0,1}. crete base measure as G = (cid:80)K γ0δ , where K is ik 0 k=1 K φk Yang and Leskovec (2012, 2014) argue that all pre- a truncation level that is set large enough to ensure a vious community detection methods, including clique good approximation to the truly infinite model. We percolationandMMSB,wouldfailtodetectcommuni- express the (truncated) HGP-EPM as tieswithdenseoverlaps,becausetheyallhadahidden assumption that a community’s overlapping parts are K K (cid:88) (cid:88) b =1(m ≥1), m = m , less densely connected than its non-overlapping ones. ij ij ij ik1k2j The same as the AGM, both the GP-EPM and HGP- k1=1k2=1 EPM do not make such a restrictive assumption, and m ∼Po(φ λ φ ), ik1k2j ik1 k1k2 jk2 they both allow overlaps of communities to be denser φ ∼Gam(a ,1/c ), a ∼Gam(e ,1/f ), ik i i i 0 0 than communities themselves; Beyond the AGM, we (cid:40) Gam(ξr ,1/β), if k =k , do not restrict φ to be either zero or one, and our λk1k2 ∼ k1 2 1 generative modelsikare built under a rigorous nonpara- Gam(r r ,1/β), if k >k , k1 k2 2 1 metric Bayesian framework with efficient Bayesian in- r ∼Gam(γ /K,1/c ), (10) ference, as presented below. k 0 0 3.4 MCMC Inference where λ ≡ λ and conjugate gamma priors are imposedko2kn1γ ,ξ,k1ck2,c andβ. Notethatmarginalizing In this paper, we consider an unweighted undirected 0 0 i out both mij and mik1k2j from (10) leads to nfientewdo.rTk,huwshwereeobnjily≡cobnijsiadnedr bselff-olirnjks>biiiianr(e1n0o).tLdeet- ij m be defined as in (6) and m as (cid:20) (cid:89)K (cid:89)K (cid:21) ik·· ·k1k2· bij ∼Ber 1− exp(−φik1λk1k2φjk2) . (11) m·k1k2· :=2−δk1k2 (cid:80)i(cid:80)j>i(mik1k2j +mik2k1j), k1=1k2=1 where δ = 1 if k = k and δ = 0 otherwise. k1k2 1 2 k1k2 A noticeable advantage of the augmented representa- Using (5) and the Poisson additive property, we have tion in (10) over (11) is that (10) is amenable to pos- m ∼Po(φ ω ), (14) terior simulation, as discussed in Section 3.4. ik·· ik ik m ∼Po(λ θ ), (15) Note that similar to Hoff (2008) and Lloyd et al. ·k1k2· k1k2 k1k2 (2012), we assume that the nodes are exchangeable where θk1k2 := 2−δk1k2 (cid:80)i(cid:80)j(cid:54)=iφik1φjk2 represents andhencethediscussionsofHoover(1982)andAldous how strongly the nodes interact through communi- (1985) on exchangeability also apply to our EPMs. ties k and k . Marginalizing out φ from (14) 1 2 ik Infinite Edge Partition Models and λ from (15), with p(cid:48) := ω /(c +ω ) and twosmall-scalebenchmarknetworks,forwhichwetest k1k2 ik ik i ik p˜ :=θ /(β+θ ), we have all algorithms and set the truncation level as K = k1k2 k1k2 k1k2 max 100forouralgorithms,andanothertwonetworkswith mik·· ∼NB(ai,p(cid:48)ik), (16) more than 2000 nodes, for which we set Kmax =256. m·k1k2· ∼NB(cid:2)rk1ξδk1k2(rk2)1−δk1k2, p˜k1k2(cid:3). (17) To test a model’s ability to predict missing edges of an unweighted undirected relational network, we ran- Using the BerPo link, the gamma-Poisson conjugacy, domly3 hold out 20% pairs of nodes and use the the and the augment-and-conquer techniques to infer the remaining 80% to predict the probability for an edge negative binomial dispersion parameters (Zhou and toexistineachoftheseheld-outpairs. Lettingo =0 ij Carin, 2012, 2015), we exploit (14)-(17) to derive if b is held out and o = 1 otherwise, we only need ij ij closed-form Gibbs sampling update equations for all to slightly modify the inference by only considering model parameters except γ , and construct an excel- {b : o = 1} in the likelihood. For example, ω in 0 ij ij ik (cid:80) (cid:80) lent proposal distribution to sample γ using an in- (5)wouldberedefinedasω = λ φ . 0 ik j:oij=1 k(cid:48) kk(cid:48) jk(cid:48) dependence chain Metropolis-Hastings algorithm. We We consider exactly the same five random training- present in the Appendix the details of MCMC infer- testing partitions for all algorithms and report the ence for the HGP-EPM, and the hierarchical model average area under the curve (AUC) of both the re- and closed-form Gibbs sampling update equations for ceiver operating characteristic (ROC) and precision- the GP-EPM. The inference of the nonparametric recall (PR) curves (Davis and Goadrich, 2006). For Bayesian AGM would be almost the same as that of link prediction, the AUC-PR is more sensitive to the theGP-EPM,withtheonlydifferencethatthe(φ |−) percentage of true edges among the top ranked ones. ik would be sampled from Bernoulli distributions. Note that in addition to link prediction, the HGP- EPM, GP-EPM, AGM and IRM all have easily in- 4 EXPERIMENTAL RESULTS terpretable latent representations that will be used to detect overlapping/disjoint communities. For comparison, we consider the infinite relational model(IRM)ofKempetal.(2006),theEigenmodelof 4.1 Protein230 Network Hoff (2008), the infinite latent attribute (ILA) model of Palla et al. (2012), the AGM of Yang and Leskovec We first consider the Protein230 dataset of Butland (2012, 2014), and our GP- and HGP-EPMs. We use et al. (2005) that describes the interactions between the R package provided for the Eigenmodel. We use 230 proteins, with 595 edges. This is a small-scale theILAcode1providedforPallaetal.(2012),inwhich benchmarknetworkthatexhibitsbothhomophilyand it is shown that the ILA outperforms the related non- stochastic equivalence, as shown in Hoff (2008) and parametric latent feature relational model of Miller also tested in Lloyd et al. (2012). We are able to etal.(2009). WeimplementanonparametricBayesian run 3000 MCMC iterations quickly enough for all al- version of the AGM as a special case of the GP-EPM, gorithms except for the ILA on this network. asdiscussedinSection3.3. MatlabcodefortheEPMs AsshowninTab. 1,theHGP-EPMhasthebestover- is available on the author’s website. all performance. The Eigenmodel is the second best For the Eigenmodel, we find the best K in withK =10andtheIRMisthethirdbest. TheAGM {5,10,25,50}. For the ILA, we use its default param- is not competitive as it restricts its features to be bi- eter setting2. For the IRM, we choose Beta(0.1,1) as nary. Inthisandallfuturetables,wehighlightinbold the prior for each latent block and Gam(0.01,1/0.01) boththebestresultandtheonesthatarelessthanone as the prior for the Chinese restaurant process con- standard error away from the best. Below we analyze centration parameter; for the nonparametric Bayesian why the HGP-EPM performs the best while the sim- AGM,weletφik ∼Ber(πi), πi ∼Beta(0.01,0.01)and pler GP-EPM is not that competitive on this dataset. (cid:15) ∼ Gam(0.01,1/0.01); these parameters are found to As shown in Figs. 1 (b)-(d), the HGP-EPM captures consistently provide good performance. For our mod- both homophily and stochastic equivalence by accu- els’ hyper-parameters, we choose e = f = 0.01 and 0 0 rately modeling both diagonal and off-diagonal dense let γ , c , c and β be all drawn from Gam(1,1). 0 i 0 regionsoftheadjacencymatrix;theGP-EPMcaptures Weconsider3000MCMCiterationsandcollectthelast homophily by accurately modeling diagonal dense re- 1500 samples, unless otherwise stated. We consider gionsthatrepresentintra-communityinteractions,but at the expense of creating nonexistent blocks in order 1 http://mlg.eng.cam.ac.uk/konstantina/ILA/ILAcode(v1).tar.gz to fit dense off-diagonal regions that represent strong 2Thedefaulttraining/testingpartitionoftheILAcode sendsself-edgesintothetestingset;whereasinthispaper, wedonotintendtopredictself-edgesandhencewedonot 3If removing an edge disconnects a node to all the oth- allow them to appear in the testing set. ers, then the edge will be kept in the training set. Mingyuan Zhou (a) Protein interaction network (b) HGP−EPM 50 50 100 100 150 150 200 200 50 100 150 200 50 100 150 200 (c) GP−EPM (d) IRM Figure 2: The inferred latent feature matrix {φk} and community-communityinteractionratematrix{λ }for k1k2 theHGP-EPMonProtein230. Thenodesarereorderedto 50 50 make a node with a larger index belong to the same or a 100 100 smaller-size community, and the communities are ordered 150 150 tomakeacommunitywithalargerindextohaveasmaller size. The pixel values are displayed on the log-10 scale. 200 200 50 100 150 200 50 100 150 200 4.2 NIPS234 Coauthor Network Figure 1: Comparisonofthreemodelsonestimatingthe Weconsiderthesmall-scaleNIPS234networkconsists link probabilities for the Protein230 network using 80% of of the top 234 authors in NIPS 1-17 conferences4 in its node pairs. The nodes are reordered to make a node with a larger index belong to the same or a smaller-size terms of the number of publications, as studied in community,wherethedisjointcommunityassignmentsare Miller et al. (2009). There are 598 edges. As shown obtained by analyzing the results of the HGP-EPM. (a) inTab. 2, theGP-EPMandHGP-EPMhavethebest Thebinaryadjacencymatrix. (b)-(c)Estimatedlinkprob- overall performance, followed by the IRM. Compar- abilitiesdisplayedonthelog-10scalefrom−2to1,witha ing with the simpler GP-EPM, the extra flexibility to brighter color representing a higher link probability. modelstochasticequivalencedoesnotbringtheHGP- Table1: Comparisonofsixalgorithmsonpredictingmiss- EPM additional advantages on this dataset, which is ing edges of the Protein230 network. The Eigenmodel achieves its best performance at K =10. not surprising as Fig. 3 suggests that this coauthor network mainly exhibits homophily. Note that the Model AUC-ROC AUC-PR IRM performs well measured by the AUC-ROC, but IRM 0.9338 ± 0.0128 0.5026 ± 0.0676 Eigenmodel 0.9314 ± 0.0188 0.5468 ± 0.0500 its AUC-PR is clearly worse than those of the EPMs. ILA 0.8971 ± 0.0297 0.3693 ± 0.0234 This may again be explained by its overly smoothed AGM 0.9145 ± 0.0160 0.3339 ± 0.0359 cartoonish estimation that overlooks small communi- GP-EPM 0.9335 ± 0.0110 0.4011 ± 0.0452 ties, as clearly shown in Fig. 3 (d). HGP-EPM 0.9519 ± 0.0100 0.5655 ± 0.0505 Table2: Comparisonofsixalgorithmsonpredictingmiss- inter-community interactions; and the IRM captures ing edges of the NIPS234 coauthor network. The Eigen- these large dense blocks, but produces a cartoonish model achieves its best performance at K =10. estimation, which overlooks small communities that Model AUC-ROC AUC-PR represent fine details along the diagonal. IRM 0.9476 ± 0.0114 0.6677 ± 0.0201 Eigenmodel 0.9269 ± 0.0177 0.6784 ± 0.0364 Fig. 2 shows how the HGP-EPM works. First, ILA 0.9171 ± 0.0222 0.6793 ± 0.0295 each feature vector φ shown in Fig. 2 (a) clearly k AGM 0.8906 ± 0.0164 0.5842 ± 0.0357 describes how strongly the nodes are affiliated with GP-EPM 0.9501 ± 0.0123 0.7415 ± 0.0319 the community it represents, and each node may HGP-EPM 0.9469 ± 0.0163 0.7289 ± 0.0540 have large weights on multiple community. Sec- ond, about 30 latent feature vectors are inferred and 4.3 Yeast and NIPS12 Networks the remaining ones are essentially drawn from the prior (cid:81) Gam(a ,1/c ). Third, the inter- and intra- We also consider the Yeast5 protein interaction net- i i i community interaction strengths in Fig. 2 (b) can be work of Bu et al. (2003), with 2361 nodes and 6646 matchedtothecorrespondingcommunities(subsetsof non-self edges, and the NIPS12 coauthor network6 nodes) in Figs. 1 (a) and (b). For example, Fig. 2 (a) thatincludesallthe2037authorsinNIPSpapersvols suggeststhatthefirstandsecondlargestcommunities 0-12, with 3134 edges. These two median-size net- have 24 and 22 nodes, respectively, and Fig. 2 (b) works are already too large for the Eigenmodel and suggests that the first and second communities have ILA to produce reasonable results given our computa- sparseanddenseintra-communityconnections,respec- tional resources. The results in Tabs. 3 and 4 show tively, and have denser connections between them, as 4http://chechiklab.biu.ac.il/∼gal/data.html confirmed by examining the block structures within 5 http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm thetop-left 46×46corner ofbothFigs. 1(a) and(b). 6http://www.cs.nyu.edu/∼roweis/data.html Infinite Edge Partition Models (a) NIPS234 coauthor network (b) HGP−EPM (a) Protein230 (b) NIPS234 102 102 HGP−EPM 11055000 11055000 Size of a community 101 GAIRGPM−MEPM 101 200 200 10100 0 101 102 10100 0 101 102 50 100 150 200 50 100 150 200 (c) Yeast (d) NIPS12 103 103 (c) GP−EPM (d) IRM munity 102 102 10500 10500 Size of a com 101 101 150 150 10100 0 102 10100 0 102 200 200 Rank of a community Rank of a community Figure 4: A community’ size and its rank. 50 100 150 200 50 100 150 200 Figure 3: Comparisonofthreemodelsonestimatingthe link probabilities for the NIPS234 coauthor network. (a)- model and ILA have at least O(N2 + NK) compu- (d) Analogous plots to Figures 1 (a)-(d). tation, where K is the number of latent features. With unoptimized Matlab on a 2.7 GHz CPU, for Table 3: Comparison of four algorithms on predicting 1000 MCMC iterations, the HGP-EPM (GP-EPM) missing edges of the Yeast protein interaction network. takes about 80 (20) seconds on Protein230, about 85 Model AUC-ROC AUC-PR (28) seconds on NIPS234, about 50 (18) minutes on IRM 0.9093 ± 0.0059 0.1878 ± 0.0142 Yeast, and about 32 (12) minutes on NIPS12. The AGM 0.9009 ± 0.0025 0.1225 ± 0.0129 GP-EPM 0.9331 ± 0.0014 0.2486 ± 0.0149 Eigenmodel with K = 25 takes about 200 seconds HGP-EPM 0.9367 ± 0.0012 0.2628 ± 0.0184 on NIPS234 to run 1000 MCMC iterations. For the ILA on NIPS234, we considered 1000 MCMC itera- Table 4: Comparison of four algorithms on predicting tions that took over 18 hours to run; for Protein230, missing edges of the NIPS12 coauthor network. the ILA inferred about two times more features as it Model AUC-ROC AUC-PR IRM 0.9427 ± 0.0121 0.2066 ± 0.0331 did on NIPS234, and we considered 500 MCMC itera- AGM 0.9328 ± 0.0049 0.2350 ± 0.0177 tions that took over 21 hours to run. GP-EPM 0.9768 ± 0.0079 0.4705 ± 0.0362 HGP-EPM 0.9762 ± 0.0081 0.4493 ± 0.0229 5 CONCLUSIONS that the HGP-EPM performs the best on the Yeast To model unweighted undirected relational networks protein-protein interaction network, which is found to characterizedbybothhomophilyandstochasticequiv- clearlyexhibitstochasticequivalencebyexaminingthe alence, we propose a hierarchical gamma process edge plots corresponding to the ones in Figs. 1 and 3 (not partitionmodel(EPM)thatsupportsaninfinitenum- shown for brevity), and the HGP-EPM and GP-EPM ber of communities and an infinite square rate matrix both perform well on the NIPS12 coauthor network, to describe community-community interactions. The which is found to mainly exhibit homophily by exam- EPM exploits a Bernoulli-Poisson link to assign a la- ining related plots (not shown for brevity). tent count to each binary edge, and further partitions that count according to the edge’s affiliations with all As discussed before, the HGP-EPM, GP-EPM, AGM pairsofcommunities,whichnaturallyleadstothepar- and IRM can all be used to assign nodes to disjoint tition of each node into overlapping communities. We communities. In Fig. 4 we plot the size of an inferred alsoprovideasimplergammaprocessEPMthatomits latent community as a function of its rank (smaller inter-community interactions, which is found to per- ranks indicate larger sizes) on the log-10 scale, for the form well on assortative networks. Efficient MCMC four scalable algorithms on the four tested real net- inference with closed-form update equations is pro- works. It is clear that in contrast to the other three vided. Experimental results on four real networks il- latent factor models, the IRM, a latent class model, lustrate the EPMs’ working mechanisms and proper- infers a smaller number of communities, with more ties, as well as their state-of-the-art performance and larger-sizeandfewersmaller-sizeones. Examiningthe interpretable latent representations. While previous details we find that the IRM tends to place all the latentfeaturerelationalmodelsandtheirnonparamet- low-degree nodes into one or several large-size com- ricBayesianversionsareoftennotscalable,ourinfinite munities, whereas the other models are able to better EPMsarereadilyscalabletonetworkswiththousands preserve fine details involving small-size communities. of nodes. It would be interesting to investigate strate- We mention that the HGP-EPM, GP-EPM and AGM giestomakethemscalabletorelationalnetworkswith have O(Nd¯+NK) computation, whereas the Eigen- millions of nodes and edges. Mingyuan Zhou References C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Ya- mada, and N. Ueda. Learning systems of concepts Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann. Link com- with an infinite relational model. In AAAI, 2006. munities reveal multiscale complexity in networks. Nature, pages 761–764, 2010. D. I. Kim, P. Gopalan, D. M. Blei, and E. Sudderth. Efficient online inference for Bayesian nonparamet- E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. ricrelationalmodels. InNIPS,pages962–970,2013. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research,pages1981– J. F. C. Kingman. Poisson Processes. Oxford Univer- 2014, 2008. sity Press, 1993. D. Aldous. Exchangeability and related topics. E´cole J.Lloyd,P.Orbanz,Z.Ghahramani,andD.Roy.Ran- d’etedeprobabilit´esdeSaint-FlourXIII-1983,pages domfunctionpriorsforexchangeablearrayswithap- 1–198, 1985. plications to graphs and relational data. In NIPS, 2012. B. Ball, B. Karrer, and M. E. J. Newman. Efficient andprincipledmethodfordetectingcommunitiesin K. Miller, M. I. Jordan, and T. L. Griffiths. Nonpara- networks. Physical Review E, 2011. metric latent feature models for link prediction. In NIPS, 2009. D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang, G. Li, M. Morup, M. N. Schmidt, and L. K. Hansen. In- and R. Chen. Topological structure analysis of finite multiple membership relational modeling for the protein–protein interaction network in budding complex networks. In IEEE MLSP, 2011. yeast. Nucleic acids research, pages 2443–2450, M. E. J. Newman and M. Girvan. Finding and eval- 2003. uating community structure in networks. Physical G. Butland, J. M. Peregr´ın-Alvarez, J. Li, W. Yang, review E, page 026113, 2004. X. Yang, V. Canadien, A. Starostine, D. Richards, K.NowickiandT.A.B.Snijders. Estimationandpre- B. Beattie, N. Krogan, M. Davey, J. Parkinson, diction for stochastic blockstructures. JASA, pages J. Greenblatt, and A. Emili. Interaction network 1077–1087, 2001. containing conserved and essential protein com- G. Palla, I. Der´enyi, I. Farkas, and T. Vicsek. Uncov- plexes in Escherichia coli. Nature, pages 531–537, ering the overlapping community structure of com- 2005. plex networks in nature and society. Nature, pages J. Davis and M. Goadrich. The relationship between 814–818, 2005. Precision-Recall and ROC curves. In ICML, 2006. K.Palla,D.Knowles,andZ.Ghahramani.Areversible T. S. Evans and R. Lambiotte. Line graphs, link par- infinite HMM using normalised random measures. titions, and overlapping communities. Physical Re- In ICML, 2014. view E, page 016105, 2009. K. Palla, D. A. Knowles, and Z. Ghahramani. An T. S. Ferguson. A Bayesian analysis of some nonpara- infinite latent attribute model for network data. In metric problems. Ann. Statist., 1973. ICML. 2012. S.Fortunato. Communitydetectioningraphs. Physics Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Reports, pages 75–174, 2010. Hierarchical Dirichlet processes. JASA, 2006. P. Gopalan, S. Gerrish, M. Freedman, D. M. Blei, J.YangandJ.Leskovec. Community-affiliationgraph and D. M. Mimno. Scalable inference of overlap- model for overlapping network community detec- pingcommunities. InNIPS,pages2249–2257,2012. tion. In ICDM, 2012. T.L.GriffithsandZ.Ghahramani. Infinitelatentfea- J. Yang and J. Leskovec. Structure and overlaps of turemodelsandtheIndianbuffetprocess. InNIPS, ground-truthcommunitiesinnetworks.ACMTrans. 2005. Intell. Syst. Technol., pages 26:1–26:35, 2014. P. Hoff. Modeling homophily and stochastic equiva- M.ZhouandL.Carin.Augment-and-conquernegative lence in symmetric relational data. In NIPS, 2008. binomial processes. In NIPS, 2012. P. W. Holland, K. B. Laskey, and S. Leinhardt. M. Zhou and L. Carin. Negative binomial process Stochastic blockmodels: First steps. Social net- count and mixture modeling. IEEE Trans. Pattern works, pages 109–137, 1983. Analysis and Machine Intelligence, 2015. D.N.Hoover. Row-columnexchangeabilityandagen- M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta- eral model for exchangeability. In G. Koch and negative binomial process and Poisson factor analy- F. Spizzichino, editors, Exchangeability in Probabil- sis. In AISTATS, 2012. ity and Statistics. 1982. Infinite Edge Partition Models Infinite Edge Partition Models for Overlapping Sample r . Similar to the inference of a , using (17), k i Community Detection and Link Prediction: we sample r as k Appendix (l |−)∼m(cid:88)·kk2·Ber(cid:18) rkξδkk2(rk2)1−δkk2 (cid:19), A Proof for Lemma 1 kk2 t=1 rkξδkk2(rk2)1−δkk2 +t−1 (cid:20) (r |−)∼Gam γ0 +(cid:88)l , Using the law of total expectation, we have k K kk2 k2 (cid:34) (cid:35) (cid:34) (cid:35) (cid:21) (cid:88)(cid:88) 1 (cid:88) 1 E k1 k2 λk1k2 = βE ξG(Ω)+[G(Ω)]2− k rk2 . c0−(cid:80)k2ξδkk2(rk2)1−δkk2 ln(1−p˜kk2) (.22) Using Campbell’s theorem (Kingman, 1993), we have Sample ξ. We resample the auxiliary variables l kk (cid:34) (cid:35) using the updated r and then sample ξ as E (cid:88)rk2 =(cid:90) (cid:90) ∞r2r−1e−c0rdrG0(dω)= γc20. (cid:20) k(cid:88) 1 (cid:21) k Ω 0 0 (ξ|−)∼Gam e0+ lkk,f −(cid:80) r ln(1−p˜ ) . The proof is completed by further using E[G(Ω)] = k 0 k k kk(23) γ0/c0 and E[G2(Ω)]=γ02/c20+γ0/c20. Sample λk1k2. Using (15) and the gamma-Poisson conjugacy, we have B MCMC Inference for HGP-EPM (cid:104) (λk1k2|−)∼ Gam rk1ξδk1k2(rk2)1−δk1k2 +m·k1k2·, Sample m . As in Section 2.2, we sample a latent (cid:105) ij 1/(β+θ ) . count for each b as k1k2 ij (24) (cid:32) K K (cid:33) (mij|−)∼bijPo+ (cid:88) (cid:88) φik1λk1k2φjk2 . (18) Sgaammmpaledβis,trciibuatniodnsc0u.sinTghetyhecacnonbjeugsaacmyplbedetwfreoemn k1=1k2=1 gamma distributions, omitted here for brevity. Sample mik1k2j. Usingtherelationshipsbetweenthe Sample γ0. As show in Lemma 1, the mass parame- Poisson and multinomial distributions, similar to the terγ playsanimportantroleindeterminingthetotal 0 derivationinZhouetal.(2012),wepartitionthelatent sum of the infinite rate matrix {λ }. Our experi- k1k2 count mij as mentsshowthatitcouldbeusedasatuningparameter to impose one’s prior preference on the number of ac- (cid:18) (cid:19) {φ λ φ } ({m }|−)∼Mult m ; ik1 k1k2 jk2 . tive communities to be discovered. In this paper, we ik1k2j ij (cid:80) (cid:80) φ λ φ k1 k2 ik1 k1k2 jk2 impose a gamma prior as γ0 ∼ Gam(1,1) to let the (19) data infer the posterior of γ . We employ an indepen- 0 Note that in each MCMC iteration we store mik·· and dence chain Metropolis-Hastings algorithm to sample m·k1k2· but not necessarily mik1k2j in the memory. γ0, with the proposal distribution constructed as Sample a . Using (16) and the data augmentation i (cid:18) (cid:19) technique developed in Zhou and Carin (2012, 2015) Q(γ∗)=Gam 1+(cid:88)˜l , 1 , forthenegativebinomialdistribution,wesamplea as 0 k 1− 1 (cid:80) ln(1−p˜˜ ) i k K k k (25) ((cid:96)ik|−)∼m(cid:88)ik··Ber(cid:18)a +ati−1(cid:19), where (˜lk|−)∼CRT(cid:0)(cid:80)k2lkk2,γ0/K(cid:1) and (ai|−)∼Gta=m1(cid:18)e0+(cid:88)i (cid:96)ik,f −(cid:80) l1n(1−p(cid:48) )(cid:19), p˜˜k := c0−−(cid:80)(cid:80)k2k2ξδξkδkk2k(2r(kr2k)21)−1−δkδkk2k2lnln(1(1−−p˜kp˜kk2k)2). k 0 k ik We accept γ∗ with probability min{1,π}, where π is (20) 0 (cid:81)K Gam(r ;γ∗/K,1/c )Gam(γ∗;1,1)Q(γ ) where with a slight abuse of notation, but for added k=1 k 0 0 0 0 , conciseness, we use x ∼ (cid:80)m Ber[a/(a+t)] to repre- (cid:81)K Gam(r ;γ /K,1/c )Gam(γ ;1,1)Q(γ∗) sent x=(cid:80)m u , u ∼Bert=[a1/(a+t)]. k=1 k 0 0 0 0 t=1 t t which is usually greater than 50% for the networks Sample φ . Using(14)andthegamma-Poissoncon- ik considered in this paper. jugacy, we have Each iteration of the MCMC for the HGP-EPM pro- (cid:2) (cid:3) (φ |−)∼Gam a +m ,1/(c +ω ) . (21) ik i ik·· i ik ceeds from (18) to (25).