Bipartite Yule Processes in Collections of Journal Papers ∗ Steven A. Morris Oklahoma State University Electrical and Computer Engineering Stillwater, OK 74078, USA (Dated: February 2, 2008) 5 Collections ofjournal papers,often referred toas’citation networks’,can bemodeled asacollec- 0 tion of coupled bipartite networks which tend to exhibit linear growth and preferential attachment 0 aspapersareaddedtothecollection. Assumingprimary nodesin thefirst partition andsecondary 2 nodesinthesecondpartition,thebasicbipartiteYuleprocessassumesthataseachprimarynodeis n added to the network, it links to multiple secondary nodes, and with probability, α, each new link a may connect to a newly appearing secondary node. The numberof links from a new primary node J followssomedistributionthatisacharacteristicofthespecificnetwork. Linkstoexistingsecondary 7 nodes follow a preferential attachment rule. With modifications to adapt to specific networks, bi- 1 partiteYuleprocessessimulatenetworksthatcanbevalidatedagainstactualnetworksusingawide variety of network metrics. The application of bipartite Yule processes to the simulation of paper- ] referencenetworksandpaper-authornetworksisdemonstrated andsimulation resultsareshownto r e mimic networks from actual collections of papers across several network metrics. h Keywords: bipartitenetworks,citation networks,Yuleprocess,Simon-Yuleprocess,networkgrowthmodel, t o preferentialattachment . t a m I. COLLECTIONS OF PAPERS AS COUPLED networks are paper author to reference author networks, BIPARTITE NETWORKS and paper journal to reference journal networks, which - d canbe usedfor authorco-citationanalysis[13]andjour- n nal co-citation analysis [5] respectively. As shown in Figure 1, a collection of journal papers o Modeling the growthof these bipartite networkshelps c constitutes a series of coupled bipartite networks [8]. As characterize the underlying processes driving a research [ diagrammedinFigure1,acollectionofpaperscontains6 specialty, such as knowledge accretion, researcher pro- directbipartite networks: 1)papers to paper authors,2) 1 ductivity, or collaboration processes. Bipartite growth papers to references, 3) papers to paper journals, 4) pa- v modelsproducemanynetworkmetrics,allowingcompre- 6 pers to terms, 5) references to reference authors, and 6) hensive validation of models against real collections of 8 references to reference journals. Additionally, there are 3 15 indirect bipartite networks in collections of papers as papers. 1 defined bythe diagram. Examplesofinterestingindirect 0 5 II. BASIC BIPARTITE YULE PROCESSES 0 / t paper authors reference authors a As originally proposed, Yule processes do not model m networks,but simply model the formation of power-laws - of frequencies of items [1] [10] [12]. For a bipartite Yule d process, assume a bipartite network where nodes fall n into two partitions: 1) primary nodes and 2) secondary o nodes. Typically, primary nodes are papers while sec- c : ondarynodesareentitiesthatareassociatedwithpapers, v references such as authors, references, journals, or terms. i papers X Figure2showsadiagramofabipartitepaper-reference terms network, where the primary nodes are papers and the r a secondary nodes are references, and papers are linked to references by citations. Figure 3 shows a diagram of a basic bipartite Yule paper journals reference journals process: FIG.1: Diagram showinga collection of papersasaseries of • The network grows by adding primary nodes one coupled bipartite networks. at a time. • When a new primary node is added, it links to N secondary nodes. N is a random deviate drawn ∗[email protected];http://samorris.ceat.okstate.edu from a discrete probability distribution that is a 2 node, the linked node is selected using preferential citations r attachment, that is, the probability of linking to 1 a secondary node is proportional to the number of r links that the node possesses. 2 The stationary distribution of the link degree of the p r 1 3 secondary nodes is a Yule distribution [3][12], a power law whose exponent is 1 + 1/(1 − α). The stationary p2 r4 distribution is independent of the distribution of N, but for finite collections of papers the distribution of N pro- references r p 5 foundly affects the tail of the distribution [6]. 3 r p 6 4 III. PRACTICAL BIPARTITE YULE r PROCESSES papers 7 r 8 In practice, the basic bipartite Yule process outlined in the proceeding section must be modified to account r 9 for the characteristics of the specific type of bipartite r network being studied. 10 FIG. 2: Diagram showing a bipartite network of papers and A. Paper-reference Yule process thereferences that theycite. Figure 4 shows a diagram of a bipartite Yule process modified for the characteristics of paper-reference net- NEW PRIMARY NODE works. The detailsofthis model, its scope,anda discus- sion of evidence of the its validity, appear in [6]. Paper- N is a random deviate whose reference networks in collections of papers covering sci- distribution is characteristic of the FOR N LINKS type of network being simulated entific specialties are characterized by the accretion of highlycitedexemplarreferences,whicharecitedatrates α 1-α farhigherthanwouldbepredictedbysimplepreferential attachment. These exemplar references tend to appear NEW USE PREFERENTIAL SECONDARY CONNECTION TO LINK TO AN duringthe initialgrowthofthe networkandtheirrateof NODE EXISTING SECONDARY NODE NOT PREVIOUSLY LINKED FROM appearance decreases exponentially as papers are added THIS PRIMARY NODE to the collection. As each paper is added to the collection, it links to a lognormally distributed number of references, as dis- NO YES LAST LINK? cussed in [6]. For each reference cited by a paper, there is a probabilityα that the citation is to a newly appear- ing reference. When a new reference appears, there is a FIG. 3: Diagram of a basic bipartite Yuleprocess. small probability that the reference will be a highly at- tractiveexemplarreference. Ifso,thereferencereceivesa characteristic of the type of network being mod- large initial attraction, A0. Newly created non-exemplar referencesreceivednoinitialattraction. Ifacitationisto eled. For paper-reference networks N is lognor- anexisting reference,the probability that any particular mally distributed [6], while for paper-author net- existingreferencewillbecitedisproportionaltothesum worksN is1-shiftedPoissondistributed[2][7]. For of its attraction plus the number of times it has been paper-journalnetworks,N isunity,sinceapaperis cited. A specific reference can not be cited more than only linked to one journal, the one in which it was once by a paper. published. As defined here, a primary entity does notlink to anyspecific secondaryentity morethan once. B. Paper-author Yule process • For each of the N links, there is a probability, α, that it will link to a newly appearing secondary Figure 5 shows a diagram of the basic bipartite Yule node. process modified for the characteristics of paper-author networks. The details ofthis model,its scope,anda dis- • If a link happens to be to an existing secondary cussion of evidence of the its validity, appear in [2] and 3 NEW NEW PAPER PAPER i m((cid:80),(cid:86)) IS (cid:68) 1-(cid:68) LOGNORMAL FOR m((cid:80),(cid:86)) RANDOM REFERENCES DEVIATE NEW SELECT EXISTING TEAM TEAM USING (cid:68) 1-(cid:68) PREFERENTIAL CONNECTION NEW USE PREFERENTIAL ADD N((cid:79)) REFERENCE CONNECTION TO AUTHORS SELECT AN EXISTING i i REFERENCE FROM e(cid:16)(cid:87) 1(cid:16)e(cid:16)(cid:87) NTOHTO ASLER ERAEDFYE RCEITNECDE SB Y Nra(n(cid:79)d):o 1m-s dheifvteiadte Poisson FOR EACH OF THIS PAPER N((cid:79)) AUTHORS CREATE HIGHLY CREATE ATTRACTIVE NORMAL 1-(cid:69) (cid:69) EXEMPLAR REFERENCE REFERENCE SELECT EXISTING TEAM SELECT RANDOM AUTHOR USING AUTHOR NO LAST YES PREFERENTIAL OUTSIDE OF TEAM CONNECTION REFERENCE? NO ADDED ALL YES AUTHORS? FIG. 4: Diagram showing a bipartiteYuleprocess for paper- reference networks. FIG. 5: Diagram showing a bipartite Yuleprocess for paper- authornetworks. [7]. In this case the Yule process is applied to teams of researchers rather than individual researchers. As each paper is added, there is a probability that the paper will showsthe adjacency matricesofthe paper-referencenet- beauthoredbyanewresearchteam. Ifso,ateamofNG work, paper-author network, and paper-journal network authors is added to the network, but only N(λ) appear in an actual collection of papers. asauthorsoftheteam’sfirstpaper,whereN(λ)isaran- From each bipartite network, two co-occurrence net- dom deviate drawnfrom a 1-shiftedPoissondistribution workscanbederivedwiththeirowncharacteristictopol- whose parameter is λ. If choosing an existing team, the ogy. For example, a paper-reference network yields two teams are chosen using preferential attachment, that is, unipartite networks, a bibliographic coupling network of the probability that a team will author the new paper is papers linked by common references and a co-citation proportional to the number of papers that the team has network of references linked by their common papers. A previously published. paper-author network yields a collaboration network of When selecting authors for an existing team’s paper, authorsconnectedbycommonpapersandalsoanetwork N(λ)authorsarechosenandtheauthorsareselectedus- of papers connected by common authors. ing preferential attachment, specifically, the probability Network metrics that characterize a bipartite network of selecting an author is proportionalto 1 plus the num- can be derived from link degree distributions in the bi- ber of papers that the author has published. Inter-team partitenetworkandlinkdegreedistributionsintheasso- collaborations(weakties)aremodeledasrandomevents; ciated unipartite co-occurrence networks. Many of these whenanexisting authoristo be selectedthereis aprob- metrics can be tied to indicators of the underlying re- ability β that the author will be drawn randomly from search process generating the collection of papers. some other team. A set of useful metrics for paper-reference networks includes: • reference per paper distribution - This tends to be IV. NETWORK METRICS a lognormal distribution whose mean, m, is from 15 to 30 references per paper [6]. Simulation using a bipartite Yule process fully pre- serves the topology of the network phenomenon being • paper per reference distribution - This tends to be studied. The adjacency matrix for a bipartite network is a power-law distribution with a characteristic ex- a roughly lower triangular rectangular matrix. Figure 6 ponent that ranges from 2 to 4 [9][11]. 4 e REFERENCES AUTHORS JOURNALS c 0 0 0 n a r a e 200 200 200 p p a of 400 400 400 r e d or 600 600 600 n s i er 800 800 800 p a P 0 5000 10000 15000 0 500 1000 0 100 200 References in order of appearance Authors in order of appearance Journals in order of appearance FIG. 6: Diagrams of adjacency matrices of bipartite networksin a collection of 902 papers on thetopic of complex networks. • bibliographic couplingstrengthperpaperpairdistri- • minimum co-authorship path length distribution bution - This is the link weight distribution of the - This is the distribution of minimum path- bibliographic coupling network. lengthsbetweenauthorpairsinthe unweightedco- authorship network. • co-citation coupling strength per reference pair dis- tribution - This is the link weight distribution of the co-citation network. V. EXAMPLES • bibliographic coupling clustering coefficient distri- bution - This the distribution of the clustering co- A. Example simulation of paper-reference network efficients for the bibliographic coupling network. Inpaper-referencenetworks,themeanreferencesperpa- The Yule model for paper-reference networks was per is typically about 30,while the mean papers per ref- tested on a collection of papers that cover the topic erence is typically about 1.4, the mean of a zeta (pure of complex networks. This collection was gathered on power-law) distribution with exponent of 3. This con- September 8th, 2003 from ISI’s Web of Science product strains the ratio of references to papers in the collection using a series of queries to find all papers that cite key to be about 20, that is, a collection of papers typically references and authors in the specialty. The collection has about 20 times more references than papers. contains 902 papers with 31355 citations to 19185 refer- A set of useful metrics for paper-author networks in- ences. The Yule parameter,α,estimatedbydividing the cludes. number of references by the number of citations to ref- erences, is 0.61. The mean references per paper is 34.8. • authors per paper distribution - This tends to be The parametersused forthe bipartite Yule simulationof a 1-shifted Poisson distribution whose mean varies this collection can be found in [6]. from2 forfields suchasmathematicsto morethan Figure 7 show plots comparing network metrics from 10 for biomedical fields [7]. the actual data to a Yule simulation of network growth. • paper per author distribution - This tends to be a The upper left plot is of papers per reference frequen- power-law (Lotka’s Law), whose exponent ranges cies. Maximum likelihood expectation (MLE) estimated from 2 to 4 [4]. power-law exponents are 3.0 for the actual frequencies, and 2.85 for the simulation. The paper-reference Yule • collaborating author distribution - This is the dis- process mimics the phenomenon of exceptionally highly tribution of the number of unique co-authors per cited exemplar references in the extreme lower right of author in the collection, and is the link degree dis- the plot. The upper right plot is of frequency of bib- tributionoftheunweightedco-authorshipnetwork. liographic coupling strength per paper pair. The Yule • co-authorship per author pair distribution - This process-based simulation frequencies match the actual frequencies well. The series of high bibliographic cou- is the link weight distribution of the weighted co- pling strength pairs in the lower right from actual data authorship network. corresponds to pairs of review papers with long lists of • co-authorship clustering coefficient distribution - almost identical references, a phenomenon not modeled This is the clustering coefficient of the unweighted by the Yule process. The lower left plot of Figure 7 is of co-authorshipnetwork. frequency of co-citationstrength per reference pair. The 5 105 106 es simulation ations 104 ggsaicmtu a=l 2=. 835.0 saicmtuualaltion eferenc 105 percent zeros saicmtuualaltion 0.539 k cit on r percent zeros actual 0.476 ng mm 104 ceivi 103 k co es re with 103 mber of referenc 110012 er of paper pairs 110012 u b N m u N 100 100 100 101 102 103 100 101 102 k = number of citations received k = number of common references 107 0.25 ers simulation simulation p actual pa 106 actual mon percent zeros simulation 0.995 0.2 m 105 percent zeros actual 0.994 o c nce pairs with k 110034 action of papers 0.01.51 efere 102 Fr ber of r 101 0.05 m u N 100 0 100 101 102 103 0 0.2 0.4 0.6 0.8 1 k = number of common papers Bibliographic coupling clustering coefficient FIG.7: Comparisonplotsofpaperperreferencefrequency(upperleft),bibliographiccouplingstrengthfrequency(upperright), co-citation strength frequency (lower left), and bibliographic coupling clustering coefficient distribution (lower right), from a collection of 902 papers on thetopic of complex networks. simulated frequencies match the actual frequencies well represents a specialty with heavy collaboration [7]. The across the whole plot. The lower right plot is of biblio- parameters used for bipartite Yule simulation of these graphic coupling clustering coefficient distribution. The paper-author networks can be found in [7]. simulateddistributionmatchestheshapeandscaleofthe Figures8,9and10showthecomparisonofYulemodel actual data. simulations to actual data for these three collections us- ing two metrics: 1) paper per author frequency (Lotka’s Law), and 2) collaborating author frequency. B. Example simulation of a paper-author network The left plots in Figures 8, 9 and 10 are paper per author frequency plots. The bipartite Yule process pro- The Yule model for paper-author networks was tested duces excellent matches to actual data. The inset plots on three collections of papers representing specialties show Yule model predicted paper per author distribu- with a wide range of collaboration intensities. A col- tions derived by gathering statistics from 1000 simula- lection of 1391 papers on the topic of distance learning tionsforeachcollection. Aline representinganMLEfit- with 51% single-authored papers represents a specialty ted zeta (pure power-law) distribution is shown in each with little collaboration. A collection of 900 papers on inset. The Yule model produces excellentfits to the zeta the topic of complex networks with 21% single-authored distribution for all three collections, confirming the Yule papers representsa specialty with typicalamount ofcol- model’s usefulness as a predictor of Lotka’s Law. Note laboration. Finally, a collection of 3095 papers on the that the deviation of the distributions from the zeta dis- topic of atrial ablation with 7% single-authored papers tribution in the tail of the distributions is due to trun- 6 104 100 800 h x papers 103 ggsaicmtu a=l 3=.35.57 1100−−42 x co−authors 567000000 mm saicmtu a=l 1=. 82.6 asicmtuualalted ors wit 102 10−6 s with 400 uth hor mber of a 101 10−1800 101 102 ber of aut 230000 u m n u N 100 actual simulated 100 0 100 101 102 0 5 10 15 20 x = number of papers x = number of co−authors FIG.8: ComparisonofbipartiteYulesimulationagainstactualdataforplotsofpaperperauthorfrequenciesandcollaborating author frequencies for the distance education papercollection. 104 100 400 g = 2.54 sim ers103 gsaicmtu a=l 2.77 10−2 −authors; 330500 mm saicmtu a=l 2=. 83.15 actual p o pa 10−4 e c with x uniqu 250 mber of authors 110012 1100−−18600 102 of authors with x 112050000 nu er b m actual u 50 simulated n 100 0 100 101 102 0 5 10 15 20 x = number of papers x = number of co−authors FIG.9: ComparisonofbipartiteYulesimulationagainstactualdataforplotsofpaperperauthorfrequenciesandcollaborating author frequencies for the complex networks paper collection. cating the simulations at the number of papers in each works. Figure 10 shows an example of coupled bipartite collection. The plots on the right side of Figures 8, networks, where a paper-author network is coupled to a 9 and 10 show that the bipartite Yule model produces paper reference network through common papers. The good matches of collaborating author frequencies to ac- challenge is to invent a model that reproduces the cor- tualdataacrossthewiderageofcollaborationintensities relation of groups of authors to groups of references, a represented by the three collections. phenomenonthat cannotbe modeled using two separate bipartite processes. VI. FUTURE WORK TheresearchonbipartiteYuleprocessesdiscussedhere will be extended to modeling of coupled bipartite net- [1] R.AlbertandA.Barabasi. Statisticalmechanicsofcom- 2002. plex networks. Reviews of Modern Physics, 74(1):47–97, 7 104 800 100 sim h x papers103 ggasicmtu a=l 2=. 125.25 1100−−42 x co−authors 567000000 mm asicmtu a=l 9=. 710.2 actual ors wit102 10−6 s with 400 mber of auth101 10−1800 102 ber of author 230000 u m n u actual N 100 simulated 100 0 100 101 102 103 0 5 10 15 20 25 30 35 40 x = number of papers x = number of co−authors FIG.10: ComparisonofbipartiteYulesimulationagainstactualdataforplotsofpaperperauthorfrequenciesandcollaborating author frequencies for the atrial ablation paper collection. ences, 16:317–323, 1926. [5] K.W. McCain. Mapping economics through thejournal literature: an experiment in journal cocitation analysis. rr 11 JournaloftheAmericanSocietyforInformationScience, rr 42(4):290–296, 1991. 22 [6] S. A. Morris. Manifestation of emerging specialties in pp11 rr33 journal literature: a growth model of papers, references, aapp11 pp22 rr44 exemplars, bibliographic coupling, co-citation, and clus- tering coefficient distribution. Journal of the American apuathpoerrs aapp22 pp33 rr55 references SocietyforInformationScienceandTechnology,inprint, aapp33 pp44 rr66 2004. [7] S. A. Morris, M. L. Goldstein, and C. F. Deyong. Man- rr 77 papers ifestation of research teams in journal literature: A rr 88 growth model of papers, authors, collaboration, coau- rr 99 thorship, weak ties, and lotka’s law. (submitted), 2004. rr1100 [8] Steven A. Morris. Unified mathematical treatment of complex cascaded bipartite networks: the case of collec- tions of journal papers. Dissertation, Oklahoma State University,2005. FIG.11: Exampleof coupled bipartitenetworks. Thepaper- [9] S.Naranan. Powerlawrelations insciencebibliography- author network is coupled to the paper-reference network a self-consistent interpretation. Journal of Documenta- through common papers. tion, 27(2):83–97, 1971. [10] D. Price. A general theory of bibliometric and other cu- mulative advantage processes. Journal of the American Society for Information Science, 27(5-6):292–306, 1976. [2] M. L. Goldstein, S.A. Morris, and G. G. Yen. A group- [11] S. Redner. How popular is your paper? an empirical based model for bipartite author-paper networks. Phys- study of the citation distribution. European Physical ical Review E (cond-mat/0409205), 2004. Journal B, 4(2):131–134, 1998. [3] Norman Lloyd Johnson, SamuelKotz, and AdrienneW. [12] H. A. Simon. On a class of skew distribution functions. Kemp. Univariate discrete distributions. Wiley series in Biometrika, 42:425–440, 1955. probability and mathematical statistics. Applied proba- [13] H. D. White and B. C. Griffith. Author cocitation: a bility and statistics. John Wiley & Sons,New York,2nd literature measure of intellectual structure. Journal of edition, 1992. theAmericanSocietyforInformationScience,32(3):163– [4] A.J.Lotka. Thefrequencydistribution ofscientificpro- 172, 1981. ductivity. Journal of the Washington Academy of Sci-