ebook img

Promise and Pitfalls of Extending Google's PageRank Algorithm to Citation Networks PDF

0.21 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Promise and Pitfalls of Extending Google's PageRank Algorithm to Citation Networks

. Promise and Pitfalls of Extending Google’s PageRank Algorithm to Citation Networks 1 2 S. Maslov and S. Redner 1Department of Condensed Matter Physics and Materials Science, Brookhaven National Laboratory, Upton, NY, 11973 2Center for Polymer Studies and and Department of Physics, Boston University, Boston, MA, 02215 9 The number of citations is the most commonly-used WecanthinkofthesetofallAPSarticlesandtheirci- 0 metric for quantifying the importance of scientific publi- tationsasanetwork,withnodesrepresentingarticlesand 0 cations. However,weallhaveanecdotalexperiencesthat adirectedlinkbetweentwonodesrepresentingacitation 2 citations alone do not characterize the importance of a fromacitingarticletoacited article. InGoogle’sPageR- n publication. Some of the shortcomingsof using citations ank algorithm [1], a random surfer is initially placed at a as a universal measure of importance include: each node of this network and its position is updated as J follows: (i) with probability 1 − d, a surfer hops to a 7 • It ignores the importance of citing papers: a cita- neighboring node by following a randomly-selected out- 1 tionfromanobscurepaperisgiventhesameweight going link from the current node; (ii) with probability as a citation from a ground-breaking and highly- d, a surfer “gets bored” and starts a new search from a ] cited work. h randomly selected node in the entire network. This up- p • Thenumberofcitationsisill-suitedtocomparethe dateis repeateduntil the numberofsurfersateachnode - reachesa steady value, the Google number. These nodal c impactofpapersfromdifferentscientificfields. Due o to factors such as size of a field and disparate cita- GooglenumbersarethensortedtodeterminetheGoogle s rank of each node. tion practices, the average number of citations per . s paper varies widely between disciplines. An aver- The original Brin-Page PageRank algorithm [1] used c age paper is cited about 6 times in life sciences, 3 the parameter value d = 0.15 based on the observa- i s times in physics, and <1 times in mathematics. tion that a typical web surfer follows of the order of 6 y hyperlinks, corresponding to a boredom attrition factor h • Many ground-breaking older articles are modestly d = 1/6 ≃ 0.15, before aborting and beginning a new p [ cited due to a smaller scientific community when search. The number of citation links followed by re- they were published. Furthermore,publications on searchers is generally considerably less than 6. In [2, 5] 1 significant discoveries often stop accruing citations we argued that in the context of citations the appropri- v once their results are incorporated into textbooks. ate choice is d = 1/2, corresponding to a citation chain 0 4 Thus citations consistently underestimate the im- to two links. portance of influential old papers. 6 The Google number G and number of citations k to a 2 These and related shortcomings of citation numbers paper are roughly proportional to each other for k & 50 . 1 are partially obviated by Google’s PageRank algorithm [2] so that, on average, they represent similar measures 0 [1]. As we shall discuss, PageRank gives higher weight of importance [6, 7]. The outliers with respect to this 9 to publications that are cited by important papers and proportionality are of particular interest (Fig. 1). While 0 also weights citations more highly from papers with few thetopAPSpapersbyPageRankaretypicallyverywell- : v references. Becauseoftheseattributes,PageRankreadily cited, some are modestly-cited and quite old. The dis- Xi identifiesalargenumberofscientific“gems”—modestly- parity between PageRank and citation rank arises be- cited articles that contain ground-breaking results. cause the former involves both the number of citations r a Inarecentstudy,weapplied[2]Google’sPageRankto and quality of the citing articles. The PageRank of a the citation network of the premier American Physical paper is enhanced when it is cited by its scientific “chil- Society (APS) family of physics journals (Physical Re- dren” that themselves have high PageRank and when viewA–E,PhysicalReviewLetters,ReviewsofModern these children have short reference lists. That is, chil- Physics,andPhysicalReviewSpecialTopics). Ourstudy dren should be influential and the initial paper should was based on all 353,268 articles published in APS jour- loom as an important “father figure” to its children. nals since their inception in 1893 until June 2003 that The top articles according to Google’s PageRank have at least one citation from within this dataset. This would be recognizable to almost all physicists, indepen- setofarticleshasbeencitedatotalof3,110,839timesby dent of their specialty (Table I). For example, Onsager’s otherAPSpublications. Ourstudyis restrictedto inter- 1944 exact solution of the two-dimensional Ising model nal citations—citationstoAPSarticlesfromotherAPS (the point labeled with O in Fig. 1) was both a calcu- articles. Otherstudies[3,4]usethePageRankalgorithm lational tour de force as well as a central development on a coarse-grainedscale of individual scientific journals incriticalphenomena. The paper by FeynmanandGell- to formulate an alternative to the Impact Factor. Mann(FG)introducedtheV −Atheoryofweakinterac- 2 tions that incorporated parity non-conservation and be- came the “standard model” of the field. Anderson’s pa- per (A), “Absence of Diffusion in Certain Random Lat- C 100 BCS KS tices”gavebirthtothefieldoflocalizationandwascited S WS FG O A AA W HK for the 1977 Nobel prize in physics. Particularlyintrigu- 0.5 ing are the articles “The Theory of Complex Spectra”, = at d by J. C. Slater (S) and “On the Constitution of Metal- ue lic Sodium” by E. Wigner and F. Seitz (WS) with rela- al k v tively few APS citations (114 and 184 respectively as of n Ra 10 June 2003). Slater’s paper introduced the determinant e ag form of the many-body wavefunction that is so ubiqui- P tous that the original work is no longer cited. Similarly, the latter paper introduced the Wigner-Seitz construc- tion that constitutes an essential curriculum component in any solid-state physics course. 1 100 1000 number of citations FIG. 1: Scatter plot of the Google number versus numberof citations for APS publications with more than 100 citations. The top Google-ranked papers (filled) are identified by au- thor(s) initials and their details are given in Table 1. The solid curve is the average Google number of papers versus numberof citations when binnedlogarithmically. TABLE I: Top non-review-article APS publications ranked according to PageRank (PR, first column), the citation number (CNR, second column), and CiteRank [5] (CR, third column). The full database of PageRank and CiteRank values of APSpublications is accessible at [8]. PR CNR CR Publication Title Author(s) 1 54 42 PRL 10, 531 (1963) UnitarySymmetry&LeptonicDecays Cabibbo (C) 2 5 10 PR 108, 1175 (1957) Theory of Superconductivity Bardeen, Cooper & Schrieffer (BCS) 3 1 1 PR140,A1133(1965) Self-Consistent Equations ... Kohn & Sham (KS) 4 2 2 PR 136, B864 (1964) Inhomogeneous Electron Gas Hohenberg & Kohn (HK) 5 6 58 PRL 19, 1264 (1967) A Model of Leptons Weinberg (W) 6 55 37 PR 65, 117 (1944) Crystal Statistics ... Onsager (O) 7 95 293 PR 109, 193 (1958) Theory of theFermi Interaction Feynman & Gell-Mann (FG) 8 17 13 PR 109, 1492 (1958) Absenceof Diffusion in ... Anderson (A) 9 1853 133 PR 34, 1293 (1929) The Theory of Complex Spectra Slater (S) 10 12 11 PRL 42, 673 (1979) Scaling Theory of Localization Abrahams, Anderson, et al. (AA) 11 712 106 PR 43, 804 (1933) ...Constitution of Metallic Sodium Wigner & Seitz (WS) The PageRank algorithm was originally developed to tists often startliterature searcheswith a paper of inter- rankwebpagesthatareconnectedbyhyperlinks andnot est that they saw in a recent issue of a journal or heard papers connected by citations. The most important dif- presented at a conference. From this starting point a ferencebetweenthesetwonetworksisthat,unlikehyper- researcher typically follows chains citations that lead to links,citationscannotbeupdatedafterpublication. The progressivelyolder publications. constraintthatapapermayonlyciteearlierworksintro- duces a time ordering to the citation network topology. These observations led us to modify the PageRank al- This ordering makes aging effects much more important gorithmbyinitiallydistributingrandomsurfersexponen- in citation networks than in the world-wide web. The tially with age, in favor of more recent publications [5]. habits of scientists looking for relevant scientific litera- This algorithm — CiteRank — is characterized by just ture are also different from web surfers. Apart from the two parameters: d — the inverse of the average citation already-mentionedshorterdepthsearch(1/d=2),scien- depth, and τ — the time constant of the bias toward more recent publications at the start of searches. The 3 optimal values of these parameters [5] that best predict use citations and their refinements, such as PageRank, the rate at which publications acquire recent citations to quantify the importance ofpublications and scientists are d = 0.5 and τ = 2.6 years for APS publications. [10],especiallyascitationdatabecomesincreasinglycon- With these values the output of the CiteRank algorithm venient to obtain electronically. In fact, the h-index of quantifiestherelevanceofagivenpublicationinthecon- any scientist, a purported single-number measure of the text of currently popular research directions, while that impact of an individual scientific career, is now easily of PageRank corresponds to its “lifetime achievement availablefromtheWebofScience[11]. However,wemust award”. The database with the results of application bevigilantfortheoveruseandmisuseofsuchindices. All of both algorithms to the citation network of APS pub- thecitationmeasuresdevisedthusfarpertaintopopular- lications can be accessed online at [8]. ityratherthantothenot-necessarily-coincidentattribute Google’s PageRank algorithm and its modifications ofintrinsic intellectual value. Evenif a wayis devisedto hold great promise for quantifying the impact of scien- attachahigh-fidelityqualitymeasuretoacitation,there tific publications. They provide a meaningful extension is no substitute for scientific judgment to assess publica- to traditionally-used importance measures, such as the tions. Weneedtoavoidfallingintothetrapofrelyingon number of citations to articles and the impact factor for automatically generated citation statistics for accessing journals. PageRank implements, in a simple way, the the performance of individual researchers, departments, logical notion that citations from more important pub- and scientific disciplines, and especially of allowing the lications should contribute more strongly to the rank evaluation task to be entrusted to administrators and of a cited paper. PageRank also effectively normalizes bureaucrats [12]. the impact of papers in different scientific communities [9]. Other ways of attributing a quality for citations wouldrequiredetailedcontextualinformationaboutcita- Acknowledgments tion themselves, features that are presently unavailable. PageRank implicitly includes context by incorporating the importance of citing publications. Thus PageRank Work by SM at Brookhaven National Laboratory was representsa computationallysimple andeffective wayto carried out under Contract No. DE-AC02-98CH10886, evaluate the relative importance of publications beyond DivisionofMaterialScience,U.S.DepartmentofEnergy. simply counting citations. SRgratefullyacknowledgefinancialsupportfromtheUS We concludewithsomecaveats. Itisverytempting to National Science Foundation grant DMR0535503. [1] S. Brin and L. Page, The anatomy of a large-scale hy- [7] S. Fortunato, A. Flammini, and F. Menczer, Scale-free pertextual Web search engine, Computer Networks and network growth by ranking, Phys. Rev. Lett. 96, 218701 ISDNSystems, 30, 107 (1998). (2006). [2] P.Chen, H.Xie,S.Maslov, and S.Redner, Finding Sci- [8] http://www.cmth.bnl.gov/∼maslov/citerank entific Gems with Google, J. of Informetrics 1, 8 (2007). [9] H. Xie, K.-K. Yan, and S. Maslov, Optimal ranking in [3] J.Bollen,M.A.Rodriguez,andH.VandeSompelJour- networks with community structure, Physica A 373, 831 nal Status, Scientometrics 69, 669, (2006). (2007). arXiv:physics/0510107. [4] C.Bergstrom,Eigenfactor: measuringthevalueandpre- [10] J. Hirsch, An index to quantify an individual’s scientific sitige of scholarly journals. College and Research Li- researchoutput,Proc.Natl.Acad.Sci.(USA)102,16569 braries News 68,5 (2007). (2005). [5] D. Walker, H. Xie, K.-K Yan, S. Maslov, Ranking sci- [11] http://apps.isiknowledge.com/ entific publications using a model of network traffic., J. [12] See also R. Adler, J. Ewing, P. Taylor, Ci- Stat.Mech. 6, P06010 (2007). tation Statistics, Report from the Interna- [6] S.Fortunato,M.Boguna,A.Flammini,andF.Menczer, tional Mathematical Union (IMU), (2008), How to make the top ten: Approximating PageRank http://www.mathunion.org/fileadmin/IMU/Report from in-degree, In Proceeding of WAW2006 (2006). /CitationStatistics.pdf arXiv:cs/0511016.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.