Table Of Content

C : A new citation index to measure the ITEX relative importance of authors and papers in scientific publications Arindam Pal † and Sushmita Ruj ‡ † TCS Innovation Labs, Kolkata, India. Email address: [email protected] ‡ Indian Statistical Institute, Kolkata, India. Email address: [email protected] 5 1 0 2 Abstract—Evaluating the performance of researchers and original research papers. Self citations also increase the measuring the impact of papers written by scientists is the number of citations of a paper. n main objective of citation analysis. Various indices and metrics a Thisiswhyweproposeanewmetricforevaluatingpapers have been proposed for this. In this paper, we propose a new J for the first time. This new citation index CITEX, gives citationindex CITEX,whichgivesnormalizedscorestoauthors 0 and papers to determine their rankings. To the best of our normalized scores to authors and papers to determine their 2 knowledge, this is the first citation index which simultaneously rankings.Usingthesescores,wecangetanobjectivemeasure assigns scores to both authors and papers. Using these scores, of the reputation of an author and the impact of a paper. ] wecangetanobjectivemeasureofthereputationofanauthor L Paperscoresarecalculatednotsolelybasedonthenumberof and the impact of a paper. D citations. Apart from giving scores to papers, we give scores We model this problem as an iterative computation on a toauthors.Theauthorscoresandpaperscoresreinforceeach . publication graph, whose vertices are authors and papers, and s c whose edges indicate which author has written which paper. other.Thus,aninfluentialauthorwillincreasethescoreofthe [ Weprovethatthisiterativecomputationconvergesinthelimit, paper(s)hewrites.Anauthor’sscoreincreasesif(s)hewrites by using a powerful theorem from linear algebra. We run a good paper. Since the author score increases the score of 1 this algorithm on several examples, and find that the author a paper written by the author, it will be a general tendency v and paper scores match closely with what is suggested by our 4 intuition. The algorithm is theoretically sound and runs very to write a paper with an influential author. To prevent this, 9 fast in practice. We compare this index with several existing we assign scores to authors in such a way that the score of a 8 metrics and find that CITEX gives far more accurate scores paper gets uniformly divided by the number of people who 4 compared to the traditional metrics. have co-authored the paper. This basic model can be very 0 Keywords: CITATIONANALYSIS,GRAPHALGORITHMS, easily extended to weighted distribution of scores, where a . 1 MATRIX COMPUTATIONS, EIGENVALUES AND EIGENVEC- first author who has the highest contribution receives more 0 TORS,INFORMATIONRETRIEVAL. weightage than an author who has less contribution. 5 Existing literature rate an author based on the number of 1 : I. INTRODUCTION papers that he/she has written, the total number of citations v received, average number of citations per paper etc. In i X In today’s world, numerous papers are written by authors Section II, we discuss each of these metrics and state their r in many journals and conferences. It is difficult for people advantages and disadvantages. None of the existing tech- a to judge the quality and impact of an author or a paper, niques take into account how the paper scores of an author even if they are experts, just by reading a few papers. Thus, influence the author’s standing in the academic community, measuringtherelativeimportanceofauthorsandpaperspub- because the paper score is calculated solely based on the lishedinscientificconferencesandjournalsisveryimportant. number of citations. Our CITEX index is inspired by the Moreimportantly,thereisaneedforanindexgivingaccurate ideas of PageRank [16], [6], [5] and HITS [15] algorithms results, which can be computed easily for a large collection for ranking web pages. In this scheme, paper scores are of authors and papers. updated depending on author scores and author scores are Many metrics are available to evaluate the importance of updated based on paper scores. This is often referred to as journals like Impact factor [12], Immediacy index, citation thePrincipleofRepeatedImprovement[9].Weprovethatthe page rank [18], and Y-factor [4]. There are many metrics scoresasymptoticallyconvergeinthelimitwhenthenumber which give scores to authors. Most famous of these are of iterations is large. In practice, the scores converge within Hirsch’s h-index [13], Individual h-index [1], Egghe’s g- a few iterations. index [10], and Zhang’s e-index [21]. However, till now, The main idea is to consider the authors and papers as a the only way to evaluate the impact of a paper, is to count network with a disjoint set of nodes, with the set of authors the number of citations. It has been observed that surveys and the set of papers as vertices in the two sets. An edge and review articles receive more citations than high quality existsbetweenanauthorandapaper,iftheauthorhaswritten thecorrespondingpaper.Apartfromthese,thereisacitation II. RELATEDWORK subgraph, consisting of papers as the vertices. A directed edge exists from node i to node j, if paper i cites paper j. A. Different metrics used in citation analysis At each step, a paper score gets uniformly divided amongst all its authors. The paper score is the sum of the scores of Previously, there have been several attempts to measure the authors who have written the paper and the scores due the impact of authors and papers. We list here some of them to citation. along with their definition. 1) Number of papers (N ): Total number of papers p Thecurrentlyavailablescoreslikeh-indexcountthenum- written by an author. ber of citations, but not the impact of these papers which 2) Number of citations (Nc): Total number of citations have cited this paper. This can be easily manipulated by for all papers written by an author. self-citations.Moreover,sincemanyindiceslikeh-indexand 3) Averagenumberofcitationsperpaper:Ratiooftotal number of citations are integers, it is difficult to distinguish number of citations and total number of papers, i.e., between two authors or papers with the same value of the Nc. This is sometimes also called the impact factor. Np index. Our index has a higher discriminatory power, since it 4) Average number of citations per author: For each is a real number between 0 and 1. It is highly unlikely that paper, its citation count is divided by the number of two authors or papers will have the same value for CITEX. authors for that paper to give the normalized citation There is also a non-uniformity across various disciplines. In count for the paper. The normalized citation counts sciences,thereisageneraltendencytoworkinlargegroups. are then summed across all papers to give the average Thesepapersalsoreceivelargenumberofcitationscompared number of citations per author. to papers in computer science. Since the score is divided 5) Average number of papers per author: For each uniformly among multiple authors, each author score will be paper, the inverse of the number of authors gives the reduced, thus affecting paper scores as well. Our index also normalized author count for the paper. The normalized gives credit to authors who write single-author papers. This authorcountsarethensummedacrossallpaperstogive should not deter people to do collaborative research. the average number of papers per author. 6) Average number of authors per paper: The sum of theauthorcountsacrossallpapers,dividedbythetotal The problem can be fine-tuned in a number of ways. We number of papers. cantakeintoconsiderationtherecommendationofauthorsby 7) h-index [13]: An author has index h, if h of his N p other authors and consider weighted distribution. The prob- papers have at least h citations each, and the rest of lem also has a number of ramifications. A similar index can the N −h papers have no more than h citations each. p be designed for product-customer recommendation system, 8) g-index[10]:Givenasetofarticlesrankedindecreas- wherecustomerscanrecommendeachotherdependingupon ing order of the number of citations that they received, thereputations(similartocitationofpapers),andacustomer theg-indexisthe(unique)largestnumbersuchthatthe can recommend a product (similar to writing papers). The top g articles together received at least g2 citations. differencehereisthat,thescoreofaproductisnotuniformly 9) e-index[21]:Itisthesquarerootofsurpluscitationsin divided amongst customers, and information cascades have theh-setbeyondthetheoreticalminimum(h2)required tobetakenintoaccountwhencalculatingtheproductscores. to obtain a h-index of h. It is useful for highly cited Other interdependent networks, which reinforce each other, scientists and for comparing those with the same h- can be treated similarly. index but different citation patterns. 10) Numberofsignificantpapers:Totalnumberofpapers The rest of the paper is organized as follows. In section with more than c citations for some integer c. II, we discuss the related work on citation analysis, define 11) Number of citations to the most cited papers: Total the metrics and compare them. In section III, we define the numberofcitationstothekmostcitedpapersforsome problem and propose the model to analyze it. We give an integer k. informal description of the algorithm and present the rules 12) Eigenfactor: The Eigenfactor score [3] is a rating of to iteratively compute the author and paper scores in section thetotalimportanceofascientificjournal.Journalsare IV. In section V, we mathematically analyze the iterative rated according to the number of incoming citations, procedure and prove that it converges to the eigenvectors of with citations from highly ranked journals weighted certain matrices. In section VI, we execute the algorithm on to make a larger contribution to the Eigenfactor than some illustrative examples, and show that the author scores thosefrompoorlyrankedjournals[2].Asameasureof andpaperscoresgivegoodindicationoftheirimportance,as importance, the Eigenfactor score scales with the total can be seen from the underlying graph structure. In section impact of a journal. Journals generating higher impact VII, we discuss some extensions of the basic algorithm and to the field have larger Eigenfactor scores. However, it futuredirectiontoworkon.Weconcludethepaperinsection is not clear whether Eigenfactor gives better estimate VIII with some future directions to work on. than raw citation count. [8] TABLEI COMPARISONBETWEENDIFFERENTCITATIONMETRICS CitationMetric Advantage Disadvantage Numberofpapers Measureofproductivity. Importanceofpapersnotconsidered. Numberofcitations Measuresimpactofanauthor. Afewhighlycitedpapersincreasethetotal. Survey and review articles are cited more than original research papers. Favorsestablishedauthors. Average number of citations per Allowscomparisonofscientistsofdifferent Rewardslowproductivity. paper ages. Average number of citations per Measuresimpactofanauthor. Difficult to distinguish between authors whose average is same, author butcitationpatternsaredifferent. Averagenumberofpapersperau- Measuresaverageproductivity Doesnotmeasureimpactofpapers. thor Averagenumberofauthorsperpa- Measurescollaborationbetweenauthors. Doesnotconsiderimportanceofauthorsandpapers. per h-index Measures both the quality and quantity of Doesnotaccountforthenumberofauthorsofapaper. scientificoutput. Different fields with different number of citations will have differenth-index. Canbemanipulatedthroughself-citations. g-index Givesmoreweighttohighly-citedarticles. Unlike the h-index, the g-index saturates whenever the average number of citations for all published papers exceeds the total numberofpublishedpapers. e-index Differentiatesbetweenscientistswithiden- Can’t be used independently. Must be used together with the h- ticalh-indicesbutdifferentcitations. index. Number of papers with at least c Measuresthebroadandsustainedimpactof Difficulttofindtherightvalueofc.Differentvaluesofcfavors citations anauthor. differentauthors. Number of citations to the k most Identifiesthemostinfluentialauthors. Notasinglenumber,sodifficulttocompare.Differentvaluesof citedpapers k favorsdifferentauthors. Eigenfactor Takesintoaccountimpactofthecitingpa- Does not give author scores. Importance of authors have to persinadditiontothenumberofcitations. inferredindirectlyfromthepapers(s)hehaswritten. B. Comparison between different citation metrics exponentially with the age of the paper. Chen et. al. [7] In this section, we compare the different citation indices uses a PAGERANK based algorithm to assess the relative importance of all publications. Their goal is to find some andstatetheiradvantagesanddisadvantages.Thecomparison exceptional papers or “gems” that are universally familiar to is presented in Table I. physicists.SunandGiles[19]proposeapopularityweighted C. Comparison with similar works rankingalgorithmforacademicdigitallibraries.Theyusethe There are a number of previous attempts to rank authors popularity of a publication venue and compare their method and papers based on importance. The SIMRANK algorithm withthePAGERANKalgorithm,citationcountsandtheHITS by Jeh and Widom [14] gives a measure of the similarity algorithm. between two objects based on their relationships with other D. Some structures in citation analysis objects.Theirbasicideaisthattwoobjectsaresimilarifthey are related to similar objects. Note that this only measures • Collaboration graph: This is a graph associated with the similarity of two objects, not their relative ranking, so the authors. The nodes of the graph are the authors. this is different from what we are trying to do. Zhou et. There is an undirected edge between two nodes, if the al. [22] proposed a method for co-ranking authors and their corresponding authors have written a paper together. publications using several networks associated with authors • Citation graph: This is a graph associated with the and papers. Although there is some similarity between our papers. The nodes of the graph are the papers. There algorithm and their approach, there are fundamental dif- is a directed edge from a paper to another paper, if the ferences between the two. Their co-ranking framework is first paper has cited the second paper. based on coupling two random walks that separately rank • Publication graph: This is a graph relating the authors authorsanddocumentsusingthe PAGERANK algorithm.Our with the papers. The nodes of the graph are the authors algorithm is designed from scratch and does not use the andthepapers.Thereisanundirectededgebetweentwo PAGERANK algorithm. Moreover, our algorithm is much nodes, if the author has written the paper. simpler and the computations required is also far lesser than III. PROBLEMDEFINITIONANDMODEL what is required in their method. Walker et. al. [20] gave a new algorithm called CITERANK. The ranking of papers We have a set of m authors A = {a1,...,am} and a is based on a network traffic model, which uses a variation set of n papers P = {p ,...,p }. We represent this by a 1 n of the PAGERANK algorithm. A paper is selected randomly publication graph GP = (VP,EP), whose vertices are the from the set of all papers with a probability that decays set of authors and papers, i.e., V = A∪P. There is an P undirectededgebetweenauthora andpaperp ,ifauthora the structure of the publication and citation graphs, so that i j i haswrittenpaperp .Notethatthisisasymmetricrelation,so important authors and papers get higher scores. j theedgesareundirected.Sincethereareonlyedgesbetween authorsandpapers,thepublicationgraphisabipartitegraph. IV. DESCRIPTIONOFTHECITEXINDEX Associated with this, there is an m×n publication matrix A. Informal description of the algorithm M, whose rows and columns are a ,...,a and p ,...,p 1 m 1 n In this section, we give an overview of our proposed respectively, and whose (i,j)th entry m = 1, if and only ij algorithm.Thealgorithmmaintainsasetofauthorscoresand if author a has written paper p . i j paper scores, which are initially set to 1. This initial choice Moreover, there is a citation graph G = (V ,E ) C C C ofscoresisarbitrary,andthescorescanbesettoanynonzero associated with the papers, whose vertices are the set of value.Thenweupdatethescoresconsideringtherelationship papers, i.e., V =P. There is a directed edge from paper p C j between authors and papers (who has authored which paper) to paper p , if paper p has cited paper p . Note that this is k j k andrelationshipbetweenpapers(whichpaperhascitedwhich anasymmetricrelation,sotheedgesaredirected.Associated paper). This critically uses the publication graph and the with this, there is an n×n citation matrix C, whose both citationgraph.WeusethePrincipleofRepeatedImprovement rows and columns are p ,...,p , and whose (j,k)th entry 1 n [9] to iteratively compute the new scores based on the c = 1, if and only if paper p has cited paper p . Note jk j k previous scores. More specifically, the author scores for the that the citation graph can’t have any directed cycle. This is next iteration is computed from the paper scores for the because a paper can only cite a previously published paper, current iteration. The paper scores for the next iteration is so they are totally ordered in time. This also means that if computedfromtheauthorscoresandthepaperscoresforthe the papers are numbered in decreasing order of time (newer current iteration. In every iteration, we normalize the scores first), the resulting citation matrix will be upper-triangular. bydividingthembythesumoftheindividualscores,sothat An example of a publication graph and a citation graph is each of them lies between 0 and 1, and they add up to 1. given in Figure 1. We continue to do this till the author scores and the paper scoresconvergeoraspecifiednumberofiterationshavebeen completed. The Principle of Repeated Improvement states p1 thateachimprovementofauthorscoreswillleadtoafurther improvementofpaperscores,andviceversa.Thefinalauthor a 1 scoresandpaperscoresarethemeasureofimportanceofthe p2 authors and the papers. The higher the score is, the higher is the impact of an author and a paper. a 2 B. Computing author and paper scores p 3 For each author a , we have an author score (a-score) x , a i i 3 and for each paper p , we have a paper score (p-score) y . j j p4 We represent the set of author scores as a column vector x=(x ,...,x )T and the set of paper scores as a column a 1 m 4 vector y=(y ,...,y )T. We initialize all author and paper 1 n p5 scores to one, i.e., x = y = 1. Then, we iteratively update the a-scores and p-scores using the following rules. Fig.1. ThepublicationandcitationgraphsforExample1showingauthors, papersandcitations. 1) For each paper pj, its adjusted p-score y¯j is given by the p-score y divided by the number of authors who j have written the paper. In other words, y¯ = yj, for The following sets are important for further development. j k j = 1,...,n, where k = |AUTHORS(p )| is the j 1) For an author a, PAPERS(a) is defined as the number of co-authors of the paper p . j set of papers written by author a. In other words, 2) For each author a , set his a-score x to be the sum i i PAPERS(a)={p∈P :(a,p)∈EP}. of the adjusted p-scores of all the papers that he has 2) For a paper p, AUTHORS(p) is defined as the set authored. In other words, x =(cid:80) y¯ , for i j∈PAPERS(i) j of authors who have written paper p. In other words, i=1,...,m. AUTHORS(p)={a∈A:(a,p)∈EP}. 3) Foreachpaperpj,setitsp-scoreyj tobethesumofthe 3) Forapaperp,CITE(p)isdefinedasthesetofpapers a-scores of all the authors who have co-authored the who have cited paper p. In other words, CITE(p) = paper p . In other words, y = (cid:80) x , j j i∈AUTHORS(j) i {q ∈P :(q,p)∈EC}. for j =1,...,n. 4) For a paper p, REF(p) is defined as the set of papers 4) For each paper p , add to its p-score y , the sum of j j which have been given as reference (cited) by paper p. the p-scores of all the papers who have cited the paper In other words, REF(p)={q ∈P :(p,q)∈EC}. pj. In other words, yj = yj + (cid:80)k∈CITE(j)yk, for Our goal is to assign scores to authors and papers using j =1,...,n. We normalize the scores by dividing the author (paper) Further, both x(cid:63) and y(cid:63) are non-negative and non-zero scores by the sum of the author (paper) scores, so that each vectors. score lies between 0 and 1, and the sum of the scores is 1. Proof: From the above discussion we have, x(cid:104)k(cid:105) = V. MATHEMATICALANALYSIS Pkx(cid:104)0(cid:105) and y(cid:104)k(cid:105) = Qky(cid:104)0(cid:105), where P = W(I +CT)MT and Q=(I+CT)MTW. Note that P is an m×m square A. Analysis of author scores and paper scores matrix, whereas Q is an n × n square matrix. Moreover, (cid:80) We observe that the rule yj = i∈AUTHORS(j)xi can x(cid:104)k+1(cid:105) = Pk+1x(cid:104)0(cid:105) = P ·Pkx(cid:104)0(cid:105) = Px(cid:104)k(cid:105). If the author be rewritten as yj = (cid:80)mi=1mijxi, since mij = 1 if score x(cid:104)k(cid:105) converges to the vector x(cid:63) in the limit when and only if i ∈ AUTHORS(j). Consider the matrix- k → ∞, then this vector should satisfy Px(cid:63) = x(cid:63). This vector equation y ← MTx. The jth row of this equation means that x(cid:63) is an eigenvector of P, with the correspond- is yj = (cid:80)mi=1mijxi. Hence, this matrix-vector equation ing eigenvalue being 1. Similarly, if the paper score y(cid:104)k(cid:105) succinctly encodes all n scalar equations for j =1,...,n. converges to the vector y(cid:63) in the limit when k → ∞, then The corresponding equation for x is similar, but a little y(cid:63) must be an eigenvector of Q, with the corresponding (cid:80) more involved. The equation xi = j∈PAPERS(i)y¯j can eigenvalue being 1. be written as xi = (cid:80)nj=1mijy¯j, since mij = 1 if and only Toprovethatanon-negativeeigenvalueandanon-negative if j ∈ PAPERS(i). Let y¯ = (y¯1,...,y¯n)T. Consider the eigenvector exists, we use the following theorem from linear matrix-vector equation x ← My¯. The ith row of this equa- algebra. tionisx =(cid:80)n m y¯ .Hence,thismatrix-vectorequation i j=1 ij j succinctly encodes all m scalar equations for i = 1,...,m. Theorem 2 (PERRON-FROBENIUS THEOREM). [17], [11], Fxuirmt=hiejr(cid:80)nonji=tse1tht(cid:16)he(cid:80)awt,mime=y¯i1igjjmh=tija(cid:17)|sAsyUojcTiH=atyOejdR(cid:80)Swnj(=pitjh1)|wth=iejyp(cid:80)ja,mip=wye1jrhmpeirje..NHwoeiwjn,c=ex, [tsh9taa]tteLmaetiejnAts≥=ho(0la,di∀.ji),bje:an1n≤×ni,njon≤-nenga.tiTvheemnatthreix,fomlleoawniinngg c(cid:80)anmi=1bemiwjritten as x ← Wy, where W is the m×jn weight 1) A has a real eigenvalue c ≥ 0 such that c > |c(cid:48)| for all other eigenvalues c(cid:48). matrix whose (i,j)th entry is w . ij 2) There is an eigenvector v with non-negative real (cid:80) The equation y = y + y can be rewritten j j k∈CITE(j) k components corresponding to the largest eigenvalue as y = y +(cid:80)n c y , since c = 1 if and only if k ∈ j j k=1 kj k kj c : Av = cv,vi ≥ 0,1 ≤ i ≤ n, and v is unique CITE(j). This can be written as the matrix-vector equation up to multiplication by a constant. y←(I+CT)y, where I is the n×n identity matrix. 3) If the largest eigenvalue c is equal to 1, then for any Let the initial author vector and paper vector be x(cid:104)0(cid:105) and startingvectorx(cid:104)0(cid:105) (cid:54)=0withnon-negativecomponents, y(cid:104)0(cid:105) respectively.Ifwestartwiththeequationx←Wy,the thesequenceofvectorsAkx(cid:104)0(cid:105) convergetoavectorin successive iterations proceed as below. the direction of v as k →∞. x(cid:104)1(cid:105) =Wy(cid:104)0(cid:105), (1) Thus, by the Perron-Frobenius theorem, the author and y(cid:104)1(cid:105) =MTx(cid:104)1(cid:105) =MTWy(cid:104)0(cid:105), (2) paper scores both converge to unique non-negative vectors y(cid:104)1(cid:105) =(I+CT)y(cid:104)1(cid:105) =(I+CT)MTWy(cid:104)0(cid:105). (3) x(cid:63) and y(cid:63), after repeated applications of the update rules. These two vectors are the limiting values of the author and Similarly, if we start with the equation y ← MTx, the paper scores. Moreover, none of the vectors x(cid:63) and y(cid:63) can successive iterations proceed as below. be the zero vector. At least one of their components must be non-zero, because the initial vectors x(cid:104)0(cid:105) and y(cid:104)0(cid:105) are the y(cid:104)1(cid:105) =MTx(cid:104)0(cid:105), (4) all-1vectors,andateachiterationthevectorsarenormalized. y(cid:104)1(cid:105) =(I+CT)y(cid:104)1(cid:105) =(I+CT)MTx(cid:104)0(cid:105), (5) So the sum of their components add up to 1. x(cid:104)1(cid:105) =Wy(cid:104)1(cid:105) =W(I+CT)MTx(cid:104)0(cid:105). (6) C. Time-complexity of each iteration Proceeding similarly, at the k-th iteration the author and paper vectors are given by, We have to multiply some matrices as can be seen from equations (7) and (8). Computing the product of a m×n x(cid:104)k(cid:105) =[W(I+CT)MT]kx(cid:104)0(cid:105), (7) matrixandan×pmatrixrequiresO(mnp)time.Computing y(cid:104)k(cid:105) =[(I+CT)MTW]ky(cid:104)0(cid:105). (8) the matrices W(I + CT)MT and (I + CT)MTW takes O(mn(m+n))timeeach.Hence,eachiterationcanbedone B. Proof of convergence of author scores and paper scores in O(mn(m+n)) time. In this section, we will prove the following theorem. Theorem 1. The sequences x(cid:104)k(cid:105) and y(cid:104)k(cid:105), k = 0,1,2,... VI. EXPERIMENTALANALYSIS converge to the limits x(cid:63) and y(cid:63) respectively. Moreover, x(cid:63) istheprincipaleigenvectorofthematrixW(I+CT)MT and We compute the author and paper scores for some graphs y(cid:63) istheprincipaleigenvectorofthematrix(I+CT)MTW. and show that they match with our intuition.   A. Example 1 1 0 0 0 0 1 1 0 1 0 1 0 For the graph in Figure 1, the author and paper vectors,   0 0 1 0 M =0 1 0 0,C = . publication matrix and citation matrix are given below. Note   0 0 0 0 0 1 0 0 that for this example, m=4,n=5. 1 0 1 0 0 0 0 1 x=[1,1,1,1],y=[1,1,1,1,1], Theauthorscoresafter10iterationsaregivenbelow.These scoresconverge(upto3decimalplaces)asthereisnochange     0 1 1 0 1 between two successive iterations. 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0   x=[0.106,0.590,0.152,0.152,0.000], M = ,C =0 0 0 0 1. 0 1 1 0 1   0 0 0 0 1 y=[0.212,0.304,0.484,0.000]. 1 0 1 1 1 0 0 0 0 0 Intuitively it is clear that author a and paper p should 2 3 Theauthorscoresafter10iterationsaregivenbelow.These get the highest scores. Author a has written two papers p 2 1 scoresconverge(upto3decimalplaces)asthereisnochange and p . In turn, p has been cited by papers p ,p and p . 3 3 1 2 4 between two successive iterations. On the other hand, p gets the lowest paper score 0, since 4 it is not cited by any paper. a gets the lowest author score 5 x=[0.259,0.132,0.289,0.320], sinceithasonlywrittenpaperp ,whichhasascore0.These 4 y=[0.082,0.141,0.264,0.123,0.390]. scores matches with our intuition. Intuitively it is clear that author a4 and paper p5 should VII. EXTENSIONS get the highest scores. Author a has written 4 papers 4 A. Weights on edges of the publication graph p ,p ,p ,p and some of them have high paper scores. 1 3 4 5 In Section V we assumed that all authors contribute Similarly, p has been written by 3 authors a ,a ,a and 5 1 3 4 equally to a paper. However, this is not true in practice. citedby3papersp ,p ,p ,someofwhichhavehighscores. 1 3 4 Different co-authors have different contribution to a paper. a getsthelowestscoreasithaswrittenonly2papersp ,p , 2 2 4 We can easily incorporate this feature in our CITEX index, none of which have high paper scores. Similarly, p gets the 1 by slightly modifying the publication matrix M. In Section lowest score since it has no citations, although it has two V, M was a 0-1 matrix. For the weighted version, we authors a ,a . 1 4 construct a matrix N, where edge weight n denotes the ij contributionofauthoriinwritingpaperj.Anauthorhaving highercontributionhashigherweight,comparedtoanauthor having lesser contribution. The new matrix N might not be a 0-1 matrix. W(cid:48) is the matrix corresponding to W. aa1 p1 HNwieo(cid:48)jnwc=,ex,(cid:80)cxamiin=ni1=jbneij(cid:80)wirsnjit=tteh1ne(cid:16)aw(cid:80)semixn=igi1←jhntijWa(cid:17)ssy(cid:48)oyjc,iaw=tehde(cid:80)rwenjiW=th1(cid:48)wtihsi(cid:48)ejtyhpjea,pwweerhigephrjet. 2 matrix whose (i,j)th entry is w(cid:48) . ij p2 Now the author and paper vectors at the k-th iteration can be written as, a 3 x(cid:104)k(cid:105) =[W(cid:48)(I+CT)NT]kx(cid:104)0(cid:105), p 3 y(cid:104)k(cid:105) =[(I+CT)NTW(cid:48)]ky(cid:104)0(cid:105). a 4 B. Reputation of authors p 4 OurCITEXindexcanbemodifiedtoincludethereputation a of authors. Each author can rank other authors who he/she 5 believes has done original, ground breaking work. We define Fig.2. ThepublicationandcitationgraphsforExample2showingauthors, the author reputation graph G = (V ,E ), similar to the R R R papersandcitations. citation graph. The vertices are the set of authors, i.e., V = R A.Thus,thenumberofverticesinthisgraphism.Adirected edge from node i to node j of weight r exists, if author ij B. Example 2 i has rated author j with a score r . The author reputation ij For the graph in Figure 2, the author and paper vectors, matrix R is defined similarly. For an author a, REP(a) is publication matrix and citation matrix are given below. Note defined as the set of authors who have ranked author a. In that for this example, m=5,n=4. other words, REP(a)={b∈A:(b,a)∈E }. R On incorporating the reputation matrix, another rule is x=[1,1,1,1,1],y=[1,1,1,1], added to the list. For each author a , add to its a-score x , i i the sum of the a-scores of all the authors who have rated • The analytical power of eigenvector-based methods is (cid:80) author a . In other words, x = x + x = not yet fully understood. It would be interesting to i i i k∈REP(i) k x +(cid:80)m r x , for i=1,...,m. pursue this question in the context of the algorithm i k=1 ki k Let the initial author vector and paper vector be x(cid:104)0(cid:105) and presented here. Considering random graph models that y(cid:104)0(cid:105) respectively.Ifwestartwiththeequationx←Wy,the contain enough structure to capture certain global prop- successive iterations proceed as below. erties of the model is a promising direction. x(cid:104)1(cid:105) =Wy(cid:104)0(cid:105), REFERENCES x(cid:104)1(cid:105) =(I+RT)x(cid:104)1(cid:105) =(I+RT)Wy(cid:104)0(cid:105) [1] Pablo D. Batista, Moˆnica G. Campiteli, and Osame Kinouchi. Is it possible to compare researchers with different scientific interests? y(cid:104)1(cid:105) =MTx(cid:104)1(cid:105) =MT(I+RT)Wy(cid:104)0(cid:105), Scientometrics,68(1):179–189,2006. [2] CarlBergstrom.Measuringthevalueandprestigeofscholarlyjournals. y(cid:104)1(cid:105) =(I+CT)y(cid:104)1(cid:105) =(I+CT)MT(I+RT)Wy(cid:104)0(cid:105). College&ResearchLibrariesNews,68(5):3146,2007. [3] Carl T Bergstrom, Jevin D West, and Marc A Wiseman. The Proceeding similarly, at the k-th iteration the author and EigenfactorTM metrics. The Journal of Neuroscience, 28(45):11433– paper vectors are given by, 11434,2008. [4] Johan Bollen, Herbert Van de Sompel, Joan A. Smith, and Richard x(cid:104)k(cid:105) =[(I+RT)W(I+CT)MT]kx(cid:104)0(cid:105), Luce. Towardalternativemetricsofjournalimpact:Acomparisonof downloadandcitationdata. Inf.Process.Manage.,41(6):1419–1440, y(cid:104)k(cid:105) =[(I+CT)MT(I+RT)W]ky(cid:104)0(cid:105). 2005. [5] Johan Bollen, Marko A Rodriquez, and Herbert Van de Sompel. Journalstatus. Scientometrics,69(3):669–687,2006. C. Evaluation of journals and conferences [6] Sergey Brin and Lawrence Page. The anatomy of a large-scale hy- One important question is how to measure the quality of pertextualwebsearchengine. ComputernetworksandISDNsystems, 30(1):107–117,1998. scientific journals and conferences. If we have the author [7] PengChen,HuafengXie,SergeiMaslov,andSidneyRedner. Finding scores and paper scores of all authors and papers published scientific gems with google’s pagerank algorithm. Journal of Infor- in a journal/conference, we can use the average author score metrics,1(1):8–15,2007. [8] PhilipMDavis. Eigenfactor:Doestheprincipleofrepeatedimprove- and the average paper score as a metric for determining the ment result in better estimates than raw citation counts? Journal quality of the journal/conference. of the American Society for Information Science and Technology, 59(13):2186–2188,2008. VIII. CONCLUSIONANDFUTUREDIRECTIONS [9] David Easley and Jon Kleinberg. Networks, crowds, and markets: Reasoning about a highly connected world. Cambridge University In this paper, we have proposed a new citation metric Press,2010. CITEXtojudgethequalityofauthorsandpapersinscientific [10] Leo Egghe. Theory and practise of the g-index. Scientometrics, 69(1):131–152,2006. publications. CITEX assigns scores to authors and papers, [11] Georg Ferdinand Frobenius. U¨ber Matrizen aus nicht negativen with higher scores indicating more importance, by analyzing Elementen. Ko¨niglicheAkademiederWissenschaften,1912. the link structures in the publication graph and the citation [12] EugeneGarfield.Citationindexesforscience.Science,122(3159):108– 111,1955. graph. We also considered some extensions to the basic [13] JorgeEHirsch.Anindextoquantifyanindividual’sscientificresearch scheme. Here are some future directions to work on. output.ProceedingsoftheNationalacademyofSciencesoftheUnited StatesofAmerica,102(46):16569–16572,2005. • In a real-world scenario, authors and papers will be [14] GlenJehandJenniferWidom.Simrank:ameasureofstructural-context added over time. Dynamically modifying the scores similarity. In Proceedings of the eighth ACM SIGKDD international from the current scores in an incremental fashion is a conferenceonKnowledgediscoveryanddatamining,pages538–543. ACM,2002. challenging problem. [15] JonM.Kleinberg.Authoritativesourcesinahyperlinkedenvironment. • Applying this framework to the setting of customer J.ACM,46(5):604–632,1999. recommendationofproductswillbeaninterestingidea. [16] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. ThePageRankcitationranking:Bringingordertotheweb. Technical Herethenodesofthegrapharecustomersandproducts. report,StanfordInfoLab,1999. A customer can give a score to a product, which is like [17] Oskar Perron. Zur theorie der matrices. Mathematische Annalen, the weighted version of the publication graph. 64(2):248–263,1907. [18] Gabriel Pinski and Francis Narin. Citation influence for journal • Thisisanexampleofaninterdependentnetwork,where aggregates of scientific publications: Theory, with application to the therearetwographs–thecollaboration/reputationgraph literatureofphysics. Inf.Process.Manage.,12(5):297–312,1976. andthecitationgraph.Inaddition,thereisapublication [19] YangSunandCLeeGiles. Popularityweightedrankingforacademic digital libraries. In Proceedings of the 29th European conference on graph which records the cross-edges between the nodes IRresearch,pages605–612.Springer-Verlag,2007. inthetwographs.ExtendingCITEXtootherinterdepen- [20] DylanWalker,HuafengXie,Koon-KiuYan,andSergeiMaslov.Rank- dent networks will be an interesting direction to think ing scientific publications using a model of network traffic. Journal of Statistical Mechanics: Theory and Experiment, 2007(06):P06010, about. 2007. • Using further parameters such as time of publication [21] Chun-TingZhang.Thee-index,complementingtheh-indexforexcess and age of authors, in addition to the link structure will citations. PLoSONE,5(5):1–22,2009. [22] Ding Zhou, Sergey A Orshanskiy, Hongyuan Zha, and C Lee Giles. require further thoughts. Co-ranking authors and documents in a heterogeneous network. In • Canthistechniquebeextendedto directlyassignscores DataMining,2007.ICDM2007.SeventhIEEEInternationalConfer- to journals and conferences, rather than doing it indi- enceon,pages739–744.IEEE,2007. rectly, as in section VII-C?

CITEX: A new citation index to measure the relative importance of authors and papers in scientific publications PDF

0.27 MB·English

by Arindam Pal

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview CITEX: A new citation index to measure the relative importance of authors and papers in scientific publications

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.