Aalborg Universitet On the Assessment of Expertise Profiles Berends, Richard; De Rijke, Maarten; Balog, Krisztian; Bogers, Toine; Van den Bosch, Antal Published in: Journal of American Society for Information Science DOI (link to publication from Publisher): 10.1002/asi.22908 Publication date: 2013 Downloaded from vbn.aau.dk on: February 15, 2023 Downloaded from vbn.aau.dk on: February 15, 2023 On the Assessment of Expertise Profiles Richard Berendsen and Maarten de Rijke ISLA, University ofAmsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands. E-mail: {r.w.berendsen; derijke}@uva.nl Krisztian Balog Department of Electrical Engineering and Computer Science, University of Stavanger, NO-4036 Stavanger, Norway. E-mail: [email protected] Toine Bogers Royal School of Library Information Science, Birketinget 6, DK-2300, Copenhagen, Denmark. E-mail: [email protected] Antal van den Bosch Faculty ofArts, CIW-Bedrijfscommunicatie, Radboud University Nijmegen, P.O. Box 9103, NL-6500 HD Nijmegen, The Netherlands. E-mail: [email protected] Expertise retrieval has attracted significant interest in Introduction thefieldofinformationretrieval.Expertfindinghasbeen studied extensively, with less attention going to the Anorganization’sintranetprovidesameansforexchang- complementarytaskofexpertprofiling,thatis,automati- ing information and facilitating collaborations among cally identifying topics about which a person is knowl- employees.Toefficientlyandeffectivelyachievecollabora- edgeable. We describe a test collection for expert tion, it is necessary to provide search facilities that enable profiling in which expert users have self-selected their employeesnotonlytoaccessdocumentsbutalsotoidentify knowledge areas. Motivated by the sparseness of this set of knowledge areas, we report on an assessment expertcolleagues(Hertzum&Pejtersen,2006).AttheText experiment in which academic experts judge a profile REtrievalConferenceEnterpriseTrack(Bailey,Craswell,de thathasbeenautomaticallygeneratedbystate-of-the-art Vries, & Soboroff, 2008; Balog, Soboroff, etal., 2009; expert-profilingalgorithms;optionally,expertscanindi- Craswell,deVries,&Soboroff,2006;Soboroff,deVries,& cate a level of expertise for relevant areas. Experts Craswell,2007),theneedtostudyandunderstandexpertise may also give feedback on the quality of the system- generated knowledge areas. We report on a content retrievalhasbeenrecognizedthroughtheintroductionofthe analysisofthesecommentsandgaininsightsintowhat expert-findingtask.Thegoalofexpertfindingistoidentify aspects of profiles matter to experts. We provide an alistofpeoplewhoareknowledgeableaboutagiventopic: erroranalysisofthesystem-generatedprofiles,identify- Who are the experts on topic X? This task is usually ingfactorsthathelpexplainwhycertainexpertsmaybe addressed by uncovering associations between people and hardertoprofilethanothers.Wealsoanalyzetheimpact on evaluating expert-profiling systems of using self- topics(Balog,Fang,deRijke,Serdyukov,&Si,2012);com- selected versus judged system-generated knowledge monly,aco-occurrenceofthenameofapersonwithtopics areas as ground truth; they rank systems somewhat inthesamecontextisassumedtobeevidenceofexpertise. differentlybutdetectaboutthesameamountofpairwise An alternative task, building on the same underlying prin- significant differences despite the fact that the judged cipleofcomputingpeople–topicassociations,isexpertpro- system-generatedassessmentsaremoresparse. filing,inwhichsystemshavetoreturnalistoftopicsthata person is knowledgeable about (Balog, Bogers, Azzopardi, deRijke,&vandenBosch,2007;Balog&deRijke,2007). Essentially, (topical) expert profiling turns the expert- ReceivedApril23,2012;revisedDecember4,2012;acceptedDecember5, finding task around and asks the following: What topic(s) 2012 doesapersonknowabout? ©2013ASIS&T(cid:129)Publishedonline28June2013inWileyOnlineLibrary Expert profiling is useful in its own right for users who (wileyonlinelibrary.com).DOI:10.1002/asi.22908 want to profile experts they already know. It is also a key JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY,64(10):2024–2044,2013 task to address in any expert-finding system; such systems We group the research questions in this article in two rank experts, and users will want to navigate to profiles of parts. In the first part, we perform a detailed analysis theseexperts.Completeandaccurateexpertprofilesenable of the outcomes of the assessment experiment. One impor- people and search engines to effectively and efficiently tant outcome is a new set of graded relevance assessments, locatethemostappropriateexpertsforaninformationneed. which we call the judged system-generated knowledge In addition to a topical profile, it is recognized that social areas. We examine the completeness of these new assess- factorsplayalargeroleindecisionsaboutwhichexpertsto ments. The knowledge areas experts selected and the approach (Balog & de Rijke, 2007; Cross, Parker, & textual feedback they gave provide us with a unique Borgatti,2002;Hofmann,Balog,Bogers,&deRijke,2010; opportunity to answer the following question: “How well Smirnova&Balog,2011). are we doing at the expert-profiling task?” We perform a We focus on the topical expert-profiling task in a detailed error analysis of the generated profiles and a knowledge-intensive organization, that is, a university, and content analysis of experts’ feedback, leading to new release an updated version of the Universiteit van Tilburg insights on what aspects make expertise retrieval difficult (UvT;TilburgUniversity[TU])expertcollection(Bogers& for current systems. Balog, 2006), which was created with data from the UvT. In the second part, we take a step back and ask: “Does BecausetheuniversitynolongerusestheacronymUvTand benchmarking a set of expertise retrieval systems with the has switched to TU instead, we call the updated collection judged system-generated profiles lead to different conclu- theTUexpertcollection.1TheTUexpertcollectionisbased sions compared with benchmarking with the self-selected on the Webwijs (“Webwise”) system2 developed at TU. profiles?” We benchmark eight state-of-the-art expertise WebwijsisapubliclyaccessibledatabaseofTUemployees retrievalsystemswithbothsetsofgroundtruthandinvesti- whoareinvolvedinresearchorteaching,whereeachexpert gatechangesinabsolutesystemscores,systemranking,and can indicate his or her skills by selecting expertise areas the number of significant differences detected between from a list of knowledge areas. Prior work has used these systems. We find that there are differences in evaluation self-selected areas as ground truth for both expert-finding outcomes, and we are able to isolate factors that contribute andexpert-profilingtasks(Balog,2008;Balogetal.,2007). to these differences. Based on our findings, we provide With the TU expert collection come updated profiles con- recommendations for researchers and practitioners who sisting of these self-selected knowledge areas; we refer to wanttoevaluatetheirownsystems. thissetofareasasself-selectedknowledgeareas. Themaincontributionsofthisarticleareasfollows: One problem with self-selected knowledge areas is that they may be sparse. There is a large number of possible (cid:129) The release of a test collection for assessing expert expertise areas to choose from (more than 2,000). When profiling—the TU expert collection—plus a critical assess- choosingtheirknowledgeareas,expertsmaynotnecessarily ment and analysis of this test collection. Test collections browse the set of knowledge areas very thoroughly, espe- support the continuous evaluation and improvement of cially because the interface in which they select the areas retrievalmodelsbyresearchersandpractitioners,inthiscase, lists them in alphabetical order without providing links inthefieldofexpertiseretrieval. (cid:129) Insights into the performance of current expert-profiling betweenrelatedareas.Thismightresultinsparsedatawith systems through an extensive error analysis, plus a content a limited number of knowledge areas assigned to each analysis of feedback of experts on the generated profiles. expert.Usingtheseself-selectedknowledgeareasasground Theseinsightsleadtorecommendationsforimprovingexper- truth for assessing automatic profiling systems may there- tiseprofilingsystems. forenotreflectthetruepredictivepowerofthesesystems.To (cid:129) Insightsintothedifferencesinevaluationoutcomesbetween findoutmoreabouthowwellthesesystemsperformunder evaluating the two sets of ground truth released with this real-world circumstances, we have asked TU employees to article. This will allow researchers and practitioners in the judgeandcommentontheprofilesthathavebeenautomati- field of expertise retrieval to understand the performance of cally generated for them. Specifically, we have used state- theirownsystemsbetter. of-the-art expertise retrieval methods to construct topical expertise profiles. TU employees were then asked to reas- Before we delve in, we give a small recap of some ter- sess their self-selected knowledge areas based on our rec- minology. Expert profiles, or topical profiles, in this article ommendations; in addition, they were given the option to consist of a set of knowledge areas from a thesaurus. indicatethelevelofexpertiseforeachselectedarea.More- Throughout the article, we focus on two kinds of expert over, they could give free text comments on the quality of profilesthatweuseasgroundtruth. the expert profiles. We refer to this whole process as the Self-selected: These profiles consist of knowledge areas assessmentexperimentinthisarticle. that experts originally selected from an alphabetical list of knowledgeareas. Judgedsystemgenerated:Theseprofilesconsistofthose 1The TU expert collection is publicly available at http:// knowledge areas that experts judged relevant from system- ilps.science.uva.nl/tu-expert-collection.Foradescriptionoftheitemscon- generated profiles: a ranked list of (up to) 100 knowledge tainedinthecollection,pleaseseetheAppendixtothisarticle. 2http://www.tilburguniversity.edu/webwijs/ areasthatwegeneratedforthem. JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 2025 DOI:10.1002/asi Therestofthisarticleisstructuredasfollows:Westartby sampled. Buckley and Voorhees (2004) examine changes reviewing related work on test collection–based evaluation when relevance assessments are subsampled. In this study, methodology in the Related Work section. In the Topical weareinterestedincomparingandanalyzingtheoutcomesof Profiling Task section, we define the topical profiling task. evaluating with two sets of relevance assessments—self- Next, we describe the assessment experiment: the profiling selectedversusjudgedsystem-generatedknowledgeareas— models used to generate the profiles and the assessment and we consider two criteria: stability and sensitivity. To interface experts used to judge these profiles. In the analyze stability, we identify four differences between our ResearchQuestionsandMethodologysection,westateour two sets of ground truth and ask how evaluation outcomes researchquestionsandthemethodsusedtoanswerthem.We vary with respect to these differences. For analyzing sensi- presentandanalyzetheresultsofourassessmentexperiment tivity, we do not have a hypothesized correct or preferred in the Results andAnalysis of theAssessment Experiment ranking.Instead,weinvestigatehowmanysignificantdiffer- section, followed by an analysis of benchmarking differ- encescanbedetectedwitheachsetofgroundtruth. ences between two sets of relevance assessments in the When two evaluation approaches generate quantitative Self-SelectedVersusJudgedSystem-GeneratedKnowledge output for multiple systems, they can be correlated to each Areas:ImpactonEvaluationOutcomessection.IntheDis- other.Oftenthisisdonebycomparingtheorderingofallpairs cussionandConclusionsection,wewrapupwithadiscus- of systems in one ranking with the ordering of the corre- sion,conclusion,andlookahead. spondingpairintheotherranking.Oneoftenusedmeasureis accuracy: the ratio of pairs for which both rankings agree (see,e.g.,Hofmann,Whiteson,&deRijke,2011;Radlinski Related Work & Craswell, 2010; Sanderson & Zobel, 2005; Voorhees & Westartwithabriefdiscussiononbenchmarkingandon Buckley,2002).Anothercommonlyusedmeasure(Buckley howithasbeenanalyzedintheliterature.Then,wezoomin &Voorhees,2004;Voorhees,2000)thatweuseinthisarticle ontheingredientsthatconstituteatestcollection.Next,we isKendalltau(Kendall,1938),arankcorrelationcoefficient consider related work on error analysis. We end with an that can be used to establish whether there is a monotonic overview of other test collections for expert profiling and relationshipbetweentwovariables(Sheskin,2011). expertfinding. Thereareseveralwaystoassess(relative)systemperfor- mance besides benchmarking. Su (1992, 1994) directly interviews end users. Allan, Carterette, and Lewis (2005), Benchmarking Turpin and Scholer (2006), and Smith and Kantor (2008) Arecentoverviewontestcollection–basedevaluationof give users a task and measure variables such as task accu- information retrieval systems can be found in Sanderson racy,taskcompletiontime,ornumberofrelevantdocuments (2010). Today’s dominant way of performing test retrieved in a fixed amount of time. Sometimes there are collection–basedevaluationininformationretrievalwasfirst stronghypothesesaboutrelativequalityofsystemsbycon- carriedoutintheCranfieldexperimentsandlaterinnumer- struction. Usage data such as clicks may also be used to ous Text REtrieval Conference (TREC) campaigns. Our estimate user preferences, for example, by interleaving the workfallsintothistradition.Tobeabletocreateayardstick rankedlistsoftworankersandrecordingclicksontheinter- for benchmarking, simplified assumptions have to be made leavedlist(Hofmannetal.,2011;Joachims,2002;Radlinski about users, their tasks, and their notion of relevance. For etal.,2008).Inourstudy,weofferexpertstheopportunityto example,intheTRECadhoccollections,auserisassumed commentonthequalityofsystem-generatedprofiles,which to have an information need that is informational (Broder, we analyze through a content analysis, as in Lazar, Feng, 2002), and a document is relevant if it contains a relevant andHochheiser(2010). pieceofinformation,evenifitisduplicateinformation.By framingthetopicalprofilingtaskasarankingtask,wealso IngredientsofaTestCollection makesomesimplifyingassumptions.Forexample,usersare satisfiedwitharankingofexpertiseareasforanexpert,and Atestcollection–basedevaluationmethodologyconsists anareaisrelevantifexpertshavejudgeditsothemselves. of a document collection, a set of test queries, a set of Typically,evaluationmethodologiesareassessedbycom- relevanceassessments,anevaluationmetric,andpossiblya paringthemwitheachother,performingdetailedanalysesin significance test to be able to claim that the differences termsofsensitivity,stability,androbustness.Sensitivityofan foundwouldgeneralizetoalargerpopulationoftestqueries. evaluationmethodologyhasbeentestedbycomparingitwith a hypothesized correct ranking of systems (Hofmann, Testqueries. TestquerycreationintheTRECadhoctracks Whiteson, & de Rijke, 2011; Radlinski & Craswell, 2010). istypicallydonebyassessorsand,hence,testqueriesreflect Stability and robustness are closely related concepts. An assessors’ interests.When test queries were created for the evaluationmethodologycanbesaidtobestablewithrespect web track, they were retrofitted around queries sampled to some changing variable, or robust to changes in that from web query logs so as to more closely reflect end-user variable. For example, Radlinski and Craswell (2010) interests(Voorhees&Harman,2000).Inourstudyonexpert examine how evaluation changes when queries are sub- profiling,testqueries(i.e.,“informationneeds”)arereadily 2026 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 DOI:10.1002/asi available: They are potentially all experts from the akin to a manual run: It consists of the knowledge areas knowledge-intensiveorganizationbeingconsidered. selected by experts in theTU-Webwijs system.We confirm At theTREC ad hoc tracks, test queries with too few or in this study that this manual run contributes many unique too many relevant documents are sometimes rejected relevant results; that is, the automatic systems fail to find a (Harman,1995;Voorhees&Harman,2000).Harman(1995) significantamountoftheseknowledgeareas. reportsthatforallcreatedqueries,atrialrunonasampleof Zobel (1998) performs a study on pooling bias. The documentsfromthecompletecollectionyieldedbetween25 concernhereisthatassessmentsarebiasedtowardcontrib- (narrow query) and 100 (broad query) relevant documents. utingrunsandverydifferentsystemswouldreceiveascore Zobel (1998) notes that selecting queries based on the thatistoolow.Inparticular,systemsthataregoodatfinding numberofrelevantdocumentsmayintroduceabias.Inour difficult documents would be penalized. For several TREC experiments,weretainalltestqueries(i.e.,allexperts)that collections,Zobelfoundthatwhenrelevantdocumentscon- haveatleastonerelevantknowledgearea. tributed by any particular run were taken out, performance of that run would only slightly decrease. In our study, Relevance assessments. Relevance assessments are typi- experts only judged those knowledge areas that the auto- cally created by assessors. For the UvT collection used by matic systems found. We study the effect of regarding Balog etal. (2007) and for the two new sets of relevance unpooled areas as nonrelevant on system ranking and find assessments we release with the TU collection, they are that it has hardly any impact. For this, like Buckley and createdbytheexpertsthemselves.Othertestcollectionsfor Voorhees(2004),weuseKendalltau. expert-findingassessmentshavebeenprovidedbyanexter- nal person (for the W3C collection [W3C, 2005]), orbycolleagues(fortheCERC[Bailey,Craswell,Soboroff, Significance tests. Significance tests are mostly used to & de Vries, 2007] and UvT [Liebregts & Bogers, 2009] estimate which findings about average system performance collections). onasetoftestquerieswillgeneralizetootherqueriesfrom Voorhees (2000) studies the impact of obtaining rel- the same assumed underlying population of queries. A evance assessments from different assessors on system simpleruleofthumbisthatanabsoluteperformancediffer- rankingusingKendalltau.Althoughdifferentassessorsmay ence of less than 5% is not notable (Spärck Jones, 1974). judgedifferentdocumentsrelevant,systemrankingishighly Pairwise significance tests are common in cases when dif- stable no matter from which assessor the assessments are ferent systems can be evaluated on the same set of test taken.Inoursituation,thedifferentsetsofrelevanceassess- queries. Voorhees and Buckley (2002) test the 5% rule of ments for each expert are created by the same expert but thumb with a fixed set of systems and a fixed document through different means: In one case, the assessor had to collection.SandersonandZobel(2005)extendthisresearch manually go through a list; in the other, the assessor was and consider relative rather than absolute performance dif- offered suggestions. We find that system ranking may be ferences;theypreferapairwisettestoverthesigntestand affectedbythesedifferences. the Wilcoxon test. Smucker, Allan, and Carterette (2007) An important aspect is the completeness of relevance compared p values (for the null hypothesis that pairs of assessments.Whentestcollectionswerestillsmall,allitems TREC runs do not differ) computed with five significance initwerejudgedforeverytestquery(Sanderson,2010).The tests.TheyfindthattheFisherpairwiserandomizationtest, expertswhoparticipatedinourexperimentshavelittletime, matched pairs Student’s t test, and bootstrap test all agree however, and were simply not available to do this. A with each other, whereas the Wilcoxon and sign tests well-known method to evaluate systems without complete disagreewiththesethreeandwitheachother.Theyrecom- assessments is pooling. It was proposed by Jones and van mend Fisher’s pairwise randomization test, which is what Rijsbergen (1975) as a method for building larger test col- weuse. lections.Theideaistopoolindependentsearchesusingany In our work, we vary query sets and sets of relevance available information and device (Sanderson, 2010). In our assessments.Then,keepingthesignificancetestusedfixed, study, we also perform a specific kind of pooling. We use we measure the average number of systems each system eight systems to generate expertise profiles, that is, lists of differs significantly from. We view this as a rough indica- knowledge areas characterizing the expertise of a person. tion of the ability of each set of assessments to distinguish The eight systems are not independent, but are all possible between systems. The difference in this number between combinations of one of two retrieval models, one of two sets of relevance assessments is a rough heuristic for the languages,andoneoftwostrategiesconcerningtheutiliza- differenceinsensitivityofthesets.Cohen(1995),whowas tionofathesaurusofknowledgeareas.Unlikethemethod- interested in repeating a benchmarking experiment using a ologyusedatTREC(Voorhees&Harman,2000),wedonot more stringent alpha value in the significance test, com- take a fixed pooling depth for each run, perform a merge, puted the average number of systems each system differs andorderresultsrandomlyfortheassessor.Instead,tomini- from for both values of alpha. He called the difference mize the time required for experts to find relevant knowl- between these numbers the criterion differential, saying it edgeareas,weaimtoproducethebestpossiblerankingby is a rough heuristic for the difference in sensitivity of both using a combination algorithm. We also have something alpha values. JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 2027 DOI:10.1002/asi ErrorAnalysis averagenumberofstarsreceivedforbestanswersasground truthfortheauthor-rankingtask. Although test collections enable us to discriminate systems in their average performance over a set of queries Topical Profiling Task with a certain reliability and sensitivity, Harman and Buckley (2009) stress that it is important to understand TheTUexpertcollectionismeanttohelpassesstopical varianceinperformanceoverqueries.Often,performanceof profilingsystemsinthesettingofamultilingualintranetof single systems varies more over queries than performance a knowledge intensive organization. One can answer the ononequeryvariesoversystems.Variationinperformance question “What topics does an expert know about?” by over queries does not simply correlate with the number of returningatopicalprofileofthatexpert:arecordofthetypes relevant documents; there is an interaction between query, of areas of skills and knowledge of that individual and a system, and document collection (Voorhees & Harman, level of proficiency in each (Balog etal., 2012). The task 1996).Intheerroranalysisofourbestperformingsystem— consistsofthefollowingtwosteps:(a)identifyingpossible thecombiningalgorithmthatwasusedtoarriveatthesetof knowledge areas and (b) assigning a score to each knowl- judged system-generated knowledge areas—on the expert edgearea(Balog&deRijke,2007).Inanenterprisesearch level,wefindthesamelackofcorrelationbetweennumber environment,thereoftenexistsalistofknowledgeareasin ofrelevantknowledgeareasandsystemscore.Weareableto which an organization has expertise. In our test collection, explain some of the performance differences between this is indeed the case; therefore, we focus on the second systemsbasedonotherpropertiesofexperts,however,such step.Weassumethatalistofknowledgeareas{a,...,a} 1 n as their profession and the kinds of documents they are is given and state the problem of assigning a score to each associatedwithinthecollection.Inadditiontoprovidingan knowledge area (given an expert) as follows: What is the analysis at the expert level, we provide one at the level of probabilityofaknowledgearea(a)beingpartoftheexpert’s knowledgeareas.Wedistinguishtwocategories:knowledge (e) topical profile?We approach this task as one where we areas that are difficult to find and knowledge areas that are havetorankknowledgeareasbythisprobabilityP(a|e). toooftenretrievedathighranks(“falsepositives”).Related IntheTUexpertcollection,forthistask,systemsreceive workinthisareawasdonebyAzzopardiandVinay(2008), thefollowingingredientsasinput: whodefineevaluationmetricsthatcapturehowwellsystems makeindividualdocumentsaccessibleandpointtointerest- (cid:129) AqueryconsistingofanexpertID(i.e.,anorganization-wide ing evaluation scenarios in which these metrics may be uniqueidentifierfortheperson) applied. (cid:129) A collection consisting of publications, supervised student theses,coursedescriptions,andresearchdescriptionscrawled from the Webwijs system of TU (All documents are either Dutch or English. The language is known for research and OtherTestCollectionsforExpertProfilingand course descriptions, and is unknown for publications and ExpertFinding studenttheses.) TheTUexpertcollectionthatwereleaseisanupdateand (cid:129) ExplicitassociationsbetweentheexpertIDanddocumentsin extension of the collection released by Balog etal. (2007), thecorpus which has previously been used for expert profiling and (cid:129) Athesaurusofknowledgeareas(Knowledgeareasareavail- able in two languages: Dutch and English.All areas have a expert finding. To the best of our knowledge, no other test Dutchrepresentation,formostofthemanEnglishtranslation collections have been used for the expert-profiling task. isavailableaswell.) Other test collections for expert finding include the W3C collection (W3C, 2005) and the CERC collection (Bailey Giventhisinput,therequestedsystemoutputisaranked etal., 2007). For these collections, relevance assessments listofknowledgeareasfromthethesaurus. wereobtainedmanually,indifferentways(cf.Ingredientsof We note a small subtlety concerning the language of aTestCollectionsection).Automaticgenerationoftestcol- documents in the collection. In previous work (Balog, lectionshasalsobeendone.SeoandCroft(2009)useApple 2008), systems were evaluated on the subset of knowledge Discussions3 forums as expertise areas and use the top 10 areasforwhichbothaDutchandanEnglishtranslationwere rated answerers for each forum as experts in the ground available;ifanexperthadselectedaknowledgeareawithout truth. Jurczyk and Agichtein (2007) consider the author an English translation, for evaluation purposes, this knowl- rankingtask,inwhichauthorshavetoberankedaccording edgeareawouldbeconsideredasnonrelevant.Inthiswork, to the quality of their contributions. This task is related if an expert selects a knowledge area, we consider it as to expert finding except that authors are not ranked by the relevant,regardlessofwhetherithasanEnglishtranslation. quality of contributions on a specific query. They use Yahoo!Answers’4 thumbs-up/thumbs-down votes and Assessment Experiment We first describe the models we used to produce the 3http://discussions.apple.com system-generatedprofiles.Thenwedescribetheassessment 4http://answers.yahoo.com interfacethatexpertsusedtojudgetheseprofiles. 2028 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 DOI:10.1002/asi AutomaticallyGeneratingProfiles similarity is taken to be the reciprocal of the length of the shortestpathSP(a,a′)betweenaanda′inthethesaurus.If The system-generated profiles are created by combining two knowledge areas are not connected, their similarity is theresultsofeightstate-of-the-artexpert-profilingsystemsin set to zero. In addition, we use a parameter m for the a straightforward way. In this subsection, we describe the maximal length of the shortest path for which we allow eight systems and the combination method, and we list the knowledgeareastohaveanonzeroprobability.Formally, parametersettingsweuseinthisarticle.Theeightexpertise profilingsystemsthatweusedifferinthreedimensions:First, ⎧1SP(a,a′), 0<SP(a,a′)≤m twodifferentretrievalmodelsareused.Second,systemsuse sim(a,a′)=⎨ (3) ⎩0, otherwise eithertheDutchortheEnglishtranslationsoftheknowledge areas. Third, half of the systems treat knowledge areas as WedescribethethesaurusgraphindetailintheAThesaurus independent of each other, whereas the other half use a ofExpertiseAreassection.Notethatwedonotdistinguish thesaurus of knowledge areas to capture the similarity betweendifferenttypesofrelationinthegraphanduseallof betweenthem.Webrieflydescribethemodelshere. them when searching for the shortest path. Next, we use The two retrieval models considered below take a gen- sim(a, a′) to measure the likelihood of seeing knowledge erative probabilistic approach and rank knowledge areas a areaagiventhepresenceofanotherknowledgeareaa′: by the probability that they are generated by expert e: P(a|e). In the first model, called Model 1 in Balog, sim(a,a′) Azzopardi,anddeRijke(2009),weconstructamultinomial P(a|a′)= ∑sim(a′′,a′) (4) languagemodelq foreachexperteoverthevocabularyof e a′′ terms from the documents associated with the expert. We modelknowledgeareasasbagsofwords,createdfromtheir The idea is that a knowledge area is more likely to be textual labels (either Dutch or English). It is assumed that included in a person’s expertise profile if the person is knowledgeareatermstaresampledindependentlyfromthis knowledgeable on related knowledge areas. This support multinomialdistribution,withreplacement.Then,forModel from other knowledge areas is linearly interpolated with 1,wehave: P(a|e)usingaparameterltoobtainanupdatedprobability estimateP′(a|e): P(a|e)=P(a|θ)=∏P(t|θ)n(t,a) (1) e e t∈a P(a|e)=λP(a|e)+(1−λ)⎛⎜∑P(a|a′)P(a′|e)⎞⎟ (5) ⎝ ⎠ where n(t,a) is the number of times term t occurs in a. In a′ estimating P(t|q), we apply smoothing using collection e For all systems, once P(a|e) has been estimated, we rank termprobabilities,withunsupervisedestimationofsmooth- knowledgeareasaccordingtothisprobabilityandreturnthe ingparameters.Specifically,weuseDirichletsmoothingand top100knowledgeareasforagivenuser;becauseweonly use the average representation length (i.e., the average retrieveknowledgeareasawhereP(a|e)>0,theresultlist number of terms associated with experts) as the smoothing maycontainfewerthan100items. parameter. In the second model, called Model 2 in Balog, Azzopardi, and de Rijke (2009), we estimate a language Merging systems’ outputs. To arrive at the set of judged model q for each document associated with an expert. Let system-generated knowledge areas, we proceed as follows. d thissetofdocumentsbeD.Wesumtheprobabilitiesofeach We use the eight profiling systems just described (i.e., e of these documents generating the knowledge area. The {Model1,Model2}¥{Dutch,English}¥{withthesaurus, termsinaaresampledindependentlyfromeachdocument. without thesaurus}) to estimate the probabilities of knowl- Then,forModel2,wehave: edge areas for each expert. Let us denote this as Pi(a|e) (i={1, ...,8}). These probabilities are then combined lin- P(a|e)= ∑ P(a|θ )= ∑ ∏P(t|θ )n(t,a) (2) earlytoobtainacombinedscoreP(a|e): d d d∈De d∈De t∈a P(a|e)=∑αp(a|e). (6) i i ToestimateP(t|q),wesmoothusingcollectiontermprob- i d abilities as before, estimating smoothing parameters in an In addition, the top three knowledge areas retrieved by unsupervised way.As before, we use Dirichlet smoothing, each profiling system receive an extra boost to ensure that but here we set the smoothing parameter to the average they get judged.This is done by adding a sufficiently large documentlengthinthecollection. constantCtoP(a|e). As for the language dimension, recall that knowledge areascomeintwolanguages:a={a ,a }.TheDutch Parametersettings. Forthesystemsthatusethethesaurus, Dutch Engligh retrieval models estimate P(a |e); the English systems we let m=3 (Equation 3) and l=.6 (Equation 5). For the Dutch estimateP(a |e). combinationalgorithm(Equation6),weleta =1/8foralli. English i Systemsthatusethethesaurusrelyonasimilaritymetric Furthermore, we set C=10 and, again, we only retrieve between a pair of knowledge areas, sim (a, a′). This knowledgeareasforwhichP(a|e)>0. JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 2029 DOI:10.1002/asi JudgingtheGeneratedProfiles mean performance over different groups of experts, for example, PhDstudentsversusprofessors? Theassessmentinterfaceusedintheassessmentexperiment Methods used:We look for correlations by visual inspection.We isshowninFigure1.Atthetopofthepage,instructionsfor groupexpertsbytheirposition(jobtitle)andlookforsignificant the expert are given. In the middle, the expert can indicate performancedifferencesusingtheWelchtwo-samplettest(Welch, the knowledge areas (called “Expertise areas” in the inter- 1947). Because we perform a number of comparisons we use an face) regarded as relevant by ticking them. Immediately avalueofa=.01tokeeptheoverallTypeIerrorundercontrol. below the top 20 knowledge areas listed by default, the RQ3. Whatarethecharacteristicsof“difficult”knowledgeareas? expert has the option to view and assess additional knowl- Methodsused:Weidentifyknowledgeareasthatareoftenincluded edge areas. The expert may or may not have examined all in experts’ self-selected profiles but are rarely retrieved in the (up to 100) retrieved knowledge areas in the generated system-generated profiles. In addition, we identify knowledge profile; this information was not recorded. System- areas that are often retrieved in the top 10 ranks of system- generated knowledge areas that were in the original (self- generatedprofilesbutneverjudgedrelevantbyexperts. selected)profileoftheexpertarepushedtothetopofthelist RQ4. Whatareimportantaspectsinthefeedbackthatexpertsgave andaretickedbydefaultintheinterface,buttheexpertmay ontheirsystem-generatedprofiles? deselectthem,therebyjudgingthemasnonrelevant.Forthe Methodsused:Inacontentanalysis,performedbytworesearchers, tickedknowledgeareas,expertshavetheoptiontoindicatea aspectsareidentifiedinafirstpassoverthedata.Inasecondpass level of expertise. If they do not do this, we still include overthedata,occurrencesoftheseaspectsarecounted. theseknowledgeareasinthegradedself-assessments,witha level of expertise of three (“somewhere in the middle”).At Self-SelectedVersusJudgedSystem-GeneratedKnowledge thebottomoftheinterface,expertscanleaveanycomments Areas:ImpactonEvaluationOutcomes theymighthaveonthegeneratedprofile. Next,weanalyzethedifferencesinevaluationoutcomes that arise when our two sets of relevance assessments Research Questions and Methodology are applied to assess expert-profiling systems. Our main Weorganizeourresearchquestionsintotwosubsections. researchquestionisthefollowing: The first subsection is concerned with the results of the assessment experiment. We study the completeness of the RQ5. Does using the set of judged system-generated knowledge areasleadtodifferencesinsystemevaluationoutcomescompared judgmentsgatheredandthequalityofthegeneratedprofiles; withusingtheself-selectedknowledgeareas? weanswerthesequestionsintheResultsandAnalysisofthe When answering these questions, we consider four differences Assessment Experiment section. The second subsection betweenthetwosetsofrelevanceassessments:(a)onlyasubsetof deals with the impact of using two sets of ground truth on expertshasjudgedthesystem-generatedknowledgeareas,(b)self- evaluationoutcomes;weanswertheseresearchquestionsin selected knowledge areas that were not in the set of system- theSelf-SelectedVersusJudgedSystem-GeneratedKnowl- generated knowledge areas are considered nonrelevant in the edgeAreas: Impact on Evaluation Outcomes section. Next, judgedsystem-generatedprofiles,(c)expertsselectednewknowl- we briefly motivate each research question and outline the edgeareasfromthesystem-generatedprofile,and(d)expertspro- methodsusedtoanswerthem. vided a level of expertise for most judged system-generated knowledgeareas.Weisolatetheeffectofeachdifferencebycon- structing five sets of ground truth (self-selected profiles, judged ResultsandAnalysisoftheAssessmentExperiment system-generatedprofiles,andthreeintermediateones),whichwe willdetaillater.Weconsidertheeffectofeachdifferenceonthree The TU expert collection includes two sets of assess- dimensions;thesearehandledasseparatesubquestions. ments: self-selected knowledge areas and judged system- generated knowledge areas. Our first research question RQ5a. How do the differences between the set of self-selected knowledge areas and the set of judged system-generated knowl- concernsthesetwotestsetsofrelevanceassessments: edgeareasaffectabsolutesystemscores? Methods used:We analyze nDCG@100 performance for each of RQ1. Whichofthetwosetsofgroundtruthismorecomplete? thefivesetsofgroundtruth.nDCG@100isametricthatrewards Methodsused:Weconstructthesetofallknowledgeareasthatan both high precision, high recall, and—in the case of graded rel- expertjudgedrelevantatsomepointintime,eitherbyincludingit evanceassessments—correctorderingofrelevantknowledgeareas. in the self-selected profile or by judging it relevant in the self- assessmentinterface.Wethenlookatwhichofthesetsofground RQ5b. How do the differences between the set of self-selected truthcontainsmoreoftheseknowledgeareas. knowledge areas and the set of judged system-generated knowl- Rememberthatthejudgedprofilesweregeneratedbyacombina- edgeareasaffectsystemranking? tionofstate-of-the-artsystems.Ournextthreeresearchquestions Methodsused:Weanalyzedifferencesinrankingwiththefivesets answerthefollowinginformalquestion:“Howwellarewedoing?” ofgroundtruth.FollowingVoorhees(2000),weuseKendalltau. LikeSandersonandSoboroff(2007),weusethefollowingformula: RQ2. Whatarethecharacteristicsof“difficult”experts? Forexample,doesthenumberofrelevantknowledgeareascorre- P−Q latewithperformance?Doesthenumberofdocumentsassociated τ= (7) (P+Q+T)(P+Q+U) with an expert matter? Is there a significant difference between 2030 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 DOI:10.1002/asi FIG.1. Screenshotoftheinterfaceforjudgingsystem-generatedknowledgeareas.Atthetop,instructionsfortheexpertaregiven.Inthemiddle,theexpert canselectknowledgeareas.Forselectedknowledgeareas,alevelofexpertisemaybeindicated.Atthebottom,thereisatextfieldforanycommentsthe expertmighthave.[Colorfigurecanbeviewedintheonlineissue,whichisavailableatwileyonlinelibrary.com.] JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 2031 DOI:10.1002/asi wherePisthenumberofconcordantpairs,Qisthenumber statistics. Our first set of ground truth contains 761 of discordant pairs, T is the number of ties in the first list, self-selected profiles of experts who are associated with at andUisthenumberoftiesinthesecondlist.Ifthereisatie leastonedocumentinthecollection.Together,theseexperts in at least one of the lists for a pair, the pair is neither selected a total of 1,662 unique knowledge areas. On correctly nor incorrectly ordered. When there are no ties, average, a self-selected profile contains 6.4 knowledge this formula is equivalent to the original formula as pro- areas. The second set of ground truth contains 239 judged posed by Kendall (1938). We compute Kendall tau for 28 system-generatedprofiles.Theseexpertstogetherselecteda pairsofsystemrankings.WeacceptaprobabilityofTypeI totalof1,266uniqueknowledgeareas.Onaverage,ajudged error a=.01 for each comparison. Then the probability of system-generatedprofilecontains8.6knowledgeareas. at least one Type I error in all comparisons if they would InFigure2,thelefttwohistogramsshowthedistribution be independent equals 1-(1-0.01)28=0.25. For eight ofexpertsovertheirnumberofrelevantknowledgeareasfor systems,Kendalltauhastobegreaterthanorequalto.79to the self-selected profiles (top) and for the judged system- rejectthenullhypothesis.Wedothisanalysisforfourstan- generatedprofiles(bottom).Thelatterdistributionisshifted dardinformationretrievalevaluationmetrics:meanaverage to the right.The histograms on the right show the distribu- precision(MAP),meanreciprocalrank(MRR),normalized tionofknowledgeareasovertheprofilesthatincludethem; discountedcumulativegaincalculatedatdepth10(nDCG@ the top right represents the self-selected profiles and the 10),[email protected],trec_eval bottomrightthejudgedsystem-generatedprofiles.Thelatter was used for evaluation; for implementing nDCG, we fol- histogramismoreskewedtotheleft;halfoftheknowledge lowedClarkeetal.(2008).Wetookallexpertsforthegiven areashavebeenjudgedrelevantbyasingleexpertonly. test set into account during evaluation, even if systems did As an aside, we now check for how many of the graded notretrieveanyknowledgeareasforthem(theseexpertsget judged system-generated knowledge areas we assigned our zeroscoreonallevaluationmetrics). “somewhere in the middle” value of three, because the expert judged the knowledge area relevant without indicat- RQ5c. How do the differences between the set of self-selected ingalevelofexpertise.Onaverage,thisoccurredfor0.6of knowledge areas and the set of judged system-generated knowl- the 8.8 knowledge areas in each expert’s profile. We con- edgeareasaffecttheaveragenumberofsystemsasystemdiffers significantlyfrom? cludethattheeffectofthisisnegligible. Methods used: We compare the five sets of ground truth on the Now,toquantifythecompletenessofeachsetofground basis of the number of significant differences in MAP, nDCG@ truthinasinglenumber,weproceedasfollows.Lettheset 100, MRR, and nDCG@10 that they detect between pairs of ofallrelevantknowledgeareasassociatedwithanexpertbe systems.Apairofsystemsdifferssignificantlyiftheirdifferenceis theunionoftheself-selectedprofileandthejudgedsystem- expectedtogeneralizetounseenqueries.WeuseFisherpairwise generatedprofile.Thensubtracttheknowledgeareasthatthe randomization test, following Smucker etal. (2007), and set expert deselected during the assessment interface (on a=.001. We repeat this test for five sets of ground truth, four average,expertsremoved2%oftheknowledgeareasorigi- evaluation metrics (except that we have no MAPor MRR scores nallyincludedintheirself-selectedprofiles).Wedividethe for the graded relevance assessments), and all possible resultinglistofknowledgeareasintothreecategories: 1 ( ⋅8[8−1]=28) pairs of systems: a total of 504 comparisons. Only found by systems: These knowledge areas were 2 Assuming that all of these comparisons are independent, this notintheself-selectedprofile,buttheywereinthesystem- meansacceptingaTypeIerrorof1-(1-0.001)504=0.40.Itisno generatedprofileandwerejudgedrelevantbytheexperts. problem for the interpretation of our results if there are a few Onlyfoundbyexperts:Theseknowledgeareaswerein spurious rejections of the null hypothesis; we mean to give an the self-selected profile, but not in the system-generated indicationofthesensitivityofeachsetofgroundtruth,thatis,the profile. averagenumberofsystemsthatasystemdifferssignificantlyfrom. Foundbyboth:Theseknowledgeareaswereinboththe self-selectedandsystem-generatedprofiles,andtheexperts Results andAnalysis of the didnotdeselectthemduringtheassessmentexperiment. Assessment Experiment Table1 lists the percentage of relevant knowledge areas Inthissection,wereportontheresultsoftheassessment that fall into each category, per profile, averaged over pro- experimentdefinedinTheAssessmentExperimentsection. files. To answer RQ1, we find that the judged system- We start with an examination of the completeness of the generatedprofilesaremorecomplete.Onaverage,ajudged main tangible outcome of this experiment, the so-called system-generated profile contains 81% (46%+35%; see judgedsystem-generatedknowledgeareas.Thenweanalyze Table1), whereas a self-selected profile contains only 65% thequalityofthegeneratedprofiles. (46%+19%; see Table1) of all relevant knowledge areas. Thisleadstothefollowingrecommendation:Becausethe judged system-generated profiles are more complete, we CompletenessoftheTwoSetsofGroundTruthfor expect this set of ground truth to give a more accurate ExpertProfiling pictureofsystemperformance,eveniffewerassessedexpert Toanswerthequestionhowcompleteeachsetofground profilesareavailable.Weelaborateonthiswhenweanswer truth is (RQ1), we start out with some basic descriptive RQ5laterinthisarticle. 2032 JOURNALOFTHEAMERICANSOCIETYFORINFORMATIONSCIENCEANDTECHNOLOGY—October2013 DOI:10.1002/asi

