ebook img

Learning to Effectively Select Topics For Information Retrieval Test Collections PDF

2.9 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning to Effectively Select Topics For Information Retrieval Test Collections

Learning to Effectively Select Topics For Information Retrieval Test Collections 7 1 MucahidKutlu1,TamerElsayed1,andMatthewLease2 0 2 1Dept.ofComputerScienceandEngineering,QatarUniversity,Qatar b 2SchoolofInformation,UniversityofTexasatAustin,USA e F 8 March1,2017 2 ] R Abstract I . Whiletestcollectionsarecommonlyemployedtoevaluatetheeffectivenessofinformationretrieval(IR)sys- s c tems, constructing these collections has become increasingly expensive in recent years as collection sizes have [ grownever-larger. Toaddressthis,weproposeanewlearning-to-ranktopicselectionmethodwhichreducesthe number of search topics needed for reliable evaluation of IR systems. As part of this work, we also revisit the 3 deepvs.shallowjudgingdebate: whetheritisbettertocollectmanyrelevancejudgmentsforafewtopicsora v 0 fewjudgmentsformanytopics.Weconsideranumberoffactorsimpactingthistrade-off:howtopicsareselected, 1 topic familiarity to judges, and how topic generation cost may impact both budget utilization and the resultant 8 qualityofjudgments. ExperimentsonNISTTRECRobust2003andRobust2004testcollectionsshownotonly 7 ourmethod’sabilitytoreliablyevaluateIRsystemsusingfewertopics,butalsothatwhentopicsareintelligently 0 selected,deepjudgingisoftenmorecost-effectivethanshallowjudginginachievingthesamelevelofevaluation 1. reliability.Topicfamiliarityandconstructioncostsarealsoseentonotablyimpacttheevaluationcostvs.reliability 0 tradeoffandprovidefurtherevidencesupportingdeepjudginginpractice. 7 1 : 1 Introduction v i X Test collections provide the cornerstone for system-based evaluation of information retrieval (IR) algorithms in r theCranfieldparadigm[17]. SuchtestcollectionsarevitaltoensuringcontinuingfieldprogressbyenablingA/B a testing of new search algorithms and community benchmarking. However, realistic evaluation requires testing IR systemsatthescaleofinformationthatistobesearchedinpractice,andthishasbecomeincreasinglyproblematic as today’s collection sizes have grown ever-larger. In particular, because larger collections tend to contain more relevant documents (which retrieval systems might return in response to a given search query), human assessors must judge the relevance of ever-more documents for each search topic. Moreover, if insufficient documents are judged, evaluation findings could be compromised [10]. Consequently, a key challenge in IR today is to devise newevaluationmethodstoreducethecostofconstructingtestcollectionswhilecontinuingtoensurethevalidityof findings. 1 Onedirectionofresearchtolowercostshassoughttoreducethenumberoftopicsforwhichhumanrelevance judging is needed at all. NIST TREC test collections have traditionally used 50 search topics, with Buckley and Voorhees [11] reporting that at least 25 topics are needed for stable evaluation, with 50 being better. However, Zobel[60]showedthatonesetof25topicspredictedrelativeperformanceofsystemsfairlywellonadifferentset of25topics. Guiver,Mizzaro,andRobertson[22]conductedasystematicstudyshowingthatevaluatingIRsystems usingthe“right”subsetoftopicsyieldsverysimilarresultsvs.evaluatingsystemsoveralltopics. However, they didnotproposeamethodtofindsuchaneffectivetopicsubsetinpractice. NISTemploysasimpleandeffectivebut costlytopicselectionprocesswhichincludesinitialjudgmentsforeachtopicandmanualselectionfromcandidate topics[54]. Hosseinietal.[27]recentlyproposedaniterativemethodtoautomaticallyfindeffectivetopicsubsets tolowerevaluationcostswhilepreservingevaluationreliability. Foragivenevaluationbudget,anotherrelatedquestioniswhetheritisbettertocollectmanyrelevancejudgments forafewtopics–i.e.,NarrowandDeep(NaD)judging–orafewrelevancejudgmentsformanytopics–i.e.,Wide and Shallow (WaS) judging. Intuitively, since people search for a wide variety of topics with a wide variety of queries, it seems that realistic IR evaluation ought to evaluate systems across a similarly wide variety of search topics and queries. Empirically, large variance in search accuracy is often observed for the same system across different topics [7], suggesting merit in sampling many diverse topics for evaluation in order to achieve stable evaluation among systems being compared. Studies on this issue have led to a fairly consistent finding that WaS judgingtendstoprovidemorestableevaluationforthesamehumaneffortvs.NaDjudging[48,16,9]. However, thisfindingdoesnotholdinallcases.Forexample,Carteretteetal.[14]achievethesamereliabilityusing250topics with20judgmentspertopic(5000judgmentsintotal)as600topicswith10judgmentspertopic(6000judgmentsin total). Moreover,ifweaccountforthetimerequiredtocreateeachtopic,therelativeexpenseofusingmoretopics isfurtherincreased. Inthiswork,weinvestigatehowtoautomaticallyselectinformativesearchtopicsforIRevaluation[22,27]. We beginwithourfirstresearchquestion: • RQ-1: Intelligent Topic Selection. How can we minimize the number of topics (hence the cost) required forreliableevaluationofIRsystems? Towardthisend,weproposeanovellearning-to-rank(L2R)approach in which topics are selected iteratively via a greedy method which optimizes accurate ranking of systems (Section 4.3). Our L2R model utilizes 63 features representing the interaction between topics and ranking of systems (Section 4.3.1). In order to train our model, we describe a method to automatically generate usefultrainingdatafromexistingtestcollections(Section4.3.2). WeevaluateourapproachonNISTTREC Robust 2003 [53] and Robust 2004 [52] test collections, comparing our approach to recent prior work [27] and random topic selection (Section 5). Results show consistent improvement over baselines, with greater relativeimprovementasfewertopicsareused. Asmentionedabove,wealsorevisitthedeepvs.shallowjudgingdebate,investigatinghowintelligentselectionof topicsimpactsoptimizationofevaluationbudget. Inrelationtopaststudies, ourworkisdistinguishedbyinvesti- gatingawiderrangeoffactorsimpactingthistradeoffthanpreviouslyconsidered: • RQ-2: NaDvs.WaS.PriorstudiesofNaDvs.WaSjudginghaveassumedrandomselectionoftopics. How doesintelligentselectionoftopics(byitself)impactevaluationbudgetoptimizationinNaDvs.WaSjudging? WefindthatNaDjudgingoftenachievesgreaterevaluationreliabilityforthesamebudgetthanWaSjudging whentopicsareselectedintelligently. • RQ-3: Judging Speed. Past comparisons between NaD vs. WaS judging have typically assumed constant judging speed [16, 48]. However, data reported by Carterette et al. [15] suggests that assessors may judge 2 documentsfasterastheyjudgemoredocumentsforthesametopic(likelyduetoincreasedtopicfamiliarity). AssumingWaSjudgingleadstoslowerjudgingspeedthanNaDjudging,howdoesthisimpactoptimization ofevaluationbudget? BecausewecancollectmorejudgmentsinthesameamountoftimewithNaDvs.WaS judging,NaDjudgingachievesgreaterrelativeevaluationreliabilitythanshowninpriorstudieswhichdidnot considerthisjudgingspeedbenefitofdeepjudging. • RQ-4:TopicgenerationTime.PriorNaDvs.WaSstudieshavetypicallyignorednon-judgingcostsinvolved intestcollectionconstruction. WhileCarteretteetal.[14]considertopicgenerationtime,the5-minutetime theyassumedisroughlytwoordersofmagnitudelessthanthe4hoursNISThastraditionallytakentocon- struct each topic [1]. How do historically large topic generation costs impact optimization of evaluation budgetintheNaDvs.WaStradeoff? WefindthatWaSjudgingispreferablethanNaDjudgingforshorttopic generationtimes(specifically≤5minutesinourexperiments). However,asthetopicgenerationcostfurther increases,NaDjudgingbecomesincreasinglypreferableoverWaSjudginginoptimizingevaluationbudget. • RQ-5: Judging Error. Several studies have reported calibration effects impacting the decisions and con- sistency of relevance assessors [47, 49]. While NIST has traditionally included an initial “burn-in” judging periodaspartoftopicgenerationandformulation[1],wepositthatdrasticallyreducingtopicgenerationtime (e.g., from4hours[1]to2minutes[15])couldnegativelyimpacttopicquality, leadingtolesswell-defined topics and/or calibrated judges, and thereby less reliable judgments. As suggestive evidence, McDonnel et al. [33] report high judging agreement in reproducing a “standard” NIST track, but high and inexplicable judgingdisagreementonTREC’sMillionQuerytrack[15],whichlackedanyburn-inperiodforjudgesand hadfarshortertopicgenerationtimes. Assumingshorttopicgenerationtimesreduceconsistencyofrelevance judgments, how does resultant judging error impact evaluation budget optimization in balancing deep vs. shallowjudging? Toinvestigatethis,wesimulateincreasedjudgingerrorasafunctionoflowertopicgener- ationtimes. Wefindthatitisbettertoinvestaportionofourevaluationbudgettoincreasequalityoftopics, instead of collecting more judgments for low-quality topics. This also makes NaD judging preferable over WaSjudginginmanycases,duetoincreasedtopicgenerationcost. Theremainderofthisarticleisorganizedasfollow. Section2reviewstherelatedworkontopicselectionand topic set design. Section 3 formally defines the topic selection problem. In Section 4, we explain our proposed L2R-basedapproachindetail. Section5presentsourexperimentalevaluation. Finally, Section6summarizesthe contributionsofourworkandsuggestspotentialfuturedirections. 2 Related Work Constructingtestcollectionsisexpensiveinhumaneffortrequired. Therefore,researchershaveproposedavariety of methods to reduce the cost of creating test collections. Proposed methods include: developing new evaluation measuresandstatisticalmethodsforthecaseofincompletejudgments[5,12,41,59,58],findingthebestsample ofdocumentstobejudgedforeachtopic[18,13,29,36,39],inferringrelevancejudgments[6],topicselection[27, 28,35,22],evaluationwithnohumanjudgments[38,50],crowdsourcing[4,21],andothers. Thereaderisreferred to[37]and[46]foramoredetailedreviewofpriorworkonmethodsforlow-costIRevaluation. 3 2.1 TopicSelection To the best of our knowledge, Mizzaro and Robertson [35]’s study was the first seeking to select the best subset oftopicsforevaluation. Theyfirstbuildasystem-topicgraphrepresentingtherelationshipbetweentopicsandIR systems,thenruntheHITSalgorithmonit.Theyhypothesizethattopicswithhigherhubnessscoreswouldbebetter todistinguishbetweensystems. However,Robertson[40]experimentallyshowsthattheirhypothesisisnottrue. Guiver, Mizzaro, andRobertson[22]experimentallyshowthatifwechoosetherightsubsetoftopics, wecan achieve a ranking of systems that is very similar to the ranking when we employ all topics. However they do not provide a solution to find the right subset of topics. This study has motivated other researchers to investigate this problem: Berto, Mizzaro, and Robertson [8] stress generality and shows that a carefully selected good subset of topics to evaluate a set of systems can be also adequate to evaluate a different set of systems. Hauff et al. [23] reportthatusingtheeasiesttopicsbasedonJensen-ShannonDivergenceapproachdoesnotworkwelltoreducethe numberoftopics. Hosseinietal.[28]focusonselectingthesubsetoftopicstoextendanexistingcollectioninorder toincreaseitsre-usability. KazaiandSung[30]reducethecostforpreferencebasedIRevaluation. Thecloseststudytoourownis[27],whichemploysanadaptivealgorithmfortopicselection. Itselectsthefirst topicrandomly. Onceatopicisselected,therelevancejudgmentsareacquiredandusedtoassistwiththeselection of subsequent topics. Specifically, in the following iterations, the topic that is predicted to maximize the current Pearson correlation is selected. In order to do that, they predict relevance probabilities of qrels for the remaining topics using a Support Vector Machine (SVM) model trained on the judgments from the topics selected thus far. Training data is extended at each iteration by adding the relevance judgments from each topic as it is selected in ordertobetterselectthenexttopic. Further studies investigate topic selection for other purposes, such as creating low-cost datasets for training learning-to-rank algorithms [34], system rank estimation [24], and selecting training data to improve supervised datafusionalgorithms[32]. Thesestudiesdonotconsidertopicselectionforlow-costevaluationofIRsystems. 2.2 HowManyTopicsAreNeeded? Pastworkhasinvestigatedtheidealsizeoftestcollectionsandhowmanytopicsareneededforareliableevaluation. WhiletraditionalTRECtestcollectionsemploy50topics,anumberofresearchersclaim50topicsarenotsufficient for a reliable evaluation [29, 55, 51]. Many researchers report that wide and shallow judging is preferable than narrowanddeepjudging[48,16,9],withoutprovidingthebalancebetweenpooldepthandthenumberoftopics. Carterette et al. [14] experimentally compare deep vs. shallow judging in terms of budget utilization. They find that20judgmentswith250topicsisthemostcost-effectiveintheirexperiments. Urbano,Marrero,andMartn[51] measurethereliabilityofTRECtestcollectionswithregardtogeneralizationandconcludethatthenumberoftopics neededforreliableevaluationvariesacrossdifferenttasks. In order to calculate the number of topics required, Webber, Moffat, and Zobel [57] propose adding topics iterativelyuntildesiredstatisticalpowerisreached,whileSakaiproposesmethodsbasedontwo-wayANOVA[45], confidence interval [43], t test and one-way ANOVA [42]. In another work, Sakai also compares his previous methodsandprovidesguidelinesfortestcollectiondesignwithagivenfixedbudget[44]. Whiletheseworksfocus oncalculatingthenumberoftopicsrequired,ourworkfocusesonhowtoselectthebesttopicsetforagivensizein ordertomaximizethereliabilityofevaluation. Wealsointroducefurtherconsiderationsimpactingthedebateover shallowvs.deepjudging: familiarizationofuserstotopics, andtheeffectoftopicgenerationcostsonthebudget utilizationandthequalityofjudgmentsforeachtopic. 4 2.3 TopicFamiliarityvs.JudgingSpeed Carterette et al. [15] report that as the number of judgments per topic increases (when collecting 8, 16, 32, 64 or 128 judgments per topic), the median time to judge each document decreases respectively: 15, 13, 15, 11 and 9 seconds. Thissuggeststhatassessorsbecomemorefamiliarwiththetopicastheyjudgemoredocumentsforit,and thisgreaterfamiliarityyieldsgreaterjudgingspeed. However,priorworkcomparingdeepvs.shallowjudgingdid not consider this, instead assuming that judging speed is constant regardless of judging depth. Consequently, our experimentsinSection5.4revisitthisquestion,consideringhowfasterjudgingwithgreaterjudgingdepthpertopic mayimpactthetradeoffbetweendeepvs.shallowjudginginmaximizingevaluationreliabilityforagivenbudget inhumanassessortime. 2.4 TopicGenerationCostvs.JudgingConsistency Pastworkhasutilizedavarietyofdifferentprocessestogeneratesearchtopicswhenconstructingtestcollections. These different processes explicitly or implicitly enact potentially important tradeoffs between human effort (i.e. cost) vs. quality of the resultant topics generated by each process. For example, NIST has employed a relatively costlyprocessinordertoensurecreationofveryhighqualitytopics[1]: Forthetraditionaladhoctasks,assessorsgenerallycametoNISTwithsomeroughideasfortopicshav- ingbeentoldthetargetdocumentcollection. Foreachidea,theywouldcreateaqueryandjudgeabout 100documents(unlessatleast20ofthefirst25wererelevant,inwhichcasetheywouldstopat25and discardtheidea). Fromthesetofcandidatetopicsacrossallassessors,NISTwouldselectthefinaltest setof50basedonload-balancingacrossassessors,numberofrelevantfound,eliminatingduplication ofsubjectmatterortopictypes,etc. Thejudgingwasanintrinsicpartofthetopicdevelopmentroutine becauseweneededtoknowthatthetopichadsufficientlymany(butnottoomany)relevantinthetarget documentset. (Thesejudgmentsmadeduringthetopicdevelopmentphasewerethendiscarded. Qrels werecreatedbasedonlyonthejudgmentsmadeduringtheofficialjudgmentphaseonpooledpartici- pantresults.) Weusedaheuristicthatexpectedoneoutofthreeoriginalideaswouldeventuallymake itasatestsettopic. Creatingasetof50topicsforanewswireadhoccollectionwasbudgetedatabout 175-225assessorhours,whichworksouttoabout4hoursperfinaltopic. In contrast, the TREC Million Query (MQ) Track used a rather different procedure to generate topics. In the 2007 MQ Track [3], 10000 queries were sampled from a large search engine query log. The assessment system showed10randomlyselectedqueriestoeachassessor,whothenselectedoneandconverteditintoastandardTREC topicbyback-fittingatopicdescriptionandnarrativetotheselectedquery. Carteretteetal.[14]reportthatmedian timeofgeneratingatopicwasroughly5minutes.Inthe2008MQTrack[2],assessorscouldrefreshlistofcandidate 10queriesiftheydidnotwanttojudgeanyofthecandidateslisted. Carteretteetal.[15]reportthatmediantimefor viewingalistofqueriesis22secondsandback-fittingatopicdescriptionis76seconds. Onaverage,eachassessor viewed 2.4 lists to generate each topic. Therefore, the cost of generating a topic is roughly 2.4∗22+76 ≈ 129 seconds,or2.1minutes. Theexamplesaboveshowavastrangeoftopiccreationtimes: from4hoursto2minutespertopic. Therefore, inSection5.4,weinvestigatedeepvs.shallowjudgingwhencostofgeneratingtopicsisalsoconsidered. Inadditiontoconsideringtopicconstructiontime,wemightalsoconsiderwhetheraggressivereductionintopic creationtimemightalsohaveotherunintended,negativeimpactsontopicquality. Forexample,Scholeretal.[49] 5 report calibration effects change judging decisions as assessors familiarize themselves with a topic. Presumably NIST’s 4 hour topic creation process provided judges ample time to familiarize themselves with a topic, and as noted above, judgments made during the topic development phase were then discarded. In contrast, it seems MQ trackassessorsbeganjudgingalmostimmediatelyafterselectingaqueryforwhichtoback-fitatopic,andwithno initialtopicformationperiodforestablishingthetopicanddiscardinginitialtopicsmadeduringthistime. Further empiricalevidencesuggestingqualityconcernswithMQtrackjudgmentswasalsorecentlyreportedbyMcDonnell etal.[33],whodescribeadetailedjudgingprocesstheyemployedtoreproduceNISTjudgments. Whiletheauthors report high agreement between their own judging and crowd judging vs. NIST on the 2009 Web Track, for NIST judgmentsfromthe2009MQtrack,theauthorsandcrowdjudgeswerebothconsistentwhiledisagreeingoftenwith NIST judges. The authors also reported that even after detailed analysis of the cases of disagreement, they could not find a rationale for the observed MQ track judgments. Taken in sum, these findings suggest that aggressively reducingtopiccreationtimemaynegativelyimpactthequalityofjudgmentscollectedforthattopic. Forexample, whileanassessorisstillformulatingandclarifyatopicforhimself/herself,anyjudgmentsmadeatthisearlystageof topicevolutionmaynotbeself-consistentwithjudgmentsmadeoncethetopicisfurthercrystallized.Consequently, inSection5.4werevisitthequestionofdeepjudgingoffewtopicsvs.shallowjudgingofmanytopics,assuming thatlowtopiccreationtimesmaybealsomeanlessconsistentjudging. 3 Problem Definition Inthissection,wedefineourtopicselectionproblem. WeassumethatwehaveaTREC-likesetup:adatacollection hasalreadybeenacquired,alargepooloftopicsandrankedlistsofIRsystemsforeachtopicarealsoavailable.Our goalistoselectacertainnumberoftopicsfromthetopicpoolsuchthatevaluationwiththoseselectedtopicsyields themostsimilarrankingoftheIRsystemstothe“ground-truth”. Weassumethattheground-truthrankingofthe IRsystemsistheonewhenweusealltopicsinthepoolforevaluation. We can formulate this problem as follows. Let T = {t ,t ,...,t } denote the pool of N topics, S = 1 2 N {s ,s ,...,s } denote the set of K IR systems to be evaluated, and R denote the ranking of systems in 1 2 K <S,T,e> S whentheyareevaluatedbasedonevaluationmeasureeoverthetopicsetT (notationusedinequationsandal- gorithmsisshowninTable1). WeaimtoselectasubsetP ⊂ T ofM topicsthatmaximizesthecorrelation(asa measureofsimilaritybetweentworankedlists)betweentherankingofsystemsoverP (i.e., consideringonlyM topicsandtheircorrespondingrelevancejudgments)andtheground-truthrankingofsystems(overT). Mathemati- caldefinitionofourgoalisasfollows: max corr(R ,R ) (1) <S,P,e> <S,T,e> P⊂T,|P|=M wherecorrisarankingsimilarityfunction(e.g.,Kendall-τ [31]). 4 Proposed Approach The problem we are tackling is challenging since we do not know the actual performance of systems (i.e., their performancewhenalltopicsareemployedforevaluation)andwewouldliketofindasubsetoftopicsthatachieves similarrankingtotheunknownground-truth. 6 Table1: Notationusedinequationsandalgorithms Symbol Name T Topicpool S IRsystemsparticipatedtothepoolofthecorrespondingtestcollection R Ranking of systems in S when they are evaluated based on evaluation <S,T,e> measureeoverthetopicsetT N Sizeoftopicpool M Numberoftopicstobeselected D Thedocumentpoolfortopict tc c L Therankedlistresultingfromsystems forthetopict sj(tc) j c Todemonstratethecomplexityoftheproblem,letusassumethatweobtainthejudgmentsforalltopic-document pairs(i.e.,weknowtheground-truthranking). Inthiscase,wehavetocheck(cid:0)N(cid:1)possibilitiesofsubsetsinorder M to find the optimal one (i.e., the one that produces a ranking that has the maximum correlation with the ground- truth ranking). For example, if N = 100 and M = 50, we need to check around 1029 subsets of topics. Since thisiscomputationallyintractable,weneedanapproximationalgorithmtosolvethisproblem. Therefore,wefirst introduceagreedyapproachtoselectthebestsubsetoftopicswhenwehavethejudgmentsforallquery-document pairs(Section4.1). Subsequently,wediscusshowwecanemploythisgreedyapproachinthelackoftherelevance judgments(Section4.2)andthenintroduceourL2R-basedtopicselectionapproach(Section4.3). 4.1 GreedyApproach We first explore an oracle greedy approach that selects topics in an iterative way when relevance judgments are already obtained. Instead of examining all possibilities, at each iteration, we select the ’best’ topic (among the currently non-selected ones) that, when added to the currently-selected subset of topics, will produce the ranking thathasthemaximumcorrelationwiththeground-truthrankingofsystems. Algorithm 1 illustrates this oracle greedy approach. First, we initialize set of selected topics (P) and set of candidatetopicstobeselected(P¯)(Line1). ForeachcandidatetopictinP¯,werankthesystemsovertheselected topicsP inadditiontot(R ),andcalculatetheKendall’sτ achievedwiththisranking(Lines3-4). We <S,P∪{t},e> then pick the topic achieving the highest Kendall-τ score among other candidates (Line 5) and update P and P¯ accordingly(Lines6-7). WerepeatthisprocessuntilwereachthetargetedsubsetsizeM (Lines2-7). 7 Algorithm1AnOracleGreedyApproachforTopicSelection 1: P ←∅;P¯ ←T (cid:46) selectedsetP isinitiallyempty 2: while|P|<M do 3: foreachtopict∈P¯ do 4: τt ←corr(R<S,P∪{t},e>, R<S,T,e>) 5: topict∗ ←max(τt) (cid:46)chooset∗yieldingthebestcorrelation t∈P¯ 6: P ←P ∪{t∗} (cid:46)addt∗totheselectedsetP 7: P¯ ←P¯−{t∗} While this approach has O(M ×N) complexity (which is clearly much more efficient compared to selecting theoptimalsubset), itisalsoimpracticalduetoleveragingtherealjudgments(whichwetypicallydonothavein advance)inordertocalculatetheground-truthrankingandtherebyKendall-τ scores. 4.2 PerformancePredictionApproach OnepossiblewaytoavoidtheneedfortheactualrelevancejudgmentsistopredicttheperformanceofIRsystems usingautomaticevaluationmethods(e.g.,[50,38])andthenrankthesystemsbasedontheirpredictedperformance. However,relyingonpredictedperformancetorepresenttheground-truthisproblematicsimplybecauseinaccurate predictionsresultinnoisypredictedground-truth,whichcausestheselectionoftopicsthatactuallydonotproduce maximumcorrelationwiththerealground-truthranking. OnenoteworthystudyusingthisapproachisHosseinietal.[27]’sworkwhichpredictsrelevanceprobabilityof document-topicpairsbyemployinganSVMclassifierandselectstopicsinagreedywaysimilartotheapproachin Algorithm1. WeusetheirselectionapproachasabaselineinourexperimentspresentedinSection5. 4.3 ProposedLearning-to-RankApproach In this work, we formulate the topic selection problem as a learning-to-rank (L2R) problem. In a typical L2R problem, we are given a query q and a set of documents D, and a model is learned to rank those documents in terms of relevance with respect to q. The model is trained using a set of queries and their corresponding labeled documents. Inourcontext,wearegivenasetofcurrently-selectedtopicsetP (analogoustothequeryq)andthe setofcandidatetopicsP¯ tobeselectedfrom(analogoustothedocumentsD),andweaimtotrainamodeltorank the topics in P¯ based on the expected effectiveness of adding each to P. The training samples used to train the modelaretuplesoftheform(P,t,corr(R ,R )),wherethemeasuredcorrelationisconsidered <S,P∪{t},e> <S,T,e> thelabelofthetopictwithrespecttoP. Noticethat,inthetrainingdata,thecorrelationiscomputedusingthetrue relevancejudgments. Thisenablesustousethewealthofexistingtestcollectionstoacquiredatafortrainingour modelasexplainedinSection4.3.2. Weapply thisL2Rproblemformulation tothetopic selection problemusingour greedyapproach. Insteadof picking the topic that leads to the best correlation with the ground-truth among all candidate topics, we use the trainedL2Rmodeltorankthecandidatetopicsandthenpicktheoneatthefirstrank. Thealgorithmisillustrated inAlgorithm2. Ateachiteration, afeaturevectorv iscomputedforeachcandidatetopictinP¯ usingafeature t extractionfunctionf (Lines3-4),detailedinSection4.3.1. Thecandidatetopicsarethenrankedusingourlearned 8 model(Line5)andthetopicinthefirstrankispicked(Line6). Finally,thetopicsetsP andP¯ areupdated(Lines 7-8)andanewiterationisstarted,ifnecessary. Algorithm2L2R-basedTopicSelection 1: P ←∅;P¯ ←T (cid:46) selectedsetP isinitiallyempty 2: while|P|<M do 3: foreachtopict∈P¯ do 4: vt ←f(t,P,P¯) (cid:46)computefeaturevectorfort 5: l←L2R({(t,vt)}|∀t∈P¯) (cid:46)applyL2Ronfeaturevectors 6: topict∗ ←first(l) (cid:46)chooset∗inthefirstrank 7: P ←P ∪{t∗} (cid:46)addt∗totheselectedsetP 8: P¯ ←P¯−{t∗} (cid:46)Removet∗fromthecandidatesetP¯ 4.3.1 Features Inthissection,wedescribethefeaturesweextractinourL2Rapproachforeachcandidatetopic.Hosseinietal.[27] mathematicallyshowthat,inthegreedyapproach,theselectedtopicateachiterationshouldbedifferentfromthe already-selected ones (i.e., topics in P) while being a representative of the non-selected ones (i.e., topics in P¯). ¯ Therefore,theextractedsetoffeaturesshouldcoverthecandidatetopicaswellasthetwosetsP and(P).Ourmain goalistocapturetheinteractionbetweenthetopicsandtheIRsystemsinadditiontothediversitybetweentheIR systemsintermsoftheirretrievalresults. Wedefinetwotypesoffeaturesets;thefirst(calledtopic-based)isextractedfromanindividualtopic,andthe second (called set-based) is extracted from a set of topics by aggregating the topic-based features extracted from eachofthosetopics. Thetopic-basedfeaturesinclude7featuresthatareextractedforagivencandidatetopict andlistedinTable2. c For a given set of topics (e.g., currently-selected topics P), we extract the set-based features by computing both averageandstandarddeviationofeachofthe7topic-basedfeaturesextractedfromalltopicsintheset. Thisgives us 14 set-based features that can be extracted for a set of topics. We compute those 14 features for each of the followingsetoftopics: • currently-selectedtopics(P) • not-yet-selectedtopics(P¯) • selectedtopicswiththecandidatetopic(P ∪{t }) c • not-selectedtopicsexcludingthecandidatetopic(P¯−{t }) c P¯ −{t } and P ∪{t } are used to represent, more clearly, how much the selected topics and not-selected topics c c willbeaffectedincaseweselectthecandidatetopic. Eventually, wehave63featuresforeachdatarecordrepresentingacandidatetopic: 14×4 = 56featuresfor theabovegroups+7topic-basedfeatures. Wedescribetheseventopic-basedfeaturesthatarethecoreoftheentire featureset. 9 Table2: Topic-basedFeatures Feature Description f Averagesamplingweightofdocuments w¯ f Standarddeviationofweightofdocuments σw f Averageτ scoreforrankedlistspairs τ¯ f Standarddeviationofτ scoresforrankedlistspairs στ f Judgmentcostofthetopic $ f Standarddeviationofjudgmentcostsofsystempairs σ$ f Standarddeviationofestimatedperformanceofsystems NSD • Average sampling weight of documents (f ): In the statAP sampling method [39], a weight is computed w¯ for each document based on where it appears in the ranked lists of all IR systems. Simply, the documents athigherranksgethigherweights. Theweightsarethenusedinanon-uniformsamplingstrategytosample more documents relevant to the corresponding topic. To compute this feature, we leverage that idea and compute the average sampling weight of all documents that appear in the pool of the candidate topic t as c follows: 1 (cid:88) f (t )= w(d,S) (2) w¯ c |D | tc d∈Dtc where D is the document pool for topic t and w(d,S) is the weight of document d over the IR systems tc c S. High f values mean that the systems have common documents at higher ranks for the corresponding w¯ topic,whereaslowerf valuesindicatethatthesystemsreturnsignificantlydifferentrankedlistsoronlythe w¯ documentsatlowerranksareincommon. • Standarddeviationofweightofdocuments(f ): Similartof ,wealsocomputethestandarddeviation σw w¯ ofthesamplingweightsofdocumentsforthecandidatetopicasfollows. f (t )=σ{w(d,S)|∀d∈D } (3) σw c tc • Averageτ scoreforrankedlistspairs(f ): ThisfeaturecomputesKendall’sτ correlationbetweenranked τ¯ lists of each pair of the IR systems and then take the average (as shown in Equation 4) in order to capture thediversityoftheresultsoftheIRsystems. Thedepthoftherankedlistsissetto100. Inordertocalculate Kendall’sτ score,thedocumentsthatappearinonelistbutnotintheotherareconcatenatedtotheotherlist sothatbothrankedlistscontainthesamedocuments. Iftherearemultipledocumentstobeconcatenated,the orderofthedocumentsintherankedlistispreservedduringconcatenation. Forinstance,ifsystemAreturns documents{a,b,c,d}andsystemBreturns{e,a,f,c}foratopic,then,theconcatenatedrankedlistsofAand Bare{a,b,c,d,e,f}and{e,a,f,c,b,d},respectively. |S|−1 |S| 1 (cid:88) (cid:88) f (t )= corr(L ,L ) (4) τ¯ c 2|S|−1 si(tc) sj(tc) i=1 j=i+1 whereL representstherankedlistresultingfromsystems forthetopict . sj(tc) j c 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.