Table Of Content

Augmented Test Collections: A Step in the Right Direction Laura Hasler Martin Halvey Robert Villa InformationRetrievalGroup ITT InformationRetrievalGroup InformationSchool Engineering andBuilt InformationSchool UniversityofSheffield Environment UniversityofSheffield l.j.hasler@sheffield.ac.uk GlasgowCaledonian r.villa@sheffield.ac.uk University [email protected] 5 1 ABSTRACT thispaper,wearguethat tomoveforward and advanceour 0 In this position paper we argue that certain aspects of rel- field, we need to acknowledge that we oversimplify things 2 evanceassessment in theevaluation ofIR systemsareover- and that the standard collections with which we evaluate simplifiedandthathumanassessmentsrepresentedbyqrels oursystemsarefairly synthetic. Weneedtobepreparedto n should be augmented to take account of contextual factors takeonthechallengeofconsideringthelesswell-definedand a J and the subjectivity of the task at hand. We propose en- perhapsmessy aspects of information retrieval evaluation. hancingtestcollectionsusedinevaluationwithinformation 6 related to human assessors and their interpretation of the Given that humans are necessarily involved in the process 2 task. Such augmented collections would provide a more re- of developing test collections, and that the process of rele- alisticanduser-focusedevaluation,enablingustobetterun- vanceassessmentuponwhichthesecollectionsdependisnot ] R derstandtheevaluationprocess,theperformanceofsystems assimpleasthecollections suggest,agoodstartingpointis I anduserinteractions. Afirststepistoconductuserstudies toexamineinmoredetailwhat peopleactuallydowhenwe . toexamineinmoredetailwhatpeopleactuallydowhenwe ask them to judge the relevance of documents to a partic- s c ask them to judgetherelevance of a document. ular topic or information need. We need to do this across [ differenttypesofdocument,differenttypesoftask,anddif- Categories andSubject Descriptors ferenttypesofassessor. Byexploringwhatitisthatpeople 1 really do and by considering the assessors themselves, we v H.3.3[InformationStorageandRetrieval]: Information canaugmentthefinalrelevancejudgementswithadditional 0 Search and Retrieval contextualinformationabouttheassessorandtheirinterpre- 7 tation of the task. This additional information could help 3 Keywords to explain subjective and complex relevance judgements in 6 RelevanceAssessment,TestCollections,UserStudies,Eval- their particular contexts and reflect how users of systems 0 . uation. mightalsoconsiderresultstoberelevant,capturingarange 1 of perspectives on the relevance of documents. Augmented 0 1. INTRODUCTION test collections containing this type of information would 5 Training and test collections are a fundamental part of the therefore allow a more realistic evaluation of retrieval en- 1 process of developing effective retrieval systems, and are gines from a user perspective, enabling us to better under- : v used extensively in evaluation forums such as TREC and stand both their performance and user interactions. i CLEF. These collections involve human input to assess the X relevanceofdocumentstoagiventopic,andtheassessments 2. AUGMENTED TEST COLLECTIONS r areconsideredabenchmarkbywhichwetrainandevaluate a The complexity of the relevance assessment task has long our systems. However, in the resulting collections, the hu- been recognised in the literature, with a focus on relevance man process of judging relevance is oversimplified. It may criteria [2, 3, 7, 10]. More recently, a number of stud- take substantial effort for humans to judge the relevance of ieshaveexploredfactorsaffectingtheprocessofperforming documents,wherenumerousfactorscomeintoplayandcon- relevance assessments and search tasks, considering aspects text is crucial. The level of subjectivity of the task can be such as level of expertise or types of judges [6, 5, 9], task high and agreement on ”relevance”low, yet theinformation complexity [4],certainty [1]andeffort [8]fordifferenttypes used to evaluate our retrieval systems is encapsulated in a of documents. However, the valuable information resulting singlenumberinaqrelandistakenasa”goldstandard”. In from such studies has not found its way into the practical sideofsystemevaluation-itisnotpresentintestcollections themselves. Toenableamorerealisticanduser-orientedevaluationinIR, we propose the development of augmented test collections. Inthesecollections,anonymisedinformationincludingstan- Copyrightisheldbytheauthor/owner(s). darddemographicinformationsuchasage,genderandlevel SIGIR’14 Workshop on Gathering Efficient Assessments of Relevance of education, along with task-specific information relating (GEAR’14), July11,2014,GoldCoast,Queensland,Australia. toassessorexpertise,interest,motivation,confidenceorcer- tainty,degreeofrelevanceofadocumentandreasonsthata resented in IR evaluation, which contributes to the opaque document is considered relevant is addedfor each relevance natureofthebenchmarksweuse. Weproposedthedevelop- assessment. Toachievethis,wefirstneed toinvestigatethe mentofaugmentedtestcollections,whereqrelsareenhanced factorswhichaffectthejudgingprocess,takingintoaccount withadditionalinformationregardingtheassessorandtheir different judges, document types and tasks. We need to be interpretationofthetask,asawayforward. Thecontextual able to ask the judges about their decisions to ensure that information used to augment the qrels would help to make we fully and accurately understand the process and their the subjectivity and complexity associated with relevance interpretation of it. assessments more explicit within test collections, leading to apotentiallymorenaturalanduser-focusedapproachtothe Weadvocatesmall-scale,qualitativelab-basedstudiesofas- evaluationofsystemswhichare,afterall,usedbyrealpeople sessors as the best method to begin our exploration. Our who make these judgements during their everyday interac- studieswilltakeintoaccountmuchmorethan”isthisdocu- tions. We outlined a possible way of exploring the factors ment relevant to this topic?”, and as well as performing as- necessary for the development of such collections through sessmentsonstandardcollections,participantswillbeasked qualitativelab-basedstudieswhichcanthenbescaledupvia tospecifytheirowninformationneedsandassessdocuments crowdsourcing. We believe that our proposal of augmented as relevant to those. This is an attempt to create a more test collections isastep intherightdirection towardsmore realistic approximation of real-life relevance assessment in realistic, clear and useful IR evaluation. contrast to the secondary assessments made on synthetic benchmarks and will be informed by user-centred library 4. ACKNOWLEDGEMENTS and information science methods (e.g. put back in refer- This work was funded bytheUK Arts and HumanitiesRe- ence). We intend to use a range of document types (text, search Council (grant AH/L010364/1). image and video) and a range of assessors to obtain more comprehensive insights. We think that some of the most 5. REFERENCES valuable insights from our studies will be gained through [1] A.L. Al-Harbiand M. D.Smucker.Userexpressions playing back assessors judgements to them, asking them to of relevance judgment certainty. talk throughtheprocess andaskingwhytheymadecertain [2] C. L. Barry. User-definedrelevance criteria: an decisions,thenanalysingthisdiscussiontoestablishthemes exploratory study.JASIS, 45(3):149–159, 1994. linked to relevance judgements. Assessors will also answer predefined questions regarding their levels of expertise, in- [3] C. L. Barry and L.Schamber. Users’ criteria for terestandmotivationforatopic,andtheirperceptionofthe relevanceevaluation: across-situational comparison. difficultyofthetaskandtheeffortexpended. Whilstthisisa Information processing & management, 34(2):219–236, time- and labour-intensivemethod,it providesrich insights 1998. that would be difficult to obtain through other methods. [4] D.J. Bell and I. Ruthven.Searcher’s assessments of The information collected would provide the basis for the task complexity for web searching. InAdvances in augmented test collections we propose, where it would be Information Retrieval, pages 57–71. Springer,2004. attachedtoitscorrespondingqrel. Thisadditionalinforma- [5] P.Clough, M. Sanderson,J. Tang, T. Gollins, and tion will help to make the hidden elements of subjectivity A.Warner. Examiningthelimits of crowdsourcing for and complexity in relevance assessments within test collec- relevanceassessment. Internet Computing, IEEE, tions more transparent, as well as to help explain the more 17(4):32–38, 2013. measurable differences in assessor judgements (i.e. as cap- [6] K.A. Kinney,S.B. Huffman,and J. Zhai. How tured by inter-assessor agreement metrics). evaluatordomain expertise affects search result relevancejudgments. In Proceedings of the 17th ACM A next step would be to investigate whether we can gather conference on Information and knowledge a substantial amount of comparable assessments and corre- management, pages 591–598. ACM, 2008. spondingadditionalinformationeffectivelyonalargerscale, [7] T. Saracevic. The concept of relevance in information using, for example, crowdsourcing as a method of data col- science: A historical review. Introduction to lection. Whilst crowdsourcing experiments cannot generate information science, pages 111–151, 1970. the same degree of qualitative data as the lab setting, the [8] R.Villa and M. Halvey.Is relevancehard work?: number of participants in a single experiment will be much evaluatingtheeffort of making relevant assessments. greater, as will the resulting quantityof assessments gener- InProceedings of the 36th international ACM SIGIR ated, enabling us to explicitly test the factors and themes conference on Research and development in identified in the lab on a larger scale. These studies will information retrieval, pages 765–768. ACM, 2013. also allow us to investigate whether it is feasible to create [9] W.Webberand J. Pickens.Assessor disagreement and our proposed type of augmented test collection using a less text classifier accuracy. In Proceedings of the 36th qualitative methodology. Considering the amount of time, international ACM SIGIR conference on Research and effort and money it takes to produce test collections and development in information retrieval, pages 929–932. thatoftenlargenumbersofassessmentsareneededforeval- ACM, 2013. uation, as well as the problem of subjectivity, it is sensible [10] Y.C. Xu and Z. Chen. Relevancejudgment: What do to explore alternativemeans of gathering judgements. information users consider beyondtopicality? Journal of the American Society for Information Science and 3. CONCLUSIONS Technology, 57(7):961–973, 2006. In this position paper, we acknowledged the oversimplifica- tion of the human process of relevance assessment as rep-