Table Of Content

Lecture Notes in Computer Science 6107 CommencedPublicationin1973 FoundingandFormerSeriesEditors: GerhardGoos,JurisHartmanis,andJanvanLeeuwen EditorialBoard DavidHutchison LancasterUniversity,UK TakeoKanade CarnegieMellonUniversity,Pittsburgh,PA,USA JosefKittler UniversityofSurrey,Guildford,UK JonM.Kleinberg CornellUniversity,Ithaca,NY,USA AlfredKobsa UniversityofCalifornia,Irvine,CA,USA FriedemannMattern ETHZurich,Switzerland JohnC.Mitchell StanfordUniversity,CA,USA MoniNaor WeizmannInstituteofScience,Rehovot,Israel OscarNierstrasz UniversityofBern,Switzerland C.PanduRangan IndianInstituteofTechnology,Madras,India BernhardSteffen TUDortmundUniversity,Germany MadhuSudan MicrosoftResearch,Cambridge,MA,USA DemetriTerzopoulos UniversityofCalifornia,LosAngeles,CA,USA DougTygar UniversityofCalifornia,Berkeley,CA,USA GerhardWeikum Max-PlanckInstituteofComputerScience,Saarbruecken,Germany Hamish Cunningham Allan Hanbury Stefan Rüger (Eds.) Advances in Multidisciplinary Retrieval First Information Retrieval Facility Conference, IRFC 2010 Vienna,Austria, May 31, 2010 Proceedings 1 3 VolumeEditors HamishCunningham UniversityofSheffield Dept.ofComputerScience SheffieldS14DP,UK E-mail:[email protected] AllanHanbury InformationRetrievalFacility 1040Vienna,Austria E-mail:[email protected] StefanRüger KnowledgeMediaInstitute TheOpenUniversity MiltonKeynes,MK76AA,UK E-mail:[email protected] LibraryofCongressControlNumber:2010926587 CRSubjectClassification(1998):H.3,I.2.4,H.5,H.4,I.2,C.2 LNCSSublibrary:SL3–InformationSystemsandApplication,incl.Internet/Web andHCI ISSN 0302-9743 ISBN-10 3-642-13083-6SpringerBerlinHeidelbergNewYork ISBN-13 978-3-642-13083-0SpringerBerlinHeidelbergNewYork Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,re-useofillustrations,recitation,broadcasting, reproductiononmicrofilmsorinanyotherway,andstorageindatabanks.Duplicationofthispublication orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,1965, initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violationsareliable toprosecutionundertheGermanCopyrightLaw. springer.com ©Springer-VerlagBerlinHeidelberg2010 PrintedinGermany Typesetting:Camera-readybyauthor,dataconversionbyScientificPublishingServices,Chennai,India Printedonacid-freepaper 06/3180 Preface These proceedingscontainthe refereedpapersandposterspresentedatthe first InformationRetrieval Facility Conference (IRFC), whichwas held in Vienna on 31 May 2010.The conference providesa multi-disciplinary, scientific forum that aims to bring young researchers into contact with industry at an early stage. IRFC 2010received20 high-qualitysubmissions,of which11 were acceptedand appear here. The decision whether a paper was presented orally or as poster wassolelybasedonwhatwe thoughtwasthe mostsuitable formofcommunica- tion, considering we had only a single day for the event. In particular, the form of presentation bears no relation to the quality of the accepted papers, all of which were thoroughly peer reviewed and had to be endorsed by at least three independent reviewers. The Information Retrieval Facility (IRF) is an open IR research institution, managedbyascientificboarddrawnfromapanelofinternationalexpertsinthe fieldwhoseroleistopromotethehighestqualityintheresearchsupportedbythe facility.As a non-profitresearchinstitution, the IRF providesservicesto IRsci- enceintheformofareferencelaboratory,hardwareandsoftwareinfrastructure. Committed to Open Science concepts, the IRF promotes publication of recent scientific results and newly developed methods, both in traditional paper form and as data sets freely available to IRF members. Such transparency ensures objectiveevaluationandcomparabilityofresultsandconsequentlydiversityand sustainability of their further development. The IRF is unique in providing a powerful supercomputing infrastructure that is exclusively dedicated to semantic processing of text. It has at its hearta hugecollectionofpatentdocuments representingthe globalarchiveofideasand inventions. This data is housed in an environment that allows large-scale scien- tificexperimentsonwaystomanageandretrievethisknowledge.Thiscollection of real data allows IR researchers from all over the world to experiment for the firsttimeonarealisticdatacorpus.Thequalityofsearchresultsisreviewedand evaluated in many fields but especially those specialising in patent search. IRFconferenceswishtoresonateinparticularwithyoungresearchers,whoare interested in discussing results obtained using the IRF infrastructure and data resources; learning about complementary technologies; applying their research efforts to real business needs; and joining the international research network of the IRF.The firstIRFC aimedto tacklefour complementaryresearchareas: – information retrieval – semantic web technologies for IR – natural language processing for IR – large-scale or distributed computing for the above areas We believe that this first conference has achievedmost of these aims and we look forward to many more instances of the IRFC. VI Preface Acknowledgements.Itisnevereasytomakeaconferencehappen.Oursincere thanks go out to: – the IRF executive board: Francisco Eduardo De Sousa Webber, Daniel Schreiber and Sylvia Thal, for their inspiration, for getting the ball rolling and for exceptional organisationaltalent – the professionalteamatthe IRFandMatrixwarefor theirhelpinpreparing the conference and this volume: Marie-Pierre Garnier; Katja Mayer; Mihai Lupu; Giovanna Roda; Helmut Berger – the IRF scientific board and John Tait (IRF CSO) for their guidance – Niraj Aswani for his help in preparing the proceedings – the conference programme committee for their hard work reviewing and commenting on the papers: • Yannis Avrithis, CERTH, Greece • Leif Azzopardi, University of Glasgow, UK • Ricardo Baeza-Yates,Yahoo! Research, Spain • Jamie Callan, Carnegie Mellon University, USA • Paul Clough, University of Sheffield, UK • W. Bruce Croft, University of Massachusetts, USA • Norbert Fuhr, University Duisburg-Essen, Germany • Wilfried Gansterer, University of Vienna, Austria • Charles J. Gillan, Queen’s University Belfast, UK • Gregory Grefenstette, Exalead, France • Preben Hansen, Swedish Institute of Computer Science, Sweden • David Hawking, Funnelback Internet and Enterprise Search, Australia • Joemon Jose, University of Glasgow, UK • Noriko Kando, National Institute of Informatics (NII), Japan • Philipp Koehn, University of Edinburgh, UK • Wessel Kraaij, TNO, The Netherlands • Udo Kruschwitz, University of Essex, UK • Dominique Maret, Matrixware Information Services, Austria • Marie-Francine Moens, Catholic University of Leuven, Belgium • Henning Mu¨ller, University of Applied Sciences Western Switzerland, Switzerland • Walid Najjar, University of California Riverside, USA • ArcotDesaiNarasimhalu,SingaporeManagementUniversity,Singapore • Fredrik Olsson, Swedish Institute of Computer Science, Sweden • Miles Osborne, University of Edinburgh, UK • Andreas Rauber, Vienna University of Technology, Austria • Magnus Sahlgren, Swedish Institute of Computer Science, Sweden • Mark Sanderson, University of Sheffield, UK • Frank J. Seinstra, Vrije Universiteit, The Netherlands • John Tait, IRF, Austria • Benjamin T’sou, City University of Hong Kong, China • Christa Womser-Hacker,University of Hildesheim, Germany Preface VII – thekeynotespeakersMarkSanderson,UniversityofSheffield,UK,andDavid Hawking, Funnelback Internet and Enterprise Search, Australia, for providing excellent and inspiring talks – the sponsors: • Matrixware Information Services • ESTeam • The University of Sheffield • STI International • Yahoo! Labs – the AustrianFederalMinistry ofScience andResearchandWien Kultur for their support – BCS — The Chartered Institute for IT for endorsing the conference – Matt Petrillo for understanding what Hamish was talking about (at least some of the time) We hope you enjoy the results! May 2010 Hamish Cunningham, University of Sheffield http://www.dcs.shef.ac.uk/˜hamish General Chair Allan Hanbury, Information Retrieval Facility http://www.ir-facility.org/about/people/staff Publications Chair Stefan Ru¨ger, The Open University http://people.kmi.open.ac.uk/stefan Programme Chair VIII Preface Supported by Endorsed by Table of Contents Scaling Up High-Value Retrieval to Medium-Volume Data............. 1 Hamish Cunningham, Allan Hanbury, and Stefan Ru¨ger Sentence-level Attachment Prediction............................... 6 M-Dyaa Albakour, Udo Kruschwitz, and Simon Lucas Rank by Readability: Document Weighting for Information Retrieval ... 20 Neil Newbold, Harry McLaughlin, and Lee Gillam Knowledge Modeling in Prior Art Search............................ 31 Erik Graf, Ingo Frommholz, Mounia Lalmas, and Keith van Rijsbergen Combining Wikipedia-Based Concept Models for Cross-Language Retrieval........................................................ 47 Benjamin Roth and Dietrich Klakow Exploring Contextual Models in Chemical Patent Search.............. 60 Jay Urbain and Ophir Frieder Measuring the Variability in Effectiveness of a Retrieval System........ 70 Mehdi Hosseini, Ingemar J. Cox, Natasa Millic-Frayling, and Vishwa Vinay An Information Retrieval Model Based on Discrete Fourier Transform ...................................................... 84 Alberto Costa and Massimo Melucci Logic-BasedRetrieval:Technologyfor Content-Orientedand Analytical Querying of Patent Data.......................................... 100 Iraklis Angelos Klampanos, Hengzhi Wu, Thomas Roelleke, and Hany Azzam Automatic Extraction and Resolution of Bibliographical References in Patent Documents ............................................... 120 Patrice Lopez An Investigation of Quantum Interference in Information Retrieval ..... 136 Massimo Melucci Abstracts versus Full Texts and Patents: A Quantitative Analysis of Biomedical Entities .............................................. 152 Bernd Mu¨ller, Roman Klinger, Harsha Gurulingappa, Heinz-Theodor Mevissen, Martin Hofmann-Apitius, Juliane Fluck, and Christoph M. Friedrich Author Index.................................................. 167 Scaling Up High-Value Retrieval to Medium-Volume Data Hamish Cunningham1, Allan Hanbury2, and Stefan Ru¨ger3 1 Department of Computer Science, University of Sheffield Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK [email protected] 2 Information Retrieval Facility Operngasse 20b, 1040 Vienna, Austria [email protected] 3 Knowledge Media Institute,The Open University Walton Hall, Milton KeynesMK7 6AA, UK [email protected] Abstract. We summarise the scientific work presented at the first In- formation Retrieval Facility Conference [3] and argue that high-value retrieval with medium-volume data, exemplified by patent search, is a thriving topic in a multidisciplinary area that sits between Information Retrieval, Natural Language Processing and Semantic Web Technolo- gies. Weanalyse theparameters thatcondition choices ofretrievaltech- nology for different sizes and values of document space, and we present the patent document space and some of its characteristics for retrieval work. 1 How Much Search Can You Afford? Informationretrieval(IR)technologyhasproliferatedinroughproportiontothe expansionofknowledgeandinformationas acentralfactorin economicsuccess. How should end-users choose between the many available options? Three main dimensions condition the choice: – Volume.The GYMbigthreewebsearchengines(Google,Yahoo,Microsoft) deliver sub-second responses to hundreds of millions of queries daily over hundreds of terabytes of data. At the other end of the scale, desktop search systemscanrelyonsubstantialcomputing resourcesrelativetoasmalldata set. – Value. The retrieval of high-value content (typically within corporate in- tranetsorbehindpay-for-useturnstiles)isoftenmission-criticalforthebusi- ness that owns the content. For example, the BBC allocates a skilled staff member for eight hours per broadcast hour to index their most important content. – Cost. Semantic indexing, conceptual search, linked data and so on share with controlled-vocabulary and meta-data systems a higher cost of H.Cunningham,A.Hanbury,andS.Ru¨ger(Eds.):IRFC2010,LNCS6107,pp.1–5,2010. (cid:2)c Springer-VerlagBerlinHeidelberg2010 2 H. Cunningham, A.Hanbury,and S. Ru¨ger implementation andmaintenance than systems based on counting keywords and hyperlinks. To process web-scale volumes, GYM use a combination of one of the oldest and simplest retrievaldata structures (an inverted file that relates searchterms to documents) and a ranking algorithm that utilises the link structure of the web. These techniques work much better than was initially expected, profiting from the vast number of human relevance decisions that are encapsulated in hyperlinks. Problems remain of course: first, there is still much data in which links are not present, and second the familiar problems of ambiguity (index term synonymy and query term polysemy) can lead to retrieval of irrelevant information and/or failure to retrieve relevant information. High-value (or low-volume) content retrieval systems address these problems withavarietyofsemantics-basedapproachesthatattempttoperformconceptual indexingandlogicalquerying.Forexample,theBBCsystemcitedaboveindexes usingathesaurusof100,000termsthatgeneraliseoveranticipatedsearchterms. Similarly in the Life Sciences, publicationdatabasesincreasingly use richtermi- nologicalresourcestosupportconceptualnavigation(MeSH,theGeneOntology, Snomed, the unified UMLS system etc). An important research theme in recent years has been to ask to what de- gree can we have our cake and eat it? In other words, how far can the low- volume/high-value methods be extended? One area of new high-value methods is being driven from Natural Language Processing. In the proceedings of this IRFC 2010 conference, Roth and Klakow [12]combineWikipedia-basedconceptmodelsforcross-languageretrieval.They employ Wikipedia as a low-cost resource that is up-to-date and find that stan- dard Latent Dirichlet Allocation can extract cross-language information that is valuable for IR by simply normalising the training data. Albakour et al [1] look at sentence-level analysis of e-mails with a view to predict whether the e-mail should contain an attachment. This is an example for a problem where a finer- grained sentence-level approach is preferable to the common IR approach of a bag-of-wordsof the entire document. Evaluationmeasuresin IRcertainly haveto adaptto new models andmodus operandi.OneexampleistheNewboldetal[11]novelranking-by-readabilityap- proach,whichconsidersthereadingability(andmotivation)oftheuser.Hosseini et al [6] critique the common approach of comparing retrieval systems by their mean performance. They argue that the averaging process of the typical mean averageprecisionmeasuredoesnotcapturealltheimportantaspectsofeffective- ness.Insteadthey explorehowthe varianceofametric canbe usedto providea moreinformativemeasureofsystemeffectivenessthanmeanaverageprecision. Other contributions to potential future high-value methods come from the forefront of theory: Melucci [9] looks at van Rijsbergen’s quantum mechanics framework as a new model for IR. He first discusses a situation in which that framework, and then quantum probability, can be necessary in IR and then describes the experiments designed to this end. Costa and Melucci [2] present a new framework based on the Discrete Fourier Transform (DFT) for IR. Their

Advances in Multidisciplinary Retrieval: First Information Retrieval Facility Conference, IRFC 2010, Vienna, Austria, May 31, 2010. Proceedings PDF

174 Pages·2010·3.548 MB·English

by Hamish Cunningham, Allan Hanbury, Stefan Rüger (auth.), Hamish Cunningham, Allan Hanbury, Stefan Rüger (eds.)

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Advances in Multidisciplinary Retrieval: First Information Retrieval Facility Conference, IRFC 2010, Vienna, Austria, May 31, 2010. Proceedings

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.