This page intentionally left blank P1:KRU/IRP irbook CUUS232/Manning 9780521865715 June26,2008 21:26 Introduction to Information Retrieval IntroductiontoInformationRetrievalisthefirsttextbookwithacoherenttreat- ment of classical and web information retrieval, including web search and the related areas of text classification and text clustering. Written from a computerscienceperspective,itgivesanup-to-datetreatmentofallaspects of the design and implementation of systems for gathering, indexing, and searchingdocumentsandofmethodsforevaluatingsystems,alongwithan introductiontotheuseofmachinelearningmethodsontextcollections. Designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also interest researchers and professionals.Acompletesetoflectureslidesandexercisesthataccompany thebookareavailableontheweb. ChristopherD.ManningisAssociateProfessorofComputerScienceandLin- guisticsatStanfordUniversity. PrabhakarRaghavanisHeadofYahoo!ResearchandaConsultingProfessor ofComputerScienceatStanfordUniversity. HinrichSchu¨tzeisChairofTheoreticalComputationalLinguisticsattheIn- stituteforNaturalLanguageProcessing,UniversityofStuttgart. i P1:KRU/IRP irbook CUUS232/Manning 9780521865715 June26,2008 21:26 ii P1:KRU/IRP irbook CUUS232/Manning 9780521865715 June26,2008 21:26 Introduction to Information Retrieval Christopher D. Manning StanfordUniversity Prabhakar Raghavan Yahoo!Research Hinrich Schu¨tze UniversityofStuttgart iii CAMBRIDGEUNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521865715 © Cambridge University Press 2008 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2008 ISBN-13 978-0-511-41405-3 eBook (EBL) ISBN-13 978-0-521-86571-5 hardback Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. P1:KRU/IRP irbook CUUS232/Manning 9780521865715 June26,2008 21:26 Contents Table of Notation page xi Preface xv 1 Booleanretrieval 1 1.1 Anexampleinformationretrievalproblem 3 1.2 A first take at building an inverted index 6 1.3 ProcessingBooleanqueries 9 1.4 TheextendedBooleanmodelversusrankedretrieval 13 1.5 Referencesandfurtherreading 16 2 Thetermvocabularyandpostingslists 18 2.1 Documentdelineationandcharactersequencedecoding 18 2.2 Determiningthevocabularyofterms 21 2.3 Fasterpostingslistintersectionviaskippointers 33 2.4 Positionalpostingsandphrasequeries 36 2.5 Referencesandfurtherreading 43 3 Dictionariesandtolerantretrieval 45 3.1 Searchstructuresfordictionaries 45 3.2 Wildcardqueries 48 3.3 Spellingcorrection 52 3.4 Phoneticcorrection 58 3.5 Referencesandfurtherreading 59 4 Indexconstruction 61 4.1 Hardwarebasics 62 4.2 Blockedsort-basedindexing 63 4.3 Single-passin-memoryindexing 66 4.4 Distributedindexing 68 4.5 Dynamicindexing 71 v P1:KRU/IRP irbook CUUS232/Manning 9780521865715 June26,2008 21:26 vi Contents 4.6 Othertypesofindexes 73 4.7 Referencesandfurtherreading 76 5 Indexcompression 78 5.1 Statisticalpropertiesoftermsininformationretrieval 79 5.2 Dictionarycompression 82 5.3 Postingsfilecompression 87 5.4 Referencesandfurtherreading 97 6 Scoring,termweighting,andthevectorspacemodel 100 6.1 Parametricandzoneindexes 101 6.2 Termfrequencyandweighting 107 6.3 Thevectorspacemodelforscoring 110 6.4 Varianttf–idffunctions 116 6.5 Referencesandfurtherreading 122 7 Computingscoresinacompletesearchsystem 124 7.1 Efficientscoringandranking 124 7.2 Componentsofaninformationretrievalsystem 132 7.3 Vectorspacescoringandqueryoperatorinteraction 136 7.4 Referencesandfurtherreading 137 8 Evaluationininformationretrieval 139 8.1 Informationretrievalsystemevaluation 140 8.2 Standardtestcollections 141 8.3 Evaluationofunrankedretrievalsets 142 8.4 Evaluationofrankedretrievalresults 145 8.5 Assessingrelevance 151 8.6 Abroaderperspective:Systemqualityanduser utility 154 8.7 Resultssnippets 157 8.8 Referencesandfurtherreading 159 9 Relevancefeedbackandqueryexpansion 162 9.1 Relevancefeedbackandpseudorelevance feedback 163 9.2 Globalmethodsforqueryreformulation 173 9.3 Referencesandfurtherreading 177 10 XMLretrieval 178 10.1 BasicXMLconcepts 180 10.2 ChallengesinXMLretrieval 183 10.3 AvectorspacemodelforXMLretrieval 188 10.4 EvaluationofXMLretrieval 192 P1:KRU/IRP irbook CUUS232/Manning 9780521865715 June26,2008 21:26 Contents vii 10.5 Text-centricversusdata-centricXMLretrieval 196 10.6 Referencesandfurtherreading 198 11 Probabilisticinformationretrieval 201 11.1 Reviewofbasicprobabilitytheory 202 11.2 Theprobabilityrankingprinciple 203 11.3 Thebinaryindependencemodel 204 11.4 Anappraisalandsomeextensions 212 11.5 Referencesandfurtherreading 216 12 Languagemodelsforinformationretrieval 218 12.1 Languagemodels 218 12.2 Thequerylikelihoodmodel 223 12.3 Languagemodelingversusotherapproaches ininformationretrieval 229 12.4 Extendedlanguagemodelingapproaches 230 12.5 Referencesandfurtherreading 232 13 TextclassificationandNaiveBayes 234 13.1 Thetextclassificationproblem 237 13.2 NaiveBayestextclassification 238 13.3 TheBernoullimodel 243 13.4 PropertiesofNaiveBayes 245 13.5 Featureselection 251 13.6 Evaluationoftextclassification 258 13.7 Referencesandfurtherreading 264 14 Vectorspaceclassification 266 14.1 Documentrepresentationsandmeasuresofrelatedness invectorspaces 267 14.2 Rocchioclassification 269 14.3 k nearestneighbor 273 14.4 Linearversusnonlinearclassifiers 277 14.5 Classificationwithmorethantwoclasses 281 14.6 Thebias–variancetradeoff 284 14.7 Referencesandfurtherreading 291 15 Supportvectormachinesandmachinelearningondocuments 293 15.1 Supportvectormachines:Thelinearlyseparablecase 294 15.2 Extensionstothesupportvectormachinemodel 300 15.3 Issuesintheclassificationoftextdocuments 307 15.4 Machine-learningmethodsinadhocinformationretrieval 314 15.5 Referencesandfurtherreading 318 P1:KRU/IRP irbook CUUS232/Manning 9780521865715 June26,2008 21:26 viii Contents 16 Flatclustering 321 16.1 Clusteringininformationretrieval 322 16.2 Problemstatement 326 16.3 Evaluationofclustering 327 16.4 K-means 331 16.5 Model-basedclustering 338 16.6 Referencesandfurtherreading 343 17 Hierarchicalclustering 346 17.1 Hierarchicalagglomerativeclustering 347 17.2 Single-linkandcomplete-linkclustering 350 17.3 Group-averageagglomerativeclustering 356 17.4 Centroidclustering 358 17.5 Optimalityofhierarchicalagglomerative clustering 360 17.6 Divisiveclustering 362 17.7 Clusterlabeling 363 17.8 Implementationnotes 365 17.9 Referencesandfurtherreading 367 18 Matrixdecompositionsandlatentsemanticindexing 369 18.1 Linearalgebrareview 369 18.2 Term–documentmatricesandsingularvalue decompositions 373 18.3 Low-rankapproximations 376 18.4 Latentsemanticindexing 378 18.5 Referencesandfurtherreading 383 19 Websearchbasics 385 19.1 Backgroundandhistory 385 19.2 Webcharacteristics 387 19.3 Advertisingastheeconomicmodel 392 19.4 Thesearchuserexperience 395 19.5 Indexsizeandestimation 396 19.6 Near-duplicatesandshingling 400 19.7 Referencesandfurtherreading 404 20 Webcrawlingandindexes 405 20.1 Overview 405 20.2 Crawling 406 20.3 Distributingindexes 415 20.4 Connectivityservers 416 20.5 Referencesandfurtherreading 419
Description: