ebook img

DTIC ADA448634: Lightening the Load of Document Smoothing for Better Language Modeling Retrieval PDF

0.06 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA448634: Lightening the Load of Document Smoothing for Better Language Modeling Retrieval

Lightening the Load of Document Smoothing for Better Language Modeling Retrieval Mark D. Smucker and James Allan CenterforIntelligentInformationRetrieval DepartmentofComputerScience UniversityofMassachusettsAmherst ABSTRACT smoothingiscompetitivewith,andsometimesoutperforms, the best performance obtainable from Dirichlet prior and Wehypothesizedthatlanguagemodelingretrievalwouldim- Jelinek-Mercer smoothing [3]. proveifwereducedtheneedfordocumentsmoothingtopro- We hypothesize that providing the inverse collection fre- vide an inverse document frequency (IDF) like effect. We quency outside of document smoothing will result in im- created inverse collection frequency (ICF) weighted query provedretrieval. Inasense,removingarolefromdocument modelsasatooltopartiallyseparatetheIDF-likerolefrom smoothing will allow smoothing to perform better in its re- documentsmoothing. Comparedtomaximumlikelihoodes- maining roles. timated (MLE) queries, the ICF weighted queries achieved To test our hypothesis, we weight our query models in a 6.4% improvement in mean average precision on descrip- proportion to the ICF. While this is an ad-hoc method to tion queries. The ICF weighted queries performed better determine the probabilities of a query model, it does al- with less document smoothing than that required by MLE low for the ICF to be partially separated from document queries. Language modeling retrieval may benefit from a smoothing. means to separately incorporate an IDF-like behavior out- side of document smoothing. 2. METHODS Categories and Subject Descriptors: H.3.3 [Informa- tion Storage and Retrieval]: Information Search and Re- Maximumlikelihoodestimation(MLE)isacommontech- trieval niqueforestimatingquerymodels. MLEestimatestheprob- General Terms: Algorithms, Experimentation ability of a word given a text as the count of that word di- vided by the total number of words in the text. As such, Keywords: Inverse Document Frequency, IDF, Document theMLE probability of a word w given a queryQ is: Smoothing, Language Modeling Q(w) P(w|MQ)= |Q| 1. INTRODUCTION Thelanguage modelingapproach toinformation retrieval where Q(w) is the count of word w in the query Q and represents queries and documents as probabilistic models |Q|= Q(w) is thequery’slength. w [1]. While inverse document frequency (IDF) is common to Wec(cid:0)reatedourinversecollectionfrequencyweightedquery many retrieval methods, it is not explicitly a part of the models as follows: language modeling approach to retrieval. Zhaiand Lafferty 1 showed that language modeling contains an IDF-like like P(w|MQ)=−ZQ(w)logP(w|C) (1) component when documents are smoothed with the collec- tion model [2]. The IDF-like behavior is provided in the where P(w|C) is the MLE model of the collection, and Z formoftheinversecollectionfrequency(ICF).ZhaiandLaf- isthenormalization factorsothatthequerymodelsumsto ferty developed two-stage smoothing to separately control 1. Taking the negative of the log, −logP(w|C), gives an andleveragesmoothing’sestimation andquerymodeling ca- amount proportional to the inversecollection frequency. pabilities[3]. Two-stagesmoothingalsoeliminatestheneed This ICF weighting approach is analogous to the inverse for parameter tuningwith training data. Thequerymodel- document frequency and may seem ad-hoc in construction. ingroleofsmoothingdescribestheneedformoresmoothing OurgoalwiththeICFweightedqueryistoprovideameans toincreasetheIDF-likeeffectforverbosequeries. Two-stage to test our hypothesis. Our goal is not to propose a new way to estimate aquerymodel. Wehopethatfurtherwork might result in a compelling formal integration of the idea into language modeling retrieval. 3. EXPERIMENTS Wecomparedtheinversecollectionfrequencyquerymod- els created via equation 1 to the MLE query models. For Copyrightisheldbytheauthor/owner(s). our queries, we used the title and description fields of the SIGIR’06,August6–11,2006,Seattle,Washington,USA. TRECtopics301-450,whicharethead-hoctopicsforTREC ACM1-59593-369-7/06/0008. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Lightening the Load of Document Smoothing for Better Language 5b. GRANT NUMBER Modeling Retrieval 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Massachusetts Amherst,Center for Intelligent Information REPORT NUMBER Retrieval,Department of Computer Science,Amherst,MA,01003 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 2 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Title Queries Dirichlet m Test Set MLE ICF % Imp. Sig. Level Training Set MLE ICF TREC 6 0.228 0.234 2.6% 0.06 Titles TREC 6, 7 800 500 TREC 7 0.182 0.189 3.9% 0.12 Titles TREC 6, 8 350 400 TREC 8 0.247 0.246 -0.3% 0.81 Titles TREC 7, 8 800 500 All: 6,7,8 0.219 0.223 1.9% 0.04 Desc. TREC 6, 7 2500 1250 Description Queries Desc. TREC 6, 8 2500 1000 Test Set MLE ICF % Imp. Sig. Level Desc. TREC 7, 8 3000 1250 TREC 6 0.195 0.212 8.9% <0.01 TREC 7 0.183 0.194 6.4% 0.03 TREC 8 0.228 0.238 4.4% 0.05 Table 2: The best parameter setting for Dirich- All: 6,7,8 0.202 0.215 6.4% <0.01 let prior smoothing on the training sets. Except for the TREC 6, 8 titles, the inverse collection fre- quency (ICF) query models performed better with Table1: Meanaverageprecisionresultsfortitleand less smoothing than the maximum likelihood esti- description queries. MLE is the maximum likeli- mated (MLE) models. hood query. ICF is the query model constructed using the inverse collection frequency as per equa- tion 1. performance of Dirichlet prior or Jelinek-Mercer smoothing [3]. Two-stage smoothing separately models the estimation and query-modeling (IDF-like) roles of smoothing, but it 6, 7, and 8. The collection was TREC volumes 4 and 5 mi- still ties the IDF-like role to the amount of smoothing re- nustheCongressional Record. WeusedtheLemurretrieval quired. In contrast, when we separated the IDF-like role toolkitforourexperiments. Westemmedallwordswiththe fromsmoothing,weobtainedamodest,butstatisticallysig- Krovetzstemmer and used an in-house stopword list of 418 nificant, performance improvement for description queries words. We ranked documents by their cross entropy with on TREC 7. While two-stage smoothing requires no para- the query. meter tuning, one could tune its parameters and possibly We used three-fold cross validation to set the document achieve similar improvementsto those reported here. smoothingparameters. EachsetofTRECtopicswasafold. For example, the TREC 7 and 8 topics were the training 5. CONCLUSION set for the TREC 6 topics. Thus our results closer repre- sentexpectedperformancethanoptimal. Wedidparameter Documentsmoothingplaysseveralrolesinlanguagemod- sweeps of Dirichlet prior and Jelinek-Mercer smoothing. In eling retrieval [2]. One of these roles is to produce an IDF- all cases Dirichlet prior performed better. Dirichlet prior like effect using the inverse collection frequency (ICF). We smoothing mixes the MLE document model with the MLE hypothesizedthat lighteningtheIDF-likeresponsibilities of collection model: P(w|MD) = (1−λ)P(w|D)+λP(w|C), smoothingcouldimproveretrieval. WecreatedICFweighted where λ = m/(|D|+m) and m is the Dirichlet prior para- querymodelsandcomparedtheirperformancetomaximum meter. likelihoodestimatedmodels. TheICFquerymodelsachieved We measured statistical significance with a paired, two- a 6.4% improvement in the mean average precision while sided, randomization test with 100,000 samples. requiring less document smoothing. Language modeling re- trievalmaybenefitfromameanstoincorporateanIDF-like 4. RESULTS ANDDISCUSSION behavior outside of document smoothing and thereby allow document smoothing to focus on its other roles. Table 1 shows the mean average precision of MLE and ICFquerymodelsfortitleanddescriptionqueries. ICFob- 6. ACKNOWLEDGMENTS taineda1.9%improvementoverMLEontitlequeriesanda 6.4% improvement on description queries. For title queries, This work was supported in part by the Center for In- noneof theindividualTRECtopicsetsshowed statistically telligent Information Retrieval and in part by the Defense significant improvements. For description queries, statisti- Advanced Research Projects Agency (DARPA) under con- cally significant improvements occurred for TREC 6 and 7. tract number HR0011-06-C-0023. Any opinions, findings ICF’s performance gain comes from improving precision at and conclusions or recommendations expressed in this ma- recall points of 0.3 and higher. terialarethoseoftheauthorsanddonotnecessarily reflect Table 2 shows that less document smoothing is required those of thesponsor. if the retrieval method has available another way to utilize the inversecollection frequency in an IDF-likemanner. 7. REFERENCES Our results show that improved retrieval performance is [1] J. M. Ponte and W. B. Croft. A language modeling possible if theIDF-likerole played by document smoothing approach to information retrieval. In SIGIR,pages is separately handled. The performance improvement may 275–281, 1998. be the result of either the better use of the ICF or better [2] C. Zhai and J. Lafferty.A study of smoothing methods document smoothing or both. for language models applied to ad hoc information Two-stagesmoothingperformswell,butintheconfigura- retrieval. In SIGIR,pages 334–342, 2001. tiontestedbyZhaiandLafferty,itdoesnotresultinstatis- tically significant performance improvementson title orde- [3] C. Zhai and J. Lafferty.Two-stage language models for scriptionqueriesforTREC7and8ascomparedtothebest information retrieval. In SIGIR, pages 49–56, 2002.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.