Prasenjit Majumder Mandar Mitra Pushpak Bhattacharyya L. Venkata Subramaniam Danish Contractor Paolo Rosso (Eds.) Multilingual 6 3 5 Information Access 7 S C in South Asian Languages N L Second International Workshop, FIRE 2010 Gandhinagar, India, February 2010 and Third International Workshop, FIRE 2011 Bombay, India, December 2011, Revised Selected Papers 123 Lecture Notes in Computer Science 7536 CommencedPublicationin1973 FoundingandFormerSeriesEditors: GerhardGoos,JurisHartmanis,andJanvanLeeuwen EditorialBoard DavidHutchison LancasterUniversity,UK TakeoKanade CarnegieMellonUniversity,Pittsburgh,PA,USA JosefKittler UniversityofSurrey,Guildford,UK JonM.Kleinberg CornellUniversity,Ithaca,NY,USA AlfredKobsa UniversityofCalifornia,Irvine,CA,USA FriedemannMattern ETHZurich,Switzerland JohnC.Mitchell StanfordUniversity,CA,USA MoniNaor WeizmannInstituteofScience,Rehovot,Israel OscarNierstrasz UniversityofBern,Switzerland C.PanduRangan IndianInstituteofTechnology,Madras,India BernhardSteffen TUDortmundUniversity,Germany MadhuSudan MicrosoftResearch,Cambridge,MA,USA DemetriTerzopoulos UniversityofCalifornia,LosAngeles,CA,USA DougTygar UniversityofCalifornia,Berkeley,CA,USA GerhardWeikum MaxPlanckInstituteforInformatics,Saarbruecken,Germany Prasenjit Majumder Mandar Mitra Pushpak Bhattacharyya L. Venkata Subramaniam Danish Contractor Paolo Rosso (Eds.) Multilingual Information Access in South Asian Languages Second International Workshop, FIRE 2010 Gandhinagar, India, February 19-21, 2010 and Third International Workshop, FIRE 2011 Bombay, India, December 2-4, 2011 Revised Selected Papers 1 3 VolumeEditors PrasenjitMajumder DhirubhaiAmbaniInstituteofInformationandCommunicationTechnology Gujarat,India E-mail:[email protected] MandarMitra IndianStatisticalInstitute,Kolkata,India E-mail:[email protected] PushpakBhattacharyya IndianInstituteofTechnology,Bombay,India E-mail:[email protected] L.VenkataSubramaniam DanishContractor IBMResearch,NewDelhi,India E-mail:{lvsubram,dcontrac}@in.ibm.com PaoloRosso NLELab-ELiRF,UniversitatPolitècnicadeValència,Spain E-mail:[email protected] ISSN0302-9743 e-ISSN1611-3349 ISBN978-3-642-40086-5 e-ISBN978-3-642-40087-2 DOI10.1007/978-3-642-40087-2 SpringerHeidelbergDordrechtLondonNewYork LibraryofCongressControlNumber:2013944224 CRSubjectClassification(1998):H.3,I.2.7,I.7 LNCSSublibrary:SL3–InformationSystemsandApplication,incl.Internet/Web andHCI ©Springer-VerlagBerlinHeidelberg2013 Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation, broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology nowknownorhereafterdeveloped.Exemptedfromthislegalreservationarebriefexcerptsinconnection withreviewsorscholarlyanalysisormaterialsuppliedspecificallyforthepurposeofbeingenteredand executedonacomputersystem,forexclusiveusebythepurchaserofthework.Duplicationofthispublication orpartsthereofispermittedonlyundertheprovisionsoftheCopyrightLawofthePublisher’slocation, initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Permissionsforuse maybeobtainedthroughRightsLinkattheCopyrightClearanceCenter.Violationsareliabletoprosecution undertherespectiveCopyrightLaw. Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant protectivelawsandregulationsandthereforefreeforgeneraluse. Whiletheadviceandinformationinthisbookarebelievedtobetrueandaccurateatthedateofpublication, neithertheauthorsnortheeditorsnorthepublishercanacceptanylegalresponsibilityforanyerrorsor omissionsthatmaybemade.Thepublishermakesnowarranty,expressorimplied,withrespecttothe materialcontainedherein. Typesetting:Camera-readybyauthor,dataconversionbyScientificPublishingServices,Chennai,India Printedonacid-freepaper SpringerispartofSpringerScience+BusinessMedia(www.springer.com) Multilingual Information Access in South Asian Languages Preface TheForumforInformationRetrievalEvaluation(FIRE—http://www.isical. \breakac.in/fire)aimstoprovideacommonplatformforevaluatinginforma- tion access technologies with a focus on South Asian languages, by creating Cranfield-styletest collections in the same spirit as TREC, CLEF, NTCIR, etc. The first evaluation exercise conducted by FIRE was held in 2008. This volume brings together revisedand expanded versions of 29 papers that were presented at the second and third FIRE workshops held in 2010 and 2011. ForFIRE 2010(heldduring February19–21,2010),four taskswereplanned, including two pilot tracks. Eventually, however, submissions were received for only the ad-hoc monolingual and cross-lingual document retrieval task. A total of seven tasks were planned for FIRE 2011 (held during December 2–4,2011).Twotasksweredroppedlater.Thisvolumeincludespapersfromthe following tasks: 1. Ad-hoc. Its objective was to evaluate the effectiveness of retrieval systems inretrievingaccurateandcompleterankedlistsofdocumentsinresponseto 50 one-time information needs. 2. CLITR(Cross-LanguageIndian TextReuse). This task dealtwith the iden- tification of highly similar journalistic articles and news stories in a cross- language setting. 3. SMS-based FAQ retrieval. The goal of this task was to find a question from acollectionofFAQs(frequentlyaskedquestions)thatbestanswers/matches a query received via SMS. 4. RISOT(RetrievalfromIndicscriptOCR’dtext).Thistasklooksatretrieval from a (noisy) document collection created using OCR. 5. Personalized IR. The primary objective of the task was to retrieve more relevantinformationforaparticularuserbymakinguseoftheloggedactions of other users who had entered similar queries to the system. FIREiscoordinatedbytheInformationRetrievalSocietyofIndia(www.irsi.res. in)andsupportedbytheDepartmentofInformationTechnology,Governmentof India, and has also receivedfunding from Google,HP, IBM Research,Microsoft Research Society for Natural Language Technology Research, and Yahoo! India Research and Development. We should like to thank the members of the FIRE Steering Committee for their advice and encouragement. The invaluable assis- tanceprovidedbySauparnaPalchowdhuryandRashmiSankepallyinpreparing VI Multi-lingual Information Access in SouthAsian Languages this volume is gratefully acknowledged. Finally, we are thankful to all our par- ticipants, and particularly to the contributors of this volume. Our apologies to them for the inordinate delay in publishing this collection of papers. April 2013 Prasenjit Majumder Mandar Mitra Pushpak Bhattacharyya L. Venkata Subramaniam Danish Contractor Paolo Rosso Table of Contents FIRE 2011 Overview of FIRE 2011........................................... 1 Sauparna Palchowdhury, Prasenjit Majumder, Dipasree Pal, Ayan Bandyopadhyay, and Mandar Mitra Adhoc Track Query Expansion Based on Equi-Width and Equi-Frequency Partition........................................................ 13 Rekha Vaidyanathan, Sujoy Das, and Namita Srivastava Ad Hoc Retrieval with Marathi Language ........................... 23 Mitra Akasereh and Jacques Savoy Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages – Bengali, Gujarati and Marathi ......................... 38 Jiaul H. Paik, Kimmo Kettunen, Dipasree Pal, and Kalervo Ja¨rvelin ISM@FIRE-2011 Bengali Monolingual Task: A Frequency-Based Stemmer........................................................ 51 Raktim Banerjee and Sukomal Pal CLiTR - Cross Lingual Text Reuse Track PAN@FIRE: Overview of the Cross-Language !ndian Text Re-Use Detection Competition ........................................... 59 Alberto Barr`on-Ceden˜o, Paolo Rosso, Sobha Lalitha Devi, Paul Clough, and Mark Stevenson Cross Lingual Text Reuse Detection Based on Keyphrase Extraction and Similarity Measures .......................................... 71 Rambhoopal Kothwal and Vasudeva Varma Mapping Hindi-English Text Re-use Document Pairs ................. 79 Parth Gupta and Khushboo Singhal SMS Based FAQ Retrieval Track Text Retrieval Using SMS Queries: Datasets and Overview of FIRE 2011 Track on SMS-Based FAQ Retrieval ........................... 86 Danish Contractor, L. Venkata Subramaniam, Deepak P., and Ankush Mittal VIII Table of Contents SMS Based FAQ Retrieval Using Latent Semantic Indexing ........... 100 Arijit De Data-Driven Methods for SMS-Based FAQ Retrieval ................. 104 Sanmitra Bhattacharya, Hung Tran, and Padmini Srinivasan Language Modeling Approach to Retrieval for SMS and FAQ Matching ....................................................... 119 Aditya Mogadala, Rambhoopal Kothwal, and Vasudeva Varma SMS Based FAQ Retrieval ........................................ 131 Nishit Shivhre Improving Accuracy of SMS Based FAQ Retrieval System............. 142 Anwar D. Shaikh, Mukul Jain, Mukul Rawat, Rajiv Ratn Shah, and Manoj Kumar Mapping SMSes to Plain Text FAQs ............................... 157 Arpit Gupta SMS Normalization for FAQ Retrieval .............................. 163 Khushboo Singhal, Gaurav Arora, Smita Kumariv, and Prasenjit Majumder Two Models for the SMS-Based FAQ Retrieval Task of FIRE 2011 ..... 175 Darnes Vilarin˜o, David Pinto, Saul Le´on, Esteban Castillo, and Mireya Tovar SMS Normalisation, Retrieval and Out-of-Domain Detection Approaches for SMS-Based FAQ Retrieval .......................... 184 Deirdre Hogan, Johannes Leveling, Hongyi Wang, Paul Ferguson, and Cathal Gurrin RISOT - Retrieval from Indic Script OCR’d Text Track Overview of the FIRE 2011 RISOT Task............................ 197 Utpal Garain, Jiaul H. Paik, Tamaltaru Pal, Prasenjit Majumder, David S. Doermann, and Douglas W. Oard Maryland at FIRE 2011: Retrieval of OCR’d Bengali ................. 205 Utpal Garain, David S. Doermann, and Douglas W. Oard Retrieval from OCR Text: RISOT Track ............................ 214 Kripabandhu Ghosh and Swapan Kumar Parui Table of Contents IX RISOT - Retrieval from Indic Script OCR’d Text Track Overview of the Personalized and Collaborative Information Retrieval (PIR) Track at FIRE-2011 ........................................ 227 Debasis Ganguly, Johannes Leveling, and Gareth J.F. Jones Simple Transliteration for CLIR Simple Transliteration for CLIR ................................... 241 Sauparna Palchowdhury and Prasenjit Majumder FIRE 2010 Overview of FIRE 2010........................................... 252 Prasenjit Majumder, Dipasree Pal, Ayan Bandyopadhyay, and Mandar Mitra UTA Stemming and Lemmatization Experiments in the FIRE Bengali Ad Hoc Task .................................................... 258 Aki Loponen, Jiaul H. Paik, and Kalervo J¨arvelin Tamil English Cross Lingual Information Retrieval ................... 269 T. Pattabhi R.K. Rao and Sobha Lalitha Devi Test Collections and Evaluation Metrics Based on Graded Relevance ... 280 Kalervo Ja¨rvelin Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages..................................... 295 Johannes Leveling, Debasis Ganguly, and Gareth J.F. Jones Improving Cross-Language Information Retrieval by Transliteration Mining and Generation ........................................... 310 K. Saravanan, Raghavendra Udupa, and A. Kumaran Information Retrieval with Hindi, Bengali, and Marathi Languages: Evaluation and Analysis .......................................... 334 Jacques Savoy, Ljiljana Dolamic, and Mitra Akasereh Author Index.................................................. 353 Overview of FIRE 2011 Sauparna Palchowdhury1, Prasenjit Majumder2, Dipasree Pal1, Ayan Bandyopadhyay1, and Mandar Mitra1 1 Indian Statistical Institute,Kolkata, India 2 DA-IICT,Gujarat, India {sauparna.palc,prasenjit.majumder,bandyopadhyay.ayan, mandar.mitra}@gmail.com, dipasree [email protected] Abstract. We provide an overview of FIRE 2011, the third evaluation exercise conducted by the Forum for Information Retrieval Evaluation (FIRE).OurmainfocusisontheAdhoctask.WedescribehowtheFIRE 2011 test collections were constructed. We also provide a brief overview of theapproaches adopted bytheAdhoc task participants. 1 Introduction The third FIRE workshop was held at IIT Bombay, from 2nd – 4th December, 2011, bringing together a growing IR community in India. The large number of downloads of the FIRE test collections over the web — more than in any of the previousiterationsoftheFIREcampaign—isanindicationofthecommunity’s interest. The following six tracks were offered this time: ◦ Adhoc monolingual / cross-lingual retrieval · documents in Bengali, Gujarati, Hindi, Marathi, Tamil and English · queries in Bengali, Gujarati, Hindi, Marathi, Tamil, Teluguand English ◦ SMS-based FAQ Retrieval ◦ Cross-LanguageIndian Text Reuse (CL!TR) ◦ PersonalisedIR (PIR) ◦ Retrieval from Indic Script OCR’d Text (RISOT) ◦ WSD for IR ◦ Adhoc Retrieval from Mailing Lists and Forums (MLAF). Thelasttwotrackswereeventuallydiscontinuedbecauseofalackofmanpower/ participation. FIRE is funded primarily by the TDIL (Technology Development in Indian Languages) group, housed within the Department of Information Technology, Government of India, with a mandate of creating, in phases, test collections for the 23 “official” languages used in India. Corpus creation in Indian languages requires some amount of familiarity with these languages. Multiple Indian in- stitutes, representing different language verticals, constitute a consortium that P.Majumderetal.(Eds.):FIRE2010and2011,LNCS7536,pp.1–12,2013. (cid:2)c Springer-VerlagBerlinHeidelberg2013