ebook img

DTIC ADA459769: Improved Cross-Language Retrieval using Backoff Translation PDF

4 Pages·0.07 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA459769: Improved Cross-Language Retrieval using Backoff Translation

Improved Cross-Language Retrieval using Backoff Translation Philip Resnik, DouglasOard, and Gina Levow 1;2 2;3 2 Departmentof Linguistics, 1 Institutefor AdvancedComputer Studies, 2 College of InformationStudies, 3 Universityof Maryland College Park, MD 20742 resnik,gina @umiacs.umd.edu, [email protected] f g ABSTRACT format. SincetheWebDicttranslationsappearinnoparticularor- der, werankedthe basedontarget languageunigramstatistics Thelimitedcoverageofavailabletranslationlexiconscanposease- calculatedoveralaergiecomparablecorpus,theEnglishportionof riouschallengeinsomecross-languageinformationretrievalappli- theCross-LanguageEvaluationForum(CLEF)collection,smoothed cations. Wepresenttwotechniquesforcombiningevidencefrom withstatisticsfromtheBrowncorpus,abalancedcorpuscovering dictionary-basedandcorpus-basedtranslationlexicons,andshow manygenresofEnglish.Allsingle-wordtranslationsareorderedby thatbackofftranslationoutperformsatechniquebasedonmerging decreasingunigramfrequency,followedbyallmulti-wordtransla- lexicons. tions,andfinallybyanysingle-wordentriesnotfoundineithercor- pus. Thisorderinghastheeffectofminimizingtheeffectofinfre- 1. INTRODUCTION quentwordsinnon-standardusagesorofmisspellingsthatsome- Theeffectivenessofabroadclassofcross-languageinformation timesappearinbilingualtermlists. retrieval(CLIR)techniquesthatarebasedonterm-by-termtransla- 2.2 STRANDTralex tiondependsonthecoverageandaccuracyoftheavailabletrans- lationlexicon(s). Twotypesoftranslationlexiconsarecommonly Oursecondlexicalresourceisatranslationlexiconobtainedfully used,onebasedontranslationknowledgeextractedfrombilingual automaticallyviaanalysisofparallelFrench-Englishdocumentsfrom dictionaries[1] andtheotherbasedontranslationknowledgeex- theWeb.Acollectionof3,378documentpairswasobtainedusing tractedfrombilingualcorpora[8]. Dictionariesprovidereliableev- STRAND,ourtechniqueforminingtheWebforbilingualtext[7]. idence,butoftenlacktranslationpreferenceinformation. Corpora, Thesedocumentpairswerealignedinternally, usingtheirHTML bycontrast,areoftenabettersourcefortranslationsofslangornewly markup,toproduce63,094alignedtext“chunks”ranginginlength coinedterms, butthestatistical analysisthroughwhichthetrans- from2to30words, 8wordsonaverageperchunk,foratotalof lationsareextractedsometimesproduceserroneousresults. Inthis 500Kwordspersi(cid:24)de. Viterbiword-alignmentsforthesepaired paperweexplorethequestionofhowbesttocombineevidencefrom (cid:24)chunkswereobtainedusingtheGIZAimplementationoftheIBM thesetwosources. statisticaltranslationmodels.2 Anorderedsetoftranslationpairs wasobtainedbytreatingeachalignmentlinkbetweenwordsasa 2. TRANSLATION LEXICONS co-occurrenceandscoringeachwordpairaccordingtothelikeli- hoodratio[2]. Wethenrankthetranslationalternativesinorderof Ourterm-by-termtranslationtechnique(describedbelow)requires decreasinglikelihoodratioscore. atranslationlexicon(henceforthtralex)inwhicheachword isas- sociatedwitharankedset oftranslations.Wfeused twotranslationlexiconsinfoeu1r;eex2p;:er:i:meenngts. 3. CLIR EXPERIMENTS Rankedtralexesareparticularly wellsuitedtoasimpleranked 2.1 WebDictTralex term-by-termtranslationapproach.Inourexperiments,weusetop- Wedownloadedafreelyavailable,manuallyconstructedEnglish- 2balanceddocumenttranslation,inwhichweproduceexactlytwo Frenchterm list fromtheWeb1 andinvertedit toFrench-English EnglishtermsforeachFrenchterm. Fortermswithnoknowntrans- lation,theuntranslatedFrenchtermisgeneratedtwice(oftenappro- 1http://www.freedict.com priateforpropernames).ForFrenchtermswithonetranslation,that translationisgeneratedtwice. ForFrenchtermswithtwoormore translations,wegeneratethefirsttwotranslationsinthetralex.Thus balancedtranslationhastheeffectofintroducingauniformweight- ingoverthetop translationsforeachterm(here ). Benefitsofthenapproachincludesimplicityandmnod=ula2rity—no- ticethatalexiconcontainingrankedtranslationsistheonlyrequire- ment, andinparticular that there isno needfor accessto the in- ternalsoftheIRsystemortothedocumentcollectioninorderto . 2http://www.clsp.jhu.edu/ws99/projects/mt/ Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2001 2. REPORT TYPE 00-00-2001 to 00-00-2001 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Improved Cross-Language Retrieval using Backoff Translation 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Institute of Advanced Computer Studies (UMIACS),Department od REPORT NUMBER Computer Science,University of Maryland,College Park,MD,20742 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 3 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 performcomputationsontermfrequenciesorweights.Inaddition, Condition MAP theapproachisaneffectiveone: inpreviousexperimentswehave STRAND( ) 0.2320 foundthatthisbalancedtranslationstrategysignificantlyoutperforms STRAND(N =1) 0.2440 theusual(unbalanced)techniqueofincludingallknowntranslations[3]. STRAND(N =2) 0.2499 Wehavealsoinvestigatedtherelationshipbetweenbalancedtrans- Merging N =3 0.2892 lationandPirkola’s structuredqueryformulationmethod[6]. WebDict 0.2919 ForourexperimentsweusedtheCLEF-2000Frenchdocument Backoff 0.3282 collection(approximately21millionwordsfromarticlesinLeMonde). Differencesin useofdiacritics, case,andpunctuationcaninhibit Table1:MeanAveragePrecision(MAP),averagedover34top- matchingbetweentralexentriesanddocumentterms,sowenormal- ics izethetralexandthedocumentsbyconvertingcharacterstolow- ercaseandremovingalldiacriticmarksandpunctuation. Wethen translatethedocumentsusingtheprocessdescribedabove,index thetranslateddocumentswiththeInqueryinformationretrievalsys- ForeachFrenchtermthatappearedinbothtralexes,wegavethe tem,andperformretrievalusing“long”queriesformulatedbygroup- top-rankedtranslationineachtralexascoreof100,thenextascore ing all terms in thetitle, narrative, anddescriptionfieldsof each of99,andsoon.WethensummedtheWebDictandSTRANDscores EnglishtopicdescriptionusingInquery’s#sumoperator.Wereport foreachtranslation,rerankedtheWebDicttranslationsbasedonthat meanaverageprecisiononthe34topicsforwhichrelevantFrench sum,andthenappendedanySTRAND-onlytranslationsforthatFrench documentsexist,basedontherelevancejudgmentsprovidedbyCLEF. term.Thus,althoughbothsourcesofevidencewereweightedequally WeevaluatedseveralstrategiesforusingtheWebDictandSTRAND inthevoting,STRAND-onlyevidencereceivedlowerprecedence tralexes. inthemergedranking.ForFrenchtermsthatappearedinonlyone tralex,weincludedthoseentriesunchangedinthemergedtralex.In 3.1 WebDictTralex thisexperimentrunweusedathresholdof ,andappliedthe Sinceatralexmaycontainaneclecticmixofrootformsandmor- four-stagebackoffstrategydescribedaboveNtot=he1mergedresource. phologicalvariants,weuseafour-stagebackoffstrategytomaxi- 3.4 WebDictBackofftoSTRAND mizecoveragewhilelimitingspurioustranslations: Apossibleweaknessofourmergingstrategyisthatinflectedforms 1. Matchthesurfaceformofadocumenttermtosurfaceforms aremorecommoninourSTRANDtralex,whilerootformsaremore ofFrenchtermsinthetralex. commoninourWebDicttralex. STRANDtralexentriesthatwere 2. MatchthestemofadocumenttermtosurfaceformsofFrench copiedunchangedintothemergedtralexthusoftenmatchedinstep termsinthetralex. 1ofthefour-stagebackoffstrategy,preventingWebDictcontribu- tionsfrombeingused. WiththeWebDicttralexoutperformingthe 3. MatchthesurfaceformofadocumenttermtostemsofFrench STRANDtralex,thisfactorcouldhurtourresults. Asanalterna- termsinthetralex. tivetomerging,therefore,wealsotriedasimplebackoffstrategyin whichweusedtheoriginalWebDicttralexwiththefour-stageback- 4. MatchthestemofadocumenttermtostemsofFrenchtermsin offstrategydescribedabove,towhichweaddedafifthstageinthe thetralex. eventthatfewerthantwoWebDicttralexmatcheswerefound: WeusedunsupervisedinductionofstemmingrulesbasedontheFrench collectiontobuildthestemmer[5]. Theprocessterminatesassoon 5. Matchthesurfaceformofadocumenttermtosurfaceforms asamatchisfoundatanystage,andtheknowntranslationsforthat ofFrenchtermsintheSTRANDtralex. match aregenerated. Theprocessmayproduceaninappropriate Weusedathresholdof forthisexperimentrun. morphologicalvariantforacorrectEnglishtranslation,soweused N =2 Inquery’sEnglishkstemstemmeratindexingtimetominimizethe 4. RESULTS effectofthatfactoronretrievaleffectiveness. Table1summarizesourresults. Increasingthresholdsseemto 3.2 STRANDTralex behelpfulwiththeSTRANDtralex,althoughthedifferenceswere Onelimitationofastatisticallyderivedtralexisthatanytermhas notfoundtobestatisticallysignificantbyapairedtwo-tailed -test someprobabilityofaligningwithanyotherterm. Merelysorting with . Merging thetralexesprovidednoimprovetment translationalternativesinorderofdecreasinglikelihoodratiowill overupsin<gt0h:e05WebDicttralexalone,butourbackoffstrategypro- thusfindsometranslationalternativesforeveryFrenchtermthatap- ducedastatisticallysignificant12%improvementinmeanaverage pearedatleastonceinthesetofparallelWebpages.Inordertolimit precision(at )overthenextbesttralex(WebDictalone). theintroductionofspurioustranslations,weincludedonlytransla- AsFigure1shpo<ws,0t:h0e1improvementisremarkablyconsistent,with tionpairswithatleast co-occurrencesinthesetusedtobuildthe onlyfourofthe34topicsadverselyaffectedandonlyonetopicshow- tralex. WeperformedNrunswith ,usingthefour-stage ingasubstantialnegativeimpact. backoffstrategydescribedabove.N = 1;2;3 Breakingdownthe backoff resultsbystage(Table 2), wefind thatthemajorityofquery-to-documenthitsareobtainedinthefirst 3.3 WebDictMergingusingSTRAND stage,i.e. matchesoftheterm’ssurfaceforminthedocumenttoa Whentwosourcesofevidencewithdifferentcharacteristicsare translationofthesurfaceforminthedictionary.However,theback- available,acombination-of-evidencestrategycansometimesout- offprocessimprovesby-tokencoverageoftermsindocumentsby performeithersourcealone. Ourinitialexperimentsindicatedthat 8%,andgivesa3%relativeimprovementinretrievalresults;italso theWebDicttralexwasthebetterofthetwo(seebelow),soweadopted contributedadditionaltranslationstothetop-2setinapproximately a rerankingstrategy inwhichthe WebDict tralexwasrefinedac- 30%ofthecases,leadingtothestatisticallysignificant12%relative cordingavotingstrategytowhichboththeoriginalWebDictand improvementinmeanaverageprecisionascomparedtothebaseline STRANDtralexrankingscontributed. usingWebDictalonewith4-stagebackoff. DevelopmentinInformationRetrieval,pages64–71.ACM Press,Aug.1998. [2] T.Dunning.Accuratemethodsforthestatisticsofsurpriseand coincidence.ComputationalLinguistics,19(1):61–74,March 1993. [3] G.-A.LevowandD.W.Oard.Translingualtopictracking withPRISE.InWorkingNotesoftheThirdTopicDetection andTrackingWorkshop,Feb.2000. [4] J.-Y.Nie,M.Simard,P.Isabelle,andR.Durand. Cross-languageinformationretrievalbasedonparalleltexts andautomaticminingofparalleltextsfromtheweb.In M.Hearst,F.Gey,andR.Tong,editors,Proceedingsofthe 22ndAnnualInternationalACMSIGIRConferenceon ResearchandDevelopmentinInformationRetrieval,pages 74–81,Aug.1999. [5] D.W.Oard,G.-A.Levow,andC.I.Cabezas.CLEF experimentsatMaryland:Statisticalstemmingandbackoff translation.InC.Peters,editor,ProceedingsoftheFirst Cross-LanguageEvaluationForum.2001.Toappear. http://www.glue.umd.edu/ oard/research.html. Figure 1: WebDict-to-tralex backoff vs. WebDict alone, by [6] D.W.OardandJ.Wang.N(cid:24)TCIR-2ECIRexperimentsat query Maryland:Comparingstructuredqueriesandbalanced translation.InSecondNationalInstituteofInformatics(NII) Stage(forms) Lexiconmatches TestCollectionInformationRetrieval(NTCIR)workshop. 1(surface-surface) 70.38% forthcoming. 2(stem-surface) 3.18% [7] P.Resnik.MiningtheWebforbilingualtext.In37thAnnual 3(surface-stem) 0.46% MeetingoftheAssociationforComputationalLinguistics 4(stem-stem) 0.98% (ACL’99),CollegePark,Maryland,June1999. 5(STRAND) 8.34% [8] P.SheridanandJ.P.Ballerini.Experimentsinmultilingual Nomatchfound 16.66% informationretrievalusingtheSPIDERsystem.In Proceedingsofthe19thAnnualInternationalACMSIGIR Table2: Termmatchesin5-stagebackoff ConferenceonResearchandDevelopmentinInformation Retrieval,Aug.1996. 5. CONCLUSIONS Therearemanywaysofcombiningevidencefrommultipletrans- lationlexicons.WeusetralexessimilartothoseusedbyNieetal.[4], butourworkdiffersinouruseofbalancedtranslationandaback- offtranslationstrategy(whichproducesastrongerbaselineforour WebDicttralex),andinourcomparisonofmergingandbackofftrans- lationstrategiesforcombiningresources.Infutureworkweplanto exploreothercombinationsofmergingandbackoffandothermerg- ingstrategies,includingpost-retrievalmergingoftherankedlists. Inaddition,parallelcorporacanbeexploitedformorethanjust theextractionofanon-contextualizedtranslationlexicon. Weare currentlyengagedinworkonlexicalselectionmethodsthattakead- vantageofcontextualinformation,inthecontextofourresearchon machinetranslation, andweexpectthatCLIR resultswill beim- provedbycontextually-informedscoringoftermtranslations. 6. ACKNOWLEDGMENTS ThisresearchwassupportedinpartbyDepartmentofDefense contractMDA90496C1250andTIDESDARPA/ITOCooperative AgreementN660010028910, 7. REFERENCES [1] L.BallesterosandW.B.Croft.Resolvingambiguityfor cross-languageretrieval.InW.B.Croft,A.Moffat,andC.V. Rijsbergen,editors,Proceedingsofthe21stAnnual InternationalACMSIGIRConferenceonResearchand

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.