ebook img

DTIC ADA461156: Statistical Phrase-Based Translation PDF

8 Pages·0.12 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA461156: Statistical Phrase-Based Translation

Statistical Phrase-Based Translation PhilippKoehn,FranzJosefOch,DanielMarcu InformationSciences Institute DepartmentofComputerScience UniversityofSouthernCalifornia [email protected],[email protected],[email protected] Abstract method to extract phrase translation pairs? In order to investigatethisquestion,wecreatedauniformevaluation We propose a new phrase-based translation frameworkthatenablesthecomparisonofdifferentways model and decoding algorithm that enables tobuildaphrasetranslationtable. us to evaluate and compare several, previ- Ourexperimentsshowthathighlevelsofperformance ously proposed phrase-based translation mod- can be achieved with fairly simple means. In fact, els. Within our framework, we carry out a for most of the steps necessary to build a phrase-based largenumberofexperimentstounderstandbet- system, tools and resources are freely available for re- ter and explain whyphrase-based models out- searchersinthefield.Moresophisticatedapproachesthat performword-basedmodels. Ourempiricalre- makeuseofsyntaxdonotleadtobetterperformance. In sults, which hold for all examined language fact,imposingsyntacticrestrictionsonphrases,asusedin pairs, suggestthatthehighestlevelsofperfor- recently proposed syntax-based translation models [Ya- mancecanbeobtainedthroughrelativelysim- madaand Knight, 2001], provesto be harmful. Ourex- ple means: heuristic learning of phrase trans- periments also show, that small phrases of up to three lations from word-based alignments and lexi- wordsaresufficientforobtaininghighlevelsofaccuracy. cal weighting of phrase translations. Surpris- Performancedifferswidelydependingonthemethods ingly,learningphraseslongerthanthreewords used to build the phrase translation table. We foundex- andlearningphrasesfromhigh-accuracyword- tractionheuristicsbasedonwordalignmentstobebetter levelalignmentmodelsdoesnothaveastrong than a more principled phrase-based alignment method. impactonperformance. Learningonlysyntac- However,whatconstitutesthebestheuristicdiffersfrom tically motivated phrases degrades the perfor- languagepairtolanguagepairandvarieswiththesizeof manceofoursystems. thetrainingcorpus. 1 Introduction 2 EvaluationFramework Various researchers have improvedthe quality of statis- Inordertocomparedifferentphraseextractionmethods, tical machine translation system with the use of phrase wedesigneda uniformframework. We presentaphrase translation.Ochetal.[1999]’salignmenttemplatemodel translationmodelanddecoderthatworkswithanyphrase canbereframedasa phrasetranslationsystem; Yamada translationtable. and Knight [2001] use phrase translation in a syntax- based translation system; Marcu and Wong [2002] in- 2.1 Model troducedajoint-probabilitymodelforphrasetranslation; Thephrasetranslationmodelisbasedonthenoisychan- and the CMU and IBM word-based statistical machine nelmodel. WeuseBayesruletoreformulatethetransla- translation systems1 are augmented with phrase transla- tion probability for translating a foreign sentence (cid:0) into tioncapability. English as (cid:1) Phrase translation clearly helps, as we will also show argmax (cid:0)(cid:11)(cid:10)(cid:13)(cid:12) argmax (cid:0) (cid:10) (cid:10) with the experiments in this paper. But what is the best (cid:2)(cid:4)(cid:3)(cid:6)(cid:5)(cid:7)(cid:1)(cid:9)(cid:8) (cid:2)(cid:4)(cid:3)(cid:6)(cid:5) (cid:8)(cid:1) (cid:3)(cid:6)(cid:5)(cid:7)(cid:1) 1PresentationsatDARPAIAOMachineTranslationWork- This allowsfor a languagemodel(cid:3)(cid:6)(cid:5)(cid:14)(cid:1) (cid:10) anda separate shop,July22-23,2002,SantaMonica,CA translationmodel (cid:0) (cid:10) . (cid:3)(cid:15)(cid:5) (cid:8)(cid:1) Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2003 2. REPORT TYPE 00-00-2003 to 00-00-2003 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Statistical Phrase-Based Translation 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of California,Information Sciences Institute ,4676 Admiralty REPORT NUMBER Way,Marina del Rey,CA,90292 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 7 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 During decoding, the foreign input sentence (cid:0) is seg- The cheapest (highest probability) final hypothesis (cid:2)(cid:4)(cid:3) mented into a sequence of (cid:0) phrases (cid:1)(cid:5) . We assume a with no untranslated foreign words is the output of the uniformprobabilitydistributionoverallpossiblesegmen- search. tations. (cid:2)(cid:7)(cid:6) (cid:2)(cid:4)(cid:3) The hypotheses are stored in stacks. The stack +(cid:7), Each foreig(cid:8) n(cid:6) phrase (cid:1) in (cid:1)(cid:5) is translated into an En- contains all hypotheses in which - foreign words have glish phrase (cid:1) . The English phrases may be reordered. beentranslated.Werecombinesearchhypothesesasdone Phrasetranslationismodeledbyaprobabilitydistribution by Och et al. [2001]. While this reduces the number of (cid:9) (cid:5) (cid:2)(cid:1)(cid:6) (cid:8)(cid:8)(cid:1) (cid:6) (cid:10) . RecallthatduetotheBayesrule,thetranslation hypotheses stored in each stack somewhat, stack size is directionisinvertedfromamodelingstandpoint. exponential with respect to input sentence length. This Reordering of the English output phrases is modeled makesanexhaustivesearchimpractical. (cid:6)(cid:14)(cid:13) b(cid:15)(cid:16)(cid:6)(cid:18)y(cid:17) a relative d(cid:6)istortion probability distribution (cid:10) (cid:5)(cid:12)(cid:11) Thus,wepruneoutweakhypothesesbasedonthecost (cid:5) (cid:10) , where (cid:11) denotes the start position of the foreign theyincurredsofaranda futurecostestimate. Foreach phrasethatwastranslatedintothe(cid:19) thEnglishphrase,and (cid:15)(cid:16)(cid:6)(cid:18)(cid:17) stack, we only keep a beam of the best . hypotheses. (cid:5) denotestheendpositionoftheforeignphrasetrans- (cid:13)(cid:21)(cid:20) Sincethefuturecostestimateisnotperfect,thisleadsto latedintothe (cid:18)(cid:19) (cid:10) thEnglishphrase. (cid:5) searcherrors. Ourfuturecostestimatetakesintoaccount Inallourexperiments,thedistortionprobabilitydistri- theestimatedphrasetranslationcost,butnottheexpected bution(cid:10) (cid:5)(cid:23)(cid:22)(cid:10) istrainedusingajointprobabilitymodel(see distortioncost. Section 3.3). Alternatively,we could al(cid:17)(cid:4)s(cid:30)o us(cid:17)e a simpler (cid:6) (cid:13) (cid:15)(cid:16)(cid:6)(cid:18)(cid:17) (cid:25)(cid:24)(cid:27)(cid:26)(cid:28)(cid:16)(cid:29) (cid:29) (cid:31)"! (cid:5) (cid:26) We computethisestimateas follows: Foreach possi- distortion model (cid:10) (cid:12)(cid:11) (cid:5) (cid:10) (cid:12) with an (cid:5) (cid:24) ble phrasetranslation anywherein the sentence(we call appropriatevaluefortheparameter . itatranslationoption),wemultiplyitsphrasetranslation Inordertocalibratethe outputlength, we introducea probability with the language model probability for the factor # for each generated English word in addition to generated English phrase. As language model probabil- thetrigramlanguagemodel . Thisisasimplemeans (cid:3) LM ityweusetheunigramprobabilityforthefirstword,the to optimize performance. Usually, this factor is larger bigramprobabilityforthesecond,andthetrigramproba- than1,biasinglongeroutput. bilityforallfollowingwords. In summary, the best English output sentence (cid:1) best givena foreigninputsentence (cid:0) according to ourmodel Giventhecostsforthetranslationoptions,wecancom- pute the estimated future cost for any sequence of con- is (cid:12) argmax (cid:0) (cid:10) secutive foreign words by dynamic programming. Note (cid:1) best (cid:2) (cid:3)(cid:6)(cid:5)(cid:14)(cid:1)(cid:9)(cid:8) thatthisisonlypossible,sinceweignoredistortioncosts. (cid:12) argmax(cid:2) (cid:3)(cid:6)(cid:5) (cid:0) (cid:8)(cid:1) (cid:10) (cid:3) LM(cid:5)(cid:7)(cid:1) (cid:10) # length$(cid:2)(cid:16)% Since there are only . (cid:5)/.10 (cid:20) (cid:10)3254 such sequences for a where (cid:2) (cid:8) (cid:10) isdecomposedinto foreign input sentence of length . , we can pre-compute (cid:3)(cid:15)(cid:5) (cid:8) (cid:3) thesecostestimatesbeforehandandstoretheminatable. (cid:3)(cid:6)(cid:5)&(cid:2)(cid:1)(cid:5)(cid:3) (cid:8)(cid:8)(cid:1) (cid:3)(cid:5) (cid:10)(cid:13)(cid:12) (cid:6))’( (cid:9) (cid:5)&(cid:2)(cid:7)(cid:1)(cid:6) (cid:8)(cid:8)(cid:1) (cid:6) (cid:10) (cid:10) (cid:5)(cid:18)(cid:11) (cid:6)*(cid:13) (cid:15) (cid:6)(cid:18)(cid:17) (cid:5) (cid:10) During translation, future costsfor uncoveredforeign (cid:5) wordscanbequicklycomputedbyconsultingthistable. Ifahypothesishasbrokensequencesofuntranslatedfor- Forallourexperimentsweusethesametrainingdata, eign words, we look up the cost for each sequence and trigramlanguagemodel[SeymoreandRosenfeld,1997], taketheproductoftheircosts. andaspecializeddecoder. Thebeamsize,e.g. themaximumnumberofhypothe- 2.2 Decoder ses in each stack, is fixed to a certain number. The The phrase-based decoder we developed for purpose of numberoftranslationoptionsislinearwiththesentence comparingdifferentphrase-basedtranslationmodelsem- length. Hence,thetimecomplexityofthebeamsearchis ploysabeamsearchalgorithm,similartotheonebyJe- quadraticwithsentencelength,andlinearwiththebeam linek [1998]. The English output sentence is generated size. left to right in form of partial translations (or hypothe- Sincethebeamsizelimitsthesearchspaceandthere- ses). fore search quality, we have to find the proper trade-off We start with an initial empty hypothesis. A newhy- between speed (low beam size) and performance (high pothesis is expandedfrom an existing hypothesisby the beam size). For our experiments, a beam size of only translation of a phrase as follows: A sequence of un- 100 proved to be sufficient. With larger beams sizes, translated foreign words and a possible English phrase only few sentences are translated differently. With our translationforthemisselected. TheEnglishphraseisat- decoder,translating 1755sentenceoflength 5-15words tachedtotheexistingEnglishoutputsequence. Thefor- takes about 10 minutes on a 2 GHz Linux system. In eign words are marked as translated and the probability other words, we achieved fast decoding, while ensuring costofthehypothesisisupdated. highquality. 3 MethodsforLearningPhrase phrasestosyntacticallymotivatedphrasescouldfilterout Translation suchnon-intuitivepairs. We carriedoutexperimentstocomparetheperformance Another motivation to evaluate the performance of of three different methods to build phrase translation a phrase translation model that contains only syntactic probabilitytables. Wealsoinvestigateanumberofvaria- phrasescomesfromrecenteffortstobuiltsyntactictrans- tions. WereportmostexperimentalresultsonaGerman- lationmodels[YamadaandKnight,2001;Wu,1997]. In Englishtranslationtask,sincewehadsufficientresources thesemodels,reorderingofwordsisrestrictedtoreorder- available for this language pair. We confirm the major ing of constituents in well-formed syntactic parse trees. pointsinexperimentsonadditionallanguagepairs. Whenaugmentingsuchmodelswithphrasetranslations, As the first method, we learn phrasealignments from typicallyonlytranslationofphrasesthatspanentiresyn- acorpusthathasbeenword-alignedbyatrainingtoolkit tacticsubtreesispossible. Itisimportanttoknowifthis foraword-basedtranslationmodel:theGiza++[Ochand isahelpfulorharmfulrestriction. Ney, 2000] toolkit for the IBM models [Brown et al., Consistent with Imamura [2002], we define a syntac- 1993]. Theextractionheuristicissimilartotheoneused ticphraseasawordsequencethatiscoveredbyasingle inthealignmenttemplateworkbyOchetal.[1999]. subtreeinasyntacticparsetree. A number of researchers have proposed to focus on the translation of phrases that have a linguistic motiva- Wecollectsyntacticphrasepairsasfollows:Weword- tion [Yamada and Knight, 2001; Imamura, 2002]. They align a parallel corpus, as described in Section 3.1. We onlyconsiderwordsequencesasphrases,iftheyarecon- thenparsebothsidesofthecorpuswithsyntacticparsers stituents, i.e. subtrees in a syntax tree (such as a noun [Collins, 1997; Schmidt and Schulte im Walde, 2000]. phrase). Toidentifythese,weuseaword-alignedcorpus For all phrase pairs that are consistent with the word annotatedwithparsetreesgeneratedbystatisticalsyntac- alignment,weadditionallycheckifbothphrasesaresub- ticparsers[Collins,1997;SchmidtandSchulteimWalde, trees in the parsetrees. Onlythese phrasesare included 2000]. inthemodel. The third method for comparison is the joint phrase Hence,thesyntacticallymotivatedphrasepairslearned modelproposedbyMarcuandWong[2002]. Thismodel are a subset of the phrase pairs learned without knowl- learns directly a phrase-level alignment of the parallel edgeofsyntax(Section3.1). corpus. As in Section 3.1, the phrase translation probability 3.1 PhrasesfromWord-BasedAlignments distributionisestimatedbyrelativefrequency. The Giza++ toolkit was developed to train word-based translation models from parallel corpora. As a by- product, it generates word alignments for this data. We 3.3 PhrasesfromPhraseAlignments improve this alignment with a number of heuristics, whicharedescribedinmoredetailinSection4.5. We collect all aligned phrase pairs that are consistent Marcu and Wong [2002] proposed a translation model withthewordalignment:Thewordsinalegalphrasepair that assumes that lexical correspondences can be estab- are only aligned to each other, and not to words outside lished not only at the word level, but at the phrase level [Ochetal.,1999]. aswell. Tolearnsuchcorrespondences,theyintroduceda Given the collected phrase pairs, we estimate the phrase-basedjointprobabilitymodelthatsimultaneously phrasetranslationprobabilitydistributionbyrelativefre- generatesboththeSourceandTargetsentencesinaparal- quency: lelcorpus. ExpectationMaximizationlearninginMarcu (cid:9) (cid:5) (cid:2)(cid:1) (cid:8)(cid:8)(cid:1) (cid:10) (cid:12) (cid:2)(cid:4)c(cid:5)(cid:3)ocuonut(cid:5)n(cid:2)(cid:1)t(cid:1)(cid:5)(cid:0)(cid:2)(cid:1)(cid:8)(cid:1)(cid:1) (cid:10)(cid:0) (cid:8)(cid:1) (cid:10) aitnyddiWstorin(cid:8)bgu’tsiofnr(cid:2)a(cid:9)m(cid:5) e(cid:8)(cid:1) w(cid:0) (cid:2)o(cid:1) (cid:10)r,kwyhiieclhdsrebfloetchts(it)heapjrooinbtabpirloitbyatbhial-t Nosmoothingisperformed. dpihsrtarsibeust(cid:1)ioannd(cid:10) /(cid:1)(cid:19) (cid:0)(cid:7)a(cid:6)re(cid:10) ,trwanhsiclahtiroenfleeqcutsivtahleenptrso;b(iaib)ialintdyathjoatinat (cid:5) phraseatposition(cid:19) istranslatedintoaphraseatposition 3.2 SyntacticPhrases (cid:6) . Tousethismodelinthecontextofourframework,we Ifwecollectallphrasepairsthatareconsistentwithword simply marginalize to conditional probabilities the joint alignments,thisincludesmanynon-intuitivephrases.For probabilitiesestimatedbyMarcuandWong[2002]. Note instance, translations for phrases such as “house the” that this approach is consistent with the approach taken maybe learned. Intuitivelywe would beinclined to be- by Marcu and Wong themselves, who use conditional lieve that such phrasesdo nothelp: Restricting possible modelsduringdecoding. Trainingcorpussize Method 10k 20k 40k 80k 160k 320k (cid:1) BLEU (cid:3)(cid:3) AP 84k 176k 370k 736k 1536k 3152k .27 (cid:3) Joint 125k 220k 400k 707k 1254k 2214k (cid:3) (cid:3) Syn 19k 24k 67k 105k 217k 373k .26 (cid:3) Table 1: Size of the phrase translation table in terms of (cid:3) (cid:3) (cid:3) distinctphrasepairs(maximumphraselength4) .25 (cid:3) .24 (cid:3) (cid:3) (cid:3) 4 Experiments .23 (cid:3) (cid:3) (cid:3) We used the freely available Europarl corpus 2 to carry .22 (cid:3) out experiments. This corpus contains over 20 million (cid:3) (cid:3) wordsineachoftheelevenofficiallanguagesoftheEu- .21(cid:3) ropeanUnion,coveringtheproceedingsoftheEuropean .20 (cid:3) AP Parliament 1996-2001. 1755 sentences of length 5-15 (cid:3) Joint werereservedfortesting. .19 M4 InallexperimentsinSection4.1-4.6wetranslatefrom Syn German to English. We measureperformance using the .18(cid:3)(cid:3) (cid:2) 10k 20k 40k 80k 160k320k BLEUscore[Papinenietal.,2001],whichestimatesthe TrainingCorpusSize accuracyoftranslationoutputwithrespecttoareference Figure 1: Comparison of the core methods: all phrase translation. pairsconsistentwithawordalignment(AP),phrasepairs from the joint model (Joint), IBM Model 4 (M4), and 4.1 ComparisonofCoreMethods onlysyntacticphrases(Syn) First,wecomparedtheperformanceofthethreemethods for phrase extraction head-on, using the same decoder (Section 2) and the same trigram language model. Fig- Onewaytocheckthisistouseallphrasepairsandgive ure1displaystheresults. moreweighttosyntacticphrasetranslations. Thiscanbe In direct comparison, learning all phrases consistent doneeitherduringthedatacollection–say,bycounting with the word alignment (AP) is superior to the joint syntacticphrasepairstwice–orduringtranslation–each model (Joint), although not by much. The restriction to timethedecoderusesasyntacticphrasepair,itcreditsa onlysyntacticphrases(Syn)isharmful.Wealsoincluded bonusfactortothehypothesisscore. in thefigure theperformanceofan IBMModel4 word- Wefoundthatneitherofthesemethodsresultinsignif- based translation system (M4), which uses a greedy de- icantimprovementoftranslationperformance. Evenpe- coder [Germann et al., 2001]. Its performance is worse nalizing the use of syntactic phrase pairs does not harm than both AP and Joint. These results are consistent performancesignificantly. Theseresultssuggestthat re- overtraining corpus sizesfrom 10,000 sentencepairsto quiring phrases to be syntactically motivated does not 320,000sentencepairs. Allsystemsimprovewithmore leadtobetterphrasepairs,butonlytofewerphrasepairs, data. withthelossofagoodamountofvaluableknowledge. Table 1 lists the number of distinct phrase translation One illustration for this is the common German “es pairslearnedbyeachmethodandeachcorpus.Thenum- gibt”, which literally translates as “it gives”, but really ber grows almost linearly with the training corpus size, means “there is”. “Es gibt” and “there is” are not syn- dueto the largenumberof singletons. Thesyntactic re- tactic constituents. Note that also constructions such as strictioneliminatesover80%ofallphrasepairs. “withregardto”and“notethat”havefairlycomplexsyn- Notethatthemillionsofphrasepairslearnedfiteasily tactic representations, but often simple one word trans- intotheworkingmemoryofmoderncomputers.Eventhe lations. Allowing to learn phrase translations over such largest models take up only a few hundred megabyte of sentence fragments is important for achieving high per- RAM. formance. 4.2 WeightingSyntacticPhrases 4.3 MaximumPhraseLength The restriction on syntactic phrases is harmful, because How long do phrases have to be to achievehigh perfor- toomanyphrasesareeliminated.Butstill,wemightsus- mance? Figure2displaysresultsfromexperimentswith pect,thattheseleadtomorereliablephrasepairs. different maximum phrase lengths. All phrases consis- 2The Europarl corpus is available at tentwiththewordalignment(AP)areused.Surprisingly, http://www.isi.edu/ koehn/europarl/ limiting the length to a maximum of only three words (cid:0) .27(cid:1) BLEU (cid:3)(cid:3)(cid:3) (cid:3)(cid:3)(cid:3)(cid:3) NULL -f-1 -f-2 #f#3 (cid:3) e1 ## -- -- (cid:3) .26 (cid:3)(cid:3) (cid:3) e2 -- ## -- (cid:3)(cid:3)(cid:3) (cid:3)(cid:3) e3 -- ## -- .25 (cid:3) (cid:3) (cid:27)(cid:29)(cid:28)(cid:31)(cid:30)(cid:29)!(cid:4) "#% $’&)(+* (cid:27),(cid:28)-(cid:30) !%./!102!134"# . # 0 # 3 $’&)( .24 (cid:3) * 56(cid:30) !4.7"# . ( (cid:3) (cid:3) max7 .23 (cid:3)(cid:3) max5 8:9 (cid:30)<56(cid:30) ! 0 "# 0 (>=?5@(cid:30) ! 0 "# 3 (’( ; max4 .22 (cid:3) max3 8 56(cid:30) ! 3 "NULL( (cid:3)(cid:3)(cid:3) max2 .2110(cid:3)(cid:3) k 20k 40k 80k 160k320k(cid:2) Figure3: Lexicalweight(cid:3)(cid:11)(cid:10) ofaphrasepair (cid:5) (cid:2)(cid:1) (cid:0) (cid:8)(cid:1) (cid:10) given analignment(cid:11) andalexicaltranslationprobabilitydistri- TrainingCorpusSize Figure 2: Different limits for maximum phrase length bution(cid:0) (cid:5) (cid:22)(cid:10) showthatlength3isenough Max. Trainingcorpussize SeeFigure3foranexample. Length 10k 20k 40k 80k 160k 320k If there are multiple alignments (cid:11) for a phrase pair 2 37k 70k 135k 250k 474k 882k (cid:5)&(cid:2)(cid:1)(cid:1) (cid:0) (cid:8)(cid:1) (cid:10) , we use the one with the highest lexical weight: 3 63k 128k 261k 509k 1028k 1996k 45 18041kk 211756kk 435790kk 972356kk 11956386kk 43111592kk (cid:3)(cid:19)(cid:10) (cid:5)&(cid:2)(cid:1) (cid:8)(cid:8)(cid:1) (cid:10) (cid:12) max(cid:28) (cid:3)(cid:19)(cid:10) (cid:5)&(cid:2)(cid:1) (cid:8)(cid:8)(cid:1) (cid:0) (cid:11) (cid:10) 7 130k 278k 605k 1217k 2657k 5663k Table2: Sizeofthephrasetranslationtablewithvarying We use the lexical weight (cid:10) during translation as a maximumphraselengthlimits additional factor. This means(cid:3) that the model (cid:2) (cid:8) (cid:10) is (cid:3)(cid:6)(cid:5) (cid:8) extendedto (cid:3) per phrase already achieves top performance. Learning (cid:3)(cid:6)(cid:5) (cid:2)(cid:1)(cid:5)(cid:3) (cid:8)(cid:8)(cid:1) (cid:3)(cid:5) (cid:10)(cid:13)(cid:12) (cid:6)’( (cid:9) (cid:5) (cid:2)(cid:1)(cid:6) (cid:8)(cid:8)(cid:1) (cid:6) (cid:10) (cid:10) (cid:5)(cid:18)(cid:11) (cid:6) (cid:13) (cid:15)(cid:16)(cid:6)(cid:18)(cid:17) (cid:5) (cid:10) (cid:3) (cid:10) (cid:5) (cid:2)(cid:1)(cid:6) (cid:8)(cid:8)(cid:1) (cid:6) (cid:0) (cid:11) (cid:10)/A (cid:5) longer phrases does not yield much improvement, and occasionally leads to worse results. Reducing the limit The parameter B defines the strength of the lexical toonlytwo,however,isclearlydetrimental. weight (cid:11)(cid:10) . Good values for this parameter are around (cid:3) Allowingforlongerphrasesincreasesthephrasetrans- 0.25. lationtablesize(seeTable2).Theincreaseisalmostlin- Figure4showstheimpactoflexicalweightingonma- ear with the maximum length limit. Still, none of these chine translation performance. In our experiments, we modelsizescausememoryproblems. achievedimprovementsofupto0.01ontheBLEUscore scale. Again,allphrasesconsistentwiththewordalign- 4.4 LexicalWeighting mentareused(Section3.1). One way to validate the quality of a phrase translation Note that phrase translation with a lexicalweight is a pairistocheck,howwellitswordstranslatetoeachother. specialcaseofthealignmenttemplatemodel[Ochetal., Forthis,weneedalexicaltranslationprobabilitydistribu- 1999]withonewordclassforeachword.Oursimplifica- tion (cid:0) (cid:2) (cid:8) (cid:10) . We estimateditbyrelativefrequencyfrom tionhastheadvantagethatthelexicalweightscanbefac- (cid:5) (cid:8) thesamewordalignmentsasthephrasemodel. toredintothephrasetranslationtablebeforehand,speed- (cid:0) (cid:5) (cid:2) (cid:8)(cid:8) (cid:10) (cid:12) (cid:2) (cid:5)(cid:2)c(cid:1)ocuonutn(cid:5) (cid:2)t(cid:0)(cid:2)(cid:4)(cid:8) (cid:3)(cid:10)(cid:0) (cid:8) (cid:10) ifnogrtuhpedaelicgondminegn.tItnemcopnlattreasmttoodethl,eobueradmecsoedaercrhisdaebcloedteor (cid:5) searchallpossiblephrasesegmentationsoftheinputsen- A special English NULL token is added to each En- tence, instead of choosing one segmentation before de- glish sentence and aligned to each unaligned foreign coding. word. (cid:2) (cid:0) (cid:8) twGeeinvetnheafpohrreaigsenpwaoirrd(cid:1) p(cid:1)osaintidonasw(cid:19) or(cid:12)d a(cid:20)li(cid:0)g(cid:5)(cid:6)n(cid:5)(cid:7)(cid:5)m(cid:0) . enatnd(cid:11) bthee- 4.5 PhraseExtractionHeuristic English word positions(cid:6) (cid:12)(cid:9)(cid:8) (cid:0) (cid:20) (cid:0) (cid:5)(cid:6)(cid:5)(cid:7)(cid:5)(cid:0) - , we compute the Recall from Section 3.1 that we learn phrase pairs from wordalignmentsgeneratedbyGiza++.TheIBMModels lexicalweight(cid:11)(cid:10) by (cid:3) (cid:20) that this toolkit implements only allow at most one En- (cid:3) (cid:10) (cid:5) (cid:2)(cid:1) (cid:8)(cid:8)(cid:1) (cid:0) (cid:11) (cid:10) (cid:12)(cid:13)(cid:6))’((cid:12) (cid:5) (cid:8)(cid:15)(cid:14) (cid:6) (cid:8)(cid:5)(cid:18)(cid:19) (cid:0) (cid:6) (cid:10)(cid:17)(cid:16) (cid:11)(cid:19)(cid:18) (cid:8) (cid:21) $(cid:20)(cid:6)(cid:23)(cid:22)(cid:24) %(cid:26)(cid:25) (cid:28) (cid:0) (cid:5) (cid:2) (cid:6) (cid:8)(cid:8) (cid:24) (cid:10) tghliisshpwroobrldemtowbeithalaighneeudriwstiitchaapfporroeaigchn.word. Weremedy (cid:3)(cid:3) .28(cid:1) BLEU (cid:3) (cid:3) .28(cid:1) BLEU (cid:3) (cid:3) (cid:3) (cid:3)(cid:3) (cid:3) .27 (cid:3) .27 (cid:3) (cid:3) (cid:3)(cid:3) (cid:3) (cid:3) (cid:3) .26 .26 (cid:3) (cid:3)(cid:3)(cid:3) (cid:3) (cid:3) (cid:3)(cid:3) .25 (cid:3) .25 (cid:3)(cid:3) (cid:3) .24 .24 (cid:3) (cid:3) (cid:3)(cid:3) (cid:3) (cid:3) (cid:3)(cid:3) .23 .23 diag-and (cid:3) diag .22 lex .22 base (cid:3) no-lex (cid:3)(cid:3) (cid:3) e2f (cid:3) (cid:3)(cid:3) .21 (cid:2) .21 f2e 10k 20k 40k 80k 160k320k union TrainingCorpusSize .20(cid:3)(cid:3) (cid:2) Figure4: Lexicalweighting(lex)improvesperformance 10k 20k 40k 80k 160k320k TrainingCorpusSize Figure5: Differentheuristicstosymmetrizewordalign- mentsfrombidirectionalGiza++alignments First, we align a parallel corpus bidirectionally – for- eigntoEnglishandEnglishtoforeign. Thisgivesustwo wordalignmentsthatwetrytoreconcile. Ifweintersect thetwoalignments,wegetahigh-precisionalignmentof Figure5showstheperformanceofthisheuristic(base) high-confidencealignmentpoints.Ifwetaketheunionof compared against the two mono-directional alignments the two alignments, we get a high-recallalignmentwith (e2f, f2e) and their union (union). The figure also con- additionalalignmentpoints. tainstwo modificationsofthebaseheuristic: Inthefirst We explore the space between intersection and union (diag)wealsopermitdiagonalneighborhoodintheitera- with expansion heuristics that start with the intersection tiveexpansionstage. Inavariationofthis(diag-and),we andaddadditionalalignmentpoints. Thedecisionwhich requireinthefinalstepthatbothwordsareunaligned. pointstoaddmaydependonanumberofcriteria: The rankingof these differentmethods varies for dif- ferent training corpus sizes. Forinstance, the alignment In which alignment does the potential alignment (cid:0) f2estartsoutsecondtoworstforthe10,000sentencepair pointexist? Foreign-EnglishorEnglish-foreign? corpus,butultimatelyiscompetitivewiththebestmethod Does the potential point neighbor already estab- (cid:0) at320,000sentencepairs. Thebase heuristicis initially lishedpoints? thebest,butthendropsoff. Does “neighboring” mean directly adjacent(block- (cid:0) distance),oralsodiagonallyadjacent? The discrepancy between the best and the worst Is the English or the foreign word that the poten- method is quite large, about 0.02 BLEU. For almost (cid:0) tial point connects unaligned so far? Are both un- alltrainingcorpussizes,theheuristicdiag-andperforms aligned? best,albeitnotalwayssignificantly. What is the lexical probability for the potential (cid:0) 4.6 SimplerUnderlyingWord-BasedModels point? The base heuristic [Och et al., 1999] proceeds as fol- The initial word alignment for collecting phrase pairs lows: We start with intersection of the two word align- isgeneratedbysymmetrizingIBMModel4alignments. ments. We only add new alignment points that exist in Model4iscomputationallyexpensive,andonlyapproxi- the union of two word alignments. We also always re- matesolutionsexisttoestimateitsparameters. TheIBM quire that a new alignment point connects at least one Models1-3arefasterandeasiertoimplement. ForIBM previouslyunalignedword. Model 1 and 2 word alignments can be computed effi- First, we expand to only directly adjacent alignment cientlywithoutrelyingonapproximations. Formorein- points.Wecheckforpotentialpointsstartingfromthetop formation on these models, please refer to Brown et al. rightcornerofthealignmentmatrix,checkingforalign- [1993].Again,weusetheheuristicsfromtheSection4.5 mentpointsforthefirstEnglishword,thencontinuewith to reconcile the mono-directional alignments obtained alignmentpointsforthesecondEnglishword,andsoon. through training parameters using models of increasing This is done iteratively until no alignment point can be complexity. added anymore. In a final step, we add non-adjacent How much is performance affected, if we base word alignmentpoints,withotherwisethesamerequirements. alignmentsonthesesimplermethods? AsFigure6indi- phrasesofuptothreewords.Lexicalweightingofphrase .28(cid:1) BLEU translationhelps. (cid:3)(cid:3)(cid:3)(cid:3) Straight-forward syntactic models that map con- .27 (cid:3) stituents into constituents fail to account for important (cid:3)(cid:3) (cid:3) phrase alignments. As a consequence, straight-forward .26 (cid:3)(cid:3)(cid:3) syntax-basedmappingsdonotleadtobettertranslations .25 (cid:3) (cid:3) than unmotivated phrase mappings. This is a challenge (cid:3)(cid:3) forsyntactictranslationmodels. (cid:3) .24 Itmattershowphrasesareextracted. Theresultssug- (cid:3) gest that choosing the right alignment heuristic is more .23 (cid:3) (cid:3) important than which model is used to create the initial .22 (cid:3) m4 wordalignments. m3 (cid:3) .21(cid:3)(cid:3) m2 References m1 (cid:3) .20 (cid:2) Brown,P.F.,Pietra,S.A.D.,Pietra,V.J.D.,andMercer,R.L. 10k 20k 40k 80k 160k320k (1993). Themathematicsofstatisticalmachinetranslation. TrainingCorpusSize ComputationalLinguistics,19(2):263–313. Figure6: UsingsimplerIBMmodelsforwordalignment Collins, M. (1997). Three generative, lexicalized models for doesnotreduceperformancemuch statisticalparsing. InProceedingsofACL35. Germann, U., Jahr, M., Knight, K., Marcu, D., and Yamada, LanguagePair Model4 Phrase Lex K.(2001). Fastdecodingandoptimaldecodingformachine translation. InProceedingsofACL39. English-German 0.2040 0.2361 0.2449 French-English 0.2787 0.3294 0.3389 Imamura,K.(2002). Applicationoftranslationknowledgeac- English-French 0.2555 0.3145 0.3247 quiredbyhierarchicalphrasealignmentforpattern-basedmt. Finnish-English 0.2178 0.2742 0.2806 InProceedingsofTMI. Swedish-English 0.3137 0.3459 0.3554 Jelinek,F.(1998). StatisticalMethodsforSpeechRecognition. Chinese-English 0.1190 0.1395 0.1418 TheMITPress. Marcu,D.andWong,W.(2002). Aphrase-based,jointproba- Table3: Confirmationofourfindingsforadditionallan- bilitymodelforstatisticalmachinetranslation. InProceed- guagepairs(measuredwithBLEU) ingsoftheConferenceonEmpiricalMethodsinNaturalLan- guageProcessing,EMNLP. Och, F.J.andNey, H.(2000). Improved statisticalalignment cates,notmuch. WhileModel1clearlyresultsinworse models. InProceedingsofACL38. performance, the difference is less striking for Model 2 Och,F.J.,Tillmann,C.,andNey,H.(1999). Improvedalign- and3. Usingdifferentexpansionheuristicsduringsym- mentmodelsforstatisticalmachinetranslation. InProc.of metrizingthewordalignmentshasabiggereffect. the Joint Conf. of EmpiricalMethods in Natural Language ProcessingandVeryLargeCorpora,pages20–28. We can conclude from this, that high quality phrase alignmentscanbelearnedwithfairlysimplemeans. The Och, F. J., Ueffing, N., and Ney, H. (2001). An efficient A* searchalgorithmforstatisticalmachinetranslation. InData- simplerandfasterModel2providessimilarperformance DrivenMTWorkshop. tothecomplexModel4. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2001). BLEU:amethodforautomaticevaluationofmachinetrans- 4.7 OtherLanguagePairs lation. Technical Report RC22176(W0109-022), IBM Re- We validated our findings for additional language pairs. searchReport. Table 3 displays some of the results. For all language Schmidt,H.andSchulteimWalde,S.(2000). RobustGerman pairsthe phrasemodel (basedonwordalignments, Sec- nounchunkingwithaprobabilisticcontext-freegrammar. In ProceedingsofCOLING. tion3.1)outperformsIBMModel4.Lexicalization(Lex) alwayshelpsaswell. Seymore, K. and Rosenfeld, R. (1997). Statistical language modelingusingtheCMU-Cambridgetoolkit.InProceedings ofEurospeech. 5 Conclusion Wu,D.(1997).Stochasticinversiontransductiongrammarsand bilingualparsingofparallelcorpora.ComputationalLinguis- Wecreatedaframework(translationmodelanddecoder) tics,23(3). that enables us to evaluate and compare various phrase Yamada,K.andKnight,K.(2001). Asyntax-basedstatistical translationmethods.Ourresultsshowthatphrasetransla- translationmodel. InProceedingsofACL39. tiongivesbetterperformancethantraditionalword-based methods. We obtain the best results even with small

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.