ebook img

A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions PDF

0.19 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions

A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Directions AntonioToral∗ V´ıctorM.Sa´nchez-Cartagena UniversityofGroningen PrompsitLanguageEngineering TheNetherlands Av. Universitats/n. EdificiQuorumIII [email protected] E-03202Elx,Spain [email protected] Abstract different weight by a tuning algorithm. On the contrary, in NMT all the components are jointly We aim to shed light on the strengths and trainedtomaximisetranslationquality. NMTsys- weaknesses of the newly introduced neu- 7 tems have a strong generalisation power because 1 ral machine translation paradigm. To that they encode translation units as numeric vectors 0 end,weconductamultifacetedevaluation thatrepresentconcepts,whereasinPBMTtransla- 2 inwhichwecompareoutputsproducedby tionunitsareencodedasstrings. Moreover,NMT n state-of-the-art neural machine translation a systems are able to model long-distance phenom- J andphrase-basedmachinetranslationsys- enathankstotheuseofrecurrentneuralnetworks, 1 tems for 9 language directions across a e.g. long short-term memory (LSTM) (Hochre- 1 number of dimensions. Specifically, we iter and Schmidhuber, 1997) or gated recurrent ] measurethesimilarityoftheoutputs,their units(Chungetal.,2014). L fluency and amount of reordering, the ef- The translations produced by NMT systems C fect of sentence length and performance have been evaluated thus far mostly in terms of . s across different error categories. We find overall performance scores, be it by means of au- c [ out that translations produced by neural tomatic or human evaluations. This has been the 1 machine translation systems are consider- case of last year’s news translation shared task v ably different, more fluent and more ac- at the First Conference on Machine Translation 1 curate in terms of word order compared (WMT16).1 In this translation task, outputs pro- 0 to those produced by phrase-based sys- 9 duced by participant MT systems, the vast ma- 2 tems. Neural machine translation sys- jority of which fall under either the phrase-based 0 tems are also more accurate at producing or neural approaches, were evaluated (i) automat- . 1 inflected forms, but they perform poorly ically with the BLEU (Papineni et al., 2002) and 0 whentranslatingverylongsentences. 7 TER (Snover et al., 2006) metrics, and (ii) manu- 1 allybymeansofrankingtranslations(Federmann, 1 Introduction : v 2012) and monolingual semantic similarity (Gra- Xi A new paradigm to statistical machine transla- hametal.,2016). Inalltheseevaluations,theper- tion, neural MT (NMT), has emerged very re- formanceofeachsystemismeasuredbymeansof r a cently and has already surpassed the performance anoverallscore,which,whilegivinganindication of the mainstream approach in the field, phrase- ofthegeneralperformanceofagivensystem,does based MT (PBMT), for a number of language notprovideanyadditionalinformation. pairs, e.g. (Sennrich et al., 2015; Luong et al., In order to understand better the new NMT 2015; Costa-Jussa` and Fonollosa, 2016; Chung et paradigm and in what respects it provides bet- al.,2016). ter(orworse)translationqualitythanstate-of-the- In PBMT (Koehn, 2010) different models art PBMT, Bentivogli et al. (2016) conducted a (translation, reordering, target language, etc.) are detailed analysis for the English-to-German lan- trained independently and combined in a log- guage direction. In a nutshell, they found out linear scheme in which each model is assigned a thatNMT(i)decreasespost-editingeffort,(ii)de- ∗WorkpartlydoneathispreviouspositioninDublinCity 1http://www.statmt.org/wmt16/ University,Ireland. translation-task.html gradesfasterthanPBMTwithsentencelengthand that the systems evaluated are state-of-the-art, as (iii)resultsinanotableimprovementregardingre- theyaretheresultofthelatestdevelopmentsattop ordering. MT research groups worldwide, and (iii) allows In this paper we delve further in this direc- the conclusions that will be drawn to be rather tionbyconductingamultilingualandmultifaceted general, as 6 languages from 4 different families evaluation in order to find answers to the follow- (Germanic,Slavic,RomanceandFinno-Ugric)are ingresearchquestions. Whether,incomparisonto coveredintheexperiments. PBMT,NMTsystemsresultin: The rest of the paper is organised as follows. Section 2 describes the experimental setup. Sub- • considerably different output and higher de- sequent sections cover the experiments carried greeofvariability; out in which we measured different aspects of NMT, namely: output similarity (Section 3), flu- • moreorlessfluentoutput; ency(Section4),degreeofreorderingandquality • moreorlessmonotonetranslations; of word order (Section 5), sentence length (Sec- tion 6), and amount of errors for different error • translationswithbetterorworsewordorder; categories(Section7). Finally,Section8holdsthe conclusionsandproposalsforfuturework. • better or worse translations depending on sentencelength; 2 ExperimentalSetup • less or more errors for different error cate- The experiments are run on the best2 PBMT3 and gories: inflectional,reorderingandlexical; NMT constrained systems submitted to the news translation task of WMT16. Out of the 12 lan- Hereunder we specify the main differences and guage directions at the translation task, we con- similarities between this work and that of Ben- duct experiments on 9.4 These are the language tivoglietal.(2016): pairsbetweenEnglish(EN)andCzech(CS),Ger- • Language directions. They considered 1 man(DE),Finnish(FI),Romanian(RO)andRus- whileourstudycomprises9. sian (RU) in both directions (except for Finnish, whereonlytheEN→FIdirectioniscoveredasno • Content. They dealt with transcribed NMT system was submitted for the opposite di- speeches while we work with news stories. rection, FI→EN). Finally, there was an additional Previous research has shown that these two language at the shared task, Turkish, that is not typesofcontentposedifferentchallengesfor consideredhere,aseithernoneofthesystemssub- MT(RuizandFederico,2014). mitted was neural (Turkish→EN), or there was one such system but its performance was ex- • Sizeofevaluationdata. Theirtestsethad600 tremelylow(EN→Turkish)andhencemostprob- sentenceswhileourtestsetsspanfrom1999 ably not representative of the state-of-the-art in to3000dependingonthelanguagedirection. NMT. • Reference type. Their references were both Table 1 shows the main characteristics of the independent from the MT output and also 2According to the human evaluation (Bojar et al., 2016, post-edited,whilewehaveaccessonlytosin- Sec. 3.4). Whentherearenotstatisticallysignificantdiffer- gleindependentreferences. encesbetweentwoormoreNMTorPBMTsystems(i.e.they belongtothesameequivalenceclass),wepicktheonewith • Analyses. While some analyses overlap, thehighestBLEUscore.IftwoNMTorPBMTsystemswere thebestaccordingtoBLEU(draw),wepicktheonewiththe some are novel in our experiments. Namely, bestTERscore. output similarity, fluency and degree of re- 3Many of the PBMT systems contain neural features, orderingperformed. mainly in the form of language models. If the best PBMT submission contains any neural features we use this as the PBMTsysteminouranalysesaslongasnoneofthesefea- Our analyses are conducted on the best PBMT turesisafully-fledgedNMTsystem.Thiswasthecaseofthe andNMTsystemssubmittedtotheWMT16trans- best submission in terms of BLEU for RU→EN (Junczys- lation task for each language direction. This (i) Dowmuntetal.,2016) 4Someexperimentsarerunonasubsetoftheselanguages guarantees the reproducibility of our results as all due to the lack of required tools for some of the languages theMToutputsarepubliclyavailable,(ii)ensures involved. Language MT Systemdetails Pair Paradigm PBMT Phrase-based,wordclusters(Dingetal.,2016) EN→CS NMT Unsupervised word segmentation and backtranslated monolingual cor- pora(Sennrichetal.,2016) hierarchical String-to-tree, neural and dependency language models (Williams et al., EN→DE PBMT 2016) NMT SameasforEN→CS PBMT Phrase-based, rule-based and unsupervised word segmentation, opera- EN→FI tion sequence model (Durrani et al., 2011), bilingual neural language model (Devlin et al., 2014), re-ranked with a recurrent neural language model(Sa´nchez-CartagenaandToral,2016) NMT Rule-based word segmentation, backtranslated monolingual cor- pora(Sa´nchez-CartagenaandToral,2016) PBMT Phrased-based, operation sequence model, monolingual and bilingual EN→RO neurallanguagemodels(Williamsetal.,2016) NMT SameasforEN→CS PBMT Phrase-based,wordclusters,bilingualneurallanguagemodel(Dingetal., EN→RU 2016) NMT SameasforEN→CS PBMT SameasforEN→CS CS→EN NMT SameasforEN→CS PBMT Phrase-based,pre-reordering,compoundsplitting(Williamsetal.,2016) DE→EN NMT SameasforEN→CSplusrerankedwitharight-to-leftmodel PBMT Phrase-based, operation sequence model, monolingual neural language RO→EN model(Williamsetal.,2016) NMT SameasforEN→CS PBMT Phrase-based,lemmasinwordalignment,sparsefeatures,bilingualneural RU→EN languagemodelandtransliteration(Loetal.,2016) NMT SameasforEN→CS Table 1: Details of the best systems pertaining to the PBMT and NMT paradigms submitted to the WMT16newstranslationtaskforeachlanguagedirection. best PBMT and NMT systems submitted to the 3 OutputSimilarity WMT16newstranslationtask. Itshouldbenoted Theaimofthisanalysisistoassesstowhichextent that all the NMT systems listed in the table fall translationsproducedbyNMTsystemsarediffer- under the encoder-decoder architecture with at- ent from those produced by PBMT systems. We tention (Bahdanau et al., 2015) and operate on measure this by taking the outputs of the top n6 subword units. Word segmentation is carried out NMT and PBMT systems submitted to each lan- with the help of a lexicon in the EN→FI di- guagedirection7andcheckingtheirpairwiseover- rection (Sa´nchez-Cartagena and Toral, 2016) and lap in terms of the chrF1 (Popovic´, 2015) auto- in an unsupervised way in the remaining direc- maticevaluationmetric.8 tions(Sennrichetal.,2016). We would consider NMT outputs considerably different (with respect to PBMT) if they resem- 2.1 OverallEvaluation bleeachother(i.e. highpairwiseoverlapbetween NMT outputs) more than they do to PBMT sys- First, and in order to contextualise our analyses tems(i.e. lowoverlapbetweenanoutputbyNMT below,wereporttheBLEUscoresachievedbythe andanotherbyPBMT).Thisanalysisiscarriedout best NMT and PBMT systems for each language onlyforlanguagedirectionsoutofEnglish,asfor directionatWMT16’snewstranslationtaskinTa- allthelanguagedirectionsintoEnglishtherewas, ble2.5 ThebestNMTsystemclearlyoutperforms atmost,1NMTsubmission. the best PBMT system for all language directions out of English (relative improvements range from TL 2NMT 2PBMT NMT&PBMT 5.5% for EN→RO to 17.6% for EN→FI) and the CS 68.66 77.63 64.34 human evaluation (Bojar et al., 2016, Sec. 3.4) DE 72.10 72.97 66.80 confirms these results. In the opposite direction, FI 56.03 57.42 55.55 the human evaluation shows that the best NMT RO 69.47 75.96 68.77 systemoutperformsthebestPBMTsystemforall RU 35.52 43.35 29.87 language directions except when the source lan- guage is Russian. This slightly differs from the Table3: Averageoftheoverlapsbetweenpairsof automatic evaluation, according to which NMT outputs produced by the top n NMT and PBMT outperforms PBMT for translations from Czech systems for each language direction from English (3.3% relative improvement) and German (9.9%) to the target language (TL). The higher the value, but underperforms PBMT for translations from thelargeristheoverlap. Romanian(-3.7%)andRussian(-3.8%). Table 3 shows the results. We can observe System CS DE FI RO RU the same trends for all the language directions, FromEN namely: (i)thehighestoverlapsarebetweenpairs PBMT 23.7 30.6 15.3 27.4 24.3 of PBMT systems; (ii) next, we have overlaps NMT 25.9 34.2 18.0 28.9 26.0 between NMT systems; (iii) finally, overlaps be- IntoEN tweenPBMTandNMTarethelowest. PBMT 30.4 35.2 23.7 35.4 29.3 We can conclude then that NMT systems lead NMT 31.4 38.7 - 34.1 28.2 6Thenumberofsystemsconsideredisdifferentforeach language direction as it depends on the number of systems Table2: BLEUscoresofthebestNMTandPBMT submitted.Namely,wehaveconsidered2NMTand2PBMT systemsforeachlanguagepairatWMT16’snews intoCzech,3NMTand5PBMTintoGerman,2NMTand4 PBMTintoFinnish,2NMTand4PBMTintoRomanianand translationtask. Ifthedifferencebetweenthemis 2NMTand3PBMTintoRussian. statistically significant according to paired boot- 7In order to make sure that all systems considered are strapresampling(Koehn,2004)withp = 0.05and trulydifferent(ratherthandifferentrunsofthesamesystem) weconsideronly1systemperparadigm(NMTandPBMT) 1000iterations,thehighestscoreisshowninbold. submittedbyeachteamforeachlanguagedirection. 8Throughoutouranalysesweusethismetricasithasbeen shown to correlate better with human judgements than the de facto standard automatic metric, BLEU, when the target 5Wereporttheofficialresultsfromhttp://matrix. languageisamorphologicallyrichlanguagesuchasFinnish, statmt.org/matrixforthetestsetnewstest2016using whileitscorrelationisonparwithBLEUforlanguageswith normalisedBLEU(columnzBLEU-cased-norm). simplermorphologysuchasEnglish(Popovic´,2015). to considerably different outputs compared to Language PBMT NMT Rel. diff. PBMT. The fact that there is higher inter-system direction variability in NMT than in PBMT (i.e. over- EN→CS 202.91 173.33 −14.58% laps between pairs of NMT systems are lower EN→DE 131.54 107.08 −18.60% than between pairs of PBMT systems) may sur- EN→FI 214.10 222.40 3.88% prisethereader,consideringthefactthatallNMT EN→RO 124.66 116.33 −6.68% systems belong to the same paradigm (encoder- EN→RU 158.18 127.83 −19.19% decoder with attention) while for some language CS→EN 110.08 102.36 −7.01% directions(EN→DE,EN→FIandEN→RO)there DE→EN 122.26 104.72 −14.35% are PBMT systems belonging to two different RO→EN 106.08 102.18 −3.68% paradigms (pure phrase-based and hierarchical). RU→EN 123.86 106.75 −13.81% However, the higher variability among NMT Average 143.74 129.22 −10.45% translations can be attributed, we believe, to the Table 4: Perplexity scores for the outputs of the fact that NMT systems use numeric vectors that best NMT and PBMT systems on language mod- representconceptsinsteadofstringsastranslation elsbuilton4millionsentencesrandomlyselected units. fromtheNewsCrawl2015corpora. 4 Fluency outputs produced by NMT systems are, in gen- In this experiment we aim to find out whether eral, more fluent than those produced by PBMT the outputs produced by NMT systems are more systems. or less fluent than those produced by PBMT sys- One may argue that the perplexity obtained for tems. To that end, we take perplexity of the NMT outputs is lower than that for PBMT out- MT outputs on neural language models (LMs) as puts because the LMs we used to measure per- a proxy for fluency. The LMs are built using plexity follow the same model as the decoder of TheanoLM (Enarvi and Kurimo, 2016). They theNMTarchitecture(Bahdanauetal.,2015)and contain100unitsintheprojectionlayer,300units hence perplexity on a neural LM is not a valid intheLSTMlayer,and300unitsinthetanhlayer, proxy for fluency. However, the following facts following the setup described by Enarvi and Ku- supportourstrategy: rimo (2016, Sec. 3.2). The training algorithm is Adagrad (Duchi et al., 2011) and we used 1000 • Themanualevaluationoffluencycarriedout wordclassesobtainedwithmkclsfromthetrain- at the WMT16 shared translation task (Bo- ing corpus. Vocabulary is limited to the most fre- jar et al., 2016, Sec. 3.5) already confirmed quent50000tokens. thatNMTsystemsconsistentlyproducemore LMs are trained on a random sample of 4 mil- fluenttranslationsthanPBMTsystems. That lionsentencesselectedfromtheNewsCrawl2015 manualevaluationonlycoveredlanguagedi- monolingual corpora, available for all the lan- rections into English. In this experiment, we guagesconsidered.9 extendthatconclusiontolanguagedirections Table 4 shows the results. For all the language outofEnglish. directionsconsideredbutone,perplexityishigher • Neural LMs consistently outperform n-gram on the PBMT output compared to the NMT out- basedLMswhenassessingthefluencyofreal put. TheonlyexceptionistranslationintoFinnish, text (Kim et al., 2016; Enarvi and Kurimo, inwhichperplexityonthePBMToutputisslightly 2016). Thus,wehaveusedthemostaccurate lower,probablybecauseitsfluencywasimproved automatictoolavailabletomeasurefluency. byrerankingitwithaneuralLMsimilartotheone weuseinthisexperiment(Sa´nchez-Cartagenaand 5 Reordering Toral, 2016). The average relative difference, i.e. considering all language directions, is notable at Inthissectionwemeasuretheamountofreorder- −10.45%. Thus, our experiment shows that the ing performed by PBMT and NMT systems. Our objective is to empirically determine whether: (i) 9http://data.statmt.org/ the recurrent neural networks in NMT systems wmt16/translation-task/ training-monolingual-news-crawl.tgz producemorechangesinthewordorderofasen- Monotonevs. Languagedirection PBMTvs. Ref. NMTvs. Ref. PBMT NMT Ref. EN→CS 0.9273 0.9029 0.8295 0.8008 0.7964 EN→DE 0.8006 0.8229 0.7740 0.7560 0.7791 EN→FI 0.8912 0.9172 0.8367 0.7611 0.7819 EN→RO 0.8389 0.8378 0.7937 0.8312 0.8282 EN→RU 0.9342 0.9114 0.8364 0.8249 0.8240 CS→EN 0.7694 0.7589 0.7128 0.8000 0.8015 DE→EN 0.8036 0.7830 0.7409 0.7728 0.7943 RO→EN 0.8693 0.8427 0.8013 0.8187 0.8245 RU→EN 0.8170 0.7891 0.7247 0.7958 0.8069 Table5: AverageKendall’staudistancebetweenthewordalignmentsobtainedaftertranslatingthetest setwitheachMTsystembeingevaluatedandamonotonealignment(left);andaverageKendall’staudis- tancebetweenthewordalignmentsobtainedforeachMTsystem’stranslationandthewordalignments of the reference translation (right). Larger values represent more similar alignments. If the difference between the distances depicted in the two last columns is statistically significant according to paired bootstrapresampling(Koehn,2004)withp = 0.05and1000iterations,thelargestdistanceisshownin bold. tence than an PBMT decoder; and whether (ii) a monotone word alignment. The similarity be- these neural networks make the word order of the tween the reorderings produced by each MT sys- translationsclosertothatofthereference. temandthereorderingsinthereferencetranslation In order to measure the amount of reorder- can also be estimated as the distance between the ing, we used the Kendall’s tau distance between corresponding word alignments. Table 5 shows word alignments obtained from pairs of sen- thevalueofthesedistancesforthelanguagepairs tences (Birch, 2011, Sec. 5.3.2). As the distance included in our evaluation. The average over all needs to be computed from permutations,10 we the sentences in the test set of the distance pro- turnedwordaligmentsintopermutationsbymeans posedbyBirch(2011)isdepicted. ofthealgorithmdefinedbyBirch(2011,Sec. 5.2). It can be observed that the amount of reorder- Foreachlanguagedirection,wecomputedword ing introduced by both types of MT systems is alignments between the source-language side of lower than the quantity of reordering in the refer- the test set and the target-language reference, the ence translation. NMT generally produces more PBMT output and the NMT output by means of changes in the structure of the sentence than MGIZA++ (Gao and Vogel, 2008). As the test PBMT. This is the case for all language pairs but sets are rather small for word alignment (1999 to two (EN→DE and EN→FI). A possible explana- 3000 sentence pairs depending on the language tion for these two exceptions is the following: in pair), we append bigger parallel corpora to help theformerlanguagepair, thePBMTsystemishi- ensure accurate word alignments and avoid data erarchical (Williams et al., 2016) while in the lat- sparseness. For languages for which in-domain ter,theoutputwasrerankedwithneuralLMs. (news) parallel training data is available (German Concerning the similarity between the reorder- andRussian),weappendthatdataset(NewsCom- ings produced by both MT systems and those in mentary). For the remaining languages (Finnish the reference translation, out of 9 directions, in 5 andRomanian)weusethewholeEuroparlcorpus. directionstheNMTsystemperformsareordering The amount of reordering performed by each closer to the reference, in 1 direction the PBMT system can be estimated as the distance between system performs a reordering closer to the refer- thewordalignmentsproducedbythatsystemand ence and in the remaining 3 directions the dif- 10Apermutationbetweenasource-languagesentenceand ferences are not statistically significant. That is, atarget-languagesentenceisdefinedasthesetofoperations NMT generally produces reorderings which are that need to be carried out over the words in the source- closertothereferencetranslation. Theexceptions language sentence to reflect the order of the words in the target-languagesentence(Birch,2011,Sec.5.2). to this trend, however, do not exactly correspond to the language pairs for which NMT underper- stable while NMT’s clearly decreases with sen- formedPBMT. tence length. The results for the other language Insummary, NMTsystemsachieve, ingeneral, directionsexhibitsimilartrends. a higher degree of reordering than pure, phrase- basedPBMTsystems,and,overall,thisreordering 0.03 results in translations whose word order is closer 0.02 nt tothatofthereferencetranslation. me 0.01 e ov pr 0 m 6 SentenceLength ve I -0.01 ati In this experiment we aim to find out whether Rel -0.02 the performances of NMT and PBMT are some- -0.03 1-5 6-10 11-1516-2021-2526-3030-3536-4041-4546-50 >50 how sensitive to sentence length. In this regard, Sentence length (range) Bentivoglietal.(2016)foundthat,fortranscribed speeches, NMT outperformed PBMT regardless Figure 2: Relative improvement of the best NMT of sentence length while also noted that NMT’s versus the best PBMT submission on chrF1 for performancedegradedfasterthanPBMT’sassen- different sentence lengths, averaged over all the tence length increases. It should be noted, how- languagepairsconsidered. ever, that sentences in our content type, news, are considerably longer than sentences in transcribed Figure 2 shows the relative improvements of speeches.11 Hence,thecurrentexperimentwillde- NMToverPBMTforeachsentencelengthsubset, terminetowhatextentthefindingsontranscribed averaged over all the 9 language directions con- speeches stand also for texts made of longer sen- sidered. Weobserveacleartrendofthisvaluede- tences. creasingwithsentencelengthandinfactwefound a strong negative Pearson correlation (-0.79) be- 54 tween sentence length and the relative improve- 52 ment(chrF1)ofthebestNMToverthebestPBMT 50 48 system. 1 F hr 46 Thecorrelationsforeachlanguagedirectionare c 44 shown in Table 6. We observe negative corre- PBMT 42 NMT lations for all the language directions except for 40 DE→EN. 1-5 6-10 11-15 16-20 21-25 26-30 30-35 36-40 41-45 46-50 >50 Sentence length (range) Direction CS DE FI RO RU FromEN -0.72 -0.26 -0.89 -0.01 -0.74 Figure 1: NMT and PBMT chrF1 scores on sub- IntoEN -0.19 0.10 - -0.36 -0.70 sets of different sentence length for the language directionEN→FI. Table 6: Pearson correlations between sentence length and relative improvement (chrF1) of the We split the source side of the test set in sub- best NMT over the best PBMT system for each sets of different lengths: 1 to 5 words (1-5), 6 to languagepair. 10 and so forth up to 46 to 50 and finally longer than 50 words (> 50). We then evaluate the out- puts of the top PBMT and NMT submissions for 7 ErrorCategories those subsets with the chrF1 evaluation metric. Figure 1 presents the results for the language di- In this experiment we assess the performance of rection EN→FI. We can observe that NMT out- NMT versus PBMT systems on a set of error performs PBMT up to sentences of length 36- categories that correspond to five word-level er- 40,whileforlongersentencesPBMToutperforms ror classes: inflection errors, reordering errors, NMT,withPBMT’sperformanceremainingfairly missing words, extra words and incorrect lexi- cal choices. These errors are detected automat- 11AccordingtoRuizetal.(2014),sentencesoftranscribed ically using the edit distance, word error rate speechesinEnglishaverageto19wordswhilesentencesin newsaverageto24words. (WER),precision-basedandrecall-basedposition- Errortype EN→CS EN→DE EN→FI EN→RO EN→RU Average Inflection −16.18% −13.26% −11.65% −15.13% −16.79% −14.60% Reordering −7.97% −21.92% −12.12% −15.91% −6.18% −12.82% Lexical −0.44% −3.48% −1.09% 2.17% −0.09% −0.59% Table7: RelativeimprovementofNMTversusPBMTfor3errorcategories,forlanguagedirectionsout ofEnglish. Errortype CS→EN DE→EN RO→EN RU→EN Average Inflection −4.38% −2.47% −3.65% −21.12% −7.91% Reordering −8.68% −21.09% −9.48% −8.50% −11.94% Lexical −1.92% −4.91% −3.90% 5.32% −1.35% Table8: RelativeimprovementofNMTversusPBMTfor3errorcategories,forlanguagedirectionsinto English. independent error rates (hPER and rPER, respec- ball stemmer from NLTK12 for all languages ex- tively) as implemented in Hjerson (Popovic´, cept for Czech, which is not supported. For 2011). Theseerrorclassesarethendefinedasfol- this language we used the aggresive variant in lows: czech stemmer.13 Tables7and8showtheresultsforlanguagedi- • Inflection error (hINFer). A word for which rections out of English and into English, respec- itsfullformismarkedasahPERerrorwhile tively. Foralllanguagedirections,weobservethat its base form matches the base form in the NMT results in a notable decrease of both inflec- reference. tion (−14.6% on average for language directions out of EN and −7.91% for language directions • Reordering error (hRer). A word that into EN) and reordering (−12.82% from EN and matches the reference but is marked as a −11.94intoEN)errors. Thereductionofreorder- WERerror. ingerrorsiscompatiblewiththeresultsoftheex- • Missingword(MISer). Awordthatoccursas perimentpresentedinSection5.14 deletion error in WER, is also a rPER error Differences in performance for the remaining and does not share the base form with any error category, lexical errors, are much smaller. hypothesiserror. In addition, the results for that category show a mixedpictureintermsofwhichparadigmisbetter, • Extra word (EXTer). A word that occurs as whichmakesitdifficulttoderiveconclusionsthat insertion error in WER, is also a hPER error apply regardless of the language pair. Out of En- and does not share the base form with any glish, NMT results in slightly less errors (0.59% referenceerror. decrease on average) for all target languages ex- cept for RO (2.17% increase). Similarly, in the • Lexical choice error (hLEXer). A word that opposite language direction, NMT also results in belongs neither to inflectional errors nor to slightly better performance overall (1.35% error missingorextrawords. reduction on average), and looking at individual Due to the fact that it is difficult to disam- 12http://www.nltk.org biguatebetweenthreeofthesecategories,namely 13http://research.variancia.com/czech_ stemmer/ missing words, extra words and lexical choice er- 14Althoughtheresultsdepictedbothinthissectionandin rors (Popovic´ and Ney, 2011), we group them in Section5showthatNMTperformsbetterreorderingingen- auniquecategory,whichwerefertoaslexicaler- eral,resultsforparticularlanguagepairsarenotexactlythe sameinbothsections. Thisisduetothefactthatthequality rors. ofthereorderingiscomputedindifferentways. Inthissec- As input, the tool requires the full forms and tion,onlythosewordsthatmatchthereferenceareconsidered base forms of the reference translations and MT whenidentifyingreorderingerrors,whileinSection5allthe words in the sentence are taken into account. That said, in outputs. For base forms, we use stems for practi- Section5theprecisionoftheresultsdependsonthequality cal reasons. These are produced with the Snow- ofwordalignment. language directions NMT outperforms PBMT for lengthintheirevaluationdatasetwasconsid- allofthemexceptRU→EN. erably shorter; and secondly, the NMT sys- tems included in our evaluation operate on 8 Conclusions subword units, which increases the effective sentencelengththeyhavetodealwith. We have conducted a multifaceted evaluation to compare NMT versus PBMT outputs across a • NMT performs better in terms of inflection number of dimensions for 9 language directions. and reordering consistently across all lan- Our aim has been to shed more light on the guage directions. We thus confirm that the strengthsandweaknessesofthenewlyintroduced findingsofBentivoglietal.(2016)regarding NMTparadigm,andtocheckwhether,andtowhat these two error types apply to a wide range extent, these generalise to different families of of language directions. Differences regard- source and target languages. Hereunder we sum- ing lexical errors are much smaller and in- mariseourfindings: consistent across language directions; for 7 of them NMT outperforms PBMT while for • The outputs of NMT systems are consider- theremaining2theoppositeistrue. ably different compared to those of PBMT systems. In addition, there is higher inter- The results for some of the evaluations, espe- system variability in NMT, i.e. outputs by ciallyerrorcategories(Section 7)havebeenanal- pairsofNMTsystemsaremoredifferentbe- ysed only superficially, looking at what conclu- tween them than outputs by pairs of PBMT sions can be derived that apply regardless of lan- systems. guagedirection. Nevertheless,allourdataispub- liclyreleased,16soweencourageinterestedparties • NMT outputs are more fluent. We have cor- to use this resource to conduct deeper language- roboratedtheresultsofthemanualevaluation specificstudies. of fluency at WMT16, which was conducted onlyforlanguagedirectionsintoEnglish,and Acknowledgments we have shown evidence that this finding is truealsofordirectionsoutofEnglish. The research leading to these results is sup- portedbytheEuropeanUnionSeventhFramework • NMT systems introduce more changes in Programme FP7/2007-2013 under grant agree- wordorderthanpurePBMTsystems,butless ment PIAP-GA-2012-324414 (Abu-MaTran) and than hierarchical PBMT systems.15 Never- byScienceFoundationIrelandthroughtheCNGL theless, for most language pairs, including Programme (Grant 12/CE/I2267) in the ADAPT thoseforwhichthebestPBMTsystemishi- Centre (www.adaptcentre.ie) at Dublin City Uni- erarchical, NMT’s reorderings are closer to versity. thereorderingsinthereferencethanthoseof PBMT. This corroborates the findings on re- References orderingbyBentivoglietal.(2016). DzmitryBahdanau,KyunghyunCho,andYoshuaBen- • Wehavefoundnegativecorrelationsbetween gio. 2015. Neural Machine Translation by Jointly sentencelengthandtheimprovementbrought Learning to Align and Translate. arXiv preprint arXiv:1409.0473. by NMT over PBMT for the majority of the languages examined. While for most LuisaBentivogli,AriannaBisazza,MauroCettolo,and sentence lengths NMT outperforms PBMT, Marcello Federico. 2016. Neural versus phrase- for very long sentences PBMT outperforms based machine translation quality: a case study. NMT. The latter was not the case in the arXivpreprintarXiv:1608.04631. work by Bentivogli et al. (2016). We be- AlexandraBirch. 2011. Reorderingmetricsforstatis- lieve the reason behind this different find- ticalmachinetranslation. Ph.D.thesis,TheUniver- ing is twofold. Firstly, the average sentence sityofEdinburgh. 15Thelatterfindingappliesonlytoonelanguagedirection 16https://github.com/antot/neural_vs_ asonlyforthatonethebestPBMTsystemishierarchical. phrasebased_smt_eacl17 Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann, LanguageProcessing,SETQA-NLP’08,pages49– Yvette Graham, Barry Haddow, Matthias Huck, 57,Columbus,Ohio. Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aure- YvetteGraham,TimothyBaldwin,AlistairMoffat,and lie Neveol, Mariana Neves, Martin Popel, Matt Justin Zobel. 2016. Can machine translation sys- Post,RaphaelRubino,CarolinaScarton,LuciaSpe- temsbeevaluatedbythecrowdalone. NaturalLan- cia, Marco Turchi, Karin Verspoor, and Marcos guageEngineering,FirstView:1–28,1. Zampieri. 2016. Findings of the 2016 conference onmachinetranslation. InProceedingsoftheFirst Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Conference on Machine Translation, pages 131– Long short-term memory. Neural computation, 198,Berlin,Germany,August. 9(8):1735–1780. Junyoung Chung, C¸aglar Gu¨lc¸ehre, KyungHyun Cho, MarcinJunczys-Dowmunt,TomaszDwojak,andRico andYoshuaBengio. 2014. Empiricalevaluationof Sennrich. 2016. The AMU-UEDIN Submission gatedrecurrentneuralnetworksonsequencemodel- to the WMT16 News Translation Task: Attention- ing. arXivpreprintarXiv:1412.3555. basedNMTModelsasFeatureFunctionsinPhrase- basedSMT. InProceedingsoftheFirstConference Junyoung Chung, Kyunghyun Cho, and Yoshua Ben- on Machine Translation, pages 319–325, Berlin, gio. 2016. A character-level decoder without ex- Germany,August. plicit segmentation for neural machine translation. arXivpreprintarXiv:1603.06147. YoonKim,YacineJernite,DavidSontag,andAlexan- der Rush. 2016. Character-aware neural language MartaR.Costa-Jussa` andJose´ A.R.Fonollosa. 2016. models. InProceedingsoftheThirtiethAAAICon- Character-based neural machine translation. arXiv ferenceonArtificialIntelligence,pages2741–2749, preprintarXiv:1603.00810. Phoenix,Arizona,USA,February. JacobDevlin,RabihZbib,ZhongqiangHuang,Thomas Philipp Koehn. 2004. Statistical significance tests Lamar,RichardSchwartz,andJohnMakhoul. 2014. for machine translation evaluation. In Proceedings Fastandrobustneuralnetworkjointmodelsforsta- of the Conference on Empirical Methods in Natu- tistical machine translation. In Proceedings of the ralLanguageProcessing,volume4,pages388–395, 52ndAnnualMeetingoftheAssociationforCompu- Barcelona,Spain. tationalLinguistics(Volume1: LongPapers),pages 1370–1380,Baltimore,Maryland,June. PhilippKoehn. 2010. StatisticalMachineTranslation. Cambridge University Press, New York, NY, USA, ShuoyangDing,KevinDuh,HudaKhayrallah,Philipp 1stedition. Koehn, and Matt Post. 2016. The JHU Machine Translation Systems for WMT 2016. In Proceed- Chi-kiu Lo, Colin Cherry, George Foster, Darlene ingsoftheFirstConferenceonMachineTranslation, Stewart,RabibIslam,AnnaKazantseva,andRoland pages272–280,Berlin,Germany,August. Kuhn. 2016. NRCRussian-EnglishMachineTrans- lation System for WMT 2016. In Proceedings of John Duchi, Elad Hazan, and Yoram Singer. 2011. theFirstConferenceonMachineTranslation,pages Adaptive subgradient methods for online learning 326–332,Berlin,Germany,August. and stochastic optimization. Journal of Machine LearningResearch,12(Jul):2121–2159. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention- NadirDurrani,HelmutSchmid,andAlexanderFraser. basedneuralmachinetranslation. InProceedingsof 2011. AJointSequenceTranslationModelwithIn- the2015ConferenceonEmpiricalMethodsinNatu- tegratedReordering. InProceedingsofthe49thAn- ralLanguageProcessing,pages1412–1421,Lisbon, nual Meeting of the Association for Computational Portugal,September. Linguistics: Human Language Technologies, pages 1045–1054,Portland,Oregon,USA,June. KishorePapineni,SalimRoukos,ToddWard,andWei- SeppoEnarviandMikkoKurimo. 2016. TheanoLM– Jing Zhu. 2002. Bleu: a method for automatic AnExtensibleToolkitforNeuralNetworkLanguage evaluation of machine translation. In Proceedings Modeling. InProceedingsofthe17thAnnualCon- of40thAnnualMeetingoftheAssociationforCom- ferenceoftheInternationalSpeechCommunication putationalLinguistics,pages311–318,Philadelphia, Association. Pennsylvania,USA,July. Christian Federmann. 2012. Appraise: An open- Maja Popovic´ and Hermann Ney. 2011. Towards au- source toolkit for manual evaluation of machine tomaticerroranalysisofmachinetranslationoutput. translation output. The Prague Bulletin of Mathe- Comput.Linguist.,37(4):657–688,December. maticalLinguistics,98:25–35,September. Maja Popovic´. 2011. Hjerson: An open source tool QinGaoandStephanVogel. 2008. Parallelimplemen- for automatic error classification of machine trans- tations of word alignment tool. In Software Engi- lationoutput. ThePragueBulletinofMathematical neering,Testing,andQualityAssuranceforNatural Linguistics,96:59–67.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.