Mutual Information and Diverse Decoding Improve Neural Machine Translation JiweiLiandDanJurafsky ComputerScienceDepartment StanfordUniversity,Stanford,CA,94305,USA jiweil,[email protected] Abstract (Luongetal.,2015b)andEnglishtoGerman(Lu- 6 ongetal.,2015a;Jeanetal.,2014)translation. 1 Sequence-to-sequence neural translation SEQ2SEQ models are implemented as an 0 2 models learn semantic and syntactic rela- encoder-decoder network, in which a source se- tions between sentence pairs by optimiz- quenceinputxismapped(encoded)toacontinu- r a ing the likelihood of the target given the ousvectorrepresentationfromwhichatargetout- M source, i.e., p(y|x), an objective that ig- puty willbegenerated(decoded). Theframework 2 nores other potentially useful sources of isoptimizedthroughmaximizingthelog-likelihood 2 information. We introduce an alternative ofobservingthepairedoutputy givenx: objectivefunctionforneuralMTthatmax- ] L imizesthemutualinformationbetweenthe Loss = −logp(y|x) (1) C sourceandtargetsentences,modelingthe s. bi-directional dependency of sources and Whilestandard SEQ2SEQ modelsthuscapturethe c targets. We implement the model with a unidirectional dependency from source to target, [ simple re-ranking method, and also intro- i.e., p(y|x), they ignore p(x|y), the dependency 2 duce a decoding algorithm that increases fromthetargettothesource,whichhaslongbeen v 2 diversity in the N-best list produced by an important feature in phrase-based translation 7 the first pass. Applied to the WMT Ger- (Och and Ney, 2002; Shen et al., 2010). Phrase 3 man/EnglishandFrench/Englishtasks,the based systems that combine p(x|y), p(y|x) and 0 0 proposedmodelsoffersaconsistentperfor- otherfeatureslikesentencelengthyieldsignificant 1. manceboostonbothstandardLSTMand performanceboost. 0 attention-basedneuralMTarchitectures. We propose to incorporate this bi-directional 6 dependency and model the maximum mutual in- 1 : 1 Introduction formation (MMI) between source and target into v i SEQ2SEQ models. As Li et al. (2015) recently X Sequence-to-sequencemodelsformachinetransla- showedinthecontextofconversationalresponse r tion(SEQ2SEQ)(Sutskeveretal.,2014;Bahdanau a generation, the MMI based objective function is et al., 2014; Cho et al., 2014; Kalchbrenner and equivalenttolinearlycombiningp(x|y)andp(y|x). Blunsom, 2013; Sennrich et al., 2015a; Sennrich Withatuningweightλ,suchalossfunctioncanbe etal.,2015b;Gulcehreetal.,2015)areofgrowing writtenas: interestfortheircapacitytolearnsemanticandsyn- tacticrelationsbetweensequencepairs,capturing p(x,y) yˆ= argmaxlog contextualdependenciesinamorecontinuousway p(x)p(y)λ y than phrase-based SMT approaches. SEQ2SEQ = argmax(1−λ)logp(y|x)+λlogp(x|y) modelsrequireminimaldomainknowledge,canbe y trainedend-to-end,haveamuchsmallermemory (2) footprint than the large phrase tables needed for But as also discussed in Li et al. (2015), direct phrase-basedSMT,andachievestate-of-the-artper- decodingfrom(2)isinfeasiblebecausecomputing formanceinlarge-scaletaskslikeEnglishtoFrench p(x|y) cannot be done until the target has been computed1. probabilityoftargetsgivensources,usingtwocom- Toavoidthisenormoussearchspace,wepropose ponents,anencoderandadecoder. Kalchbrenner to use a reranking approach to approximate the andBlunsom(2013)usedanencodingmodelakin mutual information between source and target in to convolutional networks for encoding and stan- neuralmachinetranslationmodels. Weseparately dardhiddenunitrecurrentnetsfordecoding. Simi- trainedtwo SEQ2SEQ models,oneforp(y|x)and larconvolutionalnetworksareusedin(Mengetal., one for p(x|y). The p(y|x) model is used to gen- 2015)forencoding. Sutskeveretal.(2014;Luong erateN-bestlistsfromthesourcesentencex. The et al. (2015a) employed a stacking LSTM model listsarefollowedbyarerankingprocessusingthe for both encoding and decoding. Bahdanau et al. secondtermoftheobjectivefunction,p(x|y). (2014), Jean et al. (2014) adopted bi-directional Because reranking approaches are dependent recurrentnetsfortheencoder. onhavingadiverseN-bestlisttorerank,wealso MaximumMutualInformation MaximumMu- proposeadiversity-promotingdecodingmodeltai- tualInformation(MMI)wasintroducedinspeech loredtoneuralMTsystems. Wetestedthemutual recognition(Bahletal.,1986)asawayofmeasur- information objective function and the diversity- ingthemutualdependencebetweeninputs(acous- promoting decoding model on English→French, ticfeaturevectors)andoutputs(words)andimprov- English→German and German→English transla- ingdiscriminativetraining(WoodlandandPovey, tiontasks,usingbothstandardLSTMsettingsand 2002). Li et al. (2015) show that MMI could themoreadvancedattention-modelbasedsettings thathaverecentlyshowntoresultinhigherperfor- solveanimportantprobleminSEQ2SEQconversa- mance. tionalresponsegeneration. PriorSEQ2SEQmodels tendedtogeneratehighlygeneric,dullresponses The next section presents related work, fol- (e.g., I don’t know) regardless of the inputs (Sor- lowed by a background section 3 introducing donietal.,2015;VinyalsandLe,2015;Serbanet LSTM/Attentionmachinetranslationmodels. Our al., 2015b). Li et al. (2015) show that modeling proposedmodelwillbedescribedindetailinSec- themutualdependencybetweenmessagesandre- tions 4, with datasets and experimental results in sponsepromotesthediversityofresponseoutputs. Section6followedbyconclusions. Our goal, distinct from these previous uses of 2 RelatedWork MMI, is to see whether the mutual information objective improves translation by bidirectionally Thispaperdrawsonthreepriorlinesofresearch: modelingsource-targetdependencies. Inthatsense, SEQ2SEQ models,modelingmutualinformation, ourworkisdesignedtoincorporateintoSEQ2SEQ andpromotingtranslationdiversity. modelsfeaturesthathaveprovedusefulinphrase- basedMT,likethereversetranslationprobability SEQ2SEQ Models SEQ2SEQ models map orsentencelength(OchandNey,2002;Shenetal., sourcesequencestovectorspacerepresentations, 2010;Devlinetal.,2014). from which a target sequence is then generated. TheyyieldgoodperformanceinavarietyofNLP GeneratingDiverseTranslations Variousalgo- generationtasksincludingconversationalresponse rithms have been proposed for generated diverse generation (Vinyals and Le, 2015; Serban et al., translations in phrase-based MT, including com- 2015a;Lietal.,2015),andparsing(Vinyalsetal., pactrepresentationslikelatticesandhypergraphs 2014;?). (Macherey et al., 2008; Tromble et al., 2008; A neural machine translation system uses dis- Kumar and Byrne, 2004), “traits” like transla- tributed representations to model the conditional tion length (Devlin and Matsoukas, 2012), bag- 1Asdemonstratedin(Lietal.,2015) ging/boosting(Xiaoetal.,2013),ormultiplesys- tems (Cer et al., 2013). Gimpel et al. (2013; Ba- p(x,y) log =logp(y|x)−λlogp(y) (3) tra et al. (2012), produce diverse N-best lists by p(x)p(y)λ adding a dissimilarity function based on N-gram Equ. 2canbeimmediatelyachievedbyapplyingbayesian overlaps, distancing the current translation from rules already-generated ones by choosing translations thathavehigherscoresbutdistinctfromprevious logp(y)=logp(y|x)+logp(x)−logp(x|y) ones. Whilewedrawontheseintuitions,theseex- istingdiversitypromotingalgorithmsaretailoredto phrase-basedtranslationframeworksandnoteasily etal.,2015;Luongetal.,2015b;Bahdanauetal., transplantedtoneuralMTdecodingwhichrequires 2014). batchedcomputation. Let H = {hˆ ,hˆ ,...,hˆ } be thecollection of 1 2 Nx hiddenvectorsoutputtedfromLSTMsduringen- 3 Background: NeuralMachine coding. Each element in H contains information Translation about the input sequences, focusing on the parts surrounding each specific token. Let h be the Neural machine translation models map source t−1 LSTMoutputsfordecodingattimet−1. Atten- x = {x ,x ,...x } to a continuous vector 1 2 Nx tion models link the current-step decoding infor- representation, from which target output y = mation, i.e., h with each of the representations {y ,y ,...,y }istobegenerated. t 1 2 Ny at decoding step hˆ using a weight variable a . t(cid:48) t 3.1 LSTMModels at canbeconstructedfromdifferentscoringfunc- tionssuchasthedotproductbetweenthetwovec- Along-shorttermmemorymodel(Hochreiterand tors,i.e.,hT ·hˆ ,ageneralmodelakintotensor Schmidhuber,1997)associateseachtimestepwith t−1 t operation i.e., hT · W · hˆ , and the concatena- aninputgate,amemorygateandanoutputgate,de- t−1 t tion model by concatenating the two vectors i.e., notedrespectivelyasi ,f ando . Lete denotethe t t t t UT·tanh(W ·[h ,hˆ ]). Thebehaviorofdifferent vectorforthecurrentwordw ,h thevectorcom- t−1 t t t attentionscoringfunctionshavebeenextensively putedbytheLSTMmodelattimetbycombining studied in Luong et al. (2015a). For all experi- e andh .,c thecellstatevectorattimet,and t t−1 t mentsinthispaper,weadoptthegeneralstrategy σ thesigmoidfunction. Thevectorrepresentation wheretherelevancescorebetweenthecurrentstep h foreachtimesteptisgivenby: t of the decoding representation and the encoding i = σ(W ·[h ,e ]) (4) representationisgivenby: t i t−1 t ft = σ(Wf ·[ht−1,et]) (5) v = hT ·W ·hˆ t(cid:48) t−1 t o = σ(W ·[h ,e ]) (6) t o t−1 t exp(v ) (10) t∗ a = l = tanh(W ·[h ,e ]) (7) i (cid:80) t l t−1 t t∗exp(vt∗) c = f ·c +i ·l (8) t t t−1 t t The attention vector is created by averaging hs = o ·tanh(c ) (9) t t t weightsoverallinputtime-steps: whereW ,W ,W ,W ∈ RK×2K. TheLSTMde- (cid:88) i f o l m = a hˆ (11) finesadistributionoveroutputsy andsequentially t i t(cid:48) predictstokensusingasoftmaxfunction: t(cid:48)∈[1,NS] Attentionmodelspredictsubsequenttokensbased p(y|x) = (cid:89)nT exp(f(ht−1,eyt)) onthecombinationofthelaststepoutputtedLSTM (cid:80) t=1 w(cid:48)exp(f(ht−1,ew(cid:48))) vectorsht−1 andattentionvectorsmt: wheref(ht−1,eyt)denotestheactivationfunction (cid:126)ht−1 = tanh(Wc·[ht−1,mt]) (12) betweenht−1 andewt,whereht−1 istherepresen- p(yt|y<,x) = softmax(Ws·(cid:126)ht−1) tationoutputfromtheLSTMattimet−1. Each sentenceconcludeswithaspecialend-of-sentence whereW ∈ RK×2K,W ∈ RV×K withVdenot- c s symbolEOS.Commonly,theinputandoutputeach ingvocabularysize. Luongetal.(2015a)reported usedifferentLSTMswithseparatesetsofcompo- asignificantperformanceboostbyintegrating(cid:126)h t−1 sitional parameters to capture different composi- intothenextstepLSTMhiddenstatecomputation tional patterns. During decoding, the algorithm (referred to as the input-feeding model), making terminateswhenanEOStokenispredicted. LSTMcompositionsindecodingasfollows: 3.2 AttentionModels i = σ(W ·[h ,e ,(cid:126)h ]) t i t−1 t t−1 Attention models adopt a look-back strategy that f = σ(W ·[h ,e ,(cid:126)h ]) t f t−1 t t−1 links the current decoding stage with input time (13) o = σ(W ·[h ,e ,(cid:126)h ]) stepstorepresentwhichportionsoftheinputare t o t−1 t t−1 mostresponsibleforthecurrentdecodingstate(Xu l = tanh(W ·[h ,e ,(cid:126)h ]) t l t−1 t t−1 whereW ,W ,W ,W ∈ RK×3K. Fortheatten- 2. GenerateN-bestlistsfromp(y|x). i f o l tion models implemented in this work, we adopt 3. Rerank the N-best list by linearly adding theinput-feedingstrategy. p(x|y). 3.3 UnknownWordReplacements 4.1 StandardBeamSearchforN-bestlists OneofthemajorissuesinneuralMTmodelsisthe N-bestlistsaregeneratedusingabeamsearchde- computationalcomplexityofthesoftmaxfunction coderwithbeamsizesettoK = 200fromp(y|x) fortargetwordprediction,whichrequiressumming models. AsillustratedinFigure1,attimestept−1 over all tokens in the vocabulary. Neural models indecoding,wekeeprecordofN hypothesesbased tend to keep a shortlist of 50,00-80,000 most fre- onscoreS(Y |x) = logp(y ,y ,...,y |x). As quentwordsanduseanunknown(UNK)tokento t−1 1 2 t−1 wemoveontotimestept,weexpandeachoftheK representallinfrequenttokens,whichsignificantly hypotheses(denotedasYk = {yk,yk,...,yk }, impairsBLEUscores. Recentworkhasproposed t−1 1 2 t−1 k ∈ [1,K]),byselectingtopK ofthetranslations, todealwiththisissue: (Luongetal.,2015b)adopt denoted as yk,k(cid:48), k(cid:48) ∈ [1,K], leading to the con- a post-processing strategy based on aligner from t structionofK ×K newhypotheses: IBMmodels,while(Jeanetal.,2014)approximates softmax functions by selecting a small subset of [Yk ,yk,k(cid:48)],k ∈ [1,K],k(cid:48) ∈ [1,K] targetvocabulary. t−1 t In this paper, we use a strategy similar to that The score for each of the K × K hypotheses is ofJeanetal.(2014),thusavoidingtherelianceon computedasfollows: externalIBMmodelwordaligner. Fromtheatten- tionmodels,weobtainwordalignmentsfromthe S(Yk ,yk,k(cid:48)|x) = S(Yk |x)+logp(yk,k(cid:48)|x,Yk ) t−1 t t−1 t t−1 trainingdataset,fromwhichabilingualdictionary (14) isextracted. Attesttime, wefirstgeneratetarget In a standard beam search model, the top K hy- sequences. Onceatranslationisgenerated,welink potheses are selected (from the K ×K hypothe- thegeneratedUNKtokensbacktopositionsinthe ses computed in the last step) based on the score sourceinputs,andreplaceeachUNKtokenwiththe S(Yk ,yk,k(cid:48)|x). The remaining hypotheses are translationwordofitscorrespondentsourcetoken t−1 t ignoredasweproceedtothenexttimestep. usingthepre-constructeddictionary. Wesettheminimumlengthandmaximumlength As the unknown word replacement mecha- to0.75and1.5timesthelengthofsources. Beam nism relies on automatic word alignment extrac- size N is set to 200. To be specific, at each time tion which is not explicitly modeled in vanilla step of decoding, we are presented with K ×K SEQ2SEQ models, it can not be immediately ap- wordcandidates. Wefirstaddallhypotheseswith pliedtovanilla SEQ2SEQ models. However,since anEOStokenbeinggeneratedatcurrenttimestep unknown word replacement can be viewed as a to the N-best list. Next we preserve the top K post-processing technique, we can apply a pre- unfinishedhypothesesandmovetonexttimestep. trained attention-model to any given translation. Wethereforemaintainbatchsizeof200constant For SEQ2SEQ models, we first generate transla- when some hypotheses are completed and taken tions and replace UNK tokens within the transla- down by adding in more unfinished hypotheses. tionsusingthepre-trainedattentionmodelstopost- ThiswillleadthesizeoffinalN-bestlistforeach processthetranslations. inputmuchlargerthanthebeamsize2. 4 MutualInformationviaReranking 4.2 GeneratingaDiverseN-bestList As discussed in Li et al. (2015), direct decoding Unfortunately,theN-bestlistsoutputtedfromstan- from(2)isinfeasiblesincethesecondpart,p(x|y), dardbeamsearchareapoorsurrogatefortheentire requirescompletelygeneratingthetargetbeforeit search space (Finkel et al., 2006; Huang, 2008). canbecomputed. Wethereforeuseanapproxima- Thebeamsearchalgorithmcanonlykeepasmall tionapproach: proportion of candidates in the search space and most of the generated translations in N-best list 1. Train p(y|x) and p(x|y) separately using 2For example, for the development set of the English- vanilla SEQ2SEQ models or Attention mod- German WMT14 task, each input has an average of 2,500 els. candidatesintheN-bestlist. Figure1:IllustrationofStandardBeamSearchandproposeddiversitypromotingBeamSearch. aresimilar,differingonlybypunctuationorminor by the intra-sibling ranking part γk(cid:48). The model morphologicalvariations,withmostofthewords thus generally favors choosing hypotheses from overlapping. Becausethislackofdiversityinthe diverseparents, leading toa morediverse N-best N-bestlistwillsignificantlydecreasetheimpactof list. ourrerankingprocess,itisimportanttofindaway Theproposedmodelisstraightforwardlyimple- togenerateamorediverseN-bestlist. mentedwithminoradjustmenttothestandardbeam WeproposetochangethewayS(Yk ,yk,k(cid:48)|x) searchmodel3. t−1 t iscomputedinanattempttopromotediversity,as We employ the diversity evaluation metrics in shown in Figure 1. For each of the hypotheses (Lietal.,2015)toevaluatethedegreeofdiversity Yk (he and it), we generate the top K transla- oftheN-bestlists: calculatingtheaveragenumber t−1 tions, yk,k(cid:48), k(cid:48) ∈ [1,K] as in the standard beam ofdistinctunigramsdistinct-1andbigramsdistinct- t 2 in the N-best list given each source sentence, search model. Next we rank the K translated to- scaledbythetotalnumberoftokens. Byemploying kensgeneratedfromthesameparentalhypothesis basedonp(yk,k(cid:48)|x,Yk )indescendingorder: he thediversitypromotingmodelwithγ tunedfrom t t−1 the development set based on BLEU score, the is ranks the first among he is and he has, and he valueofdistinct-1increasesfrom0.54%to0.95%, hasrankssecond;similarlyforitisandithas. Next we rewrite the score for [Yk ,yk,k(cid:48)] by anddistinct-2increasesfrom1.55%to2.84%for t−1 t English-Germantranslation. Similarphenomenon adding an additional part γk(cid:48), where k(cid:48) denotes areobservedfromEnglish-Frenchtranslationtasks the ranking of the current hypothesis among its anddetailsareomittedforbrevity. siblings,whichisfirstforheisanditis,secondfor hehasandithas. 4.3 Reranking Sˆ(Yk ,yk,k(cid:48)|x) = S(Yk ,yk,k(cid:48)|x)−γk(cid:48) (15) The generated N-best list is then reranked by lin- t−1 t t−1 t early combining logp(y|x) with logp(x|y). The The top K hypothesis are selected based on scoreofthesourcegiveneachgeneratedtranslation Sˆ(Yk ,yk,k(cid:48)|x) as we move on to the next time canbeimmediatelycomputedfromthepreviously t−1 t step. Byaddingtheadditionaltermγk(cid:48),themodel trainedp(x|y). punishes bottom ranked hypotheses among sib- Otherthanlogp(y|x),wealsoconsiderlogp(y), lings(hypothesesdescendedfromthesameparent). whichdenotestheaveragelanguagemodelproba- Whenwecomparenewlygeneratedhypothesesde- bility trained from monolingual data. It is worth scendedfromdifferentancestors,themodelgives 3DecodingforneuralbasedMTmodelusinglargebatch- morecredittotophypothesesfromeachofdifferent sizecanbeexpensiveresultedfromsoftmaxwordprediction ancestors. For instance, even though the original function.Theproposedmodelsupportsbatcheddecodingus- ingGPU,significantlyspeedupdecodingprocessthanother scoreforitisislowerthanhehas,themodelfavors diversityfosteringmodelstailoredtophrasebasedMTsys- theformerasthelatterismoreseverelypunished tems. nothing that integrating logp(y|x) and logp(y) ing,eachofwhichconsistsofadifferentsetofpa- intorerankingisnotanewoneandhaslongbeen rameters. Wefollowedthedetailedprotocolsfrom employedbyinnoisychannelmodelsinstandard Luong et al. (2015a): each LSTM layer consists MT. In neural MT literature, recent progress has of 1,000 hidden neurons, and the dimensionality demonstratedtheeffectivenessofmodelingrerank- ofwordembeddingsissetto1,000. Othertraining ingwithlanguagemodel(Gulcehreetal.,2015). details include: LSTM parameters and word em- We also consider an additional term that takes beddingsareinitializedfromauniformdistribution intoaccountthelengthoftargets(denotesasL ) between [-0.1,0.1]; For English-German transla- T in decoding. We thus linearly combine the three tion, we run 12 epochs in total. After 8 epochs, parts, making the final ranking score for a given westarthalvingthelearningrateaftereachepoch; targetcandidatey asfollows: forEnglish-Frenchtranslation,thetotalnumberof epochs is set to 8, and we start halving the learn- Score(y) = logp(y|x)+λlogp(x|y) ingrateafter5iterations. Batchsizeissetto128; (16) +γlogp(y)+ηL gradient clipping is adopted by scaling gradients T whenthenormexceededathresholdof5. Inputs Weoptimizeη,λandγ usingMERT(Och,2003) arereversed. BLEUscore(Papinenietal.,2002)onthedevelop- OurimplementationonasingleGPU5 processes mentset. approximately800-1200tokenspersecond. Train- ing for the English-German dataset (4.5 million 5 Experiments pairs) takes roughly 12-15 days. For the French- English dataset, comprised of 12 million pairs, Our models are trained on the WMT’14 training trainingtakesroughly4-6weeks. dataset containing 4.5 million pairs for English- German and German-English translation, and 12 5.2 Trainingp(y)fromMonolingualData million pairs for English-French translation. For Werespectivelytrainsingle-layerLSTMrecurrent English-Germantranslation,welimitourvocabu- modelswith500unitsforGermanandFrenchus- lariestothetop50Kmostfrequentwordsforboth ing monolingual data. We News Crawl corpora languages. ForEnglish-Frenchtranslation,wekeep fromWMT136 asadditionaltrainingdatatotrain the top 200K most frequent words for the source monolinguallanguage models. We useda subset languageand80Kforthetargetlanguage. Words oftheoriginaldatasetwhichroughlycontains50- thatarenotinthevocabularylistarenotedasthe 60millionssentences. Following(Gulcehreetal., universalunknowntoken. 2015;Sennrichetal.,2015a),weremovesentences For the English-German and English-German withmorethan10%Unknownwordsbasedonthe translation, we use newstest2013 (3000 sentence vocabularyconstructedusingparalleldatasets. We pairs) as the development set and translation per- formancesarereportedinBLEU(Papinenietal., adoptedsimilarprotocolsaswetrainedSEQ2SEQ models,suchasgradientclippingandminibatch. 2002) on newstest2014 (2737) sentences. For English-Frenchtranslation,weconcatenatenews- 5.3 English-GermanResults test-2012 and news-test-2013 to make a develop- Wereportedprogressiveperformancesasweadd mentset(6,003pairsintotal)andevaluatethemod- elsonnews-test-2014with3,003pairs4. inmorefeaturesforreranking. Resultsfordiffer- entmodelsonWMT2014English-Germantrans- 5.1 TrainingDetailsforp(x|y)andp(y|x) lation task are shown in Figure 1. Among all the features,reverseprobabilityfrommutualinforma- WetrainedneuralmodelsonStandard SEQ2SEQ tion(i.e.,p(x|y))yieldsthemostsignificantperfor- ModelsandAttentionModels. Wetrainedp(y|x) followingthestandardtrainingprotocolsdescribed manceboost,+1.4and+1.1forstandardSEQ2SEQ models without and with unknown word replace- in(Sutskeveretal.,2014). p(x|y)istrainedidenti- ment,+0.9forattentionmodels7. Inlinewith(Gul- callybutwithsourcesandtargetsswapped. WeadoptadeepstructurewithfourLSTMlay- 5TeslaK40m,1KeplerGK110B,2880Cudacores. ersforencodingandfourLSTMlayersfordecod- 6http://www.statmt.org/wmt13/ translation-task.html 4Asin(Luongetal.,2015a).Alltextsaretokenizedwith 7Targetlengthhaslongprovedtobeoneofthemostim- tokenizer.perl and BLEU scores are computed with multi- portantfeaturesinphrasebasedMTduetotheBLEUscore’s bleu.perl significantsensitivenesstotargetlengths.However,herewe Model Features BLEUscores Standard p(y|x) 13.2 Standard p(y|x)+Length 13.6(+0.4) Standard p(y|x)+p(x|y)+Length 15.0(+1.4) Standard p(y|x)+p(x|y)+p(y)+Length 15.4(+0.4) Standard p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 15.8(+0.4) +2.6intotal Standard+UnkRep p(y|x) 14.7 Standard+UnkRep p(y|x)+Length 15.2(+0.7) Standard+UnkRep p(y|x)+p(x|y)+Length 16.3(+1.1) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length 16.7(+0.4) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 17.3(+0.3) +2.6intotal Attention+UnkRep p(y|x) 20.5 Attention+UnkRep p(y|x)+Length 20.9(+0.4) Attention+UnkRep p(y|x)+p(x|y)+Length 21.8(+0.9) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length 22.1(+0.3) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 22.6(+0.3) +2.1intotal Jeanetal.,2015(withoutensemble) 19.4 Jeanetal.,2015(withensemble) 21.6 Luongetal.(2015a)(withUnkRep,withoutensemble) 20.9 Luongetal.(2015a)(withUnkRep,withensemble) 23.0 Table1:BLEUscoresfromdifferentmodelsforonWMT14English-Germanresults.UnkRepdenotesapplyingunknownword replacementstrategy.diversityindicatesdiversity-promotingmodelfordecodingbeingadopted.Baselinesperformancesare reprintedfromJeanetal.(2014),Luongetal.2015a. cehreetal.,2015;Sennrichetal.,2015a),weob- ong et al., 2015a), we reprint their results from serveconsistentperformanceboostintroducedby boththesinglemodelsettingandtheensembleset- languagemodel. ting,whichasetof(usually8)neuralmodelsthat WeseethebenefitfromourdiverseN-bestlistby differ in random initializations and the order of comparingmutual+diversitymodelswithdiversity minibatchesaretrained,thecombinationofwhich models. Ontopoftheimprovementsfromstandard jointly contributes in the decoding process. The beamsearchduetoreranking,thediversitymodels ensembleprocedureisknowntoresultinimproved introduceadditionalgainsof+0.4,+0.3and+0.3, performance(Luongetal.,2015a;Jeanetal.,2014; leading the total gains roughly up to +2.6, +2.6, Sutskeveretal.,2014). +2.1fordifferentmodels. Theunknowntokenre- placementtechniqueyieldssignificantgains,inline withobservationsfromJeanetal.(2014;Luonget al.(2015a). Notethatthereportedresultsfromthestandard We compare our English-German system with SEQ2SEQ models and attention models in Table variousothers: (1)Theend-to-endneuralMTsys- 1(thosewithoutconsideringmutualinformation) temfromJean etal.(2014)using alarge vocabu- arefrommodelsidenticalinstructuretothecorre- lary size. (2) Models from Luong et al. (2015a) spondingmodelsdescribedin(Luongetal.,2015a), thatcombinesdifferentattentionmodels. Forthe andachievesimilarperformances(13.2vs14.0for models described in (Jean et al., 2014) and (Lu- standard SEQ2SEQ models and 20.5 vs 20.7 for donotobserveaslargeperformanceboosthereasinphrase attentionmodels). Duetotimeandcomputational basedMT.Thisisduetothefactthatduringdecoding,target constraints, we did not implement an ensemble lengthhasalreadybeenstrictlyconstrained.Asdescribedin mechanism, making our results incomparable to 4.1,weonlyconsidercandidatesoflengthsbetween0.75and 1.5timesthatofthesource. theensemblemechanismsinthesepapers. Model Features BLEUscores Standard p(y|x) 29.0 Standard p(y|x)+Length 29.7(+0.7) Standard p(y|x)+p(x|y)+Length 31.2(+1.5) Standard p(y|x)+p(x|y)+p(y)+Length 31.7(+0.5) Standard p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 32.2(+0.5) +3.2intotal Standard+UnkRep p(y|x) 31.0 Standard+UnkRep p(y|x)+Length 31.5(+0.5) Standard+UnkRep p(y|x)+p(x|y)+Length 32.9(+1.4) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length 33.3(+0.4) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 33.6(+0.3) +2.6intotal Attention+UnkRep p(y|x) 33.4 Attention+UnkRep p(y|x)+Length 34.3(+0.9) Attention+UnkRep p(y|x)+p(x|y)+Length 35.2(+0.9) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length 35.7(+0.5) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 36.3(+0.4) +2.7intotal LSTM(Google)(withoutensemble)) 30.6 LSTM(Google)(withensemble) 33.0 Luongetal.(2015b),UnkRep(withoutensemble) 32.7 Luongetal.(2015b),UnkRep(withensemble) 37.5 Table2:BLEUscoresfromdifferentmodelsforonWMT’14English-Frenchresults.GoogleistheLSTM-basedmodelproposed inSutskeveretal.(2014).Luongetal.(2015)istheextensionofGooglemodelswithunknowntokenreplacements. 5.4 French-EnglishResults new decoding method that promotes diversity in thefirst-passN-bestlist. OnEnglish→Frenchand ResultsfromtheWMT’14French-Englishdatasets English→Germantranslationtasks,weshowthat areshowninTable2,alongwithresultsreprinted theneuralmachinetranslationmodelstrainedus- fromSutskeveretal.(2014;Luongetal.(2015b). ingtheproposedmethodperformbetterthancorre- Weagainobservethatapplyingmutualinformation spondingstandardmodels,andthatboththemutual yieldsbetterperformancethanthecorresponding informationobjectiveandthediversity-increasing standardneuralMTmodels. decoding methods contribute to the performance Relative to the English-German dataset, the boost.. English-Frenchtranslationtaskshowsalargergap The new models come with the advantages of betweenournewmodelandvanillamodelswhere easy implementation with sources and targets in- rerankinginformationisnotconsidered;ourmod- terchanged,andofofferingageneralsolutionthat elsrespectivelyyieldupto+3.2,+2.6,+2.7boostin canbeintegratedintoanyneuralgenerationmod- BLEUcomparedtostandardneuralmodelswithout elswithminoradjustments. Indeed,ourdiversity- and with unknown word replacement, and Atten- enhancingdecodercanbeappliedtogeneratemore tionmodels. diverse N-best lists for any NLP reranking task. 6 Discussion Finding a way to introduce mutual information based decoding directly into a first-pass decoder In this paper, we introduce a new objective for withoutrerankingnaturallyconstitutesourfuture neural MT based on the mutual dependency be- work. tween the source and target sentences, inspired byrecentworkinneuralconversationgeneration (Lietal.,2015). Webuildanapproximateimple- References mentationofourmodelusingreranking,andthen DzmitryBahdanau,KyunghyunCho,andYoshuaBen- tomakererankingmorepowerfulweintroducea gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint LiangHuang. 2008. Forestreranking: Discriminative arXiv:1409.0473. parsingwithnon-localfeatures. InACL,pages586– 594. LR Bahl, Peter F Brown, Peter V De Souza, and RobertLMercer. 1986. Maximummutualinforma- Se´bastien Jean, Kyunghyun Cho, Roland Memisevic, tionestimationofhiddenMarkovmodelparameters andYoshuaBengio. 2014. Onusingverylargetar- for speech recognition. In proc. icassp, volume 86, getvocabularyforneuralmachinetranslation. arXiv pages49–52. preprintarXiv:1412.2007. NalKalchbrennerandPhilBlunsom. 2013. Recurrent Dhruv Batra, Payman Yadollahpour, Abner Guzman- continuous translation models. In EMNLP, pages Rivera,andGregoryShakhnarovich. 2012. Diverse 1700–1709. m-bestsolutionsinMarkovrandomfields. InCom- puterVision–ECCV2012,pages1–16.Springer. ShankarKumarandWilliamByrne. 2004. Minimum Bayes-risk decoding for statistical machine transla- DanielCer,ChristopherDManning,andDanielJuraf- tion. Technicalreport,DTICDocument. sky. 2013. Positive diversity tuning for machine translation system combination. In Proceedings of JiweiLi,MichelGalley,ChrisBrockett,JianfengGao, the Eighth Workshop on Statistical Machine Trans- andBillDolan. 2015. Adiversity-promotingobjec- lation,pages320–328. tivefunctionforneuralconversationmodels. arXiv preprintarXiv:1510.03055. Kyunghyun Cho, Bart Van Merrie¨nboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Minh-Thang Luong, Hieu Pham, and Christopher D Schwenk, and Yoshua Bengio. 2014. Learning Manning. 2015a. Effectiveapproachestoattention- phrase representations using rnn encoder-decoder basedneuralmachinetranslation. EMNLP. for statistical machine translation. arXiv preprint Minh-ThangLuong,IlyaSutskever,QuocVLe,Oriol arXiv:1406.1078. Vinyals, and Wojciech Zaremba. 2015b. Address- ingtherarewordprobleminneuralmachinetransla- Jacob Devlin and Spyros Matsoukas. 2012. Trait- tion. InProceedingsofACL. based hypothesis selection for machine translation. InProceedingsofthe2012ConferenceoftheNorth WolfgangMacherey,FranzJosefOch,IgnacioThayer, American Chapter of the Association for Computa- and Jakob Uszkoreit. 2008. Lattice-based mini- tional Linguistics: Human Language Technologies, mumerrorratetrainingforstatisticalmachinetrans- pages528–532.AssociationforComputationalLin- lation. In Proceedings of the Conference on Em- guistics. pirical Methods in Natural Language Processing, pages725–734.AssociationforComputationalLin- JacobDevlin,RabihZbib,ZhongqiangHuang,Thomas guistics. Lamar, Richard M Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models Fandong Meng, Zhengdong Lu, Mingxuan Wang, forstatisticalmachinetranslation. InACL(1),pages Hang Li, Wenbin Jiang, and Qun Liu. 2015. En- 1370–1380.Citeseer. coding source language with convolutional neural network for machine translation. arXiv preprint Jenny Rose Finkel, Christopher D Manning, and An- arXiv:1503.01838. drewYNg. 2006. Solvingtheproblemofcascading errors: ApproximateBayesianinferenceforlinguis- FranzJosefOchandHermannNey. 2002. Discrimina- ticannotationpipelines. InProceedingsofthe2006 tive training and maximum entropy models for sta- Conference on Empirical Methods in Natural Lan- tisticalmachinetranslation. InProceedingsofACL guage Processing, pages 618–626. Association for 2002,pages295–302. ComputationalLinguistics. Franz Josef Och. 2003. Minimum error rate training instatisticalmachinetranslation. InProceedingsof Kevin Gimpel, Dhruv Batra, Chris Dyer, Gregory the41stAnnualMeetingonAssociationforCompu- Shakhnarovich,andVirginiaTech. 2013. Asystem- tationalLinguistics-Volume1,pages160–167.Asso- atic exploration of diversity in machine translation. ciationforComputationalLinguistics. InProceedingsofthe2013ConferenceonEmpirical MethodsinNaturalLanguageProcessing,October. KishorePapineni,SalimRoukos,ToddWard,andWei- JingZhu. 2002. Bleu: amethodforautomaticeval- CaglarGulcehre, OrhanFirat, KelvinXu, Kyunghyun uation of machine translation. In Proceedings of Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, the 40th annual meeting on association for compu- Holger Schwenk, and Yoshua Bengio. 2015. On tationallinguistics,pages311–318.Associationfor usingmonolingualcorporainneuralmachinetrans- ComputationalLinguistics. lation. arXivpreprintarXiv:1503.03535. Rico Sennrich, Barry Haddow, and Alexandra Birch. Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. 2015a. Improving neural machine translation Long short-term memory. Neural computation, models with monolingual data. arXiv preprint 9(8):1735–1780. arXiv:1511.06709. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015b. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015a. Build- ingend-to-enddialoguesystemsusinggenerativehi- erarchical neural network models. arXiv preprint arXiv:1507.04808. Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and JoellePineau. 2015b. Asurveyofavailablecorpora for building data-driven dialogue systems. arXiv preprintarXiv:1512.05742. Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010. String-to-dependencystatisticalmachinetranslation. ComputationalLinguistics,36(4):649–671. Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. Aneuralnetworkapproachtocontext-sensitivegen- eration of conversational responses. arXiv preprint arXiv:1506.06714. IlyaSutskever,OriolVinyals,andQuocVVLe. 2014. Sequencetosequencelearningwithneuralnetworks. In Advances in neural information processing sys- tems,pages3104–3112. RoyWTromble,ShankarKumar,FranzOch,andWolf- gangMacherey. 2008. Latticeminimumbayes-risk decodingforstatisticalmachinetranslation. InPro- ceedingsoftheConferenceonEmpiricalMethodsin Natural Language Processing, pages 620–629. As- sociationforComputationalLinguistics. OriolVinyalsandQuocLe. 2015. Aneuralconversa- tionalmodel. arXivpreprintarXiv:1506.05869. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2014. Grammar as a foreign language. arXiv preprint arXiv:1412.7449. P. C. Woodland and D. Povey. 2002. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech and Lan- guage,16:25–47. TongXiao,JingboZhu,andTongranLiu. 2013. Bag- gingandboostingstatisticalmachinetranslationsys- tems. ArtificialIntelligence,195:496–527. Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural im- age caption generation with visual attention. arXiv preprintarXiv:1502.03044.