ebook img

Mutual Information and Diverse Decoding Improve Neural Machine Translation PDF

0.27 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Mutual Information and Diverse Decoding Improve Neural Machine Translation

Mutual Information and Diverse Decoding Improve Neural Machine Translation JiweiLiandDanJurafsky ComputerScienceDepartment StanfordUniversity,Stanford,CA,94305,USA jiweil,[email protected] Abstract (Luongetal.,2015b)andEnglishtoGerman(Lu- 6 ongetal.,2015a;Jeanetal.,2014)translation. 1 Sequence-to-sequence neural translation SEQ2SEQ models are implemented as an 0 2 models learn semantic and syntactic rela- encoder-decoder network, in which a source se- tions between sentence pairs by optimiz- quenceinputxismapped(encoded)toacontinu- r a ing the likelihood of the target given the ousvectorrepresentationfromwhichatargetout- M source, i.e., p(y|x), an objective that ig- puty willbegenerated(decoded). Theframework 2 nores other potentially useful sources of isoptimizedthroughmaximizingthelog-likelihood 2 information. We introduce an alternative ofobservingthepairedoutputy givenx: objectivefunctionforneuralMTthatmax- ] L imizesthemutualinformationbetweenthe Loss = −logp(y|x) (1) C sourceandtargetsentences,modelingthe s. bi-directional dependency of sources and Whilestandard SEQ2SEQ modelsthuscapturethe c targets. We implement the model with a unidirectional dependency from source to target, [ simple re-ranking method, and also intro- i.e., p(y|x), they ignore p(x|y), the dependency 2 duce a decoding algorithm that increases fromthetargettothesource,whichhaslongbeen v 2 diversity in the N-best list produced by an important feature in phrase-based translation 7 the first pass. Applied to the WMT Ger- (Och and Ney, 2002; Shen et al., 2010). Phrase 3 man/EnglishandFrench/Englishtasks,the based systems that combine p(x|y), p(y|x) and 0 0 proposedmodelsoffersaconsistentperfor- otherfeatureslikesentencelengthyieldsignificant 1. manceboostonbothstandardLSTMand performanceboost. 0 attention-basedneuralMTarchitectures. We propose to incorporate this bi-directional 6 dependency and model the maximum mutual in- 1 : 1 Introduction formation (MMI) between source and target into v i SEQ2SEQ models. As Li et al. (2015) recently X Sequence-to-sequencemodelsformachinetransla- showedinthecontextofconversationalresponse r tion(SEQ2SEQ)(Sutskeveretal.,2014;Bahdanau a generation, the MMI based objective function is et al., 2014; Cho et al., 2014; Kalchbrenner and equivalenttolinearlycombiningp(x|y)andp(y|x). Blunsom, 2013; Sennrich et al., 2015a; Sennrich Withatuningweightλ,suchalossfunctioncanbe etal.,2015b;Gulcehreetal.,2015)areofgrowing writtenas: interestfortheircapacitytolearnsemanticandsyn- tacticrelationsbetweensequencepairs,capturing p(x,y) yˆ= argmaxlog contextualdependenciesinamorecontinuousway p(x)p(y)λ y than phrase-based SMT approaches. SEQ2SEQ = argmax(1−λ)logp(y|x)+λlogp(x|y) modelsrequireminimaldomainknowledge,canbe y trainedend-to-end,haveamuchsmallermemory (2) footprint than the large phrase tables needed for But as also discussed in Li et al. (2015), direct phrase-basedSMT,andachievestate-of-the-artper- decodingfrom(2)isinfeasiblebecausecomputing formanceinlarge-scaletaskslikeEnglishtoFrench p(x|y) cannot be done until the target has been computed1. probabilityoftargetsgivensources,usingtwocom- Toavoidthisenormoussearchspace,wepropose ponents,anencoderandadecoder. Kalchbrenner to use a reranking approach to approximate the andBlunsom(2013)usedanencodingmodelakin mutual information between source and target in to convolutional networks for encoding and stan- neuralmachinetranslationmodels. Weseparately dardhiddenunitrecurrentnetsfordecoding. Simi- trainedtwo SEQ2SEQ models,oneforp(y|x)and larconvolutionalnetworksareusedin(Mengetal., one for p(x|y). The p(y|x) model is used to gen- 2015)forencoding. Sutskeveretal.(2014;Luong erateN-bestlistsfromthesourcesentencex. The et al. (2015a) employed a stacking LSTM model listsarefollowedbyarerankingprocessusingthe for both encoding and decoding. Bahdanau et al. secondtermoftheobjectivefunction,p(x|y). (2014), Jean et al. (2014) adopted bi-directional Because reranking approaches are dependent recurrentnetsfortheencoder. onhavingadiverseN-bestlisttorerank,wealso MaximumMutualInformation MaximumMu- proposeadiversity-promotingdecodingmodeltai- tualInformation(MMI)wasintroducedinspeech loredtoneuralMTsystems. Wetestedthemutual recognition(Bahletal.,1986)asawayofmeasur- information objective function and the diversity- ingthemutualdependencebetweeninputs(acous- promoting decoding model on English→French, ticfeaturevectors)andoutputs(words)andimprov- English→German and German→English transla- ingdiscriminativetraining(WoodlandandPovey, tiontasks,usingbothstandardLSTMsettingsand 2002). Li et al. (2015) show that MMI could themoreadvancedattention-modelbasedsettings thathaverecentlyshowntoresultinhigherperfor- solveanimportantprobleminSEQ2SEQconversa- mance. tionalresponsegeneration. PriorSEQ2SEQmodels tendedtogeneratehighlygeneric,dullresponses The next section presents related work, fol- (e.g., I don’t know) regardless of the inputs (Sor- lowed by a background section 3 introducing donietal.,2015;VinyalsandLe,2015;Serbanet LSTM/Attentionmachinetranslationmodels. Our al., 2015b). Li et al. (2015) show that modeling proposedmodelwillbedescribedindetailinSec- themutualdependencybetweenmessagesandre- tions 4, with datasets and experimental results in sponsepromotesthediversityofresponseoutputs. Section6followedbyconclusions. Our goal, distinct from these previous uses of 2 RelatedWork MMI, is to see whether the mutual information objective improves translation by bidirectionally Thispaperdrawsonthreepriorlinesofresearch: modelingsource-targetdependencies. Inthatsense, SEQ2SEQ models,modelingmutualinformation, ourworkisdesignedtoincorporateintoSEQ2SEQ andpromotingtranslationdiversity. modelsfeaturesthathaveprovedusefulinphrase- basedMT,likethereversetranslationprobability SEQ2SEQ Models SEQ2SEQ models map orsentencelength(OchandNey,2002;Shenetal., sourcesequencestovectorspacerepresentations, 2010;Devlinetal.,2014). from which a target sequence is then generated. TheyyieldgoodperformanceinavarietyofNLP GeneratingDiverseTranslations Variousalgo- generationtasksincludingconversationalresponse rithms have been proposed for generated diverse generation (Vinyals and Le, 2015; Serban et al., translations in phrase-based MT, including com- 2015a;Lietal.,2015),andparsing(Vinyalsetal., pactrepresentationslikelatticesandhypergraphs 2014;?). (Macherey et al., 2008; Tromble et al., 2008; A neural machine translation system uses dis- Kumar and Byrne, 2004), “traits” like transla- tributed representations to model the conditional tion length (Devlin and Matsoukas, 2012), bag- 1Asdemonstratedin(Lietal.,2015) ging/boosting(Xiaoetal.,2013),ormultiplesys- tems (Cer et al., 2013). Gimpel et al. (2013; Ba- p(x,y) log =logp(y|x)−λlogp(y) (3) tra et al. (2012), produce diverse N-best lists by p(x)p(y)λ adding a dissimilarity function based on N-gram Equ. 2canbeimmediatelyachievedbyapplyingbayesian overlaps, distancing the current translation from rules already-generated ones by choosing translations thathavehigherscoresbutdistinctfromprevious logp(y)=logp(y|x)+logp(x)−logp(x|y) ones. Whilewedrawontheseintuitions,theseex- istingdiversitypromotingalgorithmsaretailoredto phrase-basedtranslationframeworksandnoteasily etal.,2015;Luongetal.,2015b;Bahdanauetal., transplantedtoneuralMTdecodingwhichrequires 2014). batchedcomputation. Let H = {hˆ ,hˆ ,...,hˆ } be thecollection of 1 2 Nx hiddenvectorsoutputtedfromLSTMsduringen- 3 Background: NeuralMachine coding. Each element in H contains information Translation about the input sequences, focusing on the parts surrounding each specific token. Let h be the Neural machine translation models map source t−1 LSTMoutputsfordecodingattimet−1. Atten- x = {x ,x ,...x } to a continuous vector 1 2 Nx tion models link the current-step decoding infor- representation, from which target output y = mation, i.e., h with each of the representations {y ,y ,...,y }istobegenerated. t 1 2 Ny at decoding step hˆ using a weight variable a . t(cid:48) t 3.1 LSTMModels at canbeconstructedfromdifferentscoringfunc- tionssuchasthedotproductbetweenthetwovec- Along-shorttermmemorymodel(Hochreiterand tors,i.e.,hT ·hˆ ,ageneralmodelakintotensor Schmidhuber,1997)associateseachtimestepwith t−1 t operation i.e., hT · W · hˆ , and the concatena- aninputgate,amemorygateandanoutputgate,de- t−1 t tion model by concatenating the two vectors i.e., notedrespectivelyasi ,f ando . Lete denotethe t t t t UT·tanh(W ·[h ,hˆ ]). Thebehaviorofdifferent vectorforthecurrentwordw ,h thevectorcom- t−1 t t t attentionscoringfunctionshavebeenextensively putedbytheLSTMmodelattimetbycombining studied in Luong et al. (2015a). For all experi- e andh .,c thecellstatevectorattimet,and t t−1 t mentsinthispaper,weadoptthegeneralstrategy σ thesigmoidfunction. Thevectorrepresentation wheretherelevancescorebetweenthecurrentstep h foreachtimesteptisgivenby: t of the decoding representation and the encoding i = σ(W ·[h ,e ]) (4) representationisgivenby: t i t−1 t ft = σ(Wf ·[ht−1,et]) (5) v = hT ·W ·hˆ t(cid:48) t−1 t o = σ(W ·[h ,e ]) (6) t o t−1 t exp(v ) (10) t∗ a = l = tanh(W ·[h ,e ]) (7) i (cid:80) t l t−1 t t∗exp(vt∗) c = f ·c +i ·l (8) t t t−1 t t The attention vector is created by averaging hs = o ·tanh(c ) (9) t t t weightsoverallinputtime-steps: whereW ,W ,W ,W ∈ RK×2K. TheLSTMde- (cid:88) i f o l m = a hˆ (11) finesadistributionoveroutputsy andsequentially t i t(cid:48) predictstokensusingasoftmaxfunction: t(cid:48)∈[1,NS] Attentionmodelspredictsubsequenttokensbased p(y|x) = (cid:89)nT exp(f(ht−1,eyt)) onthecombinationofthelaststepoutputtedLSTM (cid:80) t=1 w(cid:48)exp(f(ht−1,ew(cid:48))) vectorsht−1 andattentionvectorsmt: wheref(ht−1,eyt)denotestheactivationfunction (cid:126)ht−1 = tanh(Wc·[ht−1,mt]) (12) betweenht−1 andewt,whereht−1 istherepresen- p(yt|y<,x) = softmax(Ws·(cid:126)ht−1) tationoutputfromtheLSTMattimet−1. Each sentenceconcludeswithaspecialend-of-sentence whereW ∈ RK×2K,W ∈ RV×K withVdenot- c s symbolEOS.Commonly,theinputandoutputeach ingvocabularysize. Luongetal.(2015a)reported usedifferentLSTMswithseparatesetsofcompo- asignificantperformanceboostbyintegrating(cid:126)h t−1 sitional parameters to capture different composi- intothenextstepLSTMhiddenstatecomputation tional patterns. During decoding, the algorithm (referred to as the input-feeding model), making terminateswhenanEOStokenispredicted. LSTMcompositionsindecodingasfollows: 3.2 AttentionModels i = σ(W ·[h ,e ,(cid:126)h ]) t i t−1 t t−1 Attention models adopt a look-back strategy that f = σ(W ·[h ,e ,(cid:126)h ]) t f t−1 t t−1 links the current decoding stage with input time (13) o = σ(W ·[h ,e ,(cid:126)h ]) stepstorepresentwhichportionsoftheinputare t o t−1 t t−1 mostresponsibleforthecurrentdecodingstate(Xu l = tanh(W ·[h ,e ,(cid:126)h ]) t l t−1 t t−1 whereW ,W ,W ,W ∈ RK×3K. Fortheatten- 2. GenerateN-bestlistsfromp(y|x). i f o l tion models implemented in this work, we adopt 3. Rerank the N-best list by linearly adding theinput-feedingstrategy. p(x|y). 3.3 UnknownWordReplacements 4.1 StandardBeamSearchforN-bestlists OneofthemajorissuesinneuralMTmodelsisthe N-bestlistsaregeneratedusingabeamsearchde- computationalcomplexityofthesoftmaxfunction coderwithbeamsizesettoK = 200fromp(y|x) fortargetwordprediction,whichrequiressumming models. AsillustratedinFigure1,attimestept−1 over all tokens in the vocabulary. Neural models indecoding,wekeeprecordofN hypothesesbased tend to keep a shortlist of 50,00-80,000 most fre- onscoreS(Y |x) = logp(y ,y ,...,y |x). As quentwordsanduseanunknown(UNK)tokento t−1 1 2 t−1 wemoveontotimestept,weexpandeachoftheK representallinfrequenttokens,whichsignificantly hypotheses(denotedasYk = {yk,yk,...,yk }, impairsBLEUscores. Recentworkhasproposed t−1 1 2 t−1 k ∈ [1,K]),byselectingtopK ofthetranslations, todealwiththisissue: (Luongetal.,2015b)adopt denoted as yk,k(cid:48), k(cid:48) ∈ [1,K], leading to the con- a post-processing strategy based on aligner from t structionofK ×K newhypotheses: IBMmodels,while(Jeanetal.,2014)approximates softmax functions by selecting a small subset of [Yk ,yk,k(cid:48)],k ∈ [1,K],k(cid:48) ∈ [1,K] targetvocabulary. t−1 t In this paper, we use a strategy similar to that The score for each of the K × K hypotheses is ofJeanetal.(2014),thusavoidingtherelianceon computedasfollows: externalIBMmodelwordaligner. Fromtheatten- tionmodels,weobtainwordalignmentsfromthe S(Yk ,yk,k(cid:48)|x) = S(Yk |x)+logp(yk,k(cid:48)|x,Yk ) t−1 t t−1 t t−1 trainingdataset,fromwhichabilingualdictionary (14) isextracted. Attesttime, wefirstgeneratetarget In a standard beam search model, the top K hy- sequences. Onceatranslationisgenerated,welink potheses are selected (from the K ×K hypothe- thegeneratedUNKtokensbacktopositionsinthe ses computed in the last step) based on the score sourceinputs,andreplaceeachUNKtokenwiththe S(Yk ,yk,k(cid:48)|x). The remaining hypotheses are translationwordofitscorrespondentsourcetoken t−1 t ignoredasweproceedtothenexttimestep. usingthepre-constructeddictionary. Wesettheminimumlengthandmaximumlength As the unknown word replacement mecha- to0.75and1.5timesthelengthofsources. Beam nism relies on automatic word alignment extrac- size N is set to 200. To be specific, at each time tion which is not explicitly modeled in vanilla step of decoding, we are presented with K ×K SEQ2SEQ models, it can not be immediately ap- wordcandidates. Wefirstaddallhypotheseswith pliedtovanilla SEQ2SEQ models. However,since anEOStokenbeinggeneratedatcurrenttimestep unknown word replacement can be viewed as a to the N-best list. Next we preserve the top K post-processing technique, we can apply a pre- unfinishedhypothesesandmovetonexttimestep. trained attention-model to any given translation. Wethereforemaintainbatchsizeof200constant For SEQ2SEQ models, we first generate transla- when some hypotheses are completed and taken tions and replace UNK tokens within the transla- down by adding in more unfinished hypotheses. tionsusingthepre-trainedattentionmodelstopost- ThiswillleadthesizeoffinalN-bestlistforeach processthetranslations. inputmuchlargerthanthebeamsize2. 4 MutualInformationviaReranking 4.2 GeneratingaDiverseN-bestList As discussed in Li et al. (2015), direct decoding Unfortunately,theN-bestlistsoutputtedfromstan- from(2)isinfeasiblesincethesecondpart,p(x|y), dardbeamsearchareapoorsurrogatefortheentire requirescompletelygeneratingthetargetbeforeit search space (Finkel et al., 2006; Huang, 2008). canbecomputed. Wethereforeuseanapproxima- Thebeamsearchalgorithmcanonlykeepasmall tionapproach: proportion of candidates in the search space and most of the generated translations in N-best list 1. Train p(y|x) and p(x|y) separately using 2For example, for the development set of the English- vanilla SEQ2SEQ models or Attention mod- German WMT14 task, each input has an average of 2,500 els. candidatesintheN-bestlist. Figure1:IllustrationofStandardBeamSearchandproposeddiversitypromotingBeamSearch. aresimilar,differingonlybypunctuationorminor by the intra-sibling ranking part γk(cid:48). The model morphologicalvariations,withmostofthewords thus generally favors choosing hypotheses from overlapping. Becausethislackofdiversityinthe diverseparents, leading toa morediverse N-best N-bestlistwillsignificantlydecreasetheimpactof list. ourrerankingprocess,itisimportanttofindaway Theproposedmodelisstraightforwardlyimple- togenerateamorediverseN-bestlist. mentedwithminoradjustmenttothestandardbeam WeproposetochangethewayS(Yk ,yk,k(cid:48)|x) searchmodel3. t−1 t iscomputedinanattempttopromotediversity,as We employ the diversity evaluation metrics in shown in Figure 1. For each of the hypotheses (Lietal.,2015)toevaluatethedegreeofdiversity Yk (he and it), we generate the top K transla- oftheN-bestlists: calculatingtheaveragenumber t−1 tions, yk,k(cid:48), k(cid:48) ∈ [1,K] as in the standard beam ofdistinctunigramsdistinct-1andbigramsdistinct- t 2 in the N-best list given each source sentence, search model. Next we rank the K translated to- scaledbythetotalnumberoftokens. Byemploying kensgeneratedfromthesameparentalhypothesis basedonp(yk,k(cid:48)|x,Yk )indescendingorder: he thediversitypromotingmodelwithγ tunedfrom t t−1 the development set based on BLEU score, the is ranks the first among he is and he has, and he valueofdistinct-1increasesfrom0.54%to0.95%, hasrankssecond;similarlyforitisandithas. Next we rewrite the score for [Yk ,yk,k(cid:48)] by anddistinct-2increasesfrom1.55%to2.84%for t−1 t English-Germantranslation. Similarphenomenon adding an additional part γk(cid:48), where k(cid:48) denotes areobservedfromEnglish-Frenchtranslationtasks the ranking of the current hypothesis among its anddetailsareomittedforbrevity. siblings,whichisfirstforheisanditis,secondfor hehasandithas. 4.3 Reranking Sˆ(Yk ,yk,k(cid:48)|x) = S(Yk ,yk,k(cid:48)|x)−γk(cid:48) (15) The generated N-best list is then reranked by lin- t−1 t t−1 t early combining logp(y|x) with logp(x|y). The The top K hypothesis are selected based on scoreofthesourcegiveneachgeneratedtranslation Sˆ(Yk ,yk,k(cid:48)|x) as we move on to the next time canbeimmediatelycomputedfromthepreviously t−1 t step. Byaddingtheadditionaltermγk(cid:48),themodel trainedp(x|y). punishes bottom ranked hypotheses among sib- Otherthanlogp(y|x),wealsoconsiderlogp(y), lings(hypothesesdescendedfromthesameparent). whichdenotestheaveragelanguagemodelproba- Whenwecomparenewlygeneratedhypothesesde- bility trained from monolingual data. It is worth scendedfromdifferentancestors,themodelgives 3DecodingforneuralbasedMTmodelusinglargebatch- morecredittotophypothesesfromeachofdifferent sizecanbeexpensiveresultedfromsoftmaxwordprediction ancestors. For instance, even though the original function.Theproposedmodelsupportsbatcheddecodingus- ingGPU,significantlyspeedupdecodingprocessthanother scoreforitisislowerthanhehas,themodelfavors diversityfosteringmodelstailoredtophrasebasedMTsys- theformerasthelatterismoreseverelypunished tems. nothing that integrating logp(y|x) and logp(y) ing,eachofwhichconsistsofadifferentsetofpa- intorerankingisnotanewoneandhaslongbeen rameters. Wefollowedthedetailedprotocolsfrom employedbyinnoisychannelmodelsinstandard Luong et al. (2015a): each LSTM layer consists MT. In neural MT literature, recent progress has of 1,000 hidden neurons, and the dimensionality demonstratedtheeffectivenessofmodelingrerank- ofwordembeddingsissetto1,000. Othertraining ingwithlanguagemodel(Gulcehreetal.,2015). details include: LSTM parameters and word em- We also consider an additional term that takes beddingsareinitializedfromauniformdistribution intoaccountthelengthoftargets(denotesasL ) between [-0.1,0.1]; For English-German transla- T in decoding. We thus linearly combine the three tion, we run 12 epochs in total. After 8 epochs, parts, making the final ranking score for a given westarthalvingthelearningrateaftereachepoch; targetcandidatey asfollows: forEnglish-Frenchtranslation,thetotalnumberof epochs is set to 8, and we start halving the learn- Score(y) = logp(y|x)+λlogp(x|y) ingrateafter5iterations. Batchsizeissetto128; (16) +γlogp(y)+ηL gradient clipping is adopted by scaling gradients T whenthenormexceededathresholdof5. Inputs Weoptimizeη,λandγ usingMERT(Och,2003) arereversed. BLEUscore(Papinenietal.,2002)onthedevelop- OurimplementationonasingleGPU5 processes mentset. approximately800-1200tokenspersecond. Train- ing for the English-German dataset (4.5 million 5 Experiments pairs) takes roughly 12-15 days. For the French- English dataset, comprised of 12 million pairs, Our models are trained on the WMT’14 training trainingtakesroughly4-6weeks. dataset containing 4.5 million pairs for English- German and German-English translation, and 12 5.2 Trainingp(y)fromMonolingualData million pairs for English-French translation. For Werespectivelytrainsingle-layerLSTMrecurrent English-Germantranslation,welimitourvocabu- modelswith500unitsforGermanandFrenchus- lariestothetop50Kmostfrequentwordsforboth ing monolingual data. We News Crawl corpora languages. ForEnglish-Frenchtranslation,wekeep fromWMT136 asadditionaltrainingdatatotrain the top 200K most frequent words for the source monolinguallanguage models. We useda subset languageand80Kforthetargetlanguage. Words oftheoriginaldatasetwhichroughlycontains50- thatarenotinthevocabularylistarenotedasthe 60millionssentences. Following(Gulcehreetal., universalunknowntoken. 2015;Sennrichetal.,2015a),weremovesentences For the English-German and English-German withmorethan10%Unknownwordsbasedonthe translation, we use newstest2013 (3000 sentence vocabularyconstructedusingparalleldatasets. We pairs) as the development set and translation per- formancesarereportedinBLEU(Papinenietal., adoptedsimilarprotocolsaswetrainedSEQ2SEQ models,suchasgradientclippingandminibatch. 2002) on newstest2014 (2737) sentences. For English-Frenchtranslation,weconcatenatenews- 5.3 English-GermanResults test-2012 and news-test-2013 to make a develop- Wereportedprogressiveperformancesasweadd mentset(6,003pairsintotal)andevaluatethemod- elsonnews-test-2014with3,003pairs4. inmorefeaturesforreranking. Resultsfordiffer- entmodelsonWMT2014English-Germantrans- 5.1 TrainingDetailsforp(x|y)andp(y|x) lation task are shown in Figure 1. Among all the features,reverseprobabilityfrommutualinforma- WetrainedneuralmodelsonStandard SEQ2SEQ tion(i.e.,p(x|y))yieldsthemostsignificantperfor- ModelsandAttentionModels. Wetrainedp(y|x) followingthestandardtrainingprotocolsdescribed manceboost,+1.4and+1.1forstandardSEQ2SEQ models without and with unknown word replace- in(Sutskeveretal.,2014). p(x|y)istrainedidenti- ment,+0.9forattentionmodels7. Inlinewith(Gul- callybutwithsourcesandtargetsswapped. WeadoptadeepstructurewithfourLSTMlay- 5TeslaK40m,1KeplerGK110B,2880Cudacores. ersforencodingandfourLSTMlayersfordecod- 6http://www.statmt.org/wmt13/ translation-task.html 4Asin(Luongetal.,2015a).Alltextsaretokenizedwith 7Targetlengthhaslongprovedtobeoneofthemostim- tokenizer.perl and BLEU scores are computed with multi- portantfeaturesinphrasebasedMTduetotheBLEUscore’s bleu.perl significantsensitivenesstotargetlengths.However,herewe Model Features BLEUscores Standard p(y|x) 13.2 Standard p(y|x)+Length 13.6(+0.4) Standard p(y|x)+p(x|y)+Length 15.0(+1.4) Standard p(y|x)+p(x|y)+p(y)+Length 15.4(+0.4) Standard p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 15.8(+0.4) +2.6intotal Standard+UnkRep p(y|x) 14.7 Standard+UnkRep p(y|x)+Length 15.2(+0.7) Standard+UnkRep p(y|x)+p(x|y)+Length 16.3(+1.1) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length 16.7(+0.4) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 17.3(+0.3) +2.6intotal Attention+UnkRep p(y|x) 20.5 Attention+UnkRep p(y|x)+Length 20.9(+0.4) Attention+UnkRep p(y|x)+p(x|y)+Length 21.8(+0.9) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length 22.1(+0.3) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 22.6(+0.3) +2.1intotal Jeanetal.,2015(withoutensemble) 19.4 Jeanetal.,2015(withensemble) 21.6 Luongetal.(2015a)(withUnkRep,withoutensemble) 20.9 Luongetal.(2015a)(withUnkRep,withensemble) 23.0 Table1:BLEUscoresfromdifferentmodelsforonWMT14English-Germanresults.UnkRepdenotesapplyingunknownword replacementstrategy.diversityindicatesdiversity-promotingmodelfordecodingbeingadopted.Baselinesperformancesare reprintedfromJeanetal.(2014),Luongetal.2015a. cehreetal.,2015;Sennrichetal.,2015a),weob- ong et al., 2015a), we reprint their results from serveconsistentperformanceboostintroducedby boththesinglemodelsettingandtheensembleset- languagemodel. ting,whichasetof(usually8)neuralmodelsthat WeseethebenefitfromourdiverseN-bestlistby differ in random initializations and the order of comparingmutual+diversitymodelswithdiversity minibatchesaretrained,thecombinationofwhich models. Ontopoftheimprovementsfromstandard jointly contributes in the decoding process. The beamsearchduetoreranking,thediversitymodels ensembleprocedureisknowntoresultinimproved introduceadditionalgainsof+0.4,+0.3and+0.3, performance(Luongetal.,2015a;Jeanetal.,2014; leading the total gains roughly up to +2.6, +2.6, Sutskeveretal.,2014). +2.1fordifferentmodels. Theunknowntokenre- placementtechniqueyieldssignificantgains,inline withobservationsfromJeanetal.(2014;Luonget al.(2015a). Notethatthereportedresultsfromthestandard We compare our English-German system with SEQ2SEQ models and attention models in Table variousothers: (1)Theend-to-endneuralMTsys- 1(thosewithoutconsideringmutualinformation) temfromJean etal.(2014)using alarge vocabu- arefrommodelsidenticalinstructuretothecorre- lary size. (2) Models from Luong et al. (2015a) spondingmodelsdescribedin(Luongetal.,2015a), thatcombinesdifferentattentionmodels. Forthe andachievesimilarperformances(13.2vs14.0for models described in (Jean et al., 2014) and (Lu- standard SEQ2SEQ models and 20.5 vs 20.7 for donotobserveaslargeperformanceboosthereasinphrase attentionmodels). Duetotimeandcomputational basedMT.Thisisduetothefactthatduringdecoding,target constraints, we did not implement an ensemble lengthhasalreadybeenstrictlyconstrained.Asdescribedin mechanism, making our results incomparable to 4.1,weonlyconsidercandidatesoflengthsbetween0.75and 1.5timesthatofthesource. theensemblemechanismsinthesepapers. Model Features BLEUscores Standard p(y|x) 29.0 Standard p(y|x)+Length 29.7(+0.7) Standard p(y|x)+p(x|y)+Length 31.2(+1.5) Standard p(y|x)+p(x|y)+p(y)+Length 31.7(+0.5) Standard p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 32.2(+0.5) +3.2intotal Standard+UnkRep p(y|x) 31.0 Standard+UnkRep p(y|x)+Length 31.5(+0.5) Standard+UnkRep p(y|x)+p(x|y)+Length 32.9(+1.4) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length 33.3(+0.4) Standard+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 33.6(+0.3) +2.6intotal Attention+UnkRep p(y|x) 33.4 Attention+UnkRep p(y|x)+Length 34.3(+0.9) Attention+UnkRep p(y|x)+p(x|y)+Length 35.2(+0.9) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length 35.7(+0.5) Attention+UnkRep p(y|x)+p(x|y)+p(y)+Length+Diverdecoding 36.3(+0.4) +2.7intotal LSTM(Google)(withoutensemble)) 30.6 LSTM(Google)(withensemble) 33.0 Luongetal.(2015b),UnkRep(withoutensemble) 32.7 Luongetal.(2015b),UnkRep(withensemble) 37.5 Table2:BLEUscoresfromdifferentmodelsforonWMT’14English-Frenchresults.GoogleistheLSTM-basedmodelproposed inSutskeveretal.(2014).Luongetal.(2015)istheextensionofGooglemodelswithunknowntokenreplacements. 5.4 French-EnglishResults new decoding method that promotes diversity in thefirst-passN-bestlist. OnEnglish→Frenchand ResultsfromtheWMT’14French-Englishdatasets English→Germantranslationtasks,weshowthat areshowninTable2,alongwithresultsreprinted theneuralmachinetranslationmodelstrainedus- fromSutskeveretal.(2014;Luongetal.(2015b). ingtheproposedmethodperformbetterthancorre- Weagainobservethatapplyingmutualinformation spondingstandardmodels,andthatboththemutual yieldsbetterperformancethanthecorresponding informationobjectiveandthediversity-increasing standardneuralMTmodels. decoding methods contribute to the performance Relative to the English-German dataset, the boost.. English-Frenchtranslationtaskshowsalargergap The new models come with the advantages of betweenournewmodelandvanillamodelswhere easy implementation with sources and targets in- rerankinginformationisnotconsidered;ourmod- terchanged,andofofferingageneralsolutionthat elsrespectivelyyieldupto+3.2,+2.6,+2.7boostin canbeintegratedintoanyneuralgenerationmod- BLEUcomparedtostandardneuralmodelswithout elswithminoradjustments. Indeed,ourdiversity- and with unknown word replacement, and Atten- enhancingdecodercanbeappliedtogeneratemore tionmodels. diverse N-best lists for any NLP reranking task. 6 Discussion Finding a way to introduce mutual information based decoding directly into a first-pass decoder In this paper, we introduce a new objective for withoutrerankingnaturallyconstitutesourfuture neural MT based on the mutual dependency be- work. tween the source and target sentences, inspired byrecentworkinneuralconversationgeneration (Lietal.,2015). Webuildanapproximateimple- References mentationofourmodelusingreranking,andthen DzmitryBahdanau,KyunghyunCho,andYoshuaBen- tomakererankingmorepowerfulweintroducea gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint LiangHuang. 2008. Forestreranking: Discriminative arXiv:1409.0473. parsingwithnon-localfeatures. InACL,pages586– 594. LR Bahl, Peter F Brown, Peter V De Souza, and RobertLMercer. 1986. Maximummutualinforma- Se´bastien Jean, Kyunghyun Cho, Roland Memisevic, tionestimationofhiddenMarkovmodelparameters andYoshuaBengio. 2014. Onusingverylargetar- for speech recognition. In proc. icassp, volume 86, getvocabularyforneuralmachinetranslation. arXiv pages49–52. preprintarXiv:1412.2007. NalKalchbrennerandPhilBlunsom. 2013. Recurrent Dhruv Batra, Payman Yadollahpour, Abner Guzman- continuous translation models. In EMNLP, pages Rivera,andGregoryShakhnarovich. 2012. Diverse 1700–1709. m-bestsolutionsinMarkovrandomfields. InCom- puterVision–ECCV2012,pages1–16.Springer. ShankarKumarandWilliamByrne. 2004. Minimum Bayes-risk decoding for statistical machine transla- DanielCer,ChristopherDManning,andDanielJuraf- tion. Technicalreport,DTICDocument. sky. 2013. Positive diversity tuning for machine translation system combination. In Proceedings of JiweiLi,MichelGalley,ChrisBrockett,JianfengGao, the Eighth Workshop on Statistical Machine Trans- andBillDolan. 2015. Adiversity-promotingobjec- lation,pages320–328. tivefunctionforneuralconversationmodels. arXiv preprintarXiv:1510.03055. Kyunghyun Cho, Bart Van Merrie¨nboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Minh-Thang Luong, Hieu Pham, and Christopher D Schwenk, and Yoshua Bengio. 2014. Learning Manning. 2015a. Effectiveapproachestoattention- phrase representations using rnn encoder-decoder basedneuralmachinetranslation. EMNLP. for statistical machine translation. arXiv preprint Minh-ThangLuong,IlyaSutskever,QuocVLe,Oriol arXiv:1406.1078. Vinyals, and Wojciech Zaremba. 2015b. Address- ingtherarewordprobleminneuralmachinetransla- Jacob Devlin and Spyros Matsoukas. 2012. Trait- tion. InProceedingsofACL. based hypothesis selection for machine translation. InProceedingsofthe2012ConferenceoftheNorth WolfgangMacherey,FranzJosefOch,IgnacioThayer, American Chapter of the Association for Computa- and Jakob Uszkoreit. 2008. Lattice-based mini- tional Linguistics: Human Language Technologies, mumerrorratetrainingforstatisticalmachinetrans- pages528–532.AssociationforComputationalLin- lation. In Proceedings of the Conference on Em- guistics. pirical Methods in Natural Language Processing, pages725–734.AssociationforComputationalLin- JacobDevlin,RabihZbib,ZhongqiangHuang,Thomas guistics. Lamar, Richard M Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models Fandong Meng, Zhengdong Lu, Mingxuan Wang, forstatisticalmachinetranslation. InACL(1),pages Hang Li, Wenbin Jiang, and Qun Liu. 2015. En- 1370–1380.Citeseer. coding source language with convolutional neural network for machine translation. arXiv preprint Jenny Rose Finkel, Christopher D Manning, and An- arXiv:1503.01838. drewYNg. 2006. Solvingtheproblemofcascading errors: ApproximateBayesianinferenceforlinguis- FranzJosefOchandHermannNey. 2002. Discrimina- ticannotationpipelines. InProceedingsofthe2006 tive training and maximum entropy models for sta- Conference on Empirical Methods in Natural Lan- tisticalmachinetranslation. InProceedingsofACL guage Processing, pages 618–626. Association for 2002,pages295–302. ComputationalLinguistics. Franz Josef Och. 2003. Minimum error rate training instatisticalmachinetranslation. InProceedingsof Kevin Gimpel, Dhruv Batra, Chris Dyer, Gregory the41stAnnualMeetingonAssociationforCompu- Shakhnarovich,andVirginiaTech. 2013. Asystem- tationalLinguistics-Volume1,pages160–167.Asso- atic exploration of diversity in machine translation. ciationforComputationalLinguistics. InProceedingsofthe2013ConferenceonEmpirical MethodsinNaturalLanguageProcessing,October. KishorePapineni,SalimRoukos,ToddWard,andWei- JingZhu. 2002. Bleu: amethodforautomaticeval- CaglarGulcehre, OrhanFirat, KelvinXu, Kyunghyun uation of machine translation. In Proceedings of Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, the 40th annual meeting on association for compu- Holger Schwenk, and Yoshua Bengio. 2015. On tationallinguistics,pages311–318.Associationfor usingmonolingualcorporainneuralmachinetrans- ComputationalLinguistics. lation. arXivpreprintarXiv:1503.03535. Rico Sennrich, Barry Haddow, and Alexandra Birch. Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. 2015a. Improving neural machine translation Long short-term memory. Neural computation, models with monolingual data. arXiv preprint 9(8):1735–1780. arXiv:1511.06709. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015b. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015a. Build- ingend-to-enddialoguesystemsusinggenerativehi- erarchical neural network models. arXiv preprint arXiv:1507.04808. Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and JoellePineau. 2015b. Asurveyofavailablecorpora for building data-driven dialogue systems. arXiv preprintarXiv:1512.05742. Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010. String-to-dependencystatisticalmachinetranslation. ComputationalLinguistics,36(4):649–671. Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. Aneuralnetworkapproachtocontext-sensitivegen- eration of conversational responses. arXiv preprint arXiv:1506.06714. IlyaSutskever,OriolVinyals,andQuocVVLe. 2014. Sequencetosequencelearningwithneuralnetworks. In Advances in neural information processing sys- tems,pages3104–3112. RoyWTromble,ShankarKumar,FranzOch,andWolf- gangMacherey. 2008. Latticeminimumbayes-risk decodingforstatisticalmachinetranslation. InPro- ceedingsoftheConferenceonEmpiricalMethodsin Natural Language Processing, pages 620–629. As- sociationforComputationalLinguistics. OriolVinyalsandQuocLe. 2015. Aneuralconversa- tionalmodel. arXivpreprintarXiv:1506.05869. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2014. Grammar as a foreign language. arXiv preprint arXiv:1412.7449. P. C. Woodland and D. Povey. 2002. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech and Lan- guage,16:25–47. TongXiao,JingboZhu,andTongranLiu. 2013. Bag- gingandboostingstatisticalmachinetranslationsys- tems. ArtificialIntelligence,195:496–527. Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural im- age caption generation with visual attention. arXiv preprintarXiv:1502.03044.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.