Modeling Coverage for Neural Machine Translation ZhaopengTu† ZhengdongLu† YangLiu‡ XiaohuaLiu† HangLi† †Noah’sArkLab,HuaweiTechnologies,HongKong {tu.zhaopeng,lu.zhengdong,liuxiaohua3,hangli.hl}@huawei.com ‡DepartmentofComputerScienceandTechnology,TsinghuaUniversity,Beijing [email protected] Abstract ture long-distance reordering, which is a signifi- cantchallengeinSMT. Attention mechanism has enhanced state- NMT has a serious problem, however, namely 6 of-the-art Neural Machine Translation lack of coverage. In phrase-based SMT (Koehn 1 (NMT) by jointly learning to align and et al., 2003), a decoder maintains a coverage vec- 0 translate. Ittendstoignorepastalignment 2 tortoindicatewhetherasourcewordistranslated g information, however, which often leads or not. This is important for ensuring that each to over-translation and under-translation. u sourcewordistranslatedindecoding. Thedecod- A To address this problem, we propose ing process is completed when all source words coverage-based NMT in this paper. We 6 are “covered” or translated. In NMT, there is no maintain a coverage vector to keep track such coverage vector and the decoding process ] oftheattentionhistory. Thecoveragevec- L ends only when the end-of-sentence mark is pro- torisfedtotheattentionmodeltohelpad- C duced. We believe that lacking coverage might justfutureattention, whichletsNMTsys- . result in the following problems in conventional s c tem to consider more about untranslated NMT: [ source words. Experiments show that 6 the proposed approach significantly im- 1. Over-translation: some words are unneces- v proves both translation quality and align- sarilytranslatedformultipletimes; 1 mentqualityoverstandardattention-based 1 8 NMT.1 2. Under-translation: some words are mistak- 4 enlyuntranslated. 0 1 Introduction . Specifically,inthestate-of-the-artattention-based 1 0 The past several years have witnessed the rapid NMTmodel(Bahdanauetal.,2015),generatinga 6 progress of end-to-end Neural Machine Transla- target word heavily depends on the relevant parts 1 : tion (NMT) (Sutskever et al., 2014; Bahdanau et of the source sentence, and a source word is in- v al., 2015). Unlike conventional Statistical Ma- volved in generation of all target words. As a i X chineTranslation(SMT)(Koehnetal.,2003;Chi- result, over-translation and under-translation in- r ang, 2007), NMT uses a single and large neural evitably happen because of ignoring the “cover- a networktomodeltheentiretranslationprocess. It age” of source words (i.e., number of times a enjoys the following advantages. First, the use of source word is translated to a target word). Fig- distributed representations of words can alleviate ure 1(a) shows an example: the Chinese word the curse of dimensionality (Bengio et al., 2003). “gua¯nb`ı” is over translated to “close(d)” twice, Second, there is no need to explicitly design fea- while “be`ipo`” (means “be forced to”) is mistak- tures to capture translation regularities, which is enlyuntranslated. quitedifficultinSMT.Instead,NMTiscapableof Inthiswork,weproposeacoveragemechanism learningrepresentationsdirectlyfromthetraining toNMT(NMT-COVERAGE)toalleviatetheover- data. Third, Long Short-Term Memory (Hochre- translation and under-translation problems. Basi- iterandSchmidhuber,1997)enablesNMTtocap- cally, we append a coverage vector to the inter- mediaterepresentationsofanNMTmodel,which 1Ourcodeispubliclyavailableathttps://github. com/tuzhaopeng/NMT-Coverage. are sequentially updated after each attentive read (a) Over-translation and under-translation (b) Coverage model alleviates the problems of generatedbyNMT. over-translationandunder-translation. Figure 1: Example translations of (a) NMT without coverage, and (b) NMT with coverage. In conven- tional NMT without coverage, the Chinese word “gua¯nb`ı” is over translated to “close(d)” twice, while “be`ipo`”(means“beforcedto”)ismistakenlyuntranslated. Coveragemodelalleviatestheseproblemsby trackingthe“coverage”ofsourcewords. during the decoding process, to keep track of the attention history. The coverage vector, when en- teringintoattentionmodel,canhelpadjustthefu- ture attention and significantly improve the over- all alignment between the source and target sen- tences. Thisdesigncontainsmanyparticularcases forcoveragemodelingwithcontrastingcharacter- istics, which all share a clear linguistic intuition andyetcanbetrainedinadatadrivenfashion. No- tably,weachievesignificantimprovementevenby simplyusingthesumofpreviousalignmentprob- Figure 2: Architecture of attention-based NMT. abilities as coverage for each word, as a success- Wheneverpossible,weomitthesourceindexj to fulexampleofincorporatinglinguisticknowledge maketheillustrationlesscluttered. intoneuralnetworkbasedNLPmodels. Experiments show that NMT-COVERAGE sig- produces the translation by generating one target nificantly outperforms conventional attention- wordy ateachtimestep. Givenaninputsentence i based NMT on both translation and alignment x = {x ,...,x }andpreviouslygeneratedwords 1 J tasks. Figure 1(b) shows an example, in which {y ,...,y }, the probability of generating next 1 i−1 NMT-COVERAGE alleviates the over-translation wordy is i andunder-translationproblemsthatNMTwithout (cid:0) (cid:1) coveragesuffersfrom. P(yi|y<i,x) = softmax g(yi−1,ti,si) (1) wheregisanon-linearfunction,andt isadecod- 2 Background i ingstatefortimestepi,computedby Our work is built on attention-based NMT (Bah- t = f(t ,y ,s ) (2) danau et al., 2015), which simultaneously con- i i−1 i−1 i ducts dynamic alignment and generation of the Here the activation function f(·) is a Gated Re- target sentence, as illustrated in Figure 2. It current Unit (GRU) (Cho et al., 2014b), and s is i a distinct source representation for time i, calcu- latedasaweightedsumofthesourceannotations: J (cid:88) s = α ·h (3) i i,j j j=1 →− ←− (cid:62) where h = [h(cid:62); h(cid:62)] is the annotation of j j j x from a bi-directional Recurrent Neural Net- j work(RNN)(SchusterandPaliwal,1997),andits Figure3: Architectureofcoverage-basedattention weightαi,j iscomputedby model. A coverage vector Ci−1 is maintained to keeptrackofwhichsourcewordshavebeentrans- α = exp(ei,j) (4) lated before time i. Alignment decisions αi are i,j (cid:80)J exp(e ) made jointly taking into account past alignment k=1 i,k information embedded in C , which lets the at- i−1 and tentionmodeltoconsidermoreaboutuntranslated sourcewords. e = a(t ,h ) i,j i−1 j = v(cid:62)tanh(W t +U h ) (5) a a i−1 a j coverageisessentialforSMTsinceitavoidsgaps is an attention model that scores how well y and i andoverlapsintranslationofsourcewords. h match. With the attention model, it avoids the j Modeling coverage is also important for need to represent the entire source sentence with attention-based NMT models, since they gener- a single vector. Instead, the decoder selects parts ally lack a mechanism to indicate whether a cer- of the source sentence to pay attention to, thus tain source word has been translated, and there- exploits an expected annotation s over possible i fore are prone to the “coverage” mistakes: some alignmentsα foreachtimestepi. i,j partsofsourcesentencehavebeentranslatedmore However, the attention model fails to take ad- thanonceornottranslated. ForNMTmodels, di- vantage of past alignment information, which is rectly modeling coverage is less straightforward, found useful to avoid over-translation and under- but the problem can be significantly alleviated by translationproblemsinconventionalSMT(Koehn keepingtrackoftheattentionsignalduringthede- et al., 2003). For example, if a source word is coding process. The most natural way for doing translated in the past, it is less likely to be trans- that would be to append a coverage vector to the lated again and should be assigned a lower align- annotation of each source word (i.e., h ), which mentprobability. j isinitializedasazerovectorbutupdatedafterev- 3 CoverageModelforNMT eryattentivereadofthecorrespondingannotation. The coverage vector is fed to the attention model InSMT,acoveragesetismaintainedtokeeptrack to help adjust future attention, which lets NMT ofwhichsourcewordshavebeentranslated(“cov- systemtoconsidermoreaboutuntranslatedsource ered”)inthepast. Letustakex = {x ,x ,x ,x } 1 2 3 4 words,asillustratedinFigure3. as an example of input sentence. The initial cov- erage set is C = {0,0,0,0} which denotes that 3.1 CoverageModel no source word is yet translated. When a trans- lation rule bp = (x x ,y y ) is applied, we Since the coverage vector summarizes the atten- 2 3 m m+1 produce one hypothesis labelled with coverage tionrecordforh (andthereforeforasmallneigh- j C = {0,1,1,0}. Itmeansthatthesecondandthird bor centering at the jth source word), it will source words are translated. The goal is to gener- discourage further attention to it if it has been atetranslationwithfullcoverageC = {1,1,1,1}. heavily attended, and implicitly push the atten- A source word is translated when it is covered by tion to the less attended segments of the source one translation rule, and it is not allowed to be sentence since the attention weights are normal- translatedagaininthefuture(i.e.,hardcoverage). izedtoone. Thiscanpotentiallysolvebothcover- In this way, each source word is guaranteed to be agemistakesmentionedabove,whenmodeledand translatedandonlybetranslatedonce. Asshown, learnedproperly. Formally,thecoveragemodelisgivenby translated yet. We iteratively construct linguis- tic coverages through accumulation of alignment (cid:0) (cid:1) Ci,j = gupdate Ci−1,j,αi,j,Φ(hj),Ψ (6) probabilities generated by the attention model, eachofwhichisnormalizedbyadistinctcontext- where dependent weight. The coverage of source word x attimestepiiscomputedby • g (·)isthefunctionthatupdatesC af- j update i,j terthenewattentionα attimestepiinthe i,j i decodingprocess; 1 1 (cid:88) C = C + α = α (7) i,j i−1,j i,j k,j Φ Φ j j • C is a d-dimensional coverage vector sum- k=1 i,j marizingthehistoryofattentiontilltimestep where Φ is a pre-defined weight which indicates j ionhj; thenumberoftargetwordsxj isexpectedtogener- ate. ThesimplestwayistofollowXuetal.(2015) • Φ(h )isaword-specificfeaturewithitsown j inimage-to-captiontranslationtofixΦ = 1forall parameters; source words, which means that we directly use • Ψ are auxiliary inputs exploited in different the sum of previous alignment probabilities with- sortsofcoveragemodels. out normalization as coverage for each word, as donein(Cohnetal.,2016). Equation 6 gives a rather general model, which However,inmachinetranslation,differenttypes could take different function forms for gupdate(·) of source words may contribute differently to the and Φ(·), and different auxiliary inputs Ψ (e.g., generation of target sentence. Let us take the previous decoding state ti−1). In the rest of this sentence pairs in Figure 1 as an example. The section, we will give a number of representative nouninthesourcesentence“j¯ıchaˇng”istranslated implementations of the coverage model, which into one target word “airports”, while the adjec- either leverage more linguistic information (Sec- tive “be`ipo`” is translated into three words “were tion3.1.1)orresorttotheflexibilityofneuralnet- forced to”. Therefore, we need to assign a dis- workapproximation(Section3.1.2). tinct Φ for each source word. Ideally, we expect j Φ = (cid:80)I α with I being the total number 3.1.1 LinguisticCoverageModel j i=1 i,j of time steps in decoding. However, such desired We first consider at linguistically inspired model value is not available before decoding, thus is not which has a small number of parameters, as well suitableinthisscenario. as clear interpretation. While the linguistically- inspired coverage in NMT is similar to that in Fertility To predict Φ , we introduce the con- j SMT,thereisonekeydifference: itindicateswhat ceptoffertility,whichisfirstlyproposedinword- percentage of source words have been translated levelSMT(Brownetal.,1993). Fertilityofsource (i.e., soft coverage). In NMT, each target word yi wordxj tellshowmanytargetwordsxj produces. is generated from all source words with probabil- In SMT, the fertility is a random variable Φ , j ity αi,j for source word xj. In other words, the whose distribution p(Φj = φ) is determined by source word xj is involved in generating all tar- the parameters of word alignment models (e.g., get words and the probability of generating target IBMmodels). Inthiswork,wesimplifyandadapt word yi at time step i is αi,j. Note that unlike fertility from the original model and compute the in SMT in which each source word is fully trans- fertilityΦ by2 j lated at one decoding step, the source word x is j partiallytranslatedateachdecodingstepinNMT. Φ = N(x |x) = N ·σ(U h ) (8) j j f j Therefore, the coverage at time step i denotes the translated ratio of that each source word is trans- where N ∈ R is a predefined constant to denote lated. the maximum number of target words one source We use a scalar (d = 1) to represent linguis- 2FertilityinSMTisarandomvariablewithasetoffer- tic coverage for each source word and employ tility probabilities, n(Φ |x ) = p(Φ ,x), which depends j j <j an accumulate operation for g . The initial on the fertilities of previous source words. To simplify the update calculation and adapt it to the attention model in NMT, we value of linguistic coverage is zero, which de- define the fertility in NMT as a constant number, which is notes that the corresponding source word is not independentofpreviousfertilities. dependencies. In this work, we adopt GRU for the gating activation since it is simple yet power- ful(Chungetal.,2014). Pleasereferto(Choetal., 2014b)formoredetailsaboutGRU. Discussion Intuitively, the two types of models summarizecoverageinformationin“differentlan- guages”. Linguistic models summarize coverage informationinhumanlanguage,whichhasaclear interpretation to humans. Neural models encode Figure4: NN-basedcoveragemodel. coverageinformationin“neurallanguage”,which can be “understood” by neural networks and let them to decide how to make use of the encoded wordcanproduce,σ(·)isalogisticsigmoidfunc- coverageinformation. tion, and U ∈ R1×2n is the weight matrix. Here f we use h to denote (x |x) since h contains in- j j j 3.2 IntegratingCoverageintoNMT formation about the whole input sentence with a strong focus on the parts surrounding x (Bah- Although attention based model has the capabil- j danau et al., 2015). Since Φ does not depend on ity of jointly making alignment and translation, it j i,wecanpre-computeitbeforedecodingtomini- does not take into consideration translation his- mizethecomputationalcost. tory. Specifically, a source word that has sig- nificantly contributed to the generation of target 3.1.2 NeuralNetworkBasedCoverageModel wordsinthepast,shouldbeassignedloweralign- We next consider Neural Network (NN) based ment probabilities, which may not be the case in coveragemodel. WhenCi,j isavector(d > 1)and attentionbasedNMT.Toaddressthisproblem,we gupdate(·) is a neural network, we actually have proposetocalculatethealignmentprobabilitiesby anRNNmodelforcoverage, asillustratedinFig- incorporating past alignment information embed- ure4. Inthiswork,wetakethefollowingform: dedinthecoveragemodel. Intuitively, at each time step i in the decoding C = f(C ,α ,h ,t ) i,j i−1,j i,j j i−1 phase, coverage from time step (i − 1) serves as where f(·) is a nonlinear activation function and an additional input to the attention model, which t istheauxiliaryinputthatencodespasttrans- provides complementary information of that how i−1 lation information. Note that we leave out the likely the source words are translated in the past. word-specific feature function Φ(·) and only take We expect the coverage information would guide the input annotation h as the input to the cov- theattentionmodeltofocusmoreonuntranslated j erage RNN. It is important to emphasize that the source words (i.e., assign higher alignment prob- NN-based coverage model is able to be fed with abilities). In practice, we find that the coverage arbitrary inputs, such as the previous attentional model does fulfill the expectation (see Section 5). contexts . HereweonlyemployC forpast The translated ratios of source words from lin- i−1 i−1,j alignmentinformation,t forpasttranslationin- guistic coverages negatively correlate to the cor- i−1 formation,andh forword-specificbias.3 respondingalignmentprobabilities. j More formally, we rewrite the attention model Gating The neural function f(·) can be either a inEquation5as simple activation function tanh or a gating func- tion that proves useful to capture long-distance e = a(t ,h ,C ) i,j i−1 j i−1,j 3Inourpreliminaryexperiments,consideringmoreinputs (e.g., current and previous attentional contexts, unnormal- = va(cid:62)tanh(Wati−1+Uahj +VaCi−1,j) ized attention weights e ) does not always lead to better i,j translation quality. Possible reasons include: 1) the inputs containsduplicateinformation,and2)moreinputsintroduce whereCi−1,j isthecoverageofsourcewordxj be- moreback-propagationpathsandthereforemakeitdifficult fore time i. V ∈ Rn×d is the weight matrix for a to train. In our experience, one principle is to only feed coverage with n and d being the numbers of hid- thecoveragemodelinputsthatcontaindistinctinformation, whicharecomplementarytoeachother. denunitsandcoverageunits,respectively. 4 Training of alignment probabilities and the expected fertil- itydoesnotholdinthisscenario. We take end-to-end learning for the NMT- COVERAGE model, which learns not only the pa- 5 Experiments rametersforthe“original”NMT(i.e.,θforencod- 5.1 Setup ing RNN, decoding RNN, and attention model) but also the parameters for coverage modeling We carry out experiments on a Chinese-English (i.e., η for annotation and guidance of attention) . translation task. Our training data for the trans- Morespecifically,wechoosetomaximizethelike- lation task consists of 1.25M sentence pairs ex- lihood of reference sentences as most other NMT tracted from LDC corpora4 , with 27.9M Chinese models(see,however(Shenetal.,2016)): wordsand34.5MEnglishwordsrespectively. We chooseNIST2002datasetasourdevelopmentset, N (θ∗,η∗) = argmax(cid:88)logP(y |x ;θ,η) (9) andtheNIST2005,2006and2008datasetsasour n n θ,η test sets. We carry out experiments of the align- n=1 menttaskontheevaluationdatasetfrom(Liuand No auxiliary objective For the coverage model Sun, 2015), which contains 900 manually aligned with a clearer linguistic interpretation (Section Chinese-English sentence pairs. We use the case- 3.1.1), it is possible to inject an auxiliary objec- insensitive4-gramNISTBLEUscore(Papineniet tivefunctiononsomeintermediaterepresentation. al., 2002) for the translation task, and the align- More specifically, we may have the following ob- ment error rate (AER) (Och and Ney, 2003) for jective: the alignment task. To better estimate the qual- N (cid:40) ity of the soft alignment probabilities generated (cid:88) (θ∗,η∗) =argmax logP(y |x ;θ,η) by NMT, we propose a variant of AER, naming n n θ,η n=1 SAER: J I (cid:41) −λ(cid:110)(cid:88)(Φ −(cid:88)α )2;η(cid:111) SAER = 1− |MA×MS|+|MA×MP| j i,j |M |+|M | A S j=1 i=1 where A is a candidate alignment, and S and P wheretheterm(cid:8)(cid:80)J (Φ −(cid:80)I α )2;η(cid:9)pe- j=1 j i=1 i,j are the sets of sure and possible links in the ref- nalizesthediscrepancybetweenthesumofalign- erence alignment respectively (S ⊆ P). M de- ment probabilities and the expected fertility for notesalignmentmatrix,andforbothM andM S P linguistic coverage. This is similar to the more we assign the elements that correspond to the ex- explicittrainingforfertilityasinXuetal.(2015), isting links in S and P with probabilities 1 while whichencouragesthemodeltopayequalattention assign the other elements with probabilities 0. In toeverypartoftheimage(i.e.,Φ = 1). However, j thisway,weareabletobetterevaluatethequality our empirical study shows that the combined ob- ofthesoftalignmentsproducedbyattention-based jectiveconsistentlyworsensthetranslationquality NMT. We use sign-test (Collins et al., 2005) for whileslightlyimprovesthealignmentquality. statisticalsignificancetest. Our training strategy poses less constraints on Forefficienttrainingoftheneuralnetworks,we thedependencybetweenΦ andtheattentionthan j limitthesourceandtargetvocabulariestothemost amoreexplicitstrategytakenin(Xuetal.,2015). frequent 30K words in Chinese and English, cov- We let the objective associated with the transla- ering approximately 97.7% and 99.3% of the two tionquality(i.e.,thelikelihood)todrivethetrain- corpora respectively. All the out-of-vocabulary ing, as in Equation 9. This strategy is arguably words are mapped to a special token UNK. We set advantageous,sincetheattentionweightonahid- N = 2forthefertilitymodelinthelinguisticcov- den state h cannot be interpreted as the propor- j erages. We train each model with the sentences tionofthecorrespondingwordbeingtranslatedin of length up to 80 words in the training data. The thetargetsentence. Foronething,thehiddenstate wordembeddingdimensionis620andthesizeof h , after the transformation from encoding RNN, j a hidden layer is 1000. All the other settings are bears the contextual information from other parts thesameasin(Bahdanauetal.,2015). ofthesourcesentence,andthuslosestherigidcor- 4The corpora include LDC2002E18, LDC2003E07, respondencewiththecorrespondingword. There- LDC2003E14, Hansards portion of LDC2004T07, fore, penalizing the discrepancy between the sum LDC2004T08andLDC2005T06. # System #Params MT05 MT06 MT08 Avg. 1 Moses – 31.37 30.85 23.01 28.41 2 GroundHog 84.3M 30.61 31.12 23.23 28.32 3 +Linguisticcoveragew/ofertility +1K 31.26† 32.16†‡ 24.84†‡ 29.42 4 +Linguisticcoveragew/fertility +3K 32.36†‡ 32.31†‡ 24.91†‡ 29.86 5 +NN-basedcoveragew/ogating(d = 1) +4K 31.94†‡ 32.11†‡ 23.31 29.12 6 +NN-basedcoveragew/gating(d = 1) +10K 31.94†‡ 32.16†‡ 24.67†‡ 29.59 7 +NN-basedcoveragew/gating(d = 10) +100K 32.73†‡ 32.47†‡ 25.23†‡ 30.14 Table1: Evaluationoftranslationquality. ddenotesthedimensionofNN-basedcoverages,and†and‡ indicate statistically significant difference (p < 0.01) from GroundHog and Moses, respectively. “+” is ontopofthebaselinesystemGroundHog. We compare our method with two state-of-the- Linguistic Coverages (Rows 3 and 4): Two artmodelsofSMTandNMT5: observations can be made. First, the simplest linguistic coverage (Row 3) already significantly • Moses (Koehn et al., 2007): an open source improves translation performance by 1.1 BLEU phrase-based translation system with default points, indicating that coverage information is configuration and a 4-gram language model veryimportanttotheattentionmodel. Second,in- trainedonthetargetportionoftrainingdata. corporatingfertilitymodelbooststheperformance • GroundHog (Bahdanau et al., 2015): an by better estimating the covered ratios of source attention-basedNMTsystem. words. 5.2 TranslationQuality NN-based Coverages (Rows 5-7): (1) Gating Table 1 shows the translation performances mea- (Rows5and6): BothvariantsofNN-basedcover- suredinBLEUscore. ClearlytheproposedNMT- ages outperform GroundHog with averaged gains COVERAGE significantly improves the translation of 0.8 and 1.3 BLEU points, respectively. In- qualityinallcases,althoughtherearestillconsid- troducing gating activation function improves the erabledifferencesamongdifferentvariants. performanceofcoveragemodels,whichisconsis- tent with the results in other tasks (Chung et al., Parameters Coveragemodelintroducesfewpa- 2014). (2) Coverage dimensions (Rows 6 and 7): rameters. The baseline model (i.e., GroundHog) Increasing the dimension of coverage models fur- has 84.3M parameters. The linguistic coverage ther improves the translation performance by 0.6 using fertility introduces 3K parameters (2K for point in BLEU score, at the cost of introducing fertility model), and the NN-based coverage with moreparameters(e.g.,from10Kto100K).6 gating introduces 10K×d parameters (6K×d for gating), where d is the dimension of the coverage Subjective Evaluation We also conduct a sub- vector. In this work, the most complex coverage jective evaluation to validate the benefit of in- model only introduces 0.1M additional parame- corporating coverage. Two human evaluators are ters,whichisquitesmallcomparedtothenumber asked to evaluate the translations of 200 source ofparametersintheexistingmodel(i.e.,84.3M). sentences randomly sampled from the test sets Speed Introducing the coverage model slows withoutknowingfromwhich systematranslation down the training speed, but not significantly. is selected. Table 2 shows the results of subjec- WhenrunningonasingleGPUdeviceTeslaK80, tive evaluation on translation adequacy and flu- thespeedofthebaselinemodelis960targetwords ency.7 GroudHoghasalowadequacysince25.0% persecond. System4(“+Linguisticcoveragewith of the source words are under-translated. This is fertility”) has a speed of 870 words per second, 6Inapilotstudy, furtherincreasingthecoveragedimen- while System 7 (“+NN-based coverage (d=10)”) siononlyslightlyimprovedthetranslationperformance.One achievesaspeedof800wordspersecond. possiblereasonisthatencodingtherelativelysimplecover- ageinformationdoesnotrequiretoomanydimensions. 5Therearerecentprogressonaggregatingmultiplemod- 7Fluencymeasureswhetherthetranslationisfluent,while elsorenlargingthevocabulary(e.g., in (Jeanetal., 2015)), adequacymeasureswhetherthetranslationisfaithfultothe butherewefocusonthegenericmodels. originalsentence(Snoveretal.,2009). Under- Over- Model Adequacy Fluency Translation Translation GroundHog 3.06 3.54 25.0% 4.5% +NNcov. w/gating(d=10) 3.28 3.73 16.7% 2.7% Table2: Subjectiveevaluationoftranslationadequacyandfluency. Thenumbersinthelasttwocolumns denotethepercentagesofsourcewordsareunder-translatedandover-translated,respectively. (a)Groundhog (b)+NNcov. w/gating(d=10) Figure 5: Example alignments. Using coverage mechanism, translated source words are less likely to contributetogenerationofthetargetwordsnext(e.g.,top-rightcornerforthefirstfourChinesewords.). System SAER AER tionsummarizingattentionhistoryoneachsource GroundHog 67.00 54.67 word. More specifically, linguistic coverage with +Ling. cov. w/ofertility 66.75 53.55 fertilitysignificantlyreducesalignmenterrorsun- +Ling. cov. w/fertility 64.85 52.13 derbothmetrics,inwhichfertilityplaysanimpor- +NNcov. w/ogating(d=1) 67.10 54.46 tantrole. NN-basedcoverages, however, doesnot +NNcov. w/gating(d=1) 66.30 53.51 significantlyreducealignmenterrorsuntilincreas- +NNcov. w/gating(d=10) 64.25 50.50 ing the coverage dimension from 1 to 10. It in- dicates that NN-based models need slightly more Table 3: Evaluation of alignment quality. The dimensionstoencodethecoverageinformation. lowerthescore,thebetterthealignmentquality. Figure 5 shows an example. The coverage mechanism does meet the expectation: the align- ments are more concentrated and most impor- mainly due to the serious under-translation prob- tantly, translated source words are less likely to lems on long sentences that consist of several getinvolvedingenerationofthetargetwordsnext. sub-sentences, in which some sub-sentences are For example, the first four Chinese words are as- completely ignored. Incorporating coverage sig- signed lower alignment probabilities (i.e., darker nificantly alleviates these problems, and reduces color) after the corresponding translation “roma- 33.2% and 40.0% of under-translation and over- niareinforcesoldbuildings”isproduced. translation errors respectively. Benefiting from this,coveragemodelimprovesbothtranslationad- equacyandfluencybyaround0.2points. 5.4 EffectsonLongSentences Following Bahdanau et al. (2015), we group sen- 5.3 AlignmentQuality tences of similar lengths together and compute Table 3 lists the alignment performances. We BLEU score and averaged length of translation find that coverage information improves atten- for each group, as shown in Figure 6. Cho et tionmodelasexpectedbymaintaininganannota- al.(2014a)showthattheperformanceofGround- Figure 6: Performance of the generated translations with respect to the lengths of the input sentences. Coveragemodelsalleviateunder-translationbyproducinglongertranslationsonlongsentences. hog drops rapidly when the length of input sen- jordan ’s average score points to UNK tence increases. Our results confirm these find- this year . he received surgery before ings. One main reason is that Groundhog pro- three weeks ,withateamintheperiod duces much shorter translations on longer sen- of4to8. tences (e.g., > 40, see right panel in Figure 6), and thus faces a serious under-translation prob- inwhichtheunder-translationisrectified. lem. NMT-COVERAGEalleviatesthisproblemby The quantitative and qualitative results show incorporatingcoverageinformationintotheatten- that the coverage models indeed help to allevi- tion model, which in general pushes the attention ateunder-translation,especiallyforlongsentences to untranslated parts of the source sentence and consistingofseveralsub-sentences. implicitly discourages early stop of decoding. It is worthy to emphasize that both NN-based cov- 6 RelatedWork erages(withgating, d = 10)andlinguisticcover- Our work is inspired by recent works on im- ages (with fertility) achieve similar performances provingattention-basedNMTwithtechniquesthat onlongsentences,reconfirmingourclaimthatthe have been successfully applied to SMT. Follow- two variants improve the attention model in their ingthesuccessofMinimumRiskTraining(MRT) ownways. in SMT (Och, 2003), Shen et al. (2016) proposed Asanexample,considerthissourcesentencein MRT for end-to-end NMT to optimize model pa- thetestset: rameters directly with respect to evaluation met- qia´oda¯nbeˇnsa`ij`ıp´ıngju¯nde´fe¯n24.3fe¯n rics. Based on the observation that attention- , ta¯ za`i sa¯n zho¯u qia´n jie¯sho`u shoˇushu` basedNMTonlycapturespartialaspectsofatten- ,qiu´du`ıza`icˇıq¯ıjia¯n4she`ng8fu` . tional regularities, Cheng et al. (2016) proposed Groundhogtranslatesthissentenceinto: agreement-based learning (Liang et al., 2006) to encourage bidirectional attention models to agree jordan achieved an average score of on parameterized alignment matrices. Along the eightweeksaheadwithasurgicaloper- same direction, inspired by the coverage mecha- ationthreeweeksago. nism in SMT, we propose a coverage-based ap- in which the sub-sentence “, qiu´du`ı za`i cˇı q¯ıjia¯n proach to NMT to alleviate the over-translation 4 she`ng 8 fu`” is under-translated. With the (NN- andunder-translationproblems. based) coverage mechanism, NMT-COVERAGE Independent from our work, Cohn et al. (2016) translatesitinto: andFengetal.(2016)madeuseoftheconceptof “fertility” for the attention model, which is sim- [Brownetal.1993] Peter E. Brown, Stephen A. Della ilar in spirit to our method for building the lin- Pietra, Vincent J. Della Pietra, and Robert L. Mer- cer. 1993. The mathematics of statistical machine guistically inspired coverage with fertility. Cohn translation: Parameter estimation. Computational et al. (2016) introduced a feature-based fertility Linguistics,19(2):263–311. thatincludesthetotalalignmentscoresforthesur- [Chengetal.2016] YongCheng, ShiqiShen, Zhongjun roundingsourcewords. Incontrast,wemakepre- He,WeiHe,HuaWu,MaosongSun,andYangLiu. diction of fertility before decoding, which works 2016. Agreement-basedJointTrainingforBidirec- asanormalizertobetterestimatethecoveragera- tionalAttention-basedNeuralMachineTranslation. tio of each source word. Feng et al. (2016) used InIJCAI2016. the previous attentional context to represent im- [Chiang2007] David Chiang. 2007. Hierarchical plicitfertilityandpassedittotheattentionmodel, phrase-basedtranslation. CL. which is in essence similar to the input-feed [Choetal.2014a] Kyunghyun Cho, Bart van Merrien- methodproposedin(Luongetal.,2015). Compar- boer, Dzmitry Bahdanau, and Yoshua Bengio. atively,wepredictexplicitfertilityforeachsource 2014a. On the properties of neural machine trans- wordbasedonitsencodingannotation,andincor- lation: encoder–decoderapproaches. InSSST2014. porate it into the linguistic-inspired coverage for [Choetal.2014b] Kyunghyun Cho, Bart van Merrien- attentionmodel. boer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning 7 Conclusion phrase representations using rnn encoder-decoder forstatisticalmachinetranslation. InEMNLP2014. We have presented an approach for enhancing NMT, which maintains and utilizes a coverage [Chungetal.2014] Junyoung Chung, Caglar Gulcehre, vector to indicate whether each source word is KyungHyun Cho, and Yoshua Bengio. 2014. Em- piricalevaluationofgatedrecurrentneuralnetworks translatedornot. ByencouragingNMTtopayless onsequencemodeling. arXiv. attentiontotranslatedwordsandmoreattentionto untranslatedwords,ourapproachalleviatesthese- [Cohnetal.2016] Trevor Cohn, Cong Duy Vu Hoang, rious over-translation and under-translation prob- Ekaterina Vylomova, Kaisheng Yao, Chris Dyer, and Gholamreza Haffari. 2016. Incorporating lems that traditional attention-based NMT suffers Structural Alignment Biases into an Attentional from. We propose two variants of coverage mod- NeuralTranslationModel. InNAACL2016. els: linguistic coverage that leverages more lin- [Collinsetal.2005] Michael Collins, Philipp Koehn, guistic information and NN-based coverage that andIvonaKucˇerova´. 2005. Clauserestructuringfor resortstotheflexibilityofneuralnetworkapprox- statisticalmachinetranslation. InACL2005. imation . Experimental results show that both variantsachievesignificantimprovementsinterms [Fengetal.2016] Shi Feng, Shujie Liu, Mu Li, and Ming Zhou. 2016. Implicit distortion and fertil- of translation quality and alignment quality over itymodelsforattention-basedencoder-decodernmt NMTwithoutcoverage. model. arXiv. Acknowledgement [HochreiterandSchmidhuber1997] Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. Long short-term This work is supported by China National 973 memory. NeuralComputation. project 2014CB340301. Yang Liu is supported [Jeanetal.2015] Se´bastien Jean, Kyunghyun Cho, by the National Natural Science Foundation of RolandMemisevic, andYoshuaBengio. 2015. On China (No. 61522204) and the 863 Program using very large target vocabulary for neural ma- (2015AA011808). We thank the anonymous re- chinetranslation. InACL2015. viewersfortheirinsightfulcomments. [Koehnetal.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. InNAACL2003. References [Koehnetal.2007] Philipp Koehn, Hieu Hoang, [Bahdanauetal.2015] Dzmitry Bahdanau, Kyunghyun Alexandra Birch, Chris Callison-Burch, Marcello Cho, and Yoshua Bengio. 2015. Neural machine Federico, Nicola Bertoldi, Brooke Cowan, Wade translationbyjointlylearningtoalignandtranslate. Shen, Christine Moran, Richard Zens, Chris Dyer, ICLR2015. Ondrej Bojar, Alexandra Constantin, and Evan [Bengioetal.2003] YoshuaBengio, Re´jeanDucharme, Herbst. 2007. Moses: open source toolkit for PascalVincent,andChristianJanvin. 2003. Aneu- statisticalmachinetranslation. InACL2007. ralprobabilisticlanguagemodel. JMLR.