Incorporating Global Visual Features into Attention-Based Neural Machine Translation IacerCalixto QunLiu NickCampbell ADAPTCentre ADAPTCentre ADAPTCentre DublinCityUniversity DublinCityUniversity TrinityCollegeDublin Glasnevin,Dublin9 Glasnevin,Dublin9 CollegeGreen,Dublin2 [email protected] Abstract source language into corresponding sequences in the target. This framework has been successfully We introduce multi-modal, attention- used in many different tasks, such as handwritten based neural machine translation (NMT) 7 text generation (Graves, 2013), image description 1 models which incorporate visual features generation(Hodoshetal.,2013;Kirosetal.,2014; 0 into different parts of both the encoder Maoetal.,2014;Elliottetal.,2015;Karpathyand 2 and the decoder. We utilise global image Fei-Fei,2015;Vinyalsetal.,2015),machinetrans- n featuresextractedusingapre-trainedcon- a lation (Cho et al., 2014b; Sutskever et al., 2014) J volutionalneuralnetworkandincorporate and video description generation (Donahue et al., 3 them (i) as words in the source sentence, 2015;Venugopalanetal.,2015). 2 (ii) to initialise the encoder hidden state, Recently,therehasbeenanincreaseinthenum- ] and (iii) as additional data to initialise the ber of natural language generation models that L decoder hidden state. In our experiments, explicitly use attention-based decoders, i.e. de- C we evaluate how these different strategies codersthatmodelanintra-sequentialmappingbe- . s to incorporate global image features com- tween source and target representations. For in- c [ pare and which ones perform best. We stance, Xu et al. (2015) proposed an attention- 1 alsostudytheimpactthataddingsynthetic basedmodelforthetaskofimagedescriptiongen- v multi-modal, multilingual data brings and eration where the model learns to attend to spe- 1 find that the additional data have a posi- cificpartsofanimage(thesource)asitgenerates 2 tive impact on multi-modal models. We 5 its description (the target). In MT, one can intu- 6 report new state-of-the-art results and our itivelyinterpretthisattentionmechanismasinduc- 0 best models also significantly improve on ing an alignment between source and target sen- . 1 a comparable phrase-based Statistical MT tences,asfirstproposedbyBahdanauetal.(2015). 0 (PBSMT) model trained on the Multi30k 7 Thecommonideaistoexplicitlyframealearning 1 datasetaccordingtoallmetricsevaluated. task in which the decoder learns to attend to the : Tothebestofourknowledge,itisthefirst v relevantpartsofthesourcesequencewhengener- Xi time a purely neural model significantly atingeachpartofthetargetsequence. improvesoveraPBSMTmodelonallmet- We are inspired by recent successes in using r a ricsevaluatedonthisdataset. attention-based models in both image description generation and NMT. Our main goal in this work 1 Introduction is to propose end-to-end multi-modal NMT mod- Neural Machine Translation (NMT) has recently elswhicheffectivelyincorporatevisualfeaturesin been proposed as an instantiation of the sequence differentpartsoftheattention-basedNMTframe- to sequence (seq2seq) learning problem (Kalch- work. Themaincontributionsofourworkare: brenner and Blunsom, 2013; Cho et al., 2014b; • We propose novel attention-based multi- Sutskeveretal.,2014). Inthisproblem,eachtrain- modalNMTmodelswhichincorporatevisual ing example consists of one source and one tar- featuresintotheencoderandthedecoder. get variable-length sequence, with no prior infor- mationregardingthealignmentsbetweenthetwo. • We discuss the impact that adding synthetic A model is trained to translate sequences in the multi-modal and multilingual data brings to multi-modalNMT. ficial results of this first shared task no submis- sions based on a purely neural architecture could • We show that images bring useful informa- improveonthephrase-basedSMT(PBSMT)base- tion to an NMT model and report state-of- line. Nevertheless, researchers have proposed to the-artresults. include global visual features in re-ranking n- best lists generated by a PBSMT system or di- One additional contribution of our work is that rectlyinapurelyNMTframeworkwithsomesuc- wecorroboratepreviousfindingsbyVinyalsetal. cess (Caglayan et al., 2016; Calixto et al., 2016; (2015)thatsuggestedthatusingimagefeaturesdi- Libovicky´ etal.,2016;Shahetal.,2016). Thebest rectly as additional context to update the hidden results achieved by a purely NMT model in this state of the decoder (at each time step) leads to shared task are those of Huang et al. (2016), who overfitting,ultimatelypreventinglearning. proposedtouseglobalandregionalimagefeatures Theremainderofthispaperisstructuredasfol- extractedwiththeVGG19network. lows. In §1.1 we briefly discuss relevant previous Similarly to one model we propose,1 they ex- related work. We then revise the attention-based tract global features for an image, project these NMTframeworkandfurtherexpanditintodiffer- features into the vector space of the source words entmulti-modalNMTmodels(§2). In§3weintro- and then add it as a word in the input sequence. ducethedatasetsweuseinourexperiments. In§4 Their best model improves over a strong NMT we detail the hyperparameters, parameter initiali- baselineandiscomparabletoresultsobtainedwith sationandotherrelevantdetailsofourmodels. Fi- aPBSMTmodeltrainedonthesamedata. Forthat nally,in§5wedrawconclusionsandprovidesome reason, their models are used as baselines in our avenuesforfuturework. experiments. Next, we point out some key differ- encesbetweentheirmodelsandours. 1.1 Relatedwork Attention-based encoder-decoder models for MT Architecture Their implementation is based on have been actively investigated in recent years. the attention-based model of Luong et al. (2015), Someresearchershavestudiedhowtoimproveat- which has some differences to that of Bahdanau tentionmechanisms(Luongetal.,2015;Tuetal., et al. (2015), used in our work (§2.1). Their en- 2016) and how to train attention-based models to coder is a single-layer unidirectional LSTM and translate between many languages (Dong et al., theyusethelasthiddenstateoftheencodertoini- 2015;Firatetal.,2016). tialise the decoder’s hidden state, therefore indi- There has been some previous related work on rectly using the image features to do so. We use using images in tasks involving multilingual and a bi-directional recurrent neural network (RNN) multi-modal natural language generation. Cal- withGRU(Choetal.,2014a)asourencoder,bet- ixto et al. (2012) studied how the visual con- terencodingthesemanticsofthesourcesentence. text of a textual description can be helpful in Image features We include image features sep- the disambiguation of Statistical MT (SMT) sys- arately either as a word in the source sen- tems. Hitschler et al. (2016) used image features tence (§2.2.1) or directly for encoder (§2.2.2) for re-ranking translations of image descriptions or decoder initialisation (§2.2.3), whereas Huang generated by an SMT model and reported signif- etal.(2016)onlyuseitasaword. Wealsoshowit icant improvements. Elliott et al. (2015) gener- isbettertoincludeanimageexclusivelyfortheen- atedmultilingualdescriptionsofimagesbylearn- coderorthedecoderinitialisation(Tables1and2). ing and transferring features between two inde- pendent, non-attentive neural image description Data Huang et al. (2016) use object detections models. Luongetal.(2016)proposedamulti-task obtainedwiththeRCNNofGirshicketal.(2014) learning approach and incorporated neural image as additional data, whereas we study the impact description as an auxiliary task to sequence-to- thatadditionalback-translateddatabrings. sequence NMT and improved translations in the Performance AllourmodelsoutperformHuang maintranslationtask. et al. (2016)’s according to all metrics evaluated, Multi-modal MT has recently been addressed by the MT community in the form of a shared 1Thisideahasbeendevelopedindependentlybybothre- task (Specia et al., 2016). We note that in the of- searchgroups. even when they use additional object detections. If we use additional back-translated data, the dif- ferencebecomesevenlarger. 2 Attention-basedNMT In this section, we briefly revise the attention- based NMT framework (§2.1) and expand it into amulti-modalNMTframework(§2.2). 2.1 Text-onlyattention-basedNMT We follow the notation of Bahdanau et al. (2015) and Firat et al. (2016) throughout this section. Given a source sequence X = (x ,x ,··· ,x ) 1 2 N anditstranslationY = (y ,y ,··· ,y ),anNMT 1 2 M model aims at building a single neural network Figure 1: Computation of the decoder’s hidden that translates X into Y by directly learning to states usingtheattentionmechanism. t modelp(Y |X). Eachx isarowindexinasource i lookup matrix Wx ∈ R|Vx|×dx (the source word thedecoderintheprevioustimestep.2 embeddings matrix) and each y is an index in a j We follow Bahdanau et al. (2015) and use a target lookup matrix Wy ∈ R|Vy|×dy (the target single-layer feed-forward network to compute an word embeddings matrix). V and V are source x y expected alignment e between each source an- and target vocabularies and d and d are source t,i x y notationvectorh andthetargetwordtobeemit- and target word embeddings dimensionalities, re- i tedatthecurrenttimestept,asin(2): spectively. e = v T tanh(U s +W h ). (2) A bidirectional RNN with GRU is used as t,i a a t−1 a i →− In Equation (3), these expected alignments are the encoder. A forward RNN Φ reads X enc further normalised and converted into probabili- word by word, from left to right, and gener- ties: ates a sequence of forward annotation vectors →− →− →− exp(et,i) (h1, h2,··· , hN) at each encoder←−time step αt,i = (cid:80)N exp(e ), (3) i ∈ [1,N]. Similarly,abackwardRNN Φ reads j=1 t,j enc where α are called the model’s attention t,i X from right to left, word by word, and gener- weights, which are in turn used in computing the a←t−es a←−sequence←−of backward annotation vectors time-dependent context vector c = (cid:80)N α h . (h , h ,··· , h ),asin(1): t i=1 t,i i 1 2 N Finally,thecontextvectorc isusedincomputing →− →− (cid:0) →− (cid:1) t hi = Φenc Wx[xi], hi−1 , the decoder’s hidden state st for the current time ←− ←− ←− (cid:0) (cid:1) stept,asshowninEquation(4): h = Φ W [x ], h . (1) i enc x i i+1 s = Φ (s ,W [y˜ ],c ), (4) Thefinalannotationvectorforagiventimestepi t dec t−1 y t−1 t istheconcatenationofforwardandbackwardvec- wherest−1 isthedecoder’sprevioushiddenstate, →− ←− torshi = (cid:2)hi;hi(cid:3). Wy[y˜t−1]istheembeddingofthewordemittedin theprevioustimestep,andc istheupdatedtime- In other words, each source sequence X is t dependentcontextvector. InFigure1weillustrate encoded into a sequence of annotation vectors thecomputationofthedecoder’shiddenstates . h = (h ,h ,··· ,h ), which are in turn used by t 1 2 N We use a single-layer feed-forward neural net- the decoder: essentially a neural language model work to initialise the decoder’s hidden state s at (LM)(Bengioetal.,2003)conditionedonthepre- 0 timestept = 0andfeedittheconcatenationofthe viouslyemittedwordsandthesourcesentencevia last hidden states of the encoder’s forward RNN anattentionmechanism. Ateachtimesteptofthedecoder,wecompute 2At training time, the correct previous target word y t−1 a time-dependent context vector ct based on the is known and therefore used instead of y˜t−1. At test or in- ference time, y is not known and y˜ is used instead. annotation vectors h, the decoder’s previous hid- t−1 t−1 Bengioetal.(2015)discussedproblemsthatmayarisefrom denstatest−1 andthetargetwordy˜t−1 emittedby thisdifferencebetweentrainingandinferencedistributions. →− ←− (Φ )andbackwardRNN(Φ ),asin(5): enc ←− −→enc (cid:0) (cid:1) s = tanh W [h ;h ]+b , (5) 0 di 1 N di where W and b are model parameters. Since di di RNNs normally better store information about recent inputs in comparison to more distant ones (Hochreiter and Schmidhuber, 1997; Bah- danau et al., 2015), we expect to initialise the de- coder’shiddenstatewithastrongsourcesentence representation, i.e. a representation with a strong focus on both the first and the last tokens in the sourcesentence. 2.2 Multi-modalNMT(MNMT) Our models can be seen as expansions of the attention-based NMT framework described in §2 with the addition of a visual component to incor- porateimagefeatures. Simonyan and Zisserman (2014) trained and evaluated an extensive set of deep convolutional neural network (CNN) models for classifying im- Figure2: AnencoderbidirectionalRNNthatuses ages into one out of the 1000 classes in Ima- imagefeaturesaswordsinthesourcesequence. geNet (Russakovsky et al., 2015). We use their 19-layerVGGnetwork(VGG19)toextractimage 2.2.1 Imagesassourcewords: IMG feature vectors for all images in our dataset. We W feed an image to the pre-trained VGG19 network One way we propose to incorporate images into and use the 4096D activations of the penultimate the encoder is to project an image feature vector fully-connected layer FC73 as our image feature intothespaceofthewordsofthesourcesentence. vector,henceforthreferredtoasq. Weusetheprojectedimageasthefirstand/orlast We propose three different methods to incor- word of the source sentence and let the attention porateimagesintotheattentiveNMTframework: modellearnwhentoattendtotheimagerepresen- using an image as words in the source sentence tation. Specifically,giventheglobalimagefeature (§2.2.1), using an image to initialise the source vectorq ∈ R4096,wecompute(6): languageencoder(§2.2.2)andthetargetlanguage d = W2·(W1·q+b1)+b2, (6) I I I I decoder(§2.2.3). whereW1 ∈ R4096×4096andW2 ∈ R4096×dx are I I Wealsoevaluatedafourthmechanismtoincor- image transformation matrices, b1 ∈ R4096 and I porate images into NMT, namely to use an image b2I ∈ Rdx are bias vectors, and dx is the source asoneofthedifferentcontextsavailabletothede- wordsvectorspacedimensionality,alltrainedwith coder at each time step of the decoding process. themodel. Wethendirectlyusedaswordsinthe Weaddtheimagefeaturesdirectlyasanadditional source words vector space: as the first word only context, in addition to Wy[y˜t−1], st−1 and ct, to (modelIMG1W),andasthefirstandlastwordsof compute the hidden state st of the decoder at a thesourcesentence(modelIMG2W). given time step t. We corroborate previous find- An illustration of this idea is given in Fig- ingsbyVinyalsetal.(2015)inthataddingtheim- ure 2, where a source sentence that originally age features as such causes the model to overfit, contained N tokens, after including the image as ultimatelypreventinglearning.4 source words will contain N + 1 tokens (model IMG ) or N + 2 tokens (model IMG ). In 1W 2W 3We use the activations of the FC7 layer, which encode model IMG , the image is projected as the first information about the entire image, of the VGG19 network 1W (configurationE)inSimonyanandZisserman(2014)’spaper. sourcewordonly(solidlineinFigure2);inmodel 4Forcomparison,translationsforthetranslatedMulti30k IMG ,itisprojectedintothesourcewordsspace 2W test set (described in §3) achieve just 3.8 BLEU (Papineni asbothfirstandlastwords(bothsolidanddashed et al., 2002), 15.5 METEOR (Denkowski and Lavie, 2014) and93.0TER(Snoveretal.,2006). linesinFigure2). Given a source sequence X = (x ,x ,··· ,x ), we concatenate the trans- 1 2 N formed image vector d to W [X] and apply x the forward and backward encoder RNN passes, generating hidden vectors as in Figure 2. When computing the context vector c (Equations (2) t and (3)), we effectively make use of the trans- formedimagevector,i.e. theα attentionweight t,i parameters will use this information to attend or nottotheimagefeatures. By including images into the encoder in mod- elsIMG andIMG ,ourintuitionisthat(i)by 1W 2W including the image as the first word, we propa- gateimagefeaturesintothesourcesentencevector representations when applying the forward RNN →− →− Φ (vectorsh ),and(ii)byincludingtheimage enc i asthelastword,wepropagateimagefeaturesinto the source sentence vector representations when ←− ←− applyingthebackwardRNN Φ (vectorsh ). Figure3: Usinganimagetoinitialisetheencoder enc i hiddenstates. 2.2.2 Imagesforencoderinitialisation: IMG E In the original attention-based NMT model de- scribed in §2, the hidden state of the encoder is #» initialised with the zero vector 0. Instead, we propose to use two new single-layer feed-forward neuralnetworkstocomputetheinitialstatesofthe →− ←− forwardRNN Φ andthebackwardRNN Φ , enc enc respectively,asillustratedinFigure3. Similarlyto§2.2.1,givenaglobalimagefeature vector q ∈ R4096, we compute a vector d using Equation (6), only this time the parameters W2 I andb2 projecttheimagefeaturesintothesamedi- I mensionalityasthetextualencoderhiddenstates. Thefeed-forwardnetworksusedtoinitialisethe encoderhiddenstatearecomputedasin(7): ←− (cid:0) (cid:1) h = tanh W d+b , init f f →− (cid:0) (cid:1) Figure4: Imageasadditionaldatatoinitialisethe h = tanh W d+b , (7) init b b decoderhiddenstates . where W and W are multi-modal projection 0 f b matricesthatprojecttheimagefeaturesdintothe →− ←− encoder forward and backward hidden states di- RNN (Φ ) and backward RNN (Φ ), respec- enc enc mensionality, respectively, and b and b are bias −→ ←− f b tivelyh and h . N 1 vectors. Our proposal is that we include the image fea- 2.2.3 Imagesfordecoderinitialisation: tures as additional input to initialise the decoder IMG hiddenstateattimestept = 0,asin(8): D ←− −→ (cid:0) (cid:1) To incorporate an image into the decoder, we in- s0 = tanh Wdi[h1;hN]+Wmd+bdi , (8) troduce a new single-layer feed-forward neural whereWmisamulti-modalprojectionmatrixthat networktobeusedinsteadoftheonedescribedin projectstheimagefeaturesdintothedecoderhid- Equation 5. Originally, the decoder’s initial hid- den state dimensionality and Wdi and bdi are the den state was computed using the concatenation sameasinEquation(5). of the last hidden states of the encoder forward Once again we compute d by applying Equa- tion (6) onto a global image feature vector We train models to translate from English into q ∈ R4096, onlythistimetheparametersW2 and German and report evaluation of cased, tokenized I b2projecttheimagefeaturesintothesamedimen- sentenceswithpunctuation. I sionality as the decoder hidden states. We illus- 4 Experimentalsetup tratethisideainFigure4. Our encoder is a bidirectional RNN with GRU 3 Dataset (one 1024D single-layer forward RNN and one 1024D single-layer backward RNN). Source and Ourmulti-modalNMTmodelsneedbilingualsen- target word embeddings are 620D each and both tences accompanied by one or more images as are trained jointly with our model. All non- trainingdata. TheoriginalFlickr30kdatasetcon- recurrentmatricesareinitialisedbysamplingfrom tains 30k images and 5 English sentence descrip- a Gaussian distribution (µ = 0,σ = 0.01), recur- tions for each image (Young et al., 2014). We rent matrices are orthogonal and bias vectors are use the translated and the comparable Multi30k allinitialisedtozero. OurdecoderRNNalsouses datasets (Elliott et al., 2016), henceforth referred GRUandisaneuralLM(Bengioetal.,2003)con- to as M30k and M30k , respectively, which are T C ditioned on its previous emissions and the source multilingualexpansionsoftheoriginalFlickr30k. sentencebymeansofthesourceattentionmecha- Foreachofthe30kimagesintheFlickr30k,the nism. M30k has one of its English descriptions man- T Image features are obtained by feeding im- ually translated into German by a professional ages to the pre-trained VGG19 network of Si- translator. Training, validation and test sets con- monyan and Zisserman (2014) and using the ac- tain 29k, 1014 and 1k images, respectively, each tivations of the penultimate fully-connected layer accompanied by one sentence pair (the original FC7. We apply dropout with a probability of 0.2 EnglishsentenceanditsGermantranslation). For in both source and target word embeddings and each of the 30k images in the Flickr30k, the with a probability of 0.5 in the image features (in M30k has five descriptions in German collected C all MNMT models), in the encoder and decoder independently of the English descriptions. Train- RNNs inputs and recurrent connections, and be- ing,validationandtestsetscontain29k,1014and fore the readout operation in the decoder RNN. 1kimages,respectively,eachaccompaniedbyfive We follow Gal and Ghahramani (2016) and apply sentencesinEnglishandfivesentencesinGerman. dropouttotheencoderbidirectionalRNNandde- We use the scripts in the Moses SMT coderRNNusingthesamemaskinalltimesteps. Toolkit(Koehnetal.,2007)tonormalise,truecase Our models are trained using stochastic gradi- andtokenizeEnglishandGermandescriptionsand entdescentwithAdadelta(Zeiler,2012)andmini- we also convert space-separated tokens into sub- batches of size 40, where each training instance words (Sennrich et al., 2016b). All models use a consistsofoneEnglishsentence,oneGermansen- commonvocabularyof83,093Englishand91,141 tenceandoneimage. Weapplyearlystoppingfor German subword tokens. If sentences in English model selection based on BLEU scores, so that if orGermanarelongerthan80tokens,theyaredis- amodeldoesnotimproveonBLEUinthevalida- carded. tionsetformorethan20epochs,trainingishalted. We use the entire M30k training set for train- T We evaluate our models’ translation qual- ing, its validation set for model selection with ity quantitatively in terms of BLEU4 (Papineni BLEU, and its test set to evaluate our models. In et al., 2002), METEOR (Denkowski and Lavie, order to study the impact that additional training 2014), TER (Snover et al., 2006), and chrF3 data brings to the models, we use the baseline scores5 (Popovic´, 2015) and we report statisti- model described in §2 trained on the textual part cal significance for the three first metrics us- oftheM30k dataset(German→English)without T ing approximate randomisation computed with theimagestobuildaback-translationmodel(Sen- MultEval(Clarketal.,2011). nrich et al., 2016a). We back-translate the 145k As our main baseline we train an attention- German descriptions in the M30k into English C based NMT model (§2) in which only the textual and include the triples (synthetic English descrip- partofM30k isusedfortraining. Wealsotraina tion, German description, image) as additional T trainingdata. 5Wespecificallycomputecharacter6-gramF3scores. BLEU4↑ METEOR↑ TER↓ chrF3↑ IMG perform consistently better than the 2W+D PBSMT 32.9 54.1 45.1 67.4 strongmulti-modalNMTbaselineofHuangetal. NMT 33.7 52.3 46.7 64.5 (2016), even when this model has access to more Huang 35.1 52.2 — — data (+RCNN features).6 Combining image fea- +RCNN 36.5 54.1 — — IMG1W 37.1†‡(↑3.4) 54.5†‡(↑0.4) 42.7†‡(↓2.4) 66.9(↓0.5) tures in the encoder and the decoder at the same IMG2W 36.9†‡(↑3.2) 54.3†‡(↑0.2) 41.9†‡(↓3.2) 66.8(↓0.6) time (last two entries in Table 1) does not seem IMGE 37.1†‡(↑3.4) 55.0†‡(↑0.9) 43.1†‡(↓2.0) 67.6(↑0.2) to improve results compared to using the image IMGD 37.3†‡(↑3.6) 55.1†‡(↑1.0) 42.8†‡(↓2.3) 67.7(↑0.3) IMG2W+D 35.7†‡(↑2.0) 53.6†‡(↓0.5) 43.3†‡(↓1.8) 66.2(↓1.2) features in only the encoder or the decoder. To IMGE+D 37.0†‡(↑3.3) 54.7†‡(↑0.6) 42.6†‡(↓2.5) 67.2(↓0.2) the best of our knowledge, it is the first time a Table 1: BLEU4, METEOR, chrF3 (higher is purelyneuralmodelsignificantlyimprovesovera better) and TER scores (lower is better) on the PBSMTmodelinallmetricsonthisdataset. M30k testsetforthetwotext-onlybaselinesPB- Arguably,themaindownsideofapplyingmulti- T SMT and NMT, the two multi-modal NMT mod- modal NMT in a real-world scenario is the small elsbyHuangetal.(2016)andourMNMTmodels amountofpubliclyavailabletrainingdata(∼30k), that: (i)useimagesaswordsinthesourcesentence which restricts its applicability. For that rea- (IMG ,IMG ),(ii)useimagestoinitialisethe son, we back-translated the German sentences in 1W 2W encoder(IMGE),and(iii)useimagesasadditional the M30kC and created additional 145k synthetic data to initialise the decoder (IMG ). Best text- triples (synthetic English sentence, original Ger- D only baselines are underscored and best overall mansentenceandimage). results appear in bold. We highlight in parenthe- In Table 2, we present results for some of the sestheimprovementsbroughtbyourmodelscom- modelsevaluatedinTable1butwhenalsotrained paredtothebestcorrespondingtext-onlybaseline on the additional data. In order to add more data score. Results differ significantly from PBSMT tothePBSMTbaseline,wesimplyaddedtheGer- baseline(†)orNMTbaseline(‡)withp = 0.05. man sentences in the M30k as additional data C to train the LM.7 Both our models IMG and E IMG that use global image features to initialise PBSMTmodelbuiltwithMosesonthesamedata. D theencoderandthedecoder,respectively,improve The LM is a 5–gram LM with modified Kneser- significantly according to BLEU, METEOR and Neysmoothing(KneserandNey,1995)trainedon TERwiththeadditionalback-translateddata, and the German side of the M30k dataset. We use T alsoachievedbetterchrF3scores. ModelIMG , minimum error rate training (Och, 2003) for tun- 2W that uses images as words in the source sentence, ing the model parameters for BLEU scores. Our does not significantly differ in BLEU, METEOR third baseline is the best comparable multi-modal or TER (p = 0.05), but achieves a lower chrF3 model by Huang et al. (2016) and also their best score than the comparable PBSMT model. Al- model with additional object detections: respec- thoughmodelIMG trainedononlytheoriginal tively models m1 (image at head) and m3 in the 2W data has the best TER score (= 41.9), both mod- authors’paper. elsIMG andIMG performcomparablywiththe E D additional back-translated data (= 41.4 and 41.6, 4.1 Results respectively), though the difference between the The Multi30K dataset contains images and bilin- latter and the former is still not statistically sig- gual descriptions. Overall, it is a small dataset nificant(p = 0.05). with a small vocabulary whose sentences have We see in Tables 1 and 2 that our models that simplesyntacticstructuresandnotmuchambigu- useimagesdirectlytoinitialiseeithertheencoder ity (Elliott et al., 2016). This is reflected in the or the decoder are the only ones to consistently factthateventhesimplestbaselinesperformfairly outperform the PBSMT baseline according to the wellonit,i.e. thesmallestBLEUscoreof32.9is chrF3 metric, a character-based metric that in- that of the PBSMT model, which is still good for translatingintoGerman. 6In fact, model IMG still improves on the multi- 2W+D FromTable1weseethatourmulti-modalmod- modal baseline of Huang et al. (2016) when trained on the els perform well, with models IMG and IMG samedata. E D 7Addingthesyntheticsentencepairstotrainthebaseline improvingonbothbaselinesaccordingtoallmet- PBSMTmodel,aswedidwithallneuralMTmodels,deteri- rics analysed. We also note that all models but oratedtheresults. BLEU4↑ METEOR↑ TER↓ chrF3↑ originaltrainingdata IMG2W 36.9 54.3 41.9 66.8 IMGE 37.1 55.0 43.1 67.6 IMGD 37.3 55.1 42.8 67.7 +back-translatedtrainingdata PBSMT 34.0 55.0 44.7 68.0 NMT 35.5 53.4 43.3 65.3 IMG2W 36.7†‡(↑1.2) 54.6†‡(↓0.4) 42.0†‡(↓1.3) 66.8(↓1.2) ref. einbraunerundeinschwarzerHundlaufenauf IMGE 38.5†‡(↑3.0) 55.7†‡(↑0.9) 41.4†‡(↓1.9) 68.3(↑0.3) einemPfadimWald. IMGD 38.5†‡(↑3.0) 55.9†‡(↑1.1) 41.6†‡(↓1.7) 68.4(↑0.4) SMT einbraunundschwarzerHundla¨uftaufeinem PfadimWald. Improvements(originalvs.+back-translated) NMT einbraunerHundstehtaneinemSandStrand. IMG2W ↓0.2 ↑0.1 ↑0.1 ↑0.0 IMG1W einbraun-schwarzerHundla¨uftaufeinemPfadimWald. IMGE ↑1.4 ↑0.7 ↓1.8 ↑0.7 IMG2W einbraun-schwarzerHundla¨uftimWaldaufeinemPfad. IMGD ↑1.2 ↑0.8 ↓1.2 ↑0.7 IMGE einbraun-schwarzerHundla¨uftimWaldaufeinemPfad. IMGD einbraun-schwarzerHundla¨uftimWaldaufeinemPfad. Table 2: BLEU4, METEOR, TER and chrF3 scoresontheM30k testsetformodelstrainedon T original and additional back-translated data. Best text-onlybaselinesareunderscoredandbestover- allresultsinbold. Wehighlightinparenthesesthe improvements brought by our models compared to the best baseline score. Results differ signifi- cantlyfromPBSMTbaseline(†)orNMTbaseline ref. eineFraumitlangenHaarenbeieinerAbschlussFeier. SMT eineFraumitlangenHaarenstehtaneinemAbschluss (‡) with p = 0.05. We also show the improve- NMT eineFraumitlangenHaarenistaneinerStaZeremonie. mentseachmodelyieldsineachmetricwhenonly IMG1W eineFraumitlangenHaarenistaneinerwarmen Zeremonieteil. trainedontheoriginalM30kT trainingsetvs. also IMG2W eineFraumitlangenHaarenstehtbeieinerHochzeitFeier. includingadditionalback-translateddata. IMGE einelanghaarigeFraubeieinerolympischenZeremonie. IMGD einelanghaarigeFraubeieinerolympischenZeremonie. Table3: SometranslationsfortheM30ktestset. cludes both precision and recall, and has a re- call bias. That is also a noteworthy finding, since 5 Conclusions chrF3 is the only character-level metric we use, and it has shown a high correlation with human We have introduced different ideas to incorporate judgements(Stanojevic´ etal.,2015). images into state-of-the-art attention-based NMT, In Table 3 we see translations for two entries by using images as words in the source sentence, in the test M30k set. In the first entry, although to initialise the encoder’s hidden state and as ad- thereferencetranslationisincorrect—thereisjust ditional data in the initialisation of the decoder’s one dog in the image—, the multi-modal models hidden state. We corroborate previous findings translateditcorrectly. Inthesecondentry,thelast in that using image features directly at each time three multi-modal models extrapolate the refer- stepofthedecodercausesthemodeltooverfitand ence+image and describe “ceremony” as a “wed- prevents learning. The intuition behind our ef- ding ceremony” (IMG2W) and as an “Olympics fortistouseglobalimagefeaturevectorstovisu- ceremony”(IMGE andIMGD). Thiscouldbedue allygroundtranslationsandconsequentlyincrease to the fact that the training set is small, depicts translation quality. Extensive experiments show a small variation of different scenes and contains that adding global image features into attention- differentformsofbiasses(vanMiltenburg,2015). basedNMTisusefulandimprovesoverNMTand We note that the idea of using images as words PBSMT as well as a strong multi-modal NMT in the source sentence, also entertained by Huang baseline,accordingtoallmetricsevaluated. et al. (2016), does not perform as well as directly In future work we will conduct a more sys- usingtheimagesintheencoderordecoderinitial- tematic study on the impact that synthetic back- isation. The fact that multi-modal NMT models translateddatacanhaveonmulti-modalNMT,and canbenefitfromback-translateddataisalsoanin- also investigate how to incorporate local, spatial- terestingfinding. preservingimagefeatures. References Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evalua- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua tionforAnyTargetLanguage. InProceedingsofthe Bengio. 2015. Neural Machine Translation by EACL2014WorkshoponStatisticalMachineTrans- Jointly Learning to Align and Translate. In Inter- lation. national Conference on Learning Representations, ICLR2015.SanDiego,California. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar- rama, Marcus Rohrbach, Subhashini Venugopalan, Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Trevor Darrell, and Kate Saenko. 2015. Long- Noam M. Shazeer. 2015. Scheduled Sampling for term Recurrent Convolutional Networks for Visual Sequence Prediction with Recurrent Neural Net- Recognition and Description. In Computer Vision works. InAdvancesinNeuralInformationProcess- and Pattern Recognition (CVPR), 2015 IEEE Con- ingSystems,NIPS. http://arxiv.org/abs/1506.03099. ferenceon.Boston,US,pages2625–2634. YoshuaBengio,Re´jeanDucharme,PascalVincent,and ChristianJanvin.2003. ANeuralProbabilisticLan- Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and guage Model. J. Mach. Learn. Res. 3:1137–1155. HaifengWang.2015. Multi-TaskLearningforMul- http://dl.acm.org/citation.cfm?id=944919.944966. tiple Language Translation. In Proceedings of the 53rdAnnualMeetingoftheAssociationforCompu- Ozan Caglayan, Walid Aransa, Yaxing Wang, tational Linguistics and the 7th International Joint Marc Masana, Mercedes Garc´ıa-Mart´ınez, Fethi Conference on Natural Language Processing (Vol- Bougares, Lo¨ıc Barrault, and Joost van de Weijer. ume 1: Long Papers). Beijing, China, pages 1723– 2016. Does Multimodality Help Human and 1732. http://www.aclweb.org/anthology/P15-1166. MachineforTranslationandImageCaptioning? In Proceedings of the First Conference on Machine Desmond Elliott, Stella Frank, and Eva Hasler. Translation. Berlin, Germany, pages 627–633. 2015. Multi-language image description with http://www.aclweb.org/anthology/W/W16/W16- neural sequence models. CoRR abs/1510.04709. 2358. http://arxiv.org/abs/1510.04709. Iacer Calixto, Teofilo de Campos, and Lucia Specia. Desmond Elliott, Stella Frank, Khalil Sima’an, 2012. Images as context in Statistical Machine and Lucia Specia. 2016. Multi30K: Multilin- Translation. InProceedingsoftheWorkshoponVi- gual English-German Image Descriptions. In sionandLanguage,VL2012.Sheffield,England. Proceedings of the 5th Workshop on Vision and Language, VL@ACL 2016. Berlin, Ger- IacerCalixto,DesmondElliott,andStellaFrank.2016. many. http://aclweb.org/anthology/W/W16/W16- DCU-UvA Multimodal MT System Report. In 3210.pdf. Proceedings of the First Conference on Machine Translation. Berlin, Germany, pages 634–638. Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. http://www.aclweb.org/anthology/W/W16/W16- 2016. Multi-Way, Multilingual Neural Machine 2359. TranslationwithaSharedAttentionMechanism. In Proceedings of the 2016 Conference of the North KyunghyunCho,BartvanMerrie¨nboer,DzmitryBah- American Chapter of the Association for Com- danau, and Yoshua Bengio. 2014a. On the proper- putational Linguistics: Human Language Tech- tiesofneuralmachinetranslation:Encoder–decoder nologies. San Diego, California, pages 866–875. approaches. Syntax,SemanticsandStructureinSta- http://www.aclweb.org/anthology/N16-1101. tisticalTranslationpage103. Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- YarinGalandZoubinGhahramani.2016. ATheoreti- cehre, Dzmitry Bahdanau, Fethi Bougares, Hol- callyGroundedApplicationofDropoutinRecurrent ger Schwenk, and Yoshua Bengio. 2014b. Learn- Neural Networks. In Advances in Neural Informa- ing Phrase Representations using RNN Encoder– tion Processing Systems, NIPS, Barcelona, Spain, Decoder for Statistical Machine Translation. In pages1019–1027. http://papers.nips.cc/paper/6241- Proceedings of the 2014 Conference on Em- a-theoretically-grounded-application-of-dropout-in- pirical Methods in Natural Language Process- recurrent-neural-networks.pdf. ing (EMNLP). Doha, Qatar, pages 1724–1734. http://www.aclweb.org/anthology/D14-1179. Ross Girshick, Jeff Donahue, Trevor Darrell, and Ji- tendra Malik. 2014. Rich Feature Hierarchies for Jonathan H. Clark, Chris Dyer, Alon Lavie, and Accurate Object Detection and Semantic Segmen- Noah A. Smith. 2011. Better Hypothesis Testing tation. In Proceedings of the 2014 IEEE Confer- for Statistical Machine Translation: Control- ence on Computer Vision and Pattern Recognition. ling for Optimizer Instability. In Proceedings Washington,DC,USA,CVPR’14,pages580–587. of the 49th Annual Meeting of the Associa- https://doi.org/10.1109/CVPR.2014.81. tion for Computational Linguistics: Human Language Technologies: Short Papers - Vol- Alex Graves. 2013. Generating Sequences With Re- ume2.Portland,Oregon,HLT’11,pages176–181. current Neural Networks. CoRR abs/1308.0850. http://dl.acm.org/citation.cfm?id=2002736.2002774. http://arxiv.org/abs/1308.0850. Julian Hitschler, Shigehiko Schamoni, and Ste- Translation. Berlin, Germany, pages 646–654. fan Riezler. 2016. Multimodal Pivots for Im- http://www.aclweb.org/anthology/W/W16/W16- age Caption Translation. In Proceedings of 2361. the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Minh-ThangLuong,QuocV.Le,IlyaSutskever,Oriol Papers). Berlin, Germany, pages 2399–2409. Vinyals, and Lukasz Kaiser. 2016. Multi-Task Se- http://www.aclweb.org/anthology/P16-1227. quencetoSequenceLearning. InProceedingsofthe International Conference on Learning Representa- SeppHochreiterandJu¨rgenSchmidhuber.1997. Long tions(ICLR),2016.SanJuan,PuertoRico. Short-Term Memory. Neural Comput. 9(8):1735– 1780. https://doi.org/10.1162/neco.1997.9.8.1735. Thang Luong, Hieu Pham, and Christopher D. Man- ning. 2015. Effective Approaches to Attention- Micah Hodosh, Peter Young, and Julia Hock- based Neural Machine Translation. In Proceed- enmaier. 2013. Framing Image Description ings of the 2015 Conference on Empirical Methods As a Ranking Task: Data, Models and Evalu- inNaturalLanguageProcessing(EMNLP).Lisbon, ation Metrics. J. Artif. Int. Res. 47(1):853–899. Portugal,pages1412–1421. http://dl.acm.org/citation.cfm?id=2566972.2566993. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, and Alan L. Yuille. 2014. Explain Images Jean Oh, and Chris Dyer. 2016. Attention-based with Multimodal Recurrent Neural Networks. Multimodal Neural Machine Translation. In http://arxiv.org/abs/1410.1090. Proceedings of the First Conference on Machine Translation. Berlin, Germany, pages 639–645. Franz Josef Och. 2003. Minimum Error Rate Train- http://www.aclweb.org/anthology/W/W16/W16- ing in Statistical Machine Translation. In Pro- 2360. ceedings of the 41st Annual Meeting on Asso- ciation for Computational Linguistics - Volume NalKalchbrennerandPhilBlunsom.2013. Recurrent 1. Sapporo, Japan, ACL ’03, pages 160–167. Continuous Translation Models. In Proceedings of https://doi.org/10.3115/1075096.1075117. the2013ConferenceonEmpiricalMethodsinNat- ural Language Processing, EMNLP 2013. Seattle, Kishore Papineni, Salim Roukos, Todd Ward, and pages1700–1709. Wei-Jing Zhu. 2002. BLEU: A Method for Au- tomatic Evaluation of Machine Translation. In Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- Proceedings of the 40th Annual Meeting on As- semantic alignments for generating image descrip- sociation for Computational Linguistics. Philadel- tions. In Proceedings of the IEEE Conference on phia, Pennsylvania, ACL ’02, pages 311–318. Computer Vision and Pattern Recognition, CVPR https://doi.org/10.3115/1073083.1073135. 2015.Boston,Massachusetts,pages3128–3137. Maja Popovic´. 2015. chrf: character n-gram f- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. score for automatic mt evaluation. In Proceed- Zemel.2014. Unifyingvisual-semanticembeddings ings of the Tenth Workshop on Statistical Ma- with multimodal neural language models. CoRR chineTranslation.Lisbon,Portugal,pages392–395. abs/1411.2539. http://arxiv.org/abs/1411.2539. http://aclweb.org/anthology/W15-3049. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In In OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause, Proceedings of the IEEE International Conference Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An- on Acoustics, Speech and Signal Processing. De- drej Karpathy, Aditya Khosla, Michael Bernstein, troit,Michigan,volumeI,pages181–184. Alexander C. Berg, and Li Fei-Fei. 2015. Ima- geNet Large Scale Visual Recognition Challenge. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris International Journal of Computer Vision (IJCV) Callison-Burch,MarcelloFederico,NicolaBertoldi, 115(3):211–252. https://doi.org/10.1007/s11263- Brooke Cowan, Wade Shen, Christine Moran, 015-0816-y. RichardZens,ChrisDyer,OndˇrejBojar,Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Rico Sennrich, Barry Haddow, and Alexandra Birch. Source Toolkit for Statistical Machine Translation. 2016a. Improving Neural Machine Translation In Proceedings of the 45th Annual Meeting of the Models with Monolingual Data. In Proceed- ACL on Interactive Poster and Demonstration Ses- ings of the 54th Annual Meeting of the Asso- sions. Association for Computational Linguistics, ciation for Computational Linguistics (Volume 1: Prague, Czech Republic, ACL ’07, pages 177–180. Long Papers). Berlin, Germany, pages 86–96. http://dl.acm.org/citation.cfm?id=1557769.1557821. http://www.aclweb.org/anthology/P16-1009. Jindˇrich Libovicky´, Jindˇrich Helcl, Marek Tlusty´, Rico Sennrich, Barry Haddow, and Alexandra Birch. Ondˇrej Bojar, and Pavel Pecina. 2016. CUNI 2016b. Neural Machine Translation of Rare System for WMT16 Automatic Post-Editing Words with Subword Units. In Proceedings and Multimodal Translation Tasks. In Pro- of the 54th Annual Meeting of the Associa- ceedings of the First Conference on Machine tion for Computational Linguistics (Volume 1: