Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism OrhanFirat KyunghyunCho YoshuaBengio MiddleEastTechnicalUniversity NewYorkUniversity UniversityofMontreal [email protected] CIFARSeniorFellow 6 1 0 2 Abstract into a point in a continuous vector space, resulting n in a fixed-dimensionalcontext vector. The other re- a We propose multi-way, multilingual neural currentneuralnetwork,calledadecoder,thengener- J machine translation. The proposed approach atesatargetsequenceagainofvariablelengthstart- 6 enables a single neural translation model to ingfromthecontextvector. Thisapproachhowever translate between multiple languages, with a ] L number of parameters that grows only lin- hasbeenfoundtobeinefficientin(Choetal.,2014a) C early with the number of languages. This when handling long sentences, due to the difficulty . is made possible by having a single atten- inlearningacomplexmappingbetweenanarbitrary s c tion mechanism that is shared across all lan- longsentenceandasinglefixed-dimensionalvector. [ guage pairs. We train the proposed multi- In(Bahdanauetal.,2014),aremedytothisissue way,multilingualmodelontenlanguagepairs 1 was proposed by incorporating an attention mecha- v from WMT’15 simultaneously and observe nism to the basic encoder-decoder network. The at- 3 clearperformanceimprovementsovermodels 7 trained on only one language pair. In partic- tention mechanism in the encoder-decoder network 0 ular,weobservethattheproposedmodelsig- freesthenetworkfromhavingtomapasequenceof 1 nificantly improves the translation quality of arbitrary length to a single, fixed-dimensional vec- 0 . low-resourcelanguagepairs. tor. Since this attention mechanism was introduced 1 0 to the encoder-decoder network for machine trans- 6 1 Introduction lation, neural machine translation, which is purely 1 basedonneuralnetworkstoperformfullend-to-end : v Neural Machine Translation It has been shown translation, has become competitive with the exist- i that a deep (recurrent) neural network can success- X ing phrase-based statistical machine translation in fully learn a complex mapping between variable- r many language pairs (Jean et al., 2015; Gulcehre et a lengthinputandoutputsequencesonitsown. Some al.,2015;Luongetal.,2015b). of the earlier successes in this task have, for in- stance, been handwriting recognition (Bottou et al., Multilingual Neural Machine Translation Ex- 1997; Graves et al., 2009) and speech recogni- istingmachinetranslationsystems,mostlybasedon tion (Graves et al., 2006; Chorowski et al., 2015). a phrase-based system or its variants, work by di- More recently, a general framework of encoder- rectly mapping a symbol or a subsequence of sym- decoder networks has been found to be effective at bols in a source language to its corresponding sym- learningthiskindofsequence-to-sequencemapping bol or subsequence in a target language. This kind by using two recurrent neural networks (Cho et al., of mapping is strictly specific to a given language 2014b;Sutskeveretal.,2014). pair, and it is not trivial to extend this mapping to Abasicencoder-decodernetworkconsistsoftwo workonmultiplepairsoflanguages. recurrent networks. Thefirst network, calledan en- Asystembasedonneuralmachinetranslation,on coder, maps an input sequence of variable length the other hand, can be decomposed into two mod- ules. Theencodermapsasourcesentenceintoacon- resource language pair. The experiments show that tinuous representation, either a fixed-dimensional the proposed multi-way, multilingual model gener- vector in the case of the basic encoder-decoder net- alizes better than the single-pair translation model, work or a set of vectors in the case of attention- when the amount of available parallel corpus is based encoder-decoder network. The decoder then small. Furthermore,wevalidatethatthisisnotonly generates a target translation based on this source duetotheincreasedamountoftarget-side,monolin- representation. This makes it possible conceptually gualcorpus. to build a system that maps a source sentence in Finally, we train a single model with the pro- any language to a common continuous representa- posedarchitectureonallthelanguagepairsfromthe tion space and decodes the representation into any WMT’15; English, French, Czech, German, Rus- ofthetargetlanguages,allowingustomakeamulti- sian and Finnish. The experiments show that it is lingualmachinetranslationsystem. indeedpossibletotrainasingleattention-basednet- This possibility is straightforward to implement worktoperformmulti-waytranslation. andhasbeenvalidatedinthecaseofbasicencoder- 2 Background: Attention-basedNeural decoder networks (Luong et al., 2015a). It is MachineTranslation however not so, in the case of the attention-based encoder-decoder network, as the attention mecha- The attention-based neural machine translation was nism, or originally called the alignment function in proposed in (Bahdanau et al., 2014). It was mo- (Bahdanau et al., 2014), is conceptually language tivated from the observation in (Cho et al., 2014a) pair-specific. In (Dong et al., 2015), the authors thatabasicencoder-decodertranslationmodelfrom cleverly avoided this issue of language pair-specific (Cho et al., 2014b; Sutskever et al., 2014) suffers attention mechanism by considering only a one-to- from translating a long source sentence efficiently. many translation, where each target language de- Thisislargelyduetothefactthattheencoderofthis coderembeddeditsownattentionmechanism. Also, basic approach needs to compress a whole source we notice that both of these works have only eval- sentence into a single vector. Here we describe the uated their models on relatively small-scale tasks, attention-basedneuralmachinetranslation. making it difficult to assess whether multilingual Neuralmachinetranslationaimsatbuildingasin- neural machine translation can scale beyond low- gle neural network that takes as input a source se- resourcelanguagetranslation. quence X = (x ,...,x ) and generates a corre- 1 Tx(cid:0) (cid:1) spondingtranslationY = y ,...,y . Eachsym- Multi-Way, Multilingual Neural Machine Trans- 1 Ty bol in both source and target sentences, x or y , is lation In this paper, we first step back from the t t anintegerindexofthesymbolinavocabulary. currently available multilingual neural translation The encoder of the attention-based model en- systems proposed in (Luong et al., 2015a; Dong codes a source sentence into a set of context vec- et al., 2015) and ask the question of whether the tors C = {h ,h ,...,h }, whose size varies attention mechanism can be shared across multi- 1 2 Tx w.r.t. the length of the source sentence. This con- ple language pairs. As an answer to this question, text set is constructed by a bidirectional recurrent weproposeanattention-basedencoder-decodernet- neural network (RNN) which consists of a forward workthatadmitsasharedattentionmechanismwith RNN and reverse RNN. The forward RNN reads multiple encoders and decoders. We use this model the source sentence from the first token until the for all the experiments, which suggests that it is last one, resulting in the forward context vectors indeed possible to share an attention mechanism (cid:110)→− →− (cid:111) acrossmultiplelanguagepairs. h1,..., hTx ,where The next question we ask is the following: in →− →− (cid:16)→− (cid:17) h = Ψ h ,E [x ] , whichscenariowouldtheproposedmulti-way,mul- t enc t−1 x t tilingual neural translation have an advantage over the existing, single-pair model? Specifically, we and Ex ∈ R|Vx|×d is an embedding matrix con- consider a case of the translation between a low- taining row vectors of the source symbols. The reverse RNN in an opposite direction, resulting in whereΨ isarecurrentactivationfunction. dec (cid:110)←− ←− (cid:111) h1,..., hTx ,where The initial hidden state z0 of the decoder is ini- tialized based on the last hidden state of the reverse ←− ←− (cid:16)←− (cid:17) h = Ψ h ,E [x ] . RNN: t enc t+1 x t (cid:16)←− (cid:17) →−Ψ and ←Ψ− are recurrent activation func- z0 = finit hTx , (6) enc enc tionssuchaslongshort-termmemoryunits(LSTM, wheref isafeedforwardnetworkwithoneortwo init (Hochreiter and Schmidhuber, 1997)) or gated re- hiddenlayers. current units (GRU, (Cho et al., 2014b)). At each The probability distribution for the next target position in the source sentence, the forward and re- symboliscomputedby versecontextvectorsareconcatenatedtoformafull contextvector,i.e., p(yt = k|y˜<t,X) ∝ egk(zt,ct,E[y˜t−1]), (7) (cid:104)→− ←− (cid:105) h = h ; h . (1) t t t where g is a parametric function that returns the k The decoder, which is implemented as an RNN unnormalizedprobabilityforthenexttargetsymbol as well, generates one symbol at a time, the trans- beingk. lation of the source sentence, based on the context Training this attention-based model is done by set returned by the encoder. At each time step t in maximizingtheconditionallog-likelihood the decoder, a time-dependent context vector c is computed based on the previous hidden state oftthe L(θ) = 1 (cid:88)N (cid:88)Ty logp(y = y(n)|y(n),X(n)), decoder z , the previously decoded symbol y˜ N t t <t t−1 t−1 n=1t=1 andthewholecontextsetC. where the log probability inside the inner summa- This starts by computing the relevance score of tion is from Eq. (7). It is important to note that eachcontextvectoras (n) theground-truthtargetsymbolsy areusedduring t e = f (h ,z ,E [y˜ ]), (2) t,i score i t−1 y t−1 training. The entire model is differentiable, and the for all i = 1,...,T . f can be implemented in gradient of the log-likelihood function with respect x score variousways(Luongetal.,2015b),butinthiswork, to all the parameters θ can be computed efficiently we use a simple single-layer feedforward network. by backpropagation. This makes it straightforward Thisrelevancescoremeasureshowrelevantthei-th to use stochastic gradient descent or its variants to context vector of the source sentence is in deciding trainthewholemodeljointlytomaximizethetrans- the next symbol in the translation. These relevance lationperformance. scoresarefurthernormalized: 3 Multi-Way,MultilingualTranslation exp(e ) t,i α = , (3) t,i (cid:80)Tx exp(e ) In this section, we discuss issues and our solutions j=1 t,j in extending the conventional single-pair attention- andwecallα theattentionweight. t,i based neural machine translation into multi-way, The time-dependent context vector c is then t multilingualmodel. the weighted sum of the context vectors with their weightsbeingtheattentionweightsfromabove: Problem Definition We assume N > 1 source languages (cid:8)X1,X2,...,XN(cid:9) and M > 1 tar- c = (cid:88)Tx α h . (4) get languages (cid:8)Y1,Y2,...,YM(cid:9) , and the avail- t t,i i ability of L ≤ M × N bilingual parallel corpora i=1 {D ,...,D }, each of which is a set of sentence With this time-dependent context vector c , the 1 L t pairs of one source and one target languages. We previous hidden state z and the previously de- t−1 use s(D ) and t(D ) to indicate the source and tar- codedsymboly˜ ,thedecoder’shiddenstateisup- l l t−1 getlanguagesofthel-thparallelcorpus. datedby For each parallel corpus l, we can directly z = Ψ (z ,E [y˜ ],c ), (5) t dec t−1 y t−1 t use the log-likelihood function from Eq. (7) to define a pair-specific log-likelihood Ls(Dl),t(Dl). single-pair attention-based model into multilingual Then, the goal of multi-way, multilingual neu- translation by simply having a single encoder for a ral machine translation is to build a model sourcelanguageandpairsofadecoderandattention that maximizes the joint log-likelihood function mechanism (Eq. (2)) for each target language. We L(θ) = 1 (cid:80)L Ls(Dl),t(Dl)(θ). Once the training willshortlydiscussmoreonwhy,withtheattention L l=1 is over, the model can do translation from any of mechanism,one-to-manytranslationistrivial,while the source languages to any of the target languages multi-waytranslationisnot. includedintheparallelcorpora. 3.2 Challenges 3.1 ExistingApproaches Aquicklookatneuralmachinetranslationseemsto Neural Machine Translation without Attention suggestastraightforwardpathtowardincorporating multiple languages in both source and target sides. In (Luong et al., 2015a), the authors extended the As described earlier already in the introduction, the basic encoder-decoder network for multitask neu- basic idea is simple. We assign a separate encoder ral machine translation. As they extended the ba- to each source language and a separate decoder to sicencoder-decodernetwork,theirmodeleffectively each target language. The encoder will project a becomesasetofencodersanddecoders,whereeach sourcesentenceinitsownlanguageintoacommon, oftheencoderprojectsasourcesentenceintoacom- language-agnostic space, from which the decoder mon vector space. The point in the common space willgenerateatranslationinitsownlanguage. isthendecodedintodifferentlanguages. Unlike training multiple single-pair neural trans- The major difference between (Luong et al., lation models, in this case, the encoders and de- 2015a)andourworkisthatweextendtheattention- codersaresharedacrossmultiplepairs. Thisiscom- based encoder-decoder instead of the basic model. putationallybeneficial,asthenumberofparameters This is an important contribution, as the attention- grows only linearly with respect to the number of based neural machine translation has become de languages (O(L)), in contrary to training separate facto standard in neural translation literatures re- single-pairmodels,inwhichcasethenumberofpa- cently (Jean et al., 2014; Jean et al., 2015; Luong rametersgrowsquadratically(O(L2).) etal.,2015b;Sennrichetal.,2015b;Sennrichetal., The attention mechanism, which was initially 2015a),byoppositiontothebasicencoder-decoder. called a soft-alignment model in (Bahdanau et al., There are two minor differences as well. First, 2014), aligns a (potentially non-contiguous) source they do not consider multilinguality in depth. The phrase to a target word. This alignment process is authors of (Luong et al., 2015a) tried only a sin- largelyspecifictoalanguagepair,anditisnotclear gle language pair, English and German, in their whether an alignment mechanism for one language model. Second,theyonlyreporttranslationperplex- paircanalsoworkforanotherpair. ity, which is not a widely used metric for measur- The most naive solution to this issue is to have ingtranslationquality. Tomoreeasilycomparewith O(L2) attention mechanisms that are not shared other machine translation approaches it would be across multiple language pairs. Each attention important to evaluate metrics such as BLEU, which mechanismtakescareofasinglepairofsourceand countsthenumberofmatchedn-gramsbetweenthe target languages. This is the approach employed in generatedandreferencetranslations. (Dongetal.,2015),whereeachdecoderhaditsown One-to-Many Neural Machine Translation The attentionmechanism. authors of (Dong et al., 2015) earlier proposed There are two issues with this naive approach. a multilingual translation model based on the First,unlikewhathasbeenhopedinitiallywithmul- attention-based neural machine translation. Unlike tilingual neural machine translation, the number of this paper, they only tried it on one-to-many trans- parametersagaingrowsquadraticallyw.r.t. thenum- lation, similarly to earlier work by (Collobert et al., ber of languages. Second and more importantly, 2011)whereone-to-manynaturallanguageprocess- having separate attention mechanisms makes it less ingwasdone. Inthissetting,itistrivialtoextendthe likelyforthemodeltofullybenefitfromhavingmul- #Symbols #Sentence #En Other Pairs En-Fr 1.022b 2.213b 38.85m En-Cs 186.57m 185.58m 12.12m En-Ru 50.62m 55.76m 2.32m En-De 111.77m 117.41m 4.15m En-Fi 52.76m 43.67m 2.03m Table1:StatisticsoftheparallelcorporafromWMT’15.Sym- bolsareBPE-basedsub-words. volutional instead of recurrent.) This allows us to efficientlyincorporatevaryingtypesoflanguagesin theproposedmultilingualtranslationmodel. This however implies that the dimensionality of the context vectors in Eq. (1) may differ across source languages. Therefore, we add to the origi- Figure 1: One step of the proposed multi-way. multilingual nalbidirectionalencoderfromSec.2,alineartrans- Neural Machine Translation model, for the n-th encoder and formationlayerconsistingofaweightmatrixWn them-thdecoderattimestept.SeeSec.4fordetails. adp andabiasvectorbn ,whichisusedtoprojecteach adp tiple tasks (Caruana, 1997), especially for transfer contextvectorintoacommondimensionalspace: learningtowardsresource-poorlanguages. (cid:104)→− ←− (cid:105) hn = Wn h ; h +bn , (8) In short, the major challenge in building a multi- t adp t t adp way, multilingual neural machine translation is in −→ ←− whereWn ∈ Rd×(dimht+dimht) andbn ∈ Rd. avoiding independent (i.e., quadratically many) at- adp adp In addition, each encoder exposes two transfor- tention mechanisms. There are two questions be- mationfunctionsφn andφn . Thefirsttransformer hind this challenge. The first one is whether it is att init φn transforms a context vector to be compatible evenpossibletoshareasingleattentionmechanism att withasharedattentionmechanism: across multiple language pairs. The second ques- h˜n = φn (hn). (9) tionimmediatelyfollows: howcanwebuildaneural t att t translationmodeltoshareasingleattentionmecha- Thistransformercanbeimplementedasanytypeof nismforallthelanguagepairsinconsideration? parametricfunction,andinthispaper,wesimplyap- plyanelement-wisetanhtohn. t 4 Multi-Way,MultilingualModel The second transformer φn transforms the first init context vector hn to be compatible with the initial- We describe in this section a proposed multi- 1 izerofthedecoder’shiddenstate(seeEq.(6)): way, multilingual attention-based neural machine translation. The proposed model consists of N hˆn1 = φninit(hn1). (10) encoders {Ψn }N (see Eq. (1)), M decoders enc n=1 Similarly to φn , it can be implemented as any type {(Ψm ,gm,fm)}M (see Eqs. (5)–(7)) and a att dec init m=1 of parametric function. In this paper, we use a sharedattentionmechanismf (seeEq.(2)inthe score feedforwardnetworkwithasinglehiddenlayerand singlelanguagepaircase). shareonenetworkφ forallencoder-decoderpairs, init likeattentionmechanism. Encoders Similarly to (Luong et al., 2015b), we haveoneencoderpersourcelanguage,meaningthat Decoders Wefirststartwithaninitializationofthe a single encoder is shared for translating the lan- decoder’s hidden state. Each decoder has its own guagetomultipletargetlanguages. Inordertohan- parametric function ϕm that maps the last context init dledifferentsourcelanguagesbetter,wemayusefor vectorhˆn ofthesourceencoderfromEq.(10)into eachsourcelanguageadifferenttypeofencoder,for theinitiaTlxhiddenstate: instance, of different size (in terms of the number zm = ϕm (hˆn ) = ϕm (φn (hn)) of recurrent units) or of different architecture (con- 0 init Tx init init 1 ϕm can be any parametric function, and in this pa- Size Single Single+DF Multi init per, we used a feedforward network with a single 100k 5.06/3.96 4.98/3.99 6.2/5.17 tanhhiddenlayer. →Fi 200k 7.1/6.16 7.21/6.17 8.84/7.53 Each decoder exposes a parametric function ϕm n 400k 9.11/7.85 9.31/8.18 11.09/9.98 att E 800k 11.08/9.96 11.59/10.15 12.73/11.28 that transforms its hidden state and the previously 210k 14.27/13.2 14.65/13.88 16.96/16.26 decoded symbol to be compatible with a shared at- n E 420k 18.32/17.32 18.51/17.62 19.81/19.63 tention mechanism. This transformer is a paramet- → 840k 21/19.93 21.69/20.75 22.17/21.93 e ric function that takes as input the previous hidden D 1.68m 23.38/23.01 23.33/22.86 23.86/23.52 statezm andtheprevioussymboly˜m andreturns 210k 11.44/11.57 11.71/11.16 12.63/12.68 t−1 t−1 e avectorfortheattentionmechanism: D 420k 14.28/14.25 14.88/15.05 15.01/15.67 → 840k 17.09/17.44 17.21/17.88 17.33/18.14 n z˜m = ϕm(cid:0)zm ,Em(cid:2)y˜m (cid:3)(cid:1) (11) E 1.68m 19.09/19.6 19.36/20.13 19.23/20.59 t−1 att t−1 y t−1 Table2:BLEUscoreswherethetargetpair’sparallelcorpusis which replaces zt−1 in Eq. 2. In this paper, we use constrainedtobe5%,10%,20%and40%oftheoriginalsize. a feedforward network with a single tanh hidden We report the BLEU scores on the development and test sets layerforeachϕm. (separated by /) by the single-pair model (Single), the single- att Given the previous hidden state zm , previously pairmodelwithmonolingualcorpus(Single+DF)andthepro- t−1 decoded symbol y˜m and the time-dependent con- posedmulti-way,multilingualmodel(Multi). t−1 text vector cm, which we will discuss shortly, the t 5 ExperimentSettings decoderupdatesitshiddenstate: 5.1 Datasets z = Ψ (cid:0)zm ,Em(cid:2)y˜m (cid:3),fm (cm)(cid:1), t dec t−1 y t−1 adp t We evaluate the proposed multi-way, multilingual translation model on all the pairs available from where fm affine-transforms the time-dependent adp WMT’15–English(En)↔French(Fr), Czech(Cs), context vector to be of the same dimensionality as German (De), Russian (Ru) and Finnish (Fi)–, to- thedecoder. Weshareasingleaffine-transformation tallingtendirectedpairs. Foreachpair,weconcate- layerfm ,forallthedecodersinthispaper. adp natealltheavailableparallelcorporafromWMT’15 Once the hidden state is updated, the probability anduseitasatrainingset. Weusenewstest-2013as distribution over the next symbol is computed ex- adevelopmentsetandnewstest-2015asatestset,in actlyasforthepair-specificmodel(seeEq.(7).) all the pairs other than Fi-En. In the case of Fi-En, Attention Mechanism Unlike the encoders and weusenewsdev-2015andnewstest-2015asadevel- decoders of which there is an instance for each lan- opmentsetandtestset,respectively. guage, there is only a single attention mechanism, Data Preprocessing Each training corpus is tok- shared across all the language pairs. This shared mechanism uses the attention-specific vectors h˜n enizedusingthetokenizerscriptfromtheMosesde- t coder. Thetokenizedtrainingcorpusiscleanedfol- andz˜m fromtheencoderanddecoder,respectively. t−1 lowing the procedure in (Jean et al., 2015). Instead The relevance score of each context vector hn is t of using space-separated tokens, or words, we use computed based on the decoder’s previous hidden sub-word units extracted by byte pair encoding, as statezm andprevioussymboly˜m : t−1 t−1 recently proposed in (Sennrich et al., 2015b). For (cid:16) (cid:17) each and every language, we include 30k sub-word em,n =f h˜n,z˜m ,y˜m t,i score t t−1 t−1 symbols in a vocabulary. See Table 1 for the statis- ticsofthefinal,preprocessedtrainingcorpora. ThesescoresarenormalizedaccordingtoEq.(3)to becometheattentionweightsαm,n. Evaluation Metric We mainly use BLEU as an t,i With these attention weights, the time-dependent evaluation metric using the multi-bleu script from context vector is computed as the weighted sum of Moses. BLEUiscomputedonthetokenizedtextaf- theoriginalcontextvectors: cm,n = (cid:80)Tx αm,nhn. ter merging the BPE-based sub-word symbols. We t i=1 t,i i SeeFig.1fortheillustration. further look at the average log-probability assigned to reference translations by the trained model as an 1,200. additionalevaluationmetric,asawaytomeasurethe Weusethesametypeofencoderforeverysource model’s density estimation performance free from language, and the same type of decoder for every anyerrorcausedbyapproximatedecoding. target language. The only difference between the single-pair models and the proposed multilingual 5.2 TwoScenarios ones is the numbers of encoders N and decoders Low-Resource Translation First, we investigate M. We leave those multilingual translation specific the effect of the proposed multi-way, multilin- components, such as the ones in Eqs. (8)–(11), in gual model on low-resource language-pair transla- the single-pair models in order to keep the number tion. Among the five languages from WMT’15, we ofsharedparametersconstant. choose En, De and Fi as source languages, and En 5.4 Training and De as target languages. We control the amount of the parallel corpus of each pair out of three to BasicSettings Wetraineachmodelusingstochas- be 5%, 10%, 20% and 40% of the original corpus. ticgradientdescent(SGD)withAdam(Kingmaand In other words, we train four models with different Ba,2015)asanadaptivelearningratealgorithm. We sizes of parallel corpus for each language pair (En- use the initial learning rate of2·10−4 and leave all De,De-En,Fi-En.) the other hyperparameters as suggested in (Kingma As a baseline, we train a single-pair model for and Ba, 2015). Each SGD update is computed us- each multi-way, multilingual model. We further ingaminibatchof80examples,unlessthemodelis finetune the single-pair model to incorporate the parallelized over two GPUs, in which case we use target-side monolingual corpus consisting of all the aminibatchof60examples. Weonlyusesentences target side text from the other language pairs (e.g., oflengthupto50symbols. Weclipthenormofthe when a single-pair model was trained on Fi-En, the gradienttobenomorethan1(Pascanuetal.,2012). target-side monolingual corpus consists of the tar- All training runs are early-stopped based on BLEU get sides from De-En.) This is done by the recently on the development set. As we observed in prelim- proposed deep fusion (Gulcehre et al., 2015). The inary experiments better scores on the development latter is included to tell whether any improvement set when finetuning the shared parameters and out- fromthemultilingualmodelissimplyduetothein- putlayersofthedecodersinthecaseofmultilingual creasedamountoftarget-sidemonolingualcorpus. models, we do this for all the multilingual models. During finetuning, we clip the norm of the gradient Large-scaleTranslation Wetrainonemulti-way, tobenomorethan5. multilingual model that has five encoders and five decoders, corresponding to the five languages from Schedule Aswehaveaccessonlytobilingualcor- WMT’15; En, Fr, De, Cs, Ru, Fi → En, Fr, De, Cs, pora, each sentence pair updates only a subset of Ru,Fi. Weusethefullcorporaforallofthem. the parameters. Excessive updates based on a sin- glelanguagepairmaybiasthemodelawayfromthe 5.3 ModelArchitecture otherpairs. Toavoidit,wecyclethroughallthelan- Eachsymbol,eithersourceortarget,isprojectedon guage pairs, one pair at a time, in Fi(cid:28)En, De(cid:28)En, a 620-dimensional space. The encoder is a bidirec- Fr(cid:28)En,Cs(cid:28)En,Ru(cid:28)Enorder.1 tionalrecurrentneuralnetworkwith1,000gatedre- Model Parallelism The size of the multilingual current units (GRU) in each direction, and the de- modelgrowslinearlyw.r.t. thenumberoflanguages. coder is a recurrent neural network with also 1,000 We observed that a single model that handles five GRU’s. The decoder’s output function g from k sourceandfivetargetlanguagesdoesnotfitinasin- Eq. (7) is a feedforward network with 1,000 tanh gle GPU during training. We address this by dis- hidden units. The dimensionalities of the context tributing computational paths according to different vector hn in Eq. (8), the attention-specific context t translationpairsovermultipleGPUs. Thesharedpa- vector h˜n in Eq. (9) and the attention-specific de- t coder hidden state h˜m in Eq. (11) are all set to 1(cid:28)indicatessimultaneousupdatesontwoGPUs. t−1 Fr(39m) Cs(12m) De(4.2m) Ru(2.3m) Fi(2m) Dir →En En→ →En En→ →En En→ →En En→ →En En→ Single 27.22 26.91 21.24 15.9 24.13 20.49 21.04 18.06 13.15 9.59 U v E De Multi 26.09 25.04 21.23 14.42 23.66 19.17 21.48 17.89 12.97 8.92 L B Single 27.94 29.7 20.32 13.84 24 21.75 22.44 19.54 12.24 9.23 (a) Test Multi 28.06 27.88 20.57 13.29 24.20 20.59 23.44 19.39 12.61 8.98 Single -50.53 -53.38 -60.69 -69.56 -54.76 -61.21 -60.19 -65.81 -88.44 -91.75 v L De Multi -50.6 -56.55 -54.46 -70.76 -54.14 -62.34 -54.09 -63.75 -74.84 -88.02 L b) st Single -43.34 -45.07 -60.03 -64.34 -57.81 -59.55 -60.65 -60.29 -88.66 -94.23 ( Te Multi -42.22 -46.29 -54.66 -64.80 -53.85 -60.23 -54.49 -58.63 -71.26 -88.09 Table3:(a)BLEUscoresand(b)averagelog-probabilitiesforallthefivelanguagesfromWMT’15. rameters,mainlyrelatedtotheattentionmechanism, 7 Conclusion is duplicated on both GPUs. The implementation In this paper, we proposed multi-way, multilingual wasbasedontheworkin(Dingetal.,2014). attention-basedneuralmachinetranslation. Thepro- posed approach allows us to build a single neural network that can handle multiple source and target 6 ResultsandAnalysis languagessimultaneously. Theproposedmodelisa step forward from the recent works on multilingual Low-Resource Translation It is clear from Ta- neuraltranslation,inthesensethatwesupportatten- ble 2 that the proposed model (Multi) outperforms tion mechanism, compared to (Luong et al., 2015a) the single-pair one (Single) in all the cases. This and multi-way translation, compared to (Dong et is true even when the single-pair model is strength- al., 2015). Furthermore, we evaluate the proposed ened with a target-side monolingual corpus (Sin- model on large-scale experiments, using the full set gle+DF). This suggests that the benefit of general- ofparallelcorporafromWMT’15. izationfromhavingmultiplelanguagesgoesbeyond We empirically evaluate the proposed model in that of simply having more target-side monolingual large-scale experiments using all five languages corpus. The performance gap grows as the size of from WMT’15 with the full set of parallel corpora targetparallelcorpusdecreases. and also in the settings with artificially controlled amount of the target parallel corpus. In both of the settings, we observed the benefits of the pro- Large-Scale Translation In Table 3, we observe posed multilingual neural translation model over thattheproposedmultilingualmodeloutperformsor having a set of single-pair models. The improve- is comparable to the single-pair models for the ma- mentwasespeciallyclearinthecasesoftranslating jorityofthealltenpairs/directionsconsidered. This low-resourcelanguagepairs. happens in terms of both BLEU and average log- Weobservedthelargerimprovementswhentrans- probability. This is encouraging, considering that lating to English. We conjecture that this is due there are twice more parameters in the whole set of to a higher availability of English in most parallel single-pairmodelsthaninthemultilingualmodel. corpora, leading to a better parameter estimation of It is worthwhile to notice that the benefit is more the English decoder. More research on this phe- apparent when the model translates from a foreign nomenoninthefuturewillresultinfurtherimprove- language to English. This may be due to the fact mentsfromusingtheproposedmodel. Also, allthe thatalloftheparallelcorporaincludeEnglishasei- other techniques proposed recently, such as ensem- ther a source or target language, leading to a better blingandlargevocabularytricks,needtobetriedto- parameter estimation of the English decoder. In the gether with the proposed multilingual model to im- future, a strategy of using a pseudo-parallel corpus provethetranslationqualityevenfurther. Finally,an to increase the amount of training examples for the interestingfutureworkistousetheproposedmodel decodersofotherlanguages(Sennrichetal.,2015a) totranslatebetweenalanguagepairnotincludedin shouldbeinvestigatedtoconfirmthisconjecture. asetoftrainingcorpus. Acknowledgments Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Pro- The authors would like to thank the developers of cessingSystems,pages577–585. Theano (Bergstra et al., 2010; Bastien et al., 2012) [Collobertetal.2011] Ronan Collobert, Jason Weston, and Blocks (van Merrie¨nboer et al., 2015). We ac- Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, knowledge the support of the following organiza- and Pavel Kuksa. 2011. Natural language process- tions for research funding and computing support: ing (almost) from scratch. The Journal of Machine LearningResearch,12:2493–2537. NSERC, Samsung, IBM, Calcul Que´bec, Compute [Dingetal.2014] Weiguang Ding, Ruoyan Wang, Fei Canada, the Canada Research Chairs, CIFAR and Mao, and Graham W. Taylor. 2014. Theano-based TUBITAK-2214a. large-scale visual recognition with multiple gpus. arXiv:1412.2302. [Dongetal.2015] DaxiangDong,HuaWu,WeiHe,Dian- References haiYu,andHaifengWang. 2015. Multi-tasklearning [Bahdanauetal.2014] Dzmitry Bahdanau, Kyunghyun formultiplelanguagetranslation. ACL. Cho, and Yoshua Bengio. 2014. Neural machine [Gravesetal.2006] Alex Graves, Santiago Ferna´ndez, translation by jointly learning to align and translate. Faustino Gomez, and Ju¨rgen Schmidhuber. 2006. InICLR2015. Connectionisttemporalclassification:labellingunseg- [Bastienetal.2012] Fre´de´ric Bastien, Pascal Lamblin, mentedsequencedatawithrecurrentneuralnetworks. Razvan Pascanu, James Bergstra, Ian J. Goodfellow, InProceedingsofthe23rdinternationalconferenceon ArnaudBergeron,NicolasBouchard,andYoshuaBen- Machinelearning,pages369–376.ACM. gio. 2012. Theano: new features and speed im- [Gravesetal.2009] Alex Graves, Marcus Liwicki, San- provements. Deep Learning and Unsupervised Fea- tiago Ferna´ndez, Roman Bertolami, Horst Bunke, tureLearningNIPS2012Workshop. and Ju¨rgen Schmidhuber. 2009. A novel connec- [Bergstraetal.2010] James Bergstra, Olivier Breuleux, tionistsystemforunconstrainedhandwritingrecogni- Fre´de´ric Bastien, Pascal Lamblin, Razvan Pascanu, tion. PatternAnalysisandMachineIntelligence,IEEE Guillaume Desjardins, Joseph Turian, David Warde- Transactionson,31(5):855–868. Farley,andYoshuaBengio. 2010. Theano:aCPUand [Gulcehreetal.2015] Caglar Gulcehre, Orhan Firat, GPUmathexpressioncompiler. InProceedingsofthe Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei- Python for Scientific Computing Conference (SciPy), Chi Lin, Fethi Bougares, Holger Schwenk, and June. Yoshua Bengio. 2015. On using monolingual cor- [Bottouetal.1997] Le´on Bottou, Yoshua Bengio, and pora in neural machine translation. arXiv preprint YannLeCun. 1997. Globaltrainingofdocumentpro- arXiv:1503.03535. cessingsystemsusinggraphtransformernetworks. In [HochreiterandSchmidhuber1997] Sepp Hochreiter and ComputerVisionandPatternRecognition,1997.Pro- Ju¨rgenSchmidhuber. 1997. Longshort-termmemory. ceedings., 1997 IEEE Computer Society Conference Neuralcomputation,9(8):1735–1780. on,pages489–494.IEEE. [Jeanetal.2014] Se´bastien Jean, Kyunghyun Cho, [Caruana1997] Rich Caruana. 1997. Multitask learning. Roland Memisevic, and Yoshua Bengio. 2014. On Machinelearning,28(1):41–75. usingverylargetargetvocabularyforneuralmachine [Choetal.2014a] KyunghyunCho,BartvanMerrie¨nboer, translation. InACL2015. Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On [Jeanetal.2015] Se´bastienJean,OrhanFirat,Kyunghyun thepropertiesofneuralmachinetranslation:Encoder– Cho, Roland Memisevic, and Yoshua Bengio. 2015. Decoder approaches. In Eighth Workshop on Syntax, Montreal neural machine translation systems for SemanticsandStructureinStatisticalTranslation,Oc- wmt’15. In Proceedings of the Tenth Workshop on tober. Statistical Machine Translation, pages 134–140, Lis- [Choetal.2014b] Kyunghyun Cho, Bart bon, Portugal, September. Association for Computa- Van Merrie¨nboer, Caglar Gulcehre, Dzmitry Bah- tionalLinguistics. danau, Fethi Bougares, Holger Schwenk, and Yoshua [KingmaandBa2015] Diederik Kingma and Jimmy Ba. Bengio. 2014b. Learning phrase representations 2015. Adam: A method for stochastic optimization. using rnn encoder-decoder for statistical machine The International Conference on Learning Represen- translation. arXivpreprintarXiv:1406.1078. tations(ICLR). [Chorowskietal.2015] Jan K Chorowski, Dzmitry Bah- [Luongetal.2015a] Minh-ThangLuong,QuocVLe,Ilya danau,DmitriySerdyuk,KyunghyunCho,andYoshua Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015a. Multi-task sequence to sequence learning. arXiv preprintarXiv:1511.06114. [Luongetal.2015b] Minh-ThangLuong,HieuPham,and ChristopherDManning. 2015b. Effectiveapproaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025. [Pascanuetal.2012] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063. [Sennrichetal.2015a] RicoSennrich,BarryHaddow,and Alexandra Birch. 2015a. Improving neural ma- chinetranslationmodelswithmonolingualdata. arXiv preprintarXiv:1511.06709. [Sennrichetal.2015b] RicoSennrich,BarryHaddow,and Alexandra Birch. 2015b. Neural machine transla- tionofrarewordswithsubwordunits. arXivpreprint arXiv:1508.07909. [Sutskeveretal.2014] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning withneuralnetworks. InAdvancesinneuralinforma- tionprocessingsystems,pages3104–3112. [vanMerrie¨nboeretal.2015] Bart van Merrie¨nboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and YoshuaBengio. 2015. Blocksandfuel: Frameworks fordeeplearning. CoRR,abs/1506.00619.