Table Of Content

OpenNMT: Open-Source Toolkit for Neural Machine Translation GuillaumeKlein†,YoonKim∗,YuntianDeng∗,JeanSenellart†,AlexanderM.Rush∗ HarvardUniversity∗,SYSTRAN† Abstract We describe an open-source toolkit for 7 neural machine translation (NMT). The 1 toolkit prioritizes efficiency, modularity, 0 andextensibilitywiththegoalofsupport- 2 ing NMT research into model architec- r a tures, feature representations, and source M modalities,whilemaintainingcompetitive performance and reasonable training re- 6 Figure 1: Schematic view of neural machine translation. quirements. The toolkit consists of mod- The red source words are first mapped to word vectors and ] eling and translation support, as well as thenfedintoarecurrentneuralnetwork(RNN).Uponseeing L the(cid:104)eos(cid:105)symbol,thefinaltimestepinitializesatargetblue detailedpedagogicaldocumentationabout C RNN.Ateachtargettimestep,attentionisappliedoverthe . theunderlyingtechniques. source RNN and combined with the current hidden state to s produceapredictionp(w |w ,x)ofthenextword. This c t 1:t−1 [ 1 Introduction predictionisthenfedbackintothetargetRNN. 2 Neural machine translation (NMT) is a new v 0 methodology for machine translation that has led framework for developing and comparing open- 1 toremarkableimprovements,particularlyinterms sourcesystems,whileatthesametimebeingeffi- 8 of human evaluation, compared to rule-based and cientandaccurateenoughtobeusedinproduction 2 0 statisticalmachinetranslation(SMT)systems(Wu contexts. . et al., 2016; Crego et al., 2016). Originally de- Currently there are several existing NMT im- 1 0 veloped using pure sequence-to-sequence models plementations. Many systems such as those de- 7 (Sutskever et al., 2014; Cho et al., 2014) and im- veloped in industry by Google, Microsoft, and 1 : proved upon using attention-based variants (Bah- Baidu, are closed source, and are unlikely to be v danau et al., 2014; Luong et al., 2015), NMT has released with unrestricted licenses. Many other i X now become a widely-applied technique for ma- systems such as GroundHog, Blocks, tensorflow- r chine translation, as well as an effective approach seq2seq, lamtram, and our own seq2seq-attn, ex- a forotherrelatedNLPtaskssuchasdialogue,pars- ist mostly as research code. These libraries pro- ing,andsummarization. vide important functionality but minimal support As NMT approaches are standardized, it be- toproductionusers. Perhapsmostpromisingisthe comesmoreimportantforthemachinetranslation University of Edinburgh’s Nematus system orig- and NLP community to develop open implemen- inally based on NYU’s NMT system. Nematus tationsforresearcherstobenchmarkagainst,learn provideshigh-accuracytranslation,manyoptions, from, and extend upon. Just as the SMT com- clear documentation, and has been used in sev- munity benefited greatly from toolkits like Moses eral successful research projects. In the devel- (Koehn et al., 2007) for phrase-based SMT and opment of this project, we aimed to build upon CDec (Dyer et al., 2010) for syntax-based SMT, thestrengthsofthissystem,whileprovidingaddi- NMT toolkits can provide a foundation to build tional documentation and functionality to provide upon. A toolkit should aim to provide a shared ausefulopen-sourceNMTframeworkfortheNLP communityinacademiaandindustry. With these goals in mind, we introduce OpenNMT (http://opennmt.net),anopen- source framework for neural machine translation. OpenNMT is a complete NMT implementation. In addition to providing code for the core translation tasks, OpenNMT was designed with three aims: (a)prioritizefirsttrainingandtestefficiency, Figure 2: Live demo of the OpenNMT system across (b)maintainmodelmodularityandreadability,(c) dozensoflanguagepairs. supportsignificantresearchextensibility. This engineering report describes how the sys- temtargetsthesecriteria. Webeginbybrieflysur- anLSTM(HochreiterandSchmidhuber,1997)or veying the background for NMT, describing the GRU (Chung et al., 2014) which help the model high-level implementation details, and then de- learn long-term features. (b) Translation requires scribingspecificcasestudiesforthethreecriteria. relatively large, stacked RNNs, which consist of We end by showing benchmarks of the system in severalverticallayers(2-16)ofRNNsateachtime terms of accuracy, speed, and memory usage for step (Sutskever et al., 2014). (c) Input feeding, severaltranslationandtranslation-liketasks. wherethepreviousattentionvectorisfedbackinto the input as well as the predicted word, has been 2 Background shown to be quite helpful for machine translation (Luong et al., 2015). (d) Test-time decoding is NMT has now been extensively described donethroughbeamsearchwheremultiplehypoth- in many excellent tutorials (see for instance esistarget predictions areconsideredat eachtime https://sites.google.com/site/ step. Implementing these correctly can be diffi- acl16nmt/home). We give only a condensed cult, which motivates their inclusion in an NMT overview. framework. NMT takes a conditional language modeling view of translation by modeling the probability 3 Implementation of a target sentence w given a source sentence 1:T x as p(w |x) = (cid:81)T p(w |w ,x;θ). This OpenNMT is a complete library for training and 1:S 1:T 1 t 1:t−1 distribution is estimated using an attention-based deployingneuralmachinetranslationmodels. The encoder-decoder architecture (Bahdanau et al., system is successor to seq2seq-attn developed at 2014). A source encoder recurrent neural net- Harvard, and has been completely rewritten for work (RNN) maps each source word to a word easeofefficiency,readability,andgeneralizability. vector, and processes these to a sequence of hid- It includes vanilla NMT models along with sup- den vectors h ,...,h . The target decoder com- port for attention, gating, stacking, input feeding, 1 S binesanRNNhiddenrepresentationofpreviously regularization, beam search and all other options generated words (w ,...w ) with source hidden necessaryforstate-of-the-artperformance. 1 t−1 vectors to predict scores for each possible next The main system is implemented in the word. A softmax layer is then used to produce Lua/Torch mathematical framework, and can be a next-word distribution p(w |w ,x;θ). The easily be extended using Torch’s internal stan- t 1:t−1 source hidden vectors influence the distribution dardneuralnetworkcomponents. Ithasalsobeen through an attention pooling layer that weights extended by Adam Lerer of Facebook Research eachsourcewordrelativetoitsexpectedcontribu- to support Python/PyTorch framework, with the tion to the target prediction. The complete model sameAPI. istrainedend-to-endtominimizethenegativelog- The system has been developed completely in likelihoodofthetrainingcorpus. Anunfoldednet- theopenonGitHubat(http://github.com/ workdiagramisshowninFigure1. opennmt/opennmt) and is MIT licensed. The In practice, there are also many other impor- first version has primarily (intercontinental) con- tant aspects that improve the effectiveness of the tributions from SYSTRAN Paris and the Harvard base model. Here we briefly mention four areas: NLPgroup. Sinceofficialbetarelease,theproject (a) It is important to use a gated RNN such as has been starred by over 1000 users, and there have been active development by those outside of Batch Beam GPU CPU CPU/C thesetwoorganizations. Theprojecthasanactive 1 5 209.0 24.1 62.2 forumforcommunityfeedbackwithoverfivehun- 1 1 166.9 23.3 84.9 dred posts in the last two months. There is also a 30 5 646.8 104.0 116.2 live demonstration available of the system in use 30 1 535.1 128.5 392.7 (Figure3). One nice aspect of NMT as a model is its rela- Table1: Performancenumbersinsourcetokenspersecond tive compactness. The Lua OpenNMT system in- fortheTorchCPU/GPUimplementationsandforthemulti- cludingpreprocessingisroughly4Klinesofcode, threaded CPU C implementation. (Run with Intel i7/GTX 1080) and the Python version is less than 1K lines (al- though slightly less-feature complete). For com- parisontheMosesSMTframeworkincludinglan- and process independent batches during training guage modeling is over 100K lines. This makes phase. Two modes are available: synchronous thesystemeasytocompletelyunderstandfornew- and asynchronous training. In synchronous train- comers. The project is fully self-contained de- ing, batches on parallel GPU are run simultane- pending on minimal number of external Lua li- ously and gradients aggregated to update master braries and including also a simple language in- parametersbeforeresynchronizationoneachGPU dependentreversibletokenizationanddetokeniza- forthefollowingbatch. Inasynchronoustraining, tiontools. batches are run independent on each GPU, and independent gradients accumulated to the master 4 DesignGoals copy of the parameters. Asynchronous SGD is knowntoprovidefasterconvergence(Deanetal., As the low-level details of NMT have been cov- 2012). Experiments with 8 GPUs show a 6× eredpreviously,wefocusthisreportonthedesign speedupinperepoch,butaslightlossintraining goalsofOpenNMT:systemefficiency,codemod- efficiency. Whentrainingtosimilarloss,itgivesa ularity,andmodelextensibility. 3.5×totalspeed-uptotraining. 4.1 SystemEfficiency C/Mobile/GPU Translation Training NMT As NMT systems can take from days to weeks to systems requires significant code complexity to train, training efficiency is a paramount concern. facilitate fast back-propagation-through-time. At Slightlyfastertrainingcanmakebethedifference deployment, the system is much less complex, betweenplausibleandimpossibleexperiments. and only requires (i) forwarding values through the network and (ii) running a beam search that Memory Sharing When training GPU-based is much simplified compared to SMT. OpenNMT NMT models, memory size restrictions are the includes several different translation deployments most common limiter of batch size, and thus di- specialized for different run-time environments: rectlyimpacttrainingtime. Neuralnetworktoolk- a batched CPU/GPU implementation for very its, such as Torch, are often designed to trade-off quickly translating a large set of sentences, a extra memory allocations for speed and declar- simple single-instance implementation for use ative simplicity. For OpenNMT, we wanted to on mobile devices, and a specialized C imple- have it both ways, and so we implemented an ex- mentation. The first implementation is suited ternal memory sharing system that exploits the for research use, for instance allowing the user known time-series control flow of NMT systems to easily include constraints on the feasible set and aggressively shares the internal buffers be- of sentences and ideas such as pointer networks tweenclones. Thepotentialsharedbuffersaredy- and copy mechanisms. The last implementation namicallycalculatedbyexplorationofthenetwork is particularly suited for industrial use as it can graphbeforestartingtraining. Inpracticaluse,ag- runonCPUinstandardproductionenvironments; gressive memory reuse provides a saving of 70% it reads the structure of the network and then ofGPUmemorywiththedefaultmodelsize. uses the Eigen package to implement the basic Multi-GPU OpenNMT additionally supports linear algebra necessary for decoding. Table 4.1 multi-GPU training using data parallelism. Each compares the performance of the different GPU has a replica of the master parameters implementationsbasedonbatchsize,beamsize. 4.2 ModularityforResearch Asecondarygoalwasadesireforcodereadability for non-experts. We targeted this goal by explic- itly separating out many optimizations from the core model, and by including tutorial documenta- tionwithinthecode. Totestwhetherthisapproach wouldallownovelfeaturedevelopmentweexper- imentedwithtwocasestudies. Figure 3: 3D Visualization of OpenNMT source embed- Case Study: Factored Neural Translation In dingfromtheTensorBoardvisualizationsystem. feature-based factored neural translation (Sen- nrich and Haddow, 2016), instead of generating a Multiple Modalities Recent work has shown word at each time step, the model generates both that NMT-like systems are effective for image- word and associated features. For instance, the to-text generation tasks (Xu et al., 2015). This systemmightincludewordsandseparatecasefea- task is quite different from standard machine tures. This extension requires modifying both the translation as the source sentence is now an inputs and the output of the decoder to generate image. However, the future of translation multiple symbols. In OpenNMT both of these as- may require this style of (multi-)modal inputs pectsareabstractedfromthecoretranslationcode, (e.g. http://www.statmt.org/wmt16/ andthereforefactoredtranslationsimplymodifies multimodal-task.html). the input network to instead process the feature- As a case study, we adapted two systems with basedrepresentation,andtheoutputgeneratornet- non-textual inputs to run in OpenNMT. The first worktoinsteadproducemultipleconditionallyin- is an image-to-text system developed for mathe- dependentpredictions. matical OCR (Deng et al., 2016). This model re- Case Study: Attention Networks The use of places the source RNN with a deep convolution attention over the encoder at each step of transla- over the source input. Excepting preprocessing, tion is crucial for the model to perform well. The the entire adaptation requires less than 500 lines default method is to utilize the global attention of additional code and is also open-sourced as mechanism. However there are many other types github.com/opennmt/im2text. The sec- of attention that have recently proposed including ond is a speech-to-text recognition system based localattention(Luongetal.,2015),sparse-maxat- on the work of Chan et al. (2015). This sys- tention (Martins and Astudillo, 2016), hierarchi- tem has been implemented directly in OpenNMT calattention(Yangetal.,2016)amongothers. As by replacing the source encoder with a Pyrimidal thisissimplyamoduleinOpenNMTitcaneasily sourcemodel. besubstituted. RecentlytheHarvardgroupdevel- 4.4 AdditionalTools oped a structured attention approach, that utilizes graphical model inference to compute this atten- Finally we briefly summarize some of the addition. The method is quite computationally com- tionaltoolsthatextendOpenNMTtomakeitmore plex; however as it is modularized by the Torch beneficialtotheresearchcommunity. interface,itcanbeusedinOpenNMTtosubstitute Tokenization We aimed for OpenNMT to be forstandardattention. a standalone project and not depend on com- monly used tools. For instance the Moses tok- 4.3 Extensibility enizer has language specific heuristics not neces- Deeplearningisaquicklyevolvingfield. Recently sary in NMT. We therefore include a simple re- work such as variational seq2seq auto-encoders versible tokenizer that (a) includes markers seen (Bowman et al., 2016) or memory networks (We- bythemodelthatallowsimpledeterministicdeto- ston et al., 2014), propose interesting extensions kenization, (b) has extremely simple, language- to basic seq2seq models. We next discuss a case independenttokenizationrules. Thetokenizercan studytodemonstratethatOpenNMTisextensible alsoperformBytePairEncoding(BPE)whichhas tofuturevariants. become a popular method for sub-word tokeniza- ES FR IT PT RO ES - 32.7(+5.4) 28.0(+4.6) 34.4(+6.1) 28.7(+6.4) FR 32.9(+3.3) - 26.3(+4.3) 30.9(+5.2) 26.0(+6.6) IT 31.6(+5.3) 31.0(+5.8) - 28.0(+5.0) 24.3(+5.9) PT 35.3(+10.4) 34.1(+4.7) 28.1(+5.6) - 28.7(+5.0) RO 35.0(+5.4) 31.9(+9.0) 26.4(+6.3) 31.6(+7.3) - Table2: 20languagepairsingletranslationmodel. TableshowsBLEU(∆)where∆comparestoonly usingthepairfortraining. Vocab System Speedtok/sec BLEU 20151 dataset. Here we compare, BLEU score, Train Trans as well as training and test speed to the publicly availableNematussystem. 2 V=50k Nematus 3393 284 17.28 We additionally trained a multilingual trans- ONMT 4185 380 17.60 lation model following Johnson (2016). The V=32k Nematus 3221 252 18.25 modeltranslatesfromandtoFrench,Spanish,Por- ONMT 5254 457 19.34 tuguese, Italian, and Romanian. Training data is 4Msentencesandwasselectedfromtheopenpar- Table 3: Performance Results for EN→DE on WMT15 allel corpus3, specifically from Europarl, Glob- tested on newstest2014. Both system 2x500 RNN, embed- alVoicesandTed. Corpuswasselectedtobemulti- ding size 300, 13 epochs, batch size 64, beam size 5. We compareona50kvocabularyanda32kBPEsetting. source, multi-target: each sentence has its translation in the 4 other languages. Corpus was tok- enized using shared Byte Pair Encoding of 32k. tioninNMTsystems(Sennrichetal.,2015). Comparative results between multi-way translation and each of the 20 independent training are Word Embeddings OpenNMT includes tools presentedinTable2. Thesystematicallylargeim- for simplifying the process of using pretrained provementshowsthatlanguagepairbenefitsfrom wordembeddings,evenallowingautomaticdown- trainingjointlywiththeotherlanguagepairs. load of embeddings for many languages. This al- Additionally we have found interest from the lows training in languages or domain with rela- community in using OpenNMT for non-standard tively little aligned data. Additionally OpenNMT MT tasks like sentence document summarization can export the word embeddings from trained dialogue response generation (chatbots), among models to standard formats. This allows analy- others. Using OpenNMT, we were able to repli- sis is external tools such as TensorBoard, shown catethesentencesummarizationresultsofChopra inFigure4.4. etal.(2016),reachingaROUGE-1scoreof33.13 on the Gigaword data. We have also trained a 5 Benchmarks model on 14 million sentences of the OpenSub- titles data set based on the work Vinyals and Le We now document some runs of the model. We (2015),achievingcomparableperplexity. expect performance and memory usage to improve with further development. Public bench- 6 Conclusion marks are available at http://opennmt. We introduce OpenNMT, a research toolkit for net/Models/, which also includes publicly NMT that prioritizes efficiency and modularity. available pre-trained models for all of these tasks WehopetofurtherdevelopOpenNMTtomaintain andtutorialinstructionsforallofthesetasks. The strong MT results at the research frontier, provid- benchmarks are run on a Intel(R) Core(TM) i7- ingastableandframeworkforproductionuse. 5930KCPU@3.50GHz,256GBMem,trainedon 1 GPU GeForce GTX 1080 (Pascal) with CUDA 1http://statmt.org/wmt15 v. 8.0(driver375.20)andcuDNN(v. 5005). 2https://github.com/rsennrich/nematus. Comparison with OpenNMT/Nematus github revisions The comparison, shown in Table 3, is on 907824/75c6ab1. English-to-German (EN→DE) using the WMT 3http://opus.lingfil.uu.se References MikeSchusterQuocV.LeMaximKrikunYonghuiWu Zhifeng Chen Nikhil Thorat Fernanda Vigas Mar- Dzmitry Bahdanau, Kyunghyun Cho, and tin Wattenberg Greg Corrado Macduff Hughes Jef- Yoshua Bengio. 2014. Neural Machine frey Dean Johnson. 2016. Google’s multilingual Translation By Jointly Learning To Align neural machine translation system: Enabling zero- and Translate. In ICLR. pages 1–15. shottranslation. https://doi.org/10.1146/annurev.neuro.26.041002.131047. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Callison-Burch,MarcelloFederico,NicolaBertoldi, Andrew M. Dai, Rafal Jo´zefowicz, and Samy Brooke Cowan, Wade Shen, Christine Moran, Bengio. 2016. Generating sentences from a Richard Zens, et al. 2007. Moses: Open source continuous space. In Proceedings of the 20th toolkit for statistical machine translation. In Proc SIGNLL Conference on Computational Natu- ACL. Association for Computational Linguistics, ral Language Learning, CoNLL 2016, Berlin, pages177–180. Germany, August 11-12, 2016. pages 10–21. http://aclweb.org/anthology/K/K16/K16-1002.pdf. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning.2015. EffectiveApproachestoAttention- William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol based Neural Machine Translation. In Proc of Vinyals. 2015. Listen, attend and spell. CoRR EMNLP. abs/1508.01211. http://arxiv.org/abs/1508.01211. Andre´ FT Martins and Ramoń Fernandez Astudillo. Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- 2016. Fromsoftmaxtosparsemax: Asparsemodel cehre, Dzmitry Bahdanau, Fethi Bougares, Hol- of attention and multi-label classification. arXiv ger Schwenk, and Yoshua Bengio. 2014. Learn- preprintarXiv:1602.02068. ing Phrase Representations using RNN Encoder- Rico Sennrich and Barry Haddow. 2016. Linguistic Decoder for Statistical Machine Translation. In input features improve neural machine translation. ProcofEMNLP. arXivpreprintarXiv:1606.02892. Sumit Chopra, Michael Auli, Alexander M Rush, and Rico Sennrich, Barry Haddow, and Alexandra Birch. SEAS Harvard. 2016. Abstractive sentence sum- 2015. Neural machine translation of rare words marizationwithattentiverecurrentneuralnetworks. with subword units. CoRR abs/1508.07909. ProceedingsofNAACL-HLT16pages93–98. http://arxiv.org/abs/1508.07909. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Ilya Sutskever, Oriol Vinyals, and Quoc V. and Yoshua Bengio. 2014. Empirical evaluation of Le. 2014. Sequence to Sequence Learn- gatedrecurrentneuralnetworksonsequencemodel- ing with Neural Networks. In NIPS. page 9. ing. arXivpreprintarXiv:1412.3555. http://arxiv.org/abs/1409.3215. JosepCrego,JungiKim,andJeanSenellart.2016. Sys- OriolVinyalsandQuocLe.2015. Aneuralconversa- tran’spureneuralmachinetranslationsystem. arXiv tionalmodel. arXivpreprintarXiv:1506.05869. preprintarXiv:1602.06023. Jason Weston, Sumit Chopra, and Antoine Bordes. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, 2014. Memory networks. CoRR abs/1410.3916. Matthieu Devin, Mark Mao, Andrew Senior, Paul http://arxiv.org/abs/1410.3916. Tucker, Ke Yang, Quoc V Le, et al. 2012. Large Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V scale distributed deep networks. In Advances in Le, Mohammad Norouzi, Wolfgang Macherey, neuralinformationprocessingsystems.pages1223– Maxim Krikun, Yuan Cao, Qin Gao, Klaus 1231. Macherey, et al. 2016. Google’s neural ma- chinetranslationsystem: Bridgingthegapbetween Yuntian Deng, Anssi Kanervisto, and Alexander M. human and machine translation. arXiv preprint Rush. 2016. What you get is what you see: A arXiv:1609.08144. visual markup decompiler. CoRR abs/1609.04938. http://arxiv.org/abs/1609.04938. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Chris Dyer, Jonathan Weese, Hendra Setiawan, Adam RichardS.Zemel,andYoshuaBengio.2015. Show, Lopez, Ferhan Ture, Vladimir Eidelman, Juri Gan- attend and tell: Neural image caption genera- itkevitch, Phil Blunsom, and Philip Resnik. 2010. tion with visual attention. CoRR abs/1502.03044. cdec:Adecoder,alignment,andlearningframework http://arxiv.org/abs/1502.03044. for finite-state and context-free translation models. In Proceedings of the ACL 2010 System Demon- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, strations. Association for Computational Linguis- Alex Smola, and Eduard Hovy. 2016. Hierarchi- tics,pages7–12. cal attention networks for document classification. InProceedingsofthe2016ConferenceoftheNorth Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. American Chapter of the Association for Computa- Long short-term memory. Neural computation tionalLinguistics: HumanLanguageTechnologies. 9(8):1735–1780.