An Empirical Comparison of Simple Domain Adaptation Methods for Neural Machine Translation Chenhui Chu1 , RajDabre2, and Sadao Kurohashi2 1Japan Scienceand TechnologyAgency 2GraduateSchool ofInformatics,KyotoUniversity [email protected], [email protected], [email protected] Abstract training a single NMT model for multiple do- mains. This method adds tags “<2domain>” In this paper, we propose a novel do- by modifying the parallel corpora to indicate 7 main adaptation method named “mixed domains without any modifications to the NMT 1 fine tuning” for neural machine transla- systemarchitecture. However,thismethodhasnot 0 tion(NMT).Wecombinetwoexisting ap- beenstudiedfordomainadaptation inparticular. 2 proaches namely finetuning andmultido- Motivated by these two lines of studies, we b e main NMT. We first train an NMT model propose a new domain adaptation method called F on an out-of-domain parallel corpus, and “mixed fine tuning,” where we first train an NMT 3 thenfinetuneitonaparallelcorpuswhich model on an out-of-domain parallel corpus, and is a mix of the in-domain and out-of- then fine tune it on a parallel corpus that is a ] L domain corpora. All corpora are aug- mix of the in-domain and out-of-domain corpora. C mented withartificial tags toindicate spe- Finetuningonthemixedcorpusinstead ofthein- . cific domains. We empirically compare domain corpus can address the overfitting prob- s c our proposed method against fine tuning lem. Allcorporaareaugmentedwithartificialtags [ and multi domain methods and discuss its to indicate specific domains. We tried two differ- 2 benefitsandshortcomings. entcorpora settings: v 4 • Manually created resource poor corpus: Us- 1 Introduction 1 ingtheNTCIRdata(patentdomain;resource 2 3 One of the most attractive features of neural ma- rich)(Gotoetal.,2013)toimprovethetrans- 0 chine translation (NMT) (Bahdanau etal.,2015; lationqualityfortheIWSLTdata(TEDtalks; . 1 Choetal.,2014; Sutskeveretal.,2014) is that it resource poor)(Cettoloetal.,2015). 0 is possible to train an end to end system without 7 • Automatically extracted resource poor cor- 1 theneedtodealwithwordalignments, translation pus: Using the ASPEC data (scientific do- : rulesandcomplicateddecodingalgorithms,which v main; resource rich) (Nakazawaetal.,2016) i are a characteristic of statistical machine transla- X toimprovethetranslationqualityfortheWiki tion (SMT) systems. However, it is reported that r data (resource poor). The parallel corpus of a NMTworksbetterthanSMTonlywhenthereisan thelatterdomainwasautomaticallyextracted abundance of parallel corpora. In the case of low (Chuetal.,2016a). resource domains, vanilla NMT is either worse thanorcomparable toSMT(Zophetal.,2016). We observed that “mixed fine tuning” works sig- Domain adaptation has been shown to be nificantly better than methods that use fine tuning effectiveforlowresourceNMT.Theconventional and domain tag based approaches separately. Our domain adaptation method is fine tuning, in contributions aretwofold: which an out-of-domain model is further trained • Weproposeanovelmethodthatcombinesthe on in-domain data (LuongandManning, 2015; best of existing approaches and have shown Sennrichetal.,2016b; Servanetal.,2016; thatitiseffective. FreitagandAl-Onaizan, 2016). However, fine tuning tends to overfit quickly due to the small • To the best of our knowledge this is the first size of the in-domain data. On the other hand, work on an empirical comparison of various multi domain NMT (Kobusetal.,2016) involves domainadaptation methods. Source-Target 3.3 MixedFineTuning (out-of-domain) NMT Model (out-of-domain) Initialize Theproposed mixed fine tuning method is a com- bination of the above methods (shown in Figure Source-Target (in-domain) NMT Model 2). Thetraining procedure isasfollows: (in-domain) 1. Train an NMT model on out-of-domain data Figure1: Finetuningfordomainadaptation. tillconvergence. 2 Related Work 2. Resume training the NMT model from step 1 on a mix of in-domain and out-of-domain Besides fine tuning and multi domian NMT us- data(byoversamplingthein-domaindata)till ing tags, another direction for domain adapta- convergence. tion is using in-domain monolingual data. Ei- ther training an in-domain recurrent neural lan- By default, we utilize domain tags, but we also guage (RNN) language model for the NMT de- consider settings where we do not use them (i.e., coder (Gu¨lc¸ehreetal.,2015) or generating syn- “w/o tags”). We can further fine tune the model thetic data by back translating target in-domain fromstep2onthein-domaindata,whichisnamed monolingual data (Sennrich etal.,2016b) have as“mixedfinetuning+finetuning.” beenstudied. Note that in the “fine tuning” method, the vo- 3 Methods forComparison cabulary obtained from the out-of-domain data is used for the in-domain data; while for the “multi All the methods that we compare are simple and domain”and“mixedfinetuning”methods,weuse donotneedanymodificationstotheNMTsystem. a vocabulary obtained from the mixed in-domain andout-of-domain dataforallthetrainingstages. 3.1 FineTuning 4 Experimental Settings Fine tuning is the conventional way for domain adaptation, and thus serves as a baseline in this We conducted NMT domain adaptation experi- study. In this method, we first train an NMT sys- mentsintwodifferent settingsasfollows: tem on a resource rich out-of-domain corpus till 4.1 HighQualityIn-domainCorpusSetting convergence, and then fine tune its parameters on aresource poorin-domain corpus(Figure1). Chinese-to-English translation was the focus of the high quality in-domain corpus setting. We 3.2 MultiDomain utilized the resource rich patent out-of-domain The multi domain method is originally motivated data to augment the resource poor spoken lan- by (Sennrichetal.,2016a), which uses tags to guage in-domain data. The patent domain MT control the politeness of NMT translations. The was conducted on the Chinese-English subtask overview of this method is shown in the dotted (NTCIR-CE)ofthepatentMTtaskattheNTCIR- sectioninFigure2. Inthismethod,wesimplycon- 10 workshop2 (Gotoetal.,2013). The NTCIR- catenatethecorporaofmultipledomainswithtwo CE task uses 1000000, 2000, and 2000 sentences smallmodifications: a. Appendingthedomaintag for training, development, and testing, respec- “<2domain>” to the source sentences of the re- tively. The spoken domain MTwas conducted on spective corpora.1 This primes the NMT decoder the Chinese-English subtask (IWSLT-CE) of the to generate sentences for the specific domain. b. TED talk MT task at the IWSLT 2015 workshop Oversamplingthesmallercorpussothatthetrain- (Cettoloetal.,2015). The IWSLT-CE task con- ingprocedurepaysequalattentiontoeachdomain. tains 209,491 sentences for training. We used the Wecanfurtherfinetunethemultidomainmodel dev2010setfordevelopment,containing887sen- on the in-domain data, which is named as “multi tences. We evaluated all methods on the 2010, domain+finetuning.” 2011, 2012, and 2013 test sets, containing 1570, 1Weverifiedtheeffectivenessofthedomaintagsbycom- 1245,1397,and1261sentences, respectively. paringagainstasettingthatdoesnotusethem,seethe“w/o tags”settingsinTables1and2. 2http://ntcir.nii.ac.jp/PatentMT-2/ NMT Model (out-of-domain) (Soouut-rocfe-d-Toamragient) tAapgp (e<n2do uotu-ot-fo-df-odmomaiani>n) Initialize merge NMT Model (mixed) Source-Target Append in-domain (in-domain) tag (<2in-domain>) Oveinrs-daommpalein t hceo rspmusa ller Initialize NMT Model (in-domain) Figure2: Mixedfinetuningwithdomaintagsfordomainadaptation(Thesectioninthedottedrectangle denotesthemultidomainmethod). 4.2 LowQualityIn-domainCorpusSetting sentences longer than 80 tokens were discarded. We early stopped the training process when we Chinese-to-Japanese translation was the focus of observed thattheBLEUscoreofthedevelopment the low quality in-domain corpus setting. We set converges. For testing, we self ensembled the utilized the resource rich scientific out-of-domain threeparameters ofthebestdevelopment loss,the data to augment the resource poor Wikipedia bestdevelopmentBLEU,andthefinalparameters. (essentially open) in-domain data. The scien- Beamsizewassetto100. tific domain MT was conducted on the Chinese- Japanese paper excerpt corpus (ASPEC-CJ)3 For performance comparison, we also con- ducted experiments on phrase based SMT (PB- (Nakazawaetal.,2016), which is one subtask of the workshop on Asian translation (WAT)4 SMT). We used the Moses PBSMT system (Koehnetal.,2007) for all of our MT experi- (Nakazawaetal.,2015). The ASPEC-CJ task ments. Fortherespectivetasks,wetrained5-gram uses 672315, 2090, and 2107 sentences for language models on the target side of the train- training, development, and testing, respectively. ing data using the KenLM toolkit6 with interpo- The Wikipedia domain task was conducted on a lated Kneser-Ney discounting, respectively. In all Chinese-Japanese corpus automatically extracted of our experiments, weused the GIZA++toolkit7 fromWikipedia(WIKI-CJ)(Chuetal.,2016a)us- forwordalignment;tuningwasperformedbymin- ingtheASPEC-CJcorpusasaseed. TheWIKI-CJ imum error rate training (Och,2003), and it was task contains 136013, 198, and 198 sentences for re-runforeveryexperiment. training, development, andtesting, respectively. For both MT systems, we preprocessed the 4.3 MTSystems data as follows. For Chinese, we used Ky- For NMT, we used the KyotoNMT system5 otoMorph8 for segmentation, which was trained (Cromieresetal.,2016). The NMT training set- on the CTB version 5 (CTB5) and SCTB tingsarethesameasthoseofthebestsystemsthat (Chuetal.,2016b). ForEnglish,wetokenizedand participated inWAT2016. Thesizesofthesource lowercased the sentences using the tokenizer.perl and target vocabularies, the source and target side script in Moses. Japanese was segmented using embeddings,thehiddenstates,theattentionmech- JUMAN9 (Kurohashietal.,1994). anism hidden states, and the deep softmax output For NMT, we further split the words into with a 2-maxout layer were set to 32,000, 620, sub-words using byte pair encoding (BPE) 1000, 1000, and 500, respectively. We used 2- (Sennrichetal.,2016c), which has been shown to layer LSTMsfor both the source and target sides. be effective for the rare word problem in NMT. ADAM was used as the learning algorithm, with Another motivation of using sub-words is mak- a dropout rate of 20% for the inter-layer dropout, ing the different domains share more vocabulary, and L2 regularization with a weight decay coef- whichisimportantespeciallyfortheresourcepoor ficient of 1e-6. The mini batch size was 64, and 6https://github.com/kpu/kenlm/ 3http://lotus.kuee.kyoto-u.ac.jp/ASPEC/ 7http://code.google.com/p/giza-pp 4http://orchid.kuee.kyoto-u.ac.jp/WAT/ 8https://bitbucket.org/msmoshen/kyotomorph-beta 5https://github.com/fabiencro/knmt 9http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN IWSLT-CE System NTCIR-CE test2010 test2011 test2012 test2013 average IWSLT-CESMT - 12.73 16.27 14.01 14.67 14.31 IWSLT-CENMT - 6.75 9.08 9.05 7.29 7.87 NTCIR-CESMT 29.54 3.57 4.70 4.21 4.74 4.33 NTCIR-CENMT 37.11 2.23 2.83 2.55 2.85 2.60 Finetuning 17.37 13.93 18.99 16.12 17.12 16.41 Multidomain 36.40 13.42 19.07 16.56 17.54 16.34 Multidomainw/otags 37.32 12.57 17.40 15.02 15.96 14.97 Multidomain+Finetuning 14.47 13.18 18.03 16.41 16.80 15.82 Mixedfinetuning 37.01 15.04 20.96 18.77 18.63 18.01 Mixedfinetuningw/otags 39.67 14.47 20.53 18.10 17.97 17.43 Mixedfinetuning+Finetuning 32.03 14.40 19.53 17.65 17.94 17.11 Table1: Domainadaptation results(BLEU-4scores)forIWSLT-CEusingNTCIR-CE. domain. For the Chinese-to-English tasks, we System ASPEC-CJ WIKI-CJ WIKI-CJSMT - 36.83 trained two BPE models on the Chinese and En- WIKI-CJNMT - 18.29 glish vocabularies, respectively. For the Chinese- ASPEC-CJSMT 36.39 17.43 to-Japanese tasks, we trained a joint BPE model ASPEC-CJNMT 42.92 20.01 Finetuning 22.10 37.66 onbothoftheChineseandJapanesevocabularies, Multidomain 42.52 35.79 because Chinese and Japanese could share some Multidomainw/otags 40.78 33.74 vocabularies of Chinese characters. The number Multidomain+Finetuning 22.78 34.61 Mixedfinetuning 42.56 37.57 of merge operations was set to 30,000 for all the Mixedfinetuningw/otags 41.86 37.23 tasks. Mixedfinetuning+Finetuning 31.63 37.77 5 Results Table 2: Domain adaptation results (BLEU-4 scores)forWIKI-CJusingASPEC-CJ. Tables 1 and 2 show the translation results on the Chinese-to-English and Chinese-to-Japanese tasks, respectively. The entries with SMT and “Fine tuning” overfits quickly after only 1 epoch NMT are the PBSMT and NMT systems, respec- oftraining,“Mixedfinetuning”onlyslightlyover- tively; others are the different methods described fitsuntilcovergence. Inaddition,“Mixedfinetun- in Section 3. In both tables, the numbers in ing”doesnotworsenthequalityofout-of-domain bold indicate the best system and all systems that translations, while “Fine tuning” and “Multi do- were not significantly different from the best sys- main” do. One shortcoming of “Mixed fine tun- tem. The significance tests were performed using ing” is that compared to “fine tuning,” it took a thebootstrapresamplingmethod(Koehn,2004)at longertimeforthefinetuningprocess, asthetime p < 0.05. untilconvergenceisessentiallyproportionaltothe We can see that without domain adaptation, sizeofthedatausedforfinetuning. theSMTsystemsperformsignificantlybetterthan “Multi domain” performs either as well as the NMT system on the resource poor domains, (IWSLT-CE)or worse than (WIKI-CJ) “Fine tun- i.e., IWSLT-CE and WIKI-CJ; while on the re- ing,” but “Mixed fine tuning” performs either sig- sourcerichdomains,i.e.,NTCIR-CEandASPEC- nificantly better than (IWSLT-CE)or is compara- CJ, NMT outperforms SMT. Directly using the ble to (WIKI-CJ) “Fine tuning.” We believe the SMT/NMT models trained on the out-of-domain performance difference between the two tasks is data to translate the in-domain data shows bad due to their unique characteristics. As WIKI-CJ performance. With our proposed “Mixed fine data is of relatively poorer quality, mixing it with tuning” domain adaptation method, NMT signif- out-of-domain data does not have the same level icantlyoutperforms SMTonthein-domain tasks. ofpositiveeffectsasthoseobtainedbytheIWSLT- Comparing different domain adaptation meth- CEdata. ods, “Mixed fine tuning” shows the best perfor- Thedomaintagsarehelpfulforboth“Multido- mance. We believe the reason for this is that main” and “Mixed fine tuning.” Essentially, fur- “Mixed fine tuning” can address the over-fitting ther fine tuning on in-domain data does not help problem of“Finetuning.” Weobserved thatwhile forboth“Multidomain” and“Mixedfinetuning.” We believe the reason for this is that the “Multi [Cromieresetal.2016] Fabien Cromieres, Chenhui domain”and“Mixedfinetuning”methodsalready Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2016. Kyoto university participation to wat 2016. utilizethein-domaindatausedforfinetuning. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), pages 166–174, Osaka, 6 Conclusion Japan, December. The COLING 2016 Organizing Committee. Inthispaper,weproposedanoveldomainadapta- tionmethodnamed“mixedfinetuning” forNMT. [FreitagandAl-Onaizan2016] Markus Freitag and We empirically compared our proposed method Yaser Al-Onaizan. 2016. Fast domain adaptation againstfinetuningandmultidomainmethods,and for neural machine translation. arXiv preprint arXiv:1612.06897. haveshownthatitiseffectivebutissensitivetothe qualityofthein-domaindataused. [Gotoetal.2013] Isao Goto, Ka-Po Chow, Bin Lu, Ei- In the future, we plan to incorporate an RNN ichiro Sumita, and Benjamin K. Tsou. 2013. model into our current architecture to leverage Overview of the patent machine translation task at the ntcir-10 workshop. In Proceedings of the 10th abundant in-domain monolingual corpora. We NTCIR Conference, pages 260–286, Tokyo, Japan, alsoplanonexploringtheeffectsofsyntheticdata June.NationalInstituteofInformatics(NII). by back translating large in-domain monolingual corpora. [Gu¨lc¸ehreetal.2015] C¸aglar Gu¨lc¸ehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Lo¨ıc Barrault, Huei- Chi Lin, Fethi Bougares, Holger Schwenk, and References Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. CoRR, [Bahdanauetal.2015] Dzmitry Bahdanau, Kyunghyun abs/1503.03535. Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and trans- [Kobusetal.2016] Catherine Kobus, Josep Crego, and late. InInProceedingsofthe3rdInternationalCon- JeanSenellart. 2016. Domaincontrolforneuralma- ference on Learning Representations (ICLR 2015), chinetranslation. arXivpreprintarXiv:1612.06140. SanDiego,USA,May.InternationalConferenceon LearningRepresentations. [Koehnetal.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello [Cettoloetal.2015] M Cettolo, J Niehues, S Stu¨ker, Federico, Nicola Bertoldi, Brooke Cowan, Wade LBentivogli,RCattoni,andMFederico. 2015. The Shen, Christine Moran, Richard Zens, Chris Dyer, iwslt 2015evaluationcampaign. In Proceedingsof Ondrej Bojar, Alexandra Constantin, and Evan theTwelfthInternationalWorkshoponSpokenLan- Herbst. 2007. Moses: Open source toolkit for guageTranslation(IWSLT). statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for [Choetal.2014] Kyunghyun Cho, Bart van Computational Linguistics Companion Volume Merrie¨nboer, C¸alar Gu¨lc¸ehre, Dzmitry Bahdanau, Proceedings of the Demo and Poster Sessions, Fethi Bougares, Holger Schwenk, and Yoshua pages 177–180, Prague, Czech Republic, June. Bengio. 2014. Learning phrase representations AssociationforComputationalLinguistics. using rnn encoder–decoder for statistical machine translation. InProceedingsofthe2014Conference [Koehn2004] Philipp Koehn. 2004. Statistical sig- on Empirical Methods in Natural Language Pro- nificance tests for machine translation evaluation. cessing (EMNLP), pages 1724–1734, Doha, Qatar, In Proceedings of EMNLP 2004, pages 388–395, October.AssociationforComputationalLinguistics. Barcelona, Spain, July. Association for Computa- tionalLinguistics. [Chuetal.2016a] ChenhuiChu, Raj Dabre, and Sadao Kurohashi. 2016a. Parallel sentence extraction from comparable corpora with neural network fea- [Kurohashietal.1994] Sadao Kurohashi, Toshihisa tures. In Proceedings of the Tenth International Nakamura, Yuji Matsumoto, and Makoto Nagao. ConferenceonLanguageResourcesandEvaluation 1994. ImprovementsofJapanesemorphologicalan- (LREC 2016), Paris, France, may. European Lan- alyzerJUMAN. InProceedingsoftheInternational guageResourcesAssociation(ELRA). Workshop on Sharable Natural Language, pages 22–28. [Chuetal.2016b] Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, and Sadao Kurohashi. 2016b. [LuongandManning2015] Minh-Thang Luong and SCTB:AChinesetreebankinscientificdomain. In Christopher D Manning. 2015. Stanford neural Proceedings of the 12th Workshop on Asian Lan- machine translation systems for spoken language guage Resources (ALR12), pages 59–67, Osaka, domains. In Proceedings of the 12th International Japan, December. The COLING 2016 Organizing Workshop on Spoken Language Translation, pages Committee. 76–79,DaNang,Vietnam,December. [Nakazawaetal.2015] Toshiaki Nakazawa, Hideya Austin, Texas, USA, November 1-4, 2016, pages Mino, Isao Goto, Graham Neubig, Sadao Kuro- 1568–1575. hashi, and Eiichiro Sumita. 2015. Overview of the 2nd Workshop on Asian Translation. In Pro- ceedingsofthe2ndWorkshoponAsianTranslation (WAT2015),pages1–28,Kyoto,Japan,October. [Nakazawaetal.2016] Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Ei- ichiro Sumita, Sadao Kurohashi, and Hitoshi Isa- hara. 2016. Aspec: Asian scientific paper excerpt corpus. In Proceedings of the Tenth International ConferenceonLanguageResourcesandEvaluation (LREC 2016), Paris, France, May. European Lan- guageResourcesAssociation(ELRA). [Och2003] FranzJosefOch. 2003. Minimumerrorrate training in statistical machine translation. In Pro- ceedingsofthe41stAnnualMeetingoftheAssocia- tionforComputationalLinguistics,pages160–167, Sapporo,Japan,July.AssociationforComputational Linguistics. [Sennrichetal.2016a] Rico Sennrich, Barry Haddow, andAlexandraBirch. 2016a. Controllingpoliteness in neural machine translation via side constraints. InProceedingsofthe2016ConferenceoftheNorth AmericanChapteroftheAssociationforComputa- tionalLinguistics: HumanLanguageTechnologies, pages35–40,SanDiego,California,June.Associa- tionforComputationalLinguistics. [Sennrichetal.2016b] Rico Sennrich, Barry Haddow, andAlexandraBirch. 2016b. Improvingneuralma- chinetranslationmodelswithmonolingualdata. In Proceedingsof the 54th AnnualMeeting of the As- sociationforComputationalLinguistics(Volume1: Long Papers), pages 86–96, Berlin, Germany, Au- gust.AssociationforComputationalLinguistics. [Sennrichetal.2016c] Rico Sennrich, Barry Haddow, andAlexandraBirch. 2016c. Neuralmachinetrans- lationofrarewordswithsubwordunits. InProceed- ings of the 54th AnnualMeeting of the Association forComputationalLinguistics(Volume1: LongPa- pers), pages 1715–1725,Berlin, Germany, August. AssociationforComputationalLinguistics. [Servanetal.2016] Christophe Servan, Josep Crego, andJeanSenellart. 2016. Domainspecialization: a post-trainingdomainadaptationforneuralmachine translation. arXivpreprintarXiv:1612.06141. [Sutskeveretal.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th InternationalConference on Neural Infor- mation Processing Systems, NIPS’14, pages 3104– 3112,Cambridge,MA,USA.MITPress. [Zophetal.2016] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning forlow-resourceneuralmachinetranslation. InPro- ceedingsofthe2016ConferenceonEmpiricalMeth- odsinNaturalLanguageProcessing,EMNLP2016,