QCRI Machine Translation Systems for IWSLT 16 NadirDurrani, FahimDalvi, HassanSajjad, StephanVogel QatarComputingResearch Institute– HBKU {ndurrani,faimaduddin,hsajjad,svogel}@qf.org.qa Abstract phrase-basedsystem: This paper describes QCRI’s machine translation sys- • We applied MML-based data selection [8] to the UN tems for the IWSLT 2016evaluation campaign. We partic- andOpenSub-titledata,withthegoalsoffilteringout 7 ipated in the Arabic→English and English→Arabic tracks. harmfuldata. 1 We built both Phrase-based and Neuralmachine translation 0 models, in an effort to probe whether the newly emerged • We trainedOSMmodels[9]onseparatecorpora,and 2 NMTframeworksurpassesthetraditionalphrase-basedsys- interpolatedthem[10]byoptimizingperplexityonthe n tems in Arabic-English language pairs. We trained a very tuning-set. We also tried this on the OSM models a J strongphrase-basedsystemincluding,abiglanguagemodel, trainedonthewordclasses[11]. 4 theOperationSequenceModel,NeuralNetworkJointModel • Wetriedthefine-tuningmethodoftrainingtheNNJM 1 and Class-based models along with different domain adap- modelontheout-domaindataandfine-tuningwiththe tation techniquessuch as MML filtering, mixture modeling ] in-domainTEDdata[12]. andusingfinetuningoverNNJMmodel.However,aNeural L C MT system, trained by stacking data from different genres • We trainedbiglanguagemodelsusingalltheEnglish . throughfine-tuning, and applying ensemble over 8 models, mono data available from the WMT campaign and s beatour verystrong phrase-basedsystem by a significant2 gigawordcorpusforArabic. c [ BLEU pointsmarginin Arabic→Englishdirection. We did not obtain similar gains in the other direction but were still We trained our Neural MT system using the Nematus 1 abletooutperformthephrase-basedsystem. Wealsoapplied toolkit. WeusedBidirectionalRNN’sfortheencoder,1024 v 4 systemcombinationonphrase-basedandNMToutputs. LSTMunits, andawordembeddingsizeof500. Belowwe 2 itemizewhatworkedwhentrainingtheneuralMTsystem: 1. Introduction 9 3 • WetrainedourbaselinemodelonalloftheUNcorpus, We describe QCRI’s phrase-basedand NeuralMT systems. 0 thencontinuedtrainingwiththeOpensubtitlescorpus, . We participated in the Arabic-to-English and English-to- 1 andfinallyfine-tunedwiththein-domaindata Arabic MT tracks. Our translation engines have been his- 0 7 toricallybasedonthephrase-basedsystemtrainedusingthe • We fine-tunedall of our modelswithoutfreezingany 1 Moses toolkit [1], but during the course of this evaluation, layersinthe network,sincewe hadsufficientamount : wemadeatransitiontowardsthenewlyemergedNeuralMT v ofdatatotrainon. i framework[2],usingNematus,atoolkitusedbythetopper- X formingteam[3],duringtherecentWMTcampaign. • We used dropout when fine-tuning with in-domain r AninterestingchallengeassociatedwiththeIWSLTcam- data, since it is relatively small compared to the UN a paign is the problem of domain adaptation. The in-domain andOpensubtitledata. data based on TED talks is available in very little quan- • We trained our final models with an ensemble of the tity compared to the out-domainUN corpus [4], which has last eight models, where each model was fine-tuned been foundto be harmfulpreviously when simply concate- withthein-domaindata. nated to the training [5]. In this year’s IWSLT, two addi- tionaldataresourcesOpussubtitles[6]andtheQEDcorpus Finallyweappliedsystemcombinationovertheoutputs [7] were introduced. The latter was also used as an official of best NeuralMT and phrase-basedsystems using MEMT test-set. Thereforeapartfromexploringphrase-basedversus [13]. OureffortsweremainlyfocusedtowardstheAR→EN Neural MT, we geared ourselves towards adapting our sys- TEDtask. Inthe endwe justreplicatedourbestsystemfor tem towards TED and QED talks in this multi-domainsce- theEN→ARdirectionandtheQEDtask. ForourbestNeu- nario. With thesegoalsinmindwe re-exploredbothmodel ral MT system, we were unable to use an ensemble in the weightinganddatafilteringtechniques,inthesenewdataset- EN→ARdirection,sincewecouldnottrainseveralcompa- tings.Belowweitemizethemostsuccessfulattributesofour rablemodelstocombine. TED QED UN OPUS Percentage ted-11 ted-12 ted-13 ted-14 Avg Stats 240K 153K 18.5M 40M ID 27.5 30.6 30.4 26.3 28.7 2.5% 27.3 30.6 31.5 27.0 29.1 Table1:NumberofSentencesinParallelData 3.75% 27.1 30.7 31.6 27.2 29.2 5% 27.1 30.4 31.4 27.1 29.1 Segmentation ted-11 ted-12 ted-13 ted-14 Avg 10% 27.0 30.2 31.5 27.2 29.0 MADA 27.5 30.6 30.4 26.3 28.7 30% 26.3 29.5 30.9 26.7 28.4 Farasa 27.4 30.3 30.2 26.4 28.6 Full 25.3 28.7 29.4 25.5 27.3 Back-offPT 27.0 30.4 31.5 27.3 29.1 Table2: MADAversusFarasaTokenization 3.75%+1OPUS 28.2 32.4 32.3 28.6 30.4 2 Table3:DataSelectionusingMMLandBack-offPT 2. DataSettings andPre/PostProcessing System ted-11 ted-12 ted-13 ted-14 Avg We trained our systems using the data made available Baseline 28.2 32.4 32.3 28.6 30.4 through IWSLT 2016 campaign. This contained two in- +bigLM 28.3 32.8 33.2 29.2 30.9 domain data sets TED talks and QED corpus [14] and two out-domain data sets UN corpus [4] and OPUS data [6]. Table4: BiggerLanguageModel ThestatisticsareshowninTable1. Forlanguagemodelwe trainedusingthetargetsideoftheparallelcorpusandallthe availableEnglishdatafromtherecentWMTcampaign[15], and out-domain phrase-tablesseparately and using the out- andGigaWordandOPUSmonocorpusforArabic. domainphrase-tableonlyasa back-off. Secondlast row of WesegmentedArabicdatausingbothMADAMIRAand Table3showsresults. Whileitgaveimprovementontopof Farasa. We foundMADAMIRA [16] performed0.1 BLEU thebaselinesystem,itwasslightlybehindMMLfiltering. points better than Farasa [17] (See Table 2) and decided to We then triedto findoptimalcut-offon theOPUS data, use it for the competition. We tokenized the English side and selected 20 Million sentences (half of the Opus). Our using standard tokenizer of Moses. For English→Arabic, bestsystemsused3.75%oftheUNdataandhalfoftheOpus outputsweredetokenizedusingMADAdetokenizer. Before data. Adding the selected Opus data gave an average im- scoringtheoutput,wenormalizedthemandreferencetrans- provementof+1.2BLEUpoints. lationsusingtheQCRInormalizer[5]. 3.3. LanguageModel 3. Phrase-based System Wetrainedbiggerlanguagemodelbyusingalltheavailable English data from the recent WMT campaign1 and target- 3.1. BaselineSettings side of the parallel data. A Kneser-Ney smoothed 5-gram languagemodelwastrainedoneachsub-corpusindividually Wetrainedphrase-basedMosessystem,withthesettingsde- and then interpolated to minimize perplexity on the target scribed in [18]: a maximum sentence length of 80, Fast- partofthemonolingualdata. Wewereabletoobtainagain Aligner for word-alignments [19], an interpolated Kneser- of+0.5usingbiggerlanguagemodel.SeeTable4. Ney smoothed 5-gram language model[20], lexicalized re- ordering model [21], a 5-gram operation sequence model 3.4. InterpolationofOperationSequenceModels [22], a 14-gram NNJM model [23], with the baseline set- The OSM model has been a regular feature of the phrase- tingsdescribedin[24]. Weuseddefaultdistortionlimit,100- based pipeline [27] in the competition grade systems. It is besttranslationoptions,phrase-lengthof5,monotone-over- ajointsequencetranslationmodelwhichintegratesreorder- punctuationheuristic,cube-pruningwithalimitof1000dur- ing.[10]recentlyfoundthatanOSMmodeltrainedonplain ingtuning5000duringtest.Weusedk-bestbatchMIRA[25] concatenationofdataissub-optimalandcanbeimprovedby fortuning.WeusedcasedBLEU[26]tomeasureprogress. training OSM models on each domain individually and in- 3.2. DataSelection terpolatingthembyminimizingperplexityonthein-domain tune-set. Table5showsthatusinginterpolatedOSMmodel Duetoourexperiencefrompreviouscompetitions,wewere (OSMi) instead of the one trained on plain concatenation wary of the fact that simply adding the UN data is harmful fortheAR→MTsystem,wethereforeselecteddatathrough (OSMc)givesanaverageimprovementof+0.6BLEUpoints. MMLfiltering[8]. Weselected2.5%,3.75%,5%,10%and 3.5. NNJMAdaptation 30% of the UN data and trained MT pipeline by concate- We also explored the award winning Neural Network Joint natingtheselecteddatawiththein-domaindata. Wedidnot Model(NNJM)inourpipelineandtriedtoadaptittowards includeOpusdata(40MillionSentences)andNNJMinthese thein-domaindata. WetrainedanNNJMmodelsontheUN experimentstogettheresultsquickly. Table3showsthere- and Opus data for 25 epochs and then fine-tuned[12] it by sults. Wefound3.75%(≈685ksentences)tobetheoptimal threshold. Alternativetodataselection,wetriedtrainingin- 1http://www.statmt.org/wmt16/translation-task.html System ted-11 ted-12 ted-13 ted-14 Avg System ted-11 ted-12 ted-13 ted-14 Avg OSMc 28.3 32.8 33.2 29.2 30.9 Baseline 30.3 34.2 34.7 30.4 32.4 OSMi 29.0 33.5 33.8 29.7 31.5 Drop-OOV 30.5 34.2 35.0 30.5 32.6 Transliteration 30.4 34.2 34.7 30.6 32.5 Table5:InterpolatedOperationSequenceModel Table8: HandlingOOVs System ted-11 ted-12 ted-13 ted-14 Avg Baseline 29.0 33.5 33.8 29.7 31.5 System ted-11 ted-12 ted-13 ted-14 Avg +NNJM 29.8 34.1 34.4 30.1 32.1 Baseline 27.4 30.3 30.2 26.4 28.7 +FT(UN) 29.7 33.8 33.9 30.2 31.9 +SelectedUN 27.1 30.7 31.6 27.2 29.2 +FT(OPUS) 30.1 34.1 34.6 30.3 32.3 +SelectedOPUS 28.2 32.4 32.3 28.6 30.4 +bigLM 28.3 32.8 33.2 29.2 30.9 Table6:NeuralNetworkJointModel+DifferentAdaptation +OSMi 29.0 33.5 33.8 29.7 31.5 Methods +NNJM 29.8 34.1 34.4 30.1 32.1 +FT(OPUS) 30.1 34.1 34.6 30.3 32.3 Drop-OOV 30.5 34.2 35.0 30.5 32.6 runningfor25moreepochsonthein-domaindata. Because thedataishuge,theentiretrainingtook1.5monthsofwall- Table9:IncrementalProgressArabic-to-EnglishSystem clock time. Table 6 shows results. The NNJM modelgave significantimprovement(+0.6)ontopofbaselinewhichdoes not include it. We found fine-tuning method to give slight Table10showsprogressonthislanguagepair. Thebaseline gains (+0.2) when the baseline model was trained on the system(ID)wastrainedonthetheTEDdataandtargetside Opus data. On the contrary, fine-tuning did not help when of all the permissible parallel data. In the second row, we themodeltrainedwasonUN. addedalltheparalleldataexceptfortheUN.Inthethirdrow we additionally added the UN data that we selected in the 3.6. Class-basedModels Arabic→Englishdirection. Additionalparalleldatagivesan We explored the use of automatic word clusters in phrase- averageimprovementof +1.4BLEU point. Thenwe added based models [11]. We used 50 classes, obtained by run- an NNJM model trained on in-domain TED data on top of ning mkcls. The clusters ids were included in the phrase- this system to improve it by +0.8. Adding GigaWord and table.Weadditionallytrainedin-domainlanguagemodelus- monolingualOPUSdata(another20MSentencesotherthan ingword-classesandinterpolatedOSMonword-classes.But the target-sideofthe paralleldata) gavean improvementof weonlysawverysmallimprovementsusingwordclasses. +0.3. Finally we replaced the baseline NNJM with the one trainedonOPUSdataandfine-tunedwiththein-domaindata 3.7. HandlingUnknownWords togetourbestsystem. We tried to handle OOV words using drop-oov and 3.10. QEDSystems throughtransliteration[28,29]. Theformerworkedslightly betterandwas usedinthebestsystem. Ofcoursethegains We simply replicated QED systems by replacing QED cor- fromthetwomethodsareadditivebecausetheyareaddress- pus to be in-domain data, instead of TED data. We used ingdifferentOOVs,butthere’snogoodwaytoautomatically thesameUNdatathatweselectedforourArabic→English findwhichwordtodropandwhichonetotransliterate. system, therefore our phrase-tables remain the same. The main changes are caused when training adapted OSM and 3.8. FinalSystem NNJM models. For NNJM we simply fine tune with QED Table9showsincrementalprogressonthisArabic→English corpus instead of the TED corpus. For interpolated OSM, languagepair. OurbestsystemincludedMMLselectedUN weconcatenatedTEDandQEDcorpusandbuildOSMonit, and Opus corpora, big language model, interpolated OSM whichis theninterpolatedwith the OSM modelstrainedon andfine-tunedNNJMmodels. Weweuseddrop-oovop- the selected UN and Opusdata. We used IWSLT tuningto tiontohandleunknownwords. gettheinterpolationweights. ThiswaytheOSMsub-model 3.9. English-to-ArabicSystems WedidnotdodetailedexperimentsfortheEnglish→Arabic System ted-11 ted-12 ted-13 ted-14 Avg ID 14.8 15.6 16.7 14.5 15.4 direction because of computational limitations, but simply replicated what worked for the Arabic→English direction. +Parallel 15.5 16.4 18.2 16.3 16.6 +MML(UN) 15.6 16.3 18.4 16.7 16.8 +NNJM 16.5 17.4 19.2 17.4 17.6 System ted-11 ted-12 ted-13 ted-14 Avg +bigLM 16.6 17.6 20.0 17.4 17.9 +NNJM(FT) 16.7 17.9 20.2 17.7 18.1 Baseline 30.1 34.1 34.6 30.3 32.3 Class-based 30.3 34.2 34.7 30.4 32.4 Table10:IncrementalProgressEnglish-to-ArabicSystem Table7:UsingWordClasses System ted-11 ted-12 ted-13 ted-14 Avg System ted-11 ted-12 ted-13 ted-14 Avg 30%+FT(TED) 27.2 31.4 30.8 27.1 29.1 5% 25.8 29.4 29.3 25.0 27.4 30%+TED+FT(TED) 27.2 30.8 30.1 25.8 28.5 5%(Frozen) 24.7 27.7 27.4 23.9 25.9 30% 27.2 30.8 30.1 25.8 28.5 Table11: FineTuningonOut-domainversusConcatenation 30%(Frozen) 26.5 30.4 28.9 25.0 27.7 –Modelsrunfor3epochs Table12:Fine-tuningwith/withoutfreezingtheEmbeddings createdfromTED+QEDcorpusgetsbestweights. We also System ted-11 ted-12 ted-13 ted-14 Avg retrainedthelanguagemodelinthissimilarfashion.Weused 5%+FT(TED) 25.8 29.4 29.3 25.0 27.4 thetuningweightsobtainedfromourbestTEDsystemsand 30%+FT(TED) 28.4 32.7 32.9 27.8 30.4 replacedtheTEDadaptedOSM,NNJMandlanguagemod- Full+FT(TED) 28.1 32.3 31.6 27.0 29.8 30%+FT(OPUS) 26.1 30.6 32.5 27.1 29.1 elswiththeirQEDadaptedvariants. Full+FT(OPUS) 28.2 31.7 34.3 29.2 30.8 4. Neural Machine Translation Table13:Dataselection 4.1. Pre/Post-processing We used a similar pre/post-processing pipeline for Neural suchlayerinourcaseisthewordembeddinglayer. Wetried MT as our phrase-based systems (Section 2), and addition- a variation in which we do not freeze any layer. This lat- allyappliedBPE [30]beforetrainingthem. OurBPEmod- ter variantwas foundto outperformthe defaultsetting (See els are trained separately for both the Arabic and English Table12). datasets instead of jointly training them, since the charac- DropoutsarefoundtobeusefulinNNtraining,whenthe tersetdiffersbetweenthelanguages.Welimitedthenumber trainingdataissmall. Weexperimentedwithusingdropouts of operations to 59,500, as suggested in [30]. We experi- in our experiments, but did not find any significant differ- mented with BPE models trained on the TED data, and on ence. Hence we decided to use it only when fine-tuning theconcatenationoftheTED andout-domaindata. We did withthein-domaindata(TED/QED),sincebothoftheother notseeanyconsiderabledifferenceinperformancebetween datasets(UNandOPUS)werebiganddidnotposeanyrisk these models. Thus we used the BPE modeltrained on the ofinducingtheproblemofoverfitting. TEDdatafortheexperimentsreportedinthispaper. 4.5. DataSelection 4.2. Baseline Sincewefounddataselectionusefulinthephrasebasedsys- WeuseddefaultparametersinNematustotrainoursystems: tem,wealsotrainedourneuralsystemsusing5%,30%and a batchsize of 80, sourceand targetvocabularyof50K en- 100%oftheUNdata.Intheseexperiments,weconcatenated trieseach,1024LSTMunits,andtheembeddinglayersizeof the 5% and 30% of the UN data with the in-domain data. 500.BaselinesystemweretrainedusingonlyTEDcorpus. Toevaluatethemostpromisingmodels,wetrainedallofthe modelsuntilthelearningplateaued,andthenfine-tunedthese 4.3. FineTuningonConcatenationversusOD modelswithin-domaindata.2 TheresultsareshownininTa- ble 13. Using only 5% of the data provedharmful, and the The best phrase based systems are usually trained by con- system did notgeneralize as well as the other models. The catenatinginandout-domaindata. Ontheotherhand,deep modeltrainedon30%ofthedataperformedbetterthanthe learningsystemsaretrainedontheout-domaindatafirst,and modeltrainedonallthedata,by0.7BLEUpoints. thenfine-tunedwithin-domaindata. Weexperimentedwith In our subsequentexperimentswe tried to verify if this both strategies. In the interest of time we selected 30% of finding holds when we add the OPUS data. We therefore theUNdatausingMMLfiltering(Table3). Wetrainedtwo trainedtwosystemsbyfine-tuning30%selectedUNdataor systems, one by concatenating the in-domain data with the full UN data using OPUS. Here the results flipped and the selected (30%)UN dataandotherjust onthe selected data. wefoundthatmodelthatusedalloftheUNdataperformed Thenwefine-tunedboththemodelswiththein-domainTED better (Compare last two rows in Table 13). Therefore, we data after running them for 3 epochs. Table 11 shows that decidedtofocusoureffortsonthemodeltrainedontheentire fine-tuning a system trained on out-domain data only, out- UNdataforallofthefollowingexperiments. performsthesystemfine-tunedonconcatenation. 4.6. Ensemble 4.4. Fine-tuningVariantsandDropouts Ensemblingmodelshasshownto givea consistentboostin ThedefaultversionofNematusappliesfine-tuningbyfreez- performanceinpastbestperformingsystems[3]. Wethere- ing the weights of embedding layer. The intuition behind foreexperimentedwithseveralvariations.Wefoundthebest freezing a layer is to not allow the weights in that layer to changewithadditionaldata. Thisissometimesusefulwhen 2Becausewewererunningexperimentsinparallel,wewerenotawareat wecanlearncertainlayersbetterfromout-domaindata.One thispointthatfine-tuningonout-domainisabetterstrategy System ted-11 ted-12 ted-13 ted-14 Avg System ted-11 ted-12 ted-13 ted-14 Avg Full+FT(OPUS) 28.2 31.7 34.3 29.2 30.8 Arabic→English +FT(TED) 31.8 36.2 36.1 30.8 33.7 Phrase-based 30.5 34.2 35.0 30.5 32.6 Ensemble(8) 32.5 37.0 37.2 31.5 34.6 NeuralMT 32.5 37.0 37.2 31.5 34.6 SystemComb 32.8 36.5 37.4 31.7 34.6 Table14:Ensemblingover8Fine-tunedModels English→Arabic Phrase-based 16.7 17.9 20.2 17.7 18.1 System ted-11 ted-12 ted-13 ted-14 Avg NeuralMT 17.1 18.9 20.1 17.7 18.5 Baseline 24.0 26.4 25.2 22.4 24.5 SystemComb 16.8 19.1 20.7 17.6 18.6 UN 15.9 17.9 20.0 16.3 17.5 +OPUS 28.2 31.7 34.3 29.2 30.8 Table17:ResultsforSystemCombination +TED 31.8 36.2 36.1 30.8 33.7 +Ensemble 32.5 37.0 37.2 31.5 34.6 System ted-15 ted-16 qed-15 Arabic→English Table15:Arabic-to-EnglishNMTSystemprogress Primary 34.1 31.8 28.1 Contrastive 33.7 31.5 28.1 English→Arabic performing combination by fine-tuning the last eight mod- Primary 19.5 18.4 23.1 elsoftheUN+OPUSsystem,andthenensembletheseeight Contrastive 19.5 18.1 22.9 fine-tunedmodels. Performanceimprovementsfromtheen- semble are shown in Table 14. The second row shows sys- Table18:ResultsonOfficialTestSets temswhenwefinetuneourbestsysteminTable13withthe in-domainTEDdata. Inthelastrowweperformensemble. the contrary significant improvement was obtained only in 4.7. FinalSystem test-2013intheEnglish→Arabicdirection. Table18shows OurfinalsystemwastrainedbyfirstusingalloftheUNdata. resultsontheofficialtest-sets. We then continued training on OPUS data. Once learning hadplateauedontheOPUSdata,wetookthelasteightmod- 6. Summary els which were verysimilar in performance,and fine-tuned eachof the themusing TED data. We thencombinedthese We trained a very strong phrase-based system with SOTA eightfine-tunedmodelsin an ensembleas ourfinalsystem. featuressuchasOSM, NNJM andbigLM. Thesystemim- TheprogressisshowninTable15. Weusedthesamestrat- provedgreatly by applying domain adaptation. To this end egyfortheQEDsystemsbyfine-tuningthelasteightOPUS weappliedMML-basedfiltering,interpolatedOSMandfine- modelswithQEDdata,andcombiningtheseinanensemble. tuningof NNJM models. Overall, ourphrase-basedsystem achievedagainof4BLEUpointsontopofthebaselinesys- 4.8. English-to-ArabicSystems tem. We also applied data selection for training our NMT. However,theNMTsystemsquicklyoverfitanddidnotper- WeusedinsightsgainedfromourArabic-to-Englishsystem form well. Our experiments showed that the NMT system experimentstotrainourEnglish→Arabicsystems. Ourfinal trained on the full UN data performed best, and the final modelforbothTEDandQEDwasfirsttrainedonallofthe NMT system made use of all the available out-of-domain UNdata,followedbytheOPUSdata,andfinallyfine-tuned data. However, the training was performed incrementally, withthein-domaindata. TheprogressisshowninTable16. startingwithUNdatafor50kiterations,finetunedonOPUS 5. System Combination for 25k more iterations and then fine tuned the final model using TED talks for a few iterations. We simply replicated WecombinedhypothesesproducedbyourbestPhrase-based our settings to train QED systems. Finally we applied sys- and Neural MT systems. For this purpose we used Multi- temcombinationofthetwosystemsusingMEMT. Engine MT system, or MEMT [13]. The results are shown While it is computationally expensive, we found train- in Table 17. We didnotgain anysubstantialimprovements ing a neural MT system much simpler than a competitive using system combination. Small improvements were ob- phrase-basedsystem,wherealotofsub-componentsneedto tainedintheArabic→Englishdirectionbaringtest-2012.On be optimizedindependentlyto reach the best configuration. Onthecontrary,anNMTsystemrequiresleastsupervision. Secondly once a neural system is trained, the effort can be System ted-11 ted-12 ted-13 ted-14 Avg easily reused to adapt the system towards another domain, UN 9.1 9.3 11.2 9.4 9.8 +OPUS 10.8 11.2 13.4 10.9 11.6 as in this case we simply fine-tunedour UN+OPUSsystem +TED 17.1 18.9 20.1 17.7 18.5 with the QED corpus. On the contrary, almost all the sub- componentof a phrase-based system had to be retrained to Table16:English-to-ArabicNMTSystemprogress adaptthesystemtowardsQEDcorpus. 7. References Linguistics: Human Language Technologies (ACL- HLT’11),Portland,OR,USA,2011. [1] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, [10] N.Durrani,H.Sajjad,S.Joty,A.Abdelali,andS.Vo- C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, gel, “Using jointmodelsfordomainadaptationin sta- andE.Herbst,“Moses:Opensourcetoolkitforstatisti- tisticalmachinetranslation,”inProceedingsoftheFif- calmachinetranslation,”inProceedingsoftheAssoci- teenth Machine Translation Summit (MT Summit XV). ationforComputationalLinguistics(ACL’07),Prague, Florida,USA:AMTA,November2015. CzechRepublic,2007. [11] N.Durrani,P.Koehn,H.Schmid,andA.Fraser,“Inves- [2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural tigatingtheusefulnessofgeneralizedwordrepresenta- machine translation by jointly learning to align tionsinsmt,”inProceedingsofthe25thAnnualConfer- and translate,” in ICLR, 2015. [Online]. Available: ence on Computational Linguistics, ser. COLING’14, http://arxiv.org/pdf/1409.0473v6.pdf Dublin,Ireland,2014,pp.421–432. [3] R. Sennrich, B. Haddow, and A. Birch, “Ed- [12] M.-T.LuongandC.D.Manning,“Stanfordneuralma- inburgh neural machine translation systems for chinetranslationsystemsforspokenlanguagedomain,” wmt 16,” in Proceedings of the First Con- inInternationalWorkshoponSpokenLanguageTrans- ference on Machine Translation. Berlin, Ger- lation,DaNang,Vietnam,2015. many: Association for Computational Linguistics, August 2016, pp. 371–376. [Online]. Available: [13] K. Heafield and A. Lavie, “CMU system combination http://www.aclweb.org/anthology/W16-2323 in WMT 2011,” in Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine [4] M.Ziemski, M.Junczys-Dowmunt,andB. Pouliquen, Translation, Edinburgh, Scotland, United King- “Theunited nationsparallelcorpusv1.0,” in Proceed- dom, July 2011, pp. 145–151. [Online]. Available: ingsoftheTenthInternationalConferenceonLanguage http://kheafield.com/professional/avenue/wmt2011.pdf ResourcesandEvaluationLREC2016,Portorozˇ,Slove- nia,May23-28,2016,2016. [14] F.Guzma´n,H.Sajjad,S.Vogel,andA.Abdelali,“The AMARAcorpus:Buildingresourcesfortranslatingthe [5] H.Sajjad,F.Guzmn,P.Nakov,A.Abdelali,K.Murray, web’seducationalcontent,”inProceedingsofthe10th F.A.Obaidli,andS.Vogel,“QCRIatIWSLT2013:Ex- InternationalWorkshoponSpokenLanguageTechnol- periments in Arabic-English and English-Arabic spo- ogy(IWSLT-13),December2013. ken language translation,” in Proceedings of the 10th InternationalWorkshoponSpokenLanguageTechnol- [15] O. Bojar, R. Chatterjee, C. Federmann, Y. Gra- ogy(IWSLT-13),December2013. ham, B. Haddow, M. Huck, A. Jimeno Yepes, P. Koehn, V. Logacheva, C. Monz, M. Negri, [6] P. Lison and J. Tiedemann, “Opensubtitles2016: Ex- A. Neveol, M. Neves, M. Popel, M. Post, R. Rubino, tracting large parallelcorporafrom movie and tv sub- C. Scarton, L. Specia, M. Turchi, K. Verspoor, titles,” in Proceedingsof the Tenth InternationalCon- and M. Zampieri, “Findings of the 2016 confer- ferenceonLanguageResourcesandEvaluation(LREC ence on machine translation,” in Proceedings of the 2016). European Language Resources Association First Conference on Machine Translation. Berlin, (ELRA),may2016. Germany: Association for Computational Linguis- [7] A. Abdelali, F. Guzman, H. Sajjad, and S. Vogel, tics, August 2016, pp. 131–198. [Online]. Available: “The AMARA corpus: Building parallel language re- http://www.aclweb.org/anthology/W/W16/W16-2301 sourcesfortheeducationaldomain,”inProceedingsof [16] A. Pasha, M. Al-Badrashiny, M. Diab, A. El Kholy, the Ninth International Conference on Language Re- R.Eskander,N.Habash,M.Pooleery,O.Rambow,and sources and Evaluation (LREC’14), Reykjavik, Ice- R.M.Roth,“MADAMIRA:Afast,comprehensivetool land,May2014. formorphologicalanalysisanddisambiguationofAra- [8] A.Axelrod,X.He,andJ.Gao,“Domainadaptationvia bic,” in Proceedings of the Language Resources and pseudoin-domaindataselection,”inProceedingsofthe EvaluationConference,ser.LREC’14,Reykjavik,Ice- ConferenceonEmpiricalMethodsinNaturalLanguage land,2014,pp.1094–1101. Processing,ser.EMNLP’11,Edinburgh,UnitedKing- [17] A.Abdelali,K.Darwish,N.Durrani,andH.Mubarak, dom,2011. “Farasa: A fast and furious segmenter for arabic,” [9] N. Durrani, H. Schmid, and A. Fraser, “A joint se- in Proceedings of the 2016 Conference of the quence translation model with integrated reordering,” North American Chapter of the Association for in Proceedings of the Association for Computational Computational Linguistics: Demonstrations. San Diego, California: Association for Computational MeetingonAssociationforComputationalLinguistics, Linguistics,June2016,pp.11–16.[Online].Available: ser. ACL ’02, Morristown, NJ, USA, 2002, pp. 311– http://www.aclweb.org/anthology/N16-3003 318. [18] A. Birch, M. Huck, N. Durrani, N. Bogoychev, and [27] N. Durrani, A. Fraser, H. Schmid, H. Hoang, and P. Koehn, “Edinburgh SLT and MT system descrip- P. Koehn, “Can markov models over minimal transla- tionfortheIWSLT2014evaluation,”inProceedingsof tion units help phrase-based smt?” in Proceedings of the 11th InternationalWorkshop on SpokenLanguage the51stAnnualMeetingoftheAssociationforCompu- Translation, ser. IWSLT ’14, Lake Tahoe, CA, USA, tationalLinguistics(Volume2: ShortPapers). Sofia, 2014. Bulgaria: Association for Computational Linguis- tics, August 2013, pp. 399–405. [Online]. Available: [19] C. Dyer, V. Chahuneau, and N. A. Smith, “A simple, http://www.aclweb.org/anthology/P13-2071 fast,andeffectivereparameterizationofibmmodel2,” inProceedingsofNAACL’13,2013. [28] H. Sajjad, A. Fraser, and H. Schmid, “A statistical model for unsupervised and semi-supervised translit- [20] K. Heafield, “KenLM: Faster and Smaller Lan- eration mining,” in Proceedings of the Association guage Model Queries,” in Proceedings of the for Computational Linguistics (ACL’12), Jeju, Korea, Sixth Workshop on Statistical Machine Trans- 2012. lation, Edinburgh, Scotland, United Kingdom, July 2011, pp. 187–197. [Online]. Available: [29] N.Durrani,H.Sajjad, H. Hoang,andP.Koehn,“Inte- http://kheafield.com/professional/avenue/kenlm.pdf gratingan unsupervisedtransliterationmodelinto sta- tisticalmachinetranslation,”inProceedingsofthe15th [21] M.GalleyandC.D.Manning,“ASimpleandEffective ConferenceoftheEuropeanChapteroftheACL(EACL Hierarchical Phrase Reordering Model,” in Proceed- 2014),Gothenburg,Sweden,April2014. ings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, [30] R. Sennrich, B. Haddow, and A. Birch, “Neu- October 2008, pp. 848–856. [Online]. Available: ral machine translation of rare words with sub- http://www.aclweb.org/anthology/D08-1089 word units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Lin- [22] N. Durrani, H. Schmid, A. Fraser, P. Koehn, and guistics (Volume 1: Long Papers). Berlin, Ger- H. Schu¨tze, “The Operation Sequence Model – Com- many: Association for Computational Linguistics, biningN-Gram-basedandPhrase-basedStatisticalMa- August 2016, pp. 1715–1725. [Online]. Available: chineTranslation,”ComputationalLinguistics,vol.41, http://www.aclweb.org/anthology/P16-1162 no.2,pp.157–186,2015. [23] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, andJ. Makhoul,“Fast androbustneuralnetworkjoint modelsforstatisticalmachinetranslation,”inProceed- ingsofthe52ndAnnualMeetingoftheAssociationfor Computational Linguistics (Volume 1: Long Papers), 2014. [24] S.Joty,H.Sajjad,N.Durrani,K.Al-Mannai,A.Abde- lali, andS. Vogel, “How toAvoidUnwantedPregnan- cies: DomainAdaptationusingNeuralNetworkMod- els,”inProceedingsofthe2015ConferenceonEmpir- icalMethodsinNaturalLanguageProcessing,Lisbon, Portugal,September2015. [25] C. Cherry and G. Foster, “Batch tuning strategies for statistical machine translation,” in Proceedings of the 2012AnnualConferenceoftheNorthAmericanChap- ter of the Association for Computational Linguistics: HumanLanguageTechnologies,ser.NAACL-HLT’12, Montre´al,Canada,2012. [26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A Method for Automatic Evaluation of Ma- chine Translation,” in Proceedings of the 40th Annual

