ONTS: “Optima” News Translation System Marco Turchi∗, MartinAtkinson∗, AlastairWilcox+, Brett Crawley, Stefano Bucci+, RalfSteinberger∗ and ErikVander Goot∗ European Commission-JointResearch Centre(JRC), IPSC -GlobeSec ViaFermi 2749,21020Ispra(VA)-Italy ∗[name].[surname]@jrc.ec.europa.eu +[name].[surname]@ext.jrc.ec.europa.eu 4 [email protected] 1 0 2 n a Abstract For these reasons, we have developed our J in-house machine translation system ONTS. Its 3 Weproposeareal-timemachinetranslation translation results will be publicly accessible as 1 system that allows users to select a news part of the Europe Media Monitor family of ap- ] category and to translate the related live plications,(Steinbergeretal.,2009),whichgather L news articles from Arabic, Czech, Danish, C Farsi,French,German,Italian,Polish,Por- and process about 100,000 news articles per day . tuguese,SpanishandTurkishintoEnglish. in about fifty languages. ONTS is based on s c TheMoses-basedsystemwasoptimisedfor the open source phrase-based statistical machine [ the news domain and differs from other translation toolkit Moses (Koehn et al., 2007), 1 available systems in four ways: (1) News trained mostly on freely available parallel cor- v items are automatically categorised on the poraandoptimisedforthenewsdomain,asstated 3 source side, before translation; (2) Named 4 above. The main objective of developing our in- entity translation is optimised by recog- 9 house system is thus not to improve translation nising and extracting them on the source 2 quality over the existing services (this would be . side andbyre-insertingtheirtranslationin 1 the target language, making use of a sep- beyond our possibilities), but to offer our users a 0 4 arate entity repository; (3) News titles are roughtranslation(a“gist”)thatallowsthemtoget 1 translated with a separate translation sys- an idea of the main contents of the article and to v: temwhichisoptimisedforthespecificstyle determine whether the news item at hand is rele- i of news titles; (4) The system was opti- vantfortheirfieldofinterestornot. X mised for speed in order to cope with the A similar news-focused translation service is r largevolumeofdailynewsarticles. a “Found in Translation” (Turchi et al., 2009), which gathers articles in 23 languages and trans- 1 Introduction latesthemintoEnglish. “FoundinTranslation”is also based on Moses, but it categorises the news Beingable toread newsfrom other countries and after translation and the translation process is not written in other languages allows readers to be optimisedforthenewsdomain. better informed. It allows them todetect national news bias and thus improves transparency and 2 Europe Media Monitor democracy. Existing online translation systems such as Google Translate and Bing Translator1 Europe Media Monitor (EMM)2 gathers a daily are thus a great service, but the number of docu- averageof100,000newsarticlesinapproximately ments that can be submitted is restricted (Google 50 languages, from about 3,400 hand-selected will even entirely stop their service in 2012) and webnewssources, fromacoupleofhundred spe- submittingdocumentsmeansdisclosingtheusers’ cialist and government websites, as well as from interests and their (possibly sensitive) data to the about twenty commercial news providers. It vis- service-providing company. its the news web sites up to every fiveminutes to 1http://translate.google.com/ and http: //www.microsofttranslator.com/ 2http://emm.newsbrief.eu/overview.html search forthe latest articles. When newssites of- regionssuchaseachcountryoftheworld,organi- fer RSS feeds, it makes use of these, otherwise zations, themes such as natural disasters or secu- it extracts the news text from the often complex rity,andmorespecificclassessuchasearthquake, HTML pages. All news items are converted to terrorism ortuberculosis, Unicode. They are processed in a pipeline struc- Articles fall into a given category if they sat- ture,whereeachmoduleaddsadditionalinforma- isfy the category definition, which consists of tion. Independently of how files are written, the Boolean operators with optional vicinity opera- systemusesUTF-8-encodedRSSformat. tors and wild cards. Alternatively, cumulative Insidethepipeline,differentalgorithmsareim- positive or negative weights and a threshold can plemented to produce monolingual and multilin- be used. Uppercase letters in the category defi- gual clusters and to extract various types of in- nition only match uppercase words, while lower- formationsuchasnamedentities, quotations, cat- casewordsinthedefinitionmatchbothuppercase egories and more. ONTS uses two modules of and lowercase words. Many categories are de- EMM:thenamedentity recognition andthecate- fined with input from the users themselves. This gorization parts. method to categorize the articles is rather sim- pleanduser-friendly, anditlendsitselftodealing 2.1 NamedEntityRecognition andVariant withmanylanguages, (Steinberger etal.,2009). Matching. 3 News TranslationSystem Named Entity Recognition (NER) is per- formed using manually constructed language- Inthissection,wedescribeourstatisticalmachine independent rules that make use of language- translation (SMT) service based on the open- specific lists of trigger words such as titles source toolkit Moses (Koehn et al., 2007) and its (president), professions or occupations (tennis adaptation totranslation ofnewsitems. player, playboy), references tocountries, regions, Which is the most suitable SMT system for ethnic or religious groups (French, Bavarian, ourrequirements? Themaingoalofoursystem Berber, Muslim), age expressions (57-year-old), istohelptheuserunderstandthecontentofanar- verbal phrases (deceased), modifiers (former) ticle. Thismeansthatatranslated articleisevalu- and more. These patterns can also occur in atedpositivelyevenifitisnotperfectinthetarget combinationandpatternscanbenestedtocapture language. Dealing with such a large number of more complex titles, (Steinberger and Pouliquen, source languages andarticles perday, oursystem 2007). Inordertobeabletocovermanydifferent shouldtakeintoaccountthetranslationspeed,and languages, nootherdictionaries andnoparsersor try to avoid using language-dependent tools such part-of-speech taggersareused. aspart-of-speech taggers. To identify which of the names newly found Inside the Moses toolkit, three different every day are new entities and which ones are statistical approaches have been implemented: merely variant spellings of entities already con- phrasebasedstatistical machinetranslation (PB- tained in the database, we apply a language- SMT) (Koehn et al., 2003), hierarchical phrase independent name similarity measure to decide based statistical machine translation (Chiang, which name variants should be automatically 2007)andsyntax-based statisticalmachinetrans- merged, for details see (Pouliquen and Stein- lation (Marcu et al., 2006). To identify the berger, 2009). This allows us to maintain a most suitable system for our requirements, we database containing over 1,15 million named en- run a set of experiments training the three mod- tities and 200,000 variants. The major part of els with Europarl V4 German-English (Koehn, this resource can be downloaded from http: 2005) and optimizing and testing on the News //langtech.jrc.it/JRC-Names.html corpus (Callison-Burch et al., 2009). For all of them,weusetheirdefaultconfigurationsandthey 2.2 CategoryClassification across arerununderthesameconditiononthesamema- Languages. chine to better evaluate translation time. For the All news items are categorized into hundreds of syntax model we use linguistic information only categories. Category definitions are multilingual, on thetarget side. According toour experiments, created by humans and they include geographic in terms of performance the hierarchical model performs better than PBSMT and syntax (18.31, of the name is a common word in the target lan- 18.09, 17.62Bleupoints), butintermsoftransla- guageanditiswronglytranslated,e.g. theFrench tion speed PBSMTisbetter than hierarchical and name “Bruno Le Maire” which risks to be trans- syntax (1.02, 4.5, 49 second per sentence). Al- latedintoEnglishas“BrunoMayor”. Tomitigate though, the hierarchical model has the best Bleu both the effects we use our multilingual named score, weprefer tousethePBSMTsystem inour entitydatabase. Inthesourcelanguage,eachnews translation service,because itisfourtimesfaster. item is analysed to identify possible entities; if Whichtrainingdatacanweuse? Itisknown anentityisrecognised, itscorrect translation into in statistical machine translation that more train- English is retrieved from the database, and sug- ing data implies better translation. Although, the gested to the SMT system enriching the source number of parallel corpora has been is growing sentenceusingthexmlmarkupoption6inMoses. in the last years, the amounts of training data Thisapproach allowsustocomplementthetrain- vary from language pair to language pair. To ing data increasing the translation capability of train our models we use the freely available cor- oursystem. pora (when possible): Europarl (Koehn, 2005), How to deal with different language styles JRC-Acquis (Steinberger et al., 2006), DGT- in the news? News title writing style contains TM3, Opus (Tiedemann, 2009), SE-Times (Ty- more gerund verbs, no or few linking verbs, ers and Alperen, 2010), Tehran English-Persian prepositions and adverbs than normal sentences, Parallel Corpus (Pilevar et al., 2011), News while content sentences include more preposi- Corpus (Callison-Burch et al., 2009), UN Cor- tion, adverbs and different verbal tenses. Starting pus(Rafalovitch andDale,2009),CzEng0.9(Bo- fromthis assumption, weinvestigated ifthisphe- jarandZˇabokrtsky´, 2009), English-Persian paral- nomenon can affect the translation performance lelcorpus distributed byELRA4 and twoArabic- ofoursystem. English datasets distributed by LDC5. This re- We trained two SMT systems, SMTcontent sults in some language pairs with a large cover- and SMTtitle, using the Europarl V4 German- age, (more than 4 million sentences), and other English data as training corpus, and two dif- with a very small coverage, (less than 1 million). ferent development sets: one made of content The language models are trained using 12 model sentences, News Commentaries (Callison-Burch sentences for the content model and 4.7 million et al., 2009), and the other made of news ti- for the title model. Both sets are extracted from tles in the source language which were trans- Englishnews. lated into English using a commercial transla- Forless resourced languages such as Farsi and tion system. With the same strategy we gener- Turkish, wetried to extend the available corpora. ated also a Title test set. The SMTtitle used a For Farsi, we applied the methodology proposed language model created using only English news by (Lambert et al., 2011), where we used a large titles. The News and Title test sets were trans- languagemodelandanEnglish-FarsiSMTmodel lated by both the systems. Although the perfor- to produce new sentence pairs. For Turkish we mance obtained translating the News and Title added the Movie Subtitles corpus (Tiedemann, corpora are not comparable, we were interested 2009), which allowed the SMT system to in- in analysing how the same test set is translated crease its translation capability, but included sev- by the two systems. We noticed that translat- eralslangwordsandspokenphrases. ing a test set with a system that was optimized HowtodealwithNamedEntitiesintransla- with the same type of data resulted in almost 2 tion? Newsarticlesarerelatedtothemostimpor- Blue score improvements: Title-TestSet: 0.3706 tant events. These names need to be efficiently (SMTtitle),0.3511(SMTcontent);News-TestSet: translated to correctly understand the content of 0.1768 (SMTtitle), 0.1945 (SMTcontent). This anarticle. FromanSMTpoint ofview, twomain behaviour was present also in different language issuesarerelatedtoNamedEntitytranslation: (1) pairs. According to these results we decided such a name is not in the training data or (2) part to use two different translation systems for each language pair, one optimized using title data 3http://langtech.jrc.it/DGT-TM.html 4http://catalog.elra.info/ 6http://www.statmt.org/moses/?n=Moses. 5http://www.ldc.upenn.edu/ AdvancedFeatures#ntoc4 and the other using normal content sentences. SourceL. ONTS GoogleT. Even though this implementation choice requires Arabic 0.318 0.255 morecomputational powertoruninmemorytwo Czech 0.218 0.226 Moses servers, it allows us to mitigate the work- Danish 0.324 0.296 load of each single instance reducing translation Farsi 0.245 0.197 timeofeachsinglearticleandtoimprovetransla- French 0.26 0.286 tionquality. German 0.205 0.25 Italian 0.236 0.31 Polish 0.568 0.511 Portuguese 0.579 0.424 3.1 TranslationQuality Spanish 0.283 0.334 Turkish 0.238 0.395 ToevaluatethetranslationperformanceofONTS, we run a set of experiments where we translate a Table1:Automaticevaluation. test set for each language pair using our system and Google Translate. Lack of human translated parallel titles obliges us to test only the content 4 Technical Implementation basedmodel. ForGerman,SpanishandCzechwe usethenewstestsetsproposedin(Callison-Burch et al., 2010), for French and Italian the news test The translation service is made of two compo- sets presented in (Callison-Burch et al., 2008), nents: the connection module and the Moses for Arabic, Farsi and Turkish, sets of 2,000 news server. The connection module is a servlet im- sentences extracted from the Arabic-English and plemented in Java. It receives the RSS files, English-Persian datasets and the SE-Times cor- isolates each single news article, identifies each pus. For the other languages we use 2,000 sen- source language and pre-processes it. Each news tences which are not news but a mixture of JRC- item is split into sentences, each sentence is to- Acquis, Europarl and DGT-TM data. It is not kenized, lowercased, passed through a statisti- guaranteethatourtestsetsarenotpartofthetrain- cal compound word splitter, (Koehn and Knight, ingdataofGoogleTranslate. 2003), and the named entity annotator module. For language modelling we use the KenLM im- Each test set is translated by Google Translate plementation, (Heafield,2011). - Translator Toolkit, and by our system. Bleu score is used to evaluate the performance of both According to the language, the correct Moses systems. Results, see Table 1, show that Google servers, title and content, are fed in a multi- Translateproducesbettertranslationforthoselan- thread manner. We use the multi-thread version guages forwhichlarge amounts ofdataareavail- ofMoses(Haddow,2010). Whenallthesentences ablesuchasFrench,German,ItalianandSpanish. of each article are translated, the inverse process Surprisingly, for Danish, Portuguese and Polish, isrun: theyaredetokenized,recased,anduntrans- ONTS has better performance, this depends on lated/unknownwordsarelisted. Thetranslatedti- the choice of the test sets which are not made of tle and content of each article are uploaded into news data but of data that is fairly homogeneous theRSSfileanditispassedtothenextmodules. intermsofstyleandgenrewiththetrainingsets. The full system including the translation mod- The impact of the named entity module is ev- ules is running in a 2xQuad-Core with In- ident for Arabic and Farsi, where each English tel Hyper-threading Technology processors with suggested entity results in a larger coverage of 48GB of memory. It is our intention to locate the source language and better translations. For the Moses servers on different machines. This is highly inflected and agglutinative languages such possible thanks to the high modularity and cus- asTurkish,theoutputproposedbyONTSispoor. tomization ofthe connection module. Atthe mo- We are working on gathering more training data ment, the translation models are available for the coming from the news domain and on the pos- followingsource languages: Arabic,Czech,Dan- sibility of applying a linguistic pre-processing of ish, Farsi, French, German, Italian, Polish, Por- thedocuments. tuguese, SpanishandTurkish. but not as good as the performance of web ser- vices such as Google Translate, mostly because we use less training data and we have reduced computational power. On the other hand, our in- house system can be fed with a large number of articlesperdayandsensitivedatawithoutinclud- ing third parties in the translation process. Per- formance and translation time vary according to the number and complexity of sentences and lan- guagepairs. The domain of news articles dynamically changesaccordingtothemaineventsintheworld, while existing parallel data is static and usually Figure1: DemoWebsite. associated to governmental domains. It is our in- tention toinvestigate howtoadaptourtranslation systemupdating thelanguage modelwiththeEn- 4.1 Demo glisharticlesoftheday. Our translation service is currently presented on a demo web site, see Figure 1, which is available Acknowledgments athttp://optima.jrc.it/Translate/. The authors thank the JRC’s OPTIMA team for Newsarticlescanberetrievedselectingoneofthe itssupportduringthedevelopment ofONTS. topics and the language. All the topics are as- signed to each article using the methodology de- scribedin2.2. Thesearticlesareshownintheleft References columnoftheinterface. Whenthebutton “Trans- O.BojarandZ.Zˇabokrtsky´. 2009. CzEng0.9: Large late” is pressed, the translation process starts and Parallel Treebank with Rich Annotation. Prague the translated articles appear in the right column BulletinofMathematicalLinguistics,92. ofthepage. C. Callison-Burch and C. Fordyce and P. Koehn and Thetranslationsystemcanbecustomizedfrom C. Monz and J. Schroeder. 2008. Further Meta- the interface enabling or disabling the named EvaluationofMachineTranslation. Proceedingsof entity, compound, recaser, detokenizer and un- theThirdWorkshoponStatisticalMachineTransla- tion,pages70–106.Columbus,US. known word modules. Each translated article is C. Callison-Burch, and P. Koehn and C. Monz and J. enriched showingthetranslation timeinmillisec- Schroeder. 2009. Findings of the 2009 Workshop onds per character and, if enabled, the list of un- onStatisticalMachineTranslation. Proceedingsof known words. The interface is linked to the con- theFourthWorkshoponStatisticalMachineTrans- nection moduleanddataistransferred usingRSS lation,pages1–28.Athens,Greece. structure. C.Callison-Burch,andP.KoehnandC.MonzandK. Peterson and M. Przybocki and O. Zaidan. 2009. 5 Discussion Findings of the 2010 Joint Workshop on Statisti- cal Machine Translation and Metrics for Machine In this paper wepresent the OptimaNewsTrans- Translation. Proceedings of the Joint Fifth Work- lation System and how it is connected to Eu- shop on Statistical Machine Translation and Met- ropeMediaMonitor application. Differentstrate- ricsMATR,pages17–53.Uppsala,Sweden. giesareapplied toincrease thetranslation perfor- D.Chiang. 2005. Hierarchicalphrase-basedtransla- mance taking advantage of the document struc- tion. ComputationalLinguistics,33(2):pages201– 228.MITPress. ture and other resources available in our research B.Haddow. 2010. Addingmulti-threadeddecodingto group. Webelievethattheexperiments described moses. The Prague Bulletin of Mathematical Lin- inthisworkcanresultveryusefulforthedevelop- guistics,93(1):pages57–66.Versita. ment of other similar systems. Translations pro- K. Heafield. 2011. KenLM: Faster and smaller lan- ducedbyoursystemwillsoonbeavailableaspart guage model queries. Proceedings of the Sixth ofthemainEMMapplications. Workshop on Statistical Machine Translation, Ed- Theperformanceofoursystemisencouraging, inburgh,UK. P. Koehn. 2005. Europarl: A Parallel Corpus for the 5th InternationalConference on Language Re- Statistical Machine Translation. Proceedings of sourcesandEvaluation,pages2142–2147.Genova, the Machine Translation Summit X, pages 79-86. Italy. Phuket,Thailand. R.SteinbergerandB.PouliquenandE.vanderGoot. P.KoehnandF.J.OchandD.Marcu. 2003. Statistical 2009. AnIntroductiontotheEuropeMediaMonitor phrase-basedtranslation. Proceedingsofthe2003 Family of Applications. Proceedings of the Infor- Conference of the North American Chapter of the mationAccessinaMultilingualWorld-Proceedings Association for Computational Linguistics on Hu- of the SIGIR 2009 Workshop, pages 1–8. Boston, man Language Technology, pages 48–54. Edmon- USA. ton,Canada. J. Tiedemann. 2009. News from OPUS-A Collection P. Koehn and K. Knight. 2003. Empirical methods of Multilingual Parallel Corpora with Tools and for compound splitting. Proceedings of the tenth Interfaces. Recent advances in natural language conferenceonEuropeanchapteroftheAssociation processing V: selected papers from RANLP 2007, forComputationalLinguistics,pages187–193.Bu- pages309:237. dapest,Hungary. M. Turchi and I. Flaounas and O. Ali and T. DeBie P.KoehnandH.HoangandA.BirchandC.Callison- andT.SnowsillandN.Cristianini. 2009. Foundin Burch and M. Federico and N. Bertoldi and B. translation. Proceedings of the European Confer- Cowan and W. Shen and C. Moran and R. Zens enceonMachineLearningandKnowledgeDiscov- andC.DyerandO.BojarandA.ConstantinandE. eryinDatabases,pages746–749.Bled,Slovenia. Herbst 2007. Moses: Opensource toolkitforsta- F. Tyers and M.S. Alperen. 2010. South-EastEuro- tisticalmachinetranslation. ProceedingsoftheAn- peanTimes:AparallelcorpusofBalkanlanguages. nualMeetingoftheAssociationforComputational Proceedings of the LREC workshop on Exploita- Linguistics,demonstrationsession,pages177–180. tionofmultilingualresourcesandtoolsforCentral Columbus,Oh,USA. and(South)EasternEuropeanLanguages,Valletta, P. Lambert and H. Schwenk and C. Servan and S. Malta. Abdul-Rauf. 2011. SPMT:InvestigationsonTrans- lationModelAdaptationUsingMonolingualData. Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284–293. Edinburgh, Scotland. D. Marcu and W. Wang and A. Echihabi and K. Knight. 2006. SPMT: Statistical machine trans- lation with syntactified target language phrases. Proceedings of the 2006 Conference on Empiri- calMethodsinNaturalLanguageProcessing,pages 48–54.Edmonton,Canada. M. Pilevar and H. Faili and A. Pilevar. 2011. TEP: Tehran English-Persian Parallel Corpus. Compu- tationalLinguisticsandIntelligentTextProcessing, pages68–79.Springer. B. Pouliquen and R. Steinberger. 2009. Auto- matic constructionofmultilingualname dictionar- ies. Learning Machine Translation, pages 59–78. MIT Press - Advances in Neural Information Pro- cessingSystemsSeries(NIPS). A. Rafalovitch and R. Dale. 2009. United nations generalassembly resolutions: A six-languagepar- allelcorpus. ProceedingsoftheMTSummitXIII, pages292–299.Ottawa,Canada. R.SteinbergerandB.Pouliquen. 2007. Cross-lingual named entity recognition. Lingvisticæ Investiga- tiones,30(1)pages135–162.JohnBenjaminsPub- lishingCompany. R. Steinberger and B. Pouliquen and A. Widiger and C. IgnatandT.ErjavecandD.Tufis¸ andD. Varga. 2006. TheJRC-Acquis:Amultilingualalignedpar- allel corpus with 20+ languages. Proceedings of