Table Of Content

This figure "ec_in_unempl.png" is available in "png"(cid:10) format from: http://arxiv.org/ps/1601.02502v1 Trans-gram, Fast Cross-lingual Word-embeddings rey − Mann = regina − femme es de it fr JocelynCoulmance Jean-MarcMarty ∗ GuillaumeWenzek∗ AmineBenhalloum 105rueLaFayette 105rueLaFayette 105rueLaFayette 105 rueLaFayette 75010Paris 75010Paris 75010Paris 75010Paris [email protected] [email protected] [email protected] [email protected] Abstract In this article, we introduce a new method called Trans-gram, which learns word embed- We introduce Trans-gram, a simple dings aligned across many languages, in a simple 6 and computationally-efficient method to and efficient fashion, using only sentence align- 1 simultaneously learn and align word- ments rather than word alignments. We compare 0 2 embeddingsforavarietyoflanguages, us- our method with previous approaches on a cross- ing only monolingual data and a smaller lingualdocumentclassificationtaskandonaword n a set of sentence-aligned data. We use our translation task and obtain state of the art results J new method to compute aligned word- onthesetasks. Additionally,word-embeddingsfor 1 embeddings for twenty-one languages us- twenty-one languages are learned simultaneously 1 ing English as apivot language. Weshow -toour knowledge -forthefirsttime,inlessthan ] that some linguistic features are aligned two and a half hours. Furthermore, we illustrate L C acrosslanguagesforwhichwedonothave some interesting properties that are captured such ~ ~ s. aligned data, eventhough those properties ascross-lingual analogies, e.g reyes− Mannde+ ~ ~ c donotexistinthepivotlanguage. Wealso femmefr ≈ reginait which can be used for dis- [ achieve state ofthe art results on standard ambiguation. 1 cross-lingual text classification and word v 2 ReviewofPrevious Work translation tasks. 2 0 A number of methods have been explored to 5 1 Introduction trainandalignbilingual word-embeddings. These 2 0 methods pursue two objectives: first, similar rep- Word-embeddings are a representation of words 1. with fixed-sized vectors. It is a distributed repre- resentations (i.e. spatially close)mustbeassigned 0 tosimilarwords(i.e. “semantically close”)within sentation (Hinton,1984) in the sense that there is 6 each language - this is the mono-lingual objec- 1 not necessarily a one-to-one correspondence be- : tive; second, similar representations must be as- v tweenvectordimensionsandlinguisticproperties. signed to similar words across languages - this is i The linguistic properties are distributed along the X thecross-lingual objective. dimensions ofthespace. r a A popular method to compute word- Thesimplestapproachconsistsinseparatingthe mono-lingual optimization task from the cross- embeddings is the Skip-gram model lingual optimization task. This is for example the (Mikolovetal.,2013a). This algorithm learns casein(Mikolovetal.,2013b). Theideaistosep- high-qualitywordvectorswithacomputationcost aratelytraintwosetsofword-embeddingsforeach much lower than previous methods. This allows language and then to do a parametric estimation the processing ofvery important amounts of data. of the mapping between word-embeddings across For instance, a 1.6 billion words dataset can be languages. This method was further extended by processed inlessthanoneday. (FaruquiandDyer,2014). Even though those al- Severalauthorscameupwithdifferentmethods gorithms proved to be viable and fast, it is not to align word-embeddings across two languages clear whether or not a simple mapping between (Klementievetal.,2012; Mikolovetal.,2013b; whole languages exists. Moreover, they require Laulyetal.,2014;Gouwsetal.,2015). word alignments which are a rare and expensive ∗Theseauthorscontributedequally. resource. Another approach consists in focusing en- 3.2 Trans-gram tirely on the cross-lingual objective. This In this section we introduce Trans-gram, a new was explored in (HermannandBlunsom,2013; method to compute aligned word-embeddings for Laulyetal.,2014) where every couple of aligned avarietyoflanguages. sentences is transformed into two fixed-size vec- Our method will minimize the summation of tors. Then, the model minimizes the Euclidean mono-linguallossesandcross-linguallosses. Like distance between both vectors. This idea allows in BilBOWA (Gouwsetal.,2015), we use Skip- processing corpus aligned at sentence-level rather gram as a mono-lingual loss. Assuming we are thanword-level. However,itdoesnotleveragethe tryingtolearnaligned wordvectorsforlanguages abundance ofexistingmono-lingual corpora. e (e.g. English) and f (e.g. French), we note J A popular approach is to jointly optimize the e andJ thetwomono-lingual losses. mono-lingual and cross-lingual objectives simul- f In BilBOWA, the cross-lingual loss func- taneously. This is mostly done by minimiz- tion is a distance between bag-of-words repre- ing the sum of mono-lingual loss functions for sentations of two aligned sentences. But as each language andthe cross-lingual lossfunction. (LevyandGoldberg, 2014) showed that the Skip- (Klementievetal.,2012) proved this approach to gram loss function extracts interesting linguistic be useful by obtaining state-of-the-art results on features, we wanted to use a loss function for the several tasks. (Gouwsetal.,2015) extends their cross-lingual objective that willbecloser to Skip- workwithamorecomputationally-efficientimple- gramthanBilBOWA. mentation. Therefore, we introduce a new task, Trans- 3 From Skip-Gram to Trans-Gram gram, similar to Skip-gram. Each English sentences inouralignedcorpusA isalignedwith e e,f 3.1 Skip-gram a French sentence s . In Skip-gram, the context f We briefly introduce the Skip-gram algorithm, as picked foratarget wordw inasentence s isthe e e we will need it for further explanations. Skip- set of words c appearing in the window centered e gram allows to train word embeddings for a lan- aroundw : s [w −l :w +l]. InTrans-gram,the e e e e guage using mono-lingual data. Thismethoduses context picked for a target word w in a sentence e a dual representation for words. Each word w s will be all the words c appearing in s . The e f f has two embeddings: a target vector, w~ (∈ RD), losscanthusbewrittenas: and a context vector, w~ (∈ RD). The algorithm tries to estimate the probability of a word w to Ωe,f = X X X −logσ(w~e ·c~f) appear in the context of a word c. More pre- (se,sf)∈Ae,f we∈secf∈sf ciselywearelearningtheembeddingsw~,c~sothat: (2) σ(w~ ·c~) = P(w|c) where σ is the sigmoid func- Thislossisn’tsymmetricwithrespecttothelan- tion. guages. We, therefore, use two cross-lingual ob- A simplified version of the loss function mini- jectives: Ω aligning e’s target vectors and f’s e,f mizedbySkip-gramisthefollowing: contextvectorsandΩ aligningf’stargetvectors f,e ande’scontextvectors. BycomparisonBilBOWA J = XX X −logσ(w~ ·c~) (1) only aligns e’s target vectors and f’s target vec- s∈Cw∈sc∈s[w−l:w+l] tors. Thefigure1illustrates thefourobjectives. where C is the set of sentences constituting the Notice that we make the assumption that the training corpus, and s[w − l : w + l] is a meaning of a word is uniformly distributed in word window on the sentence s centered around the whole sentence. This assumption, although w. For the sake of simplicity this equation does a naive one, gave us in practice excellent results. not include the “negative-sampling” term, see Also our method uses only sentence-aligned cor- (Mikolovetal.,2013a)formoredetails. pusandnotword-aligned corpuswhicharerarer. Skip-gram can be seen as a materialization Toadd a third language i (e.g. Italian), we just of the distributional hypothesis (Harris,1968): have to add 3 new objectives (J , Ω and Ω ) i e,i i,e “Words used in similar contexts have similar to the global loss. If available we could also add meanings”. We will now see how to extend this Ω or Ω but in our case we only used corpora f,i i,f ideatocross-lingual contexts. alignedwithEnglish. P s~itse as~sisf s~itse as~sisf Je Ωf,e Ωe,f Jf th~e c~at o~n th~e th~e c~at si~ts o~n th~e m~at l~e ch~at e~st as~sis su~r l~e ta~pis ch~at e~st su~r l~e e e f f Figure1: Thefourpartialobjectives contributing tothealignmentofEnglishandFrench: aSkip-gram objective perlanguage (Je and Jf)over awindow surrounding atarget word (blue) and two Trans-gram objectives (Ωe,f andΩf,e)overthewhole sentence aligned withthesentence from which thetarget word isextracted (red). 4 Implementation an averaged perceptron, and we used the implementation from (Klementievetal.,2012)1. The In our experiments, we used the Europarl word vectors were computed on the Europarl-v7 (Koehn,2005) aligned corpora. Europarl-v7 has parallel corpus with size 40 like other methods. two peculiarities: firstly, the corpora are aligned Forthistaskonlythetargetvectorswhereused. atsentence-level; secondly eachpairoflanguages We report the percentage precision obtained contains English as one of its members: for in- with our method, in comparison with other meth- stance, there is no French/Italian pair. In other ods, in Table 1. The table also include results words, English is used as a pivot language. No obtained with 300 dimensions vectors trained by bi-lingual lexicons nor other bi-lingual datasets Trans-gram with the Europarl-v7 as parallel cor- alignedatthewordlevelwereused. pus and the Wikipedia as mono-lingual corpus. UsingonlytheEuroparl-v7textsasbothmono- The previous state of the art results were de- lingual and bilingual data, it took 10 min- tained (Gouwsetal.,2015) with BilBOWA and utes to align 2 languages, and two and a half (Laulyetal.,2014) with their Bilingual Auto- hours to align the 21 languages of the corpus, encoder model. This model learns word em- in a 40 dimensional space on a 6 core com- beddings during a translation task that uses an puter. We also computed 300 dimensions vec- encoder-decoder approach. We also report the tors using the Wikipedia extracts provided by scores from Klementiev et al. who introduced (Al-Rfouetal.,2013) as monolingual data for the task and the BiCVM model scores from eachlanguage. Thetraining timewas21hours. (HermannandBlunsom,2013). 5 Experiments Theresultsshowanoverallsignificantimprove- ment over the other methods, with the added ad- 5.1 ReutersCross-lingual Document vantageofbeingcomputationally efficient. Classification 5.2 P@kWordTranslation We used a subset of the English and German sections of the Reuters RCV1/RCV2 corpora Next we evaluated our method on a word trans- (LewisandLi,2004) (10000 documents each), as lation task, introduced in (Mikolovetal.,2013b) in(Klementievetal.,2012),andwereplicatedthe andusedin(Gouwsetal.,2015). Thewordswere experimental setting. Inthe English dataset, there extracted from the publicly available WMT112 are four topics: CCAT (Corporate/Industrial), corpus. The experiments were done for two sets ECAT(Economics), GCAT(Government/Social), of translation: English to Spanish and Spanish to and MCAT (Markets). We used these topics as English. (Mikolovetal.,2013b) extracted the top ourlabelsandweonlyselecteddocumentslabeled 6K mostfrequent wordsandtranslated themwith with a single topic. We trained our classifier on Google Translate. They used the top 5K pairs thearticlesofonelanguage,whereeachdocument to train a translation matrix, and evaluated their wasrepresentedusinganIDFweightedsumofthe methodontheremaining 1K. AsourEnglishand vectors of its words, we then tested it on the arti- 1ThankstoS.Gouwsforprovidingthisimplementation clesoftheotherlanguage. Theclassifierusedwas 2http://www.statmt.org/wmt11/ Method En→De De→En Speed-upintrainingtime Klementievetal. 77.6% 71.1% ×1 BilingualAuto-encoder 91.8% 72.8% ×3 BiCVM 83.7% 71.4% ×320 BilBOWA 86,5% 75% ×800 Trans-gram 87,8% 78,7% ×600 Trans-gram(size300vectorsEP+WIKI) 91,1% 78,4% Table1: ComparisonofTrans-gramwithvariousmethodsforReutersEnglish/Germanclassification Method En→EsP@1 En→EsP@5 Es→EnP@1 Es→EnP@5 Editdistance 13% 18% 24% 27% Bing 55% 71% TranslationMatrix 33% 35% 51% 52% BilBOWA 39% 44% 51% 55% Trans-gram 45% 61% 47% 62% Table2: Resultsonthetranslation task Spanishvectorsarealreadyalignedwedon’tneed “Il est en train de manger” translates into “he is the 5K training pairs and use only the 1K test eating”, orinItalian“stamangiando”. pairs. As the Italian word “sta” is used to form pro- The reported score, the translation precision gressive tenses, it’s a good candidate to disam- P@k, is the fraction of test-pairs where the tar- biguate train . Let’s introduce the vector ~v = get translation (Google Translate) is one of the k fr ~ ~ train − sta . Nowthe threepolish wordsclos- translations proposed by our model. For a given fr it estto~vtranslateinEnglishinto“atrain”,“atrain” Englishword,w,ourmodeltakesitstargetvectors and “railroad”. Therefore~v is abetter representa- w~ and proposes the k closest Spanish word using tionfortherailroad senseoftrain . the co-similarity of their vectors to w~. We com- fr pare ourselves to the “translation matrix” method and to the BilBowa aligned vectors. We also re- portthescoresobtained byatrivialalgorithm that 6.2 Transferoflinguisticfeatures usesedit-distance todeterminetheclosesttransla- tionandbytheBingTranslatorservice. Another interesting property of the vectors gener- atedbyTrans-gramisthetransferoflinguisticfea- 6 Interesting properties tures through a pivot language that does not pos- 6.1 Cross-lingualdisambiguation sessthesefeatures. We now present the task of cross-lingual dis- Let’s illustrate this by focusing on Latin lan- ambiguation as an example of possible uses of guages, which possess some features that En- aligned multilingual vectors. Thegoalofthistask glish does not, like rich conjugations. For ex- istofindasuitablerepresentation ofeachsenseof ample, in French and Italian the infinitives of agivenpolysemousword. Theideaofourmethod “eat” are manger and mangiare , and the first fr it is to look for a language in which the undesired plural persons are mangeons and mangiamo . fr it sensesarerepresentedbyunambiguouswordsand Actually in our models we observe the follow- ~ ~ thentoperformsomearithmeticoperation. ing alignments: manger ≈ mangiare and fr it ~ ~ Let’s illustrate the process with a concrete ex- mangeons ≈ mangiamo . It is thus remark- fr it ample: consider the French word “train”, train . able to see that features not present in English fr ~ ThethreeclosestPolishwordsto train translate match in languages aligned through English as fr inEnglishinto“now”,“atrain”and“when”. This the only pivot language. We also found similar seems a poor matching. In fact, train is polyse- transfers forthegendersofadjectives andarecur- fr mous. It can name aline of railroad cars, but it is rently studying other similar properties captured also used to form progressive tenses. The French byTrans-gram. 7 Conclusion [LewisandLi2004] Y.;RoseT.;Lewis,D.D.;Yangand F.Li. 2004. Rcv1: Anewbenchmarkcollectionfor In this paper we provided the following contri- textcategorizationresearch. InJournalofMachine butions: Trans-gram, a new method to compute LearningResearch. cross-lingual word-embeddings in a single word [Mikolovetal.2013a] TomasMikolov,KaiChen,Greg space;stateoftheartresultsoncross-lingual NLP Corrado,andJeffreyDean. 2013a. Efficientestima- tasks; a sketch of a cross-lingual calculus to help tionof wordrepresentationsin vectorspace. arXiv disambiguate polysemous words; the exhibition preprintarXiv:1301.3781. of linguistic features transfers through a pivot- [Mikolovetal.2013b] TomasMikolov,QuocVLe,and language notpossessing thosefeatures. Ilya Sutskever. 2013b. Exploiting similarities We are still exploring promising properties of among languages for machine translation. arXiv preprintarXiv:1309.4168. the generated vectors and their applications in otherNLPtasks(SentimentAnalysis, NER...). References [Al-Rfouetal.2013] RamiAl-Rfou,BryanPerozzi,and Steven Skiena. 2013. Polyglot: Distributed word representationsformultilingualnlp. arXivpreprint arXiv:1307.1662. [FaruquiandDyer2014] Manaal Faruqui and Chris Dyer. 2014. Improvingvectorspacewordrepresen- tations using multilingual correlation. Association forComputationalLinguistics. [Gouwsetal.2015] Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. Bilbowa: Fast bilingual distributedrepresentationswithoutwordalignments. InProceedingsofthe25thinternationalconference onMachinelearning,volume15,pages748–756. [Harris1968] ZelligSabbettaiHarris. 1968. Mathemat- icalstructuresoflanguage. [HermannandBlunsom2013] Karl Moritz Hermann and Phil Blunsom. 2013. Multilingual distributed representations without word alignment. arXiv preprintarXiv:1312.6173. [Hinton1984] Geoffrey E Hinton. 1984. Distributed representations. [Klementievetal.2012] Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingualdistributedrepresentationsofwords. [Koehn2005] Philipp Koehn. 2005. A parallel corpus forstatisticalmachinetranslation. InMTSummit. [Laulyetal.2014] Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C Raykar, and Amrita Saha. 2014. An autoencoder approachtolearningbilingualwordrepresentations. InAdvancesinNeuralInformationProcessingSys- tems,pages1853–1861. [LevyandGoldberg2014] Omer Levy and Yoav Gold- berg. 2014. Neural word embedding as implicit matrix factorization. In Advances in Neural Infor- mationProcessingSystems,pages2177–2185. Method Noword Noprioronthe Noexplicit Computationally Canleverage alignments mappingbetween alignmentsof efficient mono-lingual required targetvectors targetvectors corpus Mikolov’s Translation Matrix × × X X X Faruqui’s Translation Matrix × × X X X Zou’s Translation Matrix × × X X X BiCVM X X × X × Bilingual Auto-Encoder X X × × × Klementiev X × X × X 6 BilBOWA X X × X X 1 Trans-gram X X X X X 0 2 Table 1 – Comparative study of each method n a J 1 1 ] L C beau belle fr fr . bel bella s it it c lindo bonita pt pt [ encantador bella es es 1 frumos adormita v ro ro 2 Table 2 – Closest word in different languages to the given french word. To the masculine “beau” 0 5 (pretty) the closest word are also masculine, while the closest words to the feminine “belle” 2 (pretty) are all feminine. 0 . 1 0 6 1 : v i X train train −sta fr fr it r a pl translation pl translation obecnie now pocia˛g train pocia˛g train pocia˛gu train kiedy when kolejowa˛ train pocia˛gu train perony platforms tym this toro´w tracks kto´rym where wagono´w wagons bus na on tory tracks aktualnie currently torze track perony platforms trasie from czym what pre˛dko´scia˛ speed Table 3 – Top ten Polish words and their English translations, for the ambiguous French word “train” and its disambiguated version. fr it pt es en de innovation innovazione inovação innovación non-technological innovation innovations innovativa criatividade innovaciones entrepreneurial innovationen créativité innovazioni inovações innovadora knowledge-driven unternehmergeist entrepreneurial imprenditoriale inovadora creatividad innovations wissensgesellschaft innovante imprenditorialità tecnologias tic information-based innovations- entrepreneuriat pmi competitividade tecnolog´ıas innovating innovationsfähigkeit compétitivité innovativo tic competitividad eco-efficiency unternehmerischer technologies creatività empreendedor estimulación research forschungs- entreprenariat imprenditoria criadora emprendedor policy-driven unternehmerisches innovant tecnologie pci punteras internationalisation wettbewerbsfähigkeit pl el fi bg cs innowacji καινοτομία kilpailukyvyn иновациите inovaci innowacyjno´sci καινοτομίας teknologian иновации inovace innowacyjno´sć καινοτομίες kilpailukykyä иновация inovac´ı innowacje καινοτομιών kilpailukyky иновацията inovac´ım innowacjom γνώσης tutkimukseen научноизследователската inovaˇcn´ı innowacyjnej επιχειρηματικότητας pk-yritysten предприемаческия internacionalizaci innowacjami καινοτόμων työllisyyttä иновационния znalostn´ı innowacyjnego δημιουργικότητας kasvun иновационен vy´zkumu tworzenia καινοτόμος kasvua развойната znalostn´ıho kreatywno´sć καινοτόμο työpaikkojen знанието inovacemi hu nl da et ro innováció innovatie innovation innovatsiooni inovare innovációnak innovaties innovationen innovatsioonile inova¸tiei innovációhoz innovatievermogen videnbaseret teadmuspõhise inova¸tie innovációs innovatiecapaciteit nyskabelse uuendustele inovarea innovációt innoverende innovationer uuendusi inov˘arii kutatás ondernemersklimaat innovativ innovatsioonipotentsiaali inova¸tia tudásba innovatief innovations- uuendustest cunoa¸sterii innováción innovatiepotentieel innovationsevne uuenduste cunoa¸stere kutatás-fejlesztés kennismaatschappij ivaerksaetter˚and uuendusteks inov˘ari u´j´ıtásokat ondernemerschap innovativt innovatsiooniks inova¸tii sl sk lt sv lv inovacij inovácie naujoviu˛ innovation jaunin¯ajumiem inovacije inováci´ı inovaciju˛ nyskapande inov¯acijai inovativnost inovácii naujoves innovationen inov¯aciju inovativnosti inováciám naujove˙ms innovationer inov¯acijas inovacijam inováciu inovacijoms innovationsförm˚aga jaunin¯ajumu inovativnostjo vy´skumu inovacijomis entreprenörsandan inov¯acij¯as inovacijske inovácia inovacijas kunskapsbaserad inov¯acij¯am ustvarjanja inovaˇcnej naujove˙mis kunskapsbaserat jaunin¯ajumus inovacijah znalostnej inovacijos innovationerna jaunrades inovacijami konkurencieschopnosˇt naujove˙s innovationsomr˚adet novatorisma Table 4 – For each of the 21 languages of the Europarl, the top ten words closest to innovation fr This figure "spa-en.png" is available in "png"(cid:10) format from: http://arxiv.org/ps/1601.02502v1