Annotating Cognates and Etymological Origin in Turkic Languages BenjaminS.Mericli∗,MichaelBloodgood† ∗UniversityofMaryland,CollegePark,[email protected] †UniversityofMaryland,CollegePark,[email protected] Abstract Turkic languages exhibit extensive and diverse etymological relationships among lexical items. These relationships make the Turkic languagespromisingforexploringautomatedtranslationlexiconinductionbyleveragingcognateandotheretymologicalinformation. However,duetotheextentanddiversityofthetypesofrelationshipsbetweenwords,itisnotclearhowtoannotatesuchinformation.In thispaper,wepresentamethodologyforannotatingcognatesandetymologicalorigininTurkiclanguages.Ourmethodstrivestobalance theamountofresearchefforttheannotatorexpendswiththeutilityoftheannotationsforsupportingresearchonimprovingautomated translationlexiconinduction. 5 1 1. Introduction discusseshowtodefineandannotatecognatesandsubsec- 0 tion2.2.discusseshowtodefineandannotateetymological 2 Automated translation lexicon induction has been investi- n gated in the literature and shown to be feasible for vari- information. a ouslanguagefamiliesandsubgroups,suchastheRomance 2.1. Cognates J languages and the Slavic languages (Mann and Yarowsky, According to the Oxford English Dictionary Online1 ac- 3 2001; Schafer and Yarowsky, 2002). Although there have 1 cessedonFebruary2,2012,‘cognate’isdefinedas: “...Of been some studies investigating using Swadesh lists of words: Comingnaturallyfromthesameroot,orrepresent- words to identify Turkic language groups and loanword ] ing the same original word, with differences due to sub- L candidates (van der Ark et al., 2007), we are not aware of sequentseparatephoneticdevelopment; thus, Englishfive, C anyworkyetonautomatedtranslationlexiconinductionfor Latinquinque,Greekπεντε,arecognatewords,represent- . theTurkiclanguages. s ing a primitive *penke.” As this definition shows, shared c However,theTurkiclanguagesarewellsuitedtoexploring geneticoriginiskeytothenotionofcognateness. Aword [ suchtechnologysincetheyexhibitmanydiverselexicalre- isonlyconsideredcognatewithanotherifbothwordspro- lationshipsbothwithinfamilyandtolanguagesoutsideof 1 ceedfromthesameancestor. Nonetheless,inlinewiththe v thefamilythroughloanwords. FortheTurkiclanguages,it conventionsofpreviousresearchincomputationallinguis- 1 is prudent to leverage both cognate information and other 9 etymologicalinformationwhenautomatingtranslationlex- tics, we set a broader definition. We use the word ‘cog- 1 nate’todenote,asin(Kondrak,2001): “...wordsindiffer- iconinduction. However,wearenotawareofanycorpora 3 entlanguagesthataresimilarinformandmeaning,without for the Turkic languages that have been annotated for this 0 makingadistinctionbetweenborrowedandgeneticallyre- . informationinasuitablewaytosupportautomatictransla- 1 latedwords;forexample,English‘sprint’andtheJapanese tion lexicon induction. Moreover, performing the annota- 0 borrowing‘supurinto’areconsideredcognate,eventhough 5 tionisnotstraightforwardbecauseoftherangeofrelation- thesetwolanguagesareunrelated.” Thesebroadercriteria 1 ships that exist. In this paper, we lay out a methodology are motivated by the ways scientists develop and use cog- : for performing this annotation that is intended to balance v nate identification algorithms in natural language process- i the amount of effort expended by the annotators with the X ing(NLP)systems. Forcross-lingualapplications, thead- utilityoftheannotationsforsupportingcomputationallin- vantageofsuchtechnologyistheabilitytoidentifywords r guisticsresearch. a forwhichsimilarityinmeaningcanbeaccuratelyinferred fromsimilarityinform; itdoesnotmatterifthesimilarity 2. MainAnnotationSystem in form is from strict genetic relationship or later borrow- We obtained the dictionary of the Turkic languages ing. (O¨ztopc¸uetal.,1996). Onesectionofthisdictionarycon- However, not every pair of apparently similar words will tains 1996 English glosses and for each English gloss a be annotated as cognate. For them to be considered cog- correspondingtranslationinthefollowingeightTurkiclan- nates, the differences in form between them must meet a guages:Azerbaijani,Kazakh,Kyrgyz,Tatar,Turkish,Turk- threshold of consistency within the data. We will explain men, Uyghur, and Uzbek. Table 1 shows an example for thedefinitionsandrulesfortheannotatorstofollowinor- the English gloss ‘alive.’ When a language has an official dertoestablishsuchathreshold. Latinscript,thatscriptisused. Otherwise,thedictionary’s First, we elaborate on how our notion of cognate differs transliterationisshowninparentheses. Ourannotationsys- from that of strict genetic relation. At a high level, there tem is to annotate each Turkic word with a two-character aretwocasestoconsider: A)wherethewordsinvolvedare code. Thefirstcharacterwillbeanumberindicatingwhich nativeTurkicwords, andB)wherethewordsinvolvedare words are cognate with each other and the second charac- terwillindicateetymologicalinformation. Subsection2.1. 1http://www.oed.com/view/Entry/35870?redirectedFrom=cognate ThispaperwaspublishedinProceedingsoftheFirstWorkshoponLanguageResourcesandTechnologiesforTurkicLanguagesat LREC2012.RequiredPublisherStatement:PublishedwiththepermissionofELRA.Thispaperwaspublishedwithintheproceedings oftheLREC2012Conference.(cid:13)c1998-2012ELRA-EuropeanLanguageResourcesAssociation.Allrightsreserved. sharedloanwordsfromnon-Turkiclanguages. Withincase 2.2. Etymology A,therearetwocasestoconsider: (A1)geneticcognates; The second character in a word’s annotation indicates a and(A2)intra-familyloans. Table2showsanexampleof conjecture about etymological origin, e.g., T for Turkic. case A1. This example shows the English gloss ‘one’ for The decision to annotate word origin is motivated by its all eight Turkic languages, descended from the same pos- value for facilitating the development of technology for tulatedform,*bir,inProto-Turkic(Ro´na-Tas,2006). Case cross-language lookup of unknown forms. We therefore A1 is the strict definition of ‘cognate,’ and these are to be take a practical approach, balancing the value of the an- annotatedascognate. notationforthispurposewiththeamountofeffortrequired CaseA2isforintra-familyloans,i.e.,awordofultimately to perform the annotation. We have created the following Turkic origin borrowed by one Turkic language from an- codeforannotatingetymology: other Turkic language. These cases, contrary to the strict T Turkic origin. This includes compound forms and af- definition, are to be marked as cognate in our system. An fixed forms whose constituents are all Turkic. For example is the modern Turkish neologism almas¸ ‘alterna- example, the Turkmen for ‘manager’, y´olbas¸c¸y, is tion, permutation’, incorporated from the Kyrgyz (almas¸) marked T because its compound base, y´ol with bas¸, ‘change’ (Tu¨rk Dil Kurumu, 1942). While rare, it is used andaffix-c¸yareallTurkicinorigin. todayinTurkishscholarlyliteraturetodescribeconceptsin areassuchasmathematicsandbotany. Processinggenetic A Arabic origin, to include words borrowed indirectly cognates(caseA1)andintra-familyloans(caseA2)differ- through another language such as Persian. For ex- ently would have little impact on the success of a cross- ample, the word in every Turkic language for ‘book’ dictionary lookup system. In fact, accounting for the dif- is marked A for all eight Turkic languages. Because ferencemightlimittheefficacyofsuchasystem. Also,the variations on the Arabic form /kita:b/ exist in every timedepthofintra-Turkicborrowingsmaybecenturiesor Turkiclanguage,inPersian,andinotherlanguagesof mere decades. The more distant the borrowing the more theIslamicworld,itisdifficulttoteaseouttheword’s difficult it will be for annotators to distinguish between trajectoryintoalanguagesuchasKyrgyz. Theburden cases A1 and A2. Hence, instances of case A2 are to be ofresearchingthesefine distinctionsisnotplacedon annotatedascognateinoursystem.2 theannotator,asexplainedbelow. Case B is for situations of shared loanwords, where the source of the words is ultimately non-Turkic. There are P Persianorigin,notincludingArabicwordspossiblybor- three subcases: (B1) loanwords borrowed from the same rowed through Persian. An example is the word for non-Turkic language; (B2) loanwords borrowed from dif- ‘color’ in many Turkic languages, from the Persian ferent non-Turkic languages, but of the same ultimate ori- /ræng/. gin;and(B3)loanwordsofnon-Turkicoriginborrowedvia R borrowed from Russian, including words that are ulti- anotherTurkiclanguage. matelyofFrenchorigin. Table3showsanexampleofcaseB1,theword‘book,’bor- rowed from Arabic in all eight Turkic languages. Table 4 F French origin, not including ultimately French words showsanexampleofcaseB2, theword‘ballet,’ borrowed borrowedfromRussian. DirectFrenchloansoccural- fromRussianinallcasesexceptTurkish,whereitwasbor- mostexclusivelyinTurkish. Anexampleistheword rowed directly from the French. Table 5 shows an exam- for‘station’inTurkish,istasyon. pleofcaseB3: theword‘benefit’inKyrgyzwasborrowed E Englishorigin. Forexamplethewordfor‘basketball’in mostlikelythroughUzbekorChaghatay(Kirchner,2006), everylanguage. but the Uzbek word was borrowed from Persian, and ul- timately from Arabic. It is difficult and time-consuming I Italianorigin. Usuallyofimportanceonlytospecificdo- forannotatorstomakethesefine-graineddistinctions. And mainsinTurkish. again, for computational processing, such distinctions are notexpectedtobehelpful. Hence,allofcasesB1,B2,and G Greek origin. For example, the word in Azerbai- B3aretobeannotatedascognateinoursystem. jani,Turkish,Turkmen,Uyghur,andUzbekfor‘box’ comesfromtheGreekκoυτı. Recallthatallourannotationsaretwo-charactercodes;the firstcharacterisanumberfromonetoeightindicatingwhat C Chineseorigin,usuallyMandarinandusuallyofimpor- wordsarecognatewitheachother. Table6showsthefirst tance only to Uyghur. An example is the word for characteroftheannotationsfortheexamplefromTable1. ‘mushroom’inUyghur,(mogu). Thewordsmarkedwith1arecognatewitheachotherand thewordsmarked2arecognatewitheachother. Q unknownorinconclusiveorigin. Thecarefulreaderwillhavenoticedthatthereisanincon- sistencyinthatwordsofultimatelyArabicoriginborrowed 2Forsimilarreasons,falsecognatesmaybeannotatedascog- through Persian are marked as A, but words of ultimately nate if the annotator does not have readily available knowledge FrenchoriginborrowedthroughRussianaremarkedasR. indicatingthattheyarefalsecognates.Althoughthisisapotential limitation of our system, it is not clear how to distinguish false There are two reasons for this. The first is annotator ef- cognatesfromtruecognateswithoutsignificantadditionalanno- ficiency. Making the judgment that a word is ultimately tationexpense. of Arabic origin is much easier than having to figure out whether it was borrowed from Arabic or indirectly from Table 8 shows the contingency matrix for annotating the Persian. FortheRussian/Frenchsituation,thedistinctionis 400entries.4 FromTable8itisimmediatethatagreement mucheasiertomake.Tobeginwith,theRussianloanwords is substantial, and when there is disagreement it is largely occur almost exclusively in former USSR languages and forthedifficultcasesofinconclusiveoriginandthemulti- theFrenchloanwordsoccuralmostexclusivelyinTurkish. language exceptions: Q, X, V, and N. We measured inter- Also,theorthographyoftengivesclearcuesformakingthis annotator agreement using Cohen’s Kappa (Cohen, 1960) distinction, as Russian loanwords consistently retain char- and found Kappa = 0.5927 (95% CI = 0.5192 to 0.6662). acteristicallyRussianletters. If we restrict attention to only the instances where neither of the annotators marked an inconclusive origin or multi- 2.2.1. Multi-LanguageExceptions language exception, then Kappa is 0.9216, generally con- Wealsodefineothercodesthatcategorizecertaincomplex sidered high agreement. This shows that our annotation words that do not fall into any of the categories described systemisfeasibleforuseandalsoshowsthattoimprovethe in subsection 2.2.. Other etymological annotation studies, systemwemightfocuseffortsonfindingwaystoincrease suchastheLoanwordTypologyprojectanditsWorldLoan- agreementontheannotationoftheexceptionalcases(Q,X, word Database (Haspelmath and Tadmor, 2009), have in- V,andN). structedlinguiststopassoversuchcomplexwordsandop- tionallyflagthemas“containsaborrowedbase,” etc. Our T A P R F Q X V N annotationsystemrequiresthatthesewords,whicharevery T 160 8 2 0 0 3 10 6 1 common in Turkic languages, be annotated according to A 0 56 2 6 0 1 0 1 0 morefinegrainedcategories. P 0 0 31 0 0 0 1 0 0 Thefollowingareourmulti-languageexceptioncodes: R 0 0 0 32 1 0 0 0 0 X Compoundwordswheretheconstituentsarefromdiffer- F 0 0 0 0 5 0 0 0 0 ent origins. For example, the Tatar word for ‘truck’, Q 12 5 0 2 0 0 2 3 0 (yo¨k mashinası), is to be marked X since it contains X 2 0 1 5 0 0 17 8 0 Russian-origin(mashina),’machine,vehicle’incom- V 0 1 0 0 0 0 0 0 0 pound with Tatar (yo¨k), ‘baggage,cargo.’ In con- N 0 0 1 0 0 0 6 0 1 trast, the Turkish compound word for thunder, go¨k gu¨rlemesi, will be marked T because all of its con- Table8: TableofCountsfortwoannotators’etymological stituentsareTurkish. conjectureson392words.Annotator1’sconjecturesfollow V A verb formed by combining a non-Turkic base with a thehorizontalaxis,andannotator2’sthevertical. Turkicauxiliaryverbordenominalaffix.Forexample, theverb‘torepeat’inAzerbaijani,Tatar,andTurkish, becauseitconsistsofanounborrowedfromtheArabic 4. ConclusionsandFutureWork /takra:r/plusaTurkicauxiliaryverbet-orit-. TheTurkic languagesare apromisingcandidate familyof N Anominalconsistingofanon-Turkicbasebearingone languagestobenefitfromautomatedtranslationlexiconin- or more Turkic affixes, in cases where removing the duction. A necessary step in that direction is the creation affixes results in a form that can plausibly be found of annotated data for cognates and etymology. However, elsewhere in the data or in a loan language dictio- this annotation is not straightforward, as the Turkic lan- nary.Forexample,theKazakhwordfor‘baker,’(naw- guagesexhibitextensiveanddiverseetymologicalrelation- bayshı), is composed of a Persian-origin base, from shipsamongwords. Somedistinctionsaredifficultforan- /na:nva:/, ‘baker’, and a suffix that indicates a per- notatorstomakeandsomeareeasier. Also, somedistinc- sonassociatedwithaprofession,(-shı). TheTurkmen tions are expected to be more useful than others for au- wordfor‘baker,’(c¸o¨rekc¸i),ontheotherhand,willbe tomating cross-lingual applications among the Turkic lan- markedT,becausebothitsbase(c¸o¨rek)andaffix(-c¸i) guages. Wepresentedanannotationmethodologythatbal- areTurkic. ancestheresearcheffortrequiredoftheannotatorwiththe Table 7 shows an example of an entry that has been fully expected value of the annotations. We surveyed and ex- annotatedforbothcognatesandetymology. plainedthewiderangeofthemostimportantrelationships observedintheTurkiclanguagesandhowtoannotatethem. 3. Inter-AnnotatorAgreement Whenwefinishtheannotations,wewouldliketomakethe Wepilot-testedourannotationsystemwithtwoannotators annotated data available as long as it is legal under copy- on400etymologyannotations.3 Bothannotatorshavestud- rightlawsforustodoso. Finally,wehopethatourannota- iedlinguistics. Also,botharenativeEnglishspeakerswith tionsystemandtheassociateddiscussioncanbeusefulfor experiencestudyingorspeakingmultipleTurkiclanguages, other teams that are annotating Turkic resources, and per- Persian,andArabic. Trainingconsistedofstudyingtheau- haps parts of it can be useful for annotating resources for thors’ annotation manual and asking any follow-up ques- otherlanguagefamiliesaswell. tions. Both annotators made approximately 240 annota- tionsperhour. 4WeleftoutcolumnsforEnglish,Greek,Italian,andChinese, 3Table8has392entriesbecausetheannotatorsclaimedeight which were not relevant for the 50 entries (according to unani- entrieshadmultipletranslationsforthesameEnglishgloss. mousagreementofourannotators). 5. References J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46. MartinHaspelmathandUriTadmor. 2009. TheLoanword TypologyprojectandtheWorldLoanwordDatabase. In MartinHaspelmathandUriTadmor,editors,Loanwords in the World’s Languages: A Comparative Handbook, pages1–34,Berlin.WalterdeGruyter. MarkKirchner. 2006. Kirghiz. InLarsJohansonandE´va A´.Csato´,editors,TheTurkicLanguages,pages344–356, NewYork.Routledge. Grzegorz Kondrak. 2001. Identifying cognates by pho- neticandsemanticsimilarity. InProceedingsofthesec- ond meeting of the North American Chapter of the As- sociation for Computational Linguistics on Language technologies,NAACL’01,pages1–8,Stroudsburg,PA, USA.AssociationforComputationalLinguistics. Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of the second meeting of the North Amer- icanChapteroftheAssociationforComputationalLin- guistics on Language technologies, NAACL ’01, pages 1–8, Stroudsburg, PA, USA. Association for Computa- tionalLinguistics. Kurtulus¸ O¨ztopc¸u, Zhoumagaly Abuov, Nasir Kambarov, and Youssef Azemoun. 1996. Dictionary of the Turkic Languages. Routledge,NewYork. Andra´s Ro´na-Tas. 2006. The reconstruction of Proto- Turkic and the genetic question. In Lars Johanson and E´vaA´.Csato´,editors,TheTurkicLanguages,pages67– 80,NewYork.Routledge. Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In proceedings of the 6th conference on Natural language learning - Volume 20, COLING- 02, pages 1–7, Stroudsburg, PA, USA. Association for ComputationalLinguistics. Tu¨rk Dil Kurumu. 1942. Felsefe ve Gramer Terimleri. CumhuriyetBasımevi,Istanbul. Rene´ van der Ark, Philippe Mennecier, John Nerbonne, and Franz Manni. 2007. Preliminary identification of language groups and loan words in Central Asia. In ProceedingsoftheRANLPWorkshoponComputational Phonology,pages12–20,Borovetz,Bulgaria. Azerbaijani Kazakh Kyrgyz Tatar Turkish Turkmen Uyghur Uzbek canlı (tiri) (tu¨ru¨u¨) (janlı) canlı diri (tirik) tirik Table1: Exampleentryfromtheeight-waydictionaryfortheEnglishgloss‘alive.’ Azerbaijani Kazakh Kyrgyz Tatar Turkish Turkmen Uyghur Uzbek bir (bir) (bir) (ber) bir bir (bir) bir Table2: ExampleofcaseA1: geneticcognates. TheEnglishglossis‘one.’ Azerbaijani Kazakh Kyrgyz Tatar Turkish Turkmen Uyghur Uzbek kitab (kitap) (kitep) (kitap) kitap kitap (kitab) kitob Table3: ExampleofcaseB1: loanwordsborrowedfromthesamenon-Turkiclanguage. TheEnglishglossis‘book.’ Azerbaijani Kazakh Kyrgyz Tatar Turkish Turkmen Uyghur Uzbek balet (balet) (balet) (balet) bale balet (balet) balet Table4: ExampleofcaseB2: loanwordsborrowedfromdifferentnon-Turkiclanguages,butofthesameultimateorigin. TheEnglishglossis‘ballet.’ Azerbaijani Kazakh Kyrgyz Tatar Turkish Turkmen Uyghur Uzbek fayda (payda) (payda) (fayda) fayda pey´da (payda) foyda Table5: ExampleofcaseB3: loanwordsofnon-TurkicoriginborrowedviaanotherTurkiclanguage. TheEnglishglossis ‘benefit.’ Azerbaijani Kazakh Kyrgyz Tatar Turkish Turkmen Uyghur Uzbek canlı (tiri) (tu¨ru¨u¨) (janlı) canlı diri (tirik) tirik 1 2 2 1 1 2 2 2 Table6: Examplewithcognatesannotated. Azerbaijani Kazakh Kyrgyz Tatar Turkish Turkmen Uyghur Uzbek stul (orındıq) (orunduk) (urındık) sandalye stul (orunduq) kursi 1R 2T 2T 2T 3A 1R 2T 4A Table7: Examplewithcompleteannotationbothforcognatesandetymology. TheEnglishglosshereis‘chair.’