Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning Thesis presented to the Faculty of Arts and Social Sciences of the University of Zurich for the degree of Doctor of Philosophy by Johannes Graën Accepted in the spring semester 2018 on the recommendation of the doctoral committee: Prof. Dr. Martin Volk (main supervisor) Prof. Dr. Marianne Hundt Prof. Dr. Stefan Evert Zurich, 2018 Abstract This thesis exploits the automatic identification of semantically corresponding units in parallel and multiparallel corpora, which is referred to as alignment. Mul- tiparallel corpora are text collections of more than two languages that comprise reciprocal translations. The contributions of this thesis are threefold: • First, we prepare a large multiparallel corpus by adding several layers of annotation and alignment. Annotation is first performed on each language individually, while alignment is applied to two or more languages. For the latter case, we use the term multilingual alignment. We show that word alignment on parallel corpora can improve language-specific annotation by means of disambiguation. • Our second contribution consists in the development and evaluation of pro- totypical algorithms for multilingual alignment on both sentence and word level. As languages vary considerably with regard to how content is real- ized in sentences and words, multilingual alignment needs to be represented by a hierarchical structure rather than by bidirectional links as prevailing representation of bilingual alignment. • Based on our corpus, we thirdly show how word alignment in combination with different types of annotation can be employed to benefit linguists and language learners, among others. All tools developed in the context of this thesis, in particular the publicly available web applications, are driven by efficient database queries on a complex data structure. iii Zusammenfassung Gegenstand dieser Dissertation ist die Auswertung von semantischen Korrespon- denzrelationen in parallelen und multiparallen Korpora, welche als Alignierungen bezeichnet werden. Multiparallele Korpora sind Textsammlungen wechselseitiger Übersetzungen zwischen mehr als zwei Sprachen. Diese Arbeit umfasst drei Beiträge: • Zum einen die Aufbereitung eines grossen multiparallelen Korpus durch Hin- zufügenmehrererAnnotations-undAlignierungsebenen.WährendjedeSpra- che zuerst separat annotiert wird, erstreckt sich die Alignierung über zwei oder mehr Sprachen. Letzteren Fall bezeichnen wir als multilinguale Alignie- rung. Wir zeigen, dass Wortalignierung in parallelen Korpora helfen kann, die sprachspezifischen Annotationen mittels Disambiguierung zu verbessern. • Zum anderen die Entwicklung und Evaluierung prototypischer Algorithmen für multilinguale Alignierung sowohl auf Satz- als auch auf Wortebene. Auf- grund der starken Variation zwischen Sprachen bezüglich der Realisierung von Inhalt in Sätze und Wörter benötigt multilinguale Alignierung zur Dar- stellung der Korrespondenzen eine hierarchische Struktur, anstelle von bidi- rektionalen Verbindungen, wie sie bei bilingualer Alignierung üblich sind. • Des weiteren zeigen wir, wie Wortalignierung in Verbindung mit verschie- denen Annotationsarten zum Nutzen u.a. von Linguisten und Sprachlernern eingesetzt werden kann. Allen Werkzeugen, die im Rahmen dieser Disser- tation entwickelt wurden, insbesondere den öffentlich verfügbaren Weban- wendungen, liegen effiziente Datenbankanfragen auf einer komplexen Da- tenstruktur zugrunde. iv Acknowledgments First of all, I wish to thank my supervisor Martin Volk who guided me through the initial troubles, gave me room to realize my own ideas and had the necessary confidence in me to finish this big project of mine. I am likewise indebted to my colleagues in the Sparcling project, Marianne Hundt, Simon Clematide and Elena Callegaro, and our student collaborators Dolores Batinic, Christof Bless and Mathias Müller.1 Other, smaller contributions are indicated at the beginning of each chapter. I am thankful for Stefan Evert for accepting our invitation to become part of my doctoral committee. Duringthelastyears, Ihadtimetobecomeacquaintedwiththeothermembers of the Institute of Computational Linguistics. I really appreciate the supportive atmosphere that gave rise to interesting and fruitful discussions, often accompa- nied by chocolate and delicious pastry. I would like to give a special thanks to Noah Bubenhofer, Tilia Ellendorff, Anne Göhring, Samuel Läubli, Laura Mas- carell, Jeannette Roth, Gerold Schneider and Don Tuggener. Aside from people in Zurich, I am particularly grateful for the Språkbanken group in Gothenburg for kindly receiving me in spring 2017. The technical parts described in this thesis were quite demanding. My thanks therefore goes to our institute for making it possible to acquire new servers and letting me implement my concept for a new computer cluster that allows for dis- tributedcomputing. Thiswouldnothavebeenpossiblewithoutthecomprehensive support by the technicians of the Department of Informatics: Hanspeter Kunz, Beat Rageth and Enrico Solcà. Los agradecimientos más importantes suelen venir últimos. Quisiera darles las gracias a mis amigos de Barcelona, particularmente a Alicia Burga, Graham Cole- man, Gabriela Ferraro y Simon Mille, sin los cuales probablemente no hubiese empezado un doctorado. Les expreso mi gratitud a mi familia y a mi novia Mónica por el soporte incondicional durante todos estos años. 1The Sparcling project was kindly funded by the Swiss National Science Foundation under grant 105215_146781/1. v Contents Abstract iii Zusammenfassung iv Acknowledgments v 1 Introduction 1 1.1 The Sparcling Project . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Parallel Text Corpora 7 2.1 Monolingual Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Multiparallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Our CoStEP Corpus . . . . . . . . . . . . . . . . . . . . . . 17 3 Corpus Annotation 21 3.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 Cutter: Our Flexible Tokenizer for Many Languages . . . . 27 3.2 Part-of-speech Tagging and Lemmatization . . . . . . . . . . . . . . 37 3.2.1 Interlingual Lemma Disambiguation . . . . . . . . . . . . . 44 3.2.2 Particle Verbs in German . . . . . . . . . . . . . . . . . . . 49 3.3 Dependency Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Database Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 Alignment Methods 61 4.1 Text Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Sentence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Evaluating Sentence Alignment . . . . . . . . . . . . . . . . 74 4.3 Multilingual Sentence Alignment . . . . . . . . . . . . . . . . . . . 75 4.3.1 Our Approach to Multilingual Sentence Alignment . . . . . 79 4.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4.2 Evaluating Word Alignment . . . . . . . . . . . . . . . . . . 118 4.5 Multilingual Word Alignment . . . . . . . . . . . . . . . . . . . . . 123 4.5.1 Our Approach to Multilingual Word Alignment . . . . . . . 124 4.5.2 Evaluation and Outlook . . . . . . . . . . . . . . . . . . . . 136 5 Linguistic Applications of Word Alignment 151 5.1 Overlap of Lemma Alignment Distributions as Measure for Seman- tic Relatedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.2 Multilingual Translation Spotting . . . . . . . . . . . . . . . . . . . 160 5.3 Phraseme Identification . . . . . . . . . . . . . . . . . . . . . . . . 165 5.4 Backtranslating Prepositions for Prediction of Language Learners’ Transfer Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6 Conclusions 187 Appendices 193 Appendix A Linguistic Annotation 195 A.1 Universal Dependency Labels . . . . . . . . . . . . . . . . . . . . . 195 A.2 Our Hierarchical Alignment Tool . . . . . . . . . . . . . . . . . . . 198 Appendix B Alignment Quality 201 B.1 Relation of Alignment Error Rate (AER) and F -Score . . . . . . . 201 1 Appendix C Data Sets from Joint Measures 203 C.1 Semantic Relatedness of German Particle Verbs . . . . . . . . . . . 203 C.2 Generated Recommendations for Learners of English of Different L1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 C.2.1 Verb Preposition Combinations . . . . . . . . . . . . . . . . 226 C.2.2 Adjective Preposition Combinations . . . . . . . . . . . . . 237 Chapter 1 Introduction Collections of texts, known as text corpora, have been subject to linguists’ interest for a long time. They served, for instance, lexicographers as a source for dictionary compilationorlinguistsandhistoriansfortheinvestigationoflanguagechangeover time. Parallel text corpora, parallel corpora for short, sometimes also referred to as bitexts, are text collections in two or more languages where textual units, such as articles, sentences or words in one language correspond to textual units of the same kind in another language. If there is more than one other language, we will refer to these collections as multiparallel corpora. The term parallel corpus covers merely translated material and not collections of texts that only connect to each other in terms of content. The latter ones are named comparable corpora since they describe the same topic in a comparable way without the necessity of texts being translations of each other. Wikipedia articles, as an example for comparable corpora, deal with the same topic in several languages and can either be translations from one or more existing articles, or be writtenindependentlyofcorrespondingarticlesinotherlanguages(foranoverview see Plamada and Volk 2013). The size of typical corpora impedes manual examination and, hence, calls for automatic processing. Natural language processing (NLP) deals with the auto- mated treatment of natural language, predominantly in written form. NLP meth- ods subdivide into rule-based and statistical methods. Approaches combining both paradigms are referred to as hybrid methods. Both types of methods are capable of processing large amounts of textual data in a tiny fraction of time of what a human would need to accomplish the same task. Automatic processing typically involves that some results are incorrect. While the main shortcoming of rule-based methods is coverage,1 statistical methods bring about a task- and 1When dealing with natural language, there will typically be cases that the authors of the rules have not considered or that do not conform with the authors’ intrinsic language model. 1 2 tool-specific error rate, that is, each partial result is expected to be incorrect with a known probability, but we do not know which parts are correct and which ones are not. A principal motivation for developing new approaches is to achieve lower er- ror rates. Some applications prefer to lower the error rate by excluding samples that are likely to include errors, other applications prefer large quantities of sam- ples provided that the correct ones prevail. This trade-off between quality and quantity is known by the measures precision (how many of the results that we get are good?) and recall (how many of the good results do we actually find?) in binary classification tasks. A parametrization of the classifier that leads to an im- provement of one measure usually implicates a decline of the other. The F-Score is a commonly used measure to account for both quality and quantity. Applications that require a high precision and attach less importance to recall are typically concerned with individual examples and less so with statistics.2 In corpus linguistics, in particular, these individual examples play a role when it comes to demonstrating the usage of particular word senses or expressions in context. Comprehensive dictionaries typically incorporate sample sentences for different word senses, which are oftentimes selected from corpora. A method for selecting those sentences, called good dictionary examples (GDEX) (Kilgarriff, Husáketal.2008), ranksmatchingsentencesaccordingtofeaturessuchassentence lengths and rareness of the comprised words. Nonetheless, manual intervention is still needed to assess each dictionary candidate and sort out unsuitable ones. Good sample sentences also play a role in computer-assisted language learning (CALL) applications. Some of them assist their users, which are lan- guage learners, by providing usage examples for a particular word or expression (just like dictionaries do) and can be used in automatically generated exercises. Althoughthecompletelyautomaticselectionofsamplesentences(see, forinstance, Volodinaetal.2012; Pilánetal.2016)alwayscarriestheriskoferror(i.e., choosing a bad sentence, which confuses the learner instead of helping her), manual inter- vention is not feasible in such applications unless they rely on a list of precompiled exercises, which contradicts the principle of automating this task. We address the inherent problem of errors in linguistic data that is processed with statistical methods by combining several layers of statistically generated data in parallel corpora, one of which typically is word alignment between the lan- guages in question. Word alignment refers to the technique of automatically iden- tifying corresponding words (i.e., the actual tokens) in corresponding sentences of different languages. Corresponding sentences, in turn, are automatically identified 2The contrary is the case for applications like machine translation, which learn generalized principles from large amounts of data. Errors, as long as they are not systematic, are simply smoothed out.
Description: