Confidence Measures for Alignment and for Machine Translation Yong Xu To cite this version: Yong Xu. Confidence Measures for Alignment and for Machine Translation. Signal and Image Pro- cessing. Université Paris Saclay (COmUE), 2016. English. NNT: 2016SACLS270. tel-01399222 HAL Id: tel-01399222 https://theses.hal.science/tel-01399222 Submitted on 18 Nov 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. NNT : 2016SACLS270 T HESE DE DOCTORAT DE L’UNIVERSITE PARIS-SACLAY PREPAREE A L’UNIVERSITE PARIS-SUD ECOLE DOCTORALE N° 580 Sciences et Technologies de l’Information et de la Communication (STIC) Spécialité de doctorat : Informatique Par M. Yong Xu Confidence Measures for Alignment and for Machine Translation Titre en français Mesures de Confiance pour l’Alignement et pour la Traduction Automatique Thèse présentée et soutenue à Orsay, le 26/09/2016 Composition du Jury : M. ZWEIGENBAUM Pierre Directeur de Recherche, LIMSI-CNRS Président du jury M. LANGLAIS Philippe Professeur, Université de Montréal Rapporteur M. KRAIF Olivier Maître de Conférences, Université Grenoble Alpes Rapporteur M. ESTÈVE Yannick Professeur, Université du Maine Examinateur M. HUET Stéphane Maître de Conférences, Université d’Avignon et des Pays de Vaucluse Examinateur M. YVON François Professeur, Université Paris-Sud Directeur de thèse 2 献给徐维平,熊华中,徐文文,以及周益娴。 i ii Acknowledgements The graduate school was for sure the most challenging life experience I have been through. I have lost plenty of hair during the last three years. Now that I look back, however, I am mostly amazed by the process, and realize how grateful I am towards a lot of people. IowealottoProfessorFrançoisYvon,mythesissupervisor. WhenIfirstcametoFrance as an undergraduate student, Prof. Yvon kindly offered me an internship opportunity to work on statistical machine translation. I was so shocked by the entire methodology that I spent two other Master internships in his group to grasp a better understanding, during which I realized the importance of alignment, and decided to work on it for a PhD thesis. I wasn’t so well prepared, though, on both technical and mental aspects, leading to frustrations of Prof. Yvon during the four years. However, he has always kept his cool, always been supportive and encouraging, and has taught me, with great patiences, literally everything about research: how to formulate scientific ideas, how to design sound experiments, how to write clear papers, how to prepare a presentation and make slides, even how to correctly pronounce some words. For all these reasons, I have always kept high respect to Prof. Yvon. He is certainly the best thesis supervisor I can imagine of. And I consider myself tremendously lucky to be his student. I sincerely thank the members of my thesis jury. In particular, I am very grateful to the two reviewers. Professor Phillippe Langlais is a recognized expert on the alignment problem, with a deep understanding on both theoretical and practical aspects. Doctor Olivier Kraif provided excellent analyses of my work from the point of view of corpus linguistics. Their reports were both very encouraging, both filled with insightful comments and suggestions. I thank Professor Yannick Estève and Doctor Stéphane Huet for having read carefully the manuscript, for the inspirational discussions during the defence, and for the warm encouragements. Doctor Pierre Zweigenbaum has honoured me by presiding the thesis jury. His questions and suggestions were very valuable. He has also made the effort of manually correcting things in the manuscript, which I appreciate fondly. We were three in the interns’ office in the summer of 2012: Nicolas, Benjamin and me. Khanh joined a bit later. And the four of us all began the graduate school together. I consider it a great privilege to pass the trajectory with such wonderful fellows, from whom I have learned a lot. The permanent members in the building, in particular Guillaume, Aurélien, Hélène, Marianna and Éric, are always ready to answer questions and provide iii iv suggestions. I thank particularly Alexandre for tolerating me being a very silent office mate and asking naive questions, Tomas for explaining to me twice the design choices of his excellent Wapiti implementation of the CRFs. Other PhD students, both "olders" such as Nadi, Thiago, Haison and "youngers" such as Elena, Julia, Mattieu, Laurianne and Rachel, have together kept the TLP group a pleasant work environment. I apologize for being a bit drunk after one of our night drinks. Special thanks to Li, to whom I have asked numerous questions and have learned a lot. I have stayed in France for seven years now. This is perhaps the period that I have learned the most. I thank France for having a beautiful territory and a nice people. The experiences in the French society have taught me to grow mature. Back in China, my father, mother and sister have always been warmly supportive, to which I can’t express enough my gratitude. Thanks to Yixian, the kind, beautiful, joyful and generous soul mate of mine. Abstract Incomputationallinguistics,therelationbetweendifferentlanguagesisoftenstudiedthrough automatic alignment techniques. Such alignments can be established at various structural levels. Inparticular,sententialandsub-sententialbitextalignmentsconstituteanimportant source of information in various modern Natural Language Processing (NLP) applications, a prominent one being Machine Translation (MT). Effectivelycomputingbitextalignments,however,canbeachallengingtask. Discrepan- cies between languages appear in various ways, from discourse structures to morphological constructions. Automatic alignments would, at least in most cases, contain noise harmful for the performance of application systems which use the alignments. To deal with this sit- uation, two research directions emerge: the first is to keep improving alignment techniques; the second is to develop reliable confidence measures which enable application systems to selectively employ the alignments according to their needs. Both alignment techniques and confidence estimation can benefit from manual align- ments. Manualalignmentscanbeusedasbothsupervisionexamplestotrainscoringmodels and as evaluation materials. The creation of such data is, however, an important question in itself, particularly at sub-sentential levels, where cross-lingual correspondences can be only implicit and difficult to capture. Thisthesisfocusesonmeanstoacquireusefulsententialandsub-sententialbitextalign- ments. Chapter 1 provides a non-technical description of the research motivation, scope, organization, and introduces terminologies and notation. State-of-the-art alignment tech- niques are reviewed in Part I. Chapter 2 and 3 describe state-of-the-art methods for respec- tively sentence and word alignment. Chapter 4 summarizes existing manual alignments, and discusses issues related to the creation of gold alignment data. The remainder of this thesis, Part II, presents our contributions to bitext alignment, which are concentrated on three sub-tasks. Chapter 5 presents our contribution to gold alignment data collection. For sentence- levelalignment,wecollectmanualannotationsforaninterestingtextgenre: literarybitexts, whichareveryusefulforevaluatingsentencealigners. Wealsoproposeaschemeforsentence alignment confidence annotation. For sub-sentential alignment, we annotate one-to-one word links with a novel 4-way labelling scheme, and design a new approach for facilitating the collection of many-to-many links. All the collected data is released on-line. v vi Improving alignment methods remains an important research subject. We pay special attention to sentence alignment, which often lies at the beginning of the bitext alignment pipeline. Chapter 6 presents our contributions to this task. Starting by evaluating state- of-the-art aligners and analyzing their models and results, we propose two new sentence alignment methods, which achieve state-of-the-art performance on a difficult dataset. The other important subject that we study is confidence estimation. In Chapter 7, we propose confidence measures for sentential and sub-sentential alignments. Experiments show that confidence estimation of alignment links is a challenging problem, and more works on enhancing the confidence measures will be useful. Finally, note that these contributions have been applied to a real world application: the development of a bilingual reading tool aimed at facilitating the reading in a foreign language. Key words: Confidence Measure, Confidence Estimation, Bitext Alignment, Sentence Alignment, Word Alignment, Annotation Scheme, Reference Corpus, Machine Translation Résumé Enlinguistiqueinformatique,larelationentrelanguesdifférentesestsouventétudiéeviades techniques d’alignement automatique. De tels alignements peuvent être établis à plusieurs niveaux structurels. En particulier, les alignements de bi-textes aux niveaux phrastiques et sous-phrastiques constituent des sources importantes d’information pour diverses applica- tions du Traitement Automatique du Language Naturel (TALN) moderne, la Traduction Automatique étant un exemple proéminent. Cependant, le calcul effectif des alignements de bi-textes peut être une tâche com- pliquée. Les divergences entre les langues sont multiples, de la structure de discours aux constructionsmorphologiques.Lesalignementsautomatiquescontiennent,majoritairement, des erreurs nuisant aux performances des applications. Dans cette situation, deux pistes de recherche émergent. La première est de continuer à améliorer les techniques d’alignement. La deuxième vise à développer des mesures de confiance fiables qui permettent aux appli- cations de sélectionner les alignements selon leurs besoins. Les techniques d’alignement et l’estimation de confiance peuvent toutes les deux bénéfi- cier d’alignements manuels. Des alignements manuels peuvent jouer un rôle de supervision pourentraînerdesmodèles,etceluidesdonnéesd’évaluation.Pourtant,lacréationdetelles données est elle-même une question importante, en particulier au niveau sous-phrastique, où les correspondances multilingues peuvent être implicites et difficiles à capturer. Cette thèse étudie des moyens pour acquérir des alignements de bi-textes utiles, aux niveaux phrastiques et sous-phrastiques. Le chapitre 1 fournit une description de nos moti- vations, la portée et l’organisation du travail, et introduit quelques repères terminologiques et les principales notations. L’état-de-l’art des techniques d’alignement est revu dans la Partie I. Les chapitres 2 et 3 décrivent les méthodes respectivement pour l’alignement des phrases et des mots. Le chapitre 4 présente les bases de données d’alignement manuel, et discute de la création d’alignements de référence. Le reste de la thèse, la Partie II, présente nos contributions à l’alignement de bi-textes, en étudiant trois aspects. Le chapitre 5 présente notre contribution à la collecte d’alignements de référence. Pour l’alignement des phrases, nous collectons les annotations d’un genre spécifique de textes : les bi-textes littéraires. Nous proposons aussi un schème d’annotation de confiance. Pour l’alignement sous-phrastique, nous annotons les liens entre mots isolés avec une nouvelle catégorisation,etconcevonsuneapprocheinnovantedesegmentationitérativepourfaciliter vii
Description: