Phrase Alignment Models for Statistical Machine Translation by John Sturdy DeNero B.S. (Stanford University) 2002 M.A. (Stanford University) 2002 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Dan Klein, Chair Professor Stuart Russell Professor Tom Griffiths Professor David Chiang Fall 2010 Phrase Alignment Models for Statistical Machine Translation Copyright c 2010 (cid:13) by John Sturdy DeNero Abstract Phrase Alignment Models for Statistical Machine Translation by John Sturdy DeNero Doctor of Philosophy in Computer Science University of California, Berkeley Professor Dan Klein, Chair The goal of a machine translation (MT) system is to automatically translate a document written in some human input language (e.g., Mandarin Chinese) into an equivalent document written in an output language (e.g., English). This task—so simpleinitsspecification,andyetsorichinitscomplexities—haschallengedcomputer scienceresearchersfor60years. WhileMTsystemsareinwideusetoday, theproblem of producing human-quality translations remains unsolved. Statistical approaches have substantially improved the quality of MT systems by effectively exploiting parallel corpora: large collections of documents that have been translated by people, and therefore naturally occur in both the input and output lan- guages. Broadly characterized, statistical MT systems translate an input document by matching fragments of its contents to examples in a parallel corpus, and then stitching together the translations of those fragments into a coherent document in an output language. The central challenge of this approach is to distill example translations into reusable parts: fragments of sentences that we know how to translate robustly and are likely to recur. Individual words are certainly common enough to recur, but they often cannot be translated correctly in isolation. At the other extreme, whole sen- 1 tences can be translated without much context, but rarely repeat, and so cannot be recycled to build new translations. This thesis focuses on acquiring translations of phrases: contiguous sequences of a few words that encapsulate enough context to be translatable, but recur frequently in large corpora. We automatically identify phrase-level translations that are con- tained within human-translated sentences by partitioning each sentence into phrases and aligning phrases across languages. This alignment-based approach to acquiring phrasal translations gives rise to statistical models of phrase alignment. A statistical phrase alignment model assigns a score to each possible analysis of a sentence-level translation, where an analysis describes which phrases within that sen- tence can be translated and how to translate them. If the model assigns a high score to a particular phrasal translation, we should be willing to reuse that translation in newsentencesthatcontainthesamephrase. Chapter1providesanon-technicalintro- duction to phrase alignment models and machine translation. Chapter 2 describes a complete state-of-the-art phrase-based translation system to clarify the role of phrase alignment models. The remainder of this thesis presents a series of novel models, analyses, and experimental results that together constitute a thorough investigation of phrase alignment models for statistical machine translation. Chapter 3 presents the formal properties of the class of phrase alignment models, including inference algorithms and tractability results. We present two specific mod- els, along with statistical learning techniques to fit their parameters to data. Our experimental evaluation identifies two primary challenges to training and employing phrase alignment models, and we address each of these in turn. The first broad challenge is that generative phrase models are structured to prefer very long, rare phrases. These models require external pressure to explain observed translations using small, reusable phrases rather than large, unique ones. Chapter 4 describes three Bayesian models and a corresponding Gibbs sampler to address this 2 challenge. These models outperform the word-level models that are widely employed in state-of-the-art research and production MT systems. The second broad challenge is structural: there are many consistent and coherent ways of analyzing a translated sentence using phrases. Long phrases, short phrases, and overlapping phrases can all simultaneously express correct, translatable units. However, no previous phrase alignment models have leveraged this rich structure to predict alignments. We describe a discriminative model of multi-scale, overlapping phrases that outperforms all previously proposed models. The cumulative result of this thesis is to establish model-based phrase alignment as the most effective approach to acquiring phrasal translations. This conclusion is surprising: it overturns a long-standing result that heuristic methods based on word alignment models provide the most effective approach. This result is also funda- mental: the models proposed in this thesis address a general, language-independent alignment problem that arises in every state-of-the-art statistical machine translation system in use today. Professor Dan Klein, Chair Date 3 Acknowledgements This dissertation is a direct result of my having spent five years in a truly excep- tional research group at Berkeley, working within an exciting global research commu- nity. Every person in our research group and many people from around the world have influenced my work. I have highlighted the largest contributions below, but I am grateful for them all. I consider myself tremendously lucky to have been advised by Dan Klein. At Berkeley, Dan created a research environment that was at the same time thrilling, productive, light-hearted, motivating, cooperative, and fun. Dan’s energetic commit- ment to research and mentorship has rubbed off on each one of us who has had the opportunity to work with him. He has taught me so many things: how to structure a research project, how to evaluate an idea, how to find the flaws in published work, how to find issues with my own ideas, how to teach, how to iterate, how to experi- ment, how to find a job, and even how to navigate the cocktail hour of an academic conference. Dan is a brilliant problem solver with a unique grasp of the relationship between natural language and computation. These qualities certainly contributed to my suc- cesses as a graduate student and to the contents of this thesis. But I’ll most fondly rememberDan’sgenerosity—hehasgivenmefarmoreofhistimeandguidancethanI could have expected or asked for. He created excellent opportunities for me, prepared me to take advantage of them, and coached me through the trickiest parts. He’s also quite the matchmaker: at least between students and problems. He introduced me to the problem of phrase alignment in my first six weeks of graduate school, and I’ve been smitten ever since. Graduate school has been such a highlight of my life; I can’t thank Dan enough for making it so. i I have also worked with an amazing set of people at Berkeley. My first paper with Dan Gillick and James Zhang taught me how much more I could learn and accomplish working with other smart people than by myself. Alex Bouchard-Cote taught me virtually all that I know about sampling and Bayesian statistics. (Alex, please forgive me for the notational shortcuts I’ve taken in this thesis.) Aria Haghighi not only taught me how to write down a principled statistical model (as my first GSI), but also how to hack. The set of ICML and NIPS papers I read is entirely defined by searching the author list for the name Percy Liang. Adam Pauls dared to work with me on all manner of translation projects when everyone else abandoned me for other (less frustrating) problems. David Burkett has become my new role model for understated talent. John Blitzer taught me to talk about computer science just for the fun of it. David Hall taught me that CS research is so much better when properly engineered. Mohit Bansal taught me to ask more questions. All of you, along with the rest of the Berkeley community, have kept me smiling for many years. I would also like to thank my many external collaborators. David Chiang has been an excellent technical mentor who demonstrated to me how to conduct a truly thorough experimental investigation. Kevin Knight and Daniel Marcu continue to ensure that machine translation is an exciting research area in which to work and constantly draw new talented people into the field. Chris Callison-Burch keeps iden- tifying new translation-related problems for us to solve. I enjoyed working with my collaborators at Google Research so much that I’ve come back for more. My wife, Jessica Wan, has been immensely patient and supportive through the many long nights that led up to this dissertation. I think she now knows more NLP buzzwords (“Dirichlet process”) and names of researchers (“Sharon Goldwater”) than mostsecond-yeargraduatestudentsinthefield. Therestofmyfamilyhasalsoearned my deepest gratitude for continually trying to understand what it is that I do, even if they can’t quite fathom why I do it. ii Five years is a long time. My five years at Berkeley changed my perspective, my interests, and my understanding of the world in profound and lasting ways. I’m already nostalgic for the great times and great company I found at Cal. Thank you all who contributed to my journey. iii To my wife Jessica, for always complimenting and complementing me perfectly. iv Contents 1 Introduction 1 1.1 Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . . 1 1.2 The Task of Translating Sentences . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Learning from Example Translations . . . . . . . . . . . . . . 3 1.2.2 Translating New Sentences . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Evaluating System Output . . . . . . . . . . . . . . . . . . . . 4 1.3 The Role of Alignment Models . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Word Alignment and Phrase Alignment . . . . . . . . . . . . . . . . . 5 1.5 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 6 2 Phrase-Based Statistical Machine Translation 8 2.1 Notation for Indexing Sentence Pairs . . . . . . . . . . . . . . . . . . 8 2.2 Training Pipeline for Phrase-Based MT . . . . . . . . . . . . . . . . . 9 2.2.1 Phrasal Model Representation . . . . . . . . . . . . . . . . . . 9 2.2.2 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Phrase Pair Scoring . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.4 Tuning Translation Models . . . . . . . . . . . . . . . . . . . . 12 2.2.5 Selecting Translations . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Baseline Phrase Pair Extraction and Scoring . . . . . . . . . . . . . . 13 v
Description: