Training Machine Translation for Human Acceptability Xingyi Song Department of Computer Science University of Sheffield PhD Thesis Feb 2015 Acknowledgements Finally I’m in the position to submit final version of my thesis. I would like to express my special thanks of gratitude to my supervisor Lucia Specia as well as my co-supervisor Trevor Cohn. Without my supervisors I would never finish my thesis. Secondly, I would like thanks my examiners gives me the helpful suggestions, and corrections on my poor English grammar. Also, I would like to thanks all my family and friends, especially my wife, give me a lot of patient during my thesis writing. God bless you all :) Abstract Discriminative training, a.k.a. tuning, is an important part of Statistical Machine Translation. This step optimises weights for the several statistical models and heuristics used in a machine translation system, in order to balance their relative effect on the translation output. Different weights lead to significant changes in the quality of translation outputs, and thus selecting appropriate weights is of key importance. This thesis addresses three major problems with current discriminative training methods in order to improve translation quality. First, we de- sign more accurate automatic machine translation evaluation metrics that have better correlation with human judgements. An automatic evaluation metric is used in the loss function in most discriminative training methods, however what the best metric is for this purpose is still an open question. In this thesis we propose two novel evaluation metrics that achieve better correlation with human judgements than the current de facto standard, the BLEU metric. We show that these metrics can improve translation quality when used in discriminative training. Second, we design an algorithm to select sentence pairs for training the discriminative learner from large pools of freely available parallel sentences. These resources tend to be noisy and include translations of varying de- grees of quality and suitability for the translation task at hand, especially if obtained using crowdsourcing methods. Nevertheless, they are crucial when professionally created training data is scarce or unavailable. There is very little previous research on the data selection for discriminative train- ing. Our novel data selection algorithm does not require knowledge of the test set nor uses decoding outputs, and is thus more generally useful and efficient. Our experiments show that with this data selection algorithm, translation quality consistently improves over strong baselines. Finally, the third component of the thesis is a novel weighted ranking-based optimisation algorithm for discriminative training. In contrast to previ- ous approaches, this technique assigns a different weight to each training instance according to its reachability and its relationship to test sentence being decoded, a form of transductive learning. Our experimental results show improvements over a modern state-of-the-art method across different language pairs. Overall, the proposed approaches lead to better translation quality when compared strong baselines in our experiments, both in isolation and when combined, and can be easily applied to most existing statistical machine translation approaches. Contents List of Figures v List of Tables vii 1 Introduction 1 1.1 Objectives and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Review of SMT Discriminative Training 11 2.1 SMT and Phrase-based Models . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Translation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 SMT Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Maximum Likelihood Training . . . . . . . . . . . . . . . . . . . 18 2.2.2 Minimum Error Rate Training . . . . . . . . . . . . . . . . . . . 19 2.2.3 Perceptron and Margin-based Approaches . . . . . . . . . . . . . 21 2.2.4 Ranking-based Optimisation . . . . . . . . . . . . . . . . . . . . 23 2.3 Oracle Selection and Related Training Algorithms . . . . . . . . . . . . 25 2.4 SMT Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Word Error Rate Metrics . . . . . . . . . . . . . . . . . . . . . . 30 2.4.2 N-gram-based Metrics . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.3 Metrics with Shallow Linguistic Information . . . . . . . . . . . . 36 2.4.4 Trained Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5 Development Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . 38 i CONTENTS 2.5.1 Development Data Selection with Test Set . . . . . . . . . . . . . 39 2.5.2 Development Data Selection without Test Set . . . . . . . . . . . 40 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Automatic Evaluation Metrics with Better Human Correlation 43 3.1 Regression and Ranking-based Evaluation . . . . . . . . . . . . . . . . . 44 3.1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.2 ROSE Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 BLEU Deconstructed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Limitations of the BLEU Metric . . . . . . . . . . . . . . . . . . 50 3.2.2 Simplified BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Experiments with ROSE and SIMPBLEU . . . . . . . . . . . . . . . . . 52 3.3.1 Document-level Evaluation . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 Sentence-level Evaluation . . . . . . . . . . . . . . . . . . . . . . 57 3.4 SIMPBLEU for Discriminative Training . . . . . . . . . . . . . . . . . . 61 3.5 SIMPBLEU in WMT Evaluation . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Development Data Selection For Unseen Test Sets 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 LA Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.1 French-English Data . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 Chinese-English Data . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Selection by Sentence Length . . . . . . . . . . . . . . . . . . . . 81 4.4.2 Selection by LA Features . . . . . . . . . . . . . . . . . . . . . . 82 4.4.3 Selection by LA Algorithm . . . . . . . . . . . . . . . . . . . . . 83 4.4.4 Diversity Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.5 Machine Learned Approach . . . . . . . . . . . . . . . . . . . . . 86 4.4.6 Effect of Development Corpus Size . . . . . . . . . . . . . . . . . 88 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 ii CONTENTS 5 Weighted Ranking Optimisation 91 5.1 Weighted Ranking Optimisation – Global . . . . . . . . . . . . . . . . . 92 5.2 Weighted Ranking Optimisation – Local . . . . . . . . . . . . . . . . . . 95 5.3 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3.1 Cross-domain Experiments . . . . . . . . . . . . . . . . . . . . . 99 5.3.2 WRO with LA Selection and SIMPBLEU . . . . . . . . . . . . . 101 5.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6 Conclusions 105 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 References 109 iii CONTENTS iv List of Figures 2.1 Example of the decoding process . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Example of WER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Example of TER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Example of n-gram precision . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Example of METEOR alignment . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Smoothed BLEU Kendall’s τ with smoothing values from 0.001 to 100 . 59 4.1 Accuracy of development selection algorithms with increasing sizes of development corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Standarddeviationoftheaccuracyforthedevelopmentselectionmethod with increasing sizes of development corpora . . . . . . . . . . . . . . . . 89 5.1 Example of PRO training samples, where the x and y axis represent the feature values of the two translations . . . . . . . . . . . . . . . . . . . . 96 v LIST OF FIGURES vi
Description: