Table Of ContentThe Regression Model of Machine
Translation
by
Mehmet Ergun Bi(cid:231)ici
Dissertation submitted to the
Department of Computer Engineering
for ful(cid:28)llment of the requirements for the degree of
Doctor of Philosophy
at
KO˙ UNIVERSITY
August, 2011
Copyright ' Mehmet Ergun Bi(cid:231)ici, 2011. All rights reserved.
Thesis Supervisor: Deniz Yuret
August, 16, 2011
Ko(cid:231) University
Graduate School of Sciences and Engineering
This is to certify that I have examined this copy of the dissertation
for the degree of Doctor of Philosophy by
Mehmet Ergun Bi(cid:231)ici
and have found that it is complete and satisfactory in all respects,
and that any and all revisions required by the (cid:28)nal
examining committee have been made.
Committee Members:
Date: August, 16, 2011
i
The Regression Model of Machine Translation
by
Mehmet Ergun Bi(cid:231)ici
Dissertation submitted to the Department of Computer Engineering
for ful(cid:28)llment of the requirements for the degree of
Doctor of Philosophy
Abstract
Machine translation is the task of automatically (cid:28)nding the translation of a
source sentence in the target language. Statistical machine translation (SMT) use
parallel corpora or bilingual paired corpora that are known to be translations of
each other to (cid:28)nd a likely translation for a given source sentence based on the
observed translations. The task of machine translation can be seen as an instance
of estimating the functions that map strings to strings.
Regression based machine translation (RegMT) approach provides a learning
framework for machine translation, separating learning models for training, train-
ing instance selection, feature representation, and decoding. We use the transduc-
tive learning framework for making the RegMT approach computationally more
scalable and consider the model building step independently for each test sentence.
WedeveloptraininginstanceselectionalgorithmsthatnotonlymakeRegMTcom-
putationally more scalable but also improve the performance of standard SMT
systems. We develop better training instance selection techniques than previous
work from given parallel training sentences for achieving more accurate RegMT
models using less training instances.
We introduce L regularized regression as a better model than L regularized
1 2
regression for statistical machine translation. Our results demonstrate that sparse
regression models are better than L regularized regression for statistical machine
2
translation in predicting target features, estimating word alignments, creating
phrase tables, and generating translation outputs. We develop good evaluation
techniques for measuring the performance of the RegMT model and the quality
of the translations. We use F measure, which performs good when evaluating
1
ii
translations into English according to human judgments. F allows us to evaluate
1
the performance of the RegMT models using the target feature prediction vectors
or the coe(cid:30)cients matrices learned or a given SMT model using its phrase table
without performing the decoding step, which can be computationally expensive.
Decoding is dependent on the representation of the training set and the fea-
turesused. Weusegraphdecodingonthepredictionvectorsrepresentedinn-gram
or word sequence counts space found in the training set. We also decode using
Moses (Koehn et al., 2007) after transforming the learned weight matrix represent-
ing the mappings between the source and target features to a phrase table that can
be used by Moses during decoding. We demonstrate that sparse L regularized
1
regression performs better than L regularized regression in the German-English
2
translationtaskandintheSpanish-Englishtranslationtaskwhenusingsmallsized
training sets. Graph based decoding can provide an alternative to phrase-based
decoding in translation domains having small sized vocabulary.
Thesis Advisor: Deniz Yuret
Title: Assistant Professor
Date: August, 16, 2011
iii
Otomatik ˙eviride Regresyon Modeli
Mehmet Ergun Bi(cid:231)ici
Bilgisayar M(cid:252)hendisligi B(cid:246)l(cid:252)m(cid:252)’nde doktora derecesinin
gereklerinin tamamlanmas(cid:25) i(cid:231)in haz(cid:25)rlanm(cid:25)‡
doktora tezi
(cid:214)zet
Otomatik (cid:231)eviri, bir dilde yaz(cid:25)lm(cid:25)‡ bir c(cid:252)mlenin ba‡ka bir dildeki (cid:231)evirisini
otomatik olarak bulma i‡idir. (cid:157)statistiksel otomatik (cid:231)eviri (SMT) birbirlerinin (cid:231)e-
virisi oldu§u bilinen paralel d(cid:246)k(cid:252)manlar veya (cid:231)iftdilde d(cid:246)k(cid:252)manlar kullanarak ver-
ilen bir c(cid:252)mleye en uygun gelecek (cid:231)eviriyi (cid:246)nceden g(cid:246)r(cid:252)lm(cid:252)‡ (cid:231)evirileri kullanarak
hesaplamaya (cid:231)al(cid:25)‡(cid:25)r. Otomatik (cid:231)eviri i‡i kelime dizimlerinden kelime dizimlerine
e‡le‡tirebilen fonksiyonlar(cid:25)n hesaplanmas(cid:25) (cid:246)rne§i olarak g(cid:246)r(cid:252)lebilir.
Regresyontabanl(cid:25)otomatik(cid:231)eviri(RegMT)yakla‡(cid:25)m(cid:25)otomatik(cid:231)eviriye(cid:246)§renme
modellerini, (cid:246)§renme (cid:246)rnekleri se(cid:231)imini, (cid:246)zellik g(cid:246)sterimini, ve (cid:231)eviriyi yaratmay(cid:25)
ay(cid:25)ran bir (cid:246)grenme platformu sa§lar. Transd(cid:252)ktif (cid:246)§renme platformu RegMT yak-
la‡(cid:25)m(cid:25)n(cid:25) say(cid:25)sal olarak daha hesaplanabilir yapar ve her test c(cid:252)mlesi i(cid:231)in ba§(cid:25)ms(cid:25)z
olarak model kurar. Geli‡tirdi§imiz (cid:246)§renme (cid:246)rnekleri se(cid:231)im algoritmalar(cid:25) RegMT
yakla‡(cid:25)m(cid:25)n(cid:25) say(cid:25)sal olarak daha hesaplanabilir yapman(cid:25)n yan(cid:25)nda standart SMT
sistemlerinin performans(cid:25)n(cid:25) artt(cid:25)r(cid:25)r. Paralel (cid:246)§renme c(cid:252)mlelerinden (cid:246)nceki i‡ler-
den daha iyi c(cid:252)mle se(cid:231)me metodlar(cid:25) geli‡tirerek daha do§ru RegMT modellerini
daha az (cid:246)§renme c(cid:252)mlesi kullanarak elde edebiliyoruz.
Otomatik(cid:231)evirii(cid:231)inL d(cid:252)zenliregresyontekni§iniL d(cid:252)zenliregresyontekni§in-
1 2
den daha iyi bir model olarak sunuyoruz. Elde etti§imiz sonu(cid:231)lar seyrek re-
gresyon modellerinin L d(cid:252)zenli regresyon modelinden hedef (cid:246)zellikleri tahmin ed-
2
erken, kelime e‡le‡melerini bulurken, kelime dizimi tablolar(cid:25) olu‡tururken, ve (cid:231)eviri
yarat(cid:25)rken daha iyi oldu§unu g(cid:246)stermektedir. RegMT modelinin performans(cid:25)n(cid:25) ve
(cid:231)evirilerin kalitesini (cid:246)l(cid:231)mek i(cid:231)in iyi (cid:246)l(cid:231)(cid:252)m teknikleri gelistirdik. (cid:157)ngilizceye (cid:231)eviri-
leri (cid:246)l(cid:231)erken insanlar taraf(cid:25)ndan performans(cid:25) iyi bulunan F (cid:246)l(cid:231)(cid:252)s(cid:252)n(cid:252) kullan(cid:25)yoruz.
1
F bizim RegMT modellerinin performans(cid:25)n(cid:25) hedef (cid:246)zellik tahmin vekt(cid:246)rlerini veya
1
iv
(cid:246)§renilen katsay(cid:25) matrislerini veya verilen bir SMT modelini kendi kelime dizimi
tablolar(cid:25)n(cid:25) kullanarak, hesaplamas(cid:25) pahal(cid:25) olabilen (cid:231)eviri ad(cid:25)m(cid:25)n(cid:25) uygulamadan
(cid:246)l(cid:231)memize olanak sa§lar.
˙eviri, (cid:246)grenme k(cid:252)mesinin g(cid:246)sterimine ve kullan(cid:25)lan (cid:246)zelliklere ba§l(cid:25)d(cid:25)r. Biz
(cid:246)grenme k(cid:252)mesinin bulundugu n-gram veya kelime dizisi say(cid:25)lar(cid:25) uzay(cid:25)nda goster-
ilen tahmin vekt(cid:246)rleri (cid:252)zerinde gra(cid:28)k (cid:231)evirisi kullan(cid:25)yoruz. Ayrica Moses (Koehn
et al., 2007)’(cid:25) kullanarak (cid:231)eviri s(cid:25)ras(cid:25)nda kaynak ve hedef (cid:246)zellikler aras(cid:25)ndaki
e‡le‡tirmeleri g(cid:246)steren (cid:246)§renilmi‡ a§(cid:25)rl(cid:25)k matrisini Moses taraf(cid:25)ndan kullan(cid:25)labile-
cek kelime dizimi tablosuna d(cid:246)n(cid:252)‡t(cid:252)rd(cid:252)kten sonra (cid:231)eviri yap(cid:25)yoruz. Seyrek L
1
d(cid:252)zenli regresyonun L d(cid:252)zenli regresyondan Almanca-(cid:157)ngilizce (cid:231)eviri i‡inde ve
2
k(cid:252)(cid:231)(cid:252)k (cid:246)§renme k(cid:252)meleri kullan(cid:25)rken (cid:157)spanyolca-(cid:157)ngilizce (cid:231)eviri i‡inde daha iyi
oldu§unu g(cid:246)steriyoruz. Gra(cid:28)k tabanl(cid:25) (cid:231)eviri kelime dizimi tabanl(cid:25) (cid:231)eviriye az ke-
lime hazineli (cid:231)eviri i‡lerinde alternatif olabilir.
Tez Dan(cid:25)‡man(cid:25): Deniz Yuret
(cid:220)nvan: Yard(cid:25)mc(cid:25) Do(cid:231)ent
Tarih: August, 16, 2011
v
Acknowledgments
I would like to thank my family for their support during all the hardship
of my PhD. They showed patience, love, and economic support during
my PhD.
I would like to thank my advisor, Deniz Yuret, for helpful discussions
and for guidance and support during the term of this thesis work. I
am fortunate to share common interests with him. I am also thankful
to Kemal O(cid:29)azer for giving valuable feedback and advice about my
thesis and to Attila G(cid:252)rsoy for being attentive and an excellent person
to work with.
I am grateful to Ko(cid:231) University for generously o(cid:27)ering full scholar-
ship during my PhD. This thesis was supported in part by the Scienti(cid:28)c
and Technological Research Council of Turkey (TUBITAK) through a
short term research project and through a PhD scholarship award dur-
ing 2006-2010.
During my graduate student work at Ko(cid:231) University, I served as
a teaching assistant for many courses and professors. I would like to
thank Deniz Yuret, Attila G(cid:252)rsoy, Serdar Ta‡(cid:25)ran, Roy K(cid:252)(cid:231)(cid:252)kate‡,
and (cid:157)lker Oyman for accepting me as their teaching assistant for their
course work.
vi
Contents
1 Introduction 1
1.1 RegMT Publications . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research Contributions . . . . . . . . . . . . . . . . . . 7
2 The Statistical Machine Translation Problem 9
2.1 Parallel Corpus and Parallel Sentences for Training . . . 11
2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Computational Complexity of Statistical Machine
Translation . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Reranking . . . . . . . . . . . . . . . . . . . . . 17
2.4 Translation Performance Evaluation . . . . . . . . . . . 18
2.5 Phrase-based Statistical Machine Translation Work Flow 19
3 Regression Based Machine Translation 22
3.1 Regression problem . . . . . . . . . . . . . . . . . . . . 24
3.2 Pre-image problem . . . . . . . . . . . . . . . . . . . . 26
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Practical Issues . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Kernel Implementation . . . . . . . . . . . . . . 28
3.4.2 Feature Engineering for Machine Translation . . 30
3.4.3 Decoding with the RegMT Model . . . . . . . . 32
3.5 Computational Complexity . . . . . . . . . . . . . . . . 32
3.5.1 Transductive Learning vs. Inductive Learning . . 33
vii
CONTENTS
3.6 Training Instance Selection per Source Feature . . . . . 34
3.7 RegMT Work Flow . . . . . . . . . . . . . . . . . . . . 35
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Sparse Regression for Statistical Machine Translation 38
4.1 Sparsity in Translation Mappings . . . . . . . . . . . . . 39
4.2 L Norm Regularization . . . . . . . . . . . . . . . . . . 40
1
4.3 L Regularized Regression Techniques . . . . . . . . . . 41
1
4.3.1 lasso with Forward Stagewise Regression (FSR): 43
4.3.2 lasso with Quadratic Programming (QP): . . . . 44
4.3.3 L Minimization using Linear Programming (LP): 46
1
4.3.4 L Minimization using Iterative Thresholding (iter-
1
ε): . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.5 Regularization Parameter Optimization . . . . . 47
4.4 Example Word Alignment Scenario . . . . . . . . . . . . 49
4.4.1 Evaluating Phrase Alignment Performance . . . . 53
4.5 RegMT W Cost Functions . . . . . . . . . . . . . . . . 54
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Instance Selection for Machine Translation using Fea-
ture Decay Algorithms 55
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Instance Selection with Feature Decay . . . . . . . . . . 61
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 The E(cid:27)ect of Coverage on Translation . . . . . . 65
5.4.2 Coverage Results . . . . . . . . . . . . . . . . . 67
5.4.3 Translation Results . . . . . . . . . . . . . . . . 69
5.4.4 Instance Selection for Alignment . . . . . . . . . 72
5.4.5 Out-of-domain Translation Results . . . . . . . . 73
5.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . 74
viii
CONTENTS
6 L Regularized Regression for Reranking and System
1
Combination in Machine Translation 77
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Translation Experiments . . . . . . . . . . . . . . . . . 79
6.2.1 Datasets and Baseline . . . . . . . . . . . . . . . 79
6.2.2 Instance Selection for Transductive Regression . 80
6.2.3 Addition of the Brevity Penalty . . . . . . . . . 80
6.2.4 Reranking Experiments . . . . . . . . . . . . . . 81
6.3 System Combination Experiments . . . . . . . . . . . . 83
6.3.1 WMT’10 Datasets . . . . . . . . . . . . . . . . . 84
6.3.2 WMT’10 Experiments . . . . . . . . . . . . . . . 84
6.3.3 WMT’11 Results . . . . . . . . . . . . . . . . . 86
6.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . 87
7 Adaptive Model Weighting and Transductive Regres-
sion for Reranking Machine Translation Outputs 89
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 90
7.2 Adaptive Model Weighting for Statistical Machine Trans-
lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2.1 Related Work . . . . . . . . . . . . . . . . . . . 92
7.2.2 Estimating the Best Performing Translation Model 92
7.2.3 Model selection . . . . . . . . . . . . . . . . . . 95
7.2.4 Variations . . . . . . . . . . . . . . . . . . . . . 95
7.3 Transductive Regression Based Translation Scoring . . . 96
7.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . 96
7.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . 97
7.4.2 Reranking Scores . . . . . . . . . . . . . . . . . 97
7.4.3 Adaptive Model Weighting for Predicting Best
Translations . . . . . . . . . . . . . . . . . . . . 98
7.5 Test Results . . . . . . . . . . . . . . . . . . . . . . . . 102
7.5.1 Simulated Online Learning Results . . . . . . . . 102
7.5.2 Online Learning Results . . . . . . . . . . . . . . 103
7.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . 104
ix
Description:düzenli regresyonun L2 düzenli regresyondan Almanca- ngilizce çeviri i³inde The training corpus used might belong to a domain that is different.