Word Alignment and Smoothing Methods in Statistical Machine Translation: Noise, Prior Knowledge, and Overfitting Tsuyoshi Okita ADissertationsubmitted infulfilment oftherequirementsfortheawardof Doctor of Philosophy (Ph.D.) to Dublin City University School of Computing Supervisor: Prof. Andy Way Declaration I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Doctor of Philosophy is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breachanylawofcopyright,andhasnotbeentakenfromtheworkofotherssaveandtotheextent thatsuchworkhasbeencitedandacknowledgedwithinthetextofmywork. Signed: (Candidate)IDNo.: Date: 2 Contents Abstract 6 ListofTables 7 1 Introduction 9 1.1 TheStructureoftheThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Contributions ofthisThesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 RelatedPublications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 BackgroundKnowledge 20 2.1 ParameterEstimationinStatisticalMachineTranslation . . . . . . . . . . . . . . . 20 2.1.1 TranslationModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 LanguageModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.3 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 GraphicalModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 FactorGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Sum-Product algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.3 Max-Product(Max-Sum)Algorithm . . . . . . . . . . . . . . . . . . . . . 29 2.2.4 TypicalInferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3 3 WordAlignment: Noise 33 3.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.1 NoiseinAudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.2 NoiseinVision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.3 NoiseinWordAlignmentinStatisticalMachineTranslation . . . . . . . . 38 3.2 Hand-AnnotatedCorpus-basedEvaluation . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 BayesianLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Algorithmic Design(I): LearningGenerativeModels(ExactInference) . . . . . . . 46 3.5.1 StandardEM(standardML,GIZA++) . . . . . . . . . . . . . . . . . . . . 47 3.5.2 StandardEM(standardML,VectorialForm) . . . . . . . . . . . . . . . . 54 3.5.3 LocalMAPEstimate-EM(standardML) . . . . . . . . . . . . . . . . . . 55 3.5.4 MAPAssignment-EM(standardML) . . . . . . . . . . . . . . . . . . . . 58 3.5.5 VariationalBayesian-EM(standardML) . . . . . . . . . . . . . . . . . . . 60 3.6 Algorithmic Design(II): LearningHMM . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.1 StandardHMM(standardML,GIZA++): Baum-WelchImplementation . . 64 3.6.2 StandardHMM(standardML):GraphicalModelImplementation . . . . . 70 3.6.3 BayesianHMM(standardML):Pseudo-countImplementation . . . . . . . 73 3.6.4 VariationalBayesianHMM(standardML) . . . . . . . . . . . . . . . . . 75 3.7 Algorithmic Design: Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.7.1 Viterbi Decoding(standardML,GIZA++) . . . . . . . . . . . . . . . . . . 78 3.7.2 Viterbi Decoding(standardML):GraphicalModelImplementation . . . . 79 3.7.3 PosteriorDecoding(standardML) . . . . . . . . . . . . . . . . . . . . . . 80 3.8 DataDesign: TrainingDataManipulation . . . . . . . . . . . . . . . . . . . . . . 81 3.8.1 Sentence-levelCorpusCleaning(SMToriented,Moses) . . . . . . . . . . 83 3.8.2 Sentence-levelOutlier Detection(original) . . . . . . . . . . . . . . . . . 84 4 3.8.3 Word-levelNoise-SensitiveMAP-basedWordAligner(original) . . . . . . 87 3.9 LinguisticDomainKnowledgeaboutWordAlignment Links . . . . . . . . . . . . 89 3.9.1 NPs/MWEs/Idioms(StructuredRelation: Lexicon) . . . . . . . . . . . . 90 3.9.2 TranslationalNoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.9.3 LexicalSemantics(PairRelation) . . . . . . . . . . . . . . . . . . . . . . 93 3.9.4 Numbers/Equations(LessFrequentRelation) . . . . . . . . . . . . . . . 94 3.9.5 ProperNouns/Transliterations/LocalizationTerminology(LessFrequent Relation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.10.1 WordAlignment withSentence-levelNoiseReduction . . . . . . . . . . . 96 3.10.2 MAP-basedWordAligner withNounPhraseDetection . . . . . . . . . . . 98 3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4 SmoothingMethods: Overfitting 103 4.1 LanguageModelSmoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1.1 LanguageModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1.2 HierarchicalPitman-Yor Process-basedLanguageModel . . . . . . . . . . 108 4.1.3 Good-Turing Pitman-Yor LanguageModelSmoothing . . . . . . . . . . . 111 4.1.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.2 TranslationModelSmoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.2.1 HierarchicalPitman-Yor TranslationModelSmoothing . . . . . . . . . . . 113 4.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5 ConclusionsandFurtherStudy 118 A Pseudocodes 127 5 Abstract This thesis discusses how to incorporate linguistic knowledge into an SMT system. Although one important category of linguistic knowledge is that obtained by a constituent / dependency parser, a POS / super tagger, and a morphological analyser, linguistic knowledge here includes larger domains than this: Multi-Word Expressions, Out-Of-Vocabulary words, paraphrases, lexical se- mantics (or non-literal translations), named-entities, coreferences, and transliterations. The first discussionisaboutwordalignmentwhereweproposeaMWE-sensitivewordaligner. Thesecond discussion is about the smoothing methods for a language model and a translation model where weproposeahierarchicalPitman-Yorprocess-basedsmoothingmethod. Thecommongroundsfor thesediscussionaretheexaminationofthreeexceptionalcasesfromreal-worlddata: thepresence of noise, the availability of prior knowledge, and the problem of underfitting. Notable charac- teristics of this design are the careful usage of (Bayesian) priors in order that it can capture both frequent and linguistically important phenomena. This can be considered to provide one example to solve the problems of statistical models which often aim to learn from frequent examples only, andoftenoverlook lessfrequentbutlinguistically important phenomena. 6 List of Tables 1.1 Foursimpleexamplesthatmaybreakthepairassumption. . . . . . . . . . . . . . 10 1.2 ExampleofbitermcorrespondencewhichisgiventotheLocalMAPEstimate-EM aligner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Example of factor marginalization. Factor b is marginalized out by sum-product algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 Exampleoffactormarginalization. Factorbismarginalizedoutbymax-sumalgo- rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 The first three lines show the hand annotated corpora which we used in the eval- uation by AER, while the four last lines show the corpora which we used in this thesisforevaluationbyBLEU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 IWSLT hand-annotated corpus. Note that although we use the GIZA++ format (whichiscalledaA3finalfile)thisisnottheresultofwordalignment,butthehand annotation itself. We employ this format to simplify the annotation of alignment links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Benefitofpriorknowledgeaboutanchorwordsillustrated bytoydata. . . . . . . . 57 3.4 BLEUscoreaftercleaningofsentenceswithlengthgreaterthanX. Therowshows X, while the column shows the language pair. Parallel corpus is News Commen- taryparallelcorpus(WMT07). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7 3.5 TranslationalnoiseonlyremovingfivetypicalredundantphrasesfromtheJapanese sideofthetrainingcorpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.6 AnexampleoflexicalmappingderivedbyWordNet. . . . . . . . . . . . . . . . . 93 3.7 Statisticsrelatedtolexicalsemantics. . . . . . . . . . . . . . . . . . . . . . . . . 94 3.8 Anexampleoftransliteration (Breen,1999). . . . . . . . . . . . . . . . . . . . . . 95 3.9 Statisticsoflessfrequentsubstructure. . . . . . . . . . . . . . . . . . . . . . . . . 95 3.10 ResultsforAlgorithm 3(revisedgoodpoint algorithm). . . . . . . . . . . . . . . . 96 3.11 Redundancies in Parallel corpus and its BLEU score improvement. BT denotes BTECcorpuswhileCHdenotesChallengecorpus. TRisanabbreviationofTurk- ish,whileZHisthatofasimplifiedChinese. . . . . . . . . . . . . . . . . . . . . . 97 3.12 Intrinsicevaluationresults(JP–ENandEN–JP). . . . . . . . . . . . . . . . . . . 98 3.13 Statistics of our noun phrase extraction method. The numbers of noun phrases are from0.08to0.6NP/sentencepairinourstatisticalnounphraseextractionmethods. 99 3.14 Results for EN-JP. Baseline is plain GIZA++ / Moses (without NP grouping / prior), baseline2iswithNPgrouping, prior iswithNPgrouping andprior. . . . . . 99 3.15 Results for FR-EN and ES-EN. Baseline is plain GIZA++ / Moses (without bilin- gualnounphrasegrouping/prior),baseline2iswithbilingualnounphrasegroup- ing,prioriswithbilingual nounphrasegrouping andprior. . . . . . . . . . . . . . 100 4.1 Results for language model. Baseline1 uses modified Kneser-Ney smoothing and baseline2usesGood-Turing smoothing. . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 Statisticsofprior knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3 Resultsfor200kJP–ENsentences. Heuristicsinthelastrowshowstheresultwhen priorknowledge1wasaddedatthebottom ofthetranslationmodel. . . . . . . . . 116 8 Chapter 1 Introduction Machine Translation (MT) is the study of automatic translation that incorporates knowledge of both linguistics and statistics. As in other areas of Artificial Intelligence,1 the influx of statistical approachesinthe1990’shadadramaticeffectonMT,somuchsothattothisday,themainmodels arestatistical. ThestudyofsuchstatisticalmethodsappliedtoMTisknownasStatisticalMachine Translation (SMT). A deeper examination of applying Machine Learning methods may lead to further improvements in the quality of MT output. The research presented in this thesis attempts thisapplication,andweexamineoneparticularmodulewordalignmentandoneparticularmethod smoothingwhichappliestoboththelanguagemodelandthetranslationmodel. DespitebeingrarelydiscussedintheMTcommunity,theeffortrequiredgoesbeyondthemost complex state-of-the-art Machine Learning algorithms. MT requires the conversion of a sequence ofwordsinthesourcelanguageintoanothersequenceofwordsinthetargetlanguagewhere1)the lengthofthesource-languagesequenceandthatofthetarget-languagesequencemaybedifferent,2 and 2) the correspondences between elements in the source-language sequence and those in the 1Computer Vision is one such area where statistical approaches are most intensively used (Forsyth and Ponce, 2003);conversely,ComputerVisioncontributedtotheprogressofMachineLearning. 2Structuredpredictionalgorithmshandleinputandoutputwhoselengthsarethesameandwhosecorrespondences arealreadyassignedfromthebeginning. ForexampleinaDNAsequenceinbiology,theinputsequenceandoutput sequence,whichconstituteseitherA(Adenine),T(Thymine),G(Guanine),andC(Cytosine),arethesamelengthand theircorrespondencesmatchwiththeirindex. 9 EN onthisparticularbuilding FR dans ce baˆtiment EN thehealthandsafetylegislation thatitactuallypasses FR la reglementation en matie`re de sante` et de securite qu’ il vote. EN whyaretherenofireinstructions? FR comment se fait-il qu’ iln’yaitpasdeconsignesencasd’incendie ? EN therearetwofinnishchannelsandoneportugueseone. FR il y a bien deux chaˆınes finnoises et une chaˆıne portugaise. Table1.1: Foursimpleexamplesthatmaybreakthepairassumption. target-languagesequencemaybereordered. OnefairlystrongassumptionmadeinSMTisthatthe translation model P(e¯|f¯) has always accomodated a pair of e¯and f¯, which never lacks one side. This assumption radically reduces the complexity of the problem although this may yield some other problems. This thesis examines the methods within this assumption, untouched in other complexcasessincethisisreallyadifficultultimategoalofSMT. NotethatwemaysaythatMachineLearningperspectivesarenotidenticalwithSMTperspec- tives, but these two may complement each other in the following sense. On the one hand, various particular MT technologies are developed for SMT such as phrase extraction, stack-based decod- ing, Hierarchical Phrase-Based Machine Translation (HPB-SMT), Minimum Error Rate Training (MERT), and so forth. On the other hand, the Machine Learning point of view concerns noise, prior knowledge, overfitting, statistical assumptions, and so forth. This thesis focuses on noise, priorknowledge,andoverfitting. Amongothers,oneideathatradicallyreducestheoverallcomputationalcomplexityinSMTis thatitassumesthatanobservedword/phrasealwaysformsapair. (Wecallthisthepairassumption in this thesis.) A t-table and a translation model always accommodate source and target words / phrases. SMT is constructed based on this assumption. For example, any pair in the translation modelisneverlackingeithersideoftheword/phrase. Underthisassumption,wecanwriteapair ofwords/phrasese |f usingjustonevariablex. IntheEMalgorithm(Dempsteretal.,1977),the i i latent variable A can be written in terms of this x. In the HMM algorithm (Baum et al., 1970; i→j 10
Description: