ebook img

Joint Space Neural Probabilistic Language Model for Statistical Machine Translation PDF

0.15 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Joint Space Neural Probabilistic Language Model for Statistical Machine Translation

000 001 002 Joint Space Neural Probabilistic Language Model for 003 004 Statistical Machine Translation 005 006 007 008 009 AnonymousAuthor(s) 3 010 Affiliation 1 011 Address 0 012 email 2 013 n 014 a 015 Abstract J 016 0 017 A neuralprobabilistic languagemodel (NPLM) providesan idea to achieve the 2 018 betterperplexitythann-gramlanguagemodelandtheirsmoothedlanguagemod- ] 019 els. This paper investigates application area in bilingual NLP, specifically Sta- L 020 tistical Machine Translation (SMT). We focus on the perspectives that NPLM C 021 has potential to open the possibility to complementpotentially ‘huge’ monolin- s. 022 gualresourcesintothe‘resource-constraint’bilingualresources. Weintroducean c 023 ngram-HMMlanguagemodelasNPLMusingthenon-parametricBayesiancon- [ struction. In order to facilitate the application to various tasks, we propose the 024 joint space model of ngram-HMM languagemodel. We show an experimentof 2 025 system combination in the area of SMT. One discovery was that our treatment v 026 ofnoiseimprovedtheresults0.20BLEUpointsif NPLM istrainedinrelatively 4 027 small corpus, in ourcase 500,000sentence pairs, which is oftenthe case due to 1 6 028 thelongtrainingtimeofNPLM. 3 029 . 030 1 031 1 Introduction 0 3 032 1 033 Aneuralprobabilisticlanguagemodel(NPLM)[3,4]andthedistributedrepresentations[25]pro- : 034 videan ideato achievethebetterperplexitythann-gramlanguagemodel[47] andtheirsmoothed v i 035 languagemodels[26,9,48]. Recently,thelatterone,i.e. smoothedlanguagemodel,hashadalot X of developments in the line of nonparametric Bayesian methods such as hierarchical Pitman-Yor 036 r 037 languagemodel(HPYLM)[48]andSequenceMemoizer(SM)[51,20],includinganapplicationto a SMT [36, 37, 38]. A NPLM considerstherepresentationofdata inordertomakethe probability 038 distributionofwordsequencesmorecompactwherewefocusonthesimilarsemanticalandsyntac- 039 ticalrolesofwords.Forexample,whenwehavetwosentences“Thecatiswalkinginthebedroom” 040 and“Adogwasrunninginaroom”,thesesentencescanbemorecompactlystoredthanthen-gram 041 languagemodelifwefocusonthesimilaritybetween(the,a),(bedroom,room),(is,was),and(run- 042 ning,walking).Thus,aNPLMprovidesthesemanticalandsyntacticalrolesofwordsasalanguage 043 model. ANPLMof[3]implementedthisusingthemulti-layerneuralnetworkandyielded20%to 044 35%betterperplexitythanthelanguagemodelwiththemodifiedKneser-Neymethods[9]. 045 ThereareseveralsuccessfulapplicationsofNPLM[41,11,42,10,12,14,43]. First,onecategory 046 of applications include POS tagging, NER tagging, and parsing [12, 7]. This category uses the 047 featuresprovidedbya NPLM inthelimitedwindowsize. Itisoftenthecasethatthereisnosuch 048 longrangeeffectsthatthedecisioncannotbemadebeyondthelimitedwindowswhichrequiresto 049 look carefullythe elements in a long distance. Second, the othercategoryof applicationsinclude 050 SemanticRoleLabeling(SRL)task[12,14]. Thiscategoryusesthefeatureswithinasentence. A 051 typicalelementisthe predicatein a SRL task whichrequires theinformationwhichsometimesin 052 a long distance but within a sentence. Both of these approachesdo not require to obtain the best tagsequence,butthesetagsareindependent. Third,thefinalcategoryincludesMERTprocess[42] 053 andpossiblymanyotherswheremostofthemremainundeveloped. Theobjectiveofthislearning 1 054 inthiscategoryisnottosearchthebesttagforawordbutthebestsequenceforasentence. Hence, 055 weneedtoapplythesequentiallearningapproach. Althoughmostoftheapplicationsdescribedin 056 [11,10,12,14]aremonolingualtasks,theapplicationofthisapproachtoabilingualtaskintroduces 057 reallyastonishingaspects,whichwecancall“creativewords”[50],automaticallyintothetraditional 058 resourceconstrainedSMT components. Forexample, the trainingcorpusof word aligneris often 059 strictly restricted to the given parallel corpus. However, a NPLM allows this training with huge 060 monolingualcorpus.Althoughmostofthislinehasnotbeeneventestedmostlyduetotheproblem ofcomputationalcomplexityoftrainingNPLM,[43]appliedthistoMERTprocesswhichreranks 061 the n-best lists using NPLM. This paper aims at different task, a task of system combination [1, 062 29, 49, 15, 13, 35]. This category of tasks employs the sequential method such as Maximum A 063 Posteriori(MAP)inference(Viterbidecoding)[27,44,33]onConditionalRandomFields(CRFs)/ 064 MarkovRandomFields(MRFs). 065 066 Althoughthispaperdiscussesanngram-HMMlanguagemodelwhichweintroduceasonemodelof NPLMwhereweborrowmanyofthemechanismfrominfiniteHMM[19]andhierarchicalPitman- 067 YorLM[48],onemaincontributionwouldbetoshowonenewapplicationareaofNPLMinSMT. 068 AlthoughseveralapplicationsofNPLMhavebeenpresented,therehavebeennoapplicationtothe 069 taskofsystemcombinationasfarasweknow. 070 071 The remainder of this paper is organized as follows. Section 2 describes ngram-HMM language 072 modelwhileSection3introducesajointspacemodelofngram-HMMlanguagemodel. InSection 4, our intrinsic experimental results are presented, while in Section 5 our extrinsic experimental 073 resultsarepresented.WeconcludeinSection5. 074 075 076 077 2 Ngram-HMM Language Model 078 079 Generativemodel Figure1 depictedan exampleof ngram-HMMlanguagemodel, i.e. 4-gram- 080 HMM language model in this case, in blue (in the center). We consider a Hidden Markov 081 Model (HMM) [40, 21, 2] of size K which emits n-gram word sequence wi,...,wi−K+1 where 082 hi,...,hi−K+1 denote correspondinghidden states. The arcs from wi−3 to wi, ···, wi−1 to wi 083 showtheback-offrelationsappearedinlanguagemodelsmoothing,suchasKneser-Neysmoothing 084 [26],Good-Turingsmoothing[24],andhierarchicalPitman-YorLMsmoothing[48]. 085 086 087 088 γ β G0 089 θ 090 0 091 α πk hi−3 hi−2 hi−1 hi G0 d 092 0 093 θ 009945 H ϕ wi−3 wi−2 wi−1 wi Gu u 096 k=1,...,T u=1,...,K du 097 098 099 100 Figure1:Figureshowsagraphicalrepresentationofthe4-gramHMMlanguagemodel. 101 102 103 Intheleftside inFigure1, we placeoneDirichletProcesspriorDP(α,H), withconcentrationpa- 104 rameter α and base measure H, for the transition probabilities going out from each hidden state. 105 This constructionis borrowedfrom the infinite HMM [2, 19]. The observationlikelihood for the 106 hiddenwordhtareparameterizedasinwt|ht ∼F(φht)sincethehiddenvariablesofHMMislim- itedinitsrepresentationpowerwhereφ denotesoutputparameters.Thisissincetheobservations 107 ht canberegardedasbeinggeneratedfromadynamicmixturemodel[19]asin(1),theDirichletpriors 2 108 ontherowshaveasharedparameter. 109 K 110 111 p(wi|hi−1 =k) = X p(hi|hi−1 =k)p(wi|hi) 112 hi=1 113 K 114 = X πk,hip(wi|φhi) (1) 115 hi=1 116 IntherightsideinFigure1,weplacePitman-YorpriorPY,whichhasadvantageinitspower-law 117 behaviorasourtargetisNLP,asin(2): 118 119 wi|w1:i−1 ∼ PY(di,θi,Gi) (2) 120 where α is a concentration parameter, θ is a strength parameter, and G is a base measure. This i 121 constructionisborrowedfromhierarchicalPitman-Yorlanguagemodel[48]. 122 123 Inference Wecomputetheexpectedvalueoftheposteriordistributionofthehiddenvariableswith 124 abeamsearch[19].ThisblockedGibbssampleralternatesamplestheparameters(transitionmatrix, 125 output parameters), the state sequence, hyper-parameters, and the parameters related to language 126 model smoothing. As is mentioned in [19], this sampler has characteristic in that it adaptively 127 truncatesthestatespaceandrundynamicprogrammingasin(3): 128 129 p(ht|w1:t,u1:t) = p(wt|ht) X p(ht−1|w1:t−1,u1:t−1) (3) 130 ht−1:ut<π(ht−1,ht) 131 whereu isonlyvalidifthisissmallerthanthetransitionprobabilitiesofthehiddenwordsequence t 132 h ,...,h . Notethatweuseanauxiliaryvariableu whichsamplesforeachwordinthesequence 1 K i 133 fromthe distribution ui ∼ Uniform(0,π(hi−1,hi)). Theimplementationofthe beamsamplercon- 134 sistsofpreprocessingthetransitionmatrixπandsortingitselementsindescendingorder. 135 136 Initialization First,weobtaintheparametersforhierarchicalPitman-Yorprocess-basedlanguage 137 model[48,23],whichcanbeobtainedusingablockGibbssampling[32]. 138 Second, in order to obtain a better initialization value h for the above inference, we perform the 139 following EM algorithm instead of giving the distribution of h randomly. This EM algorithm in- 140 corporatesthe abovementionedtruncation[19]. In the E-step, we computethe expectedvalue of 141 theposteriordistributionofthehiddenvariables. Foreverypositionh ,wesendaforwardmessage i 142 α(hi−n+1:i−1)inasinglepathfromthestarttotheendofthechain(whichisthestandardforward 143 recursionin HMM;Hence we useα). Here we normalizethe sum of α consideringthe truncated 144 variablesui−n+1:i−1. 114456 α(hi−n+2:i) = PPαα((uhii−−nn++11::ii−−11))P(wi|hi)Xα(ui−n+1:i−1)P(hi|hi−n+1:i−1) (4) 147 148 Then,foreverypositionhj,wesendamessageβ(hi−n+2:i,hj)inmultiplepathsfromthestartto theendofthechainasin(5), 149 115501 β(hi−n+2:i,hj) = PPαα((uhii−−nn++11::ii−−11))P(wi|hi)Xβ(hi−n+1:i−1,hj)P(hi|hi−n+1:i−1)(5) 152 Thisstepaimsatobtainingtheexpectedvalueoftheposteriordistribution(Similarconstructionto 153 use expectation can be seen in factored HMM [22]). In the M-step, using this expected value of 154 theposteriordistributionobtainedinthe E-steptoevaluatetheexpectationof thelogarithmof the 155 complete-datalikelihood. 156 157 3 JointSpace Model 158 159 160 Inthispaper,wemechanicallyintroduceajointspacemodel.Otherthanthengram-HMMlanguage modelobtainedintheprevioussection,wewilloftenencounterthesituationwherewehaveanother 161 hiddenvariablesh1 whichisirrelevantto h0 whichisdepictedinFigure2. Supposethatwehave 3 162 thengram-HMMlanguagemodelyieldedthehiddenvariablessuggestingsemanticandsyntactical 163 role ofwords. Addingto this, we mayhave anotherhiddenvariablessuggesting, say, a genreID. 164 ThisgenreIDcanbeconsideredasthesecondcontextwhichisoftennotcloselyrelatedtothefirst 165 context.Thisalsohasanadvantageinthismechanicalconstructionthattheresultedlanguagemodel 166 often has the perplexity smaller than the originalngram-HMM languagemodel. Note that we do 167 notintendtolearnthismodeljointlyusingtheuniversalcriteria,butwejustconcatenatethelabels 168 bydifferenttasksonthesamesequence. Bythisformulation,weintendtofacilitatetheuseofthis languagemodel. 169 170 h1 h1 h1 h1 171 i−3 i−2 i−1 i 172 173 0 0 0 0 174 175 176 177 178 179 180 Figure2:Figureshowsthejointspace4-gramHMMlanguagemodel. 181 182 Itisnotedthatthosetwocontextsmaynotbederivedinasinglelearningalgorithm. Forexample, 183 languagemodelwiththesentencecontextmaybederivedinthesamewaywiththatwiththeword 184 context. Intheaboveexample,ahiddensemanticsoversentenceisnotasequentialobject. Hence, 185 thiscanbeonlyconsideringallthesentenceareindependent. Then,wecanobtainthisusing,say, 186 LDA. 187 188 4 Intrinsic Evaluation 189 190 Wecomparedtheperplexityofngram-HMMLM(1feature),ngram-HMMLM(2features,thesame 191 asinthispaperandgenreIDis4class),modifiedKneser-Neysmoothing(irstlm)[18],andhierar- 192 chicalPitmanYorLM[48]. Weusednews2011Englishtestset. WetrainedLMusingEuroparl. 193 194 ngram-HMM(1feat) ngram-HMM(2feat) modifiedKneser-Ney hierarchicalPY 195 Europarl1500k 114.014 113.450 118.890 118.884 196 197 Table1:Tableshowstheperplexityofeachlanguagemodel. 198 199 200 5 Extrinsic Evaluation: TaskofSystem Combination 201 202 We applied ngram-HMM language model to the task of system combination. For given multiple 203 MachineTranslation(MT)outputs,thistaskessentiallycombinesthebestfragmentsamonggiven 204 MToutputstorecreateanewMToutput.Thestandardprocedureconsistsofthreesteps: Minimum 205 BayesRiskdecoding,monolingualwordalignment,andmonotonicconsensusdecoding. Although 206 these proceduresthemselveswill needexplanationsin orderto understandthefollowing,we keep 207 themaintextinminimum,movingsomeexplanations(butnotsufficient)inappendices. Notethat although this experiment was done using the ngram-HMM language model, any NPLM may be 208 sufficientforthispurpose. Inthissense,weusethetermNPLM insteadofngram-HMMlanguage 209 model. 210 211 FeaturesinJointSpace Thefirstfeatureof NPLM isthesemanticallyandsyntacticallysimilar 212 wordsofroles,whichcanbederivedfromtheoriginalNPLM.Weintroducethesecondfeaturein 213 thisparagraph,whichisagenreID. 214 ThemotivationtousethisfeaturecomesfromthestudyofdomainadaptationforSMTwhereitbe- 215 comespopulartoconsidertheeffectofgenreintestset. ThispaperusesLatentDirichletAllocation 4 216 (LDA)[5,46,6,45,33]toobtainthegenreIDvia(unsupervised)documentclassificationsinceour 217 interesthereisonthegenreofsentencesintestset. Andthen,weplacetheselabelsonajointspace. 218 LDArepresentstopicsasmultinomialdistributionsovertheW uniqueword-typesinthecorpusand 219 representsdocumentsasa mixtureof topics. Let C be thenumberof uniquelabelsin the corpus. 220 Each label c is represented by a W-dimensionalmultinomialdistribution φ overthe vocabulary. c 221 Fordocumentd, we observeboththe wordsin the documentw(d) as well as the documentlabels 222 c(d). Giventhedistributionovertopicsθ ,thegenerationofwordsinthedocumentiscapturedby d 223 thefollowinggenerativemodel.Theparameters αandβ relatetothecorpuslevel,thevariablesθ d 224 belongto the documentlevel, and finallythe variablesz and w correspondto the wordlevel, dn dn 225 whicharesampledonceforeachwordineachdocument. 226 Usingtopicmodelinginthesecondstep,weproposetheoverallalgorithmtoobtaingenreIDsfor 227 testsetasin(5). 228 229 1. FixthenumberofclustersC,weexplorevaluesfromsmalltobigwheretheoptimalvalue 230 willbesearchedontuningset. 231 2. Dounsuperviseddocumentclassification(orLDA)onthesourcesideofthetuningandtest 232 sets. 233 (a) For each label c ∈ {1,...C}, sample a distribution over word-types φ ∼ 234 c Dirichlet(·|β) 235 (b) Foreachdocumentd∈{1,...,D} 236 237 i. Sampleadistributionoveritsobservedlabelsθd ∼Dirichlet(·|α) 238 ii. Foreachwordi∈{1,...,NdW} 239 A. Samplealabelz(d) ∼Multinomial(θ ) i d 240 B. Sampleawordw(d) ∼Multinomial(φ )fromthelabelc=z(d) 241 i c i 3. Separate each class of tuning and test sets (keep the originalindex and new index in the 242 allocatedseparateddataset). 243 244 4. (Runsystemcombinationoneachclass.) 245 5. (Reconstructthesystemcombinedresultsofeachclasspreservingtheoriginalindex.) 246 247 Modified Process in System Combination Given a joint space of NPLM, we need to specify 248 in which process of the task of system combination among three processes use this NPLM. We 249 onlydiscussherethestandardsystemcombinationusingconfusion-network.Thisstrategytakesthe 250 followingthreesteps(VerybriefexplanationofthesethreeisavailableinAppendix): 251 • MinimumBayesRiskdecoding[28](withMinimumErrorRateTraining(MERT)process 252 [34]) 253 254 EˆbMesBtR =argminE′∈ER(E′)=argminE′∈E X L(E,E′)P(E|F) 255 E′∈EE 256 257 = argminE′∈E X (1−BLEUE(E′))P(E|F) 258 E′∈EE 259 • Monolingualwordalignment 260 • (Monotone)consensusdecoding(withMERTprocess) 261 262 I 263 Ebest = argmaxYφ(i|e¯i)pLM(e) e 264 i=1 265 Similar to the task of n-bestrerankingin MERT process [43], we considerthe rerankingof nbest 266 lists in thethirdstep ofabove,i.e. (monotone)consensusdecoding(with MERTprocess). We do 267 notdiscusstheothertwoprocessesinthispaper. 268 On one hand, we intend to use the first feature of NPLM, i.e. the semantically and syntactically 269 similar role of words, for paraphrases. The n-best reranking in MERT process [43] alternate the 5 270 probability suggested by word sense disambiguation task using the feature of NPLM, while we 271 intendto adda sentencewhich replacesthe wordsusing NPLM. On the otherhand, we intendto 272 usethesecondfeatureofNPLM,i.e. thegenreID,tosplitasinglesystemcombinationsysteminto 273 multiplesystemcombinationsystemsbasedonthegenreIDclusters. Inthisperspective,theroleof 274 thesetwofeaturecanbeseenasindependent.Weconductedfourkindsofsettingsbelow. 275 276 (A) —First Feature: N-Best Reranking in Monotonic Consensus Decoding without Noise – 277 NPLMplain Inthefirstsettingfortheexperiments,weusedthefirstfeaturewithoutconsidering 278 noise. The original aim of NPLM is to capture the semantically and syntactically similar words 279 in a way that a latent worddependson the context. We will be able to get varietyof wordsif we 280 conditiononthefixedcontext,whichwouldformparaphrasesintheory. 281 We introduce our algorithm via a word sense disambiguation(WSD) task which selects the right 282 disambiguatedsense forthe word in question. This task is necessary due to the fact that a text is 283 nativelyambiguousaccommodatingwithseveraldifferentmeanings. ThetaskofWSD[14]canbe 284 writtenasin(6): 285 228867 P(synseti|featuresi,θ) = Z(fea1tures)Yg(synseti,k)f(featureki) (6) 288 m 289 where k rangesoverall possible features, f(featurek) is an indicatorfunctionwhose value is 1 if 290 i thefeatureexists,and0otherwise,g(synset ,k)isaparameterforagivensynsetandfeature,θisa 291 i collectionofalltheseparametersing(synset ,k),andZ isanormalizationconstant. Notethatwe 292 i usetheterm“synset”asananalogyoftheWordNet[30]:thisisequivalentto“sense”or“meaning”. 293 Note also that NPLM will be included as one of the featuresin this equation. If featuresinclude 294 sufficient statistics, a task of WSD will succeed. Otherwise, it will fail. We do reranking of the 295 outcomeofthisWSDtask. 296 On the one hand, the paraphrases obtained in this way have attractive aspects that can be called 297 “a creative word” [50]. This is since the traditional resource that can be used when building a 298 translationmodelbySMTare constrainedonparallelcorpus. However, NPLM canbetrainedon 299 hugemonolingualcorpus. Ontheotherhand,unfortunatelyinpractice,thenotorioustrainingtime 300 of NPLM only allows us to use fairly small monolingualcorpus although many papers made an 301 effort to reduce it [31]. Due to this, we cannot ignore the fact that NPLM trained not on a huge 302 corpusmaybeaffectedbynoise. Conversely,wehavenoguaranteethatsuchnoisewillbereduced 303 ifwetrainNPLMonahugecorpus.ItisquitelikelythatNPLMhasalotofnoiseforsmallcorpora. 304 Hence,thispaperalsoneedstoprovidethewaytoovercomedifficultiesofnoisydata. Inorderto 305 avoidthisdifficulty,welimittheparaphraseonlywhenitincludesitselfinhighprobability. 306 307 (B)—FirstFeature:N-BestRerankinginMonotonicConsensusDecodingwithNoise–NPLM 308 dep Inthesecondsettingforourexperiment,weusedthefirstfeatureconsideringnoise.Although 309 wemodifiedasuggestedparaphrasewithoutanyinterventionintheabovealgorithm,itisalsopos- 310 sibletoexaminewhethersuchsuggestionshouldbeadoptedornot. Ifweaddparaphrasesandthe 311 resultedsentencehasahigherscoreintermsofthemodifieddependencyscore[39](SeeFigure3), 312 thismeansthattheadditionofparaphrasesisagoodchoice. Iftheresultedscoredecreases,wedo 313 notneedtoaddthem.Onedifficultyinthisapproachisthatwedonothaveareferencewhichallows 314 ustoscoreitintheusualmanner. Forthisreason,weadoptthenaivewaytodeploytheaboveand wedeploythiswithpseudoreferences. (Thisformulationisequivalentthatwedecodetheseinputs 315 byMBRdecoding.)First,ifweaddparaphrasesandtheresultedsentencedoesnothaveaverybad 316 score, we add theseparaphrasessincethese paraphraseare notverybad(naiveway). Second,we 317 doscoringbetweenthesentenceinquestionwith alltheothercandidates(pseudoreferences)and 318 calculateanaverageofthem. Thus, oursecondalgorithmisto selecta paraphrasewhichmaynot 319 achieveaverybadscoreintermsofthemodifieddependencyscoreusingNPLM. 320 321 322 (C)—SecondFeature: GenreID—DA (DomainAdaptation) Inthethirdsettingofourex- periment, we used onlythe secondfeature. As is mentionedin the explanationaboutthis feature, 323 weintendtosplitsasinglemoduleofsystemcombinationintomultiplemodulesofsystemcombi- 6 324 c−structure f−structure 325 S SUBJ PRED john NUM sg 326 NP VP PERS 3 327 John V NP PRED resign 328 resigned yesterday TENSE past ADJ ([PRED yesterday]) 329 330 Different structure Same representation in c−structure in f−structure 331 S SUBJ PRED john 332 NP NP VP NUM sg 333 Yesterday John PERS 3 V PRED resign 334 resigned TENSE past ADJ ([PRED yesterday]) 335 336 337 Figure3: Bythemodifieddependencyscore[39],thescoreofthesetwosentences,“Johnresigned 338 yesterday”and“YesterdayJohnresigned”,arethesame. Figureshowsc-structureandf-structureof 339 twosentencesusingLexicalFunctionalGrammar(LFG)[8]. 340 341 342 nationaccordingtothegenreID.Hence,wewillusethemoduleofsystemcombinationtunedfor 343 thespecificgenreID,1. 344 345 (D) — First and SecondFeature— COMBINED In the fourthsetting we used bothfeatures. 346 Inthissetting, (1)we usedmodulesofsystem combinationwhicharetunedforthe specific genre ID, and(2)we preparedNPLM whosecontextcan beswitched basedon thespecific genreof the 347 sentenceintestset. Thelatterwasstraightforwardsincethesetwofeaturesarestoredinjointspace 348 inourcase. 349 350 Experimental Results ML4HMT-2012 provides four translation outputs (s1 to s4) which are 351 MT outputs by two RBMT systems, APERTIUM and LUCY, PB-SMT (MOSES) and HPB-SMT 352 (MOSES),respectively. Thetuningdataconsistsof20,000sentencepairs, whilethetestdatacon- 353 sistsof3,003sentencepairs. 354 355 Ourexperimentalsettingisasfollows. Weuseoursystemcombinationmodule[16,17,35],which hasitsownlanguagemodelingtool,MERTprocess,andMBRdecoding.WeusetheBLEUmetric 356 as loss function in MBR decoding. We use TERP2 as alignment metrics in monolingual word 357 alignment. WetrainedNPLMusing500,000sentencepairsfromEnglishsideofEN-EScorpusof 358 EUROPARL3. 359 360 The results show that the first setting of NPLM-based paraphrased augmentation, that is NPLM 361 plain,achieved25.61BLEU points,whichlost0.39 BLEU pointsabsoluteoverthestandardsys- temcombination. Thesecondsetting, NPLM dep,achievedslightlybetterresultsof25.81BLEU 362 points, which lost 0.19 BLEU points absolute over the standard system combination. Note that 363 thebaselineachieved26.00BLEU points,thebestsinglesystemintermsofBLEU wass4which 364 achieved 25.31 BLEU points, and the best single system in terms of METEOR was s2 which 365 achieved 0.5853. The third setting achieved 26.33 BLEU points, which was the best among our 366 four settings. The fourth setting achieved 25.95, which is again lost 0.05 BLEU points over the 367 standardsystemcombination. 368 Otherthanourfoursettingswherethesesettingsdifferwhichfeaturestouse,werunseveraldiffer- 369 entsettingsofsystemcombinationinordertounderstandtheperformanceoffoursettings. Standard 370 system combination using BLEU loss function (line 5 in Table 2), standard system combination 371 usingTERlossfunction(line6),systemcombinationwhosebackboneisunanamouslytakenfrom 372 theRBMToutputs(MTinputs2inthiscase;line11),andsystemcombinationwhosebackboneis 373 selectedbythemodifieddependencyscore(whichhasthreevariationsinthefigure;modDeppreci- 374 375 1E.g., wetranslatenewswirewithsystemcombinationmodule tunedwithnewswiretuningset, whilewe 376 translatemedicaltextwithsystemcombinationmoduletunedwithmedicaltexttuningset. 377 2http://www.cs.umd.edu/∼snover/terp 3http://www.statmt.org/europarl 7 378 sion,recallandFscore;line12,13and14). Oneinteresting characteristicsisthatthes2backbone 379 (line 11) achieved the best score among all of these variations. Then, the score of the modified 380 dependencymeasure-selectedbackbonefollows. Fromthese runs, we cannotsay thatthe runsre- 381 latedtoNPLM,i.e. (A),(B)and(D),werenotparticularlysuccessful. Thepossiblereasonforthis 382 wasthatourinterfacewithNPLMwasonlylimitedtoparaphrases,whichwasnotverysuccessfuly 383 chosenbyreranking. 384 385 NIST BLEU METEOR WER PER 386 MTinputs1 6.4996 0.2248 0.5458641 64.2452 49.9806 MTinputs2 6.9281 0.2500 0.5853446 62.9194 48.0065 387 MTinputs3 7.4022 0.2446 0.5544660 58.0752 44.0221 388 MTinputs4 7.2100 0.2531 0.5596933 59.3930 44.5230 389 standardsystemcombination(BLEU) 7.6846 0.2600 0.5643944 56.2368 41.5399 390 standardsystemcombination(TER) 7.6231 0.2638 0.5652795 56.3967 41.6092 391 (A)NPLMplain 7.6041 0.2561 0.5593901 56.4620 41.8076 392 (B)NPLMdep 7.6213 0.2581 0.5601121 56.1334 41.7820 393 (C)DA 7.7146 0.2633 0.5647685 55.8612 41.7264 394 (D)COMBINED 7.6464 0.2595 0.5610121 56.0101 41.7702 395 s2backbone 7.6371 0.2648 0.5606801 56.0077 42.0075 396 modDepprecision 7.6670 0.2636 0.5659757 56.4393 41.4986 397 modDeprecall 7.6695 0.2642 0.5664320 56.5059 41.5013 398 modDepFscore 7.6695 0.2642 0.5664320 56.5059 41.5013 399 400 Table2:Thistableshowssinglebestperformance,theperformanceofthestandardsystemcombina- 401 tion(BLEUandTERlossfunctions),theperformanceoffoursettingsinthispaper((A),...,(D)),the 402 performanceofs2backbonedsystemcombination,andtheperformanceoftheselectionofsentences 403 bymodifieddependencyscore(precision,recall,andF-scoreeach). 404 405 406 Conclusionand Perspectives 407 408 This paper proposes a non-parametric Bayesian way to interpret NPLM, which we call ngram- 409 HMM language model. Then, we add a small extension to this by concatenating other context 410 in the same model, which we call a joint space ngram-HMM language model. The main issues 411 investigated in this paper was an application of NPLM in bilingual NLP, specifically Statistical 412 MachineTranslation(SMT).WefocusedontheperspectivesthatNPLM haspotentialtoopenthe possibility to complementpotentially ‘huge’ monolingualresources into the ‘resource-constraint’ 413 bilingual resources. We compared our proposed algorithms and others. One discovery was that 414 whenweuseafairlysmallNPLM,noisereductionmaybeonewaytoimprovethequality. Inour 415 case,thenoisereducedversionobtained0.2BLEUpointsbetter. 416 417 FurtherworkwouldbetoapplythisNPLMinvariousothertasksinSMT:wordalignment,hierar- 418 chicalphrase-baseddecoding,andsemanticincorporatedMTsystemsinordertodiscoverthemerit of‘depth’ofarchitectureinMachineLearning. 419 420 References 421 422 [1] BANGALORE,S.,BORDEL,G.,ANDRICCARDI,G.Computingconsensustranslationfrommultiplemachinetranslationsystems.InProceedingsoftheIEEE 423 AutomaticSpeechRecognitionandUnderstandingWorkshop(ASRU)(2001),350–354. 424 [2] BEAL,M.J.Variationalalgorithmsforapproximatebayesianinference.PhDThesisatGatsbyComputationalNeuroscienceUnit,UniversityCollegeLondon (2003). 425 [3] BENGIO,Y.,DUCHARME,R.,ANDVINCENT,P.Aneuralprobabilisticlanguagemodel.InProceedingsofNeuralInformationSystems(2000). 426 [4] BENGIO,Y.,SCHWENK,H.,SENE´CAL,J.-S.,MORIN,F.,ANDGAUVAIN,J.-L. Neuralprobabilisticlanguagemodels. InnovationsinMachineLearning: 427 TheoryandApplicationsEditedbyD.HolmesandL.C.Jain(2005). 428 [5] BLEI,D.,NG,A.Y.,ANDJORDAN,M.I.Latentdirichletallocation.JournalofMachineLearningResearch3(2003),9931022. [6] BLEI,D.M.Introductiontoprobabilistictopicmodels.CommunicationsoftheACM(2011). 429 [7] BORDES,A.,GLOROT,X.,WESTON,J.,ANDBENGIO,Y. Towardsopen-textsemanticparsingviamulti-tasklearningofstructuredembeddings. CoRR 430 abs/1107.3663(2011). 431 [8] BRESNAN,J.Lexicalfunctionalsyntax.Blackwell(2001). [9] CHEN,S.,ANDGOODMAN,J.Anempiricalstudyofsmoothingtechniquesforlanguagemodeling.TechnicalreportTR-10-98HarvardUniversity(1998). 8 432 [10] COLLOBERT,R.Deeplearningforefficientdiscriminativeparsing.InProceedingsofthe14thInternationalConferenceonArtificialIntelligenceandStatistics 433 (AISTATS)(2011). 434 [11] COLLOBERT,R.,ANDWESTON,J. Aunifiedarchitecturefornaturallanguageprocessing:Deepneuralnetworkswithmultitasklearning. InInternational ConferenceonMachineLearning(ICML2008)(2008). 435 [12] COLLOBERT,R.,WESTON,J.,BOTTOU,L.,KARLEN,M.,KAVUKCUOGLU,K.,ANDKUKSA,P.Naturallanguageprocessing(almost)fromscratch.Journal 436 ofMachineLearningResearch12(2011),2493–2537. 437 [13] DENERO,J.,CHIANG,D.,ANDKNIGHT,K. Fastconsensusdecodingovertranslationforests. InproceedingsoftheJointConferenceofthe47thAnnual MeetingoftheACLandthe4thInternationalJointConferenceonNaturalLanguageProcessingoftheAFNLP(2009),567–575. 438 [14] DESCHACHT,K.,BELDER,J.D.,ANDMOENS,M.-F.Thelatentwordslanguagemodel.ComputerSpeechandLanguage26(2012),384–409. 439 [15] DU,J.,HE,Y.,PENKALE,S.,ANDWAY,A. MaTrEx:theDCUMTSystemforWMT2009. InProceedingsoftheThirdEACLWorkshoponStatistical 440 MachineTranslation(2009),95–99. 441 [16] DU,J.,ANDWAY,A.Anincrementalthree-passsystemcombinationframeworkbycombiningmultiplehypothesisalignmentmethods.InternationalJournal ofAsianLanguageProcessing20,1(2010),1–15. 442 [17] DU,J.,ANDWAY,A.Usingterptoaugmentthesystemcombinationforsmt.InProceedingsoftheNinthConferenceoftheAssociationforMachineTranslation 443 (AMTA2010)(2010). [18] FEDERICO,M.,BERTOLDI,N.,ANDCETTOLO,M. Irstlm:anopensourcetoolkitforhandlinglargescalelanguagemodels. ProceedingsofInterspeech 444 (2008). 445 [19] GAEL,J.V.,VLACHOS,A.,ANDGHAHRAMANI,Z.Theinfinitehmmforunsupervisedpostagging.The2009ConferenceonEmpiricalMethodsonNatural LanguageProcessing(EMNLP2009)(2009). 446 [20] GASTHAUS,J.,WOOD,F.,ANDTEH,Y.W.Losslesscompressionbasedonthesequencememoizer.DCC2010(2010). 447 [21] GHAHRAMANI,Z.Anintroductiontohiddenmarkovmodelsandbayesiannetworks.InternationalJournalofPatternRecognitionandArtificialIntelligence 448 15,1(2001),9–42. 449 [22] GHAHRAMANI,Z.,JORDAN,M.I.,ANDSMYTH,P.Factorialhiddenmarkovmodels.MachineLearning(1997). 450 [23] GOLDWATER,S.,GRIFFITHS,T.L.,ANDJOHNSON,M. Contextualdependenciesinunsupervisedwordsegmentation. InProceedingsofConferenceon ComputationalLinguistics/AssociationforComputationalLinguistics(COLING-ACL06)(2006),673–680. 451 [24] GOOD,I.J.Thepopulationfrequenciesofspeciesandtheestimationofpopulationparamters.Biometrika40,(3-4)(1953),237–264. 452 [25] HINTON,G.E.,MCCLELLAND,J.L.,ANDRUMELHART,D.Distributedrepresentations.ParallelDistributedProcessing:ExplorationsintheMicrostructure ofCognition(EditedbyD.E.RumelhartandJ.L.McClelland)MITPress1(1986). 453 [26] KNESER,R.,ANDNEY,H.Improvedbacking-offforn-gramlanguagemodeling.InProceedingsoftheIEEEInternationalConferenceonAcoustics,Speech 454 andSignalProcessing(1995),181–184. 455 [27] KOLLER,D.,ANDFRIEDMAN,N.Probabilisticgraphicalmodels:Principlesandtechniques.MITPress(2009). 456 [28] KUMAR,S.,ANDBYRNE,W. MinimumBayes-Riskwordalignmentofbilingualtexts. InProceedingsoftheEmpiricalMethodsinNaturalLanguage Processing(EMNLP2002)(2002),140–147. 457 [29] MATUSOV,E.,UEFFING,N.,ANDNEY,H.Computingconsensustranslationfrommultiplemachinetranslationsystemsusingenhancedhypothesesalignment. 458 InProceedingsofthe11stConferenceoftheEuropeanChapteroftheAssociationforComputationalLinguistics(EACL)(2006),33–40. 459 [30] MILLER,G.A.Wordnet:Alexicaldatabaseforenglish.CommunicationsoftheACM38,11(1995),39–41. 460 [31] MNIH,A.,ANDTEH,Y.W.Afastandsimplealgorithmfortrainingneuralprobabilisticlanguagemodels.InProceedingsoftheInternationalConferenceon MachineLearning(2012). 461 [32] MOCHIHASHI,D.,YAMADA,T.,ANDUEDA,N. Bayesianunsupervisedwordsegmentationwithnestedpitman-yorlanguagemodeling. InProceedingsof 462 JointConferenceofthe47thAnnualMeetingoftheAssociationforComputationalLinguisticsandthe4thInternationalJointConferenceonNaturalLanguage ProcessingoftheAsianFederationofNaturalLanguageProcessing(ACL-IJCNLP2009)(2009),100–108. 463 [33] MURPHY,K.P.Machinelearning:Aprobabilisticperspective.TheMITPress(2012). 464 [34] OCH,F.,ANDNEY,H.Asystematiccomparisonofvariousstatisticalalignmentmodels.ComputationalLinguistics29,1(2003),19–51. 465 [35] OKITA, T.,ANDVANGENABITH, J. Minimumbayesriskdecodingwithenlargedhypothesisspaceinsystemcombination. InProceedingsofthe13th InternationalConferenceonIntelligentTextProcessingandComputationalLinguistics(CICLING2012).LNCS7182PartII.A.Gelbukh(Ed.)(2012),40–51. 466 [36] OKITA,T.,ANDWAY,A.Hierarchicalpitman-yorlanguagemodelinmachinetranslation.InProceedingsoftheInternationalConferenceonAsianLanguage 467 Processing(IALP2010)(2010). 468 [37] OKITA,T.,ANDWAY,A. Pitman-Yorprocess-basedlanguagemodelforMachineTranslation. InternationalJournalonAsianLanguageProcessing21,2 (2010),57–70. 469 [38] OKITA,T.,ANDWAY,A. Givenbilingualterminologyinstatisticalmachinetranslation:Mwe-sensitvewordalignmentandhierarchicalpitman-yorprocess- 470 basedtranslationmodelsmoothing.InProceedingsofthe24thInternationalFloridaArtificialIntelligenceResearchSocietyConference(FLAIRS-24)(2011), 269–274. 471 [39] OWCZARZAK,K.,VANGENABITH,J.,ANDWAY,A.EvaluatingmachinetranslationwithLFGdependencies.MachineTranslation21,2(2007),95–119. 472 [40] RABINER,L.R.Atutorialonhiddenmarkovmodelsandselectedapplicationsinspeechrecognition.ProceedingsoftheIEEE77,2(1989),257–286. 473 [41] SCHWENK,H.Continuousspacelanguagemodels.ComputerSpeechandLanguage21(2007),492–518. 474 [42] SCHWENK,H.Continuousspacelanguagemodelsforstatisticalmachinetranslation.ThePragueBulletinofMathematicalLinguistics83(2010),137–146. 475 [43] SCHWENK,H.,ROUSSEAU,A.,ANDATTIK,M.Large,prunedorcontinuousspacelanguagemodelsonagpuforstatisticalmachinetranslation.InProceeding oftheNAACLworkshopontheFutureofLanguageModeling(2012). 476 [44] SONTAG,D.ApproximateinferenceingraphicalmodelsusingLPrelaxations.MassachusettsInstituteofTechnology(Ph.D.thesis)(2010). 477 [45] SONTAG,D.,ANDROY,D.M. Thecomplexityofinferenceinlatentdirichletallocation.InAdvancesinNeuralInformationProcessingSystems24(NIPS) 478 (2011). 479 [46] STEYVERS,M.,ANDGRIFFITHS,T.Probabilistictopicmodels.HandbookofLatentSemanticAnalysis.PsychologyPress(2007). [47] STOLCKE,A. SRILM–Anextensiblelanguagemodelingtoolkit. InProceedingsoftheInternationalConferenceonSpokenLanguageProcessing(2002), 480 901–904. 481 [48] TEH, Y.W. Ahierarchicalbayesianlanguagemodelbasedonpitman-yorprocesses. InProceedingsofthe44thAnnualMeetingoftheAssociationfor ComputationalLinguistics(ACL-06),Prague,CzechRepublic(2006),985–992. 482 [49] TROMBLE,R.,KUMAR,S.,OCH,F.,ANDMACHEREY,W.Latticeminimumbayes-riskdecodingforstatisticalmachinetranslation.Proceedingsofthe2008 483 ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(2008),620–629. 484 [50] VEALE,T.Explodingthecreativitymyth:Thecomputationalfoundationsoflinguisticcreativity.London:BloomsburyAcademic(2012). 485 [51] WOOD,F.,ARCHAMBEAU,C.,GASTHAUS,J.,JAMES,L.,ANDTEH,Y.W.Astochasticmemoizerforsequencedata.InProceedingsofthe26thInternational ConferenceonMachineLearning(2009),1129–1136. 9

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.