ebook img

Phrase Based Language Model For Statistical Machine Translation PDF

0.12 MB·English
by  Jia Xu
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Phrase Based Language Model For Statistical Machine Translation

Phrase Based Language Model for Statistical Machine Translation Jia Xu and Geliang Chen* IIIS, Tsinghua University [email protected], [email protected] — Working Paper — January 20, 2015 5 1 0 2 Abstract underlying content structures need to be inves- n a tigated as discussed in (Gao et. al., 2004). J We consider phrase based Language Mod- We modelthepredictionof phraseandphrase 8 els (LM), which generalize the commonly orders. By considering all word sequences as 1 usedwordlevelmodels. Similarconcepton phrases, the dependency inside a phrase is pre- phrasebasedLMsappearsinspeechrecog- ] served, and the phrase level structure of a sen- L nition,whichisratherspecializedandthus tence can be learned from observations. This C lesssuitableformachinetranslation(MT). . In contrast to the dependency LM, we can be considered as an n-gram model on the s c firstintroducetheexhaustivephrase-based n-gram of words, therefore word based LM is a [ LMs tailored for MT use. Preliminary ex- special case of phrase based LM if only single- 1 perimentalresultsshowthatourapproach word phrases are considered. Intuitively our ap- v outperform word based LMs with the re- proach has the following advantages: 4 specttoperplexityandtranslationquality. 1) Long distance dependency: The phrase 2 3 based LM can capture the long distance rela- 4 1 Introduction tionshipeasily. Tocapturethesentencelevelde- 0 pendency,e.g. betweenthefirstandlastwordof . 1 Statistical language models estimating the the sentence in Table 1, we need a 7-gram word 0 distribution of various natural language 5 based LM, but only a 3-gram phrase based LM, phenomena are crucial for many applica- 1 if we take “played the basketball” and “the day : tions. In machine translation, it measures v before yesterday” as phrases. i the fluency and well-formness of a trans- X 2) Consistent translation unit with phrase lation, and therefore is important for the r based MT: Some words may acquire meaning a translation quality, see (Och, 2002) and only in context, such as “day”, or “the” in “the (Koehn, Och and Marcu, 2003) etc. day before yesterday” in Table 1. Considering Common applications of LMs include esti- the frequent phrases as single units will reduce mating the distribution based on N-gram cover- the entropy of the language model. More im- age of words, to predict word and word orders, portantly, current MT is performed on phrases, asin(Stolcke, 2002)and(Lafferty et. al., 2001). which is taken as the translation unit. The The independence assumption for each word is translation task is to predict the next phrase, one of the simplifying method widely adopted. which corresponds to the phrased based LM. However, it does not hold in textual data, and 3) Fewer independence assumptions in statis- *Thisversionofthepaperwassubmittedforreview tical models: The sentence probability is com- to EMNLP 2013. The title, the idea and the content of puted as the product of the single word prob- this paper was presented by the first author in the ma- abilities in the word based n-gram LM and the chine translation group meeting at the MSRA-NLC lab productof thephraseprobabilities inthephrase (MicrosoftResearchAsia,NaturalLanguageComputing) on July 16, 2013. basedn-gramLM,giventheirhistories. Theless Words John played basketball the day before yesterday wI w w w w w w w 1 1 2 3 4 5 6 7 Segmentations John played basketball the day before yesterday kJ k =1 k =3 k =7 1 1 2 3 pJ p =w p =w w p =w w w w 1 1 1 2 2 3 3 4 5 6 7 Re-ordered John the day before yesterday play basketball Translation 约翰 昨天 打篮球 Table 1: Phrase segmentation example. words/phrases in a sentence, the fewer mistakes the lexical level. Nonetheless, these methods do the LM may contain due to less independence not consider the dependency between phrases assumption on words/phrases. Once the phrase and the re-ordering problem, and therefore are segmentation is fixed, the number of elements not suitable for the MT application. via phrase based LM is much less than that via 2 Phrase Based LM the word based LM. Therefore, our approach is less likely to obtain errors due to assumptions. We are given a sentence as a sequence of words 4) Phrase boundaries as additional informa- wI = w w ···w ···w (i ∈ 1,2,··· ,I), where I 1 1 2 i I tion: We consider different segmentation of is the sentence length. phrases in one sentence as a hidden variable, In the word based LM (Stolcke, 2002), the which provides additional constraints to align probability of a sentence Pr(wI) 1is defined as 1 phrasesintranslation. Therefore,theconstraint the product of the probabilities of each word alignment in the blocks of words can provide given its previous n−1 words: I more information than the word based LM. P(w1I) = YP(wi|wii−−n1+1) (1) i=1 Comparison to Previous Work In the The positions of phrase boundaries on a word dependency or structured LM, phrases cor- sequence wI is indicated by k ≡ 0 and K = responding to the grammars are considered, 1 0 kJ = k k ···k ···k (j ∈ 1,2,··· ,J), where and dependencies are extracted, such as in 1 1 2 j J k ∈ {1,2,··· ,I},k < k ,k ≡ I, and J is j j−1 j J (Gao et. al., 2004) and in (Shen et. al., 2008). the number of phrases in the sentence. We use However,inthephrasebasedSMT,evenphrases k to indicate that the j-th phrasesegmentation j violating the grammar structure may help as a is placed after theword w andin frontof word translationunit. Forinstance,thepartialphrase kj w , where 1 ≤ j ≤ J. k is a boundary on “the day before” may appear both in “the day kj+1 0 theleftsideofthefirstwordw ,whichisdefined 1 before yesterday” and “the day before Spring”. as 0, and k is always placed after the last word J Most importantly, the phrase candidates in our w and therefore equals I. I phrase based LM are same as that in the phrase An example is illustrated in Table 1. The En- based translation, therefore are more consistent glish sentence (wI) contains seven words (I = in the whole translation process, as mentioned 1 7), where w denotes “John”, etc. The first 1 in item 2 in Section 1. phrase segmentation boundary is placed after Some researchers have proposed their the first word, and the second boundary is after phrase based LM for speech recognition. In thethirdword(k = 3)andsoon. Thephrasese- (Kuo and Reichl, 1999) and (Tang, 2002), quence pJ in this sentence have a different order 1 new phrases are added to the lexi- 1 con with different measure function. The notational convention will be asfollows: we use thesymbolPrtodenotegeneralprobabilitydistributions In (Heeman and Damnati, 1997), a differ- with (almost) no specific assumptions. In contrast, for ent LM was proposed which derived the phrase model-basedprobabilitydistributions,weusethegeneric probabilities from a language model built at symbol P(∆). than that in its translation, on the phrase level. (3) Perplexity Sentence perplexity and text Hence, the phrase based LM advances the word perplexity in the sum model use the same def- based LM in learning the phrase re-ordering. inition as that in the word based LM. Sentence perplexity in the max model is defined as (1) Model description Given a sequence of words wI and its phrase segmentation bound- PPL(wI) = argmin[P(wI,kJ)]−1/J 1 1 1 1 aries k1J, a sentence can also be represented k1J,J in the form of a sequence of phrases pJ = . 1 (4) Parameter estimation We apply maxi- p p ···p ···p (j ∈ 1,2,··· ,J), and each in- 1 2 j J mumlikelihoodtoestimateprobabilitiesinboth dividual phrase pj is defined as sum model and max model : pj = wkj−1+1···wkj = wkkjj−1+1 P(pi|pii−−1n+1) = C(Cpi(−p1i) ), (3) i−n+1 InphrasebasedLM,weconsiderthephraseseg- whereC(·)isthefrequencyofaphrase. Theuni- mentation kJ as hidden variable and the Equa- 1 C(p) gramphraseprobabilityisP(p) = ,andC is tion 1 can be extended as follows: C the frequency of all single phrases, in the train- Pr(w1I) = XPr(w1I,K) ing text. Since we generate exponential number of phrases to the sentence length, the number of K = XPr(pJ1|k1J)·Pr(k1J) (2) parameters is huge. Therefore, we set the max- imum n-gram length on the phrase level (note k1J,J not the phrase length) as N = 3 in experiments. (2) Sentence probability For the segmen- (5) Smoothing For the unseen events, we tation prior probability, we assume a uniform perform Good-Turing smoothing as commonly distribution for simplicity, i.e. P(kJ) = 1/|K|, 1 done in word based LMs. Moreover, we inter- wherethenumberof differentK,i.e. |K|= 2I if polate between the phrase probability and the not considering the maximum phrase or phrase product of single word probabilities in a phrase n-gramlength;TocomputethePr(wm),wecon- 1 using a convex optimization: sider either two approaches: P∗(p |pj−1 ) = • Sum Model (Baum-Welch) j j−n+1 j′ P(w ) Weconsiderall2I segmentationcandidates. λP(p |pj−1 )+(1−λ) Qi=1 i j j−n+1 j′ P(w) Equation 2 is defined as (cid:0)Pw (cid:1) J wherephrasepj is madeupof j′ words w1j′. The Prsum(w1I) ≈ X YP(pj|pjj−−n1+1)·P(k1J), ideaofthisinterpolationistomaketheprobabil- k1J,Jj=1 ity of a phrase consisting of of j′ words smooth with a j′-word unigram probability after nor- • Max Model (Viterbi) malization. In our experiments, we set λ = 0.4 for convenience. Thesentenceprobabilityformulaofthesec- ond model is defined as (6) Algorithm of calculating phrase n- gram counts Thetrainingtask istocalculate J Pmax(w1I)≈ mk1Ja,JxjY=1P(pj|pjj−−n1+1)·P(k1J). n-gGriavmencoautnrtasinoinngthceoprphurassWel1eSv,ewlihnerEeqtuhaetrieonar3e. S sentences W (s = 1,2,··· ,S), our goal is to s In practice we select the segmentation that to compute C(·), for all phrasen-grams that the maximizes the perplexity of the sentence number of phrases is no greater than N. There- instead of the probability to consider the fore, for each sentence wI, we should find out 1 length normalization. every n-gram phrases that 0 < n < N. Data Sentences Words Vocabulary Model Dev2010 Tst2010 Tst2011 Training 54887 576778 23350 Base 11.26 13.10 15.05 Dev2010 202 1887 636 Word 11.92 12.93 14.76 Tst2010 247 2170 617 Sum 11.86 12.77 14.80 Tst2011 334 2916 765 Sum+S. 12.02 12.54 14.76 Max 11.61 12.99 15.34 Table 2: Statistics of corpora with sentence length Max+S. 11.56 13.55 15.27 no greater than 15 in training and 10 in test. Table 4: Translation performance on N-best list us- n Base Sum Sum+S. Max Max+S. ing different LMs in BLEU[%]. 1 676.1 85.5 112.5 625.7 1129.4 2 180.8 52.6 72.1 161.1 306.2 3 162.3 52.5 72.2 140.4 266.5 Base: but we need a success 4 162.5 52.6 72.3 141.1 267.6 Max: but we need a way to success . Ref: we certainly need one to succeed . Base: there is a specific steps that Table3: PerplexitiesonTst2011calculatedbasedon Max: there is a specific steps . various n-gram LMs with n=1,2,3,4. Ref: there is step-by-step instructions on this . We do Dynamic Programming to collect the Table5: Examplesofsentenceoutputswithbaseline phrase n-grams in one sentence wI: method and with the max model. 1 Q(1,d;wI) = {p = wd,∀1 ≤ b ≤ d ≤ I} 1 b Q(n,d;wI) = Because of the computational requirement, we 1 only employed sentences which contain no ∪ Q(n−1,b−1;wI)⊕p = wd, ∀n≤ b ≤ d ≤ I, b 1 b more than 15 words in the training corpus and no more than 10 words in the test corpora whereQ(·)istheauxiliaryfunctiondenotingthe (Dev2010, on Tst2010 and on Tst2011), as multiset of all phrase n-grams or unigram end- shown in Table 2. ing at position d (1 < n ≤ N). b denotes the starting word position of the last phrase in the We took word based LM in Equation 1 as the multiset. The {·} is a multiset, and ⊕ means baseline method (Base). We calculated the per- to append the element to each element in the plexities ofTst2011withdifferentn-gramorders multiset. ∪ denotes the union of multisets. Af- using both sum model and max model, with b ter appendingp, we consider all b that is no less and without smoothing (S.) as in Section 2. Ta- than n and no greater than d. ble 3 shows that perplexities in our approaches ThephrasecountsC(·)isthesumofallphrase are all lower than those in the baseline. n-grams from all sentences W1S, with each sen- For MT, we selected the single best trans- tenceWs = w1I,and|·|isthenumberofelements lation output based on the LM perplexity of in a multiset: the 100-best translation candiates, using differ- ent LMs as shown in Table 4. Max model S C(pn1) = X|pn1 ∈ ∪d|W=sn|Q(n,d;Ws)| along with smoothing outperforms the baseline method under all three test sets with the BLEU s=1 score (Papineni et. al., 2002) increase of 0.3% 3 Experiments on Dev2010, 0.45% on Tst2010, and 0.22% on Tst2011, respectively. This is an ongoing work, and we per- formed preliminary experiments on the Table5showstwoexamplesfromtheTst2010, IWSLT (IWSLT, 2011) task, then evalu- wherewecanseethatourmaxmodelgenerates ated the LM performance by measuring the LM betterselectionresultsthanthebaselinemethod perplexity andtheMT translation performance. in these cases. 4 Conclusion [Heeman and Damnati1997] Peter A. Heeman and Geraldine Damnati. 1997. Deriving Phrase-based Weshowedthepreliminaryresultsthataphrase Language Models. In Automatic Speech Recogni- based LM can improve the performance of MT tion and Understanding, 1997. Proceedings., 1997 systems and the LM perplexity. We presented IEEE Workshop on pp. 41–48.IEEE. twophrasebasedmodelswhichconsiderphrases [Brown, S. Della Pietra, V. Della Pietra, and R. Mercer1993] P. Brown, S. Della Pietra, V. Della Pietra, and as the basic components of a sentence and per- R. Mercer. 1993. The mathematics of statistical form exhaustive search. Our future work will machine translation: Parameter estimation. focus on the efficiency for a larger data track Computational linguistics, 19(2), 263-311. as well as the improvements on the smoothing [Gao et. al.2004] J. Gao, J. Y. Nie, G. Wu, and G. methods. Cao. 2004. Dependence languagemodel for infor- mation retrieval. In Proc. of ACM, pp. 170-177. ACM. References [Shen et. al.2008] L. Shen, J. Xu, and R. M. Weischedel. 2008. A New String-to-Dependency [Aho and Ullman1972] Alfred V. Aho and Jeffrey D. MachineTranslationAlgorithmwithaTargetDe- Ullman. 1972. The Theory of Parsing, Transla- pendency Language Model. In ACL, pp. 577-585. tion and Compiling, volume 1. Prentice-Hall, En- [Rosenfeld2000] R. Rosenfeld. 2000. Two decades of glewood Cliffs, NJ. statistical language modeling: Where do we go [American PsychologicalAssociation1983] American from here?. Proc. of IEEE, 88(8), 1270-1278. Psychological Association. 1983. Publications [Stolcke2002] A. Stolcke. 2002. SRILM-an extensible Manual. American Psychological Association, language modeling toolkit. In INTERSPEECH. Washington, DC. [Lafferty et. al.2001] Lafferty, John, Andrew McCal- [Association for Computing Machinery1983] lum,andFernandoCNPereira. 2001. Conditional Association for Computing Machinery. 1983. random fields: Probabilistic models for segment- Computing Reviews, 24(11):503–512. ing and labeling sequence data. In Intl. Conf. on [Chandra et al.1981] Ashok K. Chandra, Dexter C. Machine Learning, Kozen, and Larry J. Stockmeyer. 1981. Alter- [Roark, Saraclar,Collins, and Johnson2004] B. nation. Journal of the Association for Computing Roark, M. Saraclar, M. Collins, and M. Johnson. Machinery, 28(1):114–133. 2004. Discriminative language modeling with [Gusfield1997] Dan Gusfield. 1997. Algorithms on conditional random fields and the perceptron Strings, Trees and Sequences. Cambridge Univer- algorithm. In Proc. of ACL, pp. 47. ACL. sity Press, Cambridge, UK. [Chelba1997] C. Chelba. 1997. A structured lan- [Kuo and Reichl1999] Hong-Kwang Jeff Kuo and guagemodel. InProc. of ACL,pp.498-500.ACL. Wolfgang Reichl. 1999. Phrase-Based Lan- [IWSLT2011] IWSLT. 2011. http://iwslt2011.org/. guage Models for Speech Recognition. In EU- Homepage. ROSPEECH. [Tang2002] Haijiang Tang. 2002. Building Phrase Based Language Model from Large Corpus. Mas- ter thesis, The Hong Kong University of Science and Technology, Hong Kong. [Och2002] F. Och. 2002. Statistical Machine Trans- lation: From Single Word Models to Alignment Templates. Ph.D. thesis, RWTH Aachen, Ger- many. [Papineni et. al.2002] K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL, pp. 311–318. [Koehn, Och and Marcu2003] P. Koehn, F. Och, and D. Marcu. 2003. Statistical phrasebased transla- tion. In Proc. of HLT-NAACL, pp. 48–54.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.