ebook img

Dialog Context Language Modeling with Recurrent Neural Networks PDF

0.31 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Dialog Context Language Modeling with Recurrent Neural Networks

DIALOGCONTEXTLANGUAGEMODELINGWITHRECURRENTNEURALNETWORKS BingLiu1,IanLane1,2 1ElectricalandComputerEngineering,CarnegieMellonUniversity 2LanguageTechnologiesInstitute,CarnegieMellonUniversity [email protected], [email protected] 7 ABSTRACT sequence of inputs might not well capture the pauses, turn- 1 0 In this work, we propose contextual language models that taking, and grounding phenomena [13] in a dialog. In this 2 work,weproposecontextualRNNlanguagemodelsthatspe- incorporate dialog level discourse information into language cially track the interactions between speakers. We expect n modeling. Previous works on contextual language model a such models to generate better representations of the dialog treat preceding utterances as a sequence of inputs, without J context. considering dialog interactions. We design recurrent neu- 5 The remainder of the paper is organized as follows. In ral network (RNN) based contextual language models that 1 section 2, we introduce the background on contextual lan- speciallytracktheinteractionsbetweenspeakersinadialog. guagemodeling. Insection3,wedescribetheproposeddia- ] ExperimentresultsonSwitchboardDialogActCorpusshow L logcontextlanguagemodels. Section4discussestheevalua- thattheproposedmodeloutperformsconventionalsingleturn C tionproceduresandresults. Section5concludesthework. basedRNNlanguagemodelby3.3%onperplexity. Thepro- . s posed models also demonstrate advantageous performance c overothercompetitivecontextuallanguagemodels. 2. BACKGROUND [ IndexTerms— RNNLM,contextuallanguagemodel,di- 1 2.1. RNNLanguageModel alogmodeling,dialogact v 6 A language model assigns a probability to a sequence of 5 words w = (w ,w ,...,w ) following probability distri- 1. INTRODUCTION 1 2 T 0 bution. Using the chain rule, the likelihood of the word 4 Languagemodelplaysanimportantroleinmanynaturallan- sequencewcanbefactorizedas: 0 . guageprocessingsystems,suchasinautomaticspeechrecog- 1 T nition [1, 2] and machine translation systems [3, 4]. Recur- (cid:89) 0 P(w)=P(w ,w ,...,w )= P(w |w ) (1) 7 rentneuralnetwork(RNN)basedmodels[5,6]haverecently 1 2 T t <t t=1 1 shownsuccessinlanguagemodeling,outperformingconven- : tionaln-grambasedmodels. Longshort-termmemory[7,8] Attimestept,thesysteminputistheembeddingoftheword v i isawidelyusedRNNvariantforlanguagemodelingduetoits atindext,andthesystemoutputistheprobabilitydistribution X superiorperformanceincapturinglongertermdependencies. ofthewordatindext+1. TheRNNhiddenstateh encodes t r Conventional RNN based language model uses a hidden theinformationofthewordsequenceuptillcurrentstep: a state to represent the summary of the preceding words in a sentencewithoutconsideringcontextsignals. Mikolovetal. h =RNN(h ,w ) (2) t t−1 t proposed a context dependent RNN language model [9] by P(w |w )=softmax(W h +b ) (3) t+1 <t+1 o t o connectingacontextualvectortotheRNNhiddenstate. This contextual vector is produced by applying Latent Dirichlet whereW andb aretheoutputlayerweightsandbiases. o o Allocation [10] on preceding text. Several other contextual language models were later proposed by using bag-of-word 2.2. ContextualRNNLanguageModel [11]andRNNmethods[12]tolearnlargercontextrepresen- tationthatbeyondthetargetsentence. A number of methods have been proposed to introduce con- The previously proposed contextual language models textual information to the language model. Mikolov and treat preceding sentences as a sequence of inputs, and they Zweig [9] proposed a topic-conditioned RNNLM by intro- are suitable for document level context modeling. In dia- ducing a contextual real-valued vector to RNN hidden state. logmodeling,however,dialoginteractionsbetweenspeakers The contextual vector was created by performing LDA [10] playanimportantrole. Modelingutterancesinadialogasa on preceding text. Wang and Cho [11] studied introducing corpus-level discourse information into language modeling. w w Outputs t t+1 A number of context representation methods were explored, including bag-of-words, sequence of bag-of-words, and se- Hidden layer h h quence of bag-of-words with attention. Lin et al. [14] pro- t-1 t posedusinghierarchicalRNNfordocumentmodeling. Com- Inputs c w c w paring to using bag-of-words and sequence of bag-of-words t-1 t fordocumentcontextrepresentation,usinghierarchicalRNN canbettermodeltheorderofwordsinprecedingtext, atthe Fig.1. ContextdependentRNNlanguagemodel. cost of the increased computational complexity. These con- textual language models focused on contextual information at the document level. Tran et al. [15] further proposed a 3.2. ContextRepresentations contextuallanguagemodelthatconsiderinformationatinter- Inneuralnetworkbasedlanguagemodels,thedialogcontext documentlevel. Theyclaimedthatbyutilizingthestructural canberepresentedasadensecontinuousvector. Thiscontext information from a tree-structured document set, language vectorcanbeproducedinanumberofways. modelingperformancewaslargelyimproved. One simple approach is to use bag of word embeddings. However,bagofwordembeddingcontextrepresentationdoes 3. METHODS not take word order into consideration. An alternative ap- proachistouseanRNNtoreadtheprecedingtext. Thelast The previously proposed contextual language models focus hidden state of the RNN encoder can be seen as the repre- onapplyingcontextbyencodingprecedingtext,withoutcon- sentationofthetextandbeusedasthecontextvectorforthe sidering interactions in dialogs. These models may not be nextturn. Togeneratedocumentlevelcontextrepresentation, wellsuitedfordialoglanguagemodeling,astheyarenotde- one may cascade all sentences in a document by removing signed to capture dialog interactions, such as clarifications the sentence boundaries. The last RNN hidden state of the andconfirmations. Bymakingspecialdesigninlearningdia- previousutteranceservesastheinitialRNNstateofthenext loginteractions,weexpectthemodelstogeneratebetterrep- utterance. As in [12], we refer to this model as DRNNLM. resentations of the dialog context, and thus lower perplexity Alternatively, in the CCDCLM model proposed in [12], the ofthetargetdialogturnorutterance. last RNN hidden state of the previous utterance is fed to the In this section, we first explain the context dependent RNNhiddenstateofthetargetutteranceateachtimestep. RNNlanguagemodelthatoperatesonutteranceorturnlevel. Following that, we describe the two proposed contextual 3.3. InteractiveDialogContextLM languagemodelsthatutilizethedialoglevelcontext. Thepreviouslyproposedcontextuallanguagemodels,suchas 3.1. ContextDependentRNNLM DRNNLMandCCDCLM,treatdialoghistoryasasequence ofinputs,withoutmodelingdialoginteractions.Adialogturn LetD = (U ,U ,...,U )beadialogthathasK turnsand 1 2 K from one speaker may not only be a direct response to the involvestwospeakers. Eachturnmayhaveoneormoreutter- other speaker’s query, but also likely to be a continuation of ances. ThekthturnU =(w ,w ,...,w )isrepresentedas k 1 2 Tk hisownpreviousstatement. Thus,whenmodelingturnkina asequenceofT words. Conditioningoninformationofthe k dialog,weproposetoconnectthelastRNNstateofturnk−2 precedingtextinthedialog,probabilityofthetargetturnU k directlytothestartingRNNstateofturnk,insteadofletting canbecalculatedas: ittopropagatethroughtheRNNforturnk−1. ThelastRNN (cid:89)Tk stateofturnk−1servesasthecontextvectortoturnk,which P(Uk|U<k)= P(wtUk|w<Utk,U<k) (4) isfedtoturnk’sRNNhiddenstateateachtimesteptogether t=1 with the word input. The model architecture is as shown in where U denotes all previous turns before U , and wUk Figure 2. The context vector c and the initial RNN hidden <k k <t stateforthekthturnhUk aredefinedas: denotesallpreviouswordsbeforethetthwordinturnU . 0 k In context dependent RNN language model, the context c=hUk−1, hUk =hUk−2 (6) vector c is connected to the RNN hidden state together with Tk−1 0 Tk−2 theinputwordembeddingateachtimestep(Figure1). This issimilartothecontextdependentRNNlanguagemodelpro- wherehUTkk−−11 representsthelastRNNhiddenstateofturnk− posed in [9], other than that the context vector is not con- 1. This model also allows the context signal from previous necteddirectlytotheRNNoutputlayer. Withtheadditional turnstopropagatethroughthenetworkinfewersteps,which contextvectorinputc,theRNNstateh isupdatedas: helps to reduce information loss along the propagation. We t refer to this model as Interactive Dialog Context Language ht =RNN(ht−1,[wt,c]) (5) Model(IDCLM). turn turn 4. EXPERIMENTS k-2 k word predictions word predictions 4.1. DataSet Speaker A turn word tpurrendk-i1c tions turn WeusetheSwitchboardDialogActCorpus(SwDA)1ineval- k-2 k uating our contextual langauge models. The SwDA corpus Speaker B extends the Switchboard-1 Telephone Speech Corpus with turn and utterance-level dialog act tags. The utterances are turn k-1 alsotaggedwithpart-of-speech(POS)tags. Wesplitthedata infoldersw00tosw09astrainingset,foldersw10astestset, Fig. 2. Interactive Dialog Context Language Model (ID- andfoldersw11tosw13asvalidationset. Thetraining,vali- CLM). dation,andtestsetscontain98.7Kturns(190.0Kutterances), 5.7Kturns(11.3Kutterances),and11.9Kturns(22.2Kutter- 3.4. ExternalStateInteractiveDialogContextLM ances)respectively. Maximumturnlengthissetto160. The vocabularyisdefinedwiththetopfrequent10Kwords. The propagation of dialog context can be seen as a series of updatesofahiddendialogcontextstatealongthegrowingdi- 4.2. Baselines alog.IDCLMmodelsthishiddendialogcontextstatechanges implicitly in the turn level RNN state. Such dialog context We compare IDCLM and ESIDCLM to several baseline state updates can also be modeled in a separated RNN. As methods,includingn-grambasedmodel,singleturnRNNLM, showninthearchitectureinFigure3,weuseanexternalRNN andvariouscontextdependentRNNLMs. tomodelthecontextchangesexplicitly. Inputtotheexternal 5-gram KN A 5-gram language model with modified state RNN is the vector representation of the previous dia- Kneser-Neysmoothing[16]. logturns. TheexternalstateRNNoutputservesasthedialog Single-Turn-RNNLM ConventionalRNNLMthatop- contextfornextturn: eratesonsingleturnlevelwithnocontextinformation. BoW-Context-RNNLM Contextual RNNLM with sk−1 =RNNES(sk−2,hUTkk−−11) (7) BoWrepresentationofprecedingtextascontext. DRNNLM ContextualRNNLMwithturnlevelcontext where s is the output of the external state RNN after the k−1 vectorconnectedtoinitialRNNstateofthetargetturn. processingofturnk−1. Thecontextvectorcandtheinitial CCDCLM Contextual RNNLM with turn level con- RNNhiddenstateforthekthturnhUk arethendefinedas: 0 text vector connected to RNN hidden state of the target turn c=s , hUk =hUk−2 (8) at each time step. We implement this model following the k−1 0 Tk−2 designin[12]. We refer to this model as External State Interactive Dialog Inordertoinvestigatethepotentialperformancegainthat ContextLanguageModel(ESIDCLM). canbeachievedbyintroducingcontext,wealsocomparethe proposedmethodstoRNNLMsthatusetruedialogacttagsas context. Althoughhumanlabeleddialogactmightnotbethe turn turn k-2 k word predictions word predictions bestoptionformodelingthedialogcontextstate, itprovides areasonableestimationofthebestgainthatcanbeachieved Speaker A by introducing linguistic context. The dialog act sequence turn turn Hidden k-2 s s k ismodeledbyaseparatedRNN,similartotheexternalstate dialogue state k-2 k-1 turn RNN used in ESIDCLM. We refer to this model as Dialog k-1 word predictions ActContextLanguageModel(DACLM). Speaker B DACLM RNNLMwithtruedialogactcontextvector connectedtoRNNstateofthetargetturnateachtimestep. turn k-1 Fig.3. ExternalStateInteractiveDialogContextLanguage 4.3. ModelConfigurationandTraining Model(ESIDCLM). Inthiswork,weuseLSTMcell[7]asthebasicRNNunitfor its stronger capability in capturing long-range dependencies ComparingtoIDCLM,ESIDCLMreleasestheburdenof in a word sequence comparing to simple RNN. We use pre- turn level RNN by using an external RNN to model dialog trained word vectors [17] that are trained on Google News context state changes. One drawback of ESIDCLM is that dataset to initialize the word embeddings. These word em- thereareadditionalRNNmodelparameterstobelearneddur- beddings are fine-tuned during model training. We conduct ingmodeltraining,whichmaymakethemodelmoreproneto overfittingwhentrainingdatasizeislimited. 1http://compprag.christopherpotts.net/swda.html mini-batchtrainingusingAdamoptimizationmethodfollow- further compute the test set perplexity per POS tag and per ingthesuggestedparametersetupin[18]. Maximumnormis dialog act tag. We selected the most frequent POS tags and set to 5 for gradient clipping . For regularization, we apply dialog act tags in SwDA corpus, and report the tag based dropout (p = 0.8) on the non-recurrent connections [19] of perplexityrelativechanges(%)oftheproposedmodelscom- LSTM. In addition, we apply L regularization (λ = 10−4) paringtoSingle-Turn-RNNLM.Anegativenumberindicates 2 ontheweightsandbiasesoftheRNNoutputlayer. performancegain. 4.4. ResultsandAnalysis Table2. Perplexityrelativechange(%)perPOStag The experiment results on language modeling perplexity for POSTag IDCLM ESIDCLM DACLM modelsusingdifferentdialogturnsizeareshowninTable1. PRP -16.8 -5.8 -10.1 K valueindicatesthenumberofturnsinthedialog. Perplex- IN -2.0 -5.5 -1.8 ityiscalculatedonthelastturn,withprecedingturnsusedas RB -4.1 -8.9 -4.3 contexttothemodel. NN 13.4 8.1 2.3 UH -0.4 7.7 -9.7 Table1. LanguagemodelingperplexitiesonSwDAcorpus Table 2 shows the model perplexity per POS tag. All withvariousdialogcontextturnsizes(K). the three context dependent models produce consistent per- Model K=1 K=2 K=3 K=5 formance gain over the Single-Turn-RNNLM for pronouns, 5-gramKN 65.7 - - - prepositions, and adverbs, with pronouns having the largest Single-Turn-RNNLM 60.4 - - - perplexity improvement. However, the proposed contextual BoW-Context-RNNLM - 59.6 59.2 58.9 models are less effective in capturing nouns. This suggests DRNNLM - 60.1 58.6 59.1 that the proposed contextual RNN language models exploit CCDCLM - 63.9 61.4 62.2 the context to achieve superior prediction on certain but not IDCLM - - 58.8 58.6 all POS types. Further exploration on the model design is ESIDCLM - - 58.4 58.5 requiredifwewanttobettercapturewordsofaspecifictype. DACLM - 58.2 57.9 58.0 Table3. Perplexityrelativechange(%)perdialogacttag. As can be seen from the results, all RNN based mod- els outperform the n-gram model by large margin. The DATag IDCLM ESIDCLM DACLM BoW-Context-RNNLMandDRNNLMbeattheSingle-Turn- Statement-non-opinion -1.8 -0.5 -1.6 RNNLM consistently. Our implementation of the context Acknowledge -2.6 11.4 -16.3 dependent CCDCLM performs worse than Single-Turn- Statement-opinion 4.9 -0.9 -1.0 RNNLM. This might due to fact that the target turn word Agree/Accept 14.7 2.7 -15.1 prediction depends too much on the previous turn context Appreciation 0.7 -3.8 -6.5 vector,whichconnectsdirectlytothehiddenstateofcurrent turnRNNateachtimestep.Themodelperformanceontrain- For the dialog act tag based results in Table 3, the ing set might not generalize well during inference given the three contextual models show consistent performance gain limitedsizeofthetrainingset. on Statement-non-opinion type utterances. The perplexity TheproposedIDCLMandESIDCLMbeatthesingleturn changesforotherdialogacttagsvaryfordifferentmodels. RNNLMconsistentlyunderdifferentcontextturnsizes. ES- IDCLM shows the best language modeling performance un- 5. CONCLUSIONS der dialog turn size of 3 and 5, outperforming IDCLM by a smallmargin. IDCLMbeatsallbaselinemodelswhenusing In this work, we propose two dialog context language mod- dialog turn size of 5, and produces slightly worse perplexity elsthatwithspecialdesigntomodeldialoginteractions. Our thanDRNNLMwhenusingdialogturnsizeof3. evaluation results on Switchboard Dialog Act Corpus show Toanalyzethebestpotentialgainthatmaybeachievedby that the proposed model outperform conventional RNN lan- introducinglinguisticcontext,wecomparetheproposedcon- guage model by 3.3%. The proposed models also illustrate textualmodelstoDACLM,themodelthatusestruedialogact advantageous performance over several competitive contex- historyfordialogcontextmodeling. AsshowninTable1,the tuallanguagemodels. Perplexityoftheproposeddialogcon- gapbetweenourproposedmodelsandDACLMisnotwide. text language models is higher than that of the model using Thisgivesapositivehintthattheproposedcontextualmodels true dialog act tags as context by a small margin. This in- mayimplicitlycapturethedialogcontextstatechanges. dicates that the proposed model may implicitly capture the For fine-grained analyses of the model performance, we dialogcontextstateforlanguagemodeling. 6. REFERENCES [13] Herbert H Clark and Susan E Brennan, “Grounding in communication,” Perspectives on socially shared cog- [1] Lawrence Rabiner and Biing-Hwang Juang, “Funda- nition,vol.13,no.1991,pp.127–149,1991. mentalsofspeechrecognition,” 1993. [14] RuiLin, ShujieLiu, MuyunYang, MuLi, MingZhou, [2] William Chan, Navdeep Jaitly, Quoc Le, and Oriol and Sheng Li, “Hierarchical recurrent neural network Vinyals, “Listen, attend and spell: A neural network for document modeling,” in Proceedings of the 2015 forlargevocabularyconversationalspeechrecognition,” ConferenceonEmpiricalMethodsinNaturalLanguage in 2016 IEEE International Conference on Acoustics, Processing,2015,pp.899–907. Speech and Signal Processing (ICASSP). IEEE, 2016, pp.4960–4964. [15] Quan Hung Tran, Ingrid Zukerman, and Gholamreza Haffari, “Inter-document contextual language model,” [3] Peter F Brown, John Cocke, Stephen A Della Pietra, inProceedingsofNAACL-HLT,2016,pp.762–766. Vincent J Della Pietra, Fredrick Jelinek, John D Laf- ferty, Robert L Mercer, and Paul S Roossin, “A statis- [16] Stanley F Chen and Joshua Goodman, “An empirical tical approach to machine translation,” Computational studyofsmoothingtechniquesforlanguagemodeling,” linguistics,vol.16,no.2,pp.79–85,1990. in Proceedings of the 34th annual meeting on Asso- ciation for Computational Linguistics. Association for [4] Kyunghyun Cho, Bart Van Merrie¨nboer, Caglar Gul- ComputationalLinguistics,1996,pp.310–318. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase rep- [17] T Mikolov and J Dean, “Distributed representations resentations using rnn encoder-decoder for statistical of words and phrases and their compositionality,” Ad- machine translation,” arXiv preprint arXiv:1406.1078, vancesinneuralinformationprocessingsystems,2013. 2014. [18] Diederik Kingma and Jimmy Ba, “Adam: A [5] TomasMikolov,MartinKarafia´t,LukasBurget,JanCer- method for stochastic optimization,” arXiv preprint nocky`,andSanjeevKhudanpur, “Recurrentneuralnet- arXiv:1412.6980,2014. work based language model.,” in Interspeech, 2010, [19] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals, vol.2,p.3. “Recurrent neural network regularization,” arXiv [6] Toma´sˇ Mikolov, Stefan Kombrink, Luka´sˇ Burget, Jan preprintarXiv:1409.2329,2014. Cˇernocky`, and Sanjeev Khudanpur, “Extensions of re- currentneuralnetworklanguagemodel,” in2011IEEE InternationalConferenceonAcoustics,SpeechandSig- nalProcessing(ICASSP).IEEE,2011,pp.5528–5531. [7] SeppHochreiterandJu¨rgenSchmidhuber, “Longshort- term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780,1997. [8] MartinSundermeyer, RalfSchlu¨ter, andHermannNey, “Lstmneuralnetworksforlanguagemodeling.,” inIn- terspeech,2012,pp.194–197. [9] Tomas Mikolov and Geoffrey Zweig, “Context depen- dentrecurrentneuralnetworklanguagemodel.,”inSLT, 2012,pp.234–239. [10] David M Blei, Andrew Y Ng, and Michael I Jordan, “Latentdirichletallocation,” JournalofmachineLearn- ingresearch,vol.3,no.Jan,pp.993–1022,2003. [11] Tian Wang and Kyunghyun Cho, “Larger-context lan- guage modelling,” arXiv preprint arXiv:1511.03729, 2015. [12] YangfengJi,TrevorCohn,LingpengKong,ChrisDyer, and Jacob Eisenstein, “Document context language models,” arXivpreprintarXiv:1511.03962,2015.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.