Hierarchical Recurrent Attention Network for Response Generation ChenXing12∗,WeiWu3 ,YuWu4 ,MingZhou3 ,YalouHuang12 ,Wei-YingMa3 1CollegeofComputerandControlEngineering,NankaiUniversity,Tianjin,China 2CollegeofSoftware,NankaiUniversity,Tianjin,China 3 MicrosoftResearch,Beijing,China 4StateKeyLabofSoftwareDevelopmentEnvironment,BeihangUniversity,Beijing,China {v-chxing,wuwei,v-wuyu,mingzhou,wyma}@[email protected] Abstract We study multi-turn response generation in chatbots where a response is generated according to a conversation context. Ex- 7 isting work has modeled the hierarchy of 1 0 the context, but does not pay enough at- 2 tention to the fact that words and utter- n ances in the context are differentially im- a portant. As a result, they may lose im- J portant information in context and gen- 5 2 erate irrelevant responses. We propose Figure1: Anexampleofmulti-turnconversation a hierarchical recurrent attention network ] regarding to a wide range of issues in open do- L (HRAN) to model both aspects in a uni- mains(Jafarpouretal.,2010). Acommonpractice C fied framework. In HRAN, a hierarchical to build a chatbot is to learn a response genera- s. attention mechanism attends to important tion model within an encoder-decoder framework c parts within and among utterances with [ from large scale message-response pairs (Shang wordlevelattentionandutterancelevelat- et al., 2015; Vinyals and Le, 2015). Such mod- 1 tention respectively. With the word level v els ignore conversation history when responding, 9 attention, hidden vectors of a word level which is contradictory to the nature of real con- 4 encoder are synthesized as utterance vec- 1 versation between humans. To resolve the prob- tors and fed to an utterance level encoder 7 lem, researchers have taken conversation history 0 to construct hidden representations of the into consideration and proposed response gener- 1. context. Thehiddenvectorsofthecontext ation for multi-turn conversation (Sordoni et al., 0 are then processed by the utterance level 2015; Serban et al., 2015; Serban et al., 2016b; 7 attentionandformedascontextvectorsfor 1 Serbanetal.,2016c). : decoding the response. Empirical studies v Inthiswork,westudymulti-turnresponsegen- on both automatic evaluation and human i X eration for open domain conversation in chatbots judgment show that HRAN can signifi- r in which we try to learn a response generation cantly outperform state-of-the-art models a model from responses and their contexts. A con- formulti-turnresponsegeneration. text refers to a message and several utterances in 1 Introduction its previous turns. In practice, when a message comes, the model takes the context as input and Conversational agents include task-oriented dia- generate a response as the next turn. Multi-turn log systems which are built in vertical domains conversation requires a model to generate a re- for specific tasks (Young et al., 2013; Boden, sponse relevant to the whole context. The com- 2006; Wallace, 2009; Young et al., 2010), and plexityofthetaskliesintwoaspects: 1)aconver- non-task-oriented chatbots which aim to realize sationcontextisinahierarchicalstructure(words naturalandhuman-likeconversationswithpeople form an utterance, and utterances form the con- text)andhastwolevelsofsequentialrelationships ∗Theworkwasdonewhenthefirstauthorwasanintern inMicrosoftResearchAsia. amongbothwordsandutteranceswithinthestruc- ture; 2) not all parts of the context are equally anism assigns a weight to each vector in the hid- important to response generation. Words are dif- den sequence of an utterance and forms an utter- ferentially informative and important, and so are ancevectorbyalinearcombinationofthevectors. the utterances. State-of-the-art methods such as Importanthiddenvectorscorrespondtoimportant HRED (Serban et al., 2016a) and VHRED (Ser- parts in the utterance regarding to the generation ban et al., 2016c) focus on modeling the hierar- of the word, and contribute more to the forma- chy of the context, whereas there is little explo- tionoftheutterancevector. Theutterancevectors ration on how to select important parts from the arethenfedtoanutterancelevelRNNwhichcon- context,althoughitisoftenacrucialstepforgen- structshiddenrepresentationsofthecontext. Dif- erating a proper response. Without this step, ex- ferentfromclassicattentionmechanism,theword isting models may lose important information in level attention mechanism in HRAN is dependent context and generate irrelevant responses1. Fig- on both the decoder and the utterance level RNN. ure 1 gives an example from our data to illustrate Thus, both the current generated part of the re- the problem. The context is a conversation be- sponse and the content of context can help select tween two speakers about height and boyfriend, importantpartsinutterances. Atthethirdlayer,an therefore, to respond to the context, words like utterance attention mechanism attends to impor- “girl”,“boyfriend”andnumbersindicatingheight tantutterancesintheutterancesequenceandsum- such as “160” and “175” are more important than marizes the sequence as a context vector. Finally, “not good-looking”. Moreover, u and u convey at the top of HRAN, a decoder takes the context 1 4 main semantics of the context, and therefore are vector as input and generates the word in the re- more important than the others for generating a sponse. HRANmirrorsthedatastructureinmulti- proper response. Without modeling the word and turn response generation by growing from words utterance importance, the state-of-the-art model to utterances and then from utterances to the out- VHRED (Serban et al., 2016c) misses important put. It extends the architecture of current hierar- points and gives a response “are you a man or chical response generation models by a hierarchi- a woman” which is OK if there were only u calattentionmechanismwhichnotonlyresultsin 3 left, but nonsense given the whole context. Af- bettergenerationquality,butalsoprovidesinsight terpayingattentiontotheimportantwordsandut- into which parts in an utterance and which utter- terances, we can have a reasonable response like ancesincontextcontributetoresponsegeneration. “No,Idon’tcaremuchaboutheight”(theresponse We conduct an empirical study on large scale is generated by our model, as will be seen in ex- open domain conversation data and compare our periments). model with state-of-the-art models using both au- We aim to model the hierarchy and the impor- tomatic evaluation and side-by-side human com- tant parts of contexts in a unified framework. In- parison. The results show that on both met- spired by the success of the attention mechanism rics our model can significantly outperform ex- in single-turn response generation (Shang et al., isting models for multi-turn response genera- 2015), we propose a hierarchical recurrent atten- tion. We release our source code and data at tionnetwork(HRAN)formulti-turnresponsegen- https://github.com/LynetteXing1991/HRAN. erationinwhichweintroduceahierarchicalatten- The contributions of the paper include (1) pro- tion mechanism to dynamically highlight impor- posal of attending to important parts in contexts tantpartsofwordsequencesandtheutterancese- in multi-turn response generation; (2) proposal of quence when generating a response. Specifically, a hierarchical recurrent attention network which HRAN is built in a hierarchical structure. At the models hierarchy of contexts, word importance, bottom of HRAN, a word level recurrent neural and utterance importance in a unified framework; network (RNN) encodes each utterance into a se- (3) empirical verification of the effectiveness of quence of hidden vectors. In generation of each the model by both automatic evaluation and hu- wordintheresponse,awordlevelattentionmech- manjudgment. 1Notethatonecansimplyconcatenateallutterancesand 2 RelatedWork employs the classic sequence-to-sequence with attention to model word importance in generation. This method, how- Mostexistingeffortonresponsegenerationispaid ever,losesutterancerelationshipsandresultsinbadgenera- tionquality,aswillbeseeninexpeirments. to single-turn conversation. Starting from the ba- sic sequence to sequence model (Sutskever et al., least one utterance as conversation history. ∀j, (cid:0) (cid:1) 2014),variousmodels(Shangetal.,2015;Vinyals u = w ,...,w where w is the k- i,j i,j,1 i,j,Ti,j i,j,k and Le, 2015; Li et al., 2015; Xing et al., 2016; th word. We aim to estimate a generation prob- Li et al., 2016; ?) have been proposed under ability p(y ,...,y |U) from D, and thus given 1 T anencoder-decoderframeworktoimprovegenera- a new conversation context U, we can gener- tionqualityfromdifferentperspectivessuchasrel- ate a response Y = (y ,...,y ) according to 1 T evance,diversity,andpersonality. Recently,multi- p(y ,...,y |U). 1 T turnresponsegenerationhasdrawnattentionfrom Inthefollowing, wewillelaboratehow tocon- academia. For example, Sordoni et al. (2015) structp(y ,...,y |U)andhowtolearnit. 1 T proposedDCGMwherecontextinformationisen- coded with a multi-layer perceptron (MLP). Ser- 4 HierarchicalRecurrentAttention ban et al. (2016a) proposed HRED which models Network contexts in a hierarchical encoder-decoder frame- work. UnderthearchitectureofHRED,morevari- We propose a hierarchical recurrent attention net- antsincludingVHRED(Serbanetal., 2016c)and work(HRAN)tomodelthegenerationprobability MrRNN(Serbanetal.,2016b)areproposedinor- p(y ,...,y |U). Figure 2 gives the architecture 1 T der to introduce latent and explicit variables into of HRAN. Roughly speaking, before generation, thegenerationprocess. Inthiswork,wealsostudy HRAN employs a word level encoder to encode multi-turnresponsegeneration. Differentfromthe informationofeveryutteranceincontextashidden existingmodelswhichdonotmodelwordandut- vectors. Then, when generating every word, a hi- teranceimportanceingeneration,ourhierarchical erarchical attention mechanism attends to impor- recurrent attention network simultaneously mod- tant parts within and among utterances with word els the hierarchy of contexts and the importance levelattentionandutterancelevelattentionrespec- ofwordsandutterancesinaunifiedframework. tively. With the two levels of attention, HRAN Attention mechanism is first proposed for ma- works in a bottom-up way: hidden vectors of ut- chine translation (Bahdanau et al., 2014; Cho et terancesareprocessedbythewordlevelattention al.,2015),andisquicklyappliedtosingle-turnre- anduploadedtoanutterancelevelencodertoform sponse generation afterwards (Shang et al., 2015; hidden vectors of the context. Hidden vectors of Vinyals and Le, 2015). Recently, Yang et al. the context are further processed by the utterance (2016) proposed a hierarchical attention network levelattentionasacontextvectoranduploadedto fordocumentclassificationinwhichtwolevelsof thedecodertogeneratetheword. attention mechanisms are used to model the con- In the following, we will describe details and tributions of words and sentences in classification thelearningobjectiveofHRAN. decision. Seoetal. (2016)proposedahierarchical attentionnetworktopreciselyattendingobjectsof 4.1 WordLevelEncoder differentscalesandshapesinimages. Inspiredby thesework,weextendtheattentionmechanismfor Given U = (u ,...,u ), we employ a bidirec- 1 m single-turn response generation to a hierarchical tional recurrent neural network with gated recur- attention mechanism for multi-turn response gen- rentunits(BiGRU)(Bahdanauetal., 2014)toen- eration. To the best of our knowledge, we are the code each u ,i ∈ {1,...,m} as hidden vectors i firstwhoapplythehierarchicalattentiontechnique (h ,...,h ). Formally, suppose that u = i,1 i,Ti i toresponsegenerationinchatbots. (w ,...,w ), then ∀k ∈ {1,...,T }, h is i,1 i,Ti i i,k givenby 3 ProblemFormalization →− ←− Suppose that we have a data set D = hi,k = concat(hi,k, hi,k), (1) {(U ,Y )}N . ∀i, (U ,Y ) consists of a re- i i i=1 i i sponse Y = (y ,...,y ) and its context where concat(·,·) is an operation defined as con- i i,1 i,Ti →− U = (u ,...,u ) with y the j-th word, catenatingthetwoargumentstogether, h isthe i i,1 i,mi i,j i,k u the message, and (u ,...,u ) the ut- k-th hidden state of a forward GRU (Cho et al., i,mi i,1 i,mi−1 ←− terances in previous turns. In this work, we re- 2014),and h isthek-thhiddenstateofaback- i,k quire m (cid:62) 2 and thus each context has at wardGRU.TheforwardGRUreadsu initsorder i i WordLevelAttention: 𝑙2,𝑡 𝑙1,𝑡 Decoder: 𝑠𝑡−1 𝑟1,𝑡 … 𝑠𝑡−1 𝑠𝑡 …… 𝑠T ⊕ MLP Weights: 𝑐𝑡 Context vector α1,𝑡,1 α1,𝑡,2 … Utterance Level Attention h1,1 h1,2 … 𝑙1,𝑡 𝑙2,𝑡 …… 𝑙𝑚,𝑡 Utterance Level Encoder 𝑟1,𝑡 𝑟2,𝑡 𝑟𝑚,𝑡 …… Word Level Attention Word Level Attention Word Level Attention WordLevel Encoder u1 u2 um …… h1,1 h1,2 h1,3 h1,4 h1,5 h2,1 h2,2 h2,3 h2,4 h2,5 h𝑚,1 h𝑚,2 h𝑚,3 h𝑚,4 h𝑚,5 Figure2: HierarchicalRecurrentAttentionNetwork →− (i.e.,fromw tow ),andcalculates h as willhave,andthemorecontributionsitwillmake i,1 i,Ti i,k to the high level vector (i.e., the utterance vector −→ z =σ(W e +V h ) andthecontextvector). Thisishowthetwolevels k z i,k z i,k−1 −→ ofattentionattendstotheimportantpartsofutter- r =σ(W e +V h ) k r i,k r i,k−1 −→ (2) ancesandtheimportantutterancesingeneration. s =tanh(W e +V (h ◦r )) k s i,k s i,k−1 k −→ −→ Morespecifically,theutterancelevelencoderis hi,k =(1−zk)◦sk+zk◦ hi,k−1, a backward GRU which processes {r }m from i,t i=1 →− themessager totheearliesthistoryr . Simi- where h is initialized with a isotropic Gaus- m,t 1,t i,0 lartoEquation(2),∀i ∈ {m,...,1},l iscalcu- sian distribution, e is the embedding of w , i,t i,k i,k latedas z and r are an update gate and a reset gate k k respectively, σ(·) is a sigmoid function, and z(cid:48) =σ(W r +V l ) i zl i,t zl i+1,t W ,W ,W ,V ,V ,V are parameters. The z r s z r s r(cid:48) =σ(W r +V l ) i rl i,t rl i+1,t backward GRU reads u in its reverse order (i.e., (5) i ←− s(cid:48) =tanh(W r +V (l ◦r(cid:48))) from w to w ) and generates {h }Ti with i sl i,t sl i+1,t i i,Ti i,1 i,k k=1 l =(1−z(cid:48))◦s(cid:48) +z(cid:48) ◦l , aparameterizationsimilartotheforwardGRU. i,t i i i i+1,t where l is initialized with a isotropic Gaus- m+1,t 4.2 HierarchicalAttentionandUtterance siandistribution,z(cid:48) andr(cid:48) aretheupdategateand Encoder i i the reset gate of the utterance level GRU respec- Suppose that the decoder has generated t − 1 tively, and W ,V ,W ,V ,W ,V are pa- zl zl rl rl sl sl words, at step t, word level attention calculates a rameters. weight vector (αi,t,1,...,αi,t,Ti) (details are de- Differentfromtheclassicattentionmechanism, scribed later) for {hi,j}Tj=i1 and represents utter- word level attention in HRAN depends on both ance ui as a vector ri,t. ∀i ∈ {1,...,m}, ri,t is the hidden states of the decoder and the hidden definedby states of the utterance level encoder. It works in r =(cid:88)Ti α h . (3) a reverse order by first weighting {hm,j}Tj=m1 and i,t i,t,j i,j then moving towards {h }T1 along the utter- j=1 1,j j=1 {ri,t}mi=1 arethenutilizedasinputofanutterance ancesequence. ∀i ∈ {m,...,1},j ∈ {1,...,Ti}, levelencoderandtransformedto(l1,t,...,lm,t)as weightαi,t,j iscalculatedas hiddenvectorsofthecontext. Afterthat,utterance e =η(s ,l ,h ); level attention assigns a weight β to l (details i,t,j t−1 i+1,t i,j i,t i,t exp(e ) (6) are described later) and forms a context vector ct αi,t,j = (cid:80)Ti expi,(t,ej ), as k=1 i,t,k m (cid:88) ct = βi,tli,t. (4) where lm+1,t is initialized with a isotropic Gaus- i=1 siandistribution,s isthe(t−1)-thhiddenstate t−1 In both Equation (3) and Equation (4), the more ofthedecoder,andη(·)isamulti-layerperceptron important a hidden vector is, the larger weight it (MLP)withtanhasanactivationfunction. Notethatthewordlevelattentionandtheutter- 5 Experiments ancelevelencodingaredependentwitheachother and alternatively conducted (first attention then We compared HRAN with state-of-the-art meth- encoding). Themotivationweestablishthedepen- odsbybothautomaticevaluationandside-by-side dencybetweenα andl isthatcontentfrom humanjudgment. i,t,j i+1,t the context (i.e., l ) could help identify im- i+1,t portantinformationinutterances,especiallywhen 5.1 DataSet s isnotinformativeenough(e.g.,thegenerated t−1 part of the response are almost function words). WebuiltadatasetfromDoubanGroup2whichisa We require the utterance encoder and the word popular Chinese social networking service (SNS) levelattentiontoworkreversely,becausewethink allowing users to discuss a wide range of topics that compared to history, conversation that hap- in groups through posting and commenting. In pened after an utterance in the context is more DoubanGroup,regardingtoapostunderaspecific likely to be capable of identifying important in- topic, two persons can converse with each other formation in the utterance for generating a proper by one posting a comment and the other quoting responsetothecontext. it and posting another comment. We crawled 20 With{l }m ,theutterancelevelattentioncal- million conversations between two persons with i,t i=1 culatesaweightβ forl as the average number of turns as 6.32. The data i,t i,t covers many different topics and can be viewed e(cid:48) =η(s ,l ); asasimulationofopendomainconversationsina i,t t−1 i,t exp(e(cid:48) ) (7) chatbot. In each conversation, we treated the last βi,t = (cid:80)m expi,(te(cid:48) ). turn as response, and the remaining turns as con- i=1 i,t text. As preprocessing, we first employed Stan- 4.3 DecodingtheResponse ford Chinese word segmenter3 to tokenize each utterance in the data. Then we removed the con- The decoder of HRAN is a RNN language model versations whose response appearing more than (Mikolov et al., 2010) conditioned on the context 50 times in the whole data to prevent them from vectors {c }T given by Equation (4). Formally, t t=1 dominating learning. We also removed the con- theprobabilitydistributionp(y ,...,y |U)isde- 1 T versations shorter than 3 turns and the conversa- finedas tions with an utterance longer than 50 words. Af- T terthepreprocessing,thereare1,656,652conver- (cid:89) p(y1,...,yT|U)=p(y1|c1) p(yt|ct,y1,...,yt−1). (8) sations left. From them, we randomly sampled t=2 1 million conversations as training data, 10,000 wherep(yt|ct,y1,...,yt−1)isgivenby conversations as validation data, and 1,000 con- versationsastestdata,andmadesurethatthereis st =f(eyt−1,st−1,ct) (9) no overlap among them. In the test data, the con- p(yt|ct,y1,...,yt−1)=Iyt ·softmax(st,eyt−1), textswereusedtogenerateresponsesandtheirre- sponseswereusedasgroundtruthtocalculateper- wheres isthehiddenstateofthedecoderatstept, t plexityofgenerationmodels. Wekeptthe40,000 e istheembeddingofy ,f isaGRU,I is yt−1 t−1 yt mostfrequentwordsinthecontextsofthetraining theone-hotvectorfory ,andsoftmax(s ,e ) t t yt−1 data to construct a context vocabulary. The vo- is a V-dimensional vector with V the response cabulary covers 98.8% of words appearing in the vocabulary size and each element the generation contexts of the training data. Similarly, we con- probability of a word. In practice, we employ the structed a response vocabulary that contains the beam search (Tillmann and Ney, 2003) technique 40,000 most frequent words in the responses of togeneratethen-bestresponses. the training data which covers 99.0% words ap- LetusdenoteΘastheparametersetofHRAN, pearing in the responses. Words outside the two then we estimate Θ from D = {(U ,Y )}N by i i i=1 vocabulariesweretreatedas“UNK”.Thedatawill minimizingthefollowingobjectivefunction: bepubliclyavailable. N Θˆ =argmin −(cid:88)log(p(yi,1,...,yi,Ti|Ui)) (10) 2https://www.douban.com/group/explore Θ i=1 3http://nlp.stanford.edu/software/segmenter.shtml Model ValidationPerplexity TestPerplexity Models Win Loss Tie Kappa S2SA 43.679 44.508 HRANv.s.S2SA 27.3 20.6 52.1 0.37 HRED 46.279 47.467 HRANv.s.HRED 27.2 21.2 51.6 0.35 VHRED 44.548 45.484 HRANv.s.VHRED 25.2 20.4 54.4 0.34 HRAN 40.257 41.138 Table2: Humanannotationresults(in%) Table1: Perplexityresults ityonthetestdata. 5.2 Baselines (cid:26) (cid:27) 1 PPL=exp − ΣN log(p(Y |U )) . (11) WecomparedHRANwiththefollowingmodels: N i=1 i i S2SA: we concatenated all utterances in a con- Side-by-sidehumanannotation: wealsocom- text as a long sequence and treated the sequence pared HRAN with every baseline model by side- and the response as a message-response pair. By by-side human comparison. Specifically, we re- thismeans, wetransformed theproblem ofmulti- cruited three native speakers with rich Douban turn response generation to a problem of single- Group experience as human annotators. To each turn response generation and employed the stan- annotator, we showed a context of a test example dard sequence to sequence with attention (Shang with two generated responses, one from HRAN etal.,2015)asabaseline. and the other one from a baseline model. Both HRED: the hierarchical encoder-decoder responses are the top one results in beam search. modelproposedby(Serbanetal.,2016a). The two responses were presented in random or- VHRED: a modification of HRED (Serban et der. We then asked the annotator to judge which al., 2016c) where latent variables are introduced one is better. The criteria is, response A is better in to generation. In all models, we set the dimen- thanresponseBif(1)Aisrelevant,logicallycon- sionalityofhiddenstatesofencodersanddecoders sistenttothecontext, andfluent, whileBiseither as 1000, and the dimensionality of word embed- irrelevantorlogicallycontradictorytothecontext, ding as 620. All models were initialized with or it is disfluent (e.g., with grammatical errors or isotropic Gaussian distributions X ∼ N(0,0.01) UNKs); or (2) both A and B are relevant, consis- and trained with an AdaDelta algorithm (Zeiler, tent,andfluent,butAismoreinformativeandin- 2012) on a NVIDIA Tesla K40 GPU. The batch teresting than B (e.g., B is a universal reply like size is 128. We set the initial learning rate as 1.0 “I see”). If the annotator cannot tell which one is and reduced it by half if the perplexity on val- better, he/shewasaskedtolabela“tie”. Eachan- idation began to increase. We implemented the notatorindividuallyjudged1000testexamplesfor models with an open source deep learning tool each HRAN-baseline pair, and in total, each one Blocks4. judged 3000 examples (for three pairs). Agree- mentsamongtheannotatorswerecalculatedusing 5.3 EvaluationMetrics Fleiss’kappa(FleissandCohen,1973). How to evaluate a response generation model is NotethatwedonotchooseBLEU(Papineniet still an open problem but not the focus of the pa- al.,2002)asanevaluationmetric,because(1)Liu per. Wefollowedtheexistingworkandemployed et al. (Liu et al., 2016) have proven that BLEU thefollowingmetrics: is not a proper metric for evaluating conversa- tion models as there is weak correlation between Perplexity: following (Vinyals and Le, 2015), BLEUandhumanjudgment;(2)differentfromthe we employed perplexity as an evaluation metric. single-turn case, in multi-turn conversation, one Perplexity is defined by Equation (11). It mea- context usually has one copy in the whole data. sureshowwellamodelpredictshumanresponses. Thus,withoutanyhumaneffortlikewhatSordoni Lowerperplexitygenerallyindicatesbettergener- etal. (Sordonietal.,2015)didintheirwork,each ation performance. In our experiments, perplex- context only has a single reference in test. This ity on validation was used to determine when to makes BLEU even unreliable as a measurement stop training. If the perplexity stops decreasing ofgenerationqualityinopendomainconversation andthedifferenceissmallerthan2.0fivetimesin duetothediversityofresponses. validation,wethinkthatthealgorithmhasreached convergenceandterminatetraining. Wetestedthe 5.4 EvaluationResults generation ability of different models by perplex- Table 1 gives the results on perplexity. HRAN 4https://github.com/mila-udem/blocks achieves the lowest perplexity on both validation Figure3: Casestudy(utterancesbetweentwopersonsincontextsaresplitby“⇒”) andtest. Weconductedt-testontestperplexityand Model Win Loss Tie PPL NoUDAtt 22.3% 24.8% 52.9% 41.54 the result shows that the improvement of HRAN NoWordAtt 20.4% 25.0% 50.6% 43.24 over all baseline models is statistically significant NoUtteranceAtt 21.1% 23.7% 55.2% 47.35 (p-value< 0.01). Table3: Modelablationresults Table 2 shows the human annotation results. sage) properly by understanding the context (e.g., Theratioswerecalculatedbycombiningtheanno- case2),butalsobecapableofstartinganewtopic tationsfromthethreejudgestogether. Wecansee according to the conversation history to keep the that HRAN outperforms all baseline models and conversationgoing(e.g.,case1). Incase2,HRAN all comparisons have relatively high kappa scores understands that the message is actually asking which indicates that the annotators reached rela- “why can’t you come to have dinner with me?” tively high agreements in judgment. Compared andgeneratesaproperresponsethatgivesaplau- withS2SA,HRED,andVHRED,HRANachieves siblereason. Incase1,HRANproperlybringsup preference gains (win-loss) 6.7%, 6%, 4.8% re- anewtopicbyaskingthe“brand”oftheuser’s“lo- spectively. Signtestresultsshowthattheimprove- tion”whenthecurrenttopic“howtoexfoliatemy mentisstatisticallysignificant(p-value< 0.01for skin”hascometoanend. Thenewtopicisbased HRAN v.s. S2SA and HRAN v.s. HRED, and p- onthecontentofthecontextandthuscannaturally value < 0.05 for HRAN v.s. VHRED). Among extendstheconversationinthecase. the three baseline models, S2SA is the worst one, Visualizationofattention: tofurtherillustrate becauseitlosesrelationshipsamongutterancesin why HRAN can generate high quality responses, the context. VHRED is the best baseline model, we visualized the hierarchical attention for the which is consistent with the existing literatures casesinFigure3inFigure4. Ineverysub-figure, (Serban et al., 2016c). We checked the cases on each line is an utterance with blue color indicat- which VHRED loses to HRAN and found that ing word importance. The leftmost column of on 56% cases, VHRED generated irrelevant re- eachsub-figureusesredcolortoindicateutterance sponseswhileresponsesgivenbyHRANarerele- importance. Darker color means more important vant,logicallyconsistent,andfluent. wordsorutterances. Theimportanceofawordor anutterancewascalculatedbytheaverageweight 5.5 Discussions of the word or the utterance assigned by attention Case study: Figure 3 lists some cases from the in generating the response given at the bottom of test set to compare HRAN with the best baseline each sub-figure. It reflects an overall contribu- VHRED. We can see that HRAN not only can tion of the word or the utterance to generate the answer the last turn in the context (i.e., the mes- response. Above each line, we gave the transla- (a) Visualizationofcase1 (b) Visualizationofcase2 (c) Visualizationofcase3 (d) Visualizationofcase4 Figure4: Attentionvisualization(theimportanceofawordoranutteranceiscalculatedastheiraverage weightswhengeneratingthewholeresponse) tion of the utterance, and below it, we translated all the components are useful because removing important words. Note that word-to-word trans- anyofthemwillcauseperformancedrop. Among lation may cause confusion sometimes, therefore, them, word level attention is the most important we left some words (most of them are function one as HRAN achieved the most preference gain words) untranslated. We can see that the hierar- (4.6%)toNoWordAttonhumancomparison. chical attention mechanism in HRAN can attend to both important words and important utterances Error analysis: we finally investigate how to in contexts. For example, in Figure 4(c), words improve HRAN in the future by analyzing the including “girl” and “boyfriend” and numbers in- cases on which HRAN loses to VHRED. The er- cluding “160” and “175” are highlighted, and u rors can be summarized as: 51.81% logic con- 1 andu aremoreimportantthanothers. Theresult tradiction, 26.95% universal reply, 7.77% irrel- 4 matches our intuition in introduction. In Figure evant response, and 13.47% others. Most bad 4(b),HRANassignedlargerweightstou ,u and cases come from universal replies and responses 1 4 wordslike“dinner”and“why”. Thisexplainswhy that are logically contradictory to contexts. This the model can understand that the message is ac- iseasytounderstandasHRANdoesnotexplicitly tuallyasking“whycan’tyoucometohavedinner modelthetwoissues. Theresultalsoindicatesthat withme?”. Thefiguresprovideusinsightsonhow (1) although contexts provide more information HRANunderstandscontextsingeneration. than single messages, multi-turn response gener- ation still has the “safe response” problem as the Model ablation: we then examine the effect single-turn case; (2) although attending to impor- of different components of HRAN by removing tant words and utterances in generation can lead them one by one. We first removed l from to informative and logically consistent responses i+1 η(s ,l ,h )inEquation(6)(i.e.,removing formanycaseslikethoseinFigure3,itisstillnot t−1 i+1,t i,j utterance dependency from word level attention) enoughforfullyunderstandingcontextsduetothe and denoted the model as “No UD Att”, then we complex nature of conversations. The irrelevant removed word level attention and utterance level responses might be caused by wrong attention in attention separately, and denoted the models as generation. Althoughtheanalysismightnotcover “No Word Att” and “No Utterance Att” respec- all bad cases (e.g., HRAN and VHRED may both tively. Weconductedside-by-sidehumancompar- givebadresponses),itshedslightonourfuturedi- ison on these models with the full HRAN on the rections: (1)improvingresponsediversity,e.g.,by test data and also calculated their test perplexity introducingextracontentintogenerationlikeXing (PPL). Table 3 gives the results. We can see that etal. (Xingetal.,2016)andMouetal. (Mouetal., 2016)didforsingle-turnconversation;(2)model- Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, inglogicsincontexts;(3)improvingattention. and Zhi Jin. 2016. Sequence to backward and for- wardsequences: Acontent-introducingapproachto 6 Conclusion generative short-text conversation. arXiv preprint arXiv:1607.00970. We propose a hierarchical recurrent attention net- KishorePapineni,SalimRoukos,ToddWard,andWei- work (HRAN) for multi-turn response generation Jing Zhu. 2002. Bleu: a method for automatic in chatbots. Empirical studies on large scale con- evaluationofmachinetranslation. InProceedingsof versation data show that HRAN can significantly the 40th annual meeting on association for compu- tationallinguistics,pages311–318.Associationfor outperformstate-of-the-artmodels. ComputationalLinguistics. Paul Hongsuck Seo, Zhe Lin, Scott Cohen, Xiaohui References Shen,andBohyungHan. 2016. Hierarchicalatten- tionnetworks. arXivpreprintarXiv:1606.02393. DzmitryBahdanau,KyunghyunCho,andYoshuaBen- gio. 2014. Neural machine translation by jointly IulianVSerban, AlessandroSordoni, YoshuaBengio, learning to align and translate. arXiv preprint Aaron Courville, and Joelle Pineau. 2015. Build- arXiv:1409.0473. ingend-to-enddialoguesystemsusinggenerativehi- erarchical neural network models. arXiv preprint MargaretAnnBoden. 2006. Mindasmachine: Ahis- arXiv:1507.04808. toryofcognitivescience. ClarendonPress. IulianVSerban, AlessandroSordoni, YoshuaBengio, KyunghyunCho,BartvanMerrie¨nboer,DzmitryBah- Aaron Courville, and Joelle Pineau. 2016a. Build- danau,andYoshuaBengio. 2014. Ontheproperties ingend-to-enddialoguesystemsusinggenerativehi- ofneuralmachinetranslation: Encoder–decoderap- erarchicalneuralnetworkmodels. InProceedingsof proaches. Syntax,SemanticsandStructureinStatis- the30th AAAIConferenceonArtificial Intelligence ticalTranslation,page103. (AAAI-16). Kyunghyun Cho, Aaron Courville, and Yoshua Ben- Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, gio. 2015. Describing multimedia content using Kartik Talamadupula, Bowen Zhou, Yoshua Ben- attention-based encoder-decoder networks. Multi- gio, and Aaron Courville. 2016b. Multireso- media,IEEETransactionson,17(11):1875–1886. lution recurrent neural networks: An application Joseph L Fleiss and Jacob Cohen. 1973. The equiv- to dialogue response generation. arXiv preprint alence of weighted kappa and the intraclass corre- arXiv:1606.00776. lationcoefficientasmeasuresofreliability. Educa- tionalandpsychologicalmeasurement. Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Sina Jafarpour, Christopher JC Burges, and Alan Rit- and Yoshua Bengio. 2016c. A hierarchical latent ter. 2010. Filter, rank, andtransfertheknowledge: variable encoder-decoder model for generating dia- Learningtochat. AdvancesinRanking,10. logues. arXivpreprintarXiv:1605.06069. JiweiLi,MichelGalley,ChrisBrockett,JianfengGao, Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. andBillDolan. 2015. Adiversity-promotingobjec- Neuralrespondingmachineforshort-textconversa- tivefunctionforneuralconversationmodels. arXiv tion. arXivpreprintarXiv:1503.02364. preprintarXiv:1510.03055. Alessandro Sordoni, Michel Galley, Michael Auli, JiweiLi,MichelGalley,ChrisBrockett,JianfengGao, Chris Brockett, Yangfeng Ji, Margaret Mitchell, andBillDolan. 2016. Apersona-basedneuralcon- Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. versationmodel. arXivpreprintarXiv:1603.06155. Aneuralnetworkapproachtocontext-sensitivegen- eration of conversational responses. arXiv preprint Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael arXiv:1506.06714. Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Anempiricalstudyofunsupervisedevaluationmet- Sequence to sequence learning with neural net- ricsfordialogueresponsegeneration. arXivpreprint works. InAdvancesinneuralinformationprocess- arXiv:1603.08023. ingsystems,pages3104–3112. Tomas Mikolov, Martin Karafia´t, Lukas Burget, Jan ChristophTillmannandHermannNey. 2003. Wordre- Cernocky`, and Sanjeev Khudanpur. 2010. Recur- ordering and a dynamic programming beam search rent neural network based language model. In IN- algorithm for statistical machine translation. Com- TERSPEECH 2010, 11thAnnual Conference of the putationallinguistics,29(1):97–133. International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, OriolVinyalsandQuocLe. 2015. Aneuralconversa- pages1045–1048. tionalmodel. arXivpreprintarXiv:1506.05869. Richard S Wallace. 2009. The anatomy of ALICE. Mairesse, Jost Schatzmann, Blaise Thomson, and Springer. KaiYu. 2010. Thehiddeninformationstatemodel: Apracticalframeworkforpomdp-basedspokendia- Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, loguemanagement. ComputerSpeech&Language, Ming Zhou, and Wei-Ying Ma. 2016. Topic 24(2):150–174. aware neural response generation. arXiv preprint arXiv:1606.08340. Stephanie Young, Milica Gasic, Blaise Thomson, and John D Williams. 2013. Pomdp-based statistical Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, spoken dialog systems: A review. Proceedings of Alex Smola, and Eduard Hovy. 2016. Hierarchi- theIEEE,101(5):1160–1179. cal attention networks for document classification. InProceedingsofthe2016ConferenceoftheNorth MatthewDZeiler. 2012. Adadelta: anadaptivelearn- American Chapter of the Association for Computa- ingratemethod. arXivpreprintarXiv:1212.5701. tionalLinguistics: HumanLanguageTechnologies. Steve Young, Milica Gasˇic´, Simon Keizer, Franc¸ois