Adversarial Learning for Neural Dialogue Generation Jiwei Li1, WillMonroe1, Tianlin Shi 1, Se´bastien Jean2, AlanRitter3 andDanJurafsky1 1Stanford University,CA, USA 2New York University,NY, USA 3OhioState University,OH,USA jiweil,wmonroe4,tianlins,[email protected] [email protected] [email protected] Abstract Sordonietal., 2015) or end-to-end neural sys- tems (Shangetal., 2015; VinyalsandLe, 2015; 7 1 We apply adversarial training to open- Lietal., 2016a; Yaoetal., 2015; Luanetal., 0 domain dialogue generation, training a 2016) approximate such a goal by predicting the 2 system to produce sequences that are in- nextdialogue utterance giventhe dialogue history b distinguishable from human-generated di- using the maximum likelihood estimation (MLE) e F alogue utterances. We cast the task as objective function. Despite its success, many a reinforcement learning problem where issues emerge resulting from this over-simplified 2 2 we jointly train two systems: a genera- training objective: responses are highly dull and tivemodeltoproduceresponse sequences, generic(Sordonietal.,2015;Serbanetal.,2016a; ] L and a discriminator—analagous to the hu- Lietal., 2016a), repetitive and short-sighted C man evaluator in the Turing test— to dis- (Lietal.,2016d). . tinguish between the human-generated di- s Solutions to these problems require answering c alogues and the machine-generated ones. a few fundamental questions: what are the cru- [ In this generative adversarial network ap- cialaspects that defineanideal conversation, how 4 proach, theoutputs fromthediscriminator canwequantitatively measure them, and howcan v are used to encourage the system towards 7 we incorporate them into a machine learning sys- 4 morehuman-like dialogue. tem? For example, Lietal. (2016d) manually de- 5 Further, we investigate models for adver- fine three types of ideal dialogue properties such 6 0 sarial evaluation that uses success in fool- as ease of answering, informativeness and coher- 1. ing an adversary as a dialogue evaluation ence,anduseareinforcement-learning framework 0 metric, while avoiding a number of poten- to train the model to generate highly rewarded re- 7 tialpitfalls. sponses. Yuetal. (2016b) use keyword retrieval 1 : confidence as a reward. However, it is widely v Experimental results on several metrics, acknowledged that manually defined reward func- i including adversarial evaluation, demon- X tions can’t possibly cover all crucial aspects and stratethattheadversarially-trained system r canleadtosuboptimal generated utterances. a generates higher-quality responses than previousbaselines.1 A good dialogue model should generate ut- terances indistinguishable from human dialogues. 1 Introduction Such a goal suggests a training objective re- sembling the idea of the Turing test (Turing, Open domain dialogue generation (Ritteretal., 1950). We borrow the idea of adversarial train- 2011; Sordonietal., 2015; Xuetal., 2016; ing (Goodfellow etal., 2014; Dentonetal., 2015) Wenetal., 2016; Lietal., 2016b; Serbanetal., in computer vision, in which we jointly train two 2016c,e) aims at generating meaningful and models, a generator (which takes the form of the coherent dialogue responses given input dia- neural SEQ2SEQ model) that defines the prob- logue history. Prior systems, e.g., phrase-based ability of generating a dialogue sequence, and machine translation systems (Ritteretal., 2011; a discriminator that labels dialogues as human- generated or machine-generated. This discrimina- 1Codeavailableat https://github.com/jiweil/Neural-Dialogue-Getonrerisataniaolnogous totheevaluator intheTuringtest. Wecast the task as areinforcement learning prob- 2016). Forexample,Suetal.(2016)combinerein- lem, in which the quality of machine-generated forcementlearningwithneuralgenerationontasks utterances is measured by its ability to fool the with real users. Recent work from Asgharetal. discriminator into believing that it is a human- (2016)trainsanend-to-endRLdialoguemodelus- generated one. The output from the discriminator inghumanusers. is used as a reward to the generator, pushing it to Dialogue quality is traditionally evaluated generateutterancesindistinguishable fromhuman- (Sordonietal., 2015, e.g.) using word-overlap generated dialogues. metrics such as BLEU and METEOR scores The idea of a Turing test—employing an eval- used for machine translation. Some recent work uator todistinguish machine-generated texts from (Liuetal., 2016) has started to look at more flexi- human-generated ones—can be applied not only bleandreliableevaluation metricssuchashuman- to training but also testing, where it goes by the rating prediction (Loweetal., 2016a) and next ut- nameofadversarial evaluation. Adversarialevalu- teranceclassification (Loweetal.,2016b). ation was first employed in Bowmanetal. (2015) to evaluate sentence generation quality, and pre- Adversarial networks The idea of generative liminarilystudiedinthecontextofdialoguegener- adversarial networks has enjoyed great success in ationbyKannanandVinyals(2016). Inthispaper, computervision (Radfordetal.,2015;Chenetal., we discuss potential pitfalls of adversarial evalua- 2016a; Salimansetal., 2016). Training is formal- tions and necessary steps toavoid them and make ized as a game in which the generative model is evaluation reliable. trained to generate outputs to fool the discrimina- Experimental results demonstrate that our ap- tor;thetechniquehasbeensuccessfully appliedto proach produces moreinteractive, interesting, and imagegeneration. non-repetitive responses than standard SEQ2SEQ However, to the best of our knowledge, this modelstrainedusingtheMLEobjectivefunction. idea has not achieved comparable success in NLP. This is due to the fact that unlike in vision, text 2 Related Work generation is discrete, which makes the error out- Dialogue generation Response generation for putted from the discriminator hard to backpropa- dialoguecanbeviewedasasource-to-target trans- gatetothegenerator. Somerecentworkhasbegun duction problem. Ritteretal. (2011) frame the to address this issue: Lambetal. (2016) propose response generation problem as a statistical ma- providing the discriminator with the intermediate chine translation (SMT) problem. Sordonietal. hidden vectors of the generator rather than its se- (2015) improved Ritter et al.’s system by rescor- quenceoutputs. Suchastrategymakesthesystem ing the outputs of a phrasal SMT-based conversa- differentiable and achieves promising results in tion system with a neural model that incorporates tasks like character-level language modeling and prior context. Recent progress in SEQ2SEQ mod- handwriting generation. Yuetal. (2016a) use pol- els have inspired several efforts (VinyalsandLe, icy gradient reinforcement learning to backprop- 2015; Serbanetal., 2016a,d; Luanetal., 2016) to agate the error from the discriminator, showing build end-to-end conversational systems which improvement in multiple generation tasks such as first apply an encoder to map a message to a dis- poem generation, speech language generation and tributedvectorrepresenting itssemanticsandthen musicgeneration. Outsideofsequencegeneration, generate aresponse fromthemessagevector. Chenetal. (2016b) apply the idea of adversar- Our work adapts the encoder-decoder model to ial training to sentiment analysis and Zhangetal. RL training, and can thus be viewed as an ex- (2017)applytheideatodomainadaptation tasks. tension of Lietal. (2016d), but with more gen- Our work is distantly related to recent eral RL rewards. Lietal. (2016d) simulate dia- work that formalizes sequence generation as an logues between two virtual agents, using policy action-taking problem in reinforcement learning. gradientmethodstorewardsequencesthatdisplay Ranzatoetal. (2015) train RNN decoders in a three useful conversational properties: informativ- SEQ2SEQ model using policy gradient, and they ity, coherence, and ease of answering. Our work obtain competitive machine translation results. is also related to recent efforts to integrate the Bahdanauetal. (2016) take this a step further SEQ2SEQ and reinforcement learning paradigms, by training an actor-critic RL model for ma- drawing on the advantages of both (Wenetal., chine translation. Also related is recent work (Shenetal., 2015; WisemanandRush, 2016) to utterances that are indistinguishable from human address the issues of exposure bias and loss- generateddialogues. Weusepolicygradientmeth- evaluation mismatch in neural machine transla- ods to achieve such a goal, in which the score tion. ofcurrent utterances beinghuman-generated ones assigned by the discriminator (i.e., Q ({x,y})) + 3 Adversarial Training forDialogue is used as a reward for the generator, which is Generation trained to maximize the expected reward of gen- erated utterance(s) using the REINFORCE algo- In this section, we describe in detail the compo- rithm(Williams,1992): nents of the proposed adversarial reinforcement learning model. The problem can be framed as J(θ)= E (Q ({x,y})|θ) (1) follows: givenadialoguehistoryxconsisting ofa y∼p(y|x) + sequenceofdialogueutterances,2 themodelneeds Given the input dialogue history x, the bot gener- to generate a response y = {y1,y2,...,yT}. We ates a dialogue utterance y by sampling from the view the process of sentence generation as a se- policy. The concatenation of the generated utter- quenceofactionsthataretakenaccordingtoapol- ance y and the input x is fed to the discriminator. icydefinedbyanencoder-decoderrecurrentneural Thegradientof(1)isapproximated usingthelike- networks. lihood ratio trick (Williams, 1992; Glynn, 1990; Aleksandrov etal.,1968): 3.1 AdversarialREINFORCE The adversarial REINFORCE algorithm consists ∇J(θ)≈ [Q ({x,y})−b({x,y})] + of two components: a generative model G and a ∇logπ(y|x) discriminative modelD. = [Q ({x,y})−b({x,y})] + Generative model The generative model G de- fines the policy that generates a response y given ∇Xlogp(yt|x,y1:t−1) (2) t dialogue history x. It takes a form similar to SEQ2SEQ models, which first map the source in- where π denotes the probability of the generated put to a vector representation using a recurrent responses. b({x,y}) denotes thebaseline value to netandthencomputetheprobabilityofgenerating reduce the variance of the estimate while keeping eachtokeninthetargetusingasoftmaxfunction. it unbiased.4 The discriminator is simultaneously updated with the human generated dialogue that Discriminative model The discriminative contains dialogue history x as a positive example model D is a binary classifier that takes as input and the machine-generated dialogue as a negative a sequence of dialogue utterances {x,y} and example. outputs a label indicating whether the input is generated by humans or machines. The input 3.2 RewardforEveryGenerationStep dialogue is encoded into a vector representation (REGS) using a hierarchical encoder (Lietal., 2015; TheREINFORCEalgorithmdescribedhasthedis- Serbanetal., 2016b),3 which is then fed to a advantage thattheexpectation oftherewardisap- 2-classsoftmaxfunction, returningtheprobability proximated by only one sample, and the reward of the input dialogue episode being a machine- associated with this sample (i.e., [Q ({x,y}) − + generated dialogue (denoted Q ({x,y})) or a − b({x,y})]inEq(2))isusedforallactions(thegen- human-generated dialogue (denotedQ ({x,y})). + eration of each token) in the generated sequence. Policy Gradient Training The key idea of the Suppose, for example, the input history is what’s system is to encourage the generator to generate your name, the human-generated response is I am John, and the machine-generated response is 2Weapproximatethedialoguehistoryusingtheconcate- nationoftwoprecedingutterances.Wefoundthatusingmore 4 LikeRanzatoetal.(2015),wetrainanotherneuralnet- than 2 context utterances yields very tiny performance im- work model (the critic) to estimate the value (or future re- provementsforSEQ2SEQmodels. ward) of current state (i.e., the dialogue history) under the 3To be specific, each utterance p or q is mapped current policy π. The critic network takes as input the dia- to a vector representation h or h using LSTM loguehistory,transformsittoavectorrepresentationusinga p q (HochreiterandSchmidhuber,1997). AnotherLSTMisput hierarchicalnetworkandmapstherepresentationtoascalar. onsentencelevel,mappingtheentiredialoguesequencetoa Thenetworkisoptimizedbasedonthemeansquaredlossbe- singlerepresentation tweentheestimatedrewardandtherealreward. I don’t know. The vanilla REINFORCE model one example from {y− }NY−, which are treated 1:t t=1 assigns the same negative reward to all tokens as positive and negative examples to update the withinthehuman-generatedresponse(i.e.,I,don’t, discriminator. Compared with the Monte Carlo know), whereas proper credit assignment in train- search model, this strategy is significantly more ingwouldgiveseparaterewards,mostlikelyaneu- time-effective, but comes with the weakness that tralrewardforthetokenI,andnegativerewardsto the discriminator becomes less accurate after par- don’tandknow. Wecallthisrewardforeverygen- tially decoded sequences are added in as training erationstep,abbreviated REGS. examples. We find that the MC model performs Rewards for intermediate steps or partially de- betterwhentraining timeislessofanissue. coded sequences are thus necessary. Unfortu- For each partially-generated sequence Y = t nately,thediscriminatoristrainedtoassignscores y , the discriminator gives a classification score 1:t to fully generated sequences, but not partially de- Q (x,Y ). We compute the baseline b(x,Y ) us- + t t codedones. Weproposetwostrategiesforcomput- ing a similar model to the vanilla REINFORCE ing intermediate step rewards by (1) using Monte model. This yields the following gradient to up- Carlo(MC)searchand(2)trainingadiscriminator datethegenerator: that is able to assign rewards to partially decoded sequences. ∇J(θ)≈ X(Q+(x,Yt)−b(x,Yt)) In (1) Monte Carlo search, given a partially de- t coded sP, the model keeps sampling tokens from ∇logp(yt|x,Y1:t−1) (3) thedistributionuntilthedecodingfinishes. Sucha Comparing (3) with (2), we can see that the val- process is repeated N (set to 5) times and the N ues forrewards and baselines are different among generated sequences will share a common prefix generated tokensinthesameresponse. s . These N sequences are fed to the discrimi- P nator, the average score of which is used as a re- Teacher Forcing Practically, we find that up- ward for the sP. A similar strategy is adopted in dating the generative model only using Eq. 1 Yuetal. (2016a). The downside of MC is that it leads to unstable training for both vanilla Rein- requires repeating the sampling process for each force and REGS, with the perplexity value sky- prefix of each sequence and is thus significantly rocketing after training the model for a few hours time-consuming.5 (evenwhenthegenerator isinitialized usingapre- In (2), we directly train a discriminator that is trained SEQ2SEQ model). The reason this hap- able to assign rewards to both fully and partially pens is that the generative model can only be in- decoded sequences. We break the generated se- directly exposed to the gold-standard target se- quencesintopartialsequences, namely{y+ }NY+ quences through the reward passed back from the 1:t t=1 and {y− }NY− anduse allinstances in{y+ }NY+ discriminator, and this reward is used to promote 1:t t=1 1:t t=1 as positive examples and instances {y− }NY− as or discourage its (the generator’s) own generated 1:t t=1 sequences. Such a training strategy is fragile: negativeexamples. Theproblem withsuchastrat- once the generator (accidentally) deteriorates in egyisthatearlieractions inasequence areshared sometrainingbatchesandthediscriminatorconse- amongmultipletraining examplesforthediscrim- quentlydoesanextremelygoodjobinrecognizing inator (for example, token y+ is contained in all 1 sequencesfromthegenerator,thegeneratorimme- partially generated sequences, which results in diately gets lost. It knows that its generated se- overfitting. To mitigate this problem, we adopt a quences are bad based on the rewards outputted strategy similar to when training value networks from the discriminator, but itdoes not know what inAlphaGo(Silveretal.,2016),inwhichforeach sequences are good and how to push itself to gen- collection of subsequences of Y, we randomly erate these good sequences (the odds of generat- sample only one example from {y+ }NY+ and 1:t t=1 ing a good response from random sampling are minute, due to the vast size of the space of pos- 5Consideronetargetsequencewithlength20,weneedto sible sequences). Loss of the reward signal leads sample5*20=100 fullsequencestogetrewardsforallinter- mediatesteps.Trainingonebatchwith128examplesroughly toabreakdown inthetrainingprocess. takesroughly1minonasingleGPU,whichiscomputation- To alleviate this issue and give the generator allyintractableconsideringthesizeofthedialoguedatawe moredirectaccesstothegold-standardtargets,we have.Wethusparallelizethesamplingprocesses,distributing jobsacross8GPUs. propose also feeding human generated responses to the generator for model updates. The most Fornumberoftrainingiterationsdo . Fori=1,D-stepsdo straightforward strategy isforthediscriminator to . Sample(X,Y)fromrealdata automatically assign a reward of 1 (or other posi- . SampleYˆ ∼G(·|X) tivevalues)tothehumangenerated responses and . Update D using (X,Y) as positive examples and forthegeneratortousethisrewardtoupdateitself (X,Yˆ)asnegativeexamples. . End onhumangeneratedexamples. Thiscanbeseenas . havingateacherintervenewiththegeneratorsome . Fori=1,G-stepsdo fraction of the time and force it to generate the . Sample(X,Y)fromrealdata true responses, an approach that is similar to the . SampleYˆ ∼G(·|X) . ComputeRewardrfor(X,Yˆ)usingD. professor-forcing algorithm ofLambetal.(2016). . UpdateGon(X,Yˆ)usingrewardr A closer look reveals that this modification is . Teacher-Forcing:UpdateGon(X,Y) the same as the standard training of SEQ2SEQ . End End models, making the final training alternately up- datetheSEQ2SEQ modelusingtheadversarialob- jective and the MLE objective. One can think of Figure 1: A brief review of the proposed adversarial re- inforcement algorithm for training the generator G and dis- theprofessor-forcing modelasaregularizertoreg- criminator D. The reward r from the discriminator D can ulatethegeneratoronceitstartsdeviatingfromthe becomputedusingdifferentstrategiesaccordingtowhether training dataset. usingREINFORCEorREGS.TheupdateofthegeneratorG on(X,Yˆ)canbedonebyeitherusingEq.2orEq.3.D-steps Wealso propose another workaround, in which issetto5andG-stepsissetto1. thediscriminator firstassignsarewardtoahuman generated example using its own model, and the a few strategies that improve response quality, in- generator then updates itself using this reward on cluding: (2) Remove training examples with the thehumangeneratedexampleonlyiftherewardis length of responses shorter than a threshold (set larger thanthebaseline value. Suchastrategy has to 5). We find that this significantly improves the the advantage that different weights for model up- general response quality.6 (2) Instead of using datesareassignedtodifferenthumangeneratedex- the same learning rate for all examples, using a amples(intheformofdifferentrewardvaluespro- weighted learning rate that considers the average ducedbythegenerator) andthathumangenerated tf-idf score for tokens within the response. Such examplesarealwaysassociated withnon-negative a strategy decreases the influence from dull and weights. genericutterances (seeAppendix fordetails). AreviewoftheproposedmodelisshowninFig- (3)Penalizingintra-sibling rankingwhendoing ure1. beam search decoding to promote N-best list di- 3.3 TrainingDetails versity asdescribed inLietal.(2016c). (4)Penal- izing word types (stop words excluded) that have We first pre-train the generative model by already been generated. Such a strategy dramat- predicting target sequences given the dia- ically decreases the rate of repetitive responses logue history. We trained a SEQ2SEQ model such as no. no. no. no. no. or contradictory (Sutskeveretal., 2014) with an attention mecha- responses such as I don’t like oranges but i like nism (Bahdanau etal., 2015; Luongetal., 2015) oranges. on the OpenSubtitles dataset. We followed pro- tocols recommended by Sutskeveretal. (2014), 4 Adversarial Evaluation suchasgradient clipping, mini-batchandlearning ratedecay. Wealsopre-trainthediscriminator. To In this section, wediscuss details of strategies for generate negative examples, we decode part of successful adversarial evaluation. It is worth not- the training data. Half of the negative examples ing that the proposed adversarial training and ad- are generated using beam-search with mutual versarial evaluation are separate procedures.They information reranking as described in Lietal. are independent of each other and share no com- (2016a), and the other half is generated from monparameters. sampling. The idea of adversarial evaluation, first pro- For data processing, model training and decod- posed by Bowmanetal. (2015), is to train a dis- ing (both the proposed adversarial training model 6Tocompensate for the lossof short responses, one can and the standard SEQ2SEQ models), we employ trainaseparatemodelusingshortsequences. criminant function to separate generated and true opmentsetisnotsuitedtothiscase,sinceamodel sentences, in an attempt to evaluate the model’s that overfits the development set is necessarily su- sentence generation capability. The idea has perior. beenpreliminarilystudiedbyKannanandVinyals To deal with this issue, we propose setting up (2016) in the context of dialogue generation. Ad- afewmanually-invented situations totesttheabil- versarial evaluation also resembles the idea of the ity of the automatic evaluator. This is akin to set- Turing test, which requires a human evaluator to ting up examinations to test the ability of the hu- distinguish machine-generated texts from human- man evaluator in the Turing test. We report not generated ones. Since it is time-consuming and only the AdverSuc values, but also the scores that costly to ask a human to talk to a model and give theevaluator achievesinthesemanually-designed judgements, wetrainamachineevaluator inplace test cases, indicating how much we can trust of the human evaluator to distinguish the human the reported AdverSuc. We develop scenarios in dialogues and machine dialogues, and we use it whichweknowinadvancehowaperfectevaluator tomeasure thegeneral quality ofthegenerated re- should behave, and then compare AdverSuc from sponses. adiscriminative model withthegold-standard Ad- Adversarial evaluation involves both training verSuc. Scenarioswedesigninclude: and testing. At training time, the evaluator is • We use human-generated dialogues as both trained to label dialogues as machine-generated positive examples and negative examples. A (negative) or human-generated (positive). At test perfect evaluator should give an AdverSuc time, the trained evaluator is evaluated on a held- of 0.5 (accuracy 50%), which is the gold- outdataset. Ifthehuman-generated dialogues and standard result. machine-generated onesareindistinguishable, the • Weusemachine-generated dialogues asboth model will achieve 50 percent accuracy at test positive examples and negative examples. A time. perfect evaluator should giveanAdverSuc of 0.5(accuracy 50%). 4.1 AdversarialSuccess • We use original human-generated dialogues We define Adversarial Success (AdverSuc for as positive examples and dialogues consist- short) to be the fraction of instances in which a ing of random utterances as negative exam- model is capable of fooling the evaluator. Adver- ples. A perfect evaluator should give an Ad- Suc is the difference between 1 and the accuracy verSuc of0(accuracy 100%). achievedbytheevaluator. HighervaluesofAdver- • We use original human-generated dialogues Sucforadialoguegeneration modelarebetter. as positive examples and dialogues with re- sponses replaced by the utterances immedi- 4.2 TestingtheEvaluator’sAbility ately following the true responses. A perfect evaluation should give an AdverSuc of 0 (ac- One caveat with the adversarial evaluation meth- curacy100%). ods is that they are model-dependent. Weapprox- The evaluator reliability error (ERE) is the aver- imate the human evaluator in the Turing test with age deviation of an evaluator’s adversarial error an automatic evaluator and assume that the evalu- from the gold-standard error in the above tasks, ator is perfect: low accuracy of the discriminator with equal weight for each task. The smaller the shouldindicatehighqualityoftheresponses,since error,themorereliable theevaluatoris. we interpret this to mean the generated responses are indistinguishable from the human ones. Un- 4.3 Machine-vs-RandomAccuracy fortunately, there is another factor that can lead tolowdiscriminative accuracy: apoordiscrimina- Evaluator reliability error uses scenarios con- tive model. Consider a discriminator that always structed from human-generated dialogues to as- givesrandomlabelsoralwaysgivesthesamelabel. sessfeatureorhyper-parameterchoicefortheeval- Such an evaluator always yields a high AdverSuc uator. Unfortunately, no machine-generated re- value of 0.5. Bowmanetal. (2015) propose two sponses are involved in the ERE metric. The fol- differentdiscriminatormodelsseparatelyusingun- lowing example illustrates the serious weakness igramfeaturesandneuralfeatures. Itishardtotell resulting from this strategy: as will be shown in which feature set is more reliable. The standard the experiment section, when inputs are decoded strategy of testing the model on a held-out devel- usinggreedyorbeamsearchmodels,mostgenera- Input tellme...howlonghaveyouhadthisfallingsickness? Vanilla-MLE i’mnotadoctor. Vanilla-Sample welleverythingyoudidwastotallyuntrue. REINFORCE idon’tknowhowlongit’sbeen. REGSMonteCarlo Afewmonths,Iguess. Input soihadthedoctorstestsammy’sresponsetoconditioning. Vanilla-MLE sammywrotethetestsammywrotethetest. Vanilla-Sample objectspickobjectsupobjectsobjectsobjectsobjectsobjects REINFORCE i’mnotsurethat’srelevant. REGSMonteCarlo sohetookthepills. Input whatareyougoingtodo? Vanilla-MLE i’mgoingtothemovies. Vanilla-Sample gethimouttaherefirst! REINFORCE i’mgoingtogetyou. REGSMonteCarlo i’mgoingtogetyouanewcar. Input theyfearyourpoweryourintellect. Vanilla-MLE you’retheonlyonewhoknowswhat’sgoingon. Vanilla-Sample whentheyareconqueredandyousurrendertheywillcontrolallofus. REINFORCE i’mafraidi’mnotreadyyet. REGSMonteCarlo i’mnotafraidofyourpower. Table1: Sampledresponses fromdifferentmodels. MoreinAppendixTables5and6. tion systems to date yield an adversarial success Setting ERE SVM+Unigram 0.232 less than 10 percent (evaluator accuracy 90 per- ConcatNeural 0.209 cent). Butwhenusing sampling for decoding, the HierarchicalNeural 0.193 adversarial success skyrockets to around 40 per- SVM+Neural+multil-features 0.152 cent,7 only 10 percent less than what’s needed to Table2: EREscoresobtainedbydifferentmodels. pass the Turing test. A close look at the decoded sequences using sampling tells a different story: the responses from sampling are sometimes inco- SVM+Unigram: SVMusingunigramfeatures.8 A herent, irrelevantorevenungrammatical. multi-utterance dialogue (i.e., input messages and We thus propose an additional sanity check, in responses) is transformed to a unigram represen- whichwereporttheaccuracyofdistinguishing be- tation; (2) Concat Neural: a neural classification tweenmachine-generated responsesandrandomly model with a softmax function that takes as in- sampledresponses(machine-vs-random forshort). put the concatenation of representations of con- This resembles the N-choose-1 metric described stituentdialoguessentences;(3)HierarchicalNeu- in Shaoetal. (2016). Higher accuracy indicates ral: a hierarchical encoder with a structure simi- that the generated responses are distinguishable lar to the discriminator used in the reinforcement; fromrandomlysampledhumanresponses,indicat- and (4) SVM+Neural+multi-lex-features: a SVM ingthatthegenerativemodelisnotfoolingthegen- model that uses the following features: unigrams, erator simply by introducing randomness. As we neural representations of dialogues obtained by willshowinSec.5,usingsamplingresultsinhigh the neural model trained using strategy (3),9 the AdverSucvaluesbutlowmachine-vs-randomaccu- forward likelihood logp(t|s) and backward likeli- racy. hoodp(s|t). EREscoresobtainedbydifferentmodelsarere- 5 Experimental Results portedinTable2. Ascanbeseen,thehierarchical neural evaluator (model 3) is more reliable than In this section, we detail experimental results on simply concatenating the sentence-level represen- adversarial success andhumanevaluation. tations(model2). Usingthecombinationofneural 5.1 AdversarialEvaluation features and lexicalized features yields the most reliable evaluator. For the rest of this section, we ERE Wefirsttest adversarial evaluation models reportresultsobtainedbytheHierarchicalNeural withdifferent featuresetsandmodelarchitectures setting due to its end-to-end nature, despite its in- forreliability, asmeasured byevaluator reliability ferioritytoSVM+Neural+multil-features. error(ERE).Weexplorethefollowingmodels: (1) 7SimilarresultsarealsoreportedinKannanandVinyals 8TrainedusingtheSVM-Lightpackage(Joachims,2002). (2016). 9Therepresentationbeforethesoftmaxlayer. Model AdverSuc machine-vs-random MLE-BS 0.037 0.942 Setting adver-win adver-lose tie MLE-Greedy 0.049 0.945 single-turn 0.62 0.18 0.20 MMI+p(t|s) 0.073 0.953 multi-turn 0.72 0.10 0.18 MMI-p(t) 0.090 0.880 Sampling 0.372 0.679 Table 4: The gain from the proposed adversarial Adver-Reinforce 0.080 0.945 model over the mutual information system based Adver-REGS 0.098 0.952 onpairwisehumanjudgments. Table3: AdverSucandmachine-vs-random scores achievedbydifferenttraining/decoding strategies. 5.2 HumanEvaluation For human evaluation, we follow protocols de- Table 3 presents AdverSuc values for different fined in Lietal. (2016d), employing crowd- models, along with machine-vs-random accuracy sourced judges to evaluate a random sample of described in Section 4.3. Higher values of Adver- 200 items. Wepresent both an input message and Sucandmachine-vs-random arebetter. the generated outputs to 3 judges and ask them to Baselines we consider include standard decide which of the two outputs is better (single- SEQ2SEQ models using greedy decoding (MLE- turn general quality). Ties are permitted. Identi- greedy), beam-search (MLE+BS) and sampling, cal strings are assigned the same score. We also as well as the mutual information reranking present the judges with multi-turn conversations model of Lietal. (2016a) with two algorithmic simulated between the two agents. Each conver- variations: (1) MMI+p(t|s), in which a large sation consists of 3 turns. Results are presented in Table 4. We observe a significant quality im- N-best list is first generated using a pre-trained provement on both single-turn quality and multi- SEQ2SEQ model and then reranked by the back- ward probability p(s|t) and (2) MMI−p(t), in turn quality from the proposed adversarial model. It is worth noting that the reinforcement learn- which language model probability is penalized ing system described in Lietal. (2016d), which duringdecoding. simulates conversations between two bots and is Results are shown inTable 3. What first stands trained based on manually designed reward func- out is decoding using sampling (as discussed in tions, only improves multi-turn dialogue quality, Section 4.3), achieving a significantly higher Ad- while the model described in this paper improves verSuc number than all the rest models. How- both single-turn and multi-turn dialogue genera- ever, this does not indicate the superiority of the tionquality. Thisconfirmsthattherewardadopted sampling decoding model, since the machine-vs- inadversarialtrainingismoregeneral,naturaland random accuracy is at the same time significantly effectiveintrainingdialogue systems. lower. This means that sampled responses based onSEQ2SEQ modelsarenotonlyhardforaneval- 6 Conclusionand Future Work uator to distinguish from real human responses, but also from randomly sampled responses. A In this paper, drawing intuitions from the Turing similar, though much less extreme, effect is ob- test, we propose using an adversarial training ap- served for MMI−p(t), which has an AdverSuc proachforresponsegeneration. Wecastthemodel value slightly higher than Adver-Reinforce, but a in the framework of reinforcement learning and significantly lowermachine-vs-random score. train a generator based on the signal from a dis- By comparing different baselines, we find that criminatortogenerateresponsesequencesindistin- MMI+p(t|s) is better than MLE-greedy, which guishable from human-generated dialogues. We is in turn better than MLE+BS. This result is in observe clear performance improvements on mul- line with human-evaluation results from Lietal. tiplemetricsfromtheadversarial trainingstrategy. (2016a). Thetwoproposed adversarial algorithms Inadditiontoadversarialtraining,wedescribea achievebetterperformancethanthebaselines. We modelforadversarial evaluation thatusessuccess expect this to be the case, since the adversarial al- in fooling an adversary as a dialogue evaluation gorithmsaretrainedonanobjectivefunctionmore metric. We propose different methods for sanity similar to the the evaluation metric (i.e., adversar- checksuchastestingtheevaluator’sabilityandre- ial success). REGS performs slightly better than porting machine vs random accuracy to avoid po- thevanillaREINFORCEalgorithm. tentialpitfallsofadversarial evaluation. The adversarial training model should theoreti- laplacian pyramid of adversarial networks. In Ad- cally benefit a variety of generation tasks in NLP. vances in neural information processing systems. pages1486–1494. Unfortunately, in preliminary experiments apply- ing the same training paradigm to machine trans- PeterWGlynn.1990. Likelihoodratiogradientestima- lation and abstractive summarization (Rushetal., tionforstochasticsystems. Communicationsofthe 2015), we did not observe a clear performance ACM33(10):75–84. boost. Weconjecturethatthisisbecausetheadver- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, sarial training strategy is more beneficial to tasks BingXu,DavidWarde-Farley,SherjilOzair,Aaron in which there is a big discrepancy between the Courville,andYoshuaBengio.2014. Generativead- distributions of the generated sequences and the versarial nets. In Advances in Neural Information reference target sequences (inotherwords, thead- ProcessingSystems.pages2672–2680. versarial approach is more beneficial on tasks in SeppHochreiterandJu¨rgenSchmidhuber.1997. Long which entropy of the targets is high) or tasks in short-termmemory. Neuralcomputation9(8):1735– whichinputsequencesdonotbearalltheinforma- 1780. tion needed to generate the target (in other words, thereisnosinglecorrect targetsequence inthese- ThorstenJoachims.2002. Learningtoclassifytextus- mantic space). Exploring this relationship further ing support vector machines: Methods, theory and algorithms. KluwerAcademicPublishers. isafocusofourfuturework. Anjuli Kannan and Oriol Vinyals. 2016. Adversarial evaluationofdialoguemodels. InNIPS2016Work- References shoponAdversarialTraining. V.M.Aleksandrov,V.I.Sysoyev,andV.V.Shemeneva. Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng 1968. Stochastic optimization. EngineeringCyber- Zhang,AaronCourville, and YoshuaBengio.2016. netics5:11–16. Professor forcing: A new algorithmfor training re- current networks. In Advances In Neural Informa- NabihaAsghar,PascaPoupart,JiangXin,andHangLi. tionProcessingSystems.pages4601–4609. 2016. Online sequence-to-sequence reinforcement learning for open-domain conversational agents. JiweiLi,MichelGalley,ChrisBrockett,JianfengGao, arXivpreprintarXiv:1612.03929. and Bill Dolan. 2016a. A diversity-promoting ob- jective functionfor neuralconversationmodels. In Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Proc.ofNAACL-HLT. Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville,andYoshuaBengio.2016. Anactor-critic Jiwei Li, Michel Galley, Chris Brockett, GeorgiosSp- algorithm for sequence prediction. arXiv preprint ithourakis, Jianfeng Gao, and Bill Dolan. 2016b. arXiv:1607.07086. Apersona-basedneuralconversationmodel. InPro- ceedings of the 54th Annual Meeting of the Asso- DzmitryBahdanau,KyunghyunCho,andYoshuaBen- ciation for Computational Linguistics (Volume 1: gio. 2015. Neural machine translation by jointly Long Papers). Berlin, Germany, pages 994–1003. learningtoalignandtranslate. InProc.ofICLR. http://www.aclweb.org/anthology/P16-1094. Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An- drew M Dai, Rafal Jozefowicz, and Samy Ben- JiweiLi,Minh-ThangLuong,andDanJurafsky.2015. gio.2015. Generatingsentencesfroma continuous A hierarchical neural autoencoder for paragraphs space. arXivpreprintarXiv:1511.06349. anddocuments. arXivpreprintarXiv:1506.01057. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Jiwei Li, Will Monroe, and Dan Jurafsky. 2016c. A Ilya Sutskever, and Pieter Abbeel. 2016a. Info- simple, fast diverse decoding algorithm for neural gan: Interpretable representation learning by infor- generation. arXivpreprintarXiv:1611.08562. mation maximizing generative adversarial nets. In AdvancesInNeuralInformationProcessingSystems. Jiwei Li, Will Monroe, Alan Ritter, and Dan Jurafsky. pages2172–2180. 2016d. Deep reinforcement learning for dialogue generation. arXivpreprintarXiv:1606.01541. Xilun Chen, Ben Athiwaratkun,Yu Sun, Kilian Wein- berger,andClaire Cardie.2016b. Adversarialdeep Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael averagingnetworksforcross-lingualsentimentclas- Noseworthy, Laurent Charlin, and Joelle Pineau. sification. arXivpreprintarXiv:1606.01614. 2016. How NOT to evaluateyourdialoguesystem: Anempiricalstudyofunsupervisedevaluationmet- EmilyL Denton,SoumithChintala, Rob Fergus,et al. ricsfordialogueresponsegeneration. arXivpreprint 2015. Deep generative image models using a? arXiv:1603.08023. RyanLowe,MichaelNoseworthy,IulianSerban,Nico- IulianVlad Serban, RyanLowe, LaurentCharlin, and las Angelard-Gontier, Yoshua Bengio, and Joelle Joelle Pineau. 2016d. Generative deep neural net- Pineau. 2016a. Towards an automatic turing test: worksfordialogue:Ashortreview. Learning to evaluate dialogue responses. arXiv preprintarXiv:1605.05414. Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Ryan Lowe, Iulian V Serban, Mike Noseworthy, Lau- and Yoshua Bengio. 2016e. A hierarchical latent rentCharlin,andJoellePineau.2016b. Ontheeval- variable encoder-decoder model for generating dia- uationofdialoguesystemswithnextutteranceclas- logues. arXivpreprintarXiv:1605.06069. sification. arXivpreprintarXiv:1605.05414. LifengShang,ZhengdongLu,andHangLi.2015. Neu- Yi Luan, Yangfeng Ji, and Mari Ostendorf. 2016. ral responding machine for short-text conversation. LSTM based conversation models. arXiv preprint InProceedingsofACL-IJCNLP.pages1577–1586. arXiv:1603.09457. Louis Shao, Stephan Gouws, Denny Britz, Anna Minh-Thang Luong, Hieu Pham, and Christopher D Goldie,BrianStrope,andRayKurzweil.2016. Gen- Manning. 2015. Effective approaches to attention- eratinglong and diverseresponseswith neuralcon- based neural machine translation. arXiv preprint versationalmodels. arXiv:1508.04025. ShiqiShen, YongCheng, ZhongjunHe, Wei He, Hua AlecRadford,LukeMetz,andSoumithChintala.2015. Wu,MaosongSun,andYangLiu.2015. Minimum Unsupervisedrepresentationlearningwithdeepcon- risk training for neural machine translation. arXiv volutional generative adversarial networks. arXiv preprintarXiv:1512.02433. preprintarXiv:1511.06434. David Silver, Aja Huang, Chris J Maddison, Arthur Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Guez,LaurentSifre,GeorgeVanDenDriessche,Ju- andWojciechZaremba.2015. Sequenceleveltrain- lian Schrittwieser, Ioannis Antonoglou, Veda Pan- ing with recurrentneural networks. arXiv preprint neershelvam,MarcLanctot, etal. 2016. Mastering arXiv:1511.06732. thegameofGowithdeepneuralnetworksandtree search. Nature529(7587):484–489. AlanRitter,ColinCherry,andWilliamBDolan.2011. Data-drivenresponsegenerationinsocialmedia. In Alessandro Sordoni, Michel Galley, Michael Auli, ProceedingsofEMNLP2011.pages583–593. Chris Brockett, Yangfeng Ji, Meg Mitchell, Jian- Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A Alexander M Rush, Sumit Chopra, and Jason We- neuralnetworkapproachto context-sensitivegener- ston. 2015. A neural attention model for ab- ationofconversationalresponses. InProceedingsof stractive sentence summarization. arXiv preprint NAACL-HLT. arXiv:1509.00685. Pei-HaoSu, Milica Gasic, NikolaMrksic, LinaRojas- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Barahona, Stefan Ultes, David Vandyke, Tsung- Vicki Cheung, Alec Radford, and Xi Chen. 2016. Hsien Wen, and Steve Young. 2016. Continuously Improvedtechniquesfortraininggans. InAdvances learningneuraldialoguemanagement. arxiv. in Neural Information Processing Systems. pages 2226–2234. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequencetosequencelearningwithneuralnetworks. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, In Advances in neural information processing sys- Aaron Courville, and Joelle Pineau. 2016a. Build- tems.pages3104–3112. ingend-to-enddialoguesystemsusinggenerativehi- erarchicalneuralnetworkmodels. InProceedingsof AlanMTuring.1950. Computingmachineryandintel- AAAI. ligence. Mind59(236):433–460. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, OriolVinyalsand QuocLe. 2015. A neuralconversa- Aaron Courville, and Joelle Pineau. 2016b. Build- tionalmodel. InProceedingsofICMLDeepLearn- ingend-to-enddialoguesystemsusinggenerativehi- ingWorkshop. erarchicalneuralnetworkmodels. InProceedingsof the30thAAAIConferenceonArtificialIntelligence Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, (AAAI-16). Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, DavidVandyke,andSteveYoung.2016. Anetwork- Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, based end-to-end trainable task-oriented dialogue Kartik Talamadupula, Bowen Zhou, Yoshua Ben- system. arXivpreprintarXiv:1604.04562. gio, and Aaron Courville. 2016c. Multiresolu- tion recurrent neural networks: An application Ronald J Williams. 1992. Simple statistical gradient- to dialogue response generation. arXiv preprint following algorithms for connectionist reinforce- arXiv:1606.00776. mentlearning. Machinelearning8(3-4):229–256.