Adversarial Evaluation of Dialogue Models AnjuliKannan OriolVinyals GoogleBrain GoogleDeepMind [email protected] [email protected] 7 1 0 Abstract 2 n a TherecentapplicationofRNNencoder-decodermodelshasresultedinsubstantial J progressinfullydata-drivendialoguesystems,butevaluationremainsachallenge. 7 Anadversariallosscouldbeawaytodirectlyevaluatetheextenttowhichgener- 2 ated dialogueresponsessoundlike theycamefroma human. Thiscouldreduce the need for human evaluation, while more directly evaluating on a generative ] task. In this work, we investigate this idea by training an RNN to discriminate L a dialogue model’s samples from human-generatedsamples. Although we find C someevidencethissetupcouldbeviable,wealsonotethatmanyissuesremainin . itspracticalapplication.Wediscussbothaspectsandconcludethatfutureworkis s c warranted. [ 1 v 1 Introduction 8 9 1 Building machines capable of conversing naturally with humans is an open problem in language 8 understanding. Recurrentneuralnetworks(RNNs)havedrawnparticularinterestforthisproblem, 0 typicallyintheformofanencoder-decoderarchitecture.Onenetworkingestsanincomingmessage 1. (aTweet,achatmessage,etc.),andasecondnetworkgeneratesanoutgoingresponse,conditionalon 0 thefirstnetwork’sfinalhiddenstate. Thissortofapproachhasbeenshowntoimprovesignificantly 7 overbothastatisticalmachinetranslationbaseline[9]andtraditionalrule-basedchatbots[11]. 1 However, evaluatingdialogue modelsremainsa significantchallenge. While perplexityis a good : v measureofhowwellamodelfitssomedata,itdoesnotmeasureperformanceataparticulartask.N- i gram-basedmeasuressuchasBLEU,whileusefulintranslation,areapoorfitfordialoguemodels X because two replies may have no n-gram overlap but equally good responses to a given message. r Humanevaluationmaybeideal,butdoesnotscalewell,andcanalsobeproblematicinapplications a likeSmartReply[3],wheredatacannotbeviewedbyhumans. This work investigatesthe use of an adversarialevaluation method for dialogue models. Inspired by the success of generative adversarial networks (GANs) for image generation ([2], and others), we propose that one measure of a model’s quality is how easily its output is distinguished from a human’soutput. As an initial exploration,we take a fullytrained production-scaleconversation modeldeployedaspartoftheSmartReplysystem(the"generator"),and,keepingitfixed,wetraina secondRNN(the"discriminator")onthefollowingtask:givenanincomingmessageandaresponse, itmustpredictwhethertheresponsewassampledfromthegeneratororahuman. Ourgoalhereis tounderstandwhetheranadversarialsetupisviableeitherforevaluation. We findthatadiscriminatorcaninfactdistinguishthemodeloutputfromhumanoutputover60% of the time. Furthermore, it seems to uncover the major weaknesses that humans have observed inthesystem: anincorrectlengthdistributionandarelianceonfamiliar,simplisticrepliessuchas "Thankyou". Still, significantproblemswith the practicalapplicationof thismethodremain. We lackevidencethatamodelwithlowerdiscriminatoraccuracy(i.e.,thatfoolsit)necessarilywould bebetterinhumanevaluationaswell. WorkshoponAdversarialTraining,NIPS2016,Barcelona,Spain. Wepresentherethedetailsofouranalysis,aswellasfurtherdiscussionofbothmeritsanddrawbacks ofanadversarialsetup. Weconcludethatadditionalinvestigationiswarranted,andlayoutseveral suggestionsforthatwork. 1.1 Relatedwork Much recent work has employed RNN encoder-decodermodels to translate from utterance to re- sponse([9],[8],[11]). Workin[5]hasusedpolicygradientbuttherewardsaremanuallydefinedas usefulconversationalpropertiessuchasnon-redundancy.Evaluationremainsasignificantchallenge [6]. TheadversarialsetupwedescribeisinspiredbyworkonGANsforimagegeneration[2];however, weapplytheconcepttodialoguemodeling,whichraisesthechallengesofsequentialinputs/outputs andconditionalgeneration. Tosupportouraimofunderstandingthediscriminator,wealsodonot trainthegeneratoranddiscriminatorjointly. Anadversariallossforlanguageunderstandingisalsousedin[1]asameansofevaluation;however, the metric is not applied to any real world task, nor are the properties of the discriminator itself exploredandevaluated,aswewilldointhiswork. 2 Model Like a GAN, our architectureconsists of a generator and a discriminator; however, these are two separatemodelswhicharenottrainedtoasingleobjective. Thegeneratorisasequence-to-sequencemodel,consistingofanRNNencoderandanRNNdecoder. Givenacorpusofmessagepairs(o,r)whereo,theoriginalmessage,consistsoftokens{o ,...,o } 1 n andr,theresponsemessage,consistsoftokens{r ,...,r },thismodelistrainedtomaximizethe 1 m totallogprobabilityofobservedresponsemessages,giventheirrespectiveoriginalmessages: XlogP(r1,...,rm|o1,...,on) (o,r) ThediscriminatorisalsoanRNN,buthasonlyanencoderfollowedbyabinaryclassifier. Givena corpusofmessagepairsandscores(o,r,y),wherey = 1ifrwassampledfromthetrainingdata and0otherwise,thismodelistrainedtomaximize: X logP(y|o1,...,on,r1,...,rm) (o,r,y) 3 Experiments 3.1 Dataandtraining Weinvestigatetheproposedadversariallossusingacorpusofemailreplypairs(o,r).Thegenerator istrainedonthesamedataandinthesamemannerastheproduction-scalemodelthatisdeployed aspartoftheSmartReplyfeatureinInboxbyGmail[3]. 1 Thediscriminatoristhentrainedonaheldoutsetoftheemailcorpus.Forhalfthepairs(o,r)inthe heldoutset,weleavetheexampleunchangedandassigna1. Fortheotherhalfwereplacerwitha messager′ thathasbeensampledfromthegenerator,andassignthepair(o,r′)ascoreof0. Then thediscriminatoristrainedasdescribedintheprevioussection. 3.2 Discriminatorperformance We observethatthe discriminatorcan distinguishbetweengeneratorsamplesandhumansamples, conditionalonanoriginalmessage,62.5%ofthetime. Thisinitselfmaybesomewhatunexepcted: 1In particular, from [3]: "All email data (raw data, preprocessed data and training data) was encrypted. Engineerscouldonlyinspectaggregatedstatisticsonanonymizedsentencesthatoccurredacrossmanyusers anddidnotidentifyanyuser." 2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 Figure1:a. (left)Discriminatorscorevslength.b. (right)Recallvsprecision Original Length Topresponses(discriminator) Topresponses(generator) Actually,yes,let’s 3 Ohalright. Okthankyou moveittoMonday. That’sfine That’sfine Good... . thankyou! Areyougoingto 1 Ya Yes Matt’spartyon Maybe No Sunday? Yeah yes Table1:Comparisonofrankingofresponsesofthesamelengthbydiscriminatorscoreandgenerator score. Thediscriminatorfavorslesscommonlanguagelike"Ya"over"Yes"andconsistentlyomits "Thankyou"fromitstopresponses. onemayexpectthatsincethediscriminatorisonlyaspowerfulasthegenerator,itwouldnotbeable todistinguishitsdistributionfromthetrainingdistribution.Afullprecision-recallcurveisshownin Figure 1. 3.3 Comparisonwithperplexity A qualitativeanalysisshowsthatthediscriminatorobjectivefavorsdifferentfeaturesthanthegen- eratorobjective. Todemonstratethis,wesample100responsesfromthegeneratorforeachof100 donatedemailmessages. These are thenrankedaccordingto boththediscriminatorscoreand the generator’sassignedloglikelihood,andthetworankingsarecompared. First, we see that the discriminator’s preferences are strongly correlated with length (Figure 1). Thisisrelevantbecauseithasbeenpreviouslydocumentedthatsequence-to-sequencemodelshave alengthbias[10].Thediscriminatorreliestooheavilyonthissignal,favoringlongerresponseseven whentheyarenotinternallycoherent.Still,itisnoteworthythatitidentifiessomethinghumanshave documentedasakeyweaknessofthemodel[11]. The discriminator does not assign equal probability to all responses of the same length. When comparing responses of the same length, we find that it has a significantly different ranking than thelikelihoodassignedbythegenerator,withanaverageSpearman’scorrelationof-0.02. Broadly speakingwefindthatthediscriminatorhaslesspreferenceforthemostcommonresponsesproduced by the generator, things like "Thank you!" and "Yes!" (Table 1). The lack of diverse generated languagehasbeendocumentedasaweaknessofthesedialoguemodelsin[3]and[4],bothofwhich incorporatesignificant post-processing and re-rankingto overcomethis noted weakness. As with length, the discriminator’s preference for rarer language does not necessarily mean it is favoring betterresponses;itisnoteworthyonlyinthatitshowssignsofdetectingtheknownweaknessofthe generator. Future work mightincorporateminibatch discrimination[7] to more explicitly address thediversityweakness. 4 Discussion InthisresearchnoteweinvestigatedwhetherthediscriminatorinGANscanbeemployedforauto- maticevaluationofdialoguesystems. Weseeanaturalprogressiontowardsusingdiscriminators: 3 1. Askhumanstoevaluateeachsinglesystempublishedinaconsistentmanner.Thoughideal, itwouldalsobetimeconsumingandprohibitivelyexpensive. 2. Annotatealargedatasetofdialogues,learna“critic”(e.g.,aneuralnetwork),anduseitto scoreanynewsystem(sothatextrahumanlabourwouldnotberequired). However,this criticwouldlikelynotperformwellwhenevaluatedoff-policy,andoverfittingcouldoccur, asresearchersmaynaturallyfinditsweaknesses. 3. Use the discriminator of a GAN as a proxy for giving feedback akin to a human. Since trainingsuchacriticwouldbesimpleforanydialoguesystem,eachresearchgroupcould providetheirsandanynewsystemcouldbeevaluatedwithavarietyofdiscriminators. The last item is the simplest, andit is whatwe haveexploredin thiswork. Ourpreliminarywork suggeststhatthecriticwetrainedonaproduction-qualitydialoguesystemisabletoautomatically findsomeofthepreviouslyidentifiedweaknesseswhenemployingprobabilisticmodels–sequence length and diversity. It also succeeds in identifyingreal vs generatedresponses of a highly tuned system. However,aswithGANs, moreneedstobeunderstoodandusingdiscriminatorsalonewon’tsolve the evaluation challenges of dialogue systems. Despite the fact that GANs do not use any extra informationthanwhat’salreadypresentinthetrainingdataset,somehavearguedthatitisabetter lossthanlikelihood[12]. Still,thereremainsatensionbetweenwhatwetrainthediscriminatoron (samples) and what we typically use in practice (the maximally likely response, or some approxi- mationofit). Discriminatorshaveahardertimewhensamplingversususingbeamsearch,butthis conflictswithhumanobservationsthatsomeamountofsearchtypicallyisusefultogetthehighest qualityresponses. Furtherworkisrequiredtounderstandifandhowdiscriminatorscanbeapplied inthisdomain. References [1] S.R.Bowman, L.Vilnis, O.Vinyals, A.M.Dai, R.Jozefowicz, andS.Bengio. Generatingsentences fromacontinuousspace. InarXivpreprintarXiv:1511.06349,2015. [2] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville,andY.Bengio. Generativeadversarialnets. InProceedingsofNIPS,2014. [3] A.Kannan,K.Kurach,S.Ravi,T.Kaufmann,B.Miklos,G.Corrado,andetal. Smartreply:Automated responsesuggestionforemail. InProceedingsofKDD,2016. [4] J.Li,M.Galley,C.Brockett,J.Gao,andB.Dolan. Adiversity-promotingobjectivefunctionforneural conversationmodels. InProceedingsofNAACL-HLT,2016. [5] J.Li,W.Monroe,A.Ritter,M.Galley,J.Gao,andD.Jurafsky. Deepreinforcementlearningfordialogue generation. InarXivpreprintarXiv:1606.01541,2016. [6] C.-W.Liu,R.Lowe,I.V.Serban,M.Noseworthy, L.Charlin,andJ.Pineau. Hownottoevaluateyour dialoguesystem:Anempiricalstudyofunsupervisedevaluationmetricsfordialogueresponsegeneration. InEMNLP,2016. [7] T.Salimans,I.Goodfellow,W.Zaremba,V.Cheung,A.Radford,andZ.Chen. Improvedtechniquesfor traininggans. InarXivpreprintarXiv:1606.03498,2016. [8] I.V.Serban,A.Sordoni,Y.Bengio,A.Courville,andJ.Pineau. Hierarchicalneuralnetworkgenerative modelsformoviedialogues. InarXivpreprintarXiv:1507.04808,2015. [9] A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. A neural network approach to context-sensitive generation of conversation responses. In Proceedings of NAACL-HLT,2015. [10] P.SountsovandS.Sarawagi. Lengthbiasinencoderdecodermodelsandacaseforglobalconditioning. InProceedingsofEMNLP,2016. [11] O.VinyalsandQ.V.Le. Aneuralconversationmodel. InICMLDeepLearningWorkshop,2015. [12] L.Yu,W.Zhang,J.Wang,andY.Yu. Seqgan:Sequencegenerativeadversarialnetswithpolicygradient. InarXivpreprintarXiv:1609.05473,2016. 4