ebook img

Adversarial Evaluation of Dialogue Models PDF

0.14 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Adversarial Evaluation of Dialogue Models

Adversarial Evaluation of Dialogue Models AnjuliKannan OriolVinyals GoogleBrain GoogleDeepMind [email protected] [email protected] 7 1 0 Abstract 2 n a TherecentapplicationofRNNencoder-decodermodelshasresultedinsubstantial J progressinfullydata-drivendialoguesystems,butevaluationremainsachallenge. 7 Anadversariallosscouldbeawaytodirectlyevaluatetheextenttowhichgener- 2 ated dialogueresponsessoundlike theycamefroma human. Thiscouldreduce the need for human evaluation, while more directly evaluating on a generative ] task. In this work, we investigate this idea by training an RNN to discriminate L a dialogue model’s samples from human-generatedsamples. Although we find C someevidencethissetupcouldbeviable,wealsonotethatmanyissuesremainin . itspracticalapplication.Wediscussbothaspectsandconcludethatfutureworkis s c warranted. [ 1 v 1 Introduction 8 9 1 Building machines capable of conversing naturally with humans is an open problem in language 8 understanding. Recurrentneuralnetworks(RNNs)havedrawnparticularinterestforthisproblem, 0 typicallyintheformofanencoder-decoderarchitecture.Onenetworkingestsanincomingmessage 1. (aTweet,achatmessage,etc.),andasecondnetworkgeneratesanoutgoingresponse,conditionalon 0 thefirstnetwork’sfinalhiddenstate. Thissortofapproachhasbeenshowntoimprovesignificantly 7 overbothastatisticalmachinetranslationbaseline[9]andtraditionalrule-basedchatbots[11]. 1 However, evaluatingdialogue modelsremainsa significantchallenge. While perplexityis a good : v measureofhowwellamodelfitssomedata,itdoesnotmeasureperformanceataparticulartask.N- i gram-basedmeasuressuchasBLEU,whileusefulintranslation,areapoorfitfordialoguemodels X because two replies may have no n-gram overlap but equally good responses to a given message. r Humanevaluationmaybeideal,butdoesnotscalewell,andcanalsobeproblematicinapplications a likeSmartReply[3],wheredatacannotbeviewedbyhumans. This work investigatesthe use of an adversarialevaluation method for dialogue models. Inspired by the success of generative adversarial networks (GANs) for image generation ([2], and others), we propose that one measure of a model’s quality is how easily its output is distinguished from a human’soutput. As an initial exploration,we take a fullytrained production-scaleconversation modeldeployedaspartoftheSmartReplysystem(the"generator"),and,keepingitfixed,wetraina secondRNN(the"discriminator")onthefollowingtask:givenanincomingmessageandaresponse, itmustpredictwhethertheresponsewassampledfromthegeneratororahuman. Ourgoalhereis tounderstandwhetheranadversarialsetupisviableeitherforevaluation. We findthatadiscriminatorcaninfactdistinguishthemodeloutputfromhumanoutputover60% of the time. Furthermore, it seems to uncover the major weaknesses that humans have observed inthesystem: anincorrectlengthdistributionandarelianceonfamiliar,simplisticrepliessuchas "Thankyou". Still, significantproblemswith the practicalapplicationof thismethodremain. We lackevidencethatamodelwithlowerdiscriminatoraccuracy(i.e.,thatfoolsit)necessarilywould bebetterinhumanevaluationaswell. WorkshoponAdversarialTraining,NIPS2016,Barcelona,Spain. Wepresentherethedetailsofouranalysis,aswellasfurtherdiscussionofbothmeritsanddrawbacks ofanadversarialsetup. Weconcludethatadditionalinvestigationiswarranted,andlayoutseveral suggestionsforthatwork. 1.1 Relatedwork Much recent work has employed RNN encoder-decodermodels to translate from utterance to re- sponse([9],[8],[11]). Workin[5]hasusedpolicygradientbuttherewardsaremanuallydefinedas usefulconversationalpropertiessuchasnon-redundancy.Evaluationremainsasignificantchallenge [6]. TheadversarialsetupwedescribeisinspiredbyworkonGANsforimagegeneration[2];however, weapplytheconcepttodialoguemodeling,whichraisesthechallengesofsequentialinputs/outputs andconditionalgeneration. Tosupportouraimofunderstandingthediscriminator,wealsodonot trainthegeneratoranddiscriminatorjointly. Anadversariallossforlanguageunderstandingisalsousedin[1]asameansofevaluation;however, the metric is not applied to any real world task, nor are the properties of the discriminator itself exploredandevaluated,aswewilldointhiswork. 2 Model Like a GAN, our architectureconsists of a generator and a discriminator; however, these are two separatemodelswhicharenottrainedtoasingleobjective. Thegeneratorisasequence-to-sequencemodel,consistingofanRNNencoderandanRNNdecoder. Givenacorpusofmessagepairs(o,r)whereo,theoriginalmessage,consistsoftokens{o ,...,o } 1 n andr,theresponsemessage,consistsoftokens{r ,...,r },thismodelistrainedtomaximizethe 1 m totallogprobabilityofobservedresponsemessages,giventheirrespectiveoriginalmessages: XlogP(r1,...,rm|o1,...,on) (o,r) ThediscriminatorisalsoanRNN,buthasonlyanencoderfollowedbyabinaryclassifier. Givena corpusofmessagepairsandscores(o,r,y),wherey = 1ifrwassampledfromthetrainingdata and0otherwise,thismodelistrainedtomaximize: X logP(y|o1,...,on,r1,...,rm) (o,r,y) 3 Experiments 3.1 Dataandtraining Weinvestigatetheproposedadversariallossusingacorpusofemailreplypairs(o,r).Thegenerator istrainedonthesamedataandinthesamemannerastheproduction-scalemodelthatisdeployed aspartoftheSmartReplyfeatureinInboxbyGmail[3]. 1 Thediscriminatoristhentrainedonaheldoutsetoftheemailcorpus.Forhalfthepairs(o,r)inthe heldoutset,weleavetheexampleunchangedandassigna1. Fortheotherhalfwereplacerwitha messager′ thathasbeensampledfromthegenerator,andassignthepair(o,r′)ascoreof0. Then thediscriminatoristrainedasdescribedintheprevioussection. 3.2 Discriminatorperformance We observethatthe discriminatorcan distinguishbetweengeneratorsamplesandhumansamples, conditionalonanoriginalmessage,62.5%ofthetime. Thisinitselfmaybesomewhatunexepcted: 1In particular, from [3]: "All email data (raw data, preprocessed data and training data) was encrypted. Engineerscouldonlyinspectaggregatedstatisticsonanonymizedsentencesthatoccurredacrossmanyusers anddidnotidentifyanyuser." 2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 Figure1:a. (left)Discriminatorscorevslength.b. (right)Recallvsprecision Original Length Topresponses(discriminator) Topresponses(generator) Actually,yes,let’s 3 Ohalright. Okthankyou moveittoMonday. That’sfine That’sfine Good... . thankyou! Areyougoingto 1 Ya Yes Matt’spartyon Maybe No Sunday? Yeah yes Table1:Comparisonofrankingofresponsesofthesamelengthbydiscriminatorscoreandgenerator score. Thediscriminatorfavorslesscommonlanguagelike"Ya"over"Yes"andconsistentlyomits "Thankyou"fromitstopresponses. onemayexpectthatsincethediscriminatorisonlyaspowerfulasthegenerator,itwouldnotbeable todistinguishitsdistributionfromthetrainingdistribution.Afullprecision-recallcurveisshownin Figure 1. 3.3 Comparisonwithperplexity A qualitativeanalysisshowsthatthediscriminatorobjectivefavorsdifferentfeaturesthanthegen- eratorobjective. Todemonstratethis,wesample100responsesfromthegeneratorforeachof100 donatedemailmessages. These are thenrankedaccordingto boththediscriminatorscoreand the generator’sassignedloglikelihood,andthetworankingsarecompared. First, we see that the discriminator’s preferences are strongly correlated with length (Figure 1). Thisisrelevantbecauseithasbeenpreviouslydocumentedthatsequence-to-sequencemodelshave alengthbias[10].Thediscriminatorreliestooheavilyonthissignal,favoringlongerresponseseven whentheyarenotinternallycoherent.Still,itisnoteworthythatitidentifiessomethinghumanshave documentedasakeyweaknessofthemodel[11]. The discriminator does not assign equal probability to all responses of the same length. When comparing responses of the same length, we find that it has a significantly different ranking than thelikelihoodassignedbythegenerator,withanaverageSpearman’scorrelationof-0.02. Broadly speakingwefindthatthediscriminatorhaslesspreferenceforthemostcommonresponsesproduced by the generator, things like "Thank you!" and "Yes!" (Table 1). The lack of diverse generated languagehasbeendocumentedasaweaknessofthesedialoguemodelsin[3]and[4],bothofwhich incorporatesignificant post-processing and re-rankingto overcomethis noted weakness. As with length, the discriminator’s preference for rarer language does not necessarily mean it is favoring betterresponses;itisnoteworthyonlyinthatitshowssignsofdetectingtheknownweaknessofthe generator. Future work mightincorporateminibatch discrimination[7] to more explicitly address thediversityweakness. 4 Discussion InthisresearchnoteweinvestigatedwhetherthediscriminatorinGANscanbeemployedforauto- maticevaluationofdialoguesystems. Weseeanaturalprogressiontowardsusingdiscriminators: 3 1. Askhumanstoevaluateeachsinglesystempublishedinaconsistentmanner.Thoughideal, itwouldalsobetimeconsumingandprohibitivelyexpensive. 2. Annotatealargedatasetofdialogues,learna“critic”(e.g.,aneuralnetwork),anduseitto scoreanynewsystem(sothatextrahumanlabourwouldnotberequired). However,this criticwouldlikelynotperformwellwhenevaluatedoff-policy,andoverfittingcouldoccur, asresearchersmaynaturallyfinditsweaknesses. 3. Use the discriminator of a GAN as a proxy for giving feedback akin to a human. Since trainingsuchacriticwouldbesimpleforanydialoguesystem,eachresearchgroupcould providetheirsandanynewsystemcouldbeevaluatedwithavarietyofdiscriminators. The last item is the simplest, andit is whatwe haveexploredin thiswork. Ourpreliminarywork suggeststhatthecriticwetrainedonaproduction-qualitydialoguesystemisabletoautomatically findsomeofthepreviouslyidentifiedweaknesseswhenemployingprobabilisticmodels–sequence length and diversity. It also succeeds in identifyingreal vs generatedresponses of a highly tuned system. However,aswithGANs, moreneedstobeunderstoodandusingdiscriminatorsalonewon’tsolve the evaluation challenges of dialogue systems. Despite the fact that GANs do not use any extra informationthanwhat’salreadypresentinthetrainingdataset,somehavearguedthatitisabetter lossthanlikelihood[12]. Still,thereremainsatensionbetweenwhatwetrainthediscriminatoron (samples) and what we typically use in practice (the maximally likely response, or some approxi- mationofit). Discriminatorshaveahardertimewhensamplingversususingbeamsearch,butthis conflictswithhumanobservationsthatsomeamountofsearchtypicallyisusefultogetthehighest qualityresponses. Furtherworkisrequiredtounderstandifandhowdiscriminatorscanbeapplied inthisdomain. References [1] S.R.Bowman, L.Vilnis, O.Vinyals, A.M.Dai, R.Jozefowicz, andS.Bengio. Generatingsentences fromacontinuousspace. InarXivpreprintarXiv:1511.06349,2015. [2] I.Goodfellow,J.Pouget-Abadie,M.Mirza,B.Xu,D.Warde-Farley,S.Ozair,A.Courville,andY.Bengio. Generativeadversarialnets. InProceedingsofNIPS,2014. [3] A.Kannan,K.Kurach,S.Ravi,T.Kaufmann,B.Miklos,G.Corrado,andetal. Smartreply:Automated responsesuggestionforemail. InProceedingsofKDD,2016. [4] J.Li,M.Galley,C.Brockett,J.Gao,andB.Dolan. Adiversity-promotingobjectivefunctionforneural conversationmodels. InProceedingsofNAACL-HLT,2016. [5] J.Li,W.Monroe,A.Ritter,M.Galley,J.Gao,andD.Jurafsky. Deepreinforcementlearningfordialogue generation. InarXivpreprintarXiv:1606.01541,2016. [6] C.-W.Liu,R.Lowe,I.V.Serban,M.Noseworthy, L.Charlin,andJ.Pineau. Hownottoevaluateyour dialoguesystem:Anempiricalstudyofunsupervisedevaluationmetricsfordialogueresponsegeneration. InEMNLP,2016. [7] T.Salimans,I.Goodfellow,W.Zaremba,V.Cheung,A.Radford,andZ.Chen. Improvedtechniquesfor traininggans. InarXivpreprintarXiv:1606.03498,2016. [8] I.V.Serban,A.Sordoni,Y.Bengio,A.Courville,andJ.Pineau. Hierarchicalneuralnetworkgenerative modelsformoviedialogues. InarXivpreprintarXiv:1507.04808,2015. [9] A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. A neural network approach to context-sensitive generation of conversation responses. In Proceedings of NAACL-HLT,2015. [10] P.SountsovandS.Sarawagi. Lengthbiasinencoderdecodermodelsandacaseforglobalconditioning. InProceedingsofEMNLP,2016. [11] O.VinyalsandQ.V.Le. Aneuralconversationmodel. InICMLDeepLearningWorkshop,2015. [12] L.Yu,W.Zhang,J.Wang,andY.Yu. Seqgan:Sequencegenerativeadversarialnetswithpolicygradient. InarXivpreprintarXiv:1609.05473,2016. 4

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.