Learning to Decode for Future Success JiweiLi,WillMonroeandDanJurafsky ComputerScienceDepartment,StanfordUniversity,Stanford,CA,USA jiweil,wmonroe4,[email protected] Abstract generatetargetsequenceswithspecificproperties ofinterest,suchaspre-specifiedlengthconstraints Weintroduceasimple,generalstrategyto (Shaoetal.,2017;Shietal.,2016),whichmightbe manipulatethebehaviorofaneuraldecoder usefulintaskslikeconversationalresponsegenera- that enables it to generate outputs that tionornon-factoidquestionanswering,andcannot 7 havespecificpropertiesofinterest(e.g.,se- deal with important objectives, such as the mu- 1 quences of a pre-specified length). The 0 tual information between sources and targets (Li 2 model can be thought of as a simple ver- etal.,2016a),thatrequireknowingthefulltarget sionoftheactor-criticmodelthatusesan b sequenceinadvance. e interpolationoftheactor(theMLE-based F token generation policy) and the critic (a Toaddressthisissue,weproposeageneralstrat- 3 valuefunctionthatestimatesthefutureval- egythatallowsthedecodertoincrementallygener- ues of the desired property) for decision ateoutputsequencesthat,whencomplete,willhave ] L making. Wedemonstratethattheapproach specificpropertiesofinterest. Suchpropertiescan C is able to incorporate a variety of proper- take various forms, such as length, diversity, mu- s. tiesthatcannotbehandledbystandardneu- tualinformationbetweensourcesandtargets,and c ral sequence decoders, such as sequence BLEU/ROUGEscores. Theproposedframework [ lengthandbackwardprobability(probabil- integratestwomodels: thestandardseq2seqmodel, 2 ity of sources given targets), in addition trainedtoincrementallypredictthenexttoken,and v 9 toyieldingconsistentimprovementsinab- a future output estimation model, trained to esti- 4 stractivesummarizationandmachinetrans- mate future properties solely from a prefix string 5 lationwhenthepropertytobeoptimizedis (ortherepresentationassociatedwiththisstring), 6 0 BLEUorROUGEscores. andincorporatedintothedecodertoencourageitto . makedecisionsthatleadtobetterlong-termfuture 1 0 1 Introduction outcomes. 7 Neuralgenerationmodels(Sutskeveretal.,2014; Makingdecodingdecisionsbasedonfuturesuc- 1 : Bahdanauetal.,2015;Choetal.,2014;Kalchbren- cess resembles the central idea of reinforcement v i ner and Blunsom, 2013) learn to map source to learning(RL),thatoftrainingapolicythatleadsto X target sequences in applications such as machine betterlong-termreward. Ourworkisthusrelated r a translation(Sennrichetal.,2015;Gulcehreetal., toavarietyofrecentworkinspiredbyorusingre- 2015),conversationalresponsegeneration(Vinyals inforcementlearning(e.g.,REINFORCEoractor- and Le, 2015; Sordoni et al., 2015), abstractive criticmodels)forsequencegeneration(Wiseman summarization(Nallapatietal.,2016;Rushetal., andRush,2016;Shenetal.,2015;Bahdanauetal., 2015). 2016;Ranzatoetal.,2016). Theproposedmodel Neuralgenerationmodelsarestandardlytrained canbeviewedasasimplerbutmoreeffectivever- bymaximizingthelikelihoodoftargetsequences sionoftheactor-criticRLmodel(?) insequence given source sequences in a training dataset. At generation: itdoesnotrelyonthecritictoupdate testtime, adecoderincrementallygeneratesase- the policy (the actor), but rather, uses a linear in- quence with the highest probability using search terpolationoftheactor(thepolicy)andthecritic strategiessuchasbeamsearch. Thislocallyincre- (thevaluefunction)tomakefinaldecisions. Such mentalnatureofthedecodingmodelleadstothe astrategycomeswiththefollowingbenefits: (1)It following issueL Decoders cannot be tailored to naturallyavoidstheknownproblemssuchaslarge variance and instabilitywith the useof reinforce- softmaxfunction: mentlearningintaskswithenormoussearchspaces likesequencegeneration. Aswillbeshowninthe (cid:89)nY p(Y|X) = p(y |X,y ) t 1:t−1 experiment sections, the simplified take on rein- t=1 forcementwithoutpolicyupdatesyieldsconsistent Decoding typically seeks to find the maximum- improvements, not only outperforming standard probabilitysequenceY∗ giveninputX: SEQ2SEQ models, but also the RL models them- selvesinawiderangeofsequencegenerationtasks; Y∗ = argmaxp(Y|X) (1) (2)trainingRL-basedgenerationmodelsusingspe- Y cificfeatureslikesequencelengthasrewardsnot The softmax function that computes onlyincreasesthemodel’sinstabilitybutmayalso p(y |X,y ) takes as input the hidden leadtosuboptimalgeneratedutterances,forexam- t 1:t−1 representationattimestept−1,denotedbyh . ple,sequencesthatsatisfyalengthconstraintbut t−1 areirrelevant,incoherentorevenungrammatical.1 Thehiddenrepresentationht−1 iscomputedusing arecurrentnetthatcombinesthepreviouslybuilt We study how to incorporate different proper- representation h and the word representation tiesintothedecoderdifferentpropertiesofthefu- t−2 e for word y . It is infeasible to enumerate ture output sequence: (1) sequence length: the t−1 t−1 the large space of possible sequence outputs, so approachprovidestheflexibilityofcontrollingthe beam search is normally employed to find an outputlength,whichinturnsaddressessequence approximatelyoptimalsolution. Givenapartially models’ bias towards generating short sequences generatedsequencey ,thescoreforchoosing (SountsovandSarawagi,2016);(2)mutualinfor- 1:t−1 tokeny (denotedbyS(y ))isthusgivenby mationbetweensourcesandtargets: theapproach t t enablesmodelingthebidirectionaldependencybe- S(y ) = logp(y |h ) (2) tween sources and targets at each decoding time- t t t−1 step,significantlyimprovingresponsequalityona 2.2 TheValueFunctionQ taskofconversationalresponsegenerationand(3) Thecoreoftheproposedarchitectureistotraina thepropertiescanalsotaketheformoftheBLEU futureoutcomepredictionfunction(orvaluefunc- andROUGEscores,yieldingconsistentimprove- tion)Q,whichestimatesthefutureoutcomeoftak- mentsinmachinetranslationandsummarization, inganaction(choosingatoken)y inthepresent. yielding the state-of-the-art result on the IWSLT t ThefunctionQisthenincorporatedintoS(y )at German-Englishtranslationtask. t eachdecodingsteptopushthemodeltogenerate 2 ModelOverview outputsthatleadtofuturesuccess. Thisyieldsthe followingdefinitionforthescoreS(y )oftaking t Inthissection,wefirstreviewthebasicsoftraining actiony : t anddecodinginstandardneuralgenerationmodels. Thenwegiveasketchoftheproposedmodel. S(y ) = logp(y |h )+γQ(X,y ) (3) t t t−1 1:t 2.1 Basics where γ denotes the hyperparameter controlling Neuralsequence-to-sequence(SEQ2SEQ)genera- thetrade-offbetweenthelocalprobabilitypredic- tionmodelsaimtogenerateasequenceoftokens tionp(y |h )andthevaluefunctionQ(X,y ). t t−1 1:t Y given input sequence X. Using recurrent nets, TheinputtoQcantakevariousforms,suchasthe LSTMs (Hochreiter and Schmidhuber, 1997) or vectorrepresentationofthedecodingstepaftery t CNNs (Krizhevsky et al., 2012; Kim, 2014), X hasbeenconsidered(i.e.,h )ortherawstrings(X t is first mapped to a vector representation, which andy ).2 1:t is then used as the initial input to the decoder. A Qcanbetrainedeitherjointlywithorindepen- neuralgenerationmodeldefinesadistributionover dently of the SEQ2SEQ model. When training outputsbysequentiallypredictingtokensusinga Q, we provide it with source-target pairs (X,Y), whereY isafullsequence. Y = {y ,y ,...,y } 1A workaround is to use the linear interpolation of the 1 2 N caneitherbesampledordecodedusingatrained MLE-basedpolicyandthevaluefunctionforaspecificprop- ertyasarewardforRLtraining.Thisstrategycomeswiththe model (making Q dependent on the pre-trained followingdisadvantages:itrequirestrainingdifferentmodels fordifferentinterpolationweights, andagain, suffersfrom 2Onecanthinkofh astheoutputofafunctionthattakes t largetrainingvariance. asinputXandy . 1:t SEQ2SEQ model) or can be taken from the train- tokens(simplybyprohibitingtheEOStokenand ing set (making Q independent of the SEQ2SEQ forcingthedecodingtoproceed). However,since model). However, Y must always be a full se- the previous decoded tokens were chosen with a quence. Thefutureoutcomeofgeneratingeachof shorter sentence in mind, artificially lengthening thetokensofY (y ,y ,...,y )isthefeaturescore the response this way will result in low-quality 1 2 N (BLEU, length, mutual information, etc.) associ- responses. Inparticular,problemsarisewithrepeti- ated with the full sequence Y, denoted by q(Y). tion(“no, no, no, no, no, no”)orincoherence(“i ThefutureoutcomefunctionQistrainedtopredict likefish,butidon’tlikefishbutIdolikefish”). q(Y)from(X,y ),where1 ≤ t ≤ N. 1:t Qestimatesthelong-termoutcomeoftakingan 3.1 TrainingQforSequenceLength actionyt. Itisthussimilartothevaluefunctionin Shaoetal.(2017)giveoneefficientmethodofgen- Q-learning,theroleofthecriticinactor-criticrein- eratinglongsequences,consistingofastochastic forcementlearning(Sutton,1988;Grondmanetal., searchalgorithmandsegment-by-segmentrerank- 2012), the value network for position evaluation ingofhypotheses. Thefundamentalideaistokeep intheMonte-CarlotreesearchofAlphaGo(Silver a diverse list of hypotheses on the beam and re- etal.,2016),ortheh*functioninA∗ search. (Och movethosethataresimilartoeachother,soasto etal.,2001). explore the space more adequately. While more Inthispaper,valuefunctionandfutureoutcome adequatelyexploringthesearchspacecanincrease predictionfunctionandQareinterchangeable thelikelihoodofgeneratinglongsequences,since In the sections below, we will describe how to thebeamismorelikelytoincludeaprefixofalong adapt this general framework to various features sequence,thismethoddoesn’tofferdirectcontrol withdifferentpropertiesanddifferentkindsofin- oversequencelength. Lengthinformationseemsto puttothefutureoutcomepredictionfunction. beembeddedinthehiddenrepresentationsofneu- ralmodelsinsomeimplicitway(Shietal.,2016). 3 QforControllingSequenceLength Wethereforebuildanotherneuralmodeltoexpose Fortaskslikemachinetranslation,abstractivesum- this length information and use it to estimate the marizationandimagecaptiongeneration,theinfor- numberofwordslefttobegenerated. mationrequiredtogeneratethetargetsequencesis Given a pre-trained sequence-to-sequence alreadyembeddedintheinput. Usuallywedon’t model, an input sequence X, and a target Y = havetoworryaboutthelengthoftargets,sincethe {y1,y2,...,yN},whereN denotesthelengthofy, modelcanfigureitoutitself; thisisaknown, de- we first run a forward pass to compute the hid- sirablepropertyofneuralgenerationmodels(Shi den representation ht associated with each time etal.,2016). step on the target side (1 ≤ t ≤ N). Then we However,fortaskslikeconversationalresponse build a regression model Q(ht), which takes as generationandnon-factoidquestionanswering,in inputht topredictthelengthoftheremainingse- which there is no single correct answer, it is use- quence, i.e., N −t. The model first passes ht to ful to be able to control the length of the targets. two non-linear layers, on top of which is a linear Additionally,intaskslikeconversationalresponse regressionmodelwhichoutputsthepredictednum- generation, SEQ2SEQ models have a strong bias beroftokenslefttodecode. Theregressionmodel towardsgeneratingshortsequences(Sountsovand isoptimizedbyminimizingthemeansquaredloss Sarawagi, 2016). This is because the standard betweenthepredictedsequencelengthandthegold- searchalgorithmatdecodingtimecanonlyafford standardlengthN −tonsource–targetpairstaken to explore very a small action space. As decod- fromthetrainingset. ingproceeds,onlyasmallnumberofhypotheses 3.2 Decoding canbemaintained. ByZipf’slaw,shortsequences are significantly more frequent than longer ones. GivenaninputX,supposethatwewishtogenerate Therefore, the prefixes of shorter responses are atargetsequenceofapre-specifiedlengthN. At usuallyassignedhigherprobabilitybythemodel. decodingtimestept−1,wefirstobtainthevector Thismakesprefixesoflongersequencesfalloffthe representationh forthecurrenttimestep. The t−1 beamafterafewdecodingsteps,leavingonlyshort scoreusedtorankchoicesforthenexttokeny isa t sequences. linearcombinationofthelogprobabilityoutputted Onecanstillforcethemodeltokeepgenerating fromthesequencemodelandthemeansquareloss Figure1:Anillustrationoftheproposedfuturelengthpredictionmodel.NdenotesthenumberofwordslefttogenerateandL denotesthepre-specifiedsequencelength. betweenthenumberofwordslefttogenerate(N− Model SBS LengthpredictionQ BLEU 1.45 1.64 t)andtheoutputfromQ(h ): t AdverSuc 0.034 0.040 y = argmaxlogp(y |X) machine-vs-random 0.923 0.939 t 1:t y (4) −λ||(N −t)−Q(h )||2 Table2: Comparisonoftheproposedalgorithmwithlength t predictionandthestandardbeamsearchalgorithm. wherey istheconcatenationofy andpreviously 1:t t decoded(given)sequencey ,andh isobtained outputsfromeachlengthcluster.4 1:t−1 t fromthesequencemodelbycombiningh and We also report adversarial success (AdverSuc) t−1 thewordrepresentationofy asifitwerethetrue andmachine-vs-randomaccuracy,evaluationmet- t nexttoken. λisahyperparametercontrollingthe ricsproposedinLietal.(2016c). Adversarialsuc- influenceofthefuturelengthestimator. cessreferstothepercentageofmachine-generated responses that are able to fool a trained evalua- 3.3 Experiments tor model into believing that they are generated by a human; machine-vs-random accuracy de- We evaluate the proposed model on the task of notes the accuracy of a (different) trained eval- open-domain conversation response generation uator model at distinguishing between machine- (Vinyals and Le, 2015; Sordoni et al., 2015; Ser- generated responses and randomly-sampled re- ban et al., 2015a; Mei et al., 2016; Serban et al., sponses.5 Highervaluesofadversarialsuccessand 2015b), in which a model must predict the next machine-vs-randomaccuracyindicatethesuperior- turn of a dialogue given the preceding ones. We ityofamodel. WereferreaderstoLietal.(2016c) usetheOpenSubtitles(OSDb)dataset(Tiedemann, formoredetails. Table2presentsthequantitative 2009). results: addingpredictorrankersincreasesthegen- Wecomparetheproposedmodelwiththestan- eralqualityofgeneratedresponses. dard SEQ2SEQ beam search (SBS) decoder. We Sampledresponses(fromarandombatch,with- firstgrouptestdialoguepairsbytargetlengthand out cherry-picking) are shown in Table 1, with decode each group. At decoding time, for an in- moreexamplesshowninTable7intheAppendix. put with a gold target of length L, we force the We force the decoding model to generate 20 to- modeltogenerateanoutputoflengthL. Thiscan kens using the strategy described above. We can beachievedbyselectingahypothesisthatpredicts an EOS token at time step L+1.3 If no EOS is 4Inthissetup, bothalgorithmsareallowedtoknowthe predicted at time step L+1, we continue decod- lengthofthegold-standardtargets.Theresultsfromdifferent ingandstoponceanEOStokenisgenerated. We modelsarethuscomparable. Thisistoremovetheeffectof targetlengthontheevaluationmetrics(allmetricsemployed report BLEU scores on the concatenation of the aresensitivetotargetlength). 5TheestimatorsforAdverSucandmachine-vs-randomac- 3Ifmultiplehypothesessatisfythisrequirement,wepick curacyaretrainedusingahierarchicalnetwork(Serbanetal., theonewiththelargestlikelihood. 2016)SeeLietal.(2016c)fordetails. Input itsan8symboladdress. SBS idon’tknowwhereitcamefrombutitsnotthesameasitusedtobe. FP(λ=1) youknowithinkiveseenitonceandiveseenitinthepasttoo!. FP(λ=5) butwedontknowwhoshewasatthescenebutthatsagoodidea!?! Input comeonpollyannalet’sgo. SBS comeonlet’sgolet’sgogogogogogogogogogogogogo FP(λ=1) youwantmetotakeyoutoyourhouseandcomebackandgetmesomeicecream?! FP(λ=5) youwantmetotakethestairsandgowithyou...andgetthehelloutofhere!! Input that’shardlyawinningattitude. SBS theonlywaythatwecouldfindoutisthatwe’reinthemiddleofthegame!!! FP(λ=1) butithinkthemostimportantthingisthatwe’rebothinthesameclasstogether... FP(λ=5) buttheotherthingisithinkyoucanuseitasanexcuseforyourownpersonalgain. Input iseeandyouknowwhy? SBS becauseidon’tknowwhoiamorwhereiamorwhereiamorwhereiam FP(λ=1) whywouldthepeoplethinki’mtheonlyguywhocouldgetajobinthemiddleeast? FP(λ=5) iknowtheguyisawatthescene...andhesaiditwasmyfather’sbirthday! Input butvenuswasnoeasytarget. SBS itwastheonlywaytomakeittothetopoftheisland......butitwasn’t FP(λ=1) ithinkwehavetherighttobeinthemiddleofsomesortofatrap... FP(λ=5) theonlyreasonileftheretosaveyou...wasbecauseididn’tcarewhotheyare! Input i’mnotafraidofher. SBS i’mafraidshe’safraidofthedark......buti’mafraidshe’safraidofme. FP(λ=1) youdon’tliketotellpeopleyou’rejustachildandyoudon’tknowher?! FP(λ=5) i’llbeinaveryawkwardmomentofhertimeandi’mafraidshe’llhurtusagain Table1: Sampleof responsesgeneratedby standardbeam search(denoted by SBS)and thefuture lengthprediction (FP) algorithmwithtwodifferentvaluesofλ.Weforceeachdecodingalgorithmtogenerateresponseswithlength20.Moreexamples areshowninTable7(Appendix). clearlyidentifyproblemswiththestandardbeam whichmeasuresbidirectionaldependencybetween searchalgorithm: thedecoderproducestokensthat sources and targets, as opposed to the unidirec- areoptimalforshortersequences,eliminatingcan- tionaldependencyoftargetsonsourcesinthemax- didates from the beam that would lead to longer imumlikelihoodobjective. Modelingthebidirec- possibilities. Once the length reaches the point tionaldependencybetweensourcesandtargetsre- where the intended shorter sequence would natu- ducestheprevalenceofgenericresponsesandleads rallyconclude,ithasnooptionbuttofillspacewith to more diverse and interesting conversations.6 repetitionsoftokens(e.g.,“go,go,go”orstrings Maximizingaweightedgeneralizationofmutual ofpunctuation)andphrases(e.g.,idon’tknowwho informationbetweenthesourceandtargetcanbe iamorwhereiamorwhereiamorwhereiam),or shownusingBayes’ruletobeequivalenttomaxi- addendathataresometimescontradictory(e.g.,it mizingalinearcombinationoftheforwardproba- wastheonlywaytomakeittothetopoftheisland. bilitylogp(Y|X)(thestandardobjectivefunction . . . . . butitwasn’t). Thisissueisalleviatedbythe for SEQ2SEQmodels)andthebackwardprobabil- proposedlength-predictionalgorithm,whichplans itylogp(X|Y):7 aheadandchoosestokensthatleadtomeaningful Y∗ = argmaxlogp(Y|X)+λlogp(X|Y) (5) sequenceswiththedesiredlength. Morecoherent Y responsesareobservedwhenthehyperparameter Unfortunately,directdecodingusingEq.5isinfea- λ is set to 1 than when it is set to 5, as expected, sible,sinceitrequirescompletionoftargetgener- sincethedecodingalgorithmdeviatesmorefrom thepre-trainedmodelwhenλtakeslargervalues. ationbeforep(X|Y)canbeeffectivelycomputed, and the enormous search space for target y pre- 4 QforMutualInformation ventsexploringallpossibilities. Anapproximation approach is commonly adopted, in which an N- 4.1 Background best list is first generated based on p(Y|X) and then reranked by adding p(X|Y). The problem Maximum mutual information (MMI) has been withthisrerankingstrategyisthatthebeamsearch shown to be better than maximum likelihood es- timation(MLE)asandecodingobjectiveforcon- 6Thisisbecausealthoughitiseasytoproduceasensible versational response generation tasks (Li et al., genericresponseY regardlessoftheinputsequenceX,itis muchhardertoguessXgivenY ifY isgeneric. 2016a). ThemutualinformationbetweensourceX 7Whenusingthisobjective,p(Y|X)andp(X|Y)aresep- andtargetY isgivenbylog[p(X,Y)/p(X)p(Y)], aratelytrainedmodelswithdifferentsetsofparameters. stepgiveshigherprioritytooptimizingtheforward QforMMI MMI SBS BLEU 1.87 1.72 1.45 probability,resultinginsolutionsthatarenotglob- AdverSuc 0.068 0.057 0.043 ally optimal. Since hypotheses in beam search Distinct-1 0.019 0.010 0.005 are known to lack diversity (Li et al., 2016b; Vi- Distinct-2 0.058 0.030 0.014 jayakumaretal.,2016;Gimpeletal.,2013),after (a)Fulldataset. decodingisfinished,itissometimestoolateforthe QforMMI MMI SBS rerankingmodeltohavesignificantimpact. Shao BLEU 2.13 2.10 1.58 et al. (2017) confirm this problem and show that AdverSuc 0.099 0.093 0.074 Distinct-1 0.024 0.014 0.007 thererankingapproachhelpsforshortsequences Distinct-2 0.065 0.033 0.017 butnotlongerones. (b)Setwithshorttargets. 4.2 TrainingQforMutualInformation QforMMI MMI SBS The first term of Eq. 5 is the same as standard BLEU 1.52 1.34 1.58 AdverSuc 0.042 0.029 0.022 SEQ2SEQ decoding. Wethusfocusourattention Distinct-1 0.017 0.008 0.004 on the second term, logp(X|Y). To incorporate Distinct-2 0.054 0.027 0.012 thebackwardprobabilityintointermediatedecod- (c)Setwithlongtargets. ing steps, we use a model to estimate the future valueofp(X|Y)whengeneratingeachtokeny . Table3:Comparisonoftheproposedfuturepredictionmodel t withMMI-basedreranking(MMI)andMLE-basedstandard For example, suppose that we have a source- beamsearch(SBS). targetpairwithsourceX = “what’syourname” and target Y = “my name is john”. The fu- sourceanduseeachpair(y ,X)asatrainingex- turebackwardprobabilityofthepartialsequences 1:t “my”,“myname”,“mynameis”isthusp(X|Y). ampletoatrain SEQ2SEQmodel,withy1:t asthe sourceandX asthetarget. Sinceweareincreasing Again, we use Q(y ) to denote the function that t the size of the training set by roughly a factor of maps a partially generated sequence to its future 10(theaveragetargetlengthisabout10),training backwardprobability,andwecanfactorizeEq.5 isextremelycomputation-intensive. Wereducethe asfollows: trainingtimebygroupingy bylengthandtrain- 1:t y = argmax logp(y ,y|X)+λQ(y) (6) ing a separate SEQ2SEQ model for each length t 1:t−1 y group. Atdecodingtime,weusethescorefromthe model corresponding to the length of the current We propose two ways to obtain the future partiallydecodedtargettogeneratethenexttoken. backward-probabilityestimationfunctionQ(y ). t SinceSEQ2SEQmodelsfordifferenttargetlengths (1)AsinthestrategiesdescribedinSections5 areindependent,theycanbetrainedinparallel. and3,wefirstpretraina SEQ2SEQmodelforboth Wefindthatoption(2)generallyoutperformsop- p(Y|X)andp(X|Y). Thetrainingofthelatteris tion(1),butoption(2)requirestraininginparallel thesameasastandard SEQ2SEQ modelbutwith onalargenumberofmachines. sources and targets swapped. Then we train an additionalfuturebackward-probabilityestimation 4.3 ExperimentalResults functionQ(X,y ),whichtakesasinputsthehid- 1:t Wecomparetheresultsfortheapproachwithstan- denrepresentationofintermediatedecodingsteps dardbeamsearchusingtheMLEobjectiveandthe (i.e., h ) from the forward probability model and t MMIrerankingapproachofLietal.(2016a),which predicts the backward probability for the entire performsrerankingonlyafterdecodingisfinished. targetsequenceY usingthepretrainedbackward WereportBLEUscoresandAdverSucscores8 on SEQ2SEQ model (i.e., logp(X|Y) with Y being thetestset. Wealsoreportdiversityscores(denoted thefulltarget). byDistinct-1andDistinct-2);thesearedefinedas (2) We can directly train models to calculate inLietal.(2016a)tobethethenumberofdistinct Q(y ) = p(X|y ),i.e.,theprobabilityofgenerat- t 1:t unigramsandbigrams(respectively)ingenerated ingafullsourcegivenapartialtarget. Todothis, responses,dividedbythetotalnumberofgenerated wefirstbreaky intoaseriesofpartialsequences, tokens (to avoid favoring long sentences). Addi- i.e.,y ,y ,...,y ,whichis{“i”,“iam”,“i 1:1 1:2 1:N am john”} in the example above. Then we pair 8Themachine-vs-randomscoresforthethreemodelsare each partial sequence y1:t (1 ≤ t ≤ N) with the verysimilar,respectively0.947,0.939,0.935. Input itsan8symboladdress. pendix). As can be seen, the results from rerank- SBS idon’tknowwhereitcamefrom. ingaregenerallybetterthanthosefromMLE,but MMI idontknowwhereitcamefrom. Q(MMI)(λ=1) idon’tknowtheexactaddress. sometimesbothapproachesstillgeneratethesame Q(MMI)(λ=5) it’saveryoldhouseinparis genericoutputs. Thisisduetothefactthatrerank- Input that’shardlyawinningattitude. ingisperformedonlyaftermoreinterestingoutputs SBS that’sthewayit’salwaysbeen. MMI it’stheonlywaytowintherace. havefallenoffthebeam. Usingsmallervaluesof Q(MMI)(λ=1) it’sjustagame. λ, the future backward-probability prediction ap- Q(MMI)(λ=5) it’sjustagameofgolf.takeiteasy. proachgenerallyyieldsbetterresultsthanrerank- Input whatisupwiththosetwo? SBS idon’tknowwhat’sgoingon. ing. Whenusinglargervaluesofλ,thealgorithm MMI idon’tknowwhatthehell’sgoingon tendstoproducemorediverseandinterestingout- here! Q(MMI)(λ=1) whydon’tyougobackandhavealook putsbuthasagreaterriskofgeneratingirrelevant ? responses. Q(MMI)(λ=5) why don ’t we go up there now and check? 5 QforBLEU/ROUGE Input igottogetthemin. SBS idon’thavetimeforthis. The future outcome function Q can be trained to MMI idon’thavetimeforthem. Q(MMI)(λ=1) youdon’thavetoworryaboutthem. predict arbitrary features. These features include Q(MMI)(λ=5) igottatalktothemall. BLEU (Papineni et al., 2002) or ROUGE (Lin, Input iseeandyouknowwhy? 2004)scores. WethustrainQtodirectlypredictfu- SBS idon’tknowwhatyoutalkingabout. MMI becausei’minlovewithyou. ture BLEU or ROUGE values. In this situation, Q(MMI)(λ=1) becauseidon’tknowwhatthehelli’m the future prediction function is able to reduce talkingabout! Q(MMI)(λ=5) becauseit’sbecauseofyou thediscrepancybetweentraining(usingmaximum likelihoodobjective)andtesting(usingBLEUor Table4:Sampleofresponsesgeneratedby(1)standardbeam ROUGE) (Wiseman and Rush, 2016; Shen et al., search (SBS); (2) the MMI reranking approach of Li et al. (2016a),whichperformsrerankingonlyafterdecodingiscom- 2015;Ranzatoetal.,2016). plete(denotedbyMMI);and(3)thefuturepredictionmodel Q(MMI)withdifferentvaluesoffuturepredictionweightλ. 5.1 Model Given a pre-trained sequence generation model, tionally,wesplitthedevsetintoasubsetcontaining an input sequence X, and a partially decoded se- longertargets(withlengthlargerthan8)andasub- quence y , we want to estimate the future re- 1:t−1 setcontainingshorterones(smallerthan8). During ward for taking the action of choosing word y t decoding,weforcethemodeltogeneratetargetsof forthecurrenttime-step. Wedenotethisestimate thesamelengthasthegoldstandardtargetsusing Q({y ,y ,X}),abbreviatedQ(y )wherepos- t 1:t−1 t thestrategydescribedinSection3.3. sible. Table3presentsquantitativeresultsforthedif- Thefuturepredictionnetworkistrainedasfol- ferentdecodingstrategies. Onthefulltestset,the lows: we first sample y from the distribution t futurebackward-probabilitypredictionmodelout- p(y |X,y ),thendecodetheremainderofthe t 1:t−1 performstheapproachofrerankingwhendecoding sequence Y using beam search. The future out- isfullyfinished. Specifically,alargerperformance comefortheactiony isthusthescoreofthefinal t improvementisobservedonexampleswithlonger decoded sequence, q(Y). Having obtained pairs targetsthanonthosewithshorterones. Thiseffect (q(Y),{X,y }),wetrainaneuralnetworkmodel 1:t is consistent with the intuition that for short re- thattakesasinputX andy topredictq(Y). The 1:t sponses,duetotherelativelysmallersearchspace, network first maps the input sequence X and the doingrerankingattheendofdecodingissufficient, partiallydecodedsequencey tovectorrepresen- 1:t whereasthisisnotthecasewithlongersequences: tations using LSTMs, and then uses another net- as beam search proceeds, a small number of pre- workthattakestheconcatenationofthetwovectors fixesgraduallystarttodominate,withhypotheses tooutputthefinaloutcomeq(Y). Thefuturepre- differingonlyinpunctuationorminormorpholog- diction network is optimized by minimizing the icalvariations. Incorporatingmutualinformation meansquaredlossbetweenthepredictedvalueand in the early stages of decoding maintains diverse therealq(Y)duringtraining. hypotheses,leadingtobetterfinalresults. Atdecodingtime,Q(y )isincorporatedintothe t Table4presentssampledoutputsfromeachstrat- decodingmodeltopushthemodeltotakeactions egy, with more results shown in the Table 8 (Ap- thatleadtobetterfutureoutcomes. Anactiony is t thusevaluatedbythefollowingfunction: REINFORCE(Ranzatoetal.,2016) 20.7 Actor-Critic(Bahdanauetal.,2016) 22.4 WisemanandRush(2016) 26.3 yt = argmax logp(y1:t−1,y|X)+λQ(y) (7) vanillaLSTM+SBS 18.9 y vanillaLSTM+Q(BLEU) 19.7(+0.8) attention+SBS 27.9 λisahyperparameterthatistunedonthedevelop- attention+Q(BLEU) 28.3(+0.4) mentset. Table5:BLEUscoresfordifferentsystems.Baselinescores are best scores reprinted from corresponding papers. SBS 5.2 Experiments denotesstandardbeamsearch. Weevaluatethedecodingmodelontwosequence attention+SBS 12.2 generationtasks,machinetranslationandabstrac- attention+Q(ROUGE) 13.2(+1.0) tivesummarization. Table6: ROUGE-2forabstractivesummarization. SBSde- notesstandardbeamsearch. Machine Translation We use the German- English machine translation track of the IWSLT 2014 (Cettolo et al., 2014), which consists of manandRush(2016). Resultsarereprintedfrom sentence-alignedsubtitlesofTEDandTEDxtalks. thebestsettinginthecorrespondingpaper. For fair comparison, we followed exactly the Our implementation of the attention model it- dataprocessingprotocolsdefinedinRanzatoetal. selfalreadyachievesstate-of-the-artperformance (2016),whichhavealsobeenadoptedbyBahdanau onthisbenchmark. Theproposedfutureoutcome etal.(2016)andWisemanandRush(2016). The model adds +0.4 BLEU, pushing the SOTA per- training data consists of roughly 150K sentence formanceupto28.3. SincethetrainedSEQ2SEQ pairs, in which the average English sentence is model is already quite strong, there is less room 17.5wordslongandtheaverageGermansentence forimprovement. ForthevanillaLSTM,however, is18.5wordslong. Thetestsetisaconcatenation due to its relative inferiority, we observe a more ofdev2010,dev2012,tst2010,tst2011andtst2012, significantimprovementfromthefutureoutcome consistingof6750sentencepairs. TheEnglishdic- predictionapproach. tionary has 22822 words, while the German has AbstractiveSummarization Wefollowthepro- 32009words. tocolsdescribedinRushetal.(2015),inwhichthe Wetraintwomodels,avanillaLSTM(Sutskever source input is the first sentence of a new article et al., 2014) and an attention-based model (Bah- andthetargetoutputistheheadline. Ourtraining danauetal.,2015). Fortheattentionmodel,weuse datasetconsistsof2Mpairs. Wetrainatwo-layer theinput-feedingmodeldescribedinLuongetal. word-levelattentionmodelwith512unitsforeach (2015)withoneminormodification: theweighted layer. Experimental results are shown in Table 6. attention vectors that are used in the softmax to- Weobservea+1.0ROUGEperformanceimprove- kenpredictionsandthosefedtotherecurrentnet mentfromtheproposedmodeloverstandardbeam at the next step use different sets of parameters. search. Their values can therefore be different, unlike in Luongetal.(2015). Wefindthatthissmallmodifi- 6 Conclusion cationsignificantlyimprovesthecapacityofatten- tionmodels,yieldingmorethana+1.0BLEUscore Inthispaper,weproposeageneralstrategythaten- improvement. We use structure similar to that of ablesaneuraldecodertogenerateoutputsthathave WisemanandRush(2016),asingle-layersequence- specificpropertiesofinterest. Weshowhowtouse to-sequence model with 256 units for each layer. a model Q to optimize three useful properties of Weusebeamsize7forbothstandardbeamsearch theoutput—sequencelength,mutualinformation (SBS)andfutureoutcomeprediction. and BLEU/ROUGE scores —and investigate the Results are shown in Table 5, with SBS stand- effectsofdifferentdesignsforthepredictormodel ingforthestandardbeamsearchmodelandfuture and decoding algorithm. Our model provides a funcastheproposedfuturepredictionmodel. Base- generalandeasy-to-implementwaytocontrolneu- lines employed include the REINFORCE model ralgenerationmodelstomeettheirspecificneeds, describedinRanzatoetal.(2016),theactor-critic whileimprovingresultsonavarietyofgeneration RLmodeldescribedinBahdanauetal.(2016)and tasks. thebeam-searchtrainingschemedescribedinWise- References Yoon Kim. 2014. Convolutional neural net- works for sentence classification. arXiv preprint Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, arXiv:1408.5882. Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville,andYoshuaBengio.2016. Anactor-critic Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- algorithm for sequence prediction. arXiv preprint ton.2012. ImageNetclassificationwithdeepconvo- arXiv:1607.07086. lutionalneuralnetworks. InNIPS. DzmitryBahdanau,KyunghyunCho,andYoshuaBen- JiweiLi,MichelGalley,ChrisBrockett,JianfengGao, gio. 2015. Neural machine translation by jointly andBillDolan.2016a. Adiversity-promotingobjec- learning to align and translate. In Proceedings of tivefunctionforneuralconversationmodels. theInternationalConferenceonLearningRepresen- tations(ICLR). Jiwei Li, Will Monroe, and Dan Jurafsky. 2016b. A simple, fast diverse decoding algorithm for neural Mauro Cettolo, Jan Niehues, Sebastian Stu¨ker, Luisa generation. arXivpreprintarXiv:1611.08562. Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign, IWSLT Jiwei Li, Will Monroe, Tianlin Shi, and Dan Jurafsky. 2014. InProceedingsoftheInternationalWorkshop 2016c. Adversarial reinforcement learning for neu- onSpokenLanguageTranslation,Hanoi,Vietnam. raldialoguegeneration. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- Chin-Yew Lin. 2004. ROUGE: A package for au- ishna Vedantam, Saurabh Gupta, Piotr Dolla´r, and tomatic evaluation of summaries. In Text Summa- C Lawrence Zitnick. 2015. Microsoft COCO cap- rization Branches Out: Proceedings of the ACL-04 tions: Data collection and evaluation server. arXiv Workshop. preprintarXiv:1504.00325. Yi Luan, Yangfeng Ji, and Mari Ostendorf. 2016. Kyunghyun Cho. 2016. Noisy parallel approximate LSTM based conversation models. arXiv preprint decoding for conditional recurrent language model. arXiv:1603.09457. arXivpreprintarXiv:1605.03835. Minh-Thang Luong, Hieu Pham, and Christopher D Kyunghyun Cho, Bart Van Merrie¨nboer, Caglar Gul- Manning. 2015. Effective approaches to attention- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger basedneuralmachinetranslation. InEMNLP. Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder Hongyuan Mei, Mohit Bansal, and Matthew R Walter. for statistical machine translation. arXiv preprint 2016. Coherent dialogue with attention-based lan- arXiv:1406.1078. guagemodels. arXivpreprintarXiv:1611.06997. Sumit Chopra, Michael Auli, and Alexander M. Rush. RameshNallapati,BowenZhou,C¸ag˘larGu¨lc¸ehre,and 2016. Abstractive sentence summarization with at- Bing Xiang. 2016. Abstractive text summarization tentiverecurrentneuralnetworks. InNAACL-HLT. using sequence-to-sequence RNNs and beyond. In CoNLL. Kevin Gimpel, Dhruv Batra, Chris Dyer, and Gregory Shakhnarovich. 2013. A systematic exploration of Franz Josef Och, Nicola Ueffing, and Hermann Ney. diversityinmachinetranslation. InEMNLP. 2001. An efficient A* search algorithm for statisti- calmachinetranslation. InProceedingsoftheWork- Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, shop on Data-Driven Methods in Machine Transla- andRobertBabuska.2012. Asurveyofactor-critic tion. reinforcement learning: Standard and natural pol- icy gradients. IEEE Transactions onSystems, Man, KishorePapineni,SalimRoukos,ToddWard,andWei- andCybernetics,PartC(ApplicationsandReviews) Jing Zhu. 2002. BLEU: A method for automatic 42(6):1291–1307. evaluationofmachinetranslation. InACL. CaglarGulcehre, OrhanFirat, KelvinXu, Kyunghyun Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, andWojciechZaremba.2016. Sequenceleveltrain- HolgerSchwenk,andYoshuaBengio.2015. Onus- ingwithrecurrentneuralnetworks. InICLR. ing monolingual corpora in neural machine transla- tion. arXivpreprintarXiv:1503.03535. Alexander M Rush, Sumit Chopra, and Jason We- ston. 2015. A neural attention model for ab- Sepp Hochreiter and Ju¨rgen Schmidhuber. 1997. stractive sentence summarization. arXiv preprint Long short-term memory. Neural Computation arXiv:1509.00685. 9(8):1735–1780. Rico Sennrich, Barry Haddow, and Alexandra Birch. NalKalchbrennerandPhilBlunsom.2013. Recurrent 2015. Neuralmachinetranslationofrarewordswith continuoustranslationmodels. InEMNLP. subwordunits. arXivpreprintarXiv:1508.07909. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, search: Decoding diverse solutions from neural se- Aaron Courville, and Joelle Pineau. 2015a. Build- quencemodels. arXivpreprintarXiv:1610.02424. ingend-to-enddialoguesystemsusinggenerativehi- erarchical neural network models. arXiv preprint Oriol Vinyals and Quoc Le. 2015. A neural conver- arXiv:1507.04808. sational model. In Proceedings of the ICML Deep LearningWorkshop. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015b. Hierar- Sam Wiseman and Alexander M Rush. 2016. chical neural network generative models for movie Sequence-to-sequence learning as beam-search dialogues. arXivpreprintarXiv:1507.04808. optimization. arXivpreprintarXiv:1606.02960. Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Appendix AaronCourville,andJoellePineau.2016. Building end-to-end dialogue systems using generative hier- archical neural network models. In Proceedings of the30th AAAIConferenceonArtificial Intelligence (AAAI-16). Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Generating long and diverse responses with neural conversational models. arXiv preprint arXiv:1701.03185. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, MaosongSun, andYangLiu.2015. Minimum risk training for neural machine translation. arXiv preprintarXiv:1512.02433. XingShi,KevinKnight,andDenizYuret.2016. Why neuraltranslationsaretherightlength. InEMNLP. David Silver, Aja Huang, Chris J Maddison, Arthur Guez,LaurentSifre,GeorgeVanDenDriessche,Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Pan- neershelvam, Marc Lanctot, et al. 2016. Mastering thegameofGowithdeepneuralnetworksandtree search. Nature529(7587):484–489. Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. Aneuralnetworkapproachtocontext-sensitivegen- eration of conversational responses. arXiv preprint arXiv:1506.06714. Pavel Sountsov and Sunita Sarawagi. 2016. Length biasinencoderdecodermodelsandacaseforglobal conditioning. arXivpreprintarXiv:1606.03402. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequencetosequencelearningwithneuralnetworks. InNIPS. Richard S Sutton. 1988. Learning to predict by the methodsoftemporaldifferences. Machinelearning 3(1):9–44. Jo¨rg Tiedemann. 2009. News from OPUS – A collec- tion of multilingual parallel corpora with tools and interfaces. InRecentAdvancesinNaturalLanguage Processing.volume5,pages237–248. Ashwin K Vijayakumar, Michael Cogswell, Ram- prasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam