Long Timescale Credit Assignment in Neural Networks with External Memory StevenS.Hansen DepartmentofPsychology StanfordUniversity Stanford,CA94303 7 [email protected] 1 0 2 n a 1 Introduction J 4 Neural networks with external memories (NNEM),such as the Neural Turing Machine [1] and 1 Memory Networks [2], have often been compared to the human hippocampus in their ability to maintainepisodicinformationoverlongtimescales[3]. Butsofar,thesenetworkshaveonlybeen ] I trainedontasksrequiringmemorystoragecomparabletoafewminutes,whereasthehippocampus A storesinformationontheorderofyears. Itiswellknownthatthissortoflong-termepisodicstorage . isvitalforovercomingtheproblemofcatastrophicinterference[4],whichiscentraltothechallenge s c ofcontinuallearning. [ Whilesomeofthisdiscrepancycanbeattributedtotheshort-termnatureofcurrentbenchmarktasks, 1 amajorbottleneckcomesformtherelianceonusingback-propagation-through-timetolearntheem- v beddingfunctionthatmapsthehigh-dimensionalobservationsontolowerdimensionalrepresentations 6 suitableforstorageintheexternalmemory. Thisisproblematicbecausememoriesmightbeusedan 6 extremelylongtimeaftertheembeddingfunctionwascalled,andback-propagation-through-time 8 wouldrequireustoholdontotheforwardactivationsofthisfunctionfortheentiredurationifwe 3 wanttoback-propagatethroughit. 0 . Consideringthatinacontinuallearningsetup,anexternalmemorystoreontheorderofhundredsof 1 thousands,orevenmillions,ofseparateembeddingscouldbeneeded,storingalloftheintermediate 0 7 embedding activations is woefully intractable. A naive solution would be to simply store all of 1 thememoriesinobservationspaceinadditiontoembeddingspaceandrecomputetheembeddings : as needed for credit assignment, but a significant computational burden would result from these v redundantforward-passes. Moreimportantly,thehighdimensionalityoftheobservationspacewould i X makestoringtheseadditionalmemoriestakeanunreasonableamountofmemory. Indeed,Blundellet al[5]notedthatevenintherelativelymodestAtaridomain,storingafewmillionobservationsin r a pixel-spacewouldrequireupwardsof300gigabytes1. 2 Methods 2.1 LongTimescaleCreditAssignmentinNNEMs Creditassignmentintraditionalrecurrentneuralnetworksusuallyinvolvesback-propagatingthrough alongchainoftiedweightmatrices. Thelengthofthischainscaleslinearlywiththenumberof time-steps as the same network is run at each time-step. This creates many problems, such as vanishinggradients,thathavebeenwellstudied[7]. Incontrast,aNNEM’sarchitecturerecurrent 1Whiletheirworkutilizesalongtimescaleexternalmemory,itdoesnotaddresstheproblemsraisedhere, astheembeddingfunctionwasnotoptimizedforusageinanexternalmemory.Rather,itwaseitherarandom projection or pre-trained on a separate objective. Indeed, the low performance of the latter suggests that end-to-endoptimizationmightbenecessaryforadaptiveembeddingfunctionstobeworthwhileinthiscontext. 30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain. activitydoesn’tinvolvealongchainofactivity(thoughsomearchitecturessuchastheNTMdoutilize atraditionalrecurrentarchitectureasacontroller). Rather,theexternallystoredembeddingvectors areusedateachtime-step,butnomessagesarepassedfromprevioustime-steps. Thismeansthat vanishinggradientsaren’taproblem,asallofthenecessarygradientpathsareshort. However,these pathsareextremelynumerous(oneperembeddingvectorinmemory)andreusedforaverylongtime (untilitleavesthememory). Thus,theforward-passinformationofeachmemorymustbestoredfor theentiredurationofthememory. Thisisproblematicasthisadditionalstoragefarsurpassesthatof theactualmemories,totheextentthatlargememoriesoninfeasibletoback-propagatethroughin highdimensionalsettings. Figure 1: The computational graphes of a NNEM with differing credit assignment mechanisms. Forward-passinformationisshowninblack,truegradientsinred,andapproximategradientsinblue. A)Thestandardmethodforback-propagatingthroughanexternalmemory. Notethatthelefthalfof thisgraphmustbefrozenuntilthememoryleaves. B)Syntheticgradientmodel,M,assignscredit uponfirstenteringthememory,andthenitselflearningfromtheoldermemories’gradient. C)Exact gradientreinstatement. Thedashedmemoriesareonlinerecalculations. D)Approximategradient reinstatement. Thememoriesusedforinferenceandcreditassignmentaredifferent. 2 2.2 SyntheticGradients Syntheticgradientsarearecenttechniquecreatedtodealwithsituationwhetherapartofanetwork is‘blocked’fromfurtherusage[6],suchasinthecasepreviouslydiscussedwherebyinformation mustbestoreduntilallgradientcalculationsarecomplete. Theirwayaroundthisproblemistotreat gradientsignalasjustanotherfunctiontobeapproximated. Namely,afunctionmappingfroman activationvector, and anycontextualinformation, ontoto theerrorgradient w.r.t. thatactivation vector. Thisauxiliarynetworkcanbeusedtocalculateanestimateofthegradientbeforethetrue gradientisavailable. Thissyntheticgradientproducingnetworkcanbetrainedinapurelysupervised mannerbycomparingitsestimatestothetruegradient. AsshowninFigure1B,wecanapplythisapproachtoNNEMsbyupdatingtheembeddingnetwork withasyntheticgradientimmediatelyaftertheembeddingvectoriscalculated. Truegradientsignals areavailableforallofthepreviouslyembeddedmemories,andthesecanbeusedtotrainthesynthetic gradientnetwork. 2.3 ReinstatementforCreditAssignment Onewaytogetaroundtheneedtoholdontoforward-passinformationistorecalculatetheforward- pass whenever gradient information is available. However, as previously mentioned, even the observationsaretoolargetostoreinourdomainofinterest,whichpreventsadirectreinstatementof aforwardpassfromoccurring. Instead,werelyonalearnedautoencodertoreinstatetheobservation, andthenusetheembeddingnetworktorecalculatetheforward-pass. Sincetherecalculatedembed- dingvectorisunlikelytoperfectlymatchtheonestoredinmemory,itisnon-obvioushowtoutilize errorgradientw.r.t. thevectorinmemory. Therearetwodistinctpossibilities: ignorethemismatchanddirectlyusethememory’sgradient,or usethefreshembeddinginsteadofthememoryduringinference. Thesetwomethodsareillustrated inFigure1CandD,andwhiletheyaren’tmutuallyexclusive,theyaretreatedassuchforthepurposes ofthispaper,toaidininterpretingtheinitialresults. Thelatteroptionisatruegradientsignalwhereas theformerisonlyanapproximation,butitreliesonhavingagoodenoughreconstructiontonotharm theinferenceprocess. 3 Results Totestourmodel,weusedthestandardMNISTbenchmark,theobjectiveofwhichistocorrectly classify images of numerals by the integer they represent. While this task is trivial for modern networkswiththeappropriatearchitecture,weuseithereasitrepresentsoneofthesimplesttasksthat could,undercertainconditions,stillencounterthecreditassignmentproblempreviouslydescribed. Sinceouraimistotractablytrainembeddingfunctionswithinanetworkwithanexternalmemory, we constructed a simple NNEM consisting of an embedding network and an external store that holdstheembeddingsofthe5000mostrecentlyencounteredimagesalongwiththeircorrectlabels. Classificationsaremadebytakingthecosinesimilaritybetweentheembeddingofthecurrentimage andalloftheembeddingsinmemory,andusingthenormalizedresulttoweightclasslabelsassociated withthememorizedembeddings. Asourfocusisonweightupdatesderivedfromthememorized embeddings,westopthegradientgoingintotheembeddingfunctioncomingfromthecurrentimage, asthesearesufficientforgoodperformanceinthissimplifiedsetting. Inaddition, wefoundthat completelyunsupervisedlearningofanautoencoderwassufficienttoyielddecentperformance.Since we’reprimarilyinterestedingettingthesupervisedsignaltotravelthroughthestoredembeddings, weonlyallowthereconstructionerrortomodifythedecoder’sweights,asthedecoderisn’tusedfor inferenceliketheencoder/embeddingfunctionis. TheresultsareshowninFigure2. Allconditionsutilizethesamenetworkarchitecture2andhyper- parameters, different only in their mechanism for credit assignment. The reinstatement method withexactgradientsperformedthebest,butrequiresthemostadditionalcomputation. Synthetic gradientperformanceisperhapslowerthanexpected,butitispossiblethatthisisduetothenetwork architecture. Thesurroundingmemoriesaffectthegradientofanyindividualmemory,butthesecan’t 2asinglehiddenlayerforeachpieceofthenetwork,withdimension256.ADAMwasusedforoptimization withalearningrateof1e-4elsewhere 3 Figure2: Performanceresultsaveragedacross10runs. Exactgradientreinstatementisshowninblue, approximategradientreinstatementisshowningreen,andsyntheticgradientsareshowninred. All accuracyresultsarefromaseperatevalidationset. beaninputtothesynethicgradientnetwork,asfuturememoriesaren’tobservable. Theapproximate gradientversionofthereinstatementmethodwasthemostpronetodivergence,whichmakessense giventheapproximationinvolved. 4 FutureDirections TheabsoluteperformanceonthisMNISTtaskisn’tusefulinitself,astheperformancewaspurposely sabotagedtoisolatethenovelmechanismsunderdiscussion. However,whiletherelativeperformance changesdemonstratethepromiseofreinstatementbasedcreditassignment,thecontriveddomain necessitatesfurtherworktoconfirmtheseresultsonlargescaleproblems. Onepossibilitywould be tackle reinforcement learning problems like Atari, using a variant of the model-free episodic controlnetwork[5]modifiedsuchthattheembeddingfunctionislearnedon-line. Whiletheexact modifications needed to make that setting compatible with this approach to credit assignment is outsidethescopeofthispaper,itisclearthatthisdomainisoneofthefewpresentinthecurrent literaturethatcallsforalongtermepisodicstore. Inamorecomplexenvironment,additionalnuancesemergewhichmightimprovecreditassignment performance. Given the reliance on accurate decoding for proper reinstatement, pre-training the autoencoderofflinemightbecrucial. Vanillaautoencodersalsotendtooverfit,whichmightslow performancewhenusedtoassigncredittonewexamples. Variationalautoencodershavebeenshown tobemorerobust[8]andalsoprovideuncertaintyestimatesthatcouldbeusefulwhendecidingwhich embeddingstodiscard. AhybridapproachutilizingsyntheticgradientsandbothformsofAEbased credit assignment might also be worthwhile, as these approaches appear to have complementary weaknesses. Whilethefocushasbeenoncreditassignment,adaptivelong-termepisodicmemoriesraiseother concerns, such as the fact that stored embedding go ‘stale’ rapidly, as the embedding function’s parameterschange. Iftherateofchangeisslowenough,thenoperatingthememoriesasaqueue (firstin,firstout)shouldlimittheimpactthishasonperformance. Butinmanysetups,embeddings arediscardedasafunctionoftheirusage[5]suchthatfrequentlyneededembeddingsarekeptaround longenoughforstalenesstobecomeproblematic. Whileonlyspeculative,anautoencodercouldalso beusedto‘freshen’theseembeddingsbyre-encodingthemusingthecurrentparameters. 4 References [1]Graves,A.,Wayne,G.,&Danihelka,I.(2014).Neuralturingmachines.arXivpreprintarXiv:1410.5401. [2]Weston,J.,Chopra,S.,&Bordes,A.(2014).Memorynetworks.arXivpreprintarXiv:1410.3916. [3]Graves,A.,Wayne,G.,Reynolds,M.,Harley,T.,Danihelka,I.,Grabska-Barwin´ska,A.,...&Badia,A.P. (2016).Hybridcomputingusinganeuralnetworkwithdynamicexternalmemory.Nature. [4]McClelland,J.L.,McNaughton,B.L.,&O’Reilly,R.C.(1995).Whytherearecomplementarylearning systemsinthehippocampusandneocortex:insightsfromthesuccessesandfailuresofconnectionistmodelsof learningandmemory.Psychologicalreview,102(3),419. [5]Blundell,Charles,etal."Model-freeepisodiccontrol."arXivpreprintarXiv:1606.04460(2016). [6] Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016). Decoupledneuralinterfacesusingsyntheticgradients.arXivpreprintarXiv:1608.05343. [7]Hochreiter,S.(1998). Thevanishinggradientproblemduringlearningrecurrentneuralnetsandproblem solutions.InternationalJournalofUncertainty,FuzzinessandKnowledge-BasedSystems,6(02),107-116. [8]Kingma,D.P.,&Welling,M.(2013).Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114. 5