ebook img

Long Timescale Credit Assignment in NeuralNetworks with External Memory PDF

0.22 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Long Timescale Credit Assignment in NeuralNetworks with External Memory

Long Timescale Credit Assignment in Neural Networks with External Memory StevenS.Hansen DepartmentofPsychology StanfordUniversity Stanford,CA94303 7 [email protected] 1 0 2 n a 1 Introduction J 4 Neural networks with external memories (NNEM),such as the Neural Turing Machine [1] and 1 Memory Networks [2], have often been compared to the human hippocampus in their ability to maintainepisodicinformationoverlongtimescales[3]. Butsofar,thesenetworkshaveonlybeen ] I trainedontasksrequiringmemorystoragecomparabletoafewminutes,whereasthehippocampus A storesinformationontheorderofyears. Itiswellknownthatthissortoflong-termepisodicstorage . isvitalforovercomingtheproblemofcatastrophicinterference[4],whichiscentraltothechallenge s c ofcontinuallearning. [ Whilesomeofthisdiscrepancycanbeattributedtotheshort-termnatureofcurrentbenchmarktasks, 1 amajorbottleneckcomesformtherelianceonusingback-propagation-through-timetolearntheem- v beddingfunctionthatmapsthehigh-dimensionalobservationsontolowerdimensionalrepresentations 6 suitableforstorageintheexternalmemory. Thisisproblematicbecausememoriesmightbeusedan 6 extremelylongtimeaftertheembeddingfunctionwascalled,andback-propagation-through-time 8 wouldrequireustoholdontotheforwardactivationsofthisfunctionfortheentiredurationifwe 3 wanttoback-propagatethroughit. 0 . Consideringthatinacontinuallearningsetup,anexternalmemorystoreontheorderofhundredsof 1 thousands,orevenmillions,ofseparateembeddingscouldbeneeded,storingalloftheintermediate 0 7 embedding activations is woefully intractable. A naive solution would be to simply store all of 1 thememoriesinobservationspaceinadditiontoembeddingspaceandrecomputetheembeddings : as needed for credit assignment, but a significant computational burden would result from these v redundantforward-passes. Moreimportantly,thehighdimensionalityoftheobservationspacewould i X makestoringtheseadditionalmemoriestakeanunreasonableamountofmemory. Indeed,Blundellet al[5]notedthatevenintherelativelymodestAtaridomain,storingafewmillionobservationsin r a pixel-spacewouldrequireupwardsof300gigabytes1. 2 Methods 2.1 LongTimescaleCreditAssignmentinNNEMs Creditassignmentintraditionalrecurrentneuralnetworksusuallyinvolvesback-propagatingthrough alongchainoftiedweightmatrices. Thelengthofthischainscaleslinearlywiththenumberof time-steps as the same network is run at each time-step. This creates many problems, such as vanishinggradients,thathavebeenwellstudied[7]. Incontrast,aNNEM’sarchitecturerecurrent 1Whiletheirworkutilizesalongtimescaleexternalmemory,itdoesnotaddresstheproblemsraisedhere, astheembeddingfunctionwasnotoptimizedforusageinanexternalmemory.Rather,itwaseitherarandom projection or pre-trained on a separate objective. Indeed, the low performance of the latter suggests that end-to-endoptimizationmightbenecessaryforadaptiveembeddingfunctionstobeworthwhileinthiscontext. 30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain. activitydoesn’tinvolvealongchainofactivity(thoughsomearchitecturessuchastheNTMdoutilize atraditionalrecurrentarchitectureasacontroller). Rather,theexternallystoredembeddingvectors areusedateachtime-step,butnomessagesarepassedfromprevioustime-steps. Thismeansthat vanishinggradientsaren’taproblem,asallofthenecessarygradientpathsareshort. However,these pathsareextremelynumerous(oneperembeddingvectorinmemory)andreusedforaverylongtime (untilitleavesthememory). Thus,theforward-passinformationofeachmemorymustbestoredfor theentiredurationofthememory. Thisisproblematicasthisadditionalstoragefarsurpassesthatof theactualmemories,totheextentthatlargememoriesoninfeasibletoback-propagatethroughin highdimensionalsettings. Figure 1: The computational graphes of a NNEM with differing credit assignment mechanisms. Forward-passinformationisshowninblack,truegradientsinred,andapproximategradientsinblue. A)Thestandardmethodforback-propagatingthroughanexternalmemory. Notethatthelefthalfof thisgraphmustbefrozenuntilthememoryleaves. B)Syntheticgradientmodel,M,assignscredit uponfirstenteringthememory,andthenitselflearningfromtheoldermemories’gradient. C)Exact gradientreinstatement. Thedashedmemoriesareonlinerecalculations. D)Approximategradient reinstatement. Thememoriesusedforinferenceandcreditassignmentaredifferent. 2 2.2 SyntheticGradients Syntheticgradientsarearecenttechniquecreatedtodealwithsituationwhetherapartofanetwork is‘blocked’fromfurtherusage[6],suchasinthecasepreviouslydiscussedwherebyinformation mustbestoreduntilallgradientcalculationsarecomplete. Theirwayaroundthisproblemistotreat gradientsignalasjustanotherfunctiontobeapproximated. Namely,afunctionmappingfroman activationvector, and anycontextualinformation, ontoto theerrorgradient w.r.t. thatactivation vector. Thisauxiliarynetworkcanbeusedtocalculateanestimateofthegradientbeforethetrue gradientisavailable. Thissyntheticgradientproducingnetworkcanbetrainedinapurelysupervised mannerbycomparingitsestimatestothetruegradient. AsshowninFigure1B,wecanapplythisapproachtoNNEMsbyupdatingtheembeddingnetwork withasyntheticgradientimmediatelyaftertheembeddingvectoriscalculated. Truegradientsignals areavailableforallofthepreviouslyembeddedmemories,andthesecanbeusedtotrainthesynthetic gradientnetwork. 2.3 ReinstatementforCreditAssignment Onewaytogetaroundtheneedtoholdontoforward-passinformationistorecalculatetheforward- pass whenever gradient information is available. However, as previously mentioned, even the observationsaretoolargetostoreinourdomainofinterest,whichpreventsadirectreinstatementof aforwardpassfromoccurring. Instead,werelyonalearnedautoencodertoreinstatetheobservation, andthenusetheembeddingnetworktorecalculatetheforward-pass. Sincetherecalculatedembed- dingvectorisunlikelytoperfectlymatchtheonestoredinmemory,itisnon-obvioushowtoutilize errorgradientw.r.t. thevectorinmemory. Therearetwodistinctpossibilities: ignorethemismatchanddirectlyusethememory’sgradient,or usethefreshembeddinginsteadofthememoryduringinference. Thesetwomethodsareillustrated inFigure1CandD,andwhiletheyaren’tmutuallyexclusive,theyaretreatedassuchforthepurposes ofthispaper,toaidininterpretingtheinitialresults. Thelatteroptionisatruegradientsignalwhereas theformerisonlyanapproximation,butitreliesonhavingagoodenoughreconstructiontonotharm theinferenceprocess. 3 Results Totestourmodel,weusedthestandardMNISTbenchmark,theobjectiveofwhichistocorrectly classify images of numerals by the integer they represent. While this task is trivial for modern networkswiththeappropriatearchitecture,weuseithereasitrepresentsoneofthesimplesttasksthat could,undercertainconditions,stillencounterthecreditassignmentproblempreviouslydescribed. Sinceouraimistotractablytrainembeddingfunctionswithinanetworkwithanexternalmemory, we constructed a simple NNEM consisting of an embedding network and an external store that holdstheembeddingsofthe5000mostrecentlyencounteredimagesalongwiththeircorrectlabels. Classificationsaremadebytakingthecosinesimilaritybetweentheembeddingofthecurrentimage andalloftheembeddingsinmemory,andusingthenormalizedresulttoweightclasslabelsassociated withthememorizedembeddings. Asourfocusisonweightupdatesderivedfromthememorized embeddings,westopthegradientgoingintotheembeddingfunctioncomingfromthecurrentimage, asthesearesufficientforgoodperformanceinthissimplifiedsetting. Inaddition, wefoundthat completelyunsupervisedlearningofanautoencoderwassufficienttoyielddecentperformance.Since we’reprimarilyinterestedingettingthesupervisedsignaltotravelthroughthestoredembeddings, weonlyallowthereconstructionerrortomodifythedecoder’sweights,asthedecoderisn’tusedfor inferenceliketheencoder/embeddingfunctionis. TheresultsareshowninFigure2. Allconditionsutilizethesamenetworkarchitecture2andhyper- parameters, different only in their mechanism for credit assignment. The reinstatement method withexactgradientsperformedthebest,butrequiresthemostadditionalcomputation. Synthetic gradientperformanceisperhapslowerthanexpected,butitispossiblethatthisisduetothenetwork architecture. Thesurroundingmemoriesaffectthegradientofanyindividualmemory,butthesecan’t 2asinglehiddenlayerforeachpieceofthenetwork,withdimension256.ADAMwasusedforoptimization withalearningrateof1e-4elsewhere 3 Figure2: Performanceresultsaveragedacross10runs. Exactgradientreinstatementisshowninblue, approximategradientreinstatementisshowningreen,andsyntheticgradientsareshowninred. All accuracyresultsarefromaseperatevalidationset. beaninputtothesynethicgradientnetwork,asfuturememoriesaren’tobservable. Theapproximate gradientversionofthereinstatementmethodwasthemostpronetodivergence,whichmakessense giventheapproximationinvolved. 4 FutureDirections TheabsoluteperformanceonthisMNISTtaskisn’tusefulinitself,astheperformancewaspurposely sabotagedtoisolatethenovelmechanismsunderdiscussion. However,whiletherelativeperformance changesdemonstratethepromiseofreinstatementbasedcreditassignment,thecontriveddomain necessitatesfurtherworktoconfirmtheseresultsonlargescaleproblems. Onepossibilitywould be tackle reinforcement learning problems like Atari, using a variant of the model-free episodic controlnetwork[5]modifiedsuchthattheembeddingfunctionislearnedon-line. Whiletheexact modifications needed to make that setting compatible with this approach to credit assignment is outsidethescopeofthispaper,itisclearthatthisdomainisoneofthefewpresentinthecurrent literaturethatcallsforalongtermepisodicstore. Inamorecomplexenvironment,additionalnuancesemergewhichmightimprovecreditassignment performance. Given the reliance on accurate decoding for proper reinstatement, pre-training the autoencoderofflinemightbecrucial. Vanillaautoencodersalsotendtooverfit,whichmightslow performancewhenusedtoassigncredittonewexamples. Variationalautoencodershavebeenshown tobemorerobust[8]andalsoprovideuncertaintyestimatesthatcouldbeusefulwhendecidingwhich embeddingstodiscard. AhybridapproachutilizingsyntheticgradientsandbothformsofAEbased credit assignment might also be worthwhile, as these approaches appear to have complementary weaknesses. Whilethefocushasbeenoncreditassignment,adaptivelong-termepisodicmemoriesraiseother concerns, such as the fact that stored embedding go ‘stale’ rapidly, as the embedding function’s parameterschange. Iftherateofchangeisslowenough,thenoperatingthememoriesasaqueue (firstin,firstout)shouldlimittheimpactthishasonperformance. Butinmanysetups,embeddings arediscardedasafunctionoftheirusage[5]suchthatfrequentlyneededembeddingsarekeptaround longenoughforstalenesstobecomeproblematic. Whileonlyspeculative,anautoencodercouldalso beusedto‘freshen’theseembeddingsbyre-encodingthemusingthecurrentparameters. 4 References [1]Graves,A.,Wayne,G.,&Danihelka,I.(2014).Neuralturingmachines.arXivpreprintarXiv:1410.5401. [2]Weston,J.,Chopra,S.,&Bordes,A.(2014).Memorynetworks.arXivpreprintarXiv:1410.3916. [3]Graves,A.,Wayne,G.,Reynolds,M.,Harley,T.,Danihelka,I.,Grabska-Barwin´ska,A.,...&Badia,A.P. (2016).Hybridcomputingusinganeuralnetworkwithdynamicexternalmemory.Nature. [4]McClelland,J.L.,McNaughton,B.L.,&O’Reilly,R.C.(1995).Whytherearecomplementarylearning systemsinthehippocampusandneocortex:insightsfromthesuccessesandfailuresofconnectionistmodelsof learningandmemory.Psychologicalreview,102(3),419. [5]Blundell,Charles,etal."Model-freeepisodiccontrol."arXivpreprintarXiv:1606.04460(2016). [6] Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016). Decoupledneuralinterfacesusingsyntheticgradients.arXivpreprintarXiv:1608.05343. [7]Hochreiter,S.(1998). Thevanishinggradientproblemduringlearningrecurrentneuralnetsandproblem solutions.InternationalJournalofUncertainty,FuzzinessandKnowledge-BasedSystems,6(02),107-116. [8]Kingma,D.P.,&Welling,M.(2013).Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114. 5

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.