Table Of ContentLong Timescale Credit Assignment in Neural
Networks with External Memory
StevenS.Hansen
DepartmentofPsychology
StanfordUniversity
Stanford,CA94303
7
sshansen@stanford.edu
1
0
2
n
a 1 Introduction
J
4 Neural networks with external memories (NNEM),such as the Neural Turing Machine [1] and
1
Memory Networks [2], have often been compared to the human hippocampus in their ability to
maintainepisodicinformationoverlongtimescales[3]. Butsofar,thesenetworkshaveonlybeen
]
I trainedontasksrequiringmemorystoragecomparabletoafewminutes,whereasthehippocampus
A
storesinformationontheorderofyears. Itiswellknownthatthissortoflong-termepisodicstorage
. isvitalforovercomingtheproblemofcatastrophicinterference[4],whichiscentraltothechallenge
s
c ofcontinuallearning.
[
Whilesomeofthisdiscrepancycanbeattributedtotheshort-termnatureofcurrentbenchmarktasks,
1 amajorbottleneckcomesformtherelianceonusingback-propagation-through-timetolearntheem-
v beddingfunctionthatmapsthehigh-dimensionalobservationsontolowerdimensionalrepresentations
6 suitableforstorageintheexternalmemory. Thisisproblematicbecausememoriesmightbeusedan
6 extremelylongtimeaftertheembeddingfunctionwascalled,andback-propagation-through-time
8
wouldrequireustoholdontotheforwardactivationsofthisfunctionfortheentiredurationifwe
3
wanttoback-propagatethroughit.
0
. Consideringthatinacontinuallearningsetup,anexternalmemorystoreontheorderofhundredsof
1
thousands,orevenmillions,ofseparateembeddingscouldbeneeded,storingalloftheintermediate
0
7 embedding activations is woefully intractable. A naive solution would be to simply store all of
1 thememoriesinobservationspaceinadditiontoembeddingspaceandrecomputetheembeddings
: as needed for credit assignment, but a significant computational burden would result from these
v
redundantforward-passes. Moreimportantly,thehighdimensionalityoftheobservationspacewould
i
X makestoringtheseadditionalmemoriestakeanunreasonableamountofmemory. Indeed,Blundellet
al[5]notedthatevenintherelativelymodestAtaridomain,storingafewmillionobservationsin
r
a pixel-spacewouldrequireupwardsof300gigabytes1.
2 Methods
2.1 LongTimescaleCreditAssignmentinNNEMs
Creditassignmentintraditionalrecurrentneuralnetworksusuallyinvolvesback-propagatingthrough
alongchainoftiedweightmatrices. Thelengthofthischainscaleslinearlywiththenumberof
time-steps as the same network is run at each time-step. This creates many problems, such as
vanishinggradients,thathavebeenwellstudied[7]. Incontrast,aNNEM’sarchitecturerecurrent
1Whiletheirworkutilizesalongtimescaleexternalmemory,itdoesnotaddresstheproblemsraisedhere,
astheembeddingfunctionwasnotoptimizedforusageinanexternalmemory.Rather,itwaseitherarandom
projection or pre-trained on a separate objective. Indeed, the low performance of the latter suggests that
end-to-endoptimizationmightbenecessaryforadaptiveembeddingfunctionstobeworthwhileinthiscontext.
30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain.
activitydoesn’tinvolvealongchainofactivity(thoughsomearchitecturessuchastheNTMdoutilize
atraditionalrecurrentarchitectureasacontroller). Rather,theexternallystoredembeddingvectors
areusedateachtime-step,butnomessagesarepassedfromprevioustime-steps. Thismeansthat
vanishinggradientsaren’taproblem,asallofthenecessarygradientpathsareshort. However,these
pathsareextremelynumerous(oneperembeddingvectorinmemory)andreusedforaverylongtime
(untilitleavesthememory). Thus,theforward-passinformationofeachmemorymustbestoredfor
theentiredurationofthememory. Thisisproblematicasthisadditionalstoragefarsurpassesthatof
theactualmemories,totheextentthatlargememoriesoninfeasibletoback-propagatethroughin
highdimensionalsettings.
Figure 1: The computational graphes of a NNEM with differing credit assignment mechanisms.
Forward-passinformationisshowninblack,truegradientsinred,andapproximategradientsinblue.
A)Thestandardmethodforback-propagatingthroughanexternalmemory. Notethatthelefthalfof
thisgraphmustbefrozenuntilthememoryleaves. B)Syntheticgradientmodel,M,assignscredit
uponfirstenteringthememory,andthenitselflearningfromtheoldermemories’gradient. C)Exact
gradientreinstatement. Thedashedmemoriesareonlinerecalculations. D)Approximategradient
reinstatement. Thememoriesusedforinferenceandcreditassignmentaredifferent.
2
2.2 SyntheticGradients
Syntheticgradientsarearecenttechniquecreatedtodealwithsituationwhetherapartofanetwork
is‘blocked’fromfurtherusage[6],suchasinthecasepreviouslydiscussedwherebyinformation
mustbestoreduntilallgradientcalculationsarecomplete. Theirwayaroundthisproblemistotreat
gradientsignalasjustanotherfunctiontobeapproximated. Namely,afunctionmappingfroman
activationvector, and anycontextualinformation, ontoto theerrorgradient w.r.t. thatactivation
vector. Thisauxiliarynetworkcanbeusedtocalculateanestimateofthegradientbeforethetrue
gradientisavailable. Thissyntheticgradientproducingnetworkcanbetrainedinapurelysupervised
mannerbycomparingitsestimatestothetruegradient.
AsshowninFigure1B,wecanapplythisapproachtoNNEMsbyupdatingtheembeddingnetwork
withasyntheticgradientimmediatelyaftertheembeddingvectoriscalculated. Truegradientsignals
areavailableforallofthepreviouslyembeddedmemories,andthesecanbeusedtotrainthesynthetic
gradientnetwork.
2.3 ReinstatementforCreditAssignment
Onewaytogetaroundtheneedtoholdontoforward-passinformationistorecalculatetheforward-
pass whenever gradient information is available. However, as previously mentioned, even the
observationsaretoolargetostoreinourdomainofinterest,whichpreventsadirectreinstatementof
aforwardpassfromoccurring. Instead,werelyonalearnedautoencodertoreinstatetheobservation,
andthenusetheembeddingnetworktorecalculatetheforward-pass. Sincetherecalculatedembed-
dingvectorisunlikelytoperfectlymatchtheonestoredinmemory,itisnon-obvioushowtoutilize
errorgradientw.r.t. thevectorinmemory.
Therearetwodistinctpossibilities: ignorethemismatchanddirectlyusethememory’sgradient,or
usethefreshembeddinginsteadofthememoryduringinference. Thesetwomethodsareillustrated
inFigure1CandD,andwhiletheyaren’tmutuallyexclusive,theyaretreatedassuchforthepurposes
ofthispaper,toaidininterpretingtheinitialresults. Thelatteroptionisatruegradientsignalwhereas
theformerisonlyanapproximation,butitreliesonhavingagoodenoughreconstructiontonotharm
theinferenceprocess.
3 Results
Totestourmodel,weusedthestandardMNISTbenchmark,theobjectiveofwhichistocorrectly
classify images of numerals by the integer they represent. While this task is trivial for modern
networkswiththeappropriatearchitecture,weuseithereasitrepresentsoneofthesimplesttasksthat
could,undercertainconditions,stillencounterthecreditassignmentproblempreviouslydescribed.
Sinceouraimistotractablytrainembeddingfunctionswithinanetworkwithanexternalmemory,
we constructed a simple NNEM consisting of an embedding network and an external store that
holdstheembeddingsofthe5000mostrecentlyencounteredimagesalongwiththeircorrectlabels.
Classificationsaremadebytakingthecosinesimilaritybetweentheembeddingofthecurrentimage
andalloftheembeddingsinmemory,andusingthenormalizedresulttoweightclasslabelsassociated
withthememorizedembeddings. Asourfocusisonweightupdatesderivedfromthememorized
embeddings,westopthegradientgoingintotheembeddingfunctioncomingfromthecurrentimage,
asthesearesufficientforgoodperformanceinthissimplifiedsetting. Inaddition, wefoundthat
completelyunsupervisedlearningofanautoencoderwassufficienttoyielddecentperformance.Since
we’reprimarilyinterestedingettingthesupervisedsignaltotravelthroughthestoredembeddings,
weonlyallowthereconstructionerrortomodifythedecoder’sweights,asthedecoderisn’tusedfor
inferenceliketheencoder/embeddingfunctionis.
TheresultsareshowninFigure2. Allconditionsutilizethesamenetworkarchitecture2andhyper-
parameters, different only in their mechanism for credit assignment. The reinstatement method
withexactgradientsperformedthebest,butrequiresthemostadditionalcomputation. Synthetic
gradientperformanceisperhapslowerthanexpected,butitispossiblethatthisisduetothenetwork
architecture. Thesurroundingmemoriesaffectthegradientofanyindividualmemory,butthesecan’t
2asinglehiddenlayerforeachpieceofthenetwork,withdimension256.ADAMwasusedforoptimization
withalearningrateof1e-4elsewhere
3
Figure2: Performanceresultsaveragedacross10runs. Exactgradientreinstatementisshowninblue,
approximategradientreinstatementisshowningreen,andsyntheticgradientsareshowninred. All
accuracyresultsarefromaseperatevalidationset.
beaninputtothesynethicgradientnetwork,asfuturememoriesaren’tobservable. Theapproximate
gradientversionofthereinstatementmethodwasthemostpronetodivergence,whichmakessense
giventheapproximationinvolved.
4 FutureDirections
TheabsoluteperformanceonthisMNISTtaskisn’tusefulinitself,astheperformancewaspurposely
sabotagedtoisolatethenovelmechanismsunderdiscussion. However,whiletherelativeperformance
changesdemonstratethepromiseofreinstatementbasedcreditassignment,thecontriveddomain
necessitatesfurtherworktoconfirmtheseresultsonlargescaleproblems. Onepossibilitywould
be tackle reinforcement learning problems like Atari, using a variant of the model-free episodic
controlnetwork[5]modifiedsuchthattheembeddingfunctionislearnedon-line. Whiletheexact
modifications needed to make that setting compatible with this approach to credit assignment is
outsidethescopeofthispaper,itisclearthatthisdomainisoneofthefewpresentinthecurrent
literaturethatcallsforalongtermepisodicstore.
Inamorecomplexenvironment,additionalnuancesemergewhichmightimprovecreditassignment
performance. Given the reliance on accurate decoding for proper reinstatement, pre-training the
autoencoderofflinemightbecrucial. Vanillaautoencodersalsotendtooverfit,whichmightslow
performancewhenusedtoassigncredittonewexamples. Variationalautoencodershavebeenshown
tobemorerobust[8]andalsoprovideuncertaintyestimatesthatcouldbeusefulwhendecidingwhich
embeddingstodiscard. AhybridapproachutilizingsyntheticgradientsandbothformsofAEbased
credit assignment might also be worthwhile, as these approaches appear to have complementary
weaknesses.
Whilethefocushasbeenoncreditassignment,adaptivelong-termepisodicmemoriesraiseother
concerns, such as the fact that stored embedding go ‘stale’ rapidly, as the embedding function’s
parameterschange. Iftherateofchangeisslowenough,thenoperatingthememoriesasaqueue
(firstin,firstout)shouldlimittheimpactthishasonperformance. Butinmanysetups,embeddings
arediscardedasafunctionoftheirusage[5]suchthatfrequentlyneededembeddingsarekeptaround
longenoughforstalenesstobecomeproblematic. Whileonlyspeculative,anautoencodercouldalso
beusedto‘freshen’theseembeddingsbyre-encodingthemusingthecurrentparameters.
4
References
[1]Graves,A.,Wayne,G.,&Danihelka,I.(2014).Neuralturingmachines.arXivpreprintarXiv:1410.5401.
[2]Weston,J.,Chopra,S.,&Bordes,A.(2014).Memorynetworks.arXivpreprintarXiv:1410.3916.
[3]Graves,A.,Wayne,G.,Reynolds,M.,Harley,T.,Danihelka,I.,Grabska-Barwin´ska,A.,...&Badia,A.P.
(2016).Hybridcomputingusinganeuralnetworkwithdynamicexternalmemory.Nature.
[4]McClelland,J.L.,McNaughton,B.L.,&O’Reilly,R.C.(1995).Whytherearecomplementarylearning
systemsinthehippocampusandneocortex:insightsfromthesuccessesandfailuresofconnectionistmodelsof
learningandmemory.Psychologicalreview,102(3),419.
[5]Blundell,Charles,etal."Model-freeepisodiccontrol."arXivpreprintarXiv:1606.04460(2016).
[6] Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016).
Decoupledneuralinterfacesusingsyntheticgradients.arXivpreprintarXiv:1608.05343.
[7]Hochreiter,S.(1998). Thevanishinggradientproblemduringlearningrecurrentneuralnetsandproblem
solutions.InternationalJournalofUncertainty,FuzzinessandKnowledge-BasedSystems,6(02),107-116.
[8]Kingma,D.P.,&Welling,M.(2013).Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114.
5