ebook img

Gradient learning in spiking neural networks by dynamic perturbation of conductances PDF

0.13 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Gradient learning in spiking neural networks by dynamic perturbation of conductances

Gradient learning in spiking neural networks by dynamic perturbation of conductances 1 2 Ila R. Fiete and H. Sebastian Seung 1Kavli Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106 2Howard Hughes Medical Institute and Department of Brain and Cognitive Sciences, M.I.T., Cambridge, MA 02139 We present a method of estimating the gradient of an objective function with respect to the synaptic weights of a spiking neural network. The method works by measuring the fluctuations 6 in the objective function in response to dynamic perturbation of the membrane conductances of 0 the neurons. It is compatible with recurrent networks of conductance-based model neurons with 0 dynamic synapses. The method can be interpreted as a biologically plausible synaptic learning 2 rule, if the dynamicperturbations are generated by a special class of “empiric” synapses driven by n random spike trains from an external source. a J PACSnumbers: 84.35.+i,87.19.La,07.05.Mh,87.18.Sn 9 1 Neural network learning is often formulated in terms spiking neural networks. Instead of making perturba- ] ofanobjective function that quantifiesperformance ata tions to the synaptic weights,itestimates the N2 weight C desired computational task. The network is trained by gradients through dynamic perturbation of the conduc- N estimatingthegradientoftheobjectivefunctionwithre- tances of the N network neurons. Our algorithm does . o specttosynapticweights,andthenchangingtheweights this by exploiting a feature generic to many models of i in the direction of the gradient. neural networks: that inputs to a neuron combine ad- b - Ifneuralandnetworkdynamicsandtheobjectivefunc- ditively before being subjected to further nonlinearities. q tion are all exactly known functions of the weights, such Otherwise, the algorithm is model-free. Our approach [ learningcanbeaccomplishedbyexplicitlycomputingthe generalizes the concept of node perturbation, which has 1 relevant gradients. A famous example of this approach, been proposed for training feedforward networks of non- v used with wide success in non-spiking, deterministic ar- spikingneurons[2,8]andcanbemuchfasterthanweight 8 tificial neural networks [1], is the backpropagation (BP) perturbation [9]. We show how neural conductance per- 2 [2, 3] algorithm. turbations can be biologically plausibly used to perform 0 1 However,the relevanceofBPto neurobiologicallearn- synaptic gradient learning in fully recurrent networks of 0 ingislimited. Biologicalneuralactivitycanbenoisy,and realistic spiking neurons. 6 involvesthehighlynonlinearandoftenhistory-dependent Spiking neural networksWebrieflydiscussthemath- 0 dynamics of membrane voltages and conductances: neu- ematical conditions under which our assumption, that / o rons generatevoltagespikes,and the efficacy of synaptic the synaptic inputs to a single neuron combine linearly, i b transmissionvariesdynamically,onaspikebyspikebasis holdsinspikingneuralnetworks. Ifeachneuroniiselec- - [4, 5]. Further, the objective function in neurobiological trotonicallycompact,itcanbedescribedbyatransmem- q learning may depend on the dynamics of muscles and brane voltage V , obeying the current balance equation : i v external variables of the world unknown to the brain. C dV /dt=−Iint(t)−Isyn(t). The intrinsic currentIint i i i i i i X Similar complications are also present in analog on-chip is generally a nonlinear function of voltage and dynami- or robotic implementations of machine learning. calvariablesassociatedwiththespike-generatingconduc- r a Forlearninginsuchsystems,alternativestrategiesare tancesinthemembrane. Thedynamicsofthesevariables necessary. The method of weight perturbation estimates maybearbitrarilycomplex(e.g. Hodgkin-Huxleymodel) the gradients by perturbing synaptic weights, and ob- without affecting our derivations. A simple model for serving the change in the objective function. Unlike BP, the synaptic currentis Iisyn = jWijsij(t)(Vi(t)−Eij). weight perturbation is completely “model-free” [6] – it Thetime-varyingsynapticcondPuctancefromneuronj to does not depend on knowing anything about the func- neuron i is Wijsij(t), with amplitude controlled by the tionaldependenceoftheobjectiveonthenetworkweights parameter Wij. Its time course is determined by sij(t), – and can be applied to stochastic spiking networks [7]. whichcouldincludecomplexformsofshort-termdepres- The disadvantage of a completely model-free approach sionand facilitation. If the reversalpotentials Eij of the is the tradeoff between generality and learning speed: synapses are all the same, then the synaptic current can weight perturbation is far more widely applicable than be written as Iisyn =gi(t)(Vi(t)−Esyn), where BP, but BP is much faster when it is applicable. Here we propose a method that is intermediate be- g (t)= W s (t) (1) i ij ij tween these two extremes, yet is applicable to arbitrary X j 2 is the sum of all postsynaptic conductances of the changes in g (t). The implication of the lemma is that i synapsesontoneuroni. Thelineardependenceofg (t)on dynamic perturbations of the variablesg (t) can be used i i thesynapticweightsW willbecriticalbelow. However, to instruct modifications of the static parameters W . ij ij thislineardependencemaybeembeddedinsideanonlin- Gradient estimation In order to estimate δR/δg (t) i ear network, which may be arbitrarily complex without suppose that Eq. (1) is perturbed by a fluctuating white afffecting the following derivations. In fact, all networks noise, – neuraland spikingor neither – thatdepend ona setof interaction variables sij(t) and parameters Wij through gi(t)= Wijsij(t)+ξi(t) (4) Eq. (1)satisfythenecessaryconditionsforourderivation Xj below. Gradient learning We represent the state of the net- The white noise satisfies hξi(t)i = 0 and hξi(t1)ξj(t2)i = work by a vector Ω(t), which includes the synaptic vari- σ2δijδ(t1 −t2), where the angle brackets denote a trial average. For now, let’s regard this perturbation as a ables s (t) and all other dynamical variables (e.g., the ij mathematicaldevice; its biologicalinterpretation will be voltages V (t) and all variables ssociated with the mem- i discussed later. brane conductances). Starting from an initial condition To show that δR/δg (t) can be estimated fromthe co- Ω(0) the network generates a trajectory from time t=0 i variance of R and the perturbation ξ (t), use the lin- to t = T, and in response receives a scalar “reinforce- i T ment” signal R[Ω], which is an arbitrary functional of ear approximation R−R0 ≈ 0 dt k(δR/δgk(t))ξk(t), the trajectory. For now we assume that the network dy- whichisaccuratewhentheperRturbaPtionsξi(t)aresmall. namicsaredeterministic,andpresentthefullystochastic Here R0 is defined as R in the absence of any perturba- case in the Appendix. Each trajectory along with its re- tions, ξ = 0. Since the perturbations are white noise, it inforcement is called a “trial,” and the learning process follows that isiterative,extendingoveraseriesoftrials. ThesignalR δR 2 dependsimplicitlyonthesynapticweightsWij,andisan h(R−R0)ξi(t)i≈σ δg (t) (5) i objective function for learning. In other words, the goal of learning is to find synaptic weights that maximize R. Because hξi = 0, the baseline R0 may be replaced by Aheuristicmethodfordoingthisistofollowthegradient any quantity thatis uncorrelatedwith the perturbations of R with respect to Wij. Next we derive our gradient ofthe currenttrial. Forexample,choosingR0 =0leaves learning rule. Eq. (5) valid. However, baseline subtraction can have a large effect on the variance of the estimate (5) when based on a finite number of trials [10]. Thus a good actor ξ ξ ξ 1 2 3 choice of baseline can decrease learning time, sometimes critic R 1 W 2 W23 3 exp dramatically. 13 If the covariance relation of Eq. (5) is combined with the sensitivity lemma Eq. (3), it follows that world ∂R T 2 FIG. 1: Neuronsin arecurrent network (“actor”), connected σ ∂Wij ≈Z0 dth(R−R0)ξi(t)isij(t). (6) bymodifiableweights W. Inaddition, eachneuron ireceives an empiric synapse carrying perturbing input ξi(t) from an Synaptic learning rule Equation (6) suggests the fol- external “experimenter”. A global reinforcement signal R is lowing stochastic gradient learning procedure. At each broadcast by a “critic” to all neuronsin thenetwork. synapse the purely local eligibility trace Sensitivity lemma Suppose that Wij(t) were a time- T varying function. Then by Eq. (1) and the chain rule, it eij =Z dtξi(t)sij(t) (7) 0 would follow that is accumulated over the trajectory. At the end of the δR δR = sij(t) (2) trajectory,the synaptic weightis updated accordingto: δW (t) δg (t) ij i But if W (t) is constrainedto take on the same value at ∆Wij =η(R−R0)eij (8) ij every time, it follows that The update ∆W fluctuates because of the random- ij ∂R T δR T δR ness in the perturbations. On average, the update ∂Wij =Z0 dtδWij(t) =Z0 dtδgi(t)sij(t) (3) pointsinthedirectionofthegradient,becauseitsatisfies h∆W i ∝ ∂R/∂W , according to Eq. (6). This means ij ij We call this the sensitivity lemma, because it relates the that the learning rule of Eq. (8) is stochastic gradient sensitivity of R to changes in W with the sensitivity to following. ij 3 We note one subtlety in the derivation: In Eq. (7) the tage that the degree of exploration in the actor can be synaptic variables s (t) are defined in the presence of modified independently of activity in the actor. ij perturbations, while in the sensitivity lemma, they are Generalization to excitatory and inhibitory defined for ξ = 0. In the linear approximations above, synapses Above we assumed that all synapses have the this discrepancy leads to a higher-order correction that same reversal potential. But neurons may receive both is negligible for small perturbations. excitatory and inhibitory synapses, which have differ- Biological interpretation According to the above, ent reversalpotentials. The unmodified learning rule al- synaptic weight gradients of R can be estimated using lows both synapse types to perform gradient following if conductance perturbations ξ (t). Could this mathemat- there are two types of empiric synapses per neuron: an i ical trick be used by the brain? In the actor-critic ter- excitatory empiric synapse used to train the excitatory minology of reinforcement learning [11], one can imag- synapses,andaninhibitoryempiricsynapseusedtotrain ine that the neurons of one brain area (the “actor”) the inhibitory synapses. But if there is only one empiric driveactionsthatareassessedbyanotherbrainarea(the synapse per neuron, then for both types of synapses to “critic”), which in response issues a global, scalar rein- perform gradient following, the rule must be modified. forcementsignalRto the actor(Fig. 1). Anovelfeature Let Eij andEξ,i be the reversalpotentials of the regular of our rule is that in addition to its regular synapses i ← j synapse and of the empiric synapse onto the ith W , the actor would receive a special class of “empiric” actorneuron,respectively. Thenweobtaina generalized ij synapses from another hypothesized part of the brain sensitivity lemma: (the “experimenter”),whichperturbthe actorfromtrial ∂R δR totrial. Eachplasticsynapselocallycomputesandstores = dt a (t) s (t) (9) ∂W Z ij δg (t) ij itsscalareligibilityandmultipliesthiswithRtoundergo ij i modification. This idea is developed in detail elsewhere where ina modelofbirdsonglearning[12, 13], resultingincon- crete,nontrivialpredictions forsynaptic plasticity inthe Vi(t)−Eij a (t)= (10) ij brain. Vi(t)−Eξ,i Note that if the perturbation ξ (t) is a synaptic con- i is the ratio of the synaptic driving force at the i ← j ductance,itsmeanvaluehξ imustbe positive. Thenthe i synapse to the driving force of the empiric synapse at linear approximations above are expansions about the neuron i. The stochastic gradient learning rule remains meanconductance ξ (t)=hξ i, ratherthanξ (t)=0. As a result, ξ (t) must bie replacied by the zero-mieanfluctu- ∆Wij =η(R−R0)eij, but withmodified eligibility trace i ationδξi(t)=ξi(t)−hξiiin the eligibility trace. Inaddi- T tion, the fluctuations δξi(t) will not be truly white, but eij =Z aij(t)ξi(t)sij(t), 0 will have a correlation time set by the time constant of thesynapticcurrents. However,ifthiscorrelationtimeis Forsynapseswith the samereversalpotentialas the em- short relative to the time scale of variationin δR/δgi(t), piric synapse, aij(t) = 1, returning the original learning then the gradient estimate Eq. (5) should still be accu- rule. Even for synapses of the opposite variety, the sign rate. ofa doesnotchangewithtimebecauseneuralvoltageis ij Accurate gradient estimation requires that the eligi- constrainedtostaybetweentheinhibitoryandexcitatory bility trace filter out the mean conductance hξii of the reversal potentials VI and VE (VI ≤ Vi(t) ≤ VE), and empiric synapse. Thisoperationis biologicallyplausible, E ,E ∈ {V ,V }. Nevertheless, for these synapses of ξ,i ij I E and can be implemented by a simple time averageat ev- the opposite variety, the term a (t) adds complexity to ij ery “actor”neuron, if the empiric synapsesare driven at the simple learning rule and reduces its biological plau- a constant or very slowly varying rate. sibility. By contrast, other proposals for stochastic gradient Generalization to multicompartmental model learning typically require individual neurons to keep neurons Suppose the model neuron is not isopotential, track of and filter out a time-varying average vector but has several dendritic compartments. Then it can be of neural or synaptic activity within each trial, which trainedwithoutmodificationofthelearningrulebyusing seems rather complex. The added complexity arises be- aseparateempiricsynapseforeachcompartment. Alter- cause these proposals are based on fluctuations in net- natively, a single empiric synapse could be used for the work dynamics caused by stochasticity intrinsic to neu- whole neuron, but with the introduction of complexities rons[14,15,16]orsynapses[7]intheactornetwork;thus, inthelearningrulesimilartothea (t)factorofEq.(10). ij theaverageperturbationisafunctionofthenetworktra- Technical issues Our synaptic learning rule performs jectory and is time-varying. Our algorithm avoids this stochastic gradient following, and therefore shares the complexity, because the fluctuations are injected by an virtues and deficiencies of all methods in this class [17]. extrinsicsource,andarethereforeindependentofthenet- For example, it is possible to become stuck at a local worktrajectory. Ourapproachhastheadditionaladvan- optimum of the objective function. The stochasticity of 4 thegradientestimationmayallowsomesmallprobability Ω(T) from a Markov process with transition probability of escape, but there is no guarantee of finding a global P (Ω(t)|Ω(t −1)). The assumption of Markov transi- W optimum. tionprobabilities is compatible with mostspiking neural Thederivationofourlearningruleinparticular,andof network models. The network receives reinforcement R gradientrules in general,depends onthe smoothness as- from the conditional density P(R|Ω). Since the network sumptionthatRisadifferentiablefunctionofthesynap- is parametrized by W, the expected reward tic weights. But R depends on W through the spiking ij activity of the actor network, and spiking neurons typ- hRi= RP(R|Ω)P (Ω)dRDΩ (11) Z W ically exhibit threshold spike- or no-spike behaviors, so one might worry that R is discontinuous. However, be- is a function of W. We assume that the transition prob- cause either the amplitude or the latency of neural spik- ability depends on the weights W through ingvariescontinuouslyasafunctionofinputnearthresh- old [18], there is typically no true discontinuity. PW(Ω(t)|Ω(t−1))=f(g1(t),...,gN(t)) (12) Comparison with previous work If the perturbation where as before ξ (t) is Gaussian white noise, then our synaptic learning i rule can be included as a member of the REINFORCE g (t)= W s (t−1) (13) i ij ij class of algorithms [15]. With Gaussian white noise we X j can use the REINFORCE formalism to prove that our learning rule performs stochastic gradient ascent on R The transition probability depends on all the dynamical without assuming that the perturbations are small, be- variablesinΩ(t),althoughtheyhavebeensuppressedfor cause linear approximations are not used. In contrast, notational simplicity in Eq. (12). As before, the impor- our present derivation does not require the perturba- tant mathematical property here is the linearity of Eq. tionstobeGaussian,butassumestheyareapproximately (13), which is embedded inside a nonlinear system. The white,andofsmallamplitude. TheREINFORCEtheory sensitivity lemma takes the form: toocouldbeusedfornon-Gaussianξ (t),ifξ (t)isdrawn i i T i.i.d. from a smooth probability density function (PDF). ∂hRi ∂ = hRs (t−1)i (14) However,theresultinglearningrulewillbedifferentthan ∂W ∂g (t) ij ij Xt=1 i ours. Further,the assumptionofsmoothnessofthePDF can seriously limit the applicability of the REINFORCE Thesensitivitylemmashowsthattheappropriatechange theory: for example, a ξ generatedby filtering a random in the weight of a synapse is not given by the covariance spike train cannot be treated by REINFORCE. of its activity with reinforcement (as might be naively The sensitivity lemma allows us to derive rules for expected), but isinsteadgivenby the derivative withre- synaptic gradient learning based on perturbations of spect to gi(t) of this covariance. As before, the proof of other quantities not directly related to the synaptic pa- the sensitivity lemma involves comparing derivatives of rameters. Versions of the sensitivity lemma have ap- thetransitionprobabilitiestakenwithrespecttoWij and peared in the literature for nonspiking feedforward net- gi(t), without actually performing either differentiation. works, and been used to estimate the gradient by seri- Note that REINFORCE requires the stronger condition allyperturbingoneneuronatatime(nodeperturbation) that the log probability be differentiable. For small per- [8,19]. Ourversionofthesensitivitylemmaismoregen- turbations ξi(t), this sensitivity lemma leads us again to eral, because it is applicable to learning trajectories in the gradient learning rule of Eqns. (7-8), now valid for recurrent networks, via parallel perturbation of multiple fully stochastic networks. neurons. Mostimportantly,we haveshownhowtouse it toderiveabiologicallyplausibleruleforgradientlearning in spiking networks. Acknowledgments For comments on the manuscript, [1] Y. LeCun et al.,Proc IEEE 86(11), 2278 (1998). theauthorsaregratefultoY.LoewensteinandU.Rokni. [2] B. WidrowandM.Lehr,ProcIEEE 78(9),1415 (1990). I.F. acknowledges funding from NSF PHY 99-07949. [3] D.Rumelhartetal.,inD.RumelhartandJ.McClelland, APPENDIX: Stochastic networks Above the net- eds., Parallel Distributed Processing (MIT Press, 1986). work dynamics and reinforcement R were assumed to [4] H. Markram and M. Tsodyks, Nature 382(6594), 807 be deterministic. Both elements can be made stochas- (1996). tic, as outlined below. Consider the case of discrete [5] A. Thomson and J. Deuchars, Trends Neurosci. 17, 119 (1994). time (continuous time is a limiting case). The network [6] A. Dembo and T. Kailath, IEEE Trans on Neural Net- generates a trajectory Ω = {Ω(0),Ω(1),...,Ω(T)} from works 1(1), 58 (1990). a probability density P (Ω). Suppose each trajectory W [7] H. Seung,Neuron. 40(6), 1063 (2003). is generated by drawing an initial condition Ω(0) from [8] Y. LeCun et al., in D. Touretzky, ed., Adv Neural Info someprobabilitydensityandthendrawingΩ(1)through Proc Sys 1, 141 (1989). 5 [9] J. Werfel, X. Xie, and H. Seung, Neural Comp 17(12), [16] X. Xie and H.Seung, PhysRev E 69, 041909 (2004). 2699 (2005). [17] B. Pearlmutter, IEEE Trans on Neural Networks 6(5), [10] P.Dayan,inD.Touretzkyetal.,,eds.,ProcConnection- 1212 (1995). ist Models Summer School (Morgan Kaufmann, 1990). [18] J. Rinzel and B. Ermentrout, in C. Koch and I. Segev, [11] R.SuttonandA.Barto, Reinforcement learning: An in- eds., Methods in Neuronal Modelling: From synapses to troduction (MIT Press, 1998). Networks (MIT Press, 1989). [12] I.Fiete, Ph.D. thesis, Harvard University (2003). [19] D. Andes et al., in IJCNN-90-WASHDC: International [13] I.Fiete and H.Seung, Submitted(2005). Joint Conference on Neural Networks, (Lawrence Erl- [14] A. G. Barto and P. Anandan, IEEE Trans on Systems, baum Associates, 1990). Man, and Cybernetics 15(3), 360 (1985). [15] R.Williams, Machine Learning 8, 229 (1992).

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.