Learning and inference in a nonequilibrium Ising model with hidden nodes Benjamin Dunn∗ The Kavli Institue for Systems Neuroscience, NTNU, 7030 Trondheim Yasser Roudi† The Kavli Institue for Systems Neuroscience, NTNU, 7030 Trondheim and NORDITA, KTH Royal Institute of Technology and Stockholm University, 10691 Stockholm, Sweden We study inference and reconstruction of couplings in a partially observed kinetic Ising model. 3 Withhiddenspins, calculating thelikelihood of asequenceof observed spinconfigurations requires 1 performingatraceovertheconfigurationsofthehiddenones. This,asweshow,canberepresented 0 asapathintegral. Usingthisrepresentation,wedemonstratethatsystematicapproximateinference 2 and learning rules can be derived using dynamical mean-field theory. Although naive mean-field n theory leads to an unstable learning rule, taking into account Gaussian corrections allows learning a thecouplingsinvolvinghiddennodes. Italsoimproveslearningofthecouplingsbetweentheobserved J nodes compared towhen hidden nodes are ignored. 0 3 PACSnumbers: 02.50.Tt,05.10.-a,89.75.Hc,75.10.Nr ] n I. INTRODUCTION plingsinanonequilibriumIsingmodelwithhiddenspins n and the inference of the state of these spins. Theoreti- - s Within the statistical mechanics community, a signifi- cally, the couplings can be reconstructed by calculating di cant body of recent work deals with parameter learning andmaximizingthelikelihoodofthedatawhichinvolves . in statistical models. This work typically focuses on the performing a trace over the configurations of the hidden at canonical setting of the inverse Ising problem: given a spinsatalltimes: animpossibletaskevenformoderately m setofconfigurationsfromanequilibrium[1], ornonequi- sized data sets and networks. This is, in fact, a general - librium [2] Ising model, how can we find the interactions problem, not only for a nonequilibrium Ising model, but d between the spins? These inverse problems can be diffi- for any kinetic model with hidden variables. It is, how- n culttosolveusingbruteforceapproaches[3]andthusfor ever, natural to use a path integral approach for doing o practical purposes, the results of such theoretical works this trace, and this is what we do here. We show that c [ offering approximatelearning and inference methods are efficient approximate learning and inference can be de- of great importance. The recent interest in using statis- veloped by evaluating the log-likelihood in this way and 1 tical models to study high-throughput data from neural employing mean-field type approximations. v [4–6], genetic [7] and financial [8] networks further em- Inwhatfollows,afterformulatingtheproblem,wefirst 5 7 phasizesthenecessityofdevelopingefficientapproximate evaluatethelikelihoodofthedatausingthe saddlepoint 2 inference methods. approximation to the path integral. We find that the 7 Despiteitspracticalrelevance,however,littlehasbeen result allows inference of the state of hidden variables if 1. done on learning and inference in models with hidden the couplings are known but cannot be used for learning 0 spins using statistical physics. Learning the parameters the couplings. We then perform a Gaussian correction 3 ofastatisticalmodelinthepresenceofhiddennodesand to this saddle point approximation to derive Thouless- 1 inferring the state of hidden variables lie at the heart of Anderson-Palmer(TAP)likeequations[11]forourprob- v: difficult problems in statistics and machine learning. It lem. Our numerical results show that successul learning i is well known that for statistical modeling of data, hid- of the couplings, even those that involve hidden nodes, X den variables most often improve model quality [9, 10]. can be achieved using the corrected learning rules. We r Furthermore, when used for network reconstruction and find that with enough data and using these equations a graphidentification,ignoringhiddennodesmaycausethe evenhidden-to-hiddenconnectionscanbereconstructed. model to incorrectly identify/remove couplings between Although in this paper our focus is on the paradig- observednodeswhensuchcouplingsdonotexist/actually maticcaseoftheIsingmodel,theapproachtakenhereis exist. For these reasons, it is quite important to study quite generalandcanbe adaptedto manyother popular models with hidden nodes, both from a theoretical per- statistical models. spective and for applications to real data where usually only a fraction of the system is observeable [4, 5, 7]. II. FORMULATION OF THE PROBLEM In this paper, we study the reconstruction of the cou- Consider a network of binary spin variables subject tosynchronousupdatedynamics. Supposesomeofthese ∗ [email protected] nodesareobservedforT timesteps,whiletheothernodes † [email protected] arehidden. We denote the state ofthe observedspins at 2 time t by s(t) = s (t) and the hidden ones by σ = The log-likelihood of the data is recovered at ψ 0. i { } → σ (t) . For clarity,wealwaysuse the subscripts i,j,... Furthermore, [ψ] acts as the generating functional for a { } L for the observed and a,b,... for the hidden ones. We the posterior over the hidden spins: m (t), the mean a consider the following transition probability magnetization of the hidden spin a at time t given the p[ s,σ (t+1) s,σ (t)]= (1a) data, can be found using { } |{ } ∂ [ψ] exp[ si(t+1)gi(t)+ σa(t+1)ga(t)]Z(t)−1 ma(t)= lim µa(ψ,t), µa(ψ,t)= L . (6) i a ψ→0 ∂ψa(t) Z(t)= 2Xcosh[gi(t)] 2cosh[gaX(t)] (1b) Higher-order derivatives of with respect to ψ yield L i,a higher-order correlation functions of the hidden spins. Y where g (t) and g (t) are the fields acting on observed To develop mean-field approximations, we first per- i a spin i and hidden spin a at time t, respectively form the trace in Eq. 5 using the standard approach for dealingwithanonlinearactioninthelocalfields[12,13]: gi(t)= Jijsj(t)+ Jibσb(t) (2a) we write the term inside the log in Eq. 5 as a path in- j b tegral over g (t) and g (t) enforcing their definitions in X X i a ga(t)= jJajsj(t)+ bJabσb(t), (2b) Eq.2byinsertingδ[gi(t)− jJijsj(t)− bJibσb(t)]and Jij,Jai,Jia andXJab are theXobserved-to-observed, δ[ga(t)− jJajsj(t)− bJPabσb(t)]inthePintegral. Using the integral representation of the delta functions, can observed-to-hidden, hidden-to-observed and hidden-to- P P L be written as hidden connections. These couplings need not be sym- metric and, thus, the network may never reach equilib- [ψ]=log D exp[Φ] (7a) rium. Although, here we consider the case of zero exter- L G Z nal field, our derivations follow also for nonzero fields. Φ=logTrσexp[Q+∆+ ψ (t)σ (t)] (7b) a a Given this model, we would like to answer the follow- a,t ingquestions: Whatcanwesayaboutthestateofhidden X where = g ,gˆ,g ,gˆ . ∆, as well as gˆ and gˆ , come spins from the observations? How can we learn the vari- G { i i a a} i a fromtheintegralrepresentationoftheδ()functionsused ous couplings in the network? · to enforce the definitions in Eq. 2 Answering these questions requires finding the likeli- hoodofthe data. Optimizingthe likelihoodwithrespect ∆= igˆ(t)[g (t) J s (t) J σ (t)] (8) i i ij j ib b to Js yields the maximum likelihood value of the cou- − − i,t j b X X X plings. Theposterioroverthestateofhiddennodesgiven + igˆ (t)[g (t) J s (t) J σ (t)] the observed nodes is a a aj j ab b − − a,t j b p[ σ(t) T , s(t) T ] X X X p[ σ(t) T s(t) T ]= { }t=1 { }t=1 The crucial thing about this rewriting of is that the { }t=1|{ }t=1 p[ s(t) T ] L { }t=1 resulting action Φ is now linear in σ and we can thus which also requires calculating the likelihood. easily perform the trace over σ. The ability to perform The likelihood of the observed spin configurations un- the trace here comes at the cost of the high-dimensional der the model Eq. 1 is integral over in Eq. 7a. The integral, however, can be G approximatedusing the saddle point approximationand p[{s(t)}Tt=1]=Trσ p[{s,σ}(t+1)|{s,σ}(t)] (3) the corrections to the saddle point. t Y and involves a trace over σ(1),...,σ(T) which is dif- { } III. SADDLE-POINT APPROXIMATION ficult to do. To performthis trace analytically,we consider the fol- The saddle point values of g ,g ,gˆ and gˆ can be de- lowing functional a i a i rivedbyputtingthederivativesofΦwithrespecttothese [ψ] logTrσ ePaψa(t)σa(t)p[ s,σ (t+1) s,σ (t)]. variables to zero. For ψ = 0, the saddle point equations L ≡ { } |{ } t read Y (4) igˆ(t)=tanh[g (t)] s (t+1) (9a) which, using the definition of the transition probability i i − i from Eq. 1, can be written as igˆa(t)=tanh[ga(t)] ma(t+1) (9b) − g (t)= J s (t)+ J m (t) (9c) i ij j ib b [ψ]=logTr exp Q[s,σ]+ ψ (t)σ (t) (5) j b L σ " a a # g (t)=X J s (t)+X J m (t) (9d) a,t a aj j ab b X j b Q= si(t+1)gi(t)+ σa(t+1)ga(t) in which ma(t)Xsatisfy the selfX-consistent equations i,t a,t X X m (t)= (10) a log2cosh[g (t)] log2cosh[g (t)]. i a − − tanh g (t 1) i gˆ (t)J i gˆ (t)J a j ja b ba Xi,t Xa,t − − j − b (cid:2) X X (cid:3) 3 Let us take a moment to describe the physical meaning thenderivethe modifiedequationsfor m (t) byfirstcal- a oftheseequations. Thesaddlepointequationsforg and culating ∂ /∂ψ (t). Expressing in terms of the new i 1 a 1 L L g , Eq. 9c and 9d, are just the mean values of the local m (t), again keeping terms quadratic in J, and finally a a fields at the mean magnetizations m (t) of the hidden performing the Legendre transform, we get a variables (see Eq. 2). Examining Eq. 9a and 9b, one can 1 tidimenet.ifTyh−eigˆsei(ltf-)caonndsis−teignˆat(et)quaasteiorrnosrsfobramckp,rEoqp.ag1a0,tetdakine Γ1[m]=Γ0[m]− 2 [1−tanh2gi(t)]Ji2b[1−m2b(t)] a i,a,t X the form of the usual naive mean-field equations except 1 that back propagatederrors are taken into account. −2 [m2a(t+1)−tanh2ga(t)]Ja2b[1−m2b(t)]. (15) Given a sequence of configurations from the observed a,b,t X spins, solving Eq. 10 yields the mean magnetization of wherenow,g andg areasinEq.2butwithm instead the hidden ones given this data. The derivatives of Φ i a a ofσ . Eq.15together with the equationof state atψ with respect to the Js, evaluated at the saddle point a → 0, solutions,givethe maximumlikelihoodlearningrulesfor the couplings within this saddle point approximation. ∂Γ [m] 1 =0, (16) ∂m (t) a IV. GAUSSIAN CORRECTIONS yields the new self-consistent equation for m (t). The a couplings can then be found by optimizing Γ with re- 1 The saddle-point equations and learning rules can be spect to the couplings. improved by Gaussian corrections, that is by evaluating Insum, the correctedTAP-learningwill be the follow- the contributions to the pathintegralfromthe Gaussian ing EM like iterative algorithm: at each step, solve the fluctuations aroundthe saddle [14, 15]. This will lead to equations for m derived from Eq. 16 given the current a whatcanbecalledasTAPequationsforourproblem. In valueofthecouplings,andthenadjustthecouplingspro- whatfollows,bydoingthis onthe Legendretransformof portional to ∂Γ /∂J. 1 , L Γ[µ]= ψa(t)µa(t) (11) V. NUMERICAL RESULTS L− a,t X and using the equation of state, We evaluated the saddle point and TAP equations on data generated from a network of N = 100 spins with ∂Γ[µ] the dynamics in Eq. 1 and with the couplings drawn in- ψ (t)= (12) − a ∂µ (t) dependently from a Gaussian distribution with variance a J2/N. We first studied how much the saddle point Eq. 1 wecorrectthenaivemean-fieldequationsforourproblem 10 and TAP equations Eq. 16 tell us about the state of with the hidden spins. hidden spins if we know all the couplings. Fig. 1 A and Denoting the saddle point value of L by L0, and per- B show the distributions of ma(t) when 20% of the net- formingaLegendretransformwehave(seeAppendix A) work is hidden. The solutions to the TAP equations are more pushed towards +1 and 1 compared to the sad- Γ [µ]= [s (t+1)g (t) log2cosh(g (t))] (13) − 0 i i i dle point. Using a simple estimator σˆ (t) = sgn(m (t)) − a a Xi,t to infer the state of hidden spin a and time t, Fig. 1C + [µ (t+1)g (t) log2cosh(g (t))]+ S[µ (t)] shows that the two approximations perform equally well a a a a − ininferringthestateofhiddenspins. TheeffectofTAP a,t a,t X X correctionsbecomesveryimportantinreconstructingthe where S[µa(t)] is the entropy of a free spin with magne- couplings. We found that the saddle point learning was tization µa(t), and gi and ga are as in Eq. 2 with σb(t) alwaysunstableleadingtodivergentcouplings. However, replaced by µb(t). Using Eq. 12 and putting ψ = 0, we includingtheTAPcorrectionsdramaticallychangedthis. recover Eq. 10, as we should. Fig. 2 shows an example of this when 10% of the spins The Gaussiancorrectionscanbe performedasfollows; werehidden. Inthis case,goodreconstructionofallcou- fordetailsseeAppendixB.WefirstperformtheGaussian plings could be achieved. integral around the saddle point of . This modifies 0 Although for a proportionally small number of hid- L L to den nodes we observed good reconstruction, when the 1 hidden part was large, for a constant data length, we 1 = 0 logdet[∂2 ] (14) found that the algorithm starts to become unstable: the L L − 2 L rootmeansquarederrorfortheinferredhidden-to-hidden where ∂2 is the Hessian, containing the second deriva- connections increases after a few iterations and the oth- L tives of with respect to g and gˆ, calculated at their ers follow. This instability, however, seems to arise from L saddle point values. Keeping terms quadratic in J we our attempt to learn the hidden-to-hidden connections. 4 A C 00 A 00 B 0 0 0 0 3 3 1 1 d d e e 65000 65000 J inferr J inferr 0 0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 m m J correct J correct a a B D C d d e e err err nf nf J i J i J correct J correct FIG.2. (Coloronline)Reconstructingthefullnetwork. Scat- terplotsshowingtheinferredcouplings(bluedots)versusthe FIG. 1. (Color online) Solutions for ma. (A) and (B) show correctonesinanetworkofN =100totalspins,with10hid- ehqisutaotgiroanms,sEoqfsm.a1(0t,)1fo6u,ngdivfernomthseolcvoirnrgecstadcdoulepplionignst.aHnderTeAwPe den,J1 =1anddatalengthofT =106. Redcrossesshowthe initialvaluesof thecouplingsusedforlearning. A,B,C,and had J1 = 0.7, 20% hidden spins and a data length of T = 105. Thedarkgraybarsshowthehistogram ofthevaluesfor Dshowobserved-to-observed,observed-to-hidden,hidden-to- observed and hidden-to-hiddencouplings, respectively. whichσa(t)wasactually +1whilethetransparentonesshow when it was actually −1. (C) shows the percent correct if we use σˆa(t)=sgn(ma(t)) to infer the state of σa(t) vs. the percentage of the visible part. TAP and saddle point results VI. DISCUSSIONS were virtually indistinguishable. In this paper we described an approach for learning and inference in a nonequilibrium Ising model when it is For an architecture where there is no hidden-to-hidden partially observed,that is, when the history of spin con- couplings,thatisadynamicalsemi-restrictedBoltzmann figurationsareknownonlyforsomespinsandnotforthe machine [16], or when hidden-to-hidden couplings exist others. Treating the log-likelihood of the data as a path in the original network but are ignored during learning integral over the fields acting on hidden and observed (i.e. put to zero), the algorithm allows recovering the spinsinanon-equilibriumIsingmodel,weapproximated other connections even for very large hidden sizes. This it first using saddle point approximations, and then by is shown in Fig. 3. considering TAP corrections to it. Inthe examplesabove,wehaveusedthe rightnumber In usual statistical mechanics settings, TAP equations ofhiddennodeswhenlearningthecouplings. Inpractice, can be derived in several ways: using Plefka expan- however, the number of hidden nodes may be unknown. sion [17], information-geometric arguments [18] or by We thus wonderedif the number of hidden nodes can be calculating the Gaussian fluctuations around the sad- estimatedfromthe data. InFig.4, we plotthe objective dle point [14, 15]. Here we took an approach similar to function Γ /N for the data used in Fig. 2, where N is the latter case except that we did not use the Hubbard- 1 the totalnumber ofvisible nodes, that is,the sum ofthe Stratonovich transformation. It would be quite interest- observed nodes, and an unknown number M of hidden ingtoseewhattheotherapproachesyieldasTAPcorrec- nodes. AscanbeseeninFig.4,varyingM,atthecorrect tions. Tyrcha and Hertz [19] have also studied the prob- number of hidden units a peak in Γ /N is seen and also lem of hidden spins, using a different approach. Their 1 the reconstruction error of the visible-to-visible connec- equations agree with ours at the saddle point level but tivity reaches a minimum. This was the case, evenwhen differ at the TAP level. we constrainedthe hidden-to-hidden connections to zero In terms of performance, saddle point equations gave during learning. However, the peak in the normalized results that were similar to TAP for inferring the state costfunction wassharper,andthe minmumachievedfor of hidden spins, but were very unstable for learning thereconstructionlower,whenhidden-to-hiddenconnec- the couplings. Using TAP equations, learning all the tions were also learned. couplings, even the hidden-to-hidden ones, was possible 5 RMS Error RMS Error MS Error (10)-2 A D R A or or 5ode (10) MS Err MS Err n per n R R nctio B E Objective fu B FIG. 4. (Color online) Estimation of the number of hidden or or nodes. (A) the RMS error of the observed-to-observed cou- MS Err MS Err plings inferred using TAP learning for different values of as- R R sumed hidden units, M. The couplings are inferred either assuming no hidden-to-hidden connectivity (green x) or by C F trying to learn all couplings (blue circles). (B) the value of theobjectivefunctionperinferrednodefordifferentnumbers of assumed hidden units. The data used here is the same as that of Fig. 2, that is in reality there were 10 hidden nodes FIG. 3. (Color online) Root mean squared error of the whichiswherethenormalized objectivefunctionshowsmax- couplings. (A)-(C) show the rms error for the observed- imum and theRMS error shows a minimum. to-observed,hidden-to-observedandobserved-to-hiddencou- plings using TAP learning when the original model did not have hidden-to-hidden connections; the other couplings were evenif wrong in number, to some degree canexplain the drawn independently from a Gaussian. (D)-(F) show the correlations induced by the hidden spins in the real net- samebutwhenthehidden-to-hiddenconnectionswerepresent work. in the original network but were ignored (i.e. put to zero) when learning theother couplings. In (A) and (D) the green Animportantfuture direction,inourview,istostudy dots show the inferred couplings by maximizing the likeli- howaveragingthelikelihoodoversomecouplings,inpar- hood with hidden nodes completely ignored. Here we used ticular the hard to infer hidden-to-hidden ones, given a J1 = 1,N = 100,T = 106, and five different random realiza- prior over them, will influence the reconstruction of the tions of theconnectivity. other couplings. There is a large body of work where the pathintegralapproachis usedto study the disorder- averaged behavior of disordered systems [12, 20]. The when the hidden part was proportionally small. By ig- approachthatwe developedherecanbe easily combined noring hidden-to-hidden couplings, even when they were with this work to average out some of the couplings. present in the network, we could achieve good perfor- mance in recovering all the other couplings, even with half the network hidden. This also improved recovering the observed-to-observed couplings compared to when VII. ACKNOWLEDGEMENT hidden spins were ignored all together. Our numerical resultsalsoindicatethateventhenumberofhiddenunits can be estimated by optimization of the objective func- We are most grateful to Joanna Tyrcha and John tion. This is of particular practical revelance as often Hertz forconstructivecomments andfordiscussingtheir the true size of the network is unknown. In Fig. 4, both work with us, and Claudia Battistin for comments on the objective function and the reconstruction quality in- the paper. Benjamin Dunn acknowledges the hospital- crease as we add hidden nodes suggesting that even a ity of NORDITA. We also acknowledge the computing wrong number of hidden nodes can improve the recon- resources provided by the University of Oslo and the structionofobserved-to-observedcouplings. This canbe Norwegianmetacenter for High PerformanceComputing explained by the fact that the added hidden variables, (NOTUR), Abel cluster. 6 Appendix A: Legendre transform Here we derive Eq. 13 in the main text. Starting from Eq. 7, we have [ψ]=log D exp[Φ], (I-1a) L G Z Φ= s (t+1)g (t)+i gˆ(t) g (t) J s (t) +i gˆ (t) g (t) J s (t) log2cosh[g (t)] i i i i ij j a a ai i i − − − Xi,t Xi,t h Xj i Xa,t h Xi i Xi,t log2cosh[g (t)]+ log2cosh g (t 1) i J gˆ (t) i J gˆ(t)+ψ (t) (I-1b) a a ba b ia i a − − − − Xa,t Xa,t h Xb Xi i ∂Φ µ (t)= =tanh g (t 1) i gˆ (t)J i gˆ (t)J +ψ (t) , (I-1c) a a j ja b ba a ∂ψ (t) − − − a j b (cid:2) X X (cid:3) with the saddle point being at Eq. 9, with µ instead of point value of . Given Eq. I-1b, we have a 0 L m . Applying the following identity to the last sum in Eqa. I-1b Atijt′ ≡ ∂g (t∂)2∂Φg (t′) =−δijδtt′[1−tanh2[gi(t)]] i j log2cosh[x]=S[tanh[x]]+xtanh[x] (I-2) Bitjt′ ≡ ∂gˆi(t∂)2∂Φgˆj(t′) =− aJiaJja[1−µ2a(t)]δtt′ 1+x 1+x 1 x 1 x Ctt′ ∂2Φ X S[x]≡− 2 log 2 − −2 log −2 (I-3) ab ≡ ∂ga(t)∂gb(t′) (cid:20) (cid:21) (cid:20) (cid:21) =−δabδtt′ µ2a(t+1)−tanh2[ga(t)]] gan,dwuesionbgtaEiqn.I-1candthe saddlepointvalues forgi and Dattb′ ≡ ∂gˆa(∂t)2∂Φgˆb(t′) =h− cJacJbc[1−µ2c(t)]δtti′ a Eitbt′ ≡ ∂gˆi(t∂)2∂Φgˆb(t′) =− XaJiaJba[1−µ2a(t)]δt,t′ Φ0 = [si(t+1)gi(t)−log2cosh[gi(t)]] Fitbt′ = ∂gˆ(t∂)2∂Φg (t′) =−iXJib[1−µ2b(t)]δt−1,t′ i,t i b X ∂2Φ + [µa(t+1)ga(t)−log2cosh[ga(t)]] ∂g (t)∂gˆ (t′) =iδabδtt′−iJba[1−µ2a(t+1)]δt′,t+1 a,t a b X Gab + µ (t)ψ (t)+ S[µ (t)] (I-4) a,t a a a,t a and ∂2Φ/∂gi(t)∂gˆj(t′)=iδ|ijδtt′leading{tzo } X X Att i1 0 0 0 0 0 0 i1 Btt 0 Ett 0 0 0 0 At the saddle point we have L0 = Φ0 and using Γ0 = 0 0 Ctt i1 0 [Ft′t]T 0 Gtt′ ψ (t)µ (t) we arrive at Eq. 13. L0− a,t a a 0 [Ett]T i1 Dtt 0 0 0 0 P 0 0 0 0 At′t′ i1 0 0 0 0 Ft′t 0 i1 Bt′t′ 0 Et′t′ 0 0 0 0 0 0 Ct′t′ i1 0 0 [Gtt′]T 0 0 [Et′t′]T i1 Dt′t′ Appendix B: Corrections to the saddle-point likelihood (II-2) as part of the Hessian for t and t′ =t+1, and the same structure gets repeated for other times. The first to the To include the Gaussian corrections in Eq. 15 we first fourth rows/columns correspond to derivatives with re- calculate the Gaussian fluctuations around the saddle spect to g ,gˆ,g ,gˆ all at t and the fourth to eighth i i a a 7 rows/columns at t′ =t+1. WiththeseGaussianfluctuationstakenintoaccount,the Denotingthematrixoftheblocksonthediagonalpart corrected value for the mean magnetization is of the Hessian as α and the rest as β, we write logdet(α+β)=logdet(α)+logdet[I +α−1β] m (t)= ∂L =µ (t)+l (t) (II-6) a i a =logdet(α)+Trlog[I +α−1β] (II-3) ∂ψa(t) ∂δ logdet(α)+Tr[α−1β]+ 1Tr [α−1β]2 + la(t)= ∂ψ L(t) (II-7) ≈ 2 { }} ··· a Assuming that the couplings are random and indepen- We now express Γ to the quadratic order in J in terms dent with mean of order 1/N and a standard deviation 0 of m instead of µ . With some algebra at ψ 0 we of J1/√N as typically assumed in mean-field models of havea a → spin glasses [21], logdet(α) will be of quadratic order in J . Sinceαisblockdiagonal,sowillbeα−1andtherefore 1 Tr[α−1β]=0. Ignoring the second trace as contributing Γ [m]= [s (t+1)g (t) log2cosh(g (t))] 0 i i i − higher order terms, we thus approximate the determi- i,t X nantofthe Hessianmatrix byconsideringonly the block + [m (t+1)g (t) log2cosh(g (t))]+ S[m (t)] diagonal terms a a − a a a,t a,t X X det(α)= det A(t)B(t)+1 det C(t)D(t)+1 (II-4) [si(t+1) tanh(gi(t))]Jiala(t) { } { } − − t i,t,a Y X + l (t+1)[tanh−1[m (t+1)] g (t)] leadingtothefollowingestimateofthecontributionof a a − a a,t the Gaussian fluctuations X [m (t+1) tanh[g (t)]J l (t) δ 1 logdet A(t)B(t)+1 (II-5) −a,t,b a − a ab b L≈−2 { } X t X 1 logdet C(t)D(t)+1 Wherewehaveabusedthenotationhere,usingga andgi −2 { } toindicatethefieldscalculatedatm (t)insteadofµ (t). a a 1 [1 tanh2[g (t)]] J2[1 µ2(t)] Noting that la itself is quadratic in J, the terms that ≈−2 − i ib − b involve l above are all higher order, and can therefore t,i b a X(cid:8) X (cid:9) be ignored. This, combined with the expression for δ 1 [µ2(t+1) tanh2(g (t))] J2 [1 µ2(t)] in Eq. II-5 yield Eq. 15 in the main text. L −2 a − a ab − b t,a b X(cid:8) X (cid:9) [1] H. J. Kappen and F. B. Rodriguez, Neur. Comp. [7] T.R.Lezon,J.R.Banavar,M.Cieplak,A.Maritan, and 10, 1137 (1998); T. Tanaka, Phys. Rev. E N. Fedoroff, Proc. Natl. Acad. Sci. USA103 (2006). 58, 2302 (1998); Y. Roudi, E. Aurell, and [8] T. Bury,Physica A 392, 13751385 (2012). J. Hertz, Front. Comp. Neur. 3, 22 (2009); S. Cocco [9] C. Bishop, in Learning in graphical models, edited by and R. Monasson, Phys. Rev. Lett. 106, 090601 (2011); M. I.Jordan (MIT Press, 1999). F. Ricci-Tersenghi, J. Stat. Mech.: Theory and Exp. [10] J. Pearl, Causality: Models, Reasoning, and Inference , P08015 (2012); H. C. Nguyen and J. Berg, J. (Cambridge Univ.Press, 2000). Stat. Mech.: Theory and Exp. , P03004 (2012); [11] D. J. Thouless, P. W. Anderson, and R. G. Palmer, Phys.Rev.Lett. 109, 050602 (2012). Philosophical Magazine 35, 593 (1977). [2] Y. Roudi and J. Hertz, Phys. Rev. Lett. 106, 048702 [12] A. Coolen, in Neuro-Informatics and Neural Modelling, (2011); M. Mezard and J. Sakellariou, J. Stat. Mech.: HandbookofBiologicalPhysics,Vol.4,editedbyF.Moss Theory and Exp. (2011); H.-L. Zeng, E. Aurell, and S.Gielen (North-Holland,2001) pp.619 – 684. M.Alava, and H.Mahmoudi, Phys.Rev.E. 83, 041135 [13] M.OpperandO.Winther,inAdvanced meanfieldmeth- (2011); P. Zhang, J. Stat.Phys. , 1 (2012). ods: theoryandpractice,editedbyM.OpperandD.Saad [3] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cog- (MIT Press, Cambrdige, MA, 2001) pp. 7–21. nitiveScience 9, 147 (1985). [14] J.NegeleandH.Orland,Quantummany-particlesystems [4] E. Schneidman, M. Berry, R. Segev, and W. Bialek, (Perseus books, 1998). Nature440, 1007 (2006). [15] A. Kholodenko, J. Stat. Phys.58, 357 (1990). [5] J.Shlens,G.Field,J.Gauthier,M.Grivich,D.Petrusca, [16] S.OsinderoandG.Hinton,inAdvances inNeural Infor. A. Sher, A. Litke, and E. Chichilnisky, J. Neurosci. 26, Proces. Syst., Vol. 20 (2008); G. Taylor and G. Hinton, 8254 (2006). in Proc. of ICML(ACM, 2009) pp. 1025–1032. [6] S.Cocco,S.Leibler, andR.Monasson,Proc.Natl.Acad. [17] T. Plefka, J. Phys. A: Math. Gen. 15, 1971 (1981); Sci. USA106, 14058 (2009). G. Biroli, 32, 8365 (1999); Y. Roudi and J. Hertz, J. 8 Stat Mech: Theory and Exp. (2011). [20] C.DeDominicis,Phys.Rev.B18,4913(1978); H.Som- [18] H.J. Kappen and J. J. Spanjers, Phys. Rev.E 61, 5658 polinsky and A. Zippelius, Phys. Rev. Lett. 47, 359 (2000). (1981). [19] J. Tyrcha and J. Hertz, to bepublished (2013). [21] K. Fischer and J. A. Hertz, Spin Glasses (Cambridge University Press, 1991).