Table Of Content

Learning multilayer perceptrons efficiently C. Bunzmann, M. Biehl , R. Urbanczik Institut fu¨r theoretische Physik Universität Wu¨rzburg Am Hubland D-97074 Wu¨rzburg 1 0 Germany 0 2 n analysis of the algorithm’s performance shows that the a A learning algorithm for multilayer perceptrons is pre- ratio of required examples to the number of free param- J sented which is based on finding the principal components eters in the networks does stay finite in the thermody- 0 of a correlation matrix computed from the example inputs namic limit. 1 and their target outputs. For large networks our procedure The multilayerarchitecturewe analyzeis a committee needsfar fewerexamplestoachievegood generalization than ] traditional on-line algorithms. machine with K hidden units defined by n n K s- PACS: 84.35.+i,89.80.+h,64.60.Cn τ(ξ)=g K−1/2 h(BiTξ)!, (1) di Multilayer neural networks achieve the ability to ap- Xi=1 t. proximate any reasonable function arbitrarily well [1] by where ξ ∈ IRN is the input and the Bi ∈ IRN are the a unknown parameter vectors. The goal of learning is to interconnecting sufficiently many architecturally identi- m estimate these parameter vectors, which we shall refer cal processing elements, the perceptrons. This replica- - to as teacher vectors, from a training set of P examples d tion of identical elements invariably leads to symmetries (ξµ,τ(ξµ))oftheinput/outputrelationship. Weshallini- n in the multilayer architecture which need to be broken o duringtrainingtoachievegoodgeneralization[2]. Inthe tially focus on regression problems and later sketch the c modifications to our procedure needed for classification dynamicsofgradientbasedtrainingproceduresthe sym- [ tasks. For regression, the output function g is usually metriesgiverisetosuboptimalfixedpointsandslowcon- 1 vergence. In particular this holds for stochastic gradient assumed invertible, and this easily reduces to the linear v case by applying the inverse function to the target out- descent methods which to date have been the method of 2 puts. So in this case we simply assume g(x) = x. For choice for training large networks because, at each time 3 brevity we also assume that the B are orthonormal. 1 step, the update of the network parameters is based on i Initssimplestformourprocedurecanbeseenasgener- 1 the presentation of only a single training example and 0 hence computationally cheap. Such stochastic gradient alizingHebbianlearning. TheretheparametervectorJ¯P 1 ofasimpleperceptronapproximatingthetargetfunction descent methods have been intensively studied in the 0 τ(ξ) is obtained as the average framework of on-line learning, where each of the train- / at ing examples is used just once by the network [3]. This P m is attractive because there is no need to store the entire J¯P =P−1 τ(ξµ)ξµ. (2) - setoftrainingexamples,andtheapproachalsoworksfor µX=1 d nonstationary problems. However, in the on-line frame- n Our basic observation is that when τ(ξ) is given by a workthenumberofrequiredtrainingexamplesiscoupled o multilayerperceptronitisimportanttonotonlyconsider totheslowtemporalevolutioncausedbythesymmetries c the empirical mean of the distribution of τ(ξµ)ξµ as in : and thus unreasonably large training sets are necessary. v plain Hebbian learning but also its correlation matrix. In fact the ratio of the number of examples needed for i In particular some of the eigenvectors of the correlation X good generalization to the number of free parameters in matrix correspondto the parameter vectors B and thus r the network diverges with the network size [4]. While i a the supervisedlearningtask is mappedonto the wellun- there have been investigations into optimizing the on- derstoodproblemofprincipalcomponentanalysis. While line dynamics [5–7], these have not lead to practical al- thismappingbyitselfdoesnotsolvetheoriginalproblem, gorithmssincetheoptimizedproceduresassumethatthe it does provide a crucial reduction in its dimensionality, symmetry breaking provided by the initial conditions is such that the remaining problem becomes almost trivial macroscopic and known to the learning algorithm. in the thermodynamic limit N . Inthisletterwepresentalearningalgorithmwhichhas →∞ We shall consider the general correlationmatrix many of the attractive features of the traditional on-line procedures but yields good generalization using a much P smaller number of training examples. Further an exact CP =P−1 F (τ(ξµ))ξµξµT (3) µ=1 X 1 where the simple choice F(x) = x2 for the weight func- We want to choose Γ to minimize the generalization er- tionF yieldstheanalogytoHebbianlearning. Assuming rorǫ = 1(τ(ξ) σ (ξ))2 andtothisendPâdditional g 2 − Γ ξ thecomponentsoftheinputvectorsξµtobeindependent examples(cid:10)(ξˆν,τ(ξˆν)), ν =1(cid:11),...,Pˆ, are used. To simplify Gaussianrandomvariableswithzeromeanandunitvari- the theoretical analysis, the additional examples should ance,itisstraightforwardtoanalyzethespectrumofCP be picked independently of the examples used to obtain inthelimitP . Thespectrumthenonlyhas three ∆˜. We then apply the standard on-line gradient descent aeingdenavnayluveescλto0r→,λ¯or∞atnhdogλo∆n.alTthoeadlletgeeancehrearcyveocftλor0sisBNi−s aKn procedure Γν+1 = Γν η Γ τ(ξˆν) σΓν(ξˆν) 2. How- i − ∇ − eigenvector. The eigenvalue λ¯ has the single eigenvector ever, by choosing a scaling o(cid:16)f the learning ra(cid:17)te η such B¯ = K 1/2 K B ; this eigenvector is of little interest that η 1 for large N, the stochasticity drops out of − i=1 i ≪ since for large P one also has B¯ J¯P, and it is thus the procedure [8] and in the restricted space it performs simpler to uPse Hebb’s rule (2) to e∝stimate B¯. The im- gradientdescent in ǫg. Further we can scale Pˆ such that portant eigenspace is the one of λ∆ since it is spanned ηPˆ 1 and then ΓPˆ will be a minimum of ǫg for large ≫ by the K 1 vectors B1 Bj (j =2,...,K). The three N. Finally both scaling conditions canbe satisfiedwhile − − eigenvaluescanbe writtenas averagesoverindependent, observing Pˆ N, so that the required number of addi- ≪ zero mean, unit variance, Gaussian random variables tional examples is negligible compared to the size of the yi. For instance one obtains λ∆ = 12 F(τy)(y1−y2)2 y firsttrainingset. Notethatthankstothereduceddimen- where τy = K−1/2 Ki=1h(yi). The a(cid:10)ctivation functio(cid:11)n sionality the details of the scaling of η and Pˆ only affect finitesizeeffectsandnottheperformanceinthethermo- h is sigmoidal(odd, monotonicand bounded), andwhen P dynamic limit. Simulations combining the two stages of stating specific numerical values we will always assume h(y) = erf(y). Then for the choice F(x) = x2 one finds our algorithm (Fig. 1) show a steady decay in ǫg. the ordering λ¯ >λ >λ and λ λ = 1 8 (1 1 ). We now turn to the theoretical calculation of ρ in 0 ∆ 0− ∆ K3π − √5 the important first stage of the algorithm. The smallest For a finite number P of training examples the degen- eigenvalueofCP canbefoundbyminimizingJTCPJ un- eracy in the spectrum is broken by random fluctuations. der the constraint J =1. Hence we consider the parti- But a computation of the orthonormal eigenvectors ∆Pj tionfunctionZ = |d|Jexp( βPJTCPJ)wheretheinte- corresponding to the K 1 smallest eigenvalues of CP − − gration is over the N-dimensional unit sphere. For large nevertheless yields an estimate of the space spanned by R N thetypicalpropertiesoftheminimizationproblemare the difference vectors B B . To measure the success 1− j found by calculating the training set average lnZ and of approximating this space, we introduce the overlap h i taking the limit β . Using replicas, one introduces → ∞ ρ=(K 1)−1/2Tr(∆TBBT∆)1/2, (4) the K dimensional order parameter R , the typical over- − lap BTJ of the teacher vectors with a vector J drawn where B is the matrix (B1,...,BK) of the teacher vec- from the Gibbs distribution Z−1exp(−βPJTCPJ), and tors and ∆ = (∆P,...,∆P ). This is a sensible mea- further the replica symmetric scalar order parameter q sure because ρ is1invariantKw−1ith respect to the choice of which is the overlap J1TJ2 of two vectors drawn from an orthonormal basis of the space spanned by the ∆P this distribution. ForP >N the correlationmatrixCP j and since it attains its maximal value of 1 iff all ∆P lie isnonsingular,q approaches1withincreasingβ,andthe j inthespacespannedbytheB . Simulationsshowingthe relevant scaling is χ = β(1 q) = (1). Then for the i − O overlapρ as function of the number of examples per free quenched average of lnZ one finds in the limit β →∞ parameter, α = P/(KN), are depicted in Fig. 1. They lnZ indicateasecondorderphasetransitionfromzerotopos- h i = extrRTA(α,χ)R+a(α,χ). (5) βN itive ρ at a critical value α . The evolution of ρ can be R,χ c calculated exactly in the thermodynamic limit N . → ∞ Herea(α,χ)= αK G (τ ) + 1 ,theK byK matrix But since up to now we are only estimating the space − h χ y iy 2χ spannedby the B , insteadofthe vectorsthemselves,we A(α,χ) has entries i address this problem first and defer the calculation of ρ. δ apUprsoinxgimtahteedabboyvelienigeeanrspcoamcebpinraotcieodnusreo,fththeeB∆i cPananbde Ajk(α,χ)=−αKhGχ(τy)(yjyk−δjk)iy − 2jχk , (6) j J¯P. Thus the original KN dimensional problem is re- and G (τ ) = F(τ )/(1 +2χF(τ )). Since Eq. (5) is χ y y y duced to K2 dimensions and this is much easier for quadratic in R the extremal problem can only have a large N. To exploit the dimensionality reduction, we solution R = 0 if the matrix A is singular. From the write the possible linear combinations as ∆˜Γ where ∆˜ = symmetries6one easily obtains that A has just two eigen- (J¯P,∆P1,...,∆PK 1), Γ is a K by K matrix of tunable values. The first can be written as A11 A12, it is the parameters,andw−edefineastudentnetworkσΓinthere- relevant eigenvalue in our case [9] and it−s eigenspace is stricted space via σ (ξ)=g K 1/2 K h (∆˜Γ)Tξ . spanned by the vectors e e (j = 2,...,K), where Γ − i=1 i 1 − j (cid:16) (cid:16) (cid:17)(cid:17) P 2 e ,...,e denotethestandardbasisofIRK. Thisdegen- inFig. 2showthatthelargeK theoryprovidesareason- 1 K eracyshowsthatthe differencebetweenthe K 1small- able approximation already for networks with quite few − esteigenvaluesofthecorrelationmatrixvanishesforlarge hidden units. N. So the simple procedure of analyzing the properties To round off the analysis of the regression problem, of the single vector J minimizing JTCPJ, in fact yields we obtain a theoretical prediction for the generalization the properties of the K 1 eigenvectors vectors of C error achieved by combining the two stages of our pro- P − with smallest eigenvalues in the thermodynamic limit. cedure. A simple calculation shows that the overlapr of Due to the degeneracy, we can reparametrize (5) set- the HebbianvectorJ¯µ withB¯,r =B¯TJ¯µ/J¯µ ,for large | | ting R=ρ(e1−ej)/√2 and obtain an extremalproblem N satisfies r = (1+ arcs2iαnK(2/3))−1/2. Further, using the with only two variables ρ and χ. Note that ρ is indeed explicitexpressionforǫ givenin[3]andthevaluesrand g theparameterintroducedintheanalysisofthenumerical ρ, we can calculate the minimum of ǫ in the restricted g simulations. Its evolution is now obtained by solving (5) spaceandfindatheoreticalpredictionforthegeneraliza- and this confirms the continuous phase transition found tion error obtained by the second stage. This prediction in the simulations from ρ = 0 to positive ρ at a critical is compared to the simulations in Fig. 1 and 2. value αc. For K = 2 one finds αc = 4.49 and for K = 3 We next consider classification problems, that is we the result is αc = 8.70. As shown in Fig. 1 beyond the assume that the output function of the network given phasetransitionthereisexcellentagreementbetweenthe by Eq. (1) is g(x) = sgn(x). Then, since the out- theoretical prediction and the simulations. put is binary, λ = λ¯ holds for any choice of the 0 To obtain generic results for large K note that the weight function F, and our procedure cannot immedi- contributionsofy1 andy2 tothetargetoutputτy willbe ately be applied. However, it is possible to gain in- small in this limit. Decomposing τ as τ =τ +δ /√K formation about the target rule by not just consider- y y y∗ y where δ = h(y )+h(y ), and expanding G (τ ) up to ing its output τµ but by comparing the output to the y 1 2 χ y second order for large K simplifies Eq. 5 to: value B¯Tξµ which would have been obtained by the lin- earized network, g(x)=h(x)=x. This is feasible since, hlβnNZi = eρx,χtr−α4ρ2 G′χ′(τy∗) y δy2((y1−y2)2−2) y ebviaenn fvoerctcolarssJ¯ifiPcadteiofinn,edB¯bcyan(2b)e. esWtime aatreedtbhyusthleeaHdebto- (cid:10) (cid:11) (cid:10)1 ρ2 (cid:11) consider more general correlation matrices of the form −αKhGχ(τy)iy+ −2χ . (7) CP = P−1 Pµ=1F τ(ξµ),J¯µ−1Tξµ ξµξµT . A reason- ablewayofchoosing(cid:16)F istofocuson(cid:17)inputsξµ wherethe Ontheonehand,applyingthecentrallimittheorem,the P target output has a different sign than the output of the multiple integrals G (τ ) and G (τ ) can now be ′χ′ y∗ y h χ y iy linearizednetwork. SoweconsiderF(x,y)=Θ( xy) µ, replacedbyasingle(cid:10)average(cid:11),ontheotherhandthestruc- where Θ is the Heavyside step function. − − ture of (7) shows that χ approacheszero with increasing In the large P limit, the matrix CP has the same K. These observations yield the very simple result that eigenspaces as in the case of regression and the three ρ2 =1 αc/α for large K and α > αc. The value of αc eigenvalues will in general be different. For the acti- − is obtained as vation function we chose h(x) = sgn(x) to compare our results with the findings for on-line Gibbs learn- 4K F2(µz) α = z (8) ing [10], to our knowledge the only algorithm which has c 2 F (µz) 2 z2h2((cid:10)z) µ(cid:11)2 2 zh(z) 2 been simulated in any detail for classifications task with h ′′ iz h iz − − h iz a connected committee machine. For this h one finds (cid:16) (cid:17) where µ2 = h2(z) , and the distribution of z is Gaus- λ∆ >λ0 >λ¯,andforlargeK toleadingorderλ∆ λ0 = sianwithzeromeanzandunitvariance. Itisnowstraight- √2π 4/(π2K). Numericalsimulationsofthepro−cedure forward to (cid:10)derive(cid:11)the optimal choice of F from Eq. are sh−own in Fig. 3 using µ = π−1arccos 2/π. Moti- (8) by noting that in the denominator F (µz) = vatedbyourfindingsinthecaseofregression,thischoice eµq−u2al(itzy2−to1)FF(2µ(µzz))z. /ApFply(µinzg)thethCeanucyhiheyl-dS′s′chtwhaairtzztihne- onfeeµdeydieltdosaλc0hi=eve0 gfoorodlarggeeneKra.liWzathiiolne tahreeptmrauicnhinlgarsgeetsr (cid:10) (cid:11) z h ′′ iz than for regression,they are, for exactly the same archi- optimal choice is F(x) = x2 µ2. For this choice the (cid:10) (cid:11) − tecture, approximately 20-times smaller than for on-line eigenvalue λ of the correlation matrix CP vanishes for 0 Gibbs learning [10] already for N =150. large P. So the optimal F maximizes the signal to noise Throughout this Letter we have assumed that the in- ratiobetweentheeigenspaceofdifferencevectorswewant puts are drawn from an isotropic distribution. Typ- to estimate and the orthogonal space. ically, in practical applications, the input data itself For the activation function h(x) = erf(x) one finds will have some structure, and in this case the inputs that α =1.96K when F is optimal, whereas the simple c have to be whitened before applying our procedure. To choice F(x) = x2 yields a critical α which is one and a this end one computes the correlations of the inputs by half times higher. Simulation results for K = 7 plotted 3 DP = P 1 P ξµξµT. Then instead of just consider- − µ=1 ing CP, one finds the extremal eigenvectors ∆˜ of the P j transformed matrix R 1TCPR 1, where the square ma- − − trix R is the Cholesky factorization of DP, RTR = DP. Then an estimate for the space spanned by the differ- [1] G. Cybenko. Math. Control Signals and Systems, 2:303, ence vectors is obtained by transforming back, setting 1989. ∆ =R 1∆˜ . NumericalsimulationsforF(x)=x2show [2] H.Schwarzeand J.Hertz.Europhys. Lett., 21:785, 1993. j − j that whitening noticeably improves the performance of [3] D. Saad and S. Solla. Phys. Rev. Lett., 74:4337, 1995. [4] M.Biehl,P.Riegler,andC.Wöhler.J.Phys.A,29:4769, our algorithm even when the inputs are picked from an 1996. isotropic distribution. This is due to the fact that the [5] D.Saad and M. Rattray.Phys. Rev. Lett., 79:2578, 1997. spectrum of the empirical correlation matrix DP has a [6] R. Vicenteand N.Caticha. J. Phys. A, 30:L599, 1997. finite width even when it converges to a delta peak with [7] M. Rattray, D. Saad, and S. Amari. Phys. Rev. Lett., increasing P. 81:5461, 1998. In summary, we have presented the first learning al- [8] S. Amari. IEEE Transactions on Electronic Computers, gorithm for a realistic multilayer network which can be 16(3):299, 1967. exactly analyzed in the thermodynamic limit and yields [9] TheeigenvectorofthesecondeigenvalueA11+(K−1)A12 good generalization already for a finite ratio of training has the form Ri = const corresponding to the maximal examplestofreeparameters. Thiscontrastswiththebe- eigenvalue of CP. havior of traditional on-line procedures [3,4] for target [10] J.KimandH.Sompolinsky.Phys.Rev.E,58:2348,1998. [11] R. Urbanczik.Europhys. Lett., 35:553, 1996. functions such as the ones considered in this Letter. As [12] G. Reents and R. Urbanczik. Phys. Rev. Lett., 80:5445, longas there areonthe orderofN on-line steps,the dy- 1998. namics is exactly described by deterministic differential [13] A.C.C. Coolen, D. Saad, and Y. Xiong. Europhys. Lett., equationsforasetofself-averagingorderparameters[12] 51:691, 2000. in the limit of large N. However, even when tuned by choosinga goodlearningrate,the dynamicsis stuckina badlygeneralizingplateauonthis timescaleunlesssome 1.0 0.10 informationaboutthetargetfunctionisbuiltintotheini- 0.8 0.08 tialconditions. For the realisticcase ofrandomlychosen initial values, the symmetries are eventually broken due 0.6 0.06 tosmallinitialfluctuationsandtheaccumulationoffluc- tuationsduringthedynamics,butonlywhenthenumber ρ 0.4 0.04 ǫg of on-line steps is much larger than N. This symmetry 0.2 0.02 breakingisnotadequatelydescribedbythedeterministic differential equations. When the training set is sampled 0.0 0.00 without replacement, this divergence of the time scale 0 4 8 12 16 20 means that large numbers of examples are needed. But α even in the recently analyzed scenario of sampling with FIG.1. ResultsforK =3. TheevolutionofρforN =400 replacement [13], the above theoretical problems remain ( ) and N = 1600 ( ). The right axis relates to the values and arecompounded since the analysisrequires involved ◦ • of ǫg found when the two stages of our procedure are com- techniques such as dynamical replica theory and quite a bined, N = 400 ((cid:3)) and N = 1600 ((cid:4)). For ǫg the value few approximations already on the time scale N. of α refers to the number of examples in both training sets, Hence, we believe thatthe findings inthis Letter open α=(P+Pˆ)/KN. Thefulllinesshowthetheoretical predic- new avenues for research and that in particular the per- tion for the thermodynamic limit. Where not shown, error- formanceofthealgorithmforclassificationtasksdeserves bars are smaller than the symbolsize. a more detailed analysis. But the perhaps most intrigu- ing aspect of our procedure is that it does not really assume that the number of hidden units in the architec- tureisknownapriori. Thisishighlightedbytheinsetof Fig. 2 showing that the number of hidden units can be determined by inspecting the eigenvalue spectrumof the correlation matrix. So our findings also provide a novel perspective on the problem of model selection. The work of two of us (C.B. and R.U.) was supported by the Deutsche Forschungsgemeinschaft. 4 0.8 0.06 0.6 21 0.04 ρ 0.4 12 ǫ g 3 0.2 0.02 0.0 0 0 8 16 24 32 40 α FIG. 2. Results for K = 7 using the weight function F(x)=x2−µ2. Thenumericalsimulations forN =2000 are comparedtothetheoreticalcurvesfoundinthelargeK limit. Wherenotshown,errorbarsaresmallerthanthesymbolsize. The inset shows a histogram of the 200 smallest eigenvalues of CP for a single training set at α = 22. A gap separates the6smallest eigenvaluesfromtherestofthespectrum. The range of theshown eigenvalues is [−0.1,−0.07]. 1 0.4 0.8 0.3 0.6 ρ ǫ 0.4 g 0.2 0.2 0 0.1 0 200 400 600 α FIG.3. Numericalsimulationsoftheclassificationproblem for K = 7,N = 150 showing the values of ρ ( ) and ǫg ((cid:3)). ◦ The errorbars for the ρ values are smaller than the symbol size. Here ǫg is the probability of misclassifying a new input. Training in the restricted space during the second stage of ourprocedureusesthevariantoftheperceptronlearningrule described in [11] for tree committee machines. 5