ebook img

A multifractal phase-space analysis of perceptrons with biased patterns PDF

7 Pages·0.17 MB·English
by  J. Berg
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A multifractal phase-space analysis of perceptrons with biased patterns

A multifractal phase-space analysis of perceptrons with biased patterns J. Berg and A. Engel ∗ † Institut fu¨r Theoretische Physik, Otto-von-Guericke-Universit¨at Magdeburg PSF 4120, 39016 Magdeburg, Germany (13.1.98) 8 We calculate the multifractal spectrum of the partition of the coupling space of a perceptron 9 inducedbyrandom input-outputpairs with non-zeromean. From theresults weinfertheinfluence 9 of the input and output bias respectively on both the storage and generalization properties of the 1 network. Itturnsoutthatthevalueoftheinputbiasisirrelevantaslongasitisdifferentfromzero. n The generalization problem with output bias is new and shows an interesting two-level scenario. a Tocompareouranalyticalresultswith simulations weintroduceasimpleandefficient algorithm to J implement Gibbs learning. 3 PACS numbers: 87.10.+e, 02.50.Cw 1 ] n I. INTRODUCTION n - s Thepropertiesofsimple networksofformalneuronscanbe quantitativelydescribedbycharacterizingthe partition i of the coupling space induced by the requiredinput-output mappings [1–4]. In some cases the geometricalproperties d . of this partition can be concisely specified by the multifractal spectrum of the distribution of cells in coupling space t a corresponding to the different output sequences that can be generated by the system for given inputs [5,6]. This m approach has been used for a number of investigations of both single-layer as well as multi-layer feed-forward neural nets [7–10] and revealed a number of interesting new aspects. The simplest case of the perceptron allows a rather - d detailedanalysisalsohighlightingthe problemsandsubtleties ofthe method[11]. Allinvestigationsdone sofarhave, n however, considered symmetric statistics of both inputs and outputs. o In the present paper we analyze the multifractal properties of the phase space of a single-layerperceptron induced c [ byinputpatternswithbiasedstatistics. Apossiblebiasofthe outputsis takencareofbyconsideringaspecialsubset of cells only. The investigation is motivated by the fact the storage properties of a perceptron are known to depend 2 markedlyonthe statisticsofthe patterns. Asimilarinfluence canbe expectedforthe generalizationbehaviourwhich v has to our knowledge not been studied for biased outputs before. 4 The analysis is performed using the methods that have been employed for the case of unbiased patterns already. 5 1 Withthehelpofthe replicatrickthemultifractalspectrumf(α)averagedoverthedistributionofinputsiscalculated 0 analytically for different pattern set sizes γ . The storage and generalizationproperties are determined by the points 1 with slope 0 and 1 respectively of these curves [3,6]. The results for the storage capacity are comparedwith previous 7 findings [2], whereas those for the generalization behaviour are checked against numerical simulations. In order to 9 efficientlyexploretheversionspaceinthesesimulations,weintroducearandomizedvariantoftheperceptronlearning / t algorithm. a m - II. CALCULATION OF THE CELL SPECTRUM d n o We consider a spherical perceptron specified by a real coupling vector J obeying J2 =N and a set of p= γN :c input patterns ξµ with components ξiµ drawn independently of each other from the dPisitriibution v i 1 m 1+m X P(ξ)= − δ(ξ+1)+ δ(ξ 1). (1) 2 2 − r a To every input ξµ the perceptron determines an output σµ according to N 1 1 σµ =sgn( J ξµ)=sgn( J ξµ) (2) √N · √N i i Xi=1 ∗[email protected][email protected] 1 Every set of input patterns therefore divides the coupling space into 2p cells C(σ)= J σµ =sgn(J ξµ) µ (3) { | · ∀ } labelledby the sequence ofoutputs σµ. Note thatsome ofthe cells may be empty. The size P(σ)=V(σ)/ τ V(τ) of the cell gives the fraction of the coupling space that will produce the outputs σµ given the inputs ξiµ. P It is convenient to characterise the cell sizes by a crowding index α(σ) defined by P(σ)=2 Nα(σ) . (4) − Theentropyofthe distributionofcellsizesinthethermodynamiclimitaveragedoverthe inputpatternsisthengiven by 1 f(α)= lim log δ(α α(σ)) (5) N→∞Nhh 2Xσ − ii where ... denotes the averageover the input pattern distribution (1). Involving a trace over all output sequences hh ii σ thisquantitycannotbe usedto elucidatethe dependencies ontheoutputbias. Infactanexplicitcalculationshows that it is also independent of the input pattern distribution giving results for f(α) identical to those for m=0. The intuitive reason for this fact is that a cell chosen at random from the above ensemble will with probability 1 lie in the N 1 dimensional subspace of the coupling space which is orthogonal to the direction of the bias. However the − projectionofthecellstructureontothissubspace–whosepropertiesdominatethecellspectruminthethermodynamic limit – carries no bias. Hence inorderto studythe influence oftheinput andoutputstatisticsonthe geometryofthe phasespacewehave therefore restricted the σ-trace to outputs with magnetization m according to ′ γN 1 1 1 f(α)= lim log ′δ(α α(σ)) = lim log δ( σµ m)δ(α α(σ)) . (6) N→∞Nhh 2Xσ − ii N→∞Nhh 2Xσ γN µX=1 − ′ − ii In the literature of multifractals f(α) is called the multifractal spectrum. It can be calculated by using its analogy with the microcanonical entropy of a spin system σ with Hamiltonian α(σ) and the free energy τ(q)= lim 1 log Z = lim 1 log ′Pq(σ) = lim 1 log ′2 qNα(σ) . (7) −N→∞Nhh 2 ii −N→∞Nhh 2Xσ ii −N→∞Nhh 2Xσ − ii The quenched average over input patterns with magnetisation m is performed using the pattern statistics (1). The entropy f(α) can now be obtained by a Legendre-transformationwith respect to the inverse temperature q f(α)=min[αq τ(q)]. (8) q − For the perceptron we have p 1 P(σ)= dµ(J) θ( σµJ ξµ). (9) Z √N · µY=1 with the integral measure dJ dµ(J)= i δ(N J2) (10) √2πe − Yi ensuring the spherical constraint for the coupling vectors and the total normalisation over all cells σP(σ) = 1. θ(x) is the Heaviside step function. In order to average log Pq over the input patterns we introdPuce two sets of replicas;onesetlabelleda=1,...,nforthe standardreplica-Ptricktoreplacethe logandonesetlabelledα=1,...,q representing the q-th power of P in (7). Thus we arrive at the replicated partition function Z = Zn n hh ii σa = δ( σa Nγm) dµ(Jα) θ( µ Jα ξµ) . (11) hh µ− ′ Z a √N a · ii Xσa Ya Xµ Ya,α µY,a,α { µ} 2 Averaging over the quenched disorder results in the order parameters 1 Qα,β = Jα Jβ a,b N a · b 1 Mα = Ja,α a √N i Xi as well as their conjugates Qˆα,β and Mˆα. In the following Mα will be referred to as the weak magnetisation since a,b a a the mean value of the couplings – the strong magnetisation 1/N Ja,α vanishes in the thermodynamic limit. The i i appearance of an additional order parameter describing a ferromPagnetic bias of order √N was to be expected. To produce biased outputs the local fields J ξµ/√N must have an average of order 1. Given ξµ = m this requires · hh i ii an average of the J of order √N. The integral representation of the delta-function restricting the set of outputs i introduces a further set of order parameters R . a In the present paper we will only discuss the results obtained within the replica symmetric (RS) Ansatz [5,6] 1 (a,α)=(b,β) Qαa,,bβ = QQ1 aa==bb,α6=β 0 6 Mα =M (12) a R =R. a This Ansatz represents the two-replica structure of the problem: Q is assigned to the overlap between coupling 1 vectors belonging to identical cells labelled by σa, Q is assigned to the overlap between different cells. For the cell µ 0 spectrum without bias, it gives the correct result for 0 q 1, which is the interval of interest here. Nevertheless it ≤ ≤ remains plagued by divergences for q <0 and continuous replica symmetry breaking for q >1, for details see [11]. Eliminating the conjugate order parameters one finds Q = 0 is always an extremum of τ(q) since for Q = 0 the 0 0 saddle point equation in Q coincides with that in M. The interpretation of this result is that two randomly chosen 0 coupling vectors with the same weak magnetisation do not have an overlapof order 1. We thus arrive at the free energy 1 q 1 1 τ(q)= extr − ln(1 Q )+ ln(1 (1 q)Q ) −ln2 Q1,M,R(cid:20) 2 − 1 2 − − 1 γmR+γln(eR DsHq +e R DsHq) − ′ Z + − Z − (cid:21) √Q s mM 1 H =H( − ) (13) + √1 Q 1 − √Q s+mM 1 H =H(− ) − √1 Q1 − where Ds= ds exp( s2/2) is the Gaussian integral measure and H(x)= ∞Ds. √2π − x Extremising τ(q) with respect to Q ,M, and R yields the saddle point eqRuations 1 γR Ds(DeRs(He+Rq−H2q++e−eRRHH−q−q2))G2 = 1 (1Q1q)Q + − − − − 1 R DRsD(esR(HeR+qH−1q−+ee−RRHH−q−q)1)G =0 (14) + − − R Ds(eRHq e RHq) R Ds(eRH+q −+e−RH−q) =m′ + − − R where G=1/√2πexp( (√Q1s−mM)2). − 2(1 Q1) − The cell spectrum f(α) can now be evaluated using (8). 3 III. DISCUSSION Form =0thecellspectrumofunbiasedpatternsisrecoveredforanym. Form=0nocouplingvectorwithaweak ′ magnetisation which produces a sequence of outputs with m = 0 exists. We thus turn to the case m = 0,m = 0. ′ ′ 6 6 6 A transformation mM M in the free energy (13) would remove the magnetisation of the inputs from the saddle → point equations (14). Thus the properties of the cell spectrum for non-zero m and m do not depend on m but only ′ on γ and m whereas the weak magnetisation of the couplings M scales with m 1 for fixed m and γ. Hence we may ′ − ′ restrict the discussion to the case m=m =0 without loss of generality. ′ 6 1.5 1.0 f 0.5 0.0 0.0 1.0 2.0 α FIG. 1. The multifractal spectrum f(α) for various values of the loading parameter γ =0.2 (dotted), 0.5 (long dashed), 1 (full), and 2 (dashed) and values of the magnetisation m′ = 0,0.5,0.75 from top to bottom respectively. The parts with negative slope havebeen omitted since their interpretation is presently not clear (cf. [11]). Figure 1 shows the cell spectrum at several loading capacities γ and magnetisations m. For any given q df/dα ′ ≡ the number of cells decreases exponentially with increasing m. This is in accordance with the fact that the maximal ′ pmos)s/i2b)l.eTnuhmebmearxoifmcuelmlsfwithoouftfp(uαt)beiaxspmon′esnctaialelslyadso2mNfintoattwesitthhfetnotu=mbγe(r(1o−f cmel′l)s/2lo=g2(1d−αm2N′)f/(2α+),(h1e+nmce′)a/r2alnodg2o(m1l+y ′ max N chosen output sequence will result in a cell of size α(q =0), which is termed a storageRcell. For values of the loading parameter below the critical storage capacity γ typically all possible cells may be realised, so f =f . However c max tot above the critical storage capacity only an exponentially small fraction of all possible cells may be realised hence f < f . Note that although f decreases with increasing m as shown in figure 1 it does so slower than f max tot max ′ tot so that the storage capacity increases. This can also be seen by comparing the curves of f(α) at γ = 2 for m = 0 ′ and m = .75. The maximum of f(α) is attained for m = 0 as α indicating the cell volume shrinks to zero ′ ′ → ∞ which signals the critical storage capacity. However for m = .75 the maximum of f(α) is reached at finite values of ′ α indicating a finite size of a storage cell at γ =2. In fact the limit q 0 of (13) can be taken explicitly yielding the → saddle point equations for the storage problem for magnetised patterns [2]. On the other hand the cells dominating the volume = dα2N(f(α) α) of the coupling space are characterisedby − V q df = 1. In general such a cell is taken to describe theRgeneralisation behaviour of the perceptron, since in the ≡ dα thermodynamiclimitarandomlychosenteacherperceptronwillliewithinacellofthissizewithprobabilityone. The saddle point equations (14) at this point are Q1 =γZ Ds(H+−1+H−−1)G2 (15) 4 R=0 (16) m =1 2H(mM). (17) ′ − The interpretationof this saddle point may not be immediately obvious,since we have specified the magnetisationof the outputs, but no properties ofa teacher thatwill producesucha set ofoutputs ona giveninput pattern. However equation(17)providesanexplicitrelationbetweenm,m,andM. ItgivestheweakmagnetisationM ofthecouplings ′ ofateacherorstudentrequiredtoproduceamagnetisationm oftheoutputsgivenasetofinputswithmagnetisation ′ m. In the context of a teacher with magnetisation M acting on a set of inputs with magnetisation m (17) simply follows from the central limit theorem and Q describes the overlap between teacher and student after the student 1 has learned to classify γN examples correctly. The subsequent generalisationerror ǫ is given by g 2 ∞ ∞ dχdσ σ2 2σχQ1+χ2+2m2M2(1 Q1) χ+σ ǫ =1 exp( − − )cosh(mM ) (18) g − 1−Q21 Z0 Z0 2π − 2(1−Q21) 1+Q1 p 1.0 0.8 0.4 0.6 Q ε 1 g 0.4 0.2 0.2 0 5 10 0 5 10 γ γ FIG.2. Theteacher-studentoverlapQ1 andtheresultinggeneralisation errorǫg asafunctionofγ form′ =0,0.5,0.75(from top to bottom). The uppermost curve corresponds to the results of Gy¨orgyi and Tishby [12]. The full lines are analytical curveswhereasthesymbolsaretheresultsofnumericalsimulationswithN =200averagedover200patterns. Thesymbolsize corresponds to 5 times thestatistical error. Thefulllinesinfig.2showtheoverlapQ betweenstudentandteacherandthecorrespondinggeneralisationerrorǫ 1 g asafunctionofγ afterthe studenthaslearnedγN examplesfordifferentmagnetisationsm =m. Thegeneralisation ′ erroris found to decrease for increasing m at fixed γ. The effect is most pronouncedat low values of γ; in particular ′ we find ǫ (γ = 0,m) = 1 m′2. In fact the weak magnetisation of the couplings M is independent of the loading parametegr and alread′y take−s2on its non-zerovalue (for m=0) at γ =0. Howevertwo independent output stringsσµ andσµ withthesameaveragem differonaveragein(1 m6 2)/2bitsonly. Henceǫ <.5evenforzeroteacher-studenTt J ′ − ′ g overlap Q . Qualitatively this means that the student learns the correct bias of the outputs after a non-extensive 1 numberofexamples already. A plausibility argumentunderlining this effectis a follows: Bythe centrallimit theorem thesum1/√N J ξ isaGaussianvariablewithmeanmM andvariance1. Henceequation(17)needstobefulfilled i i i ifanoutputoftPhesamesignasm′ istobeproducedwithprobability(1+m′)/2. Ifthenumberofexamplespbecomes infinite in the thermodynamic limit the number of outputs of the same sign as m tends to p times this probability. ′ This implies that it is sufficient for the number of examples to scale with Nδ, 0<δ <1, so the number of examples is infinite in the thermodynamic limit, for M to take on its saddle point value. 5 IV. SIMULATION ALGORITHM AND NUMERICAL RESULTS GibbslearningatT =0isaconvenienttoolfortheanalyticalstudyofgeneralisationproblems,sinceitcharacterizes the typical performance of a compatible student. It is however not completely straightforward to implement in numerical simulations. The necessary average over the compatible students cannot be performed directly because the version space is only an exponential fraction of the high-dimensional coupling space of the perceptron. Several methodstocircumventthisproblemhavebeensuggested,includingarandomwalkintheversionspaceofthestudent [13], a billiard in version space [14], or a variant of the Adatron algorithm, where in each realisation a few randomly chosen patterns are learnt in addition to the examples [15]. Here we propose the randomized perceptron algorithm as a new method to effectively simulate Gibbs learning and apply it to the specific problem of biased patterns. Starting with all couplings equal to zero, the randomized perceptron algorithm runs over all examples, leaving the couplings unchanged if the pattern is classified correctly by their present values. If a pattern is not classified correctly, the algorithm adds the standard Hebbian term as well as a random vector with components chosen independently from a Gaussiandistribution. Since for largeN the random vectors are all orthogonalto each other, the coupling vector will end up in versionspace even though the amplitudes of the Hebbian and the random terms are of comparable magnitude. This procedure slows down the convergence of the perceptron algorithm but leads to more reliable results for the generalisation error. The standard deviation of the noise term was 4.8γ (compared to the magnitude of the Hebbian term of 2γ), but no strong dependence on the standard deviation was observed on the interval 2γ to 10γ. The simulationswhose resultsareshowninFig. 2were performedwith asystemsize ofN =200. Sets ofGaussian distributed inputs with magnetisation m were generated and the components of the couplings of potential teacher perceptronswerechosenfromaGaussiandistributionwith zeromean. Inthis wayeachteacherperceptronwasgiven a weak magnetisation. Teachersweregenerateduntilone ofthem producedan output magnetisationm onthe given ′ inputs. Thisteacherwasusedtogeneratetheoutputsusedinthesubsequentstep: Theresultingpatternsweretaught to the student using the randomized perceptron algorithm and its overlap with the teacher and the generalisation error were evaluated. Except at large values of the magnetisation m, where finite size effects are more noticeable, the numerical results ′ are in very good agreement with the analytical curves. V. SUMMARY Inthepresentpaperwehaveinvestigatedtheinfluenceofabiasinthedistributionofinputsandoutputsonthecell structure in the phase space of a perceptron. To this end the multifractal spectrum f(α) was calculated analytically for different values of the storageratioγ with the help ofthe methods used alreadyfor the case ofunbiased patterns. Both the storage and the generalization properties can be read off from the behaviour of f(α). Forthestorageproblemweshowedthattheinputbiashasnoinfluenceonthestoragecapacityprovidedtheoutput bias m is equal to zero. If both input and output bias are non-zero the storage capacity γ increases with increasing ′ c outputbiasm irrespectivelyofthevalueoftheinputbias. Biasedpatternshavebeenconsideredbeforeinthecontext ′ of the phase space volume of attractor neural networks [2]. In this case it is natural to assume m= m. Our results ′ show that for the perceptron this case is in fact generic as for non-zero m the properties of the entire cell spectrum only depend on m. The case m =0 but m = 0 cannot be realized by a perceptron with weak magnetisation of the ′ ′ 6 couplings without thresholds [16] and thus was not treated here. The behaviour of the maximum f of f(α) as a max function of m generalize the results about the number of dichomoties [1] to m =0. ′ ′ 6 For the generalizationproblem we found an interesting two-levelscenario of learning. The student first determines the weak magnetization of the teacher couplings necessary to produce outputs of the required bias. This is accom- plished after a non-extensive number of training examples already. The curves for the generalization error therefore start off at γ = 0 with values smaller than .5. In the second step the student then reduces the generalization error further in the usual way. The asymptotic behaviour is not modified by the bias of the patterns. The analytical results areinexcellentagreementwithnumericalsimulations. We havefounda randomizedvariantofthe perceptron algorithm a simple device for reliable simulations of Gibbs learning. Biased patterns can be viewed as the simplest example of a hierarchy of inputs. It would be interesting to see whether the generalization strategy observed can also be found in the general case of hierarchically correlated pat- terns in the sense that the student first learns the classes of patterns and only then the individual representatives. Acknowledgments: Many thanks to Martin Weigt for stimulating discussions and critical reading of the manuscript. JB gratefully acknowledges financial support by the Studienstiftung des Deutschen Volkes. 6 [1] T. M. Cover, IEEE Trans. Electron. Comput. EC-14, 326, (1965) [2] E. Gardner, J. Phys. A21, 257 (1988), [3] B. Derrida, R. B. Griffith, and A. Pru¨gel-Bennett, J. Phys. A24, 4907 (1991) [4] M. Opper and W. Kinzel, in Models of Neural Networks III E. Domany, J. L. van Hemmen, and K. Schulten (eds.) (Springer,New York,1996) [5] R.Monasson and D.O’Kane, Europhys.Lett. 27, 85 (1994) [6] A.Engel and M. Weigt, Phys. Rev E53, R2064 (1996) [7] R.Monasson and R.Zecchina, Phys. Rev.Lett. 75, 2432 (1995) [8] M. Biehl and M. Opper, in Neural Networks: The Statistical Mechanics Perspective, Jong-Houn Oh, Chulan Kwon, and Sungzoon Cho (eds.) (World Scientific,Singapore, 1995) [9] S.Cocco, R.Monasson and R.Zecchina, Phys. Rev. E54, 717 (1996) [10] G. J. Bex and C. van den Broeck, Phys. Rev. E56, 870 (1997) [11] M. Weigt and A. Engel, Phys. Rev E55, 4552 (1997) [12] G.Gy¨orgyiand N.Tishby,inWorkshop on Neural Networks and Spin Glasses, K.TheumannandW.K.Koeberle (eds.), (World Scientific, Singapore, 1990) [13] H.S. Seung,H. Sompolinsky,and N.Tishby,Phys. Rev. A45, 6056 (1992) [14] P.Ruj´an, Neural Computation 9, 99 (1997) [15] T. Watkin,Europhys. Lett.21, 871 (1993) [16] M. Biehl and M. Opper,Phys. Rev. A44, 6888 (1991) 7

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.