Table Of Content

Learning and Generalization Theories of Large Committee–Machines Rémi Monasson and Riccardo Zecchina ∗ † Laboratoire de Physique Théorique de l’ENS, 24 rue Lhomond, 75231 Paris cedex 05, France ∗ INFN and Dip. di Fisica, Politecnico di Torino, C.so Duca degli Abruzzi 24, I-10129 Torino, 6 † 9 9 Italy 1 n (December 1995) a J 5 2 Abstract 1 v 2 2 The study of the distribution of volumes associated to the internal represen- 1 1 0 tations of learning examples allows us to derive the critical learning capacity 6 9 (α = 16√lnK) of large committee machines, to verify the stability of the so- / c π t a lution in the limit of a large number K of hidden units and to find a Bayesian m - generalization cross–over at α = K. d n PACS Numbers : 05.20 - 64.60 - 87.10 o c : v i X r a Typeset using REVTEX 1 I. INTRODUCTION Following the approach presented in refs. [1,2], we derive the learning behaviour of non overlapping committee machines with a large number K of hidden units. Scope of the paper is to clarify some of the analytical aspects of a method which is based on the internal degrees of freedom of MultiLayer Networks (MLN) and which requires a double analytic continuation. Suchanapproach,besideallowingforthederivationofnewresultsbothforthe learning andthe generalizationbehaviour ofMLN, makes a rigorousbridgebetween different fields in the theory of neural computation, such as Information Theory, VC–dimension and Bayesian rule extraction, and statistical mechanics [1,11,9,10,12]. Moreover, it sheds new light on the role of internal representations of the learning examples by relating it to the distribution of domains of solutions and pure states in the weight space of the network. The method consists in a generalization of the well known Gardner approach [4]. While the latter studies the typical volume of couplings associated to the overall input–output map implemented by the network, here we consider the decomposition of such volume in a macroscopic number of single volumes associated to all possible internal representations compatible with the learned examples. In addition to the interaction weights, we take as dynamical variables also the internal state variables of the MLN characterizing internal representations. For the storage problem, we are therefore interested in counting the typical number exp( ) of volumes giving the dominant contribution to Gardner’s volume and to D N compareitwiththetotalnumberexp( )ofnon–emptyvolumes. Atthelearningtransition, R N i.e. when the Gardner’s total volume shrinks to zero and no more patterns can be learned without errors, we expect both entropies and to vanish. The vanishing condition on D R N N and gives thus an alternative indication on the storage performance of the studied D R N N network. The generalization properties of the network, i.e. the rule inference capability from a given set of deterministic input–output examples, also depend on the geometrical structure of the weight space. As we shall discuss in the sequel, the internal representation approach 2 can be straightforwardly extended to the study of the geralization error of MLN in the Bayesian framework [12,10], allowing for a geometrical interpretation of the generalization transition together with a clarification of the role of the VC-dimension [9]. The method discussed here for the committee machine can be straightfordwardly extended to other non overlapping MLN with arbitrary decoder functions. For instance, one may show that the stability analysis of the RS solution α = lnK/ln2 for parity machine c is an exact result in the limit K >> 1. Such a result coincides with the one derived in [5] following the standard Gardner [4] approach with one step of Replica Symmetry Breaking. Moreover, one may also reproduce the known results [10] on the parity machine generalization transition [1] by means of a detailed geometrical intepretation. The paper is organized as follows. In Sec.II we outline the basic points of our approach, both for learning and for the Bayesian generalization problems. In Sec.III and Sec.IV we study the entropies and in the K >> 1 limit and compute the closed expression for R D N N the critical capacity. The detailed analysis of the stability of the solution is given in Sec. V. Finally, in Sec.VI, we derive the entropy (for K >> 1) of the internal representation D M contributing totheBayesian entropy. Thisallowsustoanalyzethegeneralizationtransistion ofthecommitteemachineandtoexplainwhytheVC–dimension isnotrelevantforitstypical generalization properties. II. THE INTERNAL REPRESENTATION VOLUMES APPROACH We consider tree-like commitee machines composed of K non-overlapping perceptrons with real-valued weights J and connected to K sets of independent inputs ξ (ℓ = 1,...,K, ℓi ℓi i = 1,...,N/K). Committee machines are characterized by an output σ which is a binary function f( τ ) = sign( τ ) of the cells τ = sign( J ξ ) in the hidden layer. We refer { ℓ} ℓ ℓ ℓ i ℓi ℓi to the set τ as the intePrnal representation of the inPput pattern ξ . Given a macroscopic ℓ ℓi { } { } set of P = αN binary unbiased patterns patterns (the training set), the learning problem consists in finding a suitable set of internal representations = τµ with a corresponding T { ℓ } 3 non zero volume V = dJ θ(σµf( τµ )) θ τµ J ξµ , dJ = 1 , (1) T Z ℓ,i ℓi µ { ℓ } µ,ℓ ℓ i ℓi ℓi! Z ℓ,i ℓi Y Y Y X Y where θ(...) is the Heaviside function. The total volume of the weight space available for learning, i.e. Gardner’s total volume, is given by V = V . We are interested in discussing the limit lnV , which G G T T → −∞ P defines the maximal possible size of the training set or the critical capacity of the model. The bar denotes the average over the patterns and their corresponding outputs [4] which, as usual, are drawn according to the binary unbiased distribution law. As discussed in [1], the partition of V into connected components may be naturally G obtained using the volumes V associated to the internal representations. This allows to T giveageometricalinterpretationofthelearningandBayesian generalizationprocessinterms of the characteristics of the volumes dominating the overall distribution [1]. Following the standard statistical mechanics approach, we first compute 1 r g(r) ln V (2) ≡ −Nr T ! XT and next derive the entropy (w) of the volumes V whose sizes are equal to w = 1 lnV . N T N T This can be done using the Legendre relations w = ∂(rg(r)) and (w ) = ∂g(r) . Diversely r ∂r N r −∂(1/r) from the standard replica calculations, here we deal with two analytic continuations: we have r blocks of n replicas and, once the we average over the quenched patterns for r and n integer has been done, we perform an analytic continuation to real values of r and n. Labeling blocks and replicas by ρ,λ and a,b respectively, the spin glass order parameters read K qaρ,bλ = JaρJbλ . (3) ℓ N iℓ iℓ i X They represent the typical overlaps between weight vectors incoming onto the same hidden unitℓ(ℓ = 1,...,K)andbelingingtoblocksρ,λandreplicasa,b. Associatedtothetheqaρ,bλ ℓ there are also the conjugate Lagrange multipliers qâρ,bλ. Since hidden units are equivalent, ℓ 4 we assume that at the saddle point qaρ,bλ = qaρ,bλ and qâρ,bλ = qâρ,bλ independently of ℓ. ℓ ℓ Then, within the replica symmetric (RS) Ansatz [6], we find 1 r 1 q g(r) = Extr − ln(1 q ) ln(1 q +r(q q)) q,q∗ ( 2r − ∗ − 2r − ∗ ∗ − − 2(1 q +r(q q)) − ∗ ∗ − α K Dx ln ( x ) , (4) ℓ ℓ −r H { } ) Z ℓ=1 Y where K y √q q +τ x √q r ℓ ℓ ℓ ( xℓ ) = Tr DyℓH ∗ − . (5) H { } {τℓ}ℓY=1Z " √1−q∗ # Here, q (r) = qaρ,aλ and q(r) = qaρ,bλ are the typical overlaps between two weight vectors ∗ corresponding to the same (a,ρ = λ) and to different (a = b) internal representations 6 6 respectively [4,2]. The Gaussian measure is denoted by Dx = 1 e x2/2 whereas the T √2π − function H is defined as H(y) = Dx. Since with no loss of genereality the outputs σµ y∞ R can be set equal to 1, in eqn.(4) the sum Tr runs over the internal representations τ {τℓ} { ℓ} giving a positive output f( τ ) = +1 only. ℓ { } As discussed in [1], when N , 1ln(V ) = g(r = 1) is dominated by volumes of → ∞ N G − size w whose corresponding entropy (i.e. the logarithm of their number divided by N) is r=1 = (w ). At thesametimethemost numerous volumes arethoseofsmaller sizew , D r=1 r=0 N N since in the limit r 0 all the are counted irrespectively of their relative volumes. Their → T corresponding entropy = (w ) is the (normalized) logarithm of the total number R r=0 N N of implementable internal representations. The quantities and are easily obtained D R N N from the RS free–energy eqn.(4) using Legendre identities. In particular, q(r = 1) is the usual saddle point overlap of the Gardner volume g(1) [4,5]. The vanishing condition for the entropies is related to the zero volume condition for V and thus to the storage capacity of G the models. In the above discussion, we have focused on the storage problem. However, the generalization properties also depend on the internal structure of the coupling space. Let us for instance consider the case of a learnable rule, defined by a teacher network. When a student with the same architecture is given more and more examples of the rule to infer, 5 its version space [12] shrinks. In the perceptron case, the version space is simply connected and the typical generalization error done by the student on a new example goes to zero as its overlap with the teacher increases. The situation is much more involved in multilayer neural networks since the presence of separated components of the version space makes the alignment of the student along the teacher direction more difficult. The approach we have exposed above for the learning problem may be extended to acquire a better understanding of the generalization process in multilayer networks. We shall restrict to the Bayesian framework where all teacher are sorted according to their a priori probabilities. The generalization properties are derived through the Bayesian entropy 1 S = V lnV , (6) G G G −N {Xσµ} where the sum runs over all 2P sets of possible outputs. If we know intend to look at the distibution of the sizes fo the internal representation volumes V , we have to consider the T generating free–energy [8] 1 r s(r) V ln V . (7) G ≡ −Nr T ! {Xσµ} XT To compute s(r) with the replica method, we have to introduce 1 + nr replicas and send n 0 at the end of the computation. The order parameters entering the computation are → the overlaps paρ between the teacher and the nr students and the overlaps qaρ,bλ between two differents students. Within the RS Ansatz, we assume that paρ = p and qaρ,bλ = q if a = b, q otherwise. The result we obtained is ∗ 6 1 r 1 q p2 s(r) = Extr − ln(1 q ) ln(1 q +r(q q)) − p,q,q∗ ( 2r − ∗ − 2r − ∗ ∗ − − 2(1 q +r(q q)) − ∗ ∗ − 2α K τ x p ℓ ℓ Dx Tr H ln ( x ) , (8) − r Z ℓY=1 ℓ"{τℓ}Yℓ √q0 −p2!# H { ℓ} ) In the following , we shall focus on the logarithm (divided by N) of the number of internal representations contributing to S = s(1), that is G − 6 ∂s = (r = 1) . (9) D M ∂r It can be easily verified that p = q is always a saddle–point when r = 1. In the Bayesian framework, the typical overlap between two student is equal to the scalar product between the teacher and any student. III. ANALYSIS OF IN THE K >> 1 LIMIT R N Wefirst focusonther 0case tocompute thetypical logarithm ofthetotalnumber R → N of internal representations. One can check on the saddle–point equations for q,q that the ∗ correct scalings of the order parameters are q = O(1) and q = 1 O(r). In the following, ∗ − we shall call µ = lim [r/(1 q )]. Upon keeping the leading terms in K, the trace over r 0 ∗ → − T in equation (5) becomes [5] Q K 1 H A(x ) (10) √1−Q2! ℓY=1 ℓ where 1 K B(x )+B( x ) ℓ ℓ Q = − , (11) 1 √K A(x ) ℓ=1 ℓ X and 1 K B(x )+B( x ) 2 ℓ ℓ Q = − . (12) 2 K ℓ=1 A(xℓ) ! X In the above expressions we have adopted the definitions exp x2 µq A(x) 1+ − 2(1+µ(1−q)) (13) ≡ (cid:16)1+µ(1 q) (cid:17) − q and q exp x2 µq x q/(1 q) B(x) H x + − 2(1+µ(1−q)) H − − . (14) ≡ s1 q! (cid:16)1+µ(1 q) (cid:17)  1q+µ(1 q) − − − q q  In the K limit, Q becomes a gaussian variable with zero mean and variance Q = 1 2 → ∞ Dx(B(x) +B( x))2/A(x)2. The free–energy for r 0 then reads G(q,µ)/r + O(lnr) − → − R where 7 1 1 µq x√Q 2 G(q,µ) = ln(1+µ(1 q))+ +α DxlnH + 2 − 21+µ(1−q) Z √1−Q2! exp x2 µq 1 αK Dxln 1+ − 2(1+µ(1−q)) +O (15)  (cid:16)1+µ(1 q) (cid:17) √K! Z −  q  and the typical logarithm of the total number of internal representations is simply the R N maximum of G(q,µ) over q and µ. Taking the scaling relation µ = mK2 (which can be inferred from the equation ∂G = 0), and q = O(1), one finds ∂µ 2 Q = arcsin(q) . (16) 2 π Finally, defining q = 1 ǫ and taking the saddle point equation with respect to m (which − implies m = α2) and ǫ, one finds the following result π2α2 = ln(K) +O(ln(α)) , (17) R N − 256 which vanishes at 16 α = ln(K) . (18) R π q IV. ANALYSIS OF IN THE K >> 1 LIMIT D N We shall now concentrate on the r 1 case, which corresponds to the internal represen- → tations giving the dominant contribution to the Gardner volume V . The typical logarithm G of such internal representations is . Before taking the limit K , the Legendre D N → ∞ transform of expression (4) gives 1 2qq q q2 1 ∗ ∗ = lnV + − − ln(1 q ) αK ND N G 2(1 q)2 − 2 − ∗ − × − DxℓTr{τℓ}QKl=2H(cid:18)τℓxℓq1−qq(cid:19)R DyH(cid:16)y√q∗√−1q+−xq∗1τ1√q(cid:17)lnH(cid:16)y√q∗√−q1+−xq∗1τ1√q(cid:17) (19) Z Yℓ Tr{τℓ} Kl=1H(cid:18)τℓxℓ 1−qq(cid:19) Q q where lnV isthereplica symmetric expression of theGardnervolume. When K islarge, the G trace over all allowed internal representations may be evaluated as in the previous section [5]. We find the following scalings for the order parameters 8 128 q 1 ≃ − π2α2 32 Q 1 2 ≃ − π2α Γ2 q 1 (20) ∗ ≃ − 2π2K2α2 where 1 = √π ∞ duH(u)lnH(u) (21) Γ − Z−∞ for large K and α. Therefore, the asymptotic expression of the entropy of contributing internal representations is π2α2 = ln(K) +O(ln(α)) , (22) D N − 256 which vanishes at 16 α = ln(K) . (23) D π q For large K, α coincide with α . Reasonnably, we expect α to be equal to these critical D R c numbers and to scale as 16 ln(K) too. π q V. STABILITY ANALYSIS In order to show that our RS calculation of is asymptotically correct when the num- R N ber of hidden units K is large, we have checked its local stability with respect to fluctuations of the order parameter matrices. Although it would require a complete analysis of the eigenvalues of the Hessian matrix, we have focused only on the replicons 011 and 122 in the notations of [7], which are usually the most “dangerous” modes [6]. For a free–energy func- tional depending only on one order parameter matrix qaρ,bλ , the corresponding eigenvalues are ∂2 ∂2 ∂2 Λ = F 2r F +2(r 1) F + 011 ∂qaρ,bλ∂qaρ,bλ − ∂qaρ,bλ∂qaρ,cµ − ∂qaρ,bλ∂qaρ,bµ ∂2 ∂2 ∂2 (r 1)2 F 2r(r 1) F +r2 F (24) − ∂qaρ,bλ∂qaµ,bν − − ∂qaρ,bλ∂qaµ,cν ∂qaρ,bλ∂qcµ,dν 9 and ∂2 ∂2 ∂2 Λ = F 2 F + F (25) 122 ∂qaρ,aλ∂qaρ,aλ − ∂qaρ,aλ∂qaρ,aµ ∂qaρ,aλ∂qaµ,bν are given by formula (41) in ref. [7]. In our case, however, the free-energy depends upon the 2K matrices , ˆ . According to [4,5], the stability condition for each mode reads l l {Q Q } 1 ˆ ∆(α,K) = Λ ( Λ+(K 1)Λ ) < 0 (26) − − K2 where Λˆ,Λ,Λ are the eigenvalues computed for the fluctuations with respect to ˆ ˆ , ℓ ℓ ℓ ℓ Q Q Q Q and (ℓ = m) respectively. Since we are interested in the stability of the saddle–point ℓ m Q Q 6 giving , we focus on the limit r 0. In this case, the correct scalings of the order R N → parameters are q = 1 r/µ+O(r2), q = O(1). For the (011) mode, we find ∗ − (1 q)2 Λˆ = − r2 011 K αµ2 K N N 2 2 1 2 Λ = Dx µ 011 r2 Z ℓ=1 ℓ" D − (cid:18) D (cid:19) # Y αµ4 K N N 2 3 4 Λ = Dx (27) 011 r2 ℓ D − D2 Z ℓ=1 (cid:20) (cid:21) Y at leading order when r 1. The quantities defined in (27) are ≪ K D = Tr B(τ x ) (28) ℓ ℓ {τℓ}ℓY=1 K exp x2 µq µqx2 N = Tr B(τ x ) − 12X) 1 1 1 {τℓ}"ℓY=2 ℓ ℓ # (cid:16)X3/2 (cid:17) " − X ! × τ1x1 q/(1 q) µ q(1 q) x2µq H − − − exp 1  q√X − q√2π√X −2X)(1 q)! −  K √1 q  N = Tr B(τ x ) − Y 2 ℓ ℓ 1 {τℓ}"ℓY=2 # X K 1 q N = Tr B(τ x ) − Y Y 3 {τℓ}"ℓY=3 ℓ ℓ # X2 1 2 √1 q N = Tr B(τ x ) − Y 4 {τℓ}ℓY6=1 ℓ ℓ  X 1×    √1 q  Tr B(τ x ) − Y , (29) {τℓ}ℓY6=2 ℓ ℓ  X 2     10