Statistical Mechanics of Online Learning for Ensemble Teachers 6 Seiji MIYOSHI∗and Masato OKADA† 0 0 2 February 2, 2008 n a J 1 Abstract 2 ] We analyze the generalization performance of a student in a model composed of linear perceptrons: a h p true teacher, ensemble teachers, and the student. Calculating the generalization error of the student - analytically using statistical mechanics in the framework of on-line learning, it is proven that when c learning rate η < 1, the larger the number K and the variety of the ensemble teachers are, the smaller o s the generalization error is. On the other hand, when η > 1, the properties are completely reversed. If . the varietyofthe ensemble teachersisrichenough,the directioncosine betweenthe true teacherandthe s c student becomes unity in the limit of η 0 and K . i → →∞ s keywords: ensemble teachers, on-line learning, generalization error, statistical mechanics, learning y rate h p [ 1 Introduction 1 v Learning is to infer the underlying rules that dominate data generation using observed data. Observed 2 data are input-output pairs from a teacher and are called examples. Learning can be roughly classified 6 intobatchlearningandon-linelearning[1]. Inbatchlearning,givenexamplesareusedrepeatedly. Inthis 1 paradigm, a student becomes to give correct answers after training if the student has adequate freedom. 1 0 However, it is necessary to have a long amount of time and a large memory in which to store many 6 examples. On the contrary, in online learning examples used once are discarded. In this case, a student 0 cannot give correct answers for all examples used in training. However, there are merits, for example, / s a large memory for storing many examples isn’t necessary, and it is possible to follow a time variant c teacher. i s Recently,we[5,6]analyzedthegeneralizationperformanceofensemblelearning[2,3,4]inaframework y ofon-linelearningusingastatisticalmechanicalmethod[1,8]. Usingthesamemethod,wealsoanalyzed h the generalization performance of a student supervised by a moving teacher that goes around a true p : teacher[7]. As a result, it was proven that the generalization error of a student can be smaller than a v moving teacher, even if the student only uses examples from the moving teacher. In an actual human i X society, a teacher observed by a student doesn’t always present the correct answer. In many cases, the r teacher is learning and continues to change. Therefore, the analysis of such a model is interesting for a considering the analogies between statistical learning theories and an actual human society. On the other hand, in most cases in an actual human society a student can observe examples from twoormoreteacherswhodifferfromeachother. Therefore,weanalyzethegeneralizationperformanceof such a model and discuss the use of imperfect teachers in this paper. That is, we consider a true teacher andK teacherscalledensemble teacherswho existaroundthe true teacher. Astudentuses input-output pairsfromensembleteachersinturnorrandomly. Inthispaper,wetreatamodelinwhichallofthetrue teacher, the ensemble teachers and the student are linear perceptrons[5] with noises. We obtain order ∗DepartmentofElectronicEngineering,KobeCityCollegeofTechnology,8–3Gakuen-higashimachi,Nishi-ku,Kobe-shi, 651–2194 E-mailaddress: [email protected] †Division of Transdisciplinary Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5–1–5 Kashi- wanoha, Kashiwa-shi, Chiba, 277–8561, RIKEN BrainScience Institute, 2–1 Hirosawa, Wako-shi, Saitama, 351–0198 JST PRESTO,5–1–5Kashiwanoha,Kashiwa-shi,Chiba,277–8561 1 parameters and generalizationerrors analytically in the framework of on-line learning using a statistical mechanical method. As a result, it is proven that when student’s learning rate η < 1, the larger the number K and the variety of the ensemble teachers are, the smaller the student’s generalization error is. On the other hand, when η > 1, the properties are completely reversed. If the variety of ensemble teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η 0 and K . → →∞ 2 Model In this paper, we consider a true teacher, K ensemble teachers and a student. They are all linear perceptrons with connection weights A, B and J, respectively. Here, k =1,...,K. For simplicity, the k connection weight of the true teacher, the ensemble teachers and the student are simply called the true teacher, the ensemble teachers and the student, respectively. True teacher A = (A ,...,A ), ensemble 1 N teachers B =(B ,...,B ), student J =(J ,...,J ) and input x=(x ,...,x ) are N dimensional k k1 kN 1 N 1 N vectors. EachcomponentA ofA is drawnfrom (0,1) independently andfixed, where (0,1) denotes i N N Gaussian distribution with a mean of zero and variance unity. Some components B are equal to A ki i multiplied by –1, the others are equal to A . Which component B is equal to A is independent from i ki i the value of A . Hence, B also obeys (0,1). B is also fixed. The direction−cosine between B and i ki ki k A is RBk and that between Bk and BkN′ is qkk′. Each of the components Ji0 of the initial value J0 of J aredrawnfrom (0,1)independently. The directioncosine betweenJ andA is R andthatbetween J J and B is R N. Each component x of x is drawn from (0,1/N) independently. Thus, k BkJ i N A = 0, (A )2 =1, (1) i i h i D E B = 0, (B )2 =1, (2) ki ki h i J0 = 0, D J0 2 E=1, (3) i i (cid:10)x(cid:11) = 0, D((cid:0)x )(cid:1)2 E= 1 , (4) i i h i N ADB E B B k k k′ RBk = A · B , qkk′ = B · B , (5) k k k′ k kk k k kk k A J B J k R = · , R = · , (6) J A J BkJ B J k k kk k k kk k where denotes a mean. Fighu·rie 1 illustrates the relationship among true teacher A, ensemble teachers B , student J and k direction cosines qkk′,RBk,RJ and RBkJ. In this paper, the thermodynamic limit N is also treated. Therefore, →∞ A =√N, B =√N, J0 =√N, x =1. (7) k k k k k k k k k Generally, norm J of the student changes as time step proceeds. Therefore, ratios lm of the norm to √N are introduckedkand called the length of the student. That is, Jm =lm√N, where m denotes the k k time step. The outputs of the true teacher, the ensemble teachers, and the student are ym+nm, vm+nm and A k Bk umlm+nm, respectively. Here, J ym = A xm, (8) · vm = B xm, (9) k k· umlm = Jm xm, (10) · nm 0,σ2 , (11) A ∼ N A nmBk ∼ N (cid:0)0,σB2k(cid:1) , (12) nmJ ∼ N (cid:0)0,σJ2 (cid:1). (13) (cid:0) (cid:1) 2 Figure 1: True teacher A, ensemble teachers Bk and student J. qkk′,RJ,RBk and RBkJ are direction cosines. That is, the outputs of the true teacher, the ensemble teachers and the student include independent Gaussian noises with variances of σ2,σ2 , and σ2, respectively. Then, ym, vm, and um of Eqs. (8)–(10) A Bk J obey Gaussian distributions with a mean of zero and variance unity. Letus define errorǫ between true teacher Aand eachmember B of the ensemble teachersby the Bk k squared errors of their outputs: 1 ǫm (ym+nm vm nm )2. (14) Bk ≡ 2 A − k − Bk In the same manner, let us define error ǫ between each member B of the ensemble teachers and BkJ k student J by the squared errors of their outputs: 1 ǫm (vm+nm umlm nm)2. (15) BkJ ≡ 2 k Bk− − J Student J adopts the gradient method as a learning rule and uses input x and an output of one of the K ensemble teachers B in turn or randomly for updates. That is, k ∂ǫm Jm+1 = Jm η BkJ (16) − ∂Jm = Jm+η(vm+nm umlm nm)xm, (17) k Bk− − J where η denotes the learning rate of the student and is a constant number. In cases where the student uses K ensemble teachers in turn, k = mod(m,K)+1. Here, mod(m,K) denotes the remainder of m divided by K. On the other hand, in random cases, k is a uniform random integer that takes one of 1,2,...,K. Generalizing the learning rules, Eq. (17) can be expressed as Jm+1 = Jm+f xm (18) k = Jm+f(vm+nm ,umlm+nm)xm, (19) k Bk J where f denotes a function that represents the update amount and is determined by the learning rule. Inaddition, letus define errorǫ betweentrue teacherA andstudentJ by the squarederrorof their J outputs: 1 ǫm (ym+nm umlm nm)2. (20) J ≡ 2 A − − J 3 3 Theory 3.1 Generalization error One purpose of a statistical learning theory is to theoretically obtain generalization errors. Since gener- alization error is the mean of errors for the true teacher over the distribution of new input and noises, generalization error ǫ of each member B of the ensemble teachers and ǫ of student J are calcu- Bkg k Jg lated as follows. Superscripts m, which representthe time steps, are omitted for simplicity unless stated otherwise. ǫ = dxdn dn P (x,n ,n )ǫ (21) Bkg A Bk A Bk Bk Z 1 = dydv dn dn P (y,v ,n ,n ) (y+n v n )2 (22) k A Bk k A Bk A k Bk 2 − − Z 1 = 2R +2+σ2 +σ2 , (23) 2 − Bk A Bk (cid:0) (cid:1) ǫ = dxdn dn P (x,n ,n )ǫ (24) Jg A J A J J Z 1 = dydudn dn P (y,u,n ,n ) (y+n ul n )2 (25) A J A J A J 2 − − Z 1 = 2R l+l2+1+σ2 +σ2 . (26) 2 − J A J (cid:0) (cid:1) Here, integrations have been executed using the following: y, v and u obey (0,1). The covariance k N between y and v is R , that between v and u is R , and that between y and u is R . All n , n , k Bk k BkJ J A Bk and n are independent from other probabilistic variables. J 3.2 Differential equations for order parameters and their analytical solutions To simplify analysis, the following auxiliary order parameters are introduced: r R l, (27) J J ≡ r R l. (28) BkJ BkJ ≡ Simultaneousdifferentialequationsindeterministicforms[8],whichdescribethedynamicalbehaviors of order parameters,have been obtained based on self-averagingin the thermodynamic limits as follows: K dr 1 BkJ = fk′vk , (29) dt K h i k′=1 X K dr 1 J = f y , (30) k dt K h i k=1 X K dl 1 1 = f u + f2 . (31) dt K h k i 2lh ki k=1(cid:18) (cid:19) X Here, dimension N has been treated to be sufficiently greater than the number of ensemble teachers K. Time t = m/N, that is, time step m normalized by dimension N. Note that the above differential equations are identical whether the K ensemble teachers are used in turn or randomly. Since linear perceptrons are treated in this paper, the sample averages that appeared in the above equations can be easily calculated as follows: r BkJ f u = η l , (32) k h i l − f2 = η2(cid:16) l2 2r (cid:17) +1+σ2 +σ2 , (33) h ki − BkJ Bk J f y = η(R r ), (34) k (cid:0)Bk J (cid:1) h i − 4 K K 1 1 fk′vk = η rBkJ + qkk′ . (35) K k′=1h i − K k′=1 ! X X Since all components A , J0 of true teacher A, and the initial student J0 are drawn from (0,1) i i N independently andbecausethe thermodynamiclimitN is alsotreated,theyareorthogonaltoeach →∞ other in the initial state. That is, R =0 when t=0. (36) J In addition, l=1 when t=0. (37) ByusingEqs. (32)–(37),simultaneousdifferentialequationsEqs. (29)–(31)canbesolvedanalytically as follows: K 1 rBkJ = qkk′ 1 e−ηt , (38) K − k′=1 X (cid:0) (cid:1) K 1 r = R 1 e−ηt , (39) J Bk K − k=1 X (cid:0) (cid:1) 1 l2 = 2(1 η)q¯+η 1+σ¯2 +σ2 2 η − B J − (cid:2)1 (cid:0) (cid:1)(cid:3) + 1+ η 1+σ¯2 +σ2 2q¯ eη(η−2)t 2q¯e−ηt, (40) 2 η B J − − (cid:20) − (cid:21) (cid:0) (cid:0) (cid:1) (cid:1) where K K 1 q¯ = K2 qkk′, (41) k=1k′=1 X X K 1 σ¯2 = σ2 . (42) B K Bk k=1 X 4 Results and Discussion Inthissection,wetreatthecasewheredirectioncosinesR betweentheensembleteachersandthetrue Bk teacher, direction cosines qkk′ among the ensemble teachers and variances σB2k of the noises of ensemble teachers are uniform. That is, R = R , k =1,...,K, (43) Bk B ′ q, k =k , qkk′ = 1, k =6 k′, (44) (cid:26) σ2 = σ2. (45) Bk B In this case, Eqs. (41) and (42) are expressed as 1 q q¯ = q+ − , (46) K σ¯2 = σ2. (47) B B The dynamicalbehaviors of generalizationerrorsǫ have been analytically obtained by solving Eqs. Jg (26), (27) and (38)–(47). Figure 2 shows the analytical results and the corresponding simulation results, where N = 2000. In computer simulations, K ensemble teachers are used in turn. ǫ was obtained by Jg averagingthe squarederrorsfor 104 randominputs at each time step. Generalizationerrorǫ of one of Bg the ensemble teachers is also shown. The dynamical behaviors of R and l are shown in Fig. 3. 5 1.2 eg of J (q=1.00) 1.1 eg of J (q=0.80) 1 eg of J (q=0.60) r eg of J (q=0.49) ro 0.9 eg of B r E n 0.8 o ati 0.7 z ali 0.6 r e n 0.5 e G 0.4 0.3 0.2 0 5 10 15 20 t=m/N Figure2: Dynamicalbehaviorsofgeneralizationerrorsǫ . Theoryandcomputersimulations. Conditions Jg other than q are η =0.3,K =3,R =0.7,σ2 =0.0,σ2 =0.1 and σ2 =0.2. B A B J In these figures, the curves represent theoretical results. The dots represent simulation results. Con- ditions other than q are common: η = 0.3,K = 3,R = 0.7,σ2 = 0.0,σ2 = 0.1 and σ2 = 0.2. Figure 2 B A B J shows that the smaller q is, that is, the richer the variety of the ensemble teachers is, the smaller gener- alization error ǫ of the student is. Especially in the cases of q = 0.6 and q = 0.49, the generalization Jg error of the student becomes smaller than a member of the ensemble teachers after t 5. This means ≈ that the student in this model can become more clever than each member of the ensemble teachers even thoughthestudentonlyusestheinput-outputpairsofmembersoftheensembleteachers. Figure3shows that the larger the variety of the ensemble teachers is, the larger direction cosine R is and the smaller J length l ofthe student is. The reasonminimum value 0.49of q is takenasthe squaredvalue of R =0.7 B in Figs. 2 and 3 is described later. 1 0.8 s r e et m a 0.6 l (q=1.00) ar l (q=0.80) P r l (q=0.60) de 0.4 l (q=0.49) r O R_J (q=1.00) R_J (q=0.80) 0.2 R_J (q=0.60) R_J (q=0.49) 0 0 5 10 15 20 t=m/N Figure 3: Dynamical behaviors of R and l. Theory and computer simulations. Conditions other than q J are η =0.3,K =3,R =0.7,σ2 =0.0,σ2 =0.1 and σ2 =0.2. B A B J In Figs. 2 and 3, ǫ ,R and l almost seem to reach a steady state by t = 20. The macroscopic Jg J 6 behaviors of t can be understood theoretically since the order parameters have been obtained → ∞ analytically. Focusing on the signs of the powers of the exponential functions in Eqs. (38)–(40), we can see that ǫ and l diverge if η < 0 or η > 2. The steady state values of r ,r and l2 in the case of Jg BkJ J 0<η <2 can be easily obtained by substituting t in Eqs. (38)–(40) as follows: →∞ 1 q r q+ − , (48) BkJ → K r R , (49) J B → 1 1 q l2 2(1 η) q+ − +η 1+σ2 +σ2 (50) → 2 η − K B J − (cid:18) (cid:18) (cid:19) (cid:19) 1 q η (1 q)(K 1(cid:0)) (cid:1) = q+ − + − − +σ2 +σ2 . (51) K 2 η K B J − (cid:18) (cid:19) Equations(26),(27)and(48)–(51)showthefollowing: inthecaseofη =1,thesteadyvalueoflength l is independent from the number K of teachers and direction cosine q among the ensemble teachers. Therefore, the steady value of generalization error ǫ and direction cosine R are independent from K Jg J and q in this case. In the case of 0 < η < 1, the smaller q is or the larger K is, the smaller the steady valuesofl andǫ areandthe largerthe steadyvalueofR is. Inthe caseof1<η <2,onthe contrary, Jg J thesmallerq isorthelargerK is,thelargerthe steadyvaluesofl andǫ areandthe smallerthesteady Jg valueofR is. Thatis,inthecaseofη <1,the moreteachersexistandthe richerthevarietyofteachers J is,the morecleverthestudentcanbecome. Onthe contrary,inthecaseofη >1,thenumber ofteachers should be small and the variety of teachers should be low for the student to become clever. In the right hand side of Eq. (51), since the second and the third terms are positive, the steady value of l is larger than √q. In addition, since l √q in the limit of η 0 and K , Eqs. (27) and (49) show RJ RB/√q. On the other h→and, when S and T are→generated ind→epe∞ndently under conditions where →the direction cosine between S and P and between T and P are both R , 0 where S, T and P are high dimensional vectors, the direction cosine between S and T is q = R2, 0 0 as shown in the appendix. Therefore, if ensemble teachers have enough variety that they have been generated independently under the condition that all direction cosines between ensemble teachers and the true teacherare RB, RB/√q =1,then directioncosine RJ between the student andthe true teacher approaches unity regardless of the variances of noises in the limit of η 0 and K . → →∞ Figures 4–7 show the relationships between learning rate η and ǫ , R . In Figs 4 and 5, K =3 and Jg J is fixed. In Figs 6 and 7, q =0.49 and is fixed. Conditions other than K and q are σ2 =σ2 =σ2 =0.0 A B J and R =0.7. Computer simulations have been executed using η =0.3,0.6,1.0,1.4and 1.7. The values B on t=20 are plotted for the simulations and considered to have already reached a steady state. These figures show the following: the smaller learning rate η is, the smaller generalization error ǫ Jg is and the larger direction cosine R is. Needless to say, when η is small, learning is slow. Therefore, J residual generalization error and learning speed are in a relationship tradeoff. The phase transition in which ǫ diverges and R becomes zero on η =2 is shown. In the case of η <1, the larger K is or the Jg J smaller q is, that is, the richer the variety of ensemble teachers is, the smaller ǫ is and the larger R Jg J is. On the contrary, the properties are completely reversed in the case of η >1. Asdescribedabove,learningpropertiesaredramaticallychangedwithlearningrateη. Itisdifficultto explainthereasonqualitatively. Here,wetrytoexplainthereasonintuitivelybyshowingthegeometrical meaning of η. Figures 8(a)–(c) show the updates of η = 0.5, η = 1 and η = 2, respectively. Here, the noises are ignored for simplicity. Needless to say, teacher B itself cannot be observed directly and only k output v can be observed when student J is updated. In addition, since the projections from Jm+1 to xm and from B to xm are equal in the case of η = 1, as shown in Fig. 8(b), η = 1 is a special k condition where the student uses up the information obtained from input xm. In the case of η <1, the update is short. Since in a sense this fact helps balance the information from the ensemble teachers, the generalizationerror of the student is improved when the number K of teachers is large and their variety is rich. On the other hand, the update is excessive when η > 1. Therefore, the student is shaken or swung, and its generalization performance worsens when K is large and the variety is rich. In addition, the reason that learning diverges if η < 0 or η > 2 can be understood intuitively from Fig. 8: distance (η 1)(vm umlm)xm , measured by the projections to xm between student Jm+1 after the update k − − k 7 10 r o r Er q=1.00 n q=0.80 o ati 1 q=0.60 z q=0.49 ali r e n e G 0.1 0 0.5 1 1.5 2 Eta Figure4: Steadyvalueofgeneralizationerrorǫ inthecaseofK =3. Theoryandcomputersimulations. Jg Conditions other than K and q are σ2 =σ2 =σ2 =0.0 and R =0.7. A B J B 0.9 0.8 0.7 0.6 R 0.5 0.4 q=1.00 q=0.80 0.3 q=0.60 q=0.49 0.2 0.1 0 0.5 1 1.5 2 Eta Figure 5: Steady value of direction cosine R in the case of K = 3. Theory and computer simulations. J Conditions other than K and q are σ2 =σ2 =σ2 =0.0 and R =0.7. A B J B 8 10 K=1 K=2 K=3 r ro K=10 r 1 E K=100 n o ati z ali r e 0.1 n e G 0.01 0 0.5 1 1.5 2 Eta Figure 6: Steady value of generalization error ǫ in the case of q =0.49. Theory and computer simula- Jg tions. Conditions other than K and q are σ2 =σ2 =σ2 =0.0 and R =0.7. A B J B 1 0.9 0.8 0.7 0.6 R 0.5 0.4 K=1 K=2 0.3 K=3 0.2 K=10 K=100 0.1 0 0 0.5 1 1.5 2 Eta Figure 7: Steady value of direction cosine R in the case of q =0.49. Theory and computer simulations. J Conditions other than K and q are σ2 =σ2 =σ2 =0.0 and R =0.7. A B J B 9 and teacher B , is larger than distance (vm umlm)xm between student Jm before the update and k teacher B in the case of η <0 or η >2.kTher−efore, the lekarning diverges. k Figure 8: Geometric meaning of learning rate η 5 Conclusion We analyzed the generalization performance of a student in a model composed of linear perceptrons: a trueteacher,ensembleteachers,andthestudent. Thegeneralizationerrorofthestudentwasanalytically calculated using statistical mechanics in the framework of online learning, proving that when learning rate η < 1, the larger the number K and the variety of the ensemble teachers are, the smaller the generalization error is. On the other hand, when η > 1, the properties are completely reversed. If the varietyofensembleteachersisrichenough,thedirectioncosinebetweenthetrueteacherandthestudent becomes unity in the limit of η 0 and K . → →∞ Acknowledgments This researchwas partially supported by the Ministry of Education, Culture, Sports, Science, and Tech- nology of Japan, with Grants-in-Aid for Scientific Research 14084212,15500151and 16500093. A Direction cosine q among ensemble teachers LetusconsiderthecasewhereSandT aregeneratedindependentlysatisfyingtheconditionthatdirection cosines between S and P and between T and P are both R , as shownin Fig. 9, whereS, T andP are 0 N dimensional vectors. In this figure, the inner product of s and t is S T s t = S R k kP T R k kP (52) · − 0 P · − 0 P (cid:18) k k (cid:19) (cid:18) k k (cid:19) = S T q R2 , (53) k kk k 0− 0 wheresandtareprojectionsfromStotheorth(cid:0)ogonalco(cid:1)mplementC ofX andfromT toC,respectively. q denotes the direction cosine between S and T. 0 Incidentally, if dimensionN is largeandS and T havebeen generatedindependently, s and t should be orthogonalto each other. Therefore, q =R2. 0 0 10