JournalofthePhysicalSocietyofJapan Statistical-mechanical analysis of pre-training and fine tuning in deep learning MasayukiOhzeki ∗ 5 Departmentof SystemsScience, GraduateSchoolof Informatics,Kyoto University,36-1 1 0 YoshidaHonmachi,Sakyo-ku, Kyoto, 606-8501,Japan 2 n Inthispaper,wepresentastatistical-mechanicalanalysisofdeeplearning.Weelucidatesome a J oftheessentialcomponentsofdeeplearning—pre-trainingbyunsupervisedlearningandfine 9 1 tuningby supervised learning.We formulatetheextractionoffeatures fromthe trainingdata ] L asamargincriterioninahigh-dimensionalfeature-vectorspace.Theself-organizedclassifier M isthensuppliedwithsmallamountsoflabelleddata,asindeeplearning.Althoughweemploy . at a simple single-layer perceptron model, rather than directly analyzing a multi-layer neural t s network, we find a nontrivial phase transition that is dependent on the number of unlabelled [ 1 datainthegeneralizationerroroftheresultantclassifier.Inthissense,weevaluatetheefficacy v 3 of the unsupervised learning component of deep learning. The analysis is performed by the 1 4 replica method, which is a sophisticated tool in statistical mechanics. We validate our result 4 0 in the manner of deep learning, using a simple iterative algorithm to learn the weight vector . 1 onthebasisofbeliefpropagation. 0 5 1 : v i X r 1. Introduction a Deeplearningisapromisingtechniqueinthefieldofmachinelearning,withitsoutstand- ingperformanceinpatternrecognitionapplications,inparticular,beingextensivelyreported. The aim of deep learning is to efficiently extract important structural information directly fromthe trainingdatato producea high-precisionclassifier.1) Thetechniqueessentiallycon- sists of three parts. First, a large number of hidden units are introduced by constructing a multi-layer neural network, known as a deep neural network (DNN). This allows the im- plementation of an iterative coarse-grained procedure, whereby each high-level layer of the neuralnetworkextractsabstractinformationfromtheinputdata.Inotherwords,weintroduce someredundancyforfeatureextractionanddimensionalreduction(akindofsparserepresen- [email protected] ∗ 1/13 J.Phys. Soc. Jpn. tation) of the given data. The second part is pre-training by unsupervised learning. This is a kind of self-organization.2) To accomplish self-organization in the DNN, we provide plenty of unlabelled data. The network learns the structure of the input data by tuning the weight vectors (often termed the network parameters) assigned to each layer of the neural network. Theprocedure of updatingeach weight vectoron the basis of the gradient method, i.e., back propagation,takesarelativelylongtime3) anditsregularizationby L normandgreedyalgo- 1 rithm.4–6) Thisis because many local minimaare found during theoptimizationof theDNN. Instead, techniques such as the auto-encoder have been proposed to make the pre-training more efficient and push up the basins of attraction of the minima via a better generalization ofthetrainingdata.7–9) Thethirdcomponentofdeep learninginvolvesfinetuningtheweight vectorsusingsupervisedlearningtoelaborateDNNintoahighlypreciseclassifier.Thiscom- binationofunsupervisedandsupervisedlearningenablesthearchitectureofdeeplearningto obtainbetter generalization, effectivelyimprovingthe classification undera semi-supervised learningapproach.10,11) In the present study, we focus on the latter two parts of deep learning. The first is ne- glected because it simply highlights a way of implementing the deep learning algorithm. A recentstudyhasformulatedatheoreticalbasisfortherelationshipbetweentherecursivema- nipulation of variational renormalization groups and the multi-layer neural network in deep learning.12) Indeed, it is confirmed that the renormalization group indeed can mitigate the computational cost in the learning without any significant degradation.13) Furthermore, the directevaluationofmulti-layerneuralnetworksistoocomplextofullyclarifytheearlystages of our theoretical understanding of deep learning. Although most of the DNN is constructed byaBoltzmannmachinewithhiddenunits,wesimplifytheDNNtoabasicperceptron.This simplification, which is just for our analysis, enables us to shed light on the fundamental origin of the outstanding performance of deep learning and the efficiency of pre-training by unsupervisedlearning. The steady performance of the classifier constructed by the deep learning algorithm can be assessed in terms of the generalization error using a statistical-mechanical analysis based on the replica method.14) We consequently find nontrivial behaviour involved in the emer- genceof themetastablestateof thegeneralization error, a result of thecombinationofunsu- pervised and supervised learning. This is analogous to the metastable state in classical spin models, which leads to the hysteresis effect in magnetic fields. Following the actual process of deep learning, we numerically test our result by successively implementing the unsuper- vised learning of the pre-training procedure and the supervised learning for fine tuning. We 2/13 J.Phys. Soc. Jpn. then demonstrate the effect of being trapped in the metastable state, which worsens the gen- eralization error. This justifies the need for fine tuning by several sets of labelled data after thepre-trainingstageofdeep learning. Theremainderofthispaperisorganizedasfollows.Inthenextsection,weformulateour simplifiedmodeltorepresentunsupervisedandsupervisedlearningwithstructureddata,and analyzetheBayesianinferenceprocessfortheweightvectors.InSection3,weinvestigatethe nontrivial behaviour of the generalization error in our model. We demonstrate that the gen- eralizationerror can be significantlyimprovedby theuse ofsufficient amountsof unlabelled data.Finally,inSection 4, wesummarizethepresentwork. 2. Analysisofcombinationofunsupervised andsupervised learning 2.1 Problemsetting We deal with a simple two-class labelled-unlabelled classification problem. We assume that the N-dimensional feature vectors x RN obey the following distribution function µ ∈ conditionedonthebinarylabel y = 1 foreach datumµand apredeterminedweightvector µ ± w : 0 y P (x y ,w ) Θ µ xTw g , (1) g µ| µ 0 ∝ √N µ 0 − ! where g is a margin, which resembles the structure of the feature vectors of the given data, and 1 x > 0 Θ(x) = . (2) 0 x ≤ 0 The labelled data (xµ,yµ) (µ = 1,2, ,L) are generated from the joint probability ··· P (x y ,w )P(y ), where L is the number of labelled data. The unlabelled data (x ) (µ = g µ µ 0 µ µ | L+1,L+2, ,L+U),whereU isthenumberofunlabelleddata,followthemarginalprob- ··· ability P (x w ) = P (x y ,w )P(y ). In the following, we assume the large-N limit g µ| 0 yµ g µ| µ 0 µ and a huge number oPf data L,U O(N), as well as a symmetric distribution for the label ∼ P(y ) = 1/2. µ Thelikelihoodfunctionforthedataset isdefined as L L+U P ( w ) = P (x y ,w )P(y ) P (x w ), (3) g 0 g µ µ 0 µ g µ 0 D| | | Yµ=1 µY=L+1 where denotesthedatasetconsistingoflabelleddataandunlabelleddata.Whenthefeature D vector g has a margin value of zero, unsupervised learning is no longer meaningful, because the marginal distribution becomes flat. However, nonzero values of the margin elucidate the 3/13 J.Phys. Soc. Jpn. structure of the feature vectors through the unsupervised learning. The actual data in im- agesandsoundshavemanyinherentstructuresthatmustberepresentedbyhigh-dimensional weight vectors in the multi-layer neural networks of DNN. In the present study, we simplify thisaspect oftheactual data togivean artificial modelwitha margin thatfollowsthesimple perceptron.Thisallowsusto assesscertain nontrivialaspects ofdeep learning. 2.2 Bayesianinferenceandreplica method For readers unfamiliar with deep learning, we sketch the procedure of the deep learning here. The first step of the deep learning algorithm is to conduct pre-training. Following the unsupervised learning, the weight vector learns the features of the training data without any labels. As a simple strategy, we often estimate the weight vector to maximize the likelihood functiononly fortheunlabelleddataas L+U wPT = argmax log P (x w) . (4) w µY=L+1 h µ| We use a different margin value h from one in Eq. (3) in order to evaluate a generic case below. When we know a priori the structure of the data, one may set g = h. We may utilize thehiddenunitstoprepare someredundancytorepresent thefeatureofthegivendata. Inthe presentstudy,weomitthisaspecttosimplifythefollowinganalysis.Inotherwords,wehave acoarse-graining pictureof DNN only by a singlelayerwith a weight vectorw, the inputx µ andoutputy .Inthesecondstep,termedasthefinetuningstep,weestimatetheweightvector µ to precisely classify the training data. For instance, the maximum likelihood estimation can bea candidateto estimatetheweightvectoras L L+U wFT = argmax log P (x y ,w)P(y ) P (x w) . (5) w Yµ=1 h µ| µ µ µY=L+1 h µ| We noticean importantthing ofthe deep learning architecture. In this procedure, we use the resultofthepre-trainingwPTasaninitialconditionforthegradientmethodtoobtainwFT.The purpose of the deep learning is just obtain the weight vector to classify the newly-generated data with better performance simply from some strategy as in Eq. (5). The computational cost of the often-employed methods (e.g. back propagation3)) becomes extremely longer in general. However if we have some adequate initial condition to manipulate the estimation, wecan mitigateharmfulcomputationand reach a betterestimationoftheweightvector.8,9) Inordertoevaluatethetheoreticallimitationofthedeeplearning,insteadofthemaximum likelihoodestimation,weemployan optimalprocedure based on theframeworkofBayesian 4/13 J.Phys. Soc. Jpn. inference. Theposteriordistributioncan begivenbytheBayes’formulaas P ( w)P(w) P (w ) = h D| . (6) h |D dw P ( w )P(w ) ′ h ′ ′ D| R Weassumethatthepriordistributionfortheweightvectoris P(w) δ w2 N .Theposte- ∝ | | − (cid:16) (cid:17) riormeangivenbythisposteriordistributionprovidesanestimatorforthequantityrelatedto theweightvector: P ( w)P(w) E [f(w)] = dwf(w) h D| . (7) w |D Z dw P ( w )P(w ) ′ h ′ ′ D| R Thetypicalvalueisevaluatedbyaveragingovertherandomnessofthedataset as P ( w)P(w) E [E [g(w)]] = dwg(w) h D| , (8) D w|D Z dw P ( w )P(w ) where R ′ h D| ′ ′ D [ ] = d dw P ( w )P(w ) . (9) 0 g 0 0 ··· D Z D D| ×··· Theaveragequantityisgivenbythederivativeofthecharacteristicfunction,namelythefree energy,which isdefined as 1 = lim log dwP ( w)P(w) . (10) h −F N N " Z D| # →∞ D Inparticular,asshownbelow,thederivativeofthefreeenergyyieldsakindofself-consistent equations for the physically-relevant quantities. In this problem, we compute the overlap between the estimated w and the original weight vectors w and the variance of the weight 0 vectors,whichquantifytheprecisionofthelearning.Followingspinglasstheory,14)weapply thereplicamethodtoevaluatethefree energy.Wedefine thereplicated partitionfunctionas n Ξ = dwP ( w)P(w) . (11) n h Z D| ! The (density of) free energy can be calculated from the replicated partition function through thereplicamethodas ∂ 1 = lim lim log[Ξ ] . (12) n −F n 0 ∂n N N D → →∞ We exchange the order of the operations on n and the thermodynamic limit N , and → ∞ assumethat the replica number n is temporarily a natural numberin the evaluation of [Ξ ] . n D Weintroducethefollowingconstraintstosimplifythecalculationdependent onw : a 1 1 dQ δ Q w w δ Q w w . (13) ab a b 0a 0 a Z − N ! − N ! Ya b Ya=0 ≥ 5/13 J.Phys. Soc. Jpn. Thefree energy is thengivenby solvingan extremizationproblem: = sup[ (Q) (Q)], (14) −F G −I Q where n n (Q) = αlog Θ(u g) Θ(u h) +βlog Φ(u ,g) Φ(u ,h) G 0 − α − 0 a Ya=1 u Ya=1 u (15) n (Q) = sup Q Q˜ + Q Q˜ log (Q˜) (16) I ab ab 0a 0a − M Q˜ Xa≥b Xa=1 n (Q˜) = E exp Q˜ w w + Q˜ w w . (17) M w ab a b 0a 0 a Here, α = L/N, β = U/N, andXa≥b Xa=1 1 1 Φ(u,h) = Θ(u h)+ Θ( u h). (18) 2 − 2 − − Theexpectationistakenoverthedistribution n P(w ).Weintroduceauxiliaryparameters a=0 a Q˜ to give an integral representation of the KQronecker’s delta. We use [ ] to denote the ab u ··· averagewithrespecttothe(n+1)-multivariateGaussianrandomvariables u withvanishing a { } meanand covariance[u u ] = δ +Q (1 δ ). a b u ab ab ab − 2.3 Replica-symmetricsolution Let us evaluate the replica-symmetric solution by imposing invariant symmetry for Q ab and Q˜ underpermutationofthereplicaindexas ab Q = 1 Q = q Q = m aa ab 0a (19) Q˜ = Q˜ Q˜ = q˜ Q˜ = m˜. aa ab 0a Then, the Gaussian random variables can be written as u = √qz + 1 qt for a > 0 and a a − p u = m2/qz+ 1 m2/qt usingtheauxiliarynormalGaussianrandomvariables t andz 0 0 a − { } p p withvanishingmeanandunitvariance.UndertheRSassumption,weobtainanexplicitform forthefree energy by solvingthesaddle-pointequationfor Q˜, q˜, and m˜: mz+ √qg √qz+h = α DzH logH −F Z q m2 1 q p − p −1 q m2 +β DzG (m, √q)logG (√q,1)+ log(1 q)+ − , (20) g h Z 2 − 2(1 q) − where Dz = dzexp( z2/2), H(x) = ∞Dt, and − x R 1 az+bh az bh G (a,b) = H +H − . (21) h 2 ( √b2 a2! √b2 a2!) − − 6/13 J.Phys. Soc. Jpn. The partial derivatives of the free energy (20) with respect to m and q lead to the saddle- point equations for the physically-relevant RS order parameters, namely the overlap m and thevariancewoftheweightvector: H √qz+h mz+ √qg ′ √1 q! αZ DzH′ pq−m2G(√Hq ,√√1q)1z−+−qh! m +β DzG (m, √q) ′h = , (22) Z ′g G (√q,1)! 1 q h − 2 H √qz+h mz+ √qg ′ √1 q! αZ DzH pq−m2G(H√q ,√√1q1z)−+−qh2! q m2 +β DzG (m, √q) ′h = − , (23) Z g G (√q,1)! (1 q)2 h − where H (x) = exp( z2/2)/√2πand ′ − − 1 az+bh az bh G (a,b) = H H − . (24) ′h 2 ( ′ √b2 a2!− ′ √b2 a2!) − − The RS solution always satisfies q = m under the condition g = h (the Bayes-optimal solu- tion).Theabovesaddle-pointequationsare thenreduced tothefollowingsingleequationfor q: 2 H √qz+h 2 ′ √1 q!! G′h(√q,1) q α Dz − +β Dz = . (25) (cid:16) (cid:17) Z Z G (√q,1) 1 q H √qz+h h − √1 q! − The order parameter q is closely related to the generalization error, which is defined as the probabilityofdisagreementbetweenthelabelleddataandtheclassifieroutputsforthenewly generatedexampleaftertheclassifierhasbeentrained.Inthecaseofaninput–outputrelation givenbyasimpleperceptron, thegeneralizationerroris expressedas:14) 1 ǫ = cos 1q. (26) − π Wewillevaluatethisquantitytovalidatetheperformanceoftheclassifiergeneratedfromthe combinationofunsupervisedand supervisedlearning. 7/13 J.Phys. Soc. Jpn. −1 −1 −2 −2 −3 −3 ε−4 ε−4 log−5 log−5 −6 −6 −7 −7 α=1 αα==1100 −8 −8 0 100 200 300 400 500 600 700 800 9001000 0 100 200 300 400 500 600 700 800 9001000 β β Fig. 1. (coloronline)Generalizationerrorsforh = 0.1,0.05,0.03,0.02,and0.01(curvesfromlefttoright). The left panel shows the results for α = 1, and the right one representsα = 10. Both cases exhibit multiple solutionsforthesamevalueofβ. 3. Saddle pointand numerical verification In Fig. 1, we plot the logarithm of the generalization error with respect to the number of supervised learning data for several values of h. Each plot shows the results for a different value of α. Note that when there is no fine tuning through supervised learning (i.e., α = 0), the generalization error does not exhibit any nontrivial behaviour. However, for nonzero α, we find nontrivial curves, which give multiple solutions for the same β, in the β ǫ plane. − Thisisaremarkableresultforthecombinationofunsupervisedandsupervisedlearning.The nontrivial curves imply the existence of a metastable state, similar to several classical spin models.15) As h decreases, the spinodal point β (the point at which the multiple solutions sp coalesce)movestolarger valuesof β.This isbecause decreasing h leads to difficultiesin the classification of the input data. In other words, we need a vast number of unlabelled data to attainthelower-errorstatefora fixednumberoflabelleddata. However,themetastablestate remains up to a large value of β, causing the computational cost to become very expensive. Wethereforeneedanextremelylongcomputationaltimetoreachthelower-errorsolution,or findgoodinitialconditionsnearby.Ontheotherhand,increasingαcausesthespinodalpoints to move to lower values of β. Although this confirms an improvement in the generalization errorforthehigher-errorstate,thereisnoquantitativechangeinthatforthelower-errorstate. Inthissense,pre-trainingisanessentialpartofthearchitectureofdeeplearningifwewishto achieve the lower-error state—this is the origin of deep learning’s remarkable performance. In contrast, the emergence of the metastable state causes the computational cost to increase 8/13 J.Phys. Soc. Jpn. drastically. Several special techniques could be incorporated into the architecture of deep learningtoavoidthisweakpoint,effectivelypreparinggoodinitialconditionsthatenablethe lower-errorstatetobereached.8,9) The asymptoticform of H(x) Θ(x)exp( x2/2)/ x for x leads to the exponent of ∼ − | | → ∞ the learner curve,14) which characterizes the decrease in the generalized error in α 1 and ≫ β 1 as ǫ (c α2 + cαβ + c β2) 1. Here, c , c , and c are the constants evaluated by the g α β − α β ≫ ∼ Gaussianintegrals.Thus,thereisnoquantitativechangeintheexponentofthelearningcurve inthisformulationcompared withthatoftheperceptron withordinarysupervisedlearning. Next, let us consider the effect of fine tuning in the context of deep learning. If we plot the saddle-point solutions in the α ǫ plane, we find that multiple solutions appear in a − certainregion.Increasingthenumberofunlabelleddataagainleadstoanimprovementinthe generalization error. A gradual increase in the number of labelled data allows us to escape from the metastable state. In this sense, fine tuning by supervised learning is necessary to achievethelower-errorstateandmitigatethedifficultiesinreachingthedesiredsolution.We should emphasize that the emergence of the metastable state does not come from the multi- layer neural networks in DNN, but from the combination of unsupervised and supervised learning.Thisobservationwas alsonoted inapreviousstudy.16) To verify our analysis, we conduct numerical experiments using the so-called approxi- mate message passing algorithm.17) On the basis of the reference in the modern fashion,18) we can construct an iterative algorithm to infer the weight vector using both the unlabelled andlabelled data.Theupdateequationsare N 1 1 at+1 = x w C at, ,h (27) µ µk k − κt µ µ κt ! Xk=1 N 1 4 κt+1 = 1+ at 2 (28) 2 N k L+U Xk=1 (cid:0) (cid:1) 1 at+1 = x C at, ,h +Btat (29) k µk µ µ κt ! k Xµ=1 L+U 1 1 Bt+1 = D at, ,h , (30) N µ µ κt ! Xµ=1 where exp( z2/2) y − (µ L) µ − √2πbH(z ) ≤ Cµ(a,b,h) = exp(−z2−/2)−exp−(−z2+/2) (µ > L) (31) √2πb(H(z−)+H(z+)) 9/13 J.Phys. Soc. Jpn. z C (a,b,h) C2(a,b,h) y − µ (µ L) µ − µ √b ≤ Dµ(a,b,h) = Cµ2(a,b,h)+ z−exp(−z2−/2)+z+exp(−z2+/2) (µ > L). (32) Herex isthekthcomponentofthefeaturevec√to2rπobf(Hth(ez−d)a+tumH(µz+a)n)dw isthekthcomponent µk k oftheweightvector.Weusetheabbreviationz = (h a)/√b,andestimatetheweightvector ± ± from w = at/κt. In the numerical experiments, we first estimatetheweight vector using only theunlabelleddata, i.e.,α = 0.Wethengraduallyincreasethenumberoflabelleddatawhile estimating the weight vector. The system size is set to N = 100, and the number of samples N = 1000. The maximum iteration number for fine tuning is set to 20. In Fig. 2, we plot sam theaveragegeneralizationerroroverN independentrunsstartingfromtherandomizedini- sam tial conditions. As theoretically predicted, our results confirm the water-falling phenomena for several cases with h = 0.5. Increasing the number of labelled data in the fine tuning step allows us to escape from the metastable state. Therefore, fine tuning is a necessary compo- nentintheremarkableperformanceofdeeplearning.However,thedifficultyofclassification, represented by h, demands a large number of training data. Therefore, we require the initial condition to be as good as possible in the fine tuning to reach the lower-error state. Several empiricalstudiesofthedeeplearningalgorithmhaverevealedthatspecialtechniquessuchas the auto-encoder can provide initial conditions that are sufficiently good to improvethe per- formanceafterfinetuning.9) Infuturework,weintendtoclarifythatsuchspecifictechniques doindeed overcomethedegradationinperformancecaused by themetastablestate. 4. Conclusion We have analyzed the simplified perceptron model under a combination of unsupervised and supervised learning for data with a margin. The margin imitates the structure of the training data. We have found nontrivialbehaviourin the generalization error of the classifier obtained by this hybrid of unsupervised and supervised learning. First, we confirmed the remarkable improvement in the generalization error by increasing the number of unlabelled data. In this sense, the pre-training step in deep learning is essential when few labelled data are available. In addition, our result reveals the existence of the metastable solution, which hampers the ordinary gradient-based iteration to pursue the optimal estimation. In the deep learning algorithm, the pre-training technique is crucial in reducing the computation time andattaininggoodperformance,becausegoodinitialconditionsallowthealgorithmtoreach the lower-error state. Instead of focusing on the specialized pre-training technique, we have investigated a nontrivial behaviour involved in the metastable state and the existence of the 10/13