MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1560 January, 1996 C.B.C.L. Memo No. 129 Fast Learning by Bounding Likelihoods in Sigmoid Type Belief Networks Tommi S. Jaakkola, Lawrence K. Saul, and Michael I. Jordan ftommi,lksaul,[email protected] This publication can be retrieved by anonymousftp to publications.ai.mit.edu. Abstract Sigmoid type belief networks, a class of probabilistic neural networks, provide a natural framework for compactly representing probabilistic information in a variety of unsupervised and supervised learning problems. Often the parametersused inthese networks need tobe learnedfromexamples. Unfortunately, estimatingthe parameters via exact probabilistic calculations (i.e, the EM-algorithm)is intractable even fornetworks withfairlysmallnumbersofhiddenunits. Wepropose toavoidthe infeasibilityofthe Estep by bounding likelihoods instead of computing them exactly. We introduce extended and complementary representations for these networks and show that the estimation of the network parameters can be made fast(reducedtoquadraticoptimization)byperformingtheestimationineitherofthealternativedomains. The complementarynetworks can be used for continuous density estimationas well. Copyright (cid:13)c Massachusetts Institute of Technology, 1996 In press: In Advancesin Neural Information Processing Systems 8, MIT Press. This report describes research done at the Dept. of Brain and Cognitive Sciences, the Center for Biological and Computational Learning, and the Arti(cid:12)cial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for CBCL is provided in part by a grant from the NSF (ASC{9217041). Support for the laboratory’s arti(cid:12)cial intelligence research is provided in part by the Advanced Research Projects Agency of the Dept. of Defense. The authors were supported by a grant from the McDonnell-Pew Foundation, by a grant from Siemens Corporation, by a grant >from Daimler-Benz Systems Technology Research, and by a grant from the O(cid:14)ce of Naval Research. Michael I.Jordan is a NSF Presidential Young Investigator. 1 Introduction toe(cid:14)cient(quadratic)estimationproceduresforthenet- work parameters. The appeal of probabilistic networks for knowledge rep- resentation,inference, andlearning(Pearl,1988)derives 2 The probability representations both from the sound Bayesian framework and from the explicit representation of dependencies among the net- Belief networks represent the joint probability of a set workvariableswhichallowsready incorporationofprior ofvariables S asaproduct ofconditionalprobabilities f g informationintothedesignofthenetwork. TheBayesian given by formalismpermitsfullpropagationofprobabilisticinfor- n mationacross the network regardless of which variables P(S1;:::;Sn)= P(Sk pa[k]); (1) in the network are instantiated. In this sense these net- j kY=1 works can be \inverted" probabilistically. This inversion, however, relies heavily on the use of where the notation pa[k], \parents of Sk", refers to all look-up table representations of conditional probabili- thevariablesthatdirectlyin(cid:13)uencetheprobabilityofSk ties or representations equivalent to them for modeling taking on a particular value (for equivalent representa- dependencies between the variables. For sparse depen- tions,see Lauritzen et al. 1988). The fact that the joint dency structures such as trees or chains this poses no probabilitycanbewrittenintheaboveformimpliesthat di(cid:14)culty. In more realistic cases of reasonably inter- there are no \cycles" in the network; i.e. there exists an dependent variables the exact algorithms developed for ordering of the variables in the network such that no these belief networks (Lauritzen & Spiegelhalter, 1988) variabledirectly in(cid:13)uences any preceding variables. become infeasible due to the exponential growth in the In this paper we consider sigmoid belief networks size ofthe conditionalprobabilitytables needed to store where the variables S are binary (0/1), the conditional the exact dependencies. Therefore the use of compact probabilitieshave the form representations tomodelprobabilisticinteractionsisun- avoidable in large problems. As belief network models P(Si pa[i])=g((2Si 1) WijSj) (2) j (cid:0) moveawayfromtables,however,therepresentationscan Xj be harder to assess from expert knowledge and the im- and the weights Wij are zero unless Sj is a parent of portant role of learning is further emphasized. Si,thuspreserving the feed-forwarddirectionalityofthe Compactrepresentations ofinteractionsbetween sim- network. For notational convenience we have assumed pleunits havelongbeen emphasizedinneuralnetworks. the existence of a bias variable whose value is clamped Lacking a thorough probabilistic interpretation, how- to one. The activation function g() is chosen to be the ever, classical feed-forward neural networks cannot be (cid:1) cumulativeGaussian distribution function given by inverted in the above sense; e.g. given the output pat- tern of a feed-forward neural network it is not feasible 1 x (cid:0)12z2 1 1 (cid:0)12(z(cid:0)x)2 g(x)= e dz = e dz to compute a probability distribution over the possible p2(cid:25) Z(cid:0)1 p2(cid:25) Z0 input patterns that wouldhave resulted in the observed (3) output. On the other hand, stochastic neural networks Although very similar to the standard logistic func- suchasBoltzmanmachinesadmitprobabilisticinterpre- tion, this activation function derives a number of ad- tations and therefore, at least in principle, can be in- vantages from its integral representation. In particular, verted and used as a basis for inference and learning in we may reinterpret the integration as a marginalization the presence of uncertainty. and thereby obtain alternative representations for the Sigmoidbelief networks (Neal, 1992) forma subclass network. We consider two such representations. of probabilistic neural networks where the activation We derive an extended representation by making ex- functionhasasigmoidalform{usuallythelogisticfunc- plicitthe nonlinearitiesin the activationfunction. More tion. Neal(1992)proposedalearningalgorithmforthese precisely, networks which can be viewed as an improvement of the algorithm for Boltzmann machines. Recently Hin- P(Si pa[i]) = g((2Si 1) WijSj) ton et al. (1995) introduced the wake-sleep algorithm j (cid:0) Xj faolgrolraiythermedrebliie-sdiornecftoiorwnaalrdpsraombapbliilnisgtiacndnehtawsoarknsa.ppTehails- 1 1 (cid:0)12[Zi(cid:0)(2Si(cid:0)1) jWijSj]2 = e dZi ingcodingtheoreticmotivation. TheHelmholtzmachine Z0 p2(cid:25) P 1 (Dayan et al., 1995), on the other hand, can be seen def as an alternative technique for these architectures that = P(Si;Zi pa[i])dZi (4) Z0 j avoids Gibbs sampling altogether. Dayan et al. also introduced the important idea of bounding likelihoods This suggests de(cid:12)ning the extended network in terms instead of computing them exactly. Saul et al. (1995) of the new conditional probabilities P(Si;Zi pa[i]). By j subsequently derived rigorous mean(cid:12)eld bounds for the constructionthentheoriginalbinarynetworkisobtained likelihoods. In this paper we introduce the idea of alter- bymarginalizingovertheextravariablesZ. Inthissense native { extended and complementary{ representations the extended network is (marginally) equivalent to the of these networks by reinterpreting the nonlinearities in binary network. the activation function. We show that deriving likeli- We distinguish a complementary representation from hood bounds in the new representational domainsleads the extended one by writingthe probabilitiesentirely in 1 1 terms of continuous variables . Such a representation to compute marginalprobabilities of the form can be obtained fromthe extended network by a simple V V H transformation of variables. The new continuous vari- logP(X )=log P(X ;X ) (6) ables are de(cid:12)ned by Z~i = (2Si 1)Zi, or, equivalently, XXH (cid:0) by Zi = Z~i and Si =(cid:18)(Z~i) where (cid:18)() is the step func- j j (cid:1) If the training samples are independent, then these log tion. Performing this transformationyields marginalscan beadded togivethe overalllog-likelihood P(Z~i pa[i])= 1 e(cid:0)12[Z~i(cid:0) jWij(cid:18)(Z~j)]2 (5) of the training set j p2(cid:25) P Vt logP(trainingset)= logP(X ) (7) which de(cid:12)nes a network of conditionally Gaussian vari- Xt ables. Theoriginalnetworkinthiscasecanberecovered Unfortunately,computingeach ofthese marginalproba- by conditional marginalizationover Z~ where the condi- bilitiesinvolvessumming(integrating)overanexponen- tioning variables are (cid:18)(Z~). tial number of di(cid:11)erent con(cid:12)gurations assumed by the Figure 1 below summarizesthe relationships between hidden variables in the network. This renders the sum the di(cid:11)erent representations. As willbecomeclear later, (integration)intractableinallbut fewspecialcases (e.g. working with the alternative representations instead of trees andchains). Itispossible,however,toinstead (cid:12)nd the originalbinaryrepresentation can lead to more(cid:13)ex- amanageablelower bound onthe log-likelihoodandop- ible and e(cid:14)cient (least-squares) parameter estimation. timizethe weightsinthe networkso asto maximizethis bound. Extended network To obtain such a lower bound we resort to Jensen’s average over {S, Z} inequality: transformation of over Z variables V H V logP(X ) = log P(X ;X ) augmentation XXH Original network Complementary H V over {S} network over {Z~} = log Q(XH)P(X ;HX ) XXH Q(X ) H V H P(X ;X ) Figure 1: The relationship between the alternative rep- Q(X )log H (8) resentations. (cid:21) XXH Q(X ) Although this bound holds for all distributions Q(X) over the hidden variables, the accuracy of the bound is 3 The learning problem determinedbyhowcloselyQapproximatestheposterior H V distributionP(X X )intermsoftheKullback-Leibler We consider the problem of learning the parameters of j divergence;iftheapproximationisperfectthedivergence the network from instantiations of variables contained iszero andthe inequalityissatis(cid:12)ed withequality. Suit- ina trainingset. Such instantiations,however, need not able choices for Q can make the bound both accurate be complete; there may be variables that have no value and easy to compute. The feasibility of (cid:12)nding such Q, assignments in the training set as well as variables that however, is highlydependent on the choice of the repre- are alwaysinstantiated. The tacit divisionbetween hid- sentation for the network. den (H) and visible (V) variables therefore depends on the particulartrainingexampleconsidered and isnotan 4 Likelihood bounds in di(cid:11)erent intrinsic property of the network. Tolearnfromtheseinstantiationsweadopttheprinci- representations pleofmaximumlikelihoodtoestimatetheweightsinthe To complete the derivation of the likelihood bound network. In essence, this is a density estimation prob- (equation 8) we need to (cid:12)x the representation for the lem where the weights are chosen so as to match the network. Which representation to select, however, af- probabilistic behavior of the network with the observed fects the quality and accuracy of the bound. In addi- activitiesinthetrainingset. Centraltothisestimationis tion, the accompanying bound of the chosen represen- theabilitytocomputelikelihoods(orlog-likelihoods)for tation implies bounds in the other two representational any (partial) con(cid:12)guration of variables appearing in the V domainsasthey allcode the samedistributionsoverthe training set. In other words, if we let X be the con- 2 H observables. In this section we illustrate these points (cid:12)guration of visible or instantiated variables and X by derivingbounds in the complementaryand extended denote the hidden or uninstantiated variables, we need representations anddiscuss thecorresponding boundsin 1 the originalbinary domain. While the binary variables are the outputs of each unit the continuous variables pertain to the inputs { hence the Now, to obtain a lower bound we need to specify the name complementary. approximate posterior Q. In the complementary rep- 2 To postpone the issue of representation weuse X to de- resentation the conditional probabilities are Gaussians noteS, fS;Zg,orZ~ depending ontheparticular representa- and therefore a reasonable approximation (mean (cid:12)eld) tion chosen. is found by choosing the posterior approximation from 2 the familyof factorized Gaussians: intheextended domainavoidstheproblembyimplicitly 2 makingthe followingLegendre transformation: Q(Z~)= 1 e(cid:0)(Z~i(cid:0)hi) =2 (9) Yi p2(cid:25) logg(x) = [1x2+logg(x)] 1x2 2 (cid:0) 2 Substituting this into equation 8 we obtain the bound 1 2 (cid:21)x G((cid:21)) x (14) (cid:3) 1 2 (cid:21) (cid:0) (cid:0) 2 logP(S ) (hi (cid:6)jJijg(hj)) (cid:21) (cid:0)2 (cid:0) 2 Xi which holds since x =2+logg(x) is a convex function. 1 2 Insertingthisbackintothe relevantparts ofequation13 Jijg(hj)g( hj) (10) and performing the averages gives (cid:0)2Xij (cid:0) (cid:3) logP(S ) [qi(cid:21)i (1 qi)(cid:21)(cid:22)i] Jijqj Themeanshi forthehiddenvariablesareadjustablepa- (cid:21) (cid:0) (cid:0) Xi Xj rameters that can be tuned to make the bound as tight as possible. For the instantiated v(cid:3)ariables we need to [qiG((cid:21)i)+(1 qi)G((cid:21)(cid:22)i)] enforce the constraints g(hi) = Si to respect the in- (cid:0)Xi (cid:0) stantiation. These can be satis(cid:12)ed very accurately by (cid:3) 1 2 1 2 setting hi = 4(2Si 1). A very convenient property ( Jijqj) Jijqj(1 gj) of this bound and th(cid:0)e complementaryrepresentation in (cid:0)2 Xj (cid:0) 2Xij (cid:0) general is the quadratic weightdependence {a property [qilogqi+(1 qi)log(1 qi)] (15) veryconducivetofastlearning. Finally,wenotethatthe (cid:0) (cid:0) (cid:0) Xi complementaryrepresentation transformsthe binaryes- timation problem into a continuous density estimation whichisquadraticintheweightsasexpected. Themean problem. activities q for the hidden variables and the parameters Wenowturntotheinterpretationoftheabovebound (cid:21) can be optimized to make the bound tight. For the (cid:3) inthe binarydomain. The samebound can be obtained instantiated variables we set qi =Si. by(cid:12)rst (cid:12)xingthe inputstoallthe unitsto bethe means hi and then computingthe negative total meansquared 5 Numerical experiments error between the (cid:12)xed inputs and the corresponding Totest these techniques inpractice we appliedthe com- probabilistic inputs propagated from the parents. The plementary network to the problem of detecting motor fact that this procedure in fact gives a lower bound on failures from spectra obtained during motor operation the log-likelihood would be more di(cid:14)cult to justify by (see Petsche et al. 1995). Wecast the problemas acon- working with the binary representation alone. tinuous density estimation problem. The training set In the extended representation the probability distri- consisted of 800 out of 1283 FFT spectra each with 319 bution for Zi is a truncated Gaussian given Si and its components measured from an electric motor in a good parents. We therefore propose the partially factorized operating condition but under varying loads. The test posterior approximation: setincluded theremaining483FFTs fromthe samemo- Q(S;Z)= Q(Zi Si)Q(Si) (11) tor in a good condition in additionto three sets of 1340 Yi j FFTseachmeasuredwhenaparticularfaultwaspresent. The goal was to use the likelihood of a test FFT with where Q(Zi Si) is a truncated Gaussian: j respect to the estimated density to determine whether 1 1 (cid:0)12(Zi(cid:0)(2Si(cid:0)1)hi)2 there was a fault present in the motor. Q(Zi Si)= e (12) j g((2Si(cid:0)1)hi)p2(cid:25) We used a layered 6 20 319 generative modelto ! ! estimate the training set density. The resulting classi(cid:12)- As in the complementary domain the resulting bound cationerrorratesonthetestsetareshownin(cid:12)gure2asa depends quadraticallyontheweights. Insteadofwriting function of the threshold likelihood. The achieved error out the bound here, however, it is more informative to rates are comparableto those of Petsche et al. (1995). see its derivation in the binary domain. A factorized posterior approximation (mean (cid:12)eld) Si 1(cid:0)Si 6 Conclusions Q(S)= iqi (1 qi) for the binary network yields a boundQ (cid:0) Network models that admit probabilistic formulations (cid:3) derive a number of advantages from probability theory. logP(S )(cid:21)Xi hSilogg(PjJijSj)i Mcieosv,inhgowaewvaeyr,frcoamn emxapkliecitthreesperepsreonptearttiioensshoafrddeepretnodeenx-- + (1 Si)log(1 g( jJijSj)) ploitinpractice. Weshowed that ane(cid:14)cient estimation Xi h (cid:0) (cid:0) P i procedure can be derived for sigmoid belief networks, where standard methods are intractable in allbut a few [qilogqi+(1 qi)log(1 qi)] (13) (cid:0) (cid:0) (cid:0) special cases (e.g. trees and chains). The e(cid:14)ciency of Xi ourapproachderived fromthecombinationoftwoideas. wheretheaverages arewithrespect totheQdistribu- First, we avoided the intractability of computing likeli- h(cid:1)i tion. Theseaverages,however,donotconformtoanalyt- hoods in these networks by computinglower bounds in- ical expressions. The tractable posterior approximation stead. Second, we introduced new representations for 3 1 0.9 0.8 0.7 0.6 P(error)0.5 0.4 0.3 0.2 0.1 0 500 600 700 800 900 1000 1100 1200 log−likelihood score Figure 2: The probability of error curves for missing a fault (dashed lines) and misclassifying a good motor (solid line) as a function of the likelihoodthreshold. these networks andshowed howthe lowerbounds in the new representational domains transform the parameter estimationproblem into quadratic optimization. Acknowledgements: The authorswishtothankPeter Dayanforhelpfulcom- ments on the manuscript. References P.Dayan,G.Hinton,R.Neal,andR.Zemel(1995). The helmholtzmachine. Neural Computation 7: 889-904. A. Dempster, N. Laird, and D. Rubin. Maximumlikeli- hoodfromincompletedataviatheEMalgorithm(1977). J. Roy. Statist. Soc. B 39:1{38. G. Hinton, P. Dayan, B. Frey, and R. Neal (1995). The wake-sleep algorithmfor unsupervised neural networks. Science 268: 1158-1161. S. L. Lauritzen and D. J. Spiegelhalter (1988). Local computations with probabilities on graphical structures andtheir applicationtoexpert systems. J.Roy. Statist. Soc. B 50:154-227. R. Neal. Connectionist learning of belief networks (1992). Arti(cid:12)cial Intelligence 56: 71-113. J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann: San Mateo. T. Petsche, A. Marcantonio, C. Darken, S. J. Hanson, G. M. Kuhn, I. Santoso (1995). A neural network au- toassociator for induction motor failure prediction. In Advances in Neural Information Processing Systems 8. MIT Press. L.K. Saul,T.Jaakkola,andM. I.Jordan (1995). Mean (cid:12)eld theory for sigmoidbelief networks. M.I.T. Compu- tational Cognitive Science Technical Report 9501. 4