ebook img

Learning in Graphical Models (Adaptive Computation and Machine Learning) PDF

615 Pages·1998·56.83 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning in Graphical Models (Adaptive Computation and Machine Learning)

Series Foreword The goal of building systems that can adapt to their environments and learn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many industrial and scientific fields. Recently, several research communities have begun to converge on a common set of issues sur- rounding supervised, unsupervised, and reinforcement learning problems. The MIT Press Series on Adaptive Computation and Machine Learning seeks to unify the many diverse strands of machine learning research and to foster high-quality research and innovative applications. This book collects recent research on representing, reasoning, and learning with belief networks. Belief networks (also known as graphical models and Bayesian networks) are a widely applicable formalism for compactly representing the joint probability distribution over a set of random variables. Belief networks have revolutionized the development of intelligent systems in many areas. They are now poised to revolutionize the development of learning systems. The papers in this volume reveal the many ways in which ideas from belief networks can be applied to understand and analyze existing learning algorithms (especially for neural networks). They also show how methods from machine learning can be extended to learn the structure and parameters of belief networks. This book is an exciting illustration of the convergence of many disciplines in the study of learn- ing and adaptive computation. Preface Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering- uncertainty and complexity - and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms . Fundamental to the idea of a graphical model is the notion of modularity - a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, insuring that the system as a whole is consistent , and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly -interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms . Many of the cl~ sical multivariate probabilistic systems studied in fields such as statistics , systems engineering, information theory, pattern recog- ni tion and statistical mechanics are special cases of the general graphical model formalism - examples include mixture models, factor analysis, hid- den Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism . This has many advantages- in particular , specialized techniques that have been developed in one field can be trans- ferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems. This book presents an in-depth exploration of issues related to learn- ing within the graphical model formalism . Four of the chapters are tutorial articles (those by Cowell, MacKay , Jordan, et al., and Heckerman). The remaining articles cover a wide spectrum of topics of current research in- terest. The book is divided into four main sections: Inference , Indepen - dence , Foundations for Learning , and Learning from Data . While the sections can be read independently of each other and the articles are to a large extent self-contained , there also is a logical flow to the material . A full appreciation of the material in later sections requires an understanding 1 2 of the material in the earlier sections. The book begins with the topic of probabilistic inference. Inference refers to the problem of calculating the conditional probability distribu - tion of a subset of the nodes in a graph given another subset of the nodes. Much effort has gone into the design of efficient and accurate inference al- gorithms . The book covers three categories of inference algorithms - exact algorithms , variational algorithms and Monte Carlo algorithms . The first chapter , by Cowell, is a tutorial chapter that covers the basics of exact infer- ence, with particular focus on the popular junction tree algorithm . This ma- terial should be viewed as basic for the understanding of graphical models. A second chapter by Cowell picks up where the former leaves off and covers advanced issues arising in exact inference. Kjcerulff presents a method for increasing the efficiency of the junction tree algorithm . The basic idea is to take advantage of additional independencies which arise due to the partic - ular messages arriving at a clique; this leads to a data structure known as a "nested junction tree." Dechter presents an alternative perspective on exact inference, based on the notion of "bucket elimination ." This is a unifying perspective that provides insight into the relationship between junction tree and conditioning algorithms , and insight into space/time tradeoffs. Variational methods provide a framework for the design of approximate inference algorithms . Variational algorithms are deterministic algorithms that provide bounds on probabilities of interest . The chapter by Jordan, Ghahramani , Jaakkola, and Saul is a tutorial chapter that provides a general overview of the variational approach, emphasizing the important role of convexity . The ensuing article by Jaakkola and Jordan proposes a new method for improving the mean field approximation (a particular form of variational approximation ). In particular , the authors propose to use mixture distributions as approximating distributions within the mean field formalism . The inference section closes with two chapters on Monte Carlo meth- ods. Monte Carlo provides a general approach to the design of approximate algorithms based on stochastic sampling . MacKay 's chapter is a tutorial presentation of Monte Carlo algorithms , covering simple methods such M rejection sampling and importance sampling , M well as more sophisticated methods based on Markov chain sampling . A key problem that arises with the Markov chain Monte Carlo approach is the tendency of the algorithms to exhibit random-walk behavior ; this slows the convergence of the algo- rithms . Neal presents a new approach to this problem , showing how a so- phisticated form of overrelaxation can cause the chain to move more sys- tematically along surfaces of high probability . The second section of the book addresses the issue of Independence . Much of the aesthetic appeal of the graphical model formalism comes from 3 the "Markov properties " that graphical models embody. A Markov prop- erty is a relationship between the separation properties of nodes in a graph (e.g., the notion that a subset of nodes is separated from another subset of nodes, given a third subset of nodes) and conditional independenciesi n the family of probability distributions associated with the graph (e.g., A is independent of B given C, where A, Band C are subsets of random variables). In the case of directed graphs and undirected graphs the rela- tionships are well understood (cf. Lauritzen , 1997). Chain graphs, however, which are mixed graphs containing both directed and undirected edges, are less well understood . The chapter by Richardson explores two of the Markov properties that have been proposed for chain graphs and identifies natural "spatial " conditions on Markov properties that distinguish between these Markov properties and those for both directed and undirected graphs . Chain graphs appear to have a richer conditional independence semantics than directed and undirected graphs The chapter by Studeny and Vejnarova addresses the problem of charac- terizing stochastic dependence. Studeny and Vejnarova discuss the proper- ties of the multiinformation function , a general information -theoretic func- tion from which many useful quantities can be computed , including the conditional mutual information for all disjoint subsets of nodes in a graph . The book then turns to the topic of learning . The section on Founda - tions for Learning contains two articles that cover fundamental concepts that are used in many of the following articles . The chapter by H eckerman is a tutorial article that covers many of the basic ideas associated with learning in graphical models. The focus is on Bayesian methods, both for parameter learning and for structure learning . Neal and Hinton discuss the expectation-maximization (EM) algorithm. EM plays an important role in the graphical model literature , tying together inference and learning prob- lems. In particular , EM is a method for finding maximum likelihood (or maximum a posteriori ) parameter values, by making explicit use of a prob- abilistic inference (the "E step"). Thus EM - based approaches to learning generally make use of inference algorithms as subroutines . Neal and Hinton describe the EM algorithm as coordinate ascent in an appropriately -defined cost function . This point of view allows them to consider algorithms that take partial E steps, and provides an important justification for the use of approximate inference algorithms in learning . The section on Learning from Data contains a variety of papers con- cerned with the learning of parameters and structure in graphical models. Bishop provides an overview of latent variable models, focusing on prob- abilistic principal component analysis , mixture models , topographic maps and time series analysis . EM algorithms are developed for each case. The article by Buhmann complements the Bishop article , describing methods 4 for dimensionality reduction , clustering , and data visualization , again with the EM algorithm providing the conceptual framework for the design of the algorithms . Buhmann also presents learning algorithms based on approxi - mate inference and deterministic annealing. Friedman and Goldszmidt focus on the problem of representing and learning the local conditional probabilities for graphical models. In partic - ular , they are concerned with representations for these probabilities that make explicit the notion of "context-specific independence," where, for ex- ample, A is independent of B for some values of C but not for others. This representation can lead to significantly more parsimonious models than standard techniques. Geiger, Heckerman, and Meek are concerned with the problem of model selection for graphical models with hidden (unobserved) nodes. They develop asymptotic methods for approximating the marginal likelihood and demonstrate how to carry out the calculations for several cases of practical interest . The paper by Hinton , Sallans, and Ghahramani describes a graphical model called the "hierarchical community of experts" in which a collection of local linear models are used to fit data. As opposed to mixture models, in which each data point is assumed to be generated from a single local model, their model allows a data point to be generated from an arbitrary subset of the available local models. Kearns, Mansour , and Ng provide a care- ful analysis of the relationships between EM and the K-means algorithm . They discuss an "information -modeling tradeoff ," which characterizes the ability of an algorithm to both find balanced assignments of data to model components , and to find a good overall fit to the data. Monti and Cooper discuss the problem of structural learning in networks with both discrete and continuous nodes. They are particularly concerned with the issue of the discretization of continous data, and how this impacts the performance of a learning algorithm . Saul and Jordan present a method for unsupervised learning in layered neural networks based on mean field theory. They discuss a mean field approximation that is tailored to the case of large networks in which each node has a large number of parents. Smith and Whittaker discuss tests for conditional independence tests in graphical Gaussian models. They show that several of the appropriate statistics turn out to be functions of the sample partial correlation coef- ficient . They also develop asymptotic expansions for the distributions of the test statistics and compare their accuracy as a function of the dimen- sionality of the model. Spiegelhalter, Best, Gilks, and Inskip describe an application of graphical models to the real-life problem of assessing the ef- fectiveness of an immunization program . They demonstrate the use of the graphical model formalism to represent statistical hypotheses of interest and show how Monte Carlo methods can be used for inference. Finally , 5 Williams provides an overview of Gaussian processe,s deriving the Gaus- sian process approach from a Bayesian point of view, and showing how it can be applied to problems in nonlinear regression, classification , and hierarchical modeling . This volume arose from the proceedings of the International School on Neural Nets "E.R. Caianiello ," held at the Ettore Maiorana Centre for Scientific Culture in Erice, Italy , in September 1996. Lecturers from the school contributed chapters to the volume, and additional authors were asked to contribute chapters to provide a more complete and authoritative coverage of the field. All of the chapters have been carefully edited, following a review process in which each chapter WM scrutinized by two anonymous reviewers and returned to authors for improvement . There are a number of people to thank for their role in organizing the Erice meeting . First I would like to thank Maria Marinaro , who initiated the ongoing series of Schools to honor the memory of E.R. Caianiello , and who co-organized the first meeting. David Heckerman was also a co-organizer of the school, providing helpful advice and encouragement throughout . Anna Esposito at the University of Salerno also deserves sincere thanks for her help in organizing the meeting. The staff at the Ettore Maiorana Centre were exceedingly professional and helpful , initiating the attendees of the school into the wonders of Erice. Funding for the School was provided by the NATO Advanced Study Institute program ; this program provided generous support that allowed nearly 80 students to attend the meeting. I would also like to thank Jon Heiner , Thomas Hofmann , Nuria Oliver , Barbara Rosario, and Jon Yi for their help with preparing the final docu- ment . Finally , I would like to thank Barbara Rosario, whose fortuitous atten- dance as a participant at the Erice meeting rendered the future condition - ally independent of the past. Michael I. Jordan INTRODUCTION TO INFERENCE FOR BAYESIAN NETWORKS ROBERT COWELL City University , London. The School of Mathematics, Actuarial Science and Statistics, City University, Northampton Square, London EC1E OHT 1. Introduction The field of Bayesian networks, and graphical models in general, has grown enormously over the last few years, with theoretical and computational developments in many areas. As a consequence there is now a fairly large set of theoretical concepts and results for newcomers to the field to learn. This tutorial aims to give an overview of some of these topics, which hopefully will provide such newcomers a conceptual framework for following the more detailed and advanced work. It begins with revision of some of the basic axioms of pro babili ty theory. 2. Basic axioms of probability Probability theory , also known as inductive logic , is a system of reason - ing under uncertainty , that is under the absence of certainty . Within the Bayesian framework , probability is interpreted as a numerical measure of the degree of consistent belief in a proposition , consistency being with the data at hand . Early expert systems used deductive , or Boolean , logic , encapsulated by sets of production rules . Attempts were made to cope with uncertainty using probability theory , but the calculations became prohibitive , and the use of probability theory for inference in expert systems was abandoned . It is with the recent development of efficient computational algorithms that probability theory has had a revival within the AI community . Let us begin with some basic axioms of probability theory . The prob - ability of an event A , denoted by P (A ) , is a number in the interval [0 ,1 ] , which obeys the following axioms : 1 P (A ) = 1 if and only if A is certain . 9 10 ROBERCT OWELL 2 If A and B are mutually exclusive, then P(A or B) = P(A) + P(B). We will be dealing exclusively with discrete random variables and their probability distributions . Capital letters will denote a variable , or perhaps a set of variables, lower case letter will denote values of variables. Thus suppose A is a random variable having a finite number of mutually exclusive states (al ,..' , an). Then P(A) will be represented by a vector of non- negative real numbers P(A) = (Xl" ,. , xn) where P(A = ai) = Xi is a scalar, and Ei Xi = 1. A basic concept is that of conditional probability , a statement of which takes the form : Given the event B = b the probability of the event A = a is X, written P(A == a I B == b) = x. It is important to understand that this is not saying: "If B = b is true then the probability of A = a is x". Instead it says: "If B = b is true, and any other information to hand is irrelevant to A, then P(A == a) == X". (To see this, consider what the probabilities would be if the state of A was part of the extra information ). Conditional probabilities are important for building Bayesian networks, as we shall see. But Bayesian networks are also built to facilitate the calcu- lation of conditional probabilities , namely the conditional probabilities for variables of interest given the data (also called evidence) at hand. The fundamental rule for probability calculus is the product rulel P(A andB ) = P(A I B)P(B). (1) This equationt ells us how to combinec onditionalp robabilitiesf or individ- ual variablest o definej oint probabilitiesf or setso f variables. 3. Bayes' theorem The simplest form of Bayes ' theorem relates the joint probability P (A and B ) - written as P (A , B ) - of two events or hypotheses A and B in terms of marginal and conditional probabilities : P(A, B) = P(A I B)P(B) = P(B I A)P(A). (2) By rearrangement we easily obtain (3) P (A I B ) = P (B I A )P (A ) P (B ) , which is Bayes ' theorem . We are interested in A, and we begin This can be interpreted as follows . and then we observe with a prior probability P (A ) for our belief about A , lOr more generally P (A and B I C ) :=: P (A I B , C)P (B I C). INTRODUCTIONT O INFERENCEF ORB AYESIANN ETWORKS 11 B. Then Bayes' theorem, (3), tells us that our revisedb elief for A, the posterior probability P(~ I B) is obtainedb y multiplying the prior P(A) by the ratio P(B I A)/ P(B). The quantity P(B I A), asa functiono f varying A for fixed B, is calledt he likelihoodo f A. We can expressth is relationship in the form: posterior cx: prior x likelihood P(A I B) c:x P(A)P(B I A). Figure 1 illustrates this prior-to-posterior inference process. Each diagram 0 0 f 0 0 P(A,B) P(B)P(AIB) Figure 1. Bayesian inference as reversing the arrows represents in different ways the joint distribution P(A, B), the first repre- sents the prior beliefs while the third represents the posterior beliefs. Often , we will think of A as a possible "cause" of the "effect" B, the downward arrow represents such a causal interpretation . The "inferential " upwards arrow then represents an "argument against the causal flow", from the observed effect to the inferred cause. (We will not go into a definition of "causality " here.) Bayesian networks are generally more complicated than the ones in Figure 1, but the general principles are the same in the following sense. (cid:0) A Bayesian network provides a model representation for the joint distri - bution of a set of variables in terms of conditional and prior probabilities , in which the orientations of the arrows represent influence, usually though not always of a causal nature , such that these conditional probabilities for these particular orientations are relatively straightforward to specify (from data or eliciting from an expert ). When data are observed, then typically an inference procedure is required . This involves calculating marginal prob- abilities conditional on the observed data using Bayes' theorem, which is diagrammatically equivalent to reversing one or more of the Bayesian net- work arrows. The algorithms which have been developed in recent years 12 ROBERCT OWELL allows these calculations to be performed in an efficient and straightfor - ward manner . 4. Simple inference problems Let us now consider some simple examples of inference. The first is simply Bayes' theorem with evidence included on a simple two node network ; the remaining examples treat a simple three node problem . 4.1. PROBLEM I Supposew e have the simple model X -+ Y, and are given: P(X), P(Y I X ) and Y == y. The problem is to calculate P( X I Y == y). Now from P(X ), P(Y I X ) we can calculate the marginal distribution P(Y) and hence P(Y = y). Applying Bayes' theorem we obtain P(X I Y = y) = P(Y = yIX )P(X) P(Y = y) . (4) 4.2. PROBLEM II Suppose now we have a more complicated model in which X is a par- ent of both Y and Z: Z +- X -::,. Y with specified probabilities P(X ), P(Y I X ) and P(Z I X ), and we observe Y = y. The problem is to calculate P(Z I Y = y). Note that the joint distribution is given by P(X , Y, Z) = P(Y I X )P(Z I X )P(X ). A 'brute force' method is to calculate: 1. The joint distribution P(X, Y, Z). 2. The marginal distribution P(Y) and thence P(Y = y). 3. The marginal distribution P(Z, Y) and thence P(Z, Y = y). 4. P(Z I Y = y) = P(Z, Y = y)/P(Y = y). An alternative method is to exploit the given factorization : 1. Calculate P(X I Y = y) = P(Y = y I X)P(X)/ P(Y = y) using Bayes' theorem, where P(Y = y) = Ex P(Y = y I X)P(X). 2. Find P(Z I Y = y) = Ex P(Z I X)P(X I Y = y). Note that the first step essentially reverses the arrow between X and Y. Although the two methods give the same answer, the second is generally more efficient . For example, suppose that all three variables have 10 states. Then the first method in explicitly calculating P(X , Y, Z) requires a table of 1000 states. In contrast the largest table required for the second method has size 100. This gain in computational efficiency by exploiting the given factorizations is the basis of the arc-reversal method for solving influence

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.