P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 DataMiningandKnowledgeDiscovery1,79–119(1997) (cid:176)c 1997KluwerAcademicPublishers.ManufacturedinTheNetherlands. Bayesian Networks for Data Mining DAVIDHECKERMAN [email protected] MicrosoftResearch,9S,Redmond,WA98052-6399 Editor:UsamaFayyad ReceivedJune27,1996;RevisedNovember5,1996;AcceptedNevember5,1996 Abstract. ABayesiannetworkisagraphicalmodelthatencodesprobabilisticrelationshipsamongvariablesof interest. Whenusedinconjunctionwithstatisticaltechniques,thegraphicalmodelhasseveraladvantagesfor datamodeling. One,becausethemodelencodesdependenciesamongallvariables,itreadilyhandlessituations wheresomedataentriesaremissing. Two,aBayesiannetworkcanbeusedtolearncausalrelationships,and hencecanbeusedtogainunderstandingaboutaproblemdomainandtopredicttheconsequencesofintervention. Three,becausethemodelhasbothacausalandprobabilisticsemantics,itisanidealrepresentationforcombining priorknowledge(whichoftencomesincausalform)anddata. Four,Bayesianstatisticalmethodsinconjunction withBayesiannetworksofferanefficientandprincipledapproachforavoidingtheoverfittingofdata. Inthis paper,wediscussmethodsforconstructingBayesiannetworksfrompriorknowledgeandsummarizeBayesian statisticalmethodsforusingdatatoimprovethesemodels. Withregardtothelattertask,wedescribemethods forlearningboththeparametersandstructureofaBayesiannetwork, includingtechniquesforlearningwith incompletedata. Inaddition,werelateBayesian-networkmethodsforlearningtotechniquesforsupervisedand unsupervisedlearning.Weillustratethegraphical-modelingapproachusingareal-worldcasestudy. Keywords: Bayesiannetworks,Bayesianstatistics,learning,missingdata,classification,regression,clustering, causaldiscovery 1. Introduction A Bayesian network is a graphical model for probabilistic relationships among a set of variables. Overthelastdecade,theBayesiannetworkhasbecomeapopularrepresentation for encoding uncertain expert knowledge in expert systems (Heckerman et al., 1995a). Morerecently,researchershavedevelopedmethodsforlearningBayesiannetworksfrom data. The techniques that have been developed are new and still evolving, but they have beenshowntoberemarkablyeffectiveforsomedata-modelingproblems. Inthispaper,weprovideatutorialonBayesiannetworksandassociatedBayesiantech- niquesfordatamining—theprocessofextractingknowledgefromdata. Therearenumerous representationsavailablefordatamining,includingrulebases,decisiontrees,andartificial neuralnetworks;andtherearemanytechniquesfordataminingsuchasdensityestimation, classification, regression, and clustering. So what do Bayesian networks and Bayesian methodshavetooffer? Thereareatleastfouranswers. One,Bayesiannetworkscanreadilyhandleincompletedatasets. Forexample,consider aclassificationorregressionproblemwheretwooftheexplanatoryorinputvariablesare stronglyanti-correlated. Thiscorrelationisnotaproblemforstandardsupervisedlearning techniques,providedallinputsaremeasuredineverycase. Whenoneoftheinputsisnot P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 80 HECKERMAN observed, however, many models will produce an inaccurate prediction, because they do notencodethecorrelationbetweentheinputvariables. Bayesiannetworksofferanatural waytoencodesuchdependencies. Two,Bayesiannetworksallowonetolearnaboutcausalrelationships. Learningabout causalrelationshipsareimportantforatleasttworeasons. Theprocessisusefulwhenweare tryingtogainunderstandingaboutaproblemdomain,forexample,duringexploratorydata analysis. Inaddition, knowledgeofcausalrelationshipsallowsustomakepredictionsin thepresenceofinterventions. Forexample,amarketinganalystmaywanttoknowwhether ornotitisworthwhiletoincreaseexposureofaparticularadvertisementinordertoincrease thesalesofaproduct. Toanswerthisquestion,theanalystcandeterminewhetherornot theadvertisementisacauseforincreasedsales,andtowhatdegree. TheuseofBayesian networks helps to answer such questions even when no experiment about the effects of increasedexposureisavailable. Three,BayesiannetworksinconjunctionwithBayesianstatisticaltechniquesfacilitate thecombinationofdomainknowledgeanddata. Anyonewhohasperformedareal-world modelingtaskknowstheimportanceofpriorordomainknowledge,especiallywhendata isscarceorexpensive. Thefactthatsomecommercialsystems(i.e.,expertsystems)canbe builtfrompriorknowledgealoneisatestamenttothepowerofpriorknowledge. Bayesian networkshaveacausalsemanticsthatmakestheencodingofcausalpriorknowledgepar- ticularly straightforward. In addition, Bayesian networks encode the strength of causal relationshipswithprobabilities. Consequently,priorknowledgeanddatacanbecombined withwell-studiedtechniquesfromBayesianstatistics. Four,BayesianmethodsinconjunctionwithBayesiannetworksandothertypesofmodels offersanefficientandprincipledapproachforavoidingtheoverfittingofdata. Thistutorialisorganizedasfollows. InSection2,wediscusstheBayesianinterpretation ofprobabilityandreviewmethodsfromBayesianstatisticsforcombiningpriorknowledge withdata. InSection3,wedescribeBayesiannetworksanddiscusshowtheycanbecon- structedfrompriorknowledgealone. InSection4,wediscussalgorithmsforprobabilistic inferenceinaBayesiannetwork. InSections5and6,weshowhowtolearntheprobabilities inafixedBayesian-networkstructure,anddescribetechniquesforhandlingincompletedata includingMonte-CarlomethodsandtheGaussianapproximation. InSections7through10, weshowhowtolearnboththeprobabilitiesandstructureofaBayesiannetwork. Topics discussedincludemethodsforassessingpriorsforBayesian-networkstructureandparame- ters,andmethodsforavoidingtheoverfittingofdataincludingMonte-Carlo,Laplace,BIC, andMDLapproximations. InSections11and12, wedescribetherelationshipsbetween Bayesian-network techniques and methods for supervised and unsupervised learning. In Section13,weshowhowBayesiannetworksfacilitatethelearningofcausalrelationships. In Section 14, we illustrate techniques discussed in the tutorial using a real-world case study. InSection15,wegivepointerstosoftwareandadditionalliterature. 2. TheBayesianapproachtoprobabilityandstatistics TounderstandBayesiannetworksandassociateddata-miningtechniques,itisimportantto understandtheBayesianapproachtoprobabilityandstatistics. Inthissection,weprovide P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 BAYESIANNETWORKSFORDATAMINING 81 anintroductiontotheBayesianapproachforthosereadersfamiliaronlywiththeclassical view. In a nutshell, the Bayesian probability of an event x is a person’s degree of belief in that event. Whereas a classical probability is a physical property of the world (e.g., the probabilitythatacoinwilllandheads),aBayesianprobabilityisapropertyoftheperson whoassignstheprobability(e.g.,yourdegreeofbeliefthatthecoinwilllandheads). To keepthesetwoconceptsofprobabilitydistinct,werefertotheclassicalprobabilityofan eventasthetrueorphysicalprobabilityofthatevent,andrefertoadegreeofbeliefinan eventasaBayesianorpersonalprobability. Alternatively,whenthemeaningisclear,we refertoaBayesianprobabilitysimplyasaprobability. Oneimportantdifferencebetweenphysicalprobabilityandpersonalprobabilityisthat, to measure the latter, we do not need repeated trials. For example, imagine the repeated tosses of a sugar cube onto a wet surface. Every time the cube is tossed, its dimensions willchangeslightly. Thus,althoughtheclassicalstatisticianhasahardtimemeasuringthe probabilitythatthecubewilllandwithaparticularfaceup,theBayesiansimplyrestricts hisorherattentiontothenexttoss,andassignsaprobability. Asanotherexample,consider thequestion: WhatistheprobabilitythattheChicagoBullswillwinthechampionshipin 2001? Here,theclassicalstatisticianmustremainsilent,whereastheBayesiancanassign aprobability(andperhapsmakeabitofmoneyintheprocess). OnecommoncriticismoftheBayesiandefinitionofprobabilityisthatprobabilitiesseem arbitrary. Why should degrees of belief satisfy the rules of probability? On what scale shouldprobabilitiesbemeasured? Inparticular, itmakessensetoassignaprobabilityof one(zero)toaneventthatwill(not)occur,butwhatprobabilitiesdoweassigntobeliefs thatarenotattheextremes? Notsurprising,thesequestionshavebeenstudiedintensely. With regards to the first question, many researchers have suggested different sets of propertiesthatshouldbesatisfiedbydegreesofbelief(e.g.,Ramsey,1931;Cox,1946;Good, 1950; Savage, 1954; DeFinetti, 1970).Itturnsoutthateachsetofpropertiesleadstothe samerules: therulesofprobability.Althougheachsetofpropertiesisinitselfcompelling, thefactthatdifferentsetsallleadtotherulesofprobabilityprovidesaparticularlystrong argumentforusingprobabilitytomeasurebeliefs. The answer to the question of scale follows from a simple observation: people find it fairlyeasytosaythattwoeventsareequallylikely. Forexample,imagineasimplifiedwheel offortunehavingonlytworegions(shadedandnotshaded),suchastheoneillustratedin figure 1. Assuming everything about the wheel as symmetric (except for shading), you should conclude that it is equally likely for the wheel to stop in any one position. From this judgment and the sum rule of probability (probabilities of mutually exclusive and Figure1. Theprobabilitywheel:atoolforassessingprobabilities. P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 82 HECKERMAN collectivelyexhaustivesumtoone),itfollowsthatyourprobabilitythatthewheelwillstop intheshadedregionisthepercentareaofthewheelthatisshaded(inthiscase,0.3). Thisprobabilitywheelnowprovidesareferenceformeasuringyourprobabilitiesofother events. Forexample,whatisyourprobabilitythatAlGorewillrunontheDemocraticticket in 2000? First, ask yourself the question: Is it more likely that Gore will run or that the wheel when spun will stop in the shaded region? If you think that it is more likely that Gorewillrun,thenimagineanotherwheelwheretheshadedregionislarger. Ifyouthink that it is more likely that the wheel will stop in the shaded region, then imagine another wheel where the shaded region is smaller. Now, repeat this process until you think that Gorerunningandthewheelstoppingintheshadedregionareequallylikely. Atthispoint, yourprobabilitythatGorewillrunisjustthepercentsurfaceareaoftheshadedareaonthe wheel. In general, the process of measuring a degree of belief is commonly referred to as a probability assessment. The technique for assessment that we have just described is one ofmanyavailabletechniquesdiscussedintheManagementScience,OperationsResearch, and Psychology literature. One problem withprobability assessment thatis addressedin thisliteratureisthatofprecision. Canonereallysaythathisorherprobabilityforevent x is 0.601 and not 0.599? In most cases, no. Nonetheless, in most cases, probabilities are used to make decisions, and these decisions are not sensitive to small variations in probabilities. Well-established practices of sensitivity analysis help one to know when additionalprecisionisunnecessary(e.g.,HowardandMatheson,1983). Anotherproblem with probability assessment is that of accuracy. For example, recent experiences or the wayaquestionisphrasedcanleadtoassessmentsthatdonotreflectaperson’struebeliefs (Tversky and Kahneman, 1974). Methods for improving accuracy can be found in the decision-analysisliterature(e.g.,Spetzleretal.,1975). Nowletusturntotheissueoflearningwithdata. ToillustratetheBayesianapproach, consider a common thumbtack—one with a round, flat head that can be found in most supermarkets. Ifwethrowthethumbtackupintheair,itwillcometoresteitheronitspoint (heads) or on its head (tails)1. Suppose we flip the thumbtack N C1 times, making sure thatthephysicalpropertiesofthethumbtackandtheconditionsunderwhichitisflipped remainstableovertime. FromthefirstNobservations,wewanttodeterminetheprobability ofheadsonthe N C1thtoss. Intheclassicalanalysisofthisproblem,weassertthatthereissomephysicalprobability ofheads,whichisunknown. WeestimatethisphysicalprobabilityfromtheN observations usingcriteriasuchaslowbiasandlowvariance. Wethenusethisestimateasourprobability forheadsonthe N C1thtoss. IntheBayesianapproach,wealsoassertthatthereissome physicalprobabilityofheads,butweencodeouruncertaintyaboutthisphysicalprobability using(Bayesian)probabilities,andusetherulesofprobabilitytocomputeourprobability ofheadsonthe N C1thtoss2. ToexaminetheBayesiananalysisofthisproblem,weneedsomenotation. Wedenotea variablebyanupper-caseletter(e.g.,X;Y;X ;2),andthestateorvalueofacorresponding i variablebythatsameletterinlowercase(e.g.,x;y;x ;(cid:181)). Wedenoteasetofvariablesby i abold-faceupper-caseletter(e.g.,X;Y;X ). Weuseacorrespondingbold-facelower-case i letter (e.g., x;y;x ) to denote an assignment of state or value to each variable in a given i P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 BAYESIANNETWORKSFORDATAMINING 83 set. WesaythatvariablesetXisinconfigurationx. Weuse p.X D xj»/(or p.xj»/as ashorthand)todenotetheprobabilitythat X D x ofapersonwithstateofinformation». We also use p.xj»/ the denote the probability distribution for X (both mass functions and density functions). Whether p.xj»/ refers to a probability, a probability density, or aprobabilitydistributionwillbeclearfromcontext. Weusethisnotationforprobability throughoutthepaper. Returning to the thumbtack problem, we define 2 to be a variable3 whose values (cid:181) correspond to the possible true values of the physical probability. We sometimes refer to (cid:181) as a parameter. We express the uncertainty about 2 using the probability density function p.(cid:181)j»/. Inaddition,weuse X todenotethevariablerepresentingtheoutcome l ofthelthflip,l D 1;:::;N C1,and D D fX D x ;:::;X D x gtodenotethesetof 1 1 N N ourobservations. Thus,inBayesianterms,thethumbtackproblemreducestocomputing p.xNC1jD;»/from p.(cid:181)j»/. To do so, we first use Bayes’ rule to obtain the probability distribution for 2 given D andbackgroundknowledge»: p.(cid:181)j»/ p.Dj(cid:181);»/ p.(cid:181)jD;»/D (1) p.Dj»/ where Z p.Dj»/D p.Dj(cid:181);»/ p.(cid:181)j»/d(cid:181) (2) Next,weexpandtheterm p.Dj(cid:181);»/. BothBayesiansandclassicalstatisticiansagreeon thisterm: itisthelikelihoodfunctionforbinomialsampling. Inparticular,giventhevalue of2,theobservationsin D aremutuallyindependent,andtheprobabilityofheads(tails) onanyoneobservationis(cid:181) (1¡(cid:181)). Consequently,Eq.(1)becomes p.(cid:181)j»/(cid:181)h .1¡(cid:181)/t p.(cid:181)jD;»/D (3) p.Dj»/ wherehandtarethenumberofheadsandtailsobservedinD,respectively. Theprobability distributions p.(cid:181)j»/and p.(cid:181)jD;»/arecommonlyreferredtoasthepriorandposterior for2, respectively. Thequantitiesh andt aresaidtobesufficientstatisticsforbinomial sampling, because they provide a summarization of the data that is sufficient to compute theposteriorfromtheprior. Finally,weaverageoverthepossiblevaluesof2(usingthe expansion rule of probability) to determine the probability that the N C1th toss of the thumbtackwillcomeupheads: Z p.XNC1 Dheadsj D;»/D p.XNC1 Dheadsj(cid:181);»/ p.(cid:181) j D;»/d(cid:181) Z D (cid:181) p.(cid:181) j D;»/d(cid:181) · Ep.(cid:181)jD;»/.(cid:181)/ (4) whereEp.(cid:181)jD;»/.(cid:181)/denotestheexpectationof(cid:181)withrespecttothedistribution p.(cid:181)jD;»/. P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 84 HECKERMAN Figure2. Severalbetadistributions. TocompletetheBayesianstoryforthisexample,weneedamethodtoassesstheprior distributionfor2. Acommonapproach,usuallyadoptedforconvenience,istoassumethat thisdistributionisabetadistribution: 0.fi/ p.(cid:181)j»/DBeta.(cid:181)jfi ;fi /· (cid:181)fih¡1.1¡(cid:181)/fit¡1 (5) h t 0.fi /0.fi / h t wherefi >0andfi >0aretheparametersofthebetadistribution,fi Dfi Cfi ,and0.¢/ h t h t istheGammafunctionwhichsatisfies0.xC1/D x0.x/and0.1/D1. Thequantitiesfi h andfi areoftenreferredtoashyperparameterstodistinguishthemfromtheparameter(cid:181). t The hyperparameters fi and fi must be greater than zero so that the distribution can be h t normalized. Examplesofbetadistributionsareshowninfigure2. The beta prior is convenient for several reasons. By Eq. (3), the posterior distribution willalsobeabetadistribution: 0.fiCN/ p.(cid:181)jD;»/D (cid:181)fihCh¡1.1¡(cid:181)/fitCt¡1 DBeta.(cid:181)jfi Ch;fi Ct/ (6) 0.fi Ch/0.fi Ct/ h t h t Wesaythatthesetofbetadistributionsisaconjugatefamilyofdistributionsforbinomial sampling. Also,theexpectationof(cid:181) withrespecttothisdistributionhasasimpleform: Z fi (cid:181) Beta.(cid:181) jfi ;fi /d(cid:181) D h (7) h t fi Hence,givenabetaprior,wehaveasimpleexpressionfortheprobabilityofheadsinthe N C1thtoss: fi Ch p.XNC1 Dheadsj D;»/D fihCN (8) P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 BAYESIANNETWORKSFORDATAMINING 85 Assuming p.(cid:181)j»/ is a beta distribution, it can be assessed in a number of ways. For example, we can assess our probability for heads in the first toss of the thumbtack (e.g., usingaprobabilitywheel). Next,wecanimaginehavingseentheoutcomesofk flips,and reassessourprobabilityforheadsinthenexttoss. FromEq.(8),wehave(fork D1) fi fi C1 p.X Dheadsj»/D h p.X Dheadsj X Dheads;»/D h 1 fi Cfi 2 1 fi Cfi C1 h t h t Giventheseprobabilities,wecansolveforfi andfi . Thisassessmenttechniqueisknown h t asthemethodofimaginedfuturedata. AnotherassessmentmethodisbasedonEq.(6).Thisequationsaysthat,ifwestartwith a Beta.0;0/ prior4 and observe fi heads and fi tails, then our posterior (i.e., new prior) h t willbeaBeta.fi ;fi /distribution. RecognizingthataBeta.0;0/priorencodesastateof h t minimum information, we can assess fi and fi by determining the (possibly fractional) h t numberofobservationsofheadsandtailsthatisequivalenttoouractualknowledgeabout flippingthumbtacks. Alternatively,wecanassess p.X D headsj»/andfi,whichcanbe 1 regardedasanequivalentsamplesizeforourcurrentknowledge. Thistechniqueisknown asthemethodofequivalentsamples. Othertechniquesforassessingbetadistributionsare discussedbyWinkler(1967)andChalonerandDuncan(1983). Althoughthebetapriorisconvenient,itisnotaccurateforsomeproblems. Forexample, suppose we think that the thumbtack may have been purchased at a magic shop. In this case,amoreappropriatepriormaybeamixtureofbetadistributions—forexample, p.(cid:181)j»/D0:4Beta.(cid:181);20;1/C0:4Beta.(cid:181);1;20/C0:2Beta.(cid:181);2;2/ where 0.4 is our probability that the thumbtack is heavily weighted toward heads (tails). Ineffect,wehaveintroducedanadditionalhiddenorunobservedvariable H,whosestates correspondtothethreepossibilities: (1)thumbtackisbiasedtowardheads,(2)thumbtack isbiasedtowardtails,and(3)thumbtackisnormal;andwehaveassertedthat(cid:181) conditioned on each state of H is a beta distribution. In general, there are simple methods (e.g., the methodofimaginedfuturedata)fordeterminingwhetherornotabetapriorisanaccurate reflection of one’s beliefs. In those cases where the beta prior is inaccurate, an accurate priorcanoftenbeassessedbyintroducingadditionalhiddenvariables,asinthisexample. So far, we have only considered observations drawn from a binomial distribution. In general,observationsmaybedrawnfromanyphysicalprobabilitydistribution: p.xj(cid:181);»/D f.x;(cid:181)/ where f.x;(cid:181)/isthelikelihoodfunctionwithparameters(cid:181). Forpurposesofthisdiscussion, weassumethatthenumberofparametersisfinite. Asanexample, X maybeacontinuous variableandhaveaGaussianphysicalprobabilitydistributionwithmean„andvariancev: p.xj(cid:181);»/D.2…v/¡1=2e¡.x¡„/2=2v where(cid:181) Df„;vg. P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 86 HECKERMAN Regardlessofthefunctionalform, wecanlearnabouttheparametersgivendatausing theBayesianapproach. Aswehavedoneinthebinomialcase,wedefinevariablescorre- spondingtotheunknownparameters,assignpriorstothesevariables,anduseBayes’rule toupdateourbeliefsabouttheseparametersgivendata: p.Dj(cid:181);»/ p.(cid:181)j»/ p.(cid:181)jD;»/D (9) p.Dj»/ Wethenaverageoverthepossiblevaluesof2tomakepredictions. Forexample, Z p.xNC1jD;»/D p.xNC1j(cid:181);»/ p.(cid:181)jD;»/d(cid:181) (10) For a class of distributions known as the exponential family, these computations can be done efficiently and in closed form5. Members of this class include the binomial, multi- nomial, normal, Gamma, Poisson, and multivariate-normal distributions. Each member ofthisfamilyhassufficientstatisticsthatareoffixeddimensionforanyrandomsample, and a simple conjugate prior6. Bernardo and Smith (pp. 436–442, 1994) have compiled the important quantities and Bayesian computations for commonly used members of the exponentialfamily. Here,wesummarizetheseitemsformultinomialsampling,whichwe usetoillustratemanyoftheideasinthispaper. In multinomial sampling, the observed variable X is discrete, havingr possible states x1;:::;xr. Thelikelihoodfunctionisgivenby p.X D xkj(cid:181);»/D(cid:181) ; k D1;:::;r k P where(cid:181) D f(cid:181) ;:::;(cid:181) garetheparameters. (Theparameter(cid:181) isgivenby1¡ r (cid:181) .) 2 r 1 kD2 k In this case, as in the case of binomial sampling, the parameters correspond to physical probabilities. The sufficient statistics for data set D D fX D x ;:::;X D x g is 1 1 N N fN ;:::;N g,where N isthenumberoftimes X D xk in D. Thesimpleconjugateprior 1 r i usedwithmultinomialsamplingistheDirichletdistribution: 0.fi/ Yr p.(cid:181)j»/DDir.(cid:181)jfi ;:::;fi /· Q (cid:181)fik¡1 (11) 1 r r 0.fi / k kD1 k kD1 P wherefi D r fi ,andfi > 0;k D 1;:::;r. Theposteriordistribution p.(cid:181)jD;»/ D iD1 k k Dir.(cid:181)jfi CN ;:::;fi CN /. Techniquesforassessingthebetadistribution,includingthe 1 1 r r methodsofimaginedfuturedataandequivalentsamples,canalsobeusedtoassessDirichlet distributions. Giventhisconjugateprioranddataset D,theprobabilitydistributionforthe nextobservationisgivenby Z fi CN p.XNC1 D xkjD;»/D (cid:181)k Dir.(cid:181)jfi1CN1;:::;fir CNr/d(cid:181) D fikCNk (12) P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 BAYESIANNETWORKSFORDATAMINING 87 Asweshallsee,anotherimportantquantityinBayesiananalysisisthemarginallikelihood orevidence p.Dj»/. Inthiscase,wehave 0.fi/ Yr 0.fi CN / p.Dj»/D ¢ k k (13) 0.fiCN/ 0.fi / kD1 k Wenotethattheexplicitmentionofthestateofknowledge» isuseful,becauseitreinforces thenotionthatprobabilitiesaresubjective. Nonetheless,oncethisconceptisfirmlyinplace, thenotationsimplyaddsclutter. Intheremainderofthistutorial,weshallnotmention» explicitly. Inclosingthissection,weemphasizethat,althoughtheBayesianandclassicalapproaches may sometimes yield the same prediction, they are fundamentally different methods for learning from data. As an illustration, let us revisit the thumbtack problem. Here, the Bayesian “estimate” for the physical probability of heads is obtained in a manner that is essentiallytheoppositeoftheclassicalapproach. Namely,intheclassicalapproach,(cid:181) isfixed(albeitunknown),andweimaginealldata setsofsizeN thatmaybegeneratedbysamplingfromthebinomialdistributiondetermined by (cid:181). Each data set D will occur with some probability p.Dj(cid:181)/ and will produce an estimate(cid:181)⁄.D/. Toevaluateanestimator,wecomputetheexpectationandvarianceofthe estimatewithrespecttoallsuchdatasets: X Ep.Dj(cid:181)/.(cid:181)⁄/ D p.Dj(cid:181)/(cid:181)⁄.D/ XD (14) Varp.Dj(cid:181)/.(cid:181)⁄/D p.Dj(cid:181)/.(cid:181)⁄.D/¡Ep.Dj(cid:181)/.(cid:181)⁄//2 D Wethenchooseanestimatorthatsomehowbalancesthebias((cid:181)¡Ep.Dj(cid:181)/.(cid:181)⁄/)andvariance oftheseestimatesoverthepossiblevaluesfor(cid:181).7 Finally, weapplythisestimatortothe datasetthatweactuallyobserve. Acommonly-usedestimatoristhemaximum-likelihood (ML)estimator,whichselectsthevalueof(cid:181) thatmaximizesthelikelihood p.Dj(cid:181)/. For binomialsampling,wehave N (cid:181)⁄ .D/D P k ML r N kD1 k Incontrast,intheBayesianapproach,Disfixed,andweimagineallpossiblevaluesof(cid:181) fromwhichthisdatasetcouldhavebeengenerated. Given(cid:181),the“estimate”ofthephysical probabilityofheadsisjust(cid:181) itself. Nonetheless,weareuncertainabout(cid:181),andsoourfinal estimateistheexpectationof(cid:181) withrespecttoourposteriorbeliefsaboutitsvalue: Z Ep.(cid:181)jD;»/.(cid:181)/D (cid:181) p.(cid:181)jD;»/d(cid:181) (15) TheexpectationsinEqs.(14)and(15)aredifferentand,inmanycases,leadtodifferent “estimates”. One way to frame this difference is to say that the classical and Bayesian P1:RPS/ASH P2:RPS/ASH QC:RPS DataMiningandKnowledgeDiscovery KL411-04-Heckerman February26,1997 18:6 88 HECKERMAN approaches have different definitions for what it means to be a good estimator. Both solutionsare“correct”inthattheyareselfconsistent. Unfortunately, bothmethodshave their drawbacks, which has lead to endless debates about the merit of each approach. Forexample, Bayesiansarguethatitdoesnotmakesensetoconsidertheexpectationsin Eq. (14), because we only see a single data set. If we saw more than one data set, we shouldcombinethemintoonelargerdataset. Incontrast,classicalstatisticiansarguethat sufficientlyaccuratepriorscannotbeassessedinmanysituations. Thecommonviewthat seemstobeemergingisthatoneshouldusewhatevermethodthatismostsensibleforthe taskathand. Wesharethisview,althoughwealsobelievethattheBayesianapproachhas beenunderused,especiallyinlightofitsadvantagesmentionedintheintroduction(points threeandfour). Consequently,inthispaper,weconcentrateontheBayesianapproach. 3. Bayesiannetworks Sofar,wehaveconsideredonlysimpleproblemswithoneorafewvariables. Inrealdata- miningproblems,however,wearetypicallyinterestedinlookingforrelationshipsamong alargenumberofvariables. TheBayesiannetworkisarepresentationsuitedtothistask. Itisagraphicalmodelthatefficientlyencodesthejointprobabilitydistribution(physical orBayesian)foralargesetofvariables. Inthissection,wedefineaBayesiannetworkand showhowonecanbeconstructedfrompriorknowledge. ABayesiannetworkforasetofvariablesX D fX ;:::;X gconsistsof(1)anetwork 1 n structure SthatencodesasetofconditionalindependenceassertionsaboutvariablesinX, and(2)aset P oflocalprobabilitydistributionsassociatedwitheachvariable. Together, thesecomponentsdefinethejointprobabilitydistributionforX.Thenetworkstructure S is a directed acyclic graph. The nodes in S are in one-to-one correspondence with the variablesX.Weuse X todenoteboththevariableanditscorrespondingnode,andPa to i i denotetheparentsofnode X in Saswellasthevariablescorrespondingtothoseparents. i The lack of possible arcs in S encode conditional independencies. In particular, given structureS,thejointprobabilitydistributionforXisgivenby Yn p.x/D p.x jpa / (16) i i iD1 Thelocalprobabilitydistributions P arethedistributionscorrespondingtothetermsinthe productofEq.(16). Consequently,thepair.S;P/encodesthejointdistribution p.x/. TheprobabilitiesencodedbyaBayesiannetworkmaybeBayesianorphysical. When buildingBayesiannetworksfrompriorknowledgealone,theprobabilitieswillbeBayesian. Whenlearningthesenetworksfromdata,theprobabilitieswillbephysical(andtheirvalues may be uncertain). In subsequent sections, we describe how we can learn the structure and probabilities of a Bayesian network from data. In the remainder of this section, we exploretheconstructionofBayesiannetworksfrompriorknowledge. Asweshallseein Section10,thisprocedurecanbeusefulinlearningBayesiannetworksaswell. ToillustratetheprocessofbuildingaBayesiannetwork,considertheproblemofdetecting credit-card fraud. We begin by determining the variables to model. One possible choice