Learning(cid:0) Bayesian Probability(cid:0) Graphical Models(cid:0) and (cid:0) Abduction David Poole poole(cid:0)cs(cid:1)ubc(cid:1)ca Department of Computer Science University of British Columbia (cid:1)(cid:2)(cid:3)(cid:3) Main Mall Vancouver(cid:4) B(cid:5)C(cid:5)(cid:4) Canada V(cid:3)T (cid:0)Z(cid:6) http(cid:0)(cid:1)(cid:1)www(cid:2)cs(cid:2)ubc(cid:2)ca(cid:1)spider(cid:1)poole Abstract In this chapter I review Bayesian statistics as used for induction and relate it to logic(cid:0)based abduction(cid:1) Much reasoning under un(cid:0) certainty(cid:2) including induction(cid:2) is based on Bayes(cid:3) rule(cid:1) Bayes(cid:3) rule is interesting precisely because it provides a mechanism for abduc(cid:0) tion(cid:1) I review work of Buntine that argues that much of the work on Bayesian learning can be best viewed in terms of graphical models such asBayesian networks(cid:2)and review previous workofPoole thatre(cid:0) latesBayesiannetworkstologic(cid:0)based abduction(cid:1) This letsusseehow much of the work on induction can be viewed in terms of logic(cid:0)based abduction(cid:1) I then explore what this means for extending logic(cid:0)based abduction to richer representations(cid:2) such as learning decision trees with probabilities at the leaves(cid:1) Much of this paper is tutorial in na(cid:0) ture(cid:4) both the probabilistic and logic(cid:0)based notions of abduction and induction are introduced and motivated(cid:1) (cid:0) Introduction This paper explores the relationship between learning (cid:7)induction(cid:8) and ab(cid:9) duction(cid:5) I take what can be called the Bayesian view(cid:4) where all uncertainty is re(cid:10)ected in probabilities(cid:5) In this paper I argue that(cid:4) not only can ab(cid:9) duction be used for induction(cid:4) but that most current learning techniques (cid:7)from statistical learning to neural networks to decision trees to inductive logic programming to unsupervised learning(cid:8) can be best viewed in terms of abduction(cid:5) (cid:0) (cid:0)(cid:1)(cid:0) Causal and Evidential Modelling and Reasoning In order to understand abduction and its role in reasoning(cid:4) it is important to understand ways to model(cid:4) as well as ways to reason(cid:5) In this section we consider reasoning strategies independently of learning(cid:4) and return to learning in Section (cid:0)(cid:5)(cid:1)(cid:5) Many reasoning problems can be best understood as evidentialreasoning tasks(cid:5) De(cid:0)nition (cid:1)(cid:2)(cid:1) An evidential reasoning task is where some parts of a system are observed and you want to make inferences about other (cid:7)hidden(cid:8) parts(cid:5) Example (cid:1)(cid:2)(cid:3) The problem of diagnosis is an evidential reasoning task(cid:5) Given observations about the symptoms of a patient or artifact(cid:4) we want to determine what is going on inside the system to produce those symptoms(cid:5) Example (cid:1)(cid:2)(cid:4) The problem of perception (cid:7)including vision(cid:8) is an eviden(cid:9) tial reasoning task(cid:5) In the world the scene produces the image(cid:4) but the problem of vision is(cid:4) given an image(cid:4) determine what is in the scene(cid:5) Evidential reasoning tasks are often of the form where there is a cause(cid:9)e(cid:11)ect relationship between the parts(cid:5) In diagnosis we can think of the disease causing thesymptoms(cid:5) Invisionwecanthinkof thescenecausing theimage(cid:5) (cid:0) Bycausation (cid:4)Imeanthat di(cid:11)erentdiseasescanresultindi(cid:11)erentsymptoms (cid:7)but changing the symptoms doesn(cid:12)t a(cid:11)ect the disease(cid:8) and di(cid:11)erent scenes can result in di(cid:11)erent images (cid:7)but manipulating an image doesn(cid:12)t a(cid:11)ect the scene(cid:8)(cid:5) There are a number of di(cid:11)erent ways of modelling such a causal domain(cid:13) causal modelling where we model the function from causes to e(cid:11)ects(cid:5) For example(cid:4)wecan modelhow diseases or faults manifesttheirsymptoms(cid:5) We can model how scenes produce images(cid:5) evidential modelling where we model the function from e(cid:11)ects to causes(cid:5) For examplewe can model the mapping from symptoms to diseases(cid:4) or from image to scene(cid:5) (cid:0) See http(cid:0)(cid:1)(cid:1)singapore(cid:2)cs(cid:2)ucla(cid:2)edu(cid:1)LECTURE(cid:1)lecture sec(cid:3)(cid:2)htm for a fascinating lecture by Judea Pearl on causation(cid:0) (cid:1) observation evidential reasoning cause causal reasoning prediction Figure (cid:0)(cid:13) Causal and evidential reasoning Independentlyof these two modellingstrategies(cid:4) we can consider two reason(cid:9) ing tasks(cid:13) Evidential Reasoning given an observation of the e(cid:11)ects(cid:4) determine the causes(cid:5) For example(cid:4) determinethe disease from the symptoms(cid:4) or the scene from the image(cid:5) Causal Reasoning given some cause(cid:4) make a prediction of the e(cid:11)ects(cid:5) For example(cid:4)predicting symptoms or prognoses from a disease(cid:4) or predict(cid:9) ing an image from a scene(cid:5) This is often called simulation(cid:5) In particular(cid:4) much reasoning consists of evidential reasoning followed by causalreasoning (cid:7)seeFigure(cid:0)(cid:8)(cid:5) For example(cid:4)a doctor mayobservea patient(cid:4) determine possible diseases(cid:4) then make predictions of other symptoms or prognoses(cid:5) This thencanfeedbackto makingthedoctor lookfor thepresence or absenceof thesesymptoms(cid:4)formingthe cycleofperception(cid:14)(cid:1)(cid:15)(cid:16)(cid:5) Similarly(cid:4) a robot can observeits world(cid:4) determinewhat is where(cid:4) and act on its beliefs(cid:4) leading to further observations(cid:5) Therearea numberofcombinationsof modellingand reasoning strategies that have been proposed(cid:13) (cid:0) The simplest strategy is to do evidential modelling and only evidential reasoning(cid:5) Examples of this are neural networks (cid:14)(cid:0)(cid:17)(cid:16) and old(cid:9)fashioned (cid:2) expertsystemssuchas Mycin(cid:14)(cid:1)(cid:16)(cid:5) Aneuralnetworkforcharacterrecog(cid:9) nition may be able to recognise an (cid:18)A(cid:19) from a bitmap(cid:4) but could not say what an (cid:18)A(cid:19) looks like(cid:5) In Mycin there are rules leading from the symptoms to the diseases(cid:4) but the system can(cid:12)t tell you what the symptoms of some disease are(cid:5) (cid:0) The second strategy is to model both causally and evidentially and to use the causal model for causal reasoning and the evidential model for evidential reasoning(cid:5) The main problem with this is the redundancy of the knowledge(cid:4) and its associated problem of consistency(cid:4) although there are techniques for automatically inferring the evidential model from the causal model for limited cases (cid:14)(cid:1)(cid:17)(cid:4) (cid:20)(cid:4) (cid:0)(cid:3)(cid:4) (cid:1)(cid:21)(cid:16)(cid:5) Pearl (cid:14)(cid:1)(cid:2)(cid:16) has pointed out how naive representations of evidential and causal knowl(cid:9) edge can lead to problems(cid:5) (cid:0) The third strategy is to model causally and use di(cid:11)erent reasoning strategies for causal and evidential reasoning(cid:5) For causal reasoning we can directly use the causal model(cid:4) and for evidential reasoning we can use abduction(cid:5) This leads to an abstract formulation of abduction that will include both logical and probabilistic formulations of abduction(cid:13) De(cid:0)nition (cid:1)(cid:2)(cid:5) Abduction is the use of a model in its opposite direction(cid:5) That is(cid:4) if a model speci(cid:22)es how x gives a y(cid:4) abduction lets us infer x from (cid:1) y(cid:5) Abduction is usually evidential reasoning from a causal model (cid:5) If we have a model of how causes produce e(cid:11)ects(cid:4) abduction lets us infer causes from e(cid:11)ects(cid:5) Abduction depends on an implicit assumption of com(cid:9) plete knowledge of possible causes (cid:14)(cid:20)(cid:4) (cid:1)(cid:21)(cid:16)(cid:23) when an e(cid:11)ect is observed(cid:4) one of its causes must be present(cid:5) (cid:0)(cid:1)(cid:2) Learning as an evidential reasoning task In this section we explore learning as an evidential reasoning task(cid:5) Given a task(cid:4) a prior belief or bias(cid:4) and some data(cid:4) the learning task is to produce (cid:1) Neither the standard logical de(cid:1)nition of abduction nor the probabilistic version of abduction (cid:2)presented below(cid:3) prescribe that the givenknowledge iscausal(cid:0) It shouldn(cid:4)tbe surprising that the formal de(cid:1)nitions don(cid:4)t depend on the knowledge base being causal as the causal relationship is a modellingassumption(cid:0) We don(cid:4)t want the logic to impose arbitrary restrictions on modelling(cid:0) (cid:6) an updated theory of the data (cid:7)the posterior belief(cid:8) that can be used in the task(cid:5) In order to make this clear(cid:4) we must be very careful to distinguish(cid:13) (cid:0) the task being learned (cid:0) the task of learning itself(cid:5) This distinction is veryimportant when the task being learned is also an evi(cid:9) dentialreasoning task (cid:7)e(cid:5)g(cid:5)(cid:4) learningto do diagnosis(cid:4) or learning a perceptual task(cid:8)(cid:5) The task of learning can be seen as an evidential reasoning task where the model (cid:18)causes(cid:19) the data(cid:5) The aim of learning is(cid:13) given the data(cid:4) to (cid:22)nd appropriate models (cid:7)evidential reasoning(cid:8)(cid:4) and from the model(cid:7)s(cid:8) to make prediction on unseen cases (cid:7)causal reasoning(cid:8)(cid:5) When we look at learning as an evidential reasoning task(cid:4) not surpris(cid:9) ingly(cid:4) we (cid:22)nd learning methods that correspond to the two strategies that allow causal and evidential reasoning (cid:7)the second and third strategies of the previous section(cid:8)(cid:5) The second strategy of the previous section is to build special(cid:9)purpose reasoning strategies to carry out the evidential reasoning task (cid:7)i(cid:5)e(cid:5)(cid:4) inferring the model from the data(cid:8) that is separate from the causal reasoning task (cid:7)predicting new data from the model(cid:8)(cid:5) Examples of such special purpose mechanismsaredecision(cid:9)treelearningalgorithmssuchasC(cid:6)(cid:5)(cid:17)(cid:14)(cid:2)(cid:2)(cid:16)andCART (cid:14)(cid:0)(cid:16)(cid:4) and backpropagation for neural network learning (cid:14)(cid:2)(cid:6)(cid:16)(cid:5) The rest of this paper will show how the third strategy of the previous section(cid:4) namelycausal modellingand using di(cid:11)erent strategies for causal and evidential reasoning(cid:4) can be used for learning(cid:4) and can be carried out with both logical and probabilistic speci(cid:22)cations of abductive reasoning(cid:5) Such a strategy impliesthat we need a speci(cid:22)cation of the models to be learned and what these models predict in order to build a learning algorithm(cid:5) (cid:1) Bayesian Probability In this section we introduce and motivate probability theory independently of learning(cid:5) The interpretation of probability theory we use here is called Bayesian(cid:4) personal(cid:4) orsubjective probability(cid:4)as opposed to thefrequentist in(cid:9) terpretation of probability as the study of the frequencyof repeatable events(cid:5) (cid:17) Probability theory (cid:14)(cid:1)(cid:6)(cid:16) is a study of belief update(cid:23) how an agent(cid:12)s knowl(cid:9) edge a(cid:11)ects its beliefs(cid:5) An agent(cid:12)s probability of a proposition is a measure of how much the proposition is believed by the agent(cid:5) Rather than consider(cid:9) ing an agent maintaining one coherent set of beliefs (cid:7)for example(cid:4) the most plausible way the world could be based on the agent(cid:12)s knowledge(cid:8)(cid:4) Bayesian probability speci(cid:22)es that an agent must consider all possible ways that the world could be and their relative plausibilities(cid:5) This plausibility when nor(cid:9) malised to the range (cid:14)(cid:15)(cid:4)(cid:0)(cid:16) so that the values for all possible situations sum to one is called a probability(cid:5) There are a numberof reasons why we would be interested in probability(cid:4) including(cid:13) (cid:0) An agent can only act according to its beliefs and its goals(cid:5) An agent doesn(cid:12)t have access to everythingthat is true in its domain(cid:4) but only to its beliefs(cid:5) An agent must somehow be able to decide on actions based on its beliefs(cid:5) (cid:0) It is not enough for an agent to have just a single model of the world in which it is interacting and act on that model(cid:5) It also needs to consider what other alternatives maybe true(cid:4) and makesure that its actions are not too disastrous if these other contingencies happen to arise(cid:5) A classic example is wearing a seat belt(cid:23) an agent may assume that it won(cid:12)t have an accident on a particular trip(cid:4) but wears a seat belt to cover the possibility that it does have an accident(cid:5) Under normal circumstances(cid:4) the seat belt is a slight nuisance(cid:4) but if there is an accident(cid:4) the agent is much better o(cid:11) when it is wearing a seat belt(cid:5) Whether the agent wears a seat belt depends on how inconvenient it is when there is no accident(cid:4) how much better o(cid:11) the agent would be if they were wearing a seat belt when there is an accident(cid:4) and how likely an accident is(cid:5) This tradeo(cid:11) between various outcomes(cid:4) their relative desirability(cid:4) and their likelihood is the subject of decision theory(cid:5) (cid:0) As we will see below(cid:4) probabilities are what can be obtained fromdata(cid:5) Probability lets us explicitly model noise in data(cid:4) and lets us update our beliefs based on noisy data(cid:5) The formalisation of probability theory is simple(cid:5) A random variable is a term in a language that can take one of a number of di(cid:11)erent values(cid:5) The set of all possible values a variable can take (cid:3) iscalledthe domainof thevariable(cid:5) We writex (cid:24) v tomeantheproposition that variable x has value v(cid:5) A Boolean random variable is one where the domain is ftrue(cid:0)falseg(cid:5) Often we write x rather than x (cid:24) true and (cid:1)x rather than x (cid:24) false(cid:5) A proposition is a Boolean formula made from assignments of values to variables(cid:5) Some example random variables may be a patient(cid:12)s blood pressure at (cid:1)(cid:13)(cid:15)(cid:15)p(cid:5)m(cid:5) on July (cid:0)(cid:2)(cid:4) (cid:0)(cid:21)(cid:21)(cid:25)(cid:4) the value of the Australian dollar relative to the Canadian dollar on January (cid:0)(cid:4) (cid:1)(cid:15)(cid:15)(cid:0)(cid:4) whether a patient has cancer at a particular time(cid:4) whether a light is lit at some time point(cid:4) or whether a particular coin lands heads on a particular toss(cid:5) There is nothing random about random variables(cid:5) We introduce them because it is often useful to be able to refer to a variable without specifying its value(cid:5) Suppose we have a set of random variables(cid:5) A possible world speci(cid:22)es an assignment of one value to each random variable(cid:5) If w is a world(cid:4) x is a random variable and v is a value in the domain of x(cid:4) we write w j(cid:24) x (cid:24) v to meanthat variablex is assigned valuev inworld w(cid:5) Wecan allowBoolean combinationson the right(cid:9)hand side of j(cid:24)(cid:4) where the logical connectives have their standard meaning(cid:4) for example(cid:4) w j(cid:24) (cid:1)(cid:2)(cid:2) i(cid:11) w j(cid:24) (cid:1) and w j(cid:24) (cid:2) So far this is just standard logic(cid:4) but using the terminology of random vari(cid:9) ables(cid:5) Let(cid:12)s de(cid:22)nea nonnegative measure(cid:3)(cid:7)w(cid:8) to eachworld w so that the mea(cid:9) (cid:2) sures of the possible worlds sum to (cid:0)(cid:5) The use of (cid:0) is purely by convention(cid:23) we could have just as easily used (cid:0)(cid:15)(cid:15)(cid:4) for example(cid:5) The probability of proposition (cid:1)(cid:4) written P(cid:7)(cid:1)(cid:8)(cid:4) is the sum of the mea(cid:9) sures of the worlds in which (cid:1) is true(cid:13) P(cid:7)(cid:1)(cid:8) (cid:24) (cid:3)(cid:7)w(cid:8)(cid:4) wj(cid:3)(cid:0) X The most important part of Bayesian probability is conditioning on ob(cid:9) servations(cid:5) The set of all observations is called the evidence(cid:5) If you are (cid:2) When there are in(cid:1)nitelymanypossible worlds(cid:5)we need touse someformofmeasure theory(cid:5) so that the measure of all of the possible worlds is (cid:6)(cid:0) This requires us to assign probabilities to measurable sets of worlds(cid:5) but the general idea is essentially the same(cid:0) (cid:20) given evidence e(cid:4) conditioning means that all worlds in which e is false are eliminated(cid:4) and the remaining worlds are renormalised so that their prob(cid:9) abilities sum to (cid:0)(cid:5) This can be seen as creating a new measure (cid:3)e de(cid:22)ned by(cid:13) (cid:15) if w (cid:3)j(cid:24) e (cid:3)e(cid:7)w(cid:8) (cid:24) (cid:0) (cid:3)(cid:7)w(cid:8)(cid:5)P(cid:7)e(cid:8) if w j(cid:24) e Wecanthende(cid:22)netheconditional probabilityofagivene(cid:4)writtenP(cid:7)aje(cid:8) in terms of the new measure(cid:13) P(cid:7)aje(cid:8) (cid:24) (cid:3)e(cid:7)w(cid:8)(cid:4) wj(cid:3)(cid:0) X Example (cid:3)(cid:2)(cid:1) The probability P(cid:7)sneeze (cid:24) yesjcold (cid:24) severe(cid:8)speci(cid:22)es(cid:4) out of all of the worlds where cold is severe(cid:4) what proportion have sneeze with value yes(cid:5) It is the measure of belief in the proposition sneeze (cid:24) yes given that all you knew was that the cold was severe(cid:5) The probability P(cid:7)sneeze (cid:24) yesjcold (cid:3)(cid:24) severe(cid:8) considers the other worlds where the cold isn(cid:12)t severe(cid:4) and speci(cid:22)es the proportion of these in which sneeze has value yes(cid:5) This second probability is independent of the (cid:22)rst(cid:5) (cid:2)(cid:1)(cid:0) Bayes(cid:3) Rule Given the above semantic de(cid:22)nition of conditioning(cid:4) it is easy to prove(cid:13) P(cid:7)h(cid:2)e(cid:8) P(cid:7)hje(cid:8) (cid:24) (cid:4) P(cid:7)e(cid:8) Rewriting the above formula(cid:4) and noticing that h(cid:2)e is the same proposition as e(cid:2)h(cid:4) we get(cid:13) P(cid:7)h(cid:2)e(cid:8) (cid:24) P(cid:7)hje(cid:8)(cid:4)P(cid:7)e(cid:8) (cid:24) P(cid:7)ejh(cid:8)(cid:4)P(cid:7)h(cid:8) We can divide the right hand sides by P(cid:7)e(cid:8)(cid:4) giving P(cid:7)ejh(cid:8)(cid:4)P(cid:7)h(cid:8) P(cid:7)hje(cid:8) (cid:24) P(cid:7)e(cid:8) if P(cid:7)e(cid:8) (cid:3)(cid:24) (cid:15)(cid:5) This equation is known as Bayes(cid:6) theorem or Bayes(cid:6) Rule(cid:5) It was (cid:22)rst given in this generality by Laplace (cid:14)(cid:0)(cid:20)(cid:16)(cid:5) (cid:25) It may seem puzzling why such an innocuous looking equation should be so celebrated(cid:5) It is important because it tells us how to do evidential reasoning from a causal knowledge base(cid:23) Bayes(cid:6) rule is an equation for abduction(cid:5) Suppose P(cid:7)ejh(cid:8) speci(cid:22)es a causal model(cid:23) it gives the propensity of e(cid:11)ect e in the context when h is true(cid:5) Bayes(cid:12) rule speci(cid:22)es how to do evidential reasoning(cid:23) it tells us how to infer the cause h from the e(cid:11)ect e(cid:5) The numerator is the product of the likelihood(cid:4) P(cid:7)ejh(cid:8)(cid:4) which speci(cid:22)es how well the hypothesis h predicts the evidence e(cid:4) and the prior probabil(cid:9) ity(cid:4) P(cid:7)h(cid:8)(cid:4) that speci(cid:22)es how much the hypothesis was believed before any evidence arrived(cid:5) Thedenominator(cid:4)P(cid:7)e(cid:8)(cid:4)isa normalisingconstanttoensurethattheprob(cid:9) abilities are well formed(cid:5) If fh(cid:4)(cid:0)(cid:4)(cid:4)(cid:4)(cid:0)hkg are a set of pairwise incompatible (cid:7)hi and hj cannot both be true if i (cid:3)(cid:24) j(cid:8) and covering (cid:7)one hi must be true(cid:8) set of hypotheses(cid:4) then P(cid:7)e(cid:8) (cid:24) P(cid:7)ejhi(cid:8)(cid:4)P(cid:7)hi(cid:8) hi X If you are only interested in comparing hypotheses this denominator can be ignored(cid:5) (cid:2)(cid:1)(cid:2) Bayesian Learning Bayesian learning(cid:4) or Bayesian statistics (cid:14)(cid:17)(cid:4) (cid:0)(cid:21)(cid:4) (cid:0)(cid:1)(cid:4) (cid:0)(cid:2)(cid:16) is the method for using Bayes(cid:12) rule for evidential reasoning for the evidentialreasoning task of learning(cid:5) Bayes(cid:12) rule is P(cid:7)ejh(cid:8)(cid:4)P(cid:7)h(cid:8) P(cid:7)hje(cid:8) (cid:24) (cid:4) P(cid:7)e(cid:8) If e is the data (cid:7)all of the training examples(cid:8)(cid:4) and h is a hypothesis(cid:4) Bayes(cid:12) rule speci(cid:22)es how(cid:4) given the model of how the hypothesis h produces the data e and the prior propensity of h(cid:4) you can infer how likely the hypothesis is(cid:4) given the data(cid:5) One of the main reasons why this is of interest is that the hypotheses can be noisy(cid:23) an hypothesis can specify a probability distribution overthe data it predicts(cid:5) Moreover(cid:4) Bayes(cid:12) rule allows us to compare those hypotheses that predict the data exactly (cid:7)where P(cid:7)ejh(cid:8) (cid:24) (cid:0)(cid:8) amongst themselves and with the hypotheses that specify any other probability of the data(cid:5) (cid:21) Example (cid:3)(cid:2)(cid:3) Suppose weare doing Bayesianlearning of decisiontrees(cid:4) and are considering a number of de(cid:22)nitive decision trees (cid:7)i(cid:5)e(cid:5)(cid:4) they predict clas(cid:9) si(cid:22)cations with (cid:15) or (cid:0) probabilities(cid:4) and thus have no room for noise(cid:8)(cid:5) For each such decision tree h(cid:4) either P(cid:7)ejh(cid:8) (cid:24) (cid:0) or P(cid:7)ejh(cid:8) (cid:24) (cid:15)(cid:5) Bayes theorem tells us that those that don(cid:12)t predict the data have posterior probability (cid:15)(cid:4) and those that predict the observed data have posterior probabilities pro(cid:9) portional to their priors(cid:5) Thus the prior probability speci(cid:22)es the learning bias (cid:7)for example(cid:4) towards simplerdecision trees(cid:8)(cid:23) out of all of the trees that matchthe data(cid:4) which are to be preferred(cid:5) Without such a bias(cid:4) there can be no learning as every possible function can be represented as a decision tree(cid:5) Bayes rule also speci(cid:22)es how to compare simpler decision trees that may not exactly (cid:22)t the data (cid:7)e(cid:5)g(cid:5)(cid:4) if they have probabilities at the leaves(cid:8) with more complexones that exactly (cid:22)t the data(cid:5) This gives a principledway to handle over(cid:22)tting(cid:5) Example (cid:3)(cid:2)(cid:4) The simplestform of Bayesian learning with probabilistic hy(cid:9) potheses is when there is a single binary event that is repeated and statistics are collected(cid:5) That is(cid:4) we are trying to learn probabilities(cid:5) Suppose we have some object that can fall down such that either there is some distinguishing feature(cid:7)whichwewillcallheads(cid:8)showing ontop(cid:4) orthereisnotheads (cid:7)which we willcalltails(cid:8) showing on top(cid:5) We would liketo learn the probabilitythat there is a heads showing on top(cid:5) Suppose our hypothesis space consists of hypotheses that specify P(cid:7)heads(cid:8) (cid:24) p where heads is the proposition that says heads is on top(cid:4) and p is a number that speci(cid:22)es the probability of a heads on top(cid:5) Implicitin this hypothesis is that repeated tosses are indepen(cid:9) (cid:5) dent (cid:5) Suppose we have on observation e consisting of a particular sequence of outcomes with n outcomes with heads true and out of m outcomes(cid:5) Let hp be the hypothesis that P(cid:7)heads(cid:8) (cid:24) p for some (cid:15) (cid:5) p (cid:5) (cid:0)(cid:5) Then we have(cid:4) by elementary probability theory(cid:4) n m(cid:0)n P(cid:7)ejhp(cid:8) (cid:24) p (cid:7)(cid:0)(cid:6)p(cid:8) Suppose that our prior probability is uniform on (cid:14)(cid:15)(cid:4)(cid:0)(cid:16)(cid:5) That is(cid:4) we consider each value for P(cid:7)heads(cid:8) to be equally likely before we see any data(cid:5) Figure (cid:1) shows the posterior distributions for various values of n and m(cid:5) Note that the only hypotheses that are inconsistent with the observations (cid:3) Bayesian probability doesn(cid:4)t require independent trials(cid:0) You can model the interde(cid:7) pendence of the trials in the hypothesis space(cid:0) (cid:0)(cid:15)
Description: