Lectures for APM 541: Stochastic Modeling in Biology Jay Taylor November 3, 2011 Contents 1 Distributions, Expectations, and Random Variables 4 1.1 Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . 17 2 Approximation and Limit Theorems in Probability 20 2.1 Convergent Sequences and Approximation . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Modes of Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . 22 2.3 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 The Law of Rare Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Random Number Generation 28 3.1 Pseudorandom Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 The Inversion Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Simulating Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Discrete-time Markov Chains 37 4.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Asymptotic Behavior of Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Class Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.2 Hitting Times and Absorption Probabilities . . . . . . . . . . . . . . . . . 45 4.2.3 Stationary Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2 CONTENTS 3 5 Biological Applications of Markov Chains 54 5.1 The Wright-Fisher Model and its Relatives. . . . . . . . . . . . . . . . . . . . . . 54 5.1.1 Cannings’ Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Galton-Watson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Chain Epidemic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.1 Epidemics with Household and Community Transmission . . . . . . . . . 69 6 Continuous-time Markov Chains 72 6.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Kolmogorov Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3 Gillespie’s Algorithm and the Jump Chain . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Stationary Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.5 Time Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Poisson Processes and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7 Diffusions and Stochastic Calculus 92 7.1 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.1.1 The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.1.2 Diffusion Approximations for CTMC’s via the Heat Equation . . . . . . . 94 7.1.3 Properties of standard Brownian motion . . . . . . . . . . . . . . . . . . . 96 7.2 Diffusion Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Diffusion Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.4 Technical Interlude: Generators and Martingales . . . . . . . . . . . . . . . . . . 100 7.4.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chapter 1 Distributions, Expectations, and Random Variables 1.1 Probability Spaces We can think about probability in two ways: • Frequentist interpretation: Theprobabilityofaneventisthelimitingfrequencywithwhich the event occurs when we conduct an infinite series of identical but independent trials. • Subjective interpretation: The probability of an event measures the strength of our subjec- tive belief that the event will occur in one trial. There has been much argument, especially amongst statisticians, as to which of these interpre- tations is ‘correct’. I tend to take a fairly pragmatic view of things and switch between these perspectives as best suits the problem that I am working on. However, we can at least write down a formal (and pretty much universally accepted) mathematical definition of probability. Definition 1.1. A probability space is a triple {Ω,F,P} where: • Ω is the sample space, i.e., the set of all possible outcomes. • F is a collection of subsets of Ω which we call events. F is called a σ-algebra and is required to satisfy the following conditions: 1. The empty set and the sample space are both events: ∅,Ω ∈ F. 2. If E is an event, then its complement Ec = Ω\E is also an event. 3. If E ,E ,··· are events, then their union ∪ E is an event. 1 2 n n • P is a function from F into [0,1]: if E is an event, then P(E) is the probability of E. P is said to be a probability distribution or probability measure on F and is also required to satisfy several conditions: 1. P(∅) = 0; P(Ω) = 1. 2. Countable additivity: If E ,E ,··· are mutually exclusive events, i.e., E ∩E = 1 2 i j ∅ whenever i (cid:54)= j, then ∞ ∞ (cid:16) (cid:91) (cid:17) (cid:88) P E = P(E ). n n n=1 n=1 4 1.2. CONDITIONAL PROBABILITIES 5 If you want to read or publish articles on mathematical probability and statistics, then you will need to come to grips with this definition. David Williams’ little book, Probability with Martingales, provides an excellent introduction to this theory. In this course, we will usually be very informal and ignore the role played by the σ-algebra F. However, the properties described in the third part of the definition are both useful and intuitive: • P(∅) = 0 means that the probability that nothing (whatsoever) happens is zero. • P(Ω) = 1 means that the probability that something (whatever it is) happens is one. • If E and E are mutually exclusive events, then E ∪E is the event that either E or 1 2 1 2 1 E happens and the probability of that is just the sum of the probability that E happens 2 1 and the probability that E happens: 2 P(E ∪E ) = P(E )+P(E ). 1 2 1 2 Countable additivity says that this property holds when we have a countable collection of disjoint events. The following lemma lists some other useful properties that can be deduced from Definition 1. Lemma 1.1. The following properties hold for any two events A,B in a probability space: 1. P(Ac) = 1−P(A). 2. If A and B are mutually exclusive, then P(A∩B) = 0 3. For any two events A and B (not necessarily mutually exclusive), we have: P(A∪B) = P(A)+P(B)−P(A∩B) Exercise 1.1. Prove Lemma 1. 1.2 Conditional Probabilities It is often the case that we have some partial information about the outcome of an experiment or the state of an unknown system. Our next definition shows how we should modify our beliefs about the unobserved outcome given this additional information: Definition 1.2. Suppose that A and B are events and that P(B) > 0. Then the conditional probability that A occurs given that B occurs is P(A∩B) P(A|B) = P(B) In frequentist terms, we can think of the conditional probability P(A|B) as the fraction of tri- als resulting in both A and B divided by the fraction of trials resulting in B. In general, P(A|B) (cid:54)= P(A), in which case we say that B contains some information about A, i.e., knowing whether B does or does not occur gives us some information about whether A does or does not occur. On the other hand, if P(A|B) = P(A), then B gives us with no information about A. This important scenario motivates the next definition. 6 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES Definition 1.3. Independent Events 1. Two events A and B are said to be independent if P(A∩B) = P(A)·P(B). 2. A countable collection of events E ,E ,··· is said to be independent if for every finite 1 2 subcollection E ,··· ,E we have i1 in P(E ∩···∩E ) = P(E )···P(E ). i1 in i1 in Example 1.1. Three events A, B, and C are independent if all of the following identities hold: P(cid:0)A∩B∩C(cid:1) = P(A)·P(B)·P(C) P(cid:0)A∩B(cid:1) = P(A)·P(B) P(cid:0)A∩C(cid:1) = P(A)·P(C) P(cid:0)B∩C(cid:1) = P(B)·P(C) Theorem 1.1. If A and B are independent and P(B) > 0, then P(A∩B) P(A)P(B) P(A|B) = = = P(A). P(B) P(B) In other words, if A and B are independent, then, as we would expect, B gives us no information about A. Notice that the expression for conditional probability stated in Definition 2 can be rearranged to give P(A∩B) = P(A|B)·P(B), i.e., theprobabilitythatbothAandB occurisequaltotheconditionalprobabilitythatAoccurs given that B occurs times the probability that B occurs. Notice that, by symmetry, we also have P(A∩B) = P(B|A)·P(A). Although elementary, these simple algebraic manipulations lead to two of the most useful for- mulas in probability. The Law of Total Probability is important because it can often be used to compute the probability of a complicated event by conditioning on additional information. We will see many examples of this procedure throughout the course. Bayes’ formula is important, of course, because it forms the foundations of Bayesian statistics which we will also discuss at length in this course. Theorem 1.2. (Law of Total Probability) If A is an event and B ,··· ,B is a collection of 1 n disjoint events such that A ⊂ B ∪···∪B , then 1 n P(A) = P(A∩B )+···+P(A∩B ) 1 n = P(A|B )·P(B )+···+P(A|B )·P(B ). 1 1 n n Theorem 1.3. (Bayes’ formula) If A and B are events with P(A) > 0 and P(B) > 0, then P(B|A)P(A) P(A|B) = . P(B) 1.3. DISCRETE RANDOM VARIABLES 7 1.3 Discrete Random Variables In practice, we are often unable to directly observe the state of the systems that we study in biology and instead must make do with indirect information provided by experiments. One way to model this situation mathematically is by identifying the probability space (Ω,F,P) with the true but unknown state of the system of interest and then introducing random variables that represent the outcomes of the experiments that we perform on that system. For example, if we perform just one experiment and if the set of possible outcomes is denoted E, then we would define a random variable X which is a function from Ω into E. Thus, if the state of the system is ω, then the result of our experiment will be the value X(ω). To be more concrete, suppose that we choose a saguaro cactus at random from Picacho Peak State Park and we then measure its height. In this case, the probability space could encode all of the processes (e.g., climatic and ecological) influencing the heights of the saguaros in the park as well as those influencing our sampling of an individual, while the random variable X will denote just the height of that individual which will be some value in the set E = [0,∞). Remark 1.1. As promised, we are skirting over many formalities that are important if we want to prove theorems about random variables. In particular, to define random variables rigorously, we need to attach some additional structure to the set E and then require that the function X is measurable. For our purposes we can ignore these technical issues, but see Chapter 3 in Williams (1991) for the details. Definition 1.4. Suppose that (Ω,F,P) is a probability space and that X is a random variable that take values in the set E. Then the distribution of X is the probability distribution µ defined on E by the formula µ(A) ≡ P(X ∈ A) ≡ P(cid:0){ω ∈ Ω : X(ω) ∈ A}(cid:1). Here A is a subset of E, i.e., A is a collection of possible outcomes for our experiment, whereas the set {ω ∈ Ω : X(ω) ∈ A} is a subset of Ω. Remark 1.2. Much of the time we will simply ignore the underlying probability space and restrict our attention to the distributions of the random variables defined on that space. In particular, we will usually just write X rather than X(ω) even when we have a particular value of X in mind. On the other hand, we will often be content to use the notation P(X ∈ A) rather than explicitly introduce the probability measure µ as we did in Definition 4. With practice, this shorthand will become very natural. Discrete random variables provide an important special case of these concepts. Definition 1.5. 1. A random variable X is said to be discrete if it takes values in a set E that is either finite or contains countably infinitely many points (e.g., the integers). 2. If E = {x ,x ,···}, then the probability mass function of X is the function p : E → 1 2 [0,1] defined by the formula p(x ) = P(X = x ). i i The probability mass function of a discrete random variable completely determines its distribu- tion: (cid:88) P(X ∈ A) = p(x ). i xi∈A 8 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES In other words, to calculate the probability that X takes a value in a set A ⊂ E, we simply need to sum the probability mass function of X over all of the points that belong to A. Notice that this implies that (cid:88) p(x ) = P(X ∈ E) = 1, i xi∈E since E is defined to be the set of all possible values that X can take. Definition 1.6. If X is a discrete random variable that takes values in a subset of the real numbers, then the expected value of X is defined to be the weighted average of these values (cid:88) EX = p(x )·x . i i i Remark 1.3. The expected value of a random variable is also called its expectation or its mean and is sometimes written as E[X] for clarity. In some respects, the name ‘expected value’ is misleading since EX could well be a value that X never takes. For example, if E = {0,1} and P(X = 0) = P(X = 1) = 1/2, then 1 1 1 EX = ·0+ ·1 = , 2 2 2 even though X is never equal to 1/2. An important property of expectations is that they are linear: Theorem 1.4. (Linearity) Suppose that X and Y are discrete random variables and that a and b are real numbers. Then E(cid:2)a·X +b·Y(cid:3) = a·EX +b·EY. The next theorem describes another important property of expectations that is sometimes in- correctly stated as a definition, hence the tongue-in-cheek name: Theorem 1.5. (The Law of the Unconscious Statistician) If X is a discrete random variable with values in a set E and f : E → R is a real-valued function, then f(X) is a discrete random variable and (cid:88) E[f(X)] = p(x )·f(x ) i i xi If you want a challenge, then try to do Exercise 1.2. Prove Theorems 4 and 5. Definition 1.7. The variance of a discrete real-valued random variable X is defined as Var(X) = E(cid:2)(X −EX)2(cid:3). Exercise 1.3. Use Theorems 4 and 5 to show that Var(X) = E(cid:2)X2(cid:3)−(EX)2. 1.3. DISCRETE RANDOM VARIABLES 9 The next four examples describe some of the more important discrete distributions that we will encounter this semester. In each case, E will denote the set of possible values of the random variable and p(x) will denote its probability mass function. Example 1.2. X is said to have a Bernoulli distribution with parameter p if E = {0,1} and P(X = 1) = p; P(X = 0) = 1−p In this case the mean and variance of X are given by EX = p Var(X) = p(1−p). Bernoulli random variables are the simplest non-constant random variables and are often used to represent the success (1) or failure (0) of a random trial. Example 1.3. X is said to have a Binomial distribution with parameters n and p if X takes values in the set E = {0,1,··· ,n} with probability mass function (cid:18) (cid:19) n P(X = k) = pk(1−p)n−k. k Recall that the binomial coefficient that appears in this definition is equal to (cid:18) (cid:19) n n! = , k k!(n−k)! where n! = n(n−1)(n−2)···1, and counts the number of ways of choosing a subset of k objects from a collection of n objects. The mean and variance of X are given by EX = np Var(X) = np(1−p) Binomial distributions often arise when we carry out n independent but identical trials, each having probability p of success, and we count the total number of successes. Exercise 1.4. Suppose that X ,··· ,X are independent, identically-distributed (abbreviated 1 n i.i.d.) Bernoulli random variables with parameter p and let X = X + ··· + X . Show that 1 n X is a Binomial random variable with parameters n and p. Example 1.4. X is said to have a geometric distribution with parameter p if X takes values in the non-negative integers E = {0,1,···} with probability mass function P(X = k) = (1−p)kp. The mean and variance of X are given by 1−p (1−p) EX = Var(X) = . p p2 Geometricdistributionsalsoarisewhenwecarryoutindependentbutidenticaltrials. LetX ,X ,··· 1 2 be an infinite collection of i.i.d. Bernoulli random variables, each with parameter p, and define X to be the number of failures that occur until the first success. Then X is a geometric random variable with parameter p. 10 CHAPTER 1. DISTRIBUTIONS, EXPECTATIONS, AND RANDOM VARIABLES Example 1.5. X is said to have a Poisson distribution with parameter λ if X takes values in the non-negative integers E = {0,1,···} with probability mass function λk P(X = k) = e−λ k! The mean and variance of X are given by: EX = λ Var(X) = λ. Poisson distributions often arise in situations where a large number of independent trials are carried out and the probability of success of any one trial is small. We will discuss this in the next lecture when we consider the Law of Rare Events. 1.4 Continuous Random Variables Some the variables that we will be interested in take values in sets that are continuous, e.g., the height (in cm) of a randomly sampled individual could be regarded as a random variable that can assume any value between 0 and 300. In this case, the probability mass function is zero at every point and we need to describe the distribution of the random variable in a different way. Definition 1.8. A real-valued random variable X is said to be continuous if there is a non- negative function p(x), called the probability density function of X, such that (cid:90) P(X ∈ A) = x·p(x)dx, A where A is a subset of R. In particular, by taking A = R, we see that (cid:90) ∞ p(x)dx = P(X ∈ R) = 1, −∞ i.e., the density must integrate to 1 over the whole real line. Remark 1.4. If X is any real-valued random variable (not necessarily continuous), then the distribution of X is completely determined by its cumulative distribution function (often abbreviated c.d.f.) F(x) = P(X ≤ x). Notice that F(x) is an increasing function of x, i.e., if x < y, then F(x) ≤ F(y). Also, lim F(x) = P(X ≤ −∞) = 0 lim F(x) = P(X ≤ ∞) = 1. x→−∞ x→∞ If X is also continuous, then the cumulative distribution function F(x) and the density function p(x) are related in the following way (cid:90) x F(x) = p(y)dy and p(x) = F(cid:48)(x), −∞ i.e., the density is just the derivative of the cumulative distribution function. Furthermore, we can estimate the density of X at a value x using the approximate formula, P(x−(cid:15) < X ≤ x+(cid:15)) F(x+(cid:15))−F(x−(cid:15)) p(x) ≈ = , 2(cid:15) 2(cid:15) where (cid:15) > 0 is any small positive number.
Description: