ebook img

Probabilistic Machine Learning: An Introduction (Instructor Solution Manual, Solutions) PDF

57 Pages·2022·0.6 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Probabilistic Machine Learning: An Introduction (Instructor Solution Manual, Solutions)

Full Solution Manual for “Probabilistic Machine Learning: An Introduction” Kevin Murphy November 21, 2021 1 1 Solutions 2 Part I Foundations 3 2 Solutions 2.1 Conditional independence PRIVATE 1. Bayes’ rule gives P(E ,E |H)P(H) P(H|E ,E )= 1 2 (1) 1 2 P(E ,E ) 1 2 Thus the information in (ii) is sufficient. In fact, we don’t need P(E ,E ) because it is equal to the 1 2 normalization constant (to enforce the sum to one constraint). (i) and (iii) are insufficient. 2. Now the equation simplifies to P(E |H)P(E |H)P(H) P(H|E ,E )= 1 2 (2) 1 2 P(E ,E ) 1 2 so (i) and (ii) are obviously sufficient. (iii) is also sufficient, because we can compute P(E ,E ) using 1 2 normalization. 2.2 Pairwise independence does not imply mutual independence We provide two counter examples. Let X and X be independent binary random variables, and X = X ⊕X , where ⊕ is the XOR 1 2 3 1 2 operator. We have p(X |X ,X )(cid:54)=p(X ), since X can be deterministically calculated from X and X . So 3 1 2 3 3 1 2 the variables {X ,X ,X } are not mutually independent. However, we also have p(X |X )=p(X ), since 1 2 3 3 1 3 without X , no information can be provided to X . So X ⊥X and similarly X ⊥X . Hence {X ,X ,X } 2 3 1 3 2 3 1 2 3 are pairwise independent. Here is a different example. Let there be four balls in a bag, numbered 1 to 4. Suppose we draw one at random. Define 3 events as follows: • X : ball 1 or 2 is drawn. 1 • X : ball 2 or 3 is drawn. 2 • X : ball 1 or 3 is drawn. 3 We have p(X ) = p(X ) = p(X ) = 0.5. Also, p(X ,X ) = p(X ,X ) = p(X ,X ) = 0.25. Hence 1 2 3 1 2 2 3 1 3 p(X ,X ) = p(X )p(X ), and similarly for the other pairs. Hence the events are pairwise independent. 1 2 1 2 However, p(X ,X ,X )=0(cid:54)=1/8=p(X )p(X )p(X ). 1 2 3 1 2 3 2.3 Conditional independence iff joint factorizes PRIVATE Independency ⇒ Factorization. Let g(x,z)=p(x|z) and h(y,z)=p(y|z). If X ⊥Y|Z then p(x,y|z)=p(x|z)p(y|z)=g(x,z)h(y,z) (3) 4 Factorization ⇒ Independency. If p(x,y|z)=g(x,z)h(y,z) then (cid:88) (cid:88) (cid:88) (cid:88) 1= p(x,y|z)= g(x,z)h(y,z)= g(x,z) h(y,z) (4) x,y x,y x y (cid:88) (cid:88) (cid:88) p(x|z)= p(x,y|z)= g(x,z)h(y,z)=g(x,z) h(y,z) (5) y y y (cid:88) (cid:88) (cid:88) p(y|z)= p(x,y|z)= g(x,z)h(y,z)=h(y,z) g(x,z) (6) x x x (cid:88) (cid:88) p(x|z)p(y|z)=g(x,z)h(y,z) g(x,z) g(y,z) (7) x y =g(x,z)h(y,z)=p(x,y|z) (8) 2.4 Convolution of two Gaussians is a Gaussian We follow the derivation of [Jay03, p221]. Define 1 x−µ φ(x−µ|σ)=N(x|µ,σ2)= φ( ) (9) σ σ where φ(z) is the pdf of the standard normal. We have p(y)=N(x |µ ,σ2)⊗N(x |µ ,σ2) (10) 1 1 1 2 2 2 (cid:90) = dx φ(x −µ |σ )φ(y−(x −µ )|σ ) (11) 1 1 1 1 1 2 2 Now the product inside the integral can be written as follows 1 (cid:40) 1(cid:34)(cid:18)x −µ (cid:19)2 (cid:18)y−x −µ (cid:19)2(cid:35)(cid:41) φ(x −µ |σ )φ(y−(x −µ )|σ )= exp − 1 1 + 1 2 (12) 1 1 1 1 2 2 2πσ σ 2 σ σ 1 2 1 2 We can bring out the dependency on x by rearranging the quadratic form inside the exponent as follows 1 (cid:18)x−µ (cid:19)2 (cid:18)y−(x −µ )(cid:19)2 w w 1 + 1 2 =(w +w )(x −xˆ)+ 1 2 (y−(µ −µ ))2 (13) σ σ 1 2 1 w +w 1 2 1 2 1 2 where we have defined the precision or weighting terms w =1/σ2, w =1/σ2, and the term 1 1 2 2 w µ +w y−w µ xˆ= 1 1 2 2 2 (14) w +w 1 2 Note that w w 1 σ2σ2 1 1 2 = 1 2 = (15) w +w σ2σ2σ2+σ2 σ2+σ2 1 2 1 2 1 2 1 2 Hence 1 (cid:90) (cid:20) 1 1 w w (cid:21) p(y)= exp − (w +w )(x −xˆ)2− 1 2 (y−(µ −µ ))2 dx (16) 2πσ σ 2 1 2 1 2w +w 1 2 1 1 2 1 2 The integral over x becomes one over the normalization constant for the Gaussian: 1 (cid:90) 1 (cid:18) σ2σ2 (cid:19)21 exp[−2(w1+w2)(x1−xˆ)2]dx1 =(2π)21 σ22+2σ2 (17) 1 2 5 Hence (cid:18) σ2σ2 (cid:19)21 1 p(y)=(2π)−1(σ12σ22)−12(2π)12 σ22+2σ2 exp[−2(σ2+σ2)(y−µ1−µ2)2] (18) 1 2 1 2 =(2π)−21 (cid:0)σ12+σ22(cid:1)−12 exp[−2(σ21+σ2)(y−µ1−µ2)2] (19) 1 2 =N(y|µ +µ ,σ2+σ2) (20) 1 2 1 2 2.5 Expected value of the minimum of two rv’s PRIVATE Let Z =min(X,Y). We have P(Z >z)=P(X >z,Y >z)=P(X >z)p(Y >z)=(1−z)2 =1+z2−2z. Hence the cdf of the minimum is P(Z ≤z)=2z−z2 and the pdf is p(z)=2−2z. Hence the expected value is (cid:90) 1 E[Z]= z(2−2z)=1/3 (21) 0 2.6 Variance of a sum We have V[X+Y]=E[(X+Y)2]−(E[X]+E[Y])2 (22) =E[X2+Y2+2XY]−(E[X]2+E[Y]2+2E[X]E[Y]) (23) =E[X2]−E[X]2+E[Y2]−E[Y]2+2E[XY]−2E[X]E[Y] (24) =V[X]+V[Y]+2Cov[X,Y] (25) If X and Y are independent, then Cov[X,Y]=0, so V[X+Y]=V[X]+V[Y]. 2.7 Deriving the inverse gamma density PRIVATE We have dx p (y)=p (x)| | (26) y x dy where dx 1 =− =−x2 (27) dy y2 So ba p (y)=x2 xa−1e−xb (28) y Γ(a) ba = xa+1e−xb (29) Γ(a) ba = y−(a+1)e−b/y =IG(y|a,b) (30) Γ(a) 6 2.8 Mean, mode, variance for the beta distribution For the mode we can use simple calculus, as follows. Let f(x)=Beta(x|a,b). We have df 0= (31) dx =(a−1)xa−2(1−x)b−1−(b−1)xa−1(1−x)b−2 (32) =(a−1)(1−x)−(b−1)x (33) a−1 x= (34) a+b−2 For the mean we have (cid:90) 1 E[θ|D]= θp(θ|D)dθ (35) 0 Γ(a+b) (cid:90) = θ(a+1)−1(1−θ)b−1dθ (36) Γ(a)Γ(b) Γ(a+b) Γ(a+1)Γ(b) a = = (37) Γ(a)Γ(b) Γ(a+1+b) a+b where we used the definition of the Gamma function and the fact that Γ(x+1)=xΓ(x). We can find the variance in the same way, by first showing that E(cid:2)θ2(cid:3)= Γ(a+b) (cid:90) 1 Γ(a+2+b)θ(a+2)−1(1−θ)b−1dθ (38) Γ(a)Γ(b) Γ(a+2)Γ(b) 0 Γ(a+b) Γ(a+2+b) a a+1 = = (39) Γ(a)Γ(b)Γ(a+2)Γ(b) a+ba+1+b Now we use V[θ]=E(cid:2)θ2(cid:3)−E[θ]2 and E[θ]=a/(a+b) to get the variance. 2.9 Bayes rule for medical diagnosis PRIVATE Let T =1 represent a positive test outcome, T =0 represent a negative test outcome, D =1 mean you have the disease, and D =0 mean you don’t have the disease. We are told P(T =1|D =1)=0.99 (40) P(T =0|D =0)=0.99 (41) P(D =1)=0.0001 (42) We are asked to compute P(D =1|T =1), which we can do using Bayes’ rule: P(T =1|D =1)P(D =1) P(D =1|T =1)= (43) P(T =1|D =1)P(D =1)+P(T =1|D =0)P(D =0) 0.99×0.0001 = (44) 0.99×0.0001+0.01×0.9999 =0.009804 (45) So although you are much more likely to have the disease (given that you have tested positive) than a random member of the population, you are still unlikely to have it. 7 2.10 Legal reasoning Let E be the evidence (the observed blood type), and I be the event that the defendant is innocent, and G=¬I be the event that the defendant is guilty. 1. The prosecutor is confusing p(E|I) with p(I|E). We are told that p(E|I) = 0.01 but the relevant quantity is p(I|E). By Bayes rule, this is p(E|I)p(I) 0.01p(I) p(I|E)= = (46) p(E|I)p(I)+p(E|G)p(G) 0.01p(I)+(1−p(I)) since p(E|G) = 1 and p(G) = 1−p(I). So we cannot determine p(I|E) without knowing the prior probability p(I). So p(E|I) = p(I|E) only if p(G) = p(I) = 0.5, which is hardly a presumption of innocence. To understand this more intuitively, consider the following isomorphic problem (from http://en. wikipedia.org/wiki/Prosecutor’s_fallacy): A big bowl is filled with a large but unknown number of balls. Some of the balls are made of wood, and some of them are made of plastic. Of the wooden balls, 100 are white; out of the plastic balls, 99 are red and only 1 are white. A ball is pulled out at random, and observed to be white. Without knowledge of the relative proportions of wooden and plastic balls, we cannot tell how likely it is that the ball is wooden. If the number of plastic balls is far larger than the number of wooden balls, for instance, then a white ball pulled from the bowl at random is far more likely to be a white plastic ball than a white wooden ball — even though white plastic balls are a minority of the whole set of plastic balls. 2. The defender is quoting p(G|E) while ignoring p(G). The prior odds are p(G) 1 = (47) p(I) 799,999 The posterior odds are p(G|E) 1 = (48) p(I|E) 7999 So the evidence has increased the odds of guilt by a factor of 1000. This is clearly relevant, although perhaps still not enough to find the suspect guilty. 2.11 Probabilities are sensitive to the form of the question that was used to generate the answer PRIVATE 1. The event space is shown below, where X is one child and Y the other. X Y Prob. G G 1/4 G B 1/4 B G 1/4 B B 1/4 8 Let N be the number of girls and N the number of boys. We have the constraint (side information) g b that N +N =2 and 0≤N ,N ≤2. We are told N ≥1 and are asked to compute the probability of b g b g b the event N =1 (i.e., one child is a girl). By Bayes rule we have g p(N ≥1|N =1)p(N =1) p(N =1|N ≥1)= b g g (49) g b p(N ≥1) b 1×1/2 = =2/3 (50) 3/4 2. Let Y be the identity of the observed child and X be the identity of the other child. We want p(X =g|Y =b). By Bayes rule we have p(Y =b|X =g)p(X =g) p(X =g|y =b)= (51) p(Y =b) (1/2)×(1/2) = =1/2 (52) 1/2 Tom Minka [Min98] has written the following about these results: This seems like a paradox because it seems that in both cases we could condition on the fact that "at least one child is a boy." But that is not correct; you must condition on the event actually observed, notitslogicalimplications. Inthefirstcase, theeventwas"Hesaidyestomyquestion." Inthesecondcase,theeventwas"Onechildappearedinfrontofme."Thegeneratingdistribution is different for the two events. Probabilities reflect the number of possible ways an event can happen, like the number of roads to a town. Logical implications are further down the road and may be reached in more ways, through different towns. The different number of ways changes the probability. 2.12 Normalization constant for a 1D Gaussian Following the first hint we have (cid:90) 2π(cid:90) ∞ (cid:18) r2 (cid:19) Z2 = rexp − drdθ (53) 2σ2 0 0 (cid:20)(cid:90) 2π (cid:21)(cid:20)(cid:90) ∞ (cid:18) r2 (cid:19) (cid:21) = dθ rexp − dr (54) 2σ2 0 0 =(2π)I (55) where I is the inner integral (cid:90) ∞ (cid:18) r2 (cid:19) I = rexp − (56) 2σ2 0 Following the second hint we have (cid:90) r I =−σ2 − e−r2/2σ2dr (57) σ2 (cid:104) (cid:105)∞ =−σ2 e−r2/2σ2 (58) 0 =−σ2[0−1]=σ2 (59) Hence Z2 =2πσ2 (60) (cid:112) Z =σ (2π) (61) 9 3 Solutions 3.1 Uncorrelated does not imply independent PRIVATE We have E[X]=0 (62) V[X]=(1−−1)2/12=1/3 (63) E[Y]=E[X2]=V[X]+(E[X])2 =1/3+0 (64) (cid:90) 1(cid:20) (cid:90) 0 (cid:90) 1 (cid:21) E[XY]= p(x)x3dx= −( x3dx)+( x3dx) (65) 2 −1 0 (cid:20) (cid:21) 1 −1 1 1 = [x4]0 + [x4]1 = [−1+1]=0 (66) 2 4 −1 4 0 8 where we have split the integral into two pieces, since x3 changes sign in the interval [−1,1]. Hence Cov[X,Y]E[XY]−E[X]E[Y] 0 ρ= = =0 (67) σ σ σ σ σ σ X Y X Y X Y 3.2 Correlation coefficient is between -1 and +1 We have (cid:20) (cid:21) X Y 0≤V + (68) σ σ X Y (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) X Y X Y =V +V +2Cov , (69) σ σ σ σ X Y X Y V[X] V[Y] (cid:20) X Y (cid:21) = + +2Cov , (70) σ2 σ2 σ σ X Y X Y =1+1+2ρ=2(1+ρ) (71) Hence ρ≥−1. Similarly, (cid:20) (cid:21) X Y 0≤V − =2(1−ρ) (72) σ σ X Y so ρ≤1. 3.3 Correlation coefficient for linearly related variables is ±1 PRIVATE Let E[X]=µ and V[X]=σ2, and assume a>0. Then we have E[Y]=E[aX+b]=aE[X]+b=aµ+b (73) V[Y]=V[aX+b]=a2V[X]=a2σ2 (74) E[XY]=E[X(aX+b)]=E(cid:2)aX2+bX(cid:3)=aE(cid:2)X2(cid:3)+bE[X]=a(µ2+σ2)+bµ (75) Cov[X,Y]=E[XY]−E[X]E[Y]=a(µ2+σ2)+bµ−µ(aµ+b)=aσ2 (76) Cov[X,Y] aσ2 ρ= = √ √ =1 (77) (cid:112)V[X]V[Y] σ2 a2σ2 √ If Y =aX+b, where a<0, we find ρ=−1, since a/ a2 =−a/|a|=−1. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.