Table Of ContentFull Solution Manual for
“Probabilistic Machine Learning: An Introduction”
Kevin Murphy
November 21, 2021
1
1 Solutions
2
Part I
Foundations
3
2 Solutions
2.1 Conditional independence
PRIVATE
1. Bayes’ rule gives
P(E ,E |H)P(H)
P(H|E ,E )= 1 2 (1)
1 2 P(E ,E )
1 2
Thus the information in (ii) is sufficient. In fact, we don’t need P(E ,E ) because it is equal to the
1 2
normalization constant (to enforce the sum to one constraint). (i) and (iii) are insufficient.
2. Now the equation simplifies to
P(E |H)P(E |H)P(H)
P(H|E ,E )= 1 2 (2)
1 2 P(E ,E )
1 2
so (i) and (ii) are obviously sufficient. (iii) is also sufficient, because we can compute P(E ,E ) using
1 2
normalization.
2.2 Pairwise independence does not imply mutual independence
We provide two counter examples.
Let X and X be independent binary random variables, and X = X ⊕X , where ⊕ is the XOR
1 2 3 1 2
operator. We have p(X |X ,X )(cid:54)=p(X ), since X can be deterministically calculated from X and X . So
3 1 2 3 3 1 2
the variables {X ,X ,X } are not mutually independent. However, we also have p(X |X )=p(X ), since
1 2 3 3 1 3
without X , no information can be provided to X . So X ⊥X and similarly X ⊥X . Hence {X ,X ,X }
2 3 1 3 2 3 1 2 3
are pairwise independent.
Here is a different example. Let there be four balls in a bag, numbered 1 to 4. Suppose we draw one at
random. Define 3 events as follows:
• X : ball 1 or 2 is drawn.
1
• X : ball 2 or 3 is drawn.
2
• X : ball 1 or 3 is drawn.
3
We have p(X ) = p(X ) = p(X ) = 0.5. Also, p(X ,X ) = p(X ,X ) = p(X ,X ) = 0.25. Hence
1 2 3 1 2 2 3 1 3
p(X ,X ) = p(X )p(X ), and similarly for the other pairs. Hence the events are pairwise independent.
1 2 1 2
However, p(X ,X ,X )=0(cid:54)=1/8=p(X )p(X )p(X ).
1 2 3 1 2 3
2.3 Conditional independence iff joint factorizes
PRIVATE
Independency ⇒ Factorization. Let g(x,z)=p(x|z) and h(y,z)=p(y|z). If X ⊥Y|Z then
p(x,y|z)=p(x|z)p(y|z)=g(x,z)h(y,z) (3)
4
Factorization ⇒ Independency. If p(x,y|z)=g(x,z)h(y,z) then
(cid:88) (cid:88) (cid:88) (cid:88)
1= p(x,y|z)= g(x,z)h(y,z)= g(x,z) h(y,z) (4)
x,y x,y x y
(cid:88) (cid:88) (cid:88)
p(x|z)= p(x,y|z)= g(x,z)h(y,z)=g(x,z) h(y,z) (5)
y y y
(cid:88) (cid:88) (cid:88)
p(y|z)= p(x,y|z)= g(x,z)h(y,z)=h(y,z) g(x,z) (6)
x x x
(cid:88) (cid:88)
p(x|z)p(y|z)=g(x,z)h(y,z) g(x,z) g(y,z) (7)
x y
=g(x,z)h(y,z)=p(x,y|z) (8)
2.4 Convolution of two Gaussians is a Gaussian
We follow the derivation of [Jay03, p221]. Define
1 x−µ
φ(x−µ|σ)=N(x|µ,σ2)= φ( ) (9)
σ σ
where φ(z) is the pdf of the standard normal. We have
p(y)=N(x |µ ,σ2)⊗N(x |µ ,σ2) (10)
1 1 1 2 2 2
(cid:90)
= dx φ(x −µ |σ )φ(y−(x −µ )|σ ) (11)
1 1 1 1 1 2 2
Now the product inside the integral can be written as follows
1 (cid:40) 1(cid:34)(cid:18)x −µ (cid:19)2 (cid:18)y−x −µ (cid:19)2(cid:35)(cid:41)
φ(x −µ |σ )φ(y−(x −µ )|σ )= exp − 1 1 + 1 2 (12)
1 1 1 1 2 2 2πσ σ 2 σ σ
1 2 1 2
We can bring out the dependency on x by rearranging the quadratic form inside the exponent as follows
1
(cid:18)x−µ (cid:19)2 (cid:18)y−(x −µ )(cid:19)2 w w
1 + 1 2 =(w +w )(x −xˆ)+ 1 2 (y−(µ −µ ))2 (13)
σ σ 1 2 1 w +w 1 2
1 2 1 2
where we have defined the precision or weighting terms w =1/σ2, w =1/σ2, and the term
1 1 2 2
w µ +w y−w µ
xˆ= 1 1 2 2 2 (14)
w +w
1 2
Note that
w w 1 σ2σ2 1
1 2 = 1 2 = (15)
w +w σ2σ2σ2+σ2 σ2+σ2
1 2 1 2 1 2 1 2
Hence
1 (cid:90) (cid:20) 1 1 w w (cid:21)
p(y)= exp − (w +w )(x −xˆ)2− 1 2 (y−(µ −µ ))2 dx (16)
2πσ σ 2 1 2 1 2w +w 1 2 1
1 2 1 2
The integral over x becomes one over the normalization constant for the Gaussian:
1
(cid:90) 1 (cid:18) σ2σ2 (cid:19)21
exp[−2(w1+w2)(x1−xˆ)2]dx1 =(2π)21 σ22+2σ2 (17)
1 2
5
Hence
(cid:18) σ2σ2 (cid:19)21 1
p(y)=(2π)−1(σ12σ22)−12(2π)12 σ22+2σ2 exp[−2(σ2+σ2)(y−µ1−µ2)2] (18)
1 2 1 2
=(2π)−21 (cid:0)σ12+σ22(cid:1)−12 exp[−2(σ21+σ2)(y−µ1−µ2)2] (19)
1 2
=N(y|µ +µ ,σ2+σ2) (20)
1 2 1 2
2.5 Expected value of the minimum of two rv’s
PRIVATE
Let Z =min(X,Y). We have P(Z >z)=P(X >z,Y >z)=P(X >z)p(Y >z)=(1−z)2 =1+z2−2z.
Hence the cdf of the minimum is P(Z ≤z)=2z−z2 and the pdf is p(z)=2−2z. Hence the expected value
is
(cid:90) 1
E[Z]= z(2−2z)=1/3 (21)
0
2.6 Variance of a sum
We have
V[X+Y]=E[(X+Y)2]−(E[X]+E[Y])2 (22)
=E[X2+Y2+2XY]−(E[X]2+E[Y]2+2E[X]E[Y]) (23)
=E[X2]−E[X]2+E[Y2]−E[Y]2+2E[XY]−2E[X]E[Y] (24)
=V[X]+V[Y]+2Cov[X,Y] (25)
If X and Y are independent, then Cov[X,Y]=0, so V[X+Y]=V[X]+V[Y].
2.7 Deriving the inverse gamma density
PRIVATE
We have
dx
p (y)=p (x)| | (26)
y x dy
where
dx 1
=− =−x2 (27)
dy y2
So
ba
p (y)=x2 xa−1e−xb (28)
y Γ(a)
ba
= xa+1e−xb (29)
Γ(a)
ba
= y−(a+1)e−b/y =IG(y|a,b) (30)
Γ(a)
6
2.8 Mean, mode, variance for the beta distribution
For the mode we can use simple calculus, as follows. Let f(x)=Beta(x|a,b). We have
df
0= (31)
dx
=(a−1)xa−2(1−x)b−1−(b−1)xa−1(1−x)b−2 (32)
=(a−1)(1−x)−(b−1)x (33)
a−1
x= (34)
a+b−2
For the mean we have
(cid:90) 1
E[θ|D]= θp(θ|D)dθ (35)
0
Γ(a+b) (cid:90)
= θ(a+1)−1(1−θ)b−1dθ (36)
Γ(a)Γ(b)
Γ(a+b) Γ(a+1)Γ(b) a
= = (37)
Γ(a)Γ(b) Γ(a+1+b) a+b
where we used the definition of the Gamma function and the fact that Γ(x+1)=xΓ(x).
We can find the variance in the same way, by first showing that
E(cid:2)θ2(cid:3)= Γ(a+b) (cid:90) 1 Γ(a+2+b)θ(a+2)−1(1−θ)b−1dθ (38)
Γ(a)Γ(b) Γ(a+2)Γ(b)
0
Γ(a+b) Γ(a+2+b) a a+1
= = (39)
Γ(a)Γ(b)Γ(a+2)Γ(b) a+ba+1+b
Now we use V[θ]=E(cid:2)θ2(cid:3)−E[θ]2 and E[θ]=a/(a+b) to get the variance.
2.9 Bayes rule for medical diagnosis
PRIVATE
Let T =1 represent a positive test outcome, T =0 represent a negative test outcome, D =1 mean you have
the disease, and D =0 mean you don’t have the disease. We are told
P(T =1|D =1)=0.99 (40)
P(T =0|D =0)=0.99 (41)
P(D =1)=0.0001 (42)
We are asked to compute P(D =1|T =1), which we can do using Bayes’ rule:
P(T =1|D =1)P(D =1)
P(D =1|T =1)= (43)
P(T =1|D =1)P(D =1)+P(T =1|D =0)P(D =0)
0.99×0.0001
= (44)
0.99×0.0001+0.01×0.9999
=0.009804 (45)
So although you are much more likely to have the disease (given that you have tested positive) than a random
member of the population, you are still unlikely to have it.
7
2.10 Legal reasoning
Let E be the evidence (the observed blood type), and I be the event that the defendant is innocent, and
G=¬I be the event that the defendant is guilty.
1. The prosecutor is confusing p(E|I) with p(I|E). We are told that p(E|I) = 0.01 but the relevant
quantity is p(I|E). By Bayes rule, this is
p(E|I)p(I) 0.01p(I)
p(I|E)= = (46)
p(E|I)p(I)+p(E|G)p(G) 0.01p(I)+(1−p(I))
since p(E|G) = 1 and p(G) = 1−p(I). So we cannot determine p(I|E) without knowing the prior
probability p(I). So p(E|I) = p(I|E) only if p(G) = p(I) = 0.5, which is hardly a presumption of
innocence.
To understand this more intuitively, consider the following isomorphic problem (from http://en.
wikipedia.org/wiki/Prosecutor’s_fallacy):
A big bowl is filled with a large but unknown number of balls. Some of the balls are made of
wood, and some of them are made of plastic. Of the wooden balls, 100 are white; out of the
plastic balls, 99 are red and only 1 are white. A ball is pulled out at random, and observed
to be white.
Without knowledge of the relative proportions of wooden and plastic balls, we cannot tell how likely it
is that the ball is wooden. If the number of plastic balls is far larger than the number of wooden balls,
for instance, then a white ball pulled from the bowl at random is far more likely to be a white plastic
ball than a white wooden ball — even though white plastic balls are a minority of the whole set of
plastic balls.
2. The defender is quoting p(G|E) while ignoring p(G). The prior odds are
p(G) 1
= (47)
p(I) 799,999
The posterior odds are
p(G|E) 1
= (48)
p(I|E) 7999
So the evidence has increased the odds of guilt by a factor of 1000. This is clearly relevant, although
perhaps still not enough to find the suspect guilty.
2.11 Probabilities are sensitive to the form of the question that was used to
generate the answer
PRIVATE
1. The event space is shown below, where X is one child and Y the other.
X Y Prob.
G G 1/4
G B 1/4
B G 1/4
B B 1/4
8
Let N be the number of girls and N the number of boys. We have the constraint (side information)
g b
that N +N =2 and 0≤N ,N ≤2. We are told N ≥1 and are asked to compute the probability of
b g b g b
the event N =1 (i.e., one child is a girl). By Bayes rule we have
g
p(N ≥1|N =1)p(N =1)
p(N =1|N ≥1)= b g g (49)
g b p(N ≥1)
b
1×1/2
= =2/3 (50)
3/4
2. Let Y be the identity of the observed child and X be the identity of the other child. We want
p(X =g|Y =b). By Bayes rule we have
p(Y =b|X =g)p(X =g)
p(X =g|y =b)= (51)
p(Y =b)
(1/2)×(1/2)
= =1/2 (52)
1/2
Tom Minka [Min98] has written the following about these results:
This seems like a paradox because it seems that in both cases we could condition on the fact that
"at least one child is a boy." But that is not correct; you must condition on the event actually
observed, notitslogicalimplications. Inthefirstcase, theeventwas"Hesaidyestomyquestion."
Inthesecondcase,theeventwas"Onechildappearedinfrontofme."Thegeneratingdistribution
is different for the two events. Probabilities reflect the number of possible ways an event can
happen, like the number of roads to a town. Logical implications are further down the road and
may be reached in more ways, through different towns. The different number of ways changes the
probability.
2.12 Normalization constant for a 1D Gaussian
Following the first hint we have
(cid:90) 2π(cid:90) ∞ (cid:18) r2 (cid:19)
Z2 = rexp − drdθ (53)
2σ2
0 0
(cid:20)(cid:90) 2π (cid:21)(cid:20)(cid:90) ∞ (cid:18) r2 (cid:19) (cid:21)
= dθ rexp − dr (54)
2σ2
0 0
=(2π)I (55)
where I is the inner integral
(cid:90) ∞ (cid:18) r2 (cid:19)
I = rexp − (56)
2σ2
0
Following the second hint we have
(cid:90) r
I =−σ2 − e−r2/2σ2dr (57)
σ2
(cid:104) (cid:105)∞
=−σ2 e−r2/2σ2 (58)
0
=−σ2[0−1]=σ2 (59)
Hence
Z2 =2πσ2 (60)
(cid:112)
Z =σ (2π) (61)
9
3 Solutions
3.1 Uncorrelated does not imply independent
PRIVATE
We have
E[X]=0 (62)
V[X]=(1−−1)2/12=1/3 (63)
E[Y]=E[X2]=V[X]+(E[X])2 =1/3+0 (64)
(cid:90) 1(cid:20) (cid:90) 0 (cid:90) 1 (cid:21)
E[XY]= p(x)x3dx= −( x3dx)+( x3dx) (65)
2
−1 0
(cid:20) (cid:21)
1 −1 1 1
= [x4]0 + [x4]1 = [−1+1]=0 (66)
2 4 −1 4 0 8
where we have split the integral into two pieces, since x3 changes sign in the interval [−1,1]. Hence
Cov[X,Y]E[XY]−E[X]E[Y] 0
ρ= = =0 (67)
σ σ σ σ σ σ
X Y X Y X Y
3.2 Correlation coefficient is between -1 and +1
We have
(cid:20) (cid:21)
X Y
0≤V + (68)
σ σ
X Y
(cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21)
X Y X Y
=V +V +2Cov , (69)
σ σ σ σ
X Y X Y
V[X] V[Y] (cid:20) X Y (cid:21)
= + +2Cov , (70)
σ2 σ2 σ σ
X Y X Y
=1+1+2ρ=2(1+ρ) (71)
Hence ρ≥−1. Similarly,
(cid:20) (cid:21)
X Y
0≤V − =2(1−ρ) (72)
σ σ
X Y
so ρ≤1.
3.3 Correlation coefficient for linearly related variables is ±1
PRIVATE
Let E[X]=µ and V[X]=σ2, and assume a>0. Then we have
E[Y]=E[aX+b]=aE[X]+b=aµ+b (73)
V[Y]=V[aX+b]=a2V[X]=a2σ2 (74)
E[XY]=E[X(aX+b)]=E(cid:2)aX2+bX(cid:3)=aE(cid:2)X2(cid:3)+bE[X]=a(µ2+σ2)+bµ (75)
Cov[X,Y]=E[XY]−E[X]E[Y]=a(µ2+σ2)+bµ−µ(aµ+b)=aσ2 (76)
Cov[X,Y] aσ2
ρ= = √ √ =1 (77)
(cid:112)V[X]V[Y] σ2 a2σ2
√
If Y =aX+b, where a<0, we find ρ=−1, since a/ a2 =−a/|a|=−1.
10