clustering and the em algorithmlearning.eng.cam.ac.uk/.../turner/teaching/clustering.pdfwhat is...
TRANSCRIPT
Clustering and the EM algorithm
Rich Turner and Jose Miguel Hernandez-Lobato
x1
x2
What is clustering?
x1
x2
network community detection
Campbell et al Social Network Analysis
image segmentation
vector quantisationgenetic clusteringanomaly detection
crime analysis
What is clustering?
I Roughly speaking, two points belonging to the same cluster aregenerally more similar to each other or closer to each other than twopoints belonging to different clusters.
I D = {x1 . . . xN} → s = {s1 . . . sN} where sn ∈ {1 . . .K}I Unsupervised learning problem (no labels or rewards)
x1
x2
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
A first clustering algorithm: k-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Question: is K-means guaranteed to converge for any dataset D?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
could one or more of the cluster centres diverge or oscillate?
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise
Note:∑K
k=1 sn,k = 1
Cost:
C({sn,k}, {mk}) =N∑
n=1
K∑k=1
sn,k ||xn −mk ||2
K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to
∑k sn,k = 1 and sn,k ∈ {0, 1}
K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.
C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.
K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise
Note:∑K
k=1 sn,k = 1
Cost:
C({sn,k}, {mk}) =N∑
n=1
K∑k=1
sn,k ||xn −mk ||2
K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to
∑k sn,k = 1 and sn,k ∈ {0, 1}
K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.
C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.
K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise
Note:∑K
k=1 sn,k = 1
Cost:
C({sn,k}, {mk}) =N∑
n=1
K∑k=1
sn,k ||xn −mk ||2
K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to
∑k sn,k = 1 and sn,k ∈ {0, 1}
K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.
C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.
K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise
Note:∑K
k=1 sn,k = 1
Cost:
C({sn,k}, {mk}) =N∑
n=1
K∑k=1
sn,k ||xn −mk ||2
K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to
∑k sn,k = 1 and sn,k ∈ {0, 1}
K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.
C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Where will K-means converge to when run on these data?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
input: D = {x1, . . . ,xN}, xn ∈ RD
initialise: mk ∈ RD for k = 1 . . .K
repeat
for n = 1 . . .N
sn = arg mink
||xn −mk ||
endfor
for k = 1 . . .K
mk = mean(xn : sn = k)
endfor
until convergence (sn fixed)
Pathologies of K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
I strange results when clustershave very different shapes
I returns hard assignmentsI soft assignments more
sensible
I initialisation delicateI k++ means algorithm:
spreads out initial centres
Pathologies of K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
I strange results when clustershave very different shapes
I returns hard assignmentsI soft assignments more
sensible
I initialisation delicateI k++ means algorithm:
spreads out initial centres
Mixture of Gaussians: Generative Model
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
For each data point 1 . . .N
sample cluster membership:
p(sn = k|θ) = πk note∑K
k=1 πk = 1
sample data-value given cluster mem.:
p(xn|sn = k, θ) = N (xn; mk ,Σk)
{π3, µ3,Σ3}
{π2, µ2,Σ2}
{π1, µ1,Σ1}
Mixture of Gaussians: Generative Model
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
For each data point 1 . . .N
sample cluster membership:
p(sn = k|θ) = πk note∑K
k=1 πk = 1
sample data-value given cluster mem.:
p(xn|sn = k, θ) = N (xn; mk ,Σk)
{π3, µ3,Σ3}
{π2, µ2,Σ2}
{π1, µ1,Σ1}
Mixture of Gaussians: Generative Model
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
For each data point 1 . . .N
sample cluster membership:
p(sn = k|θ) = πk note∑K
k=1 πk = 1
sample data-value given cluster mem.:
How can we learn the parame-ters of this model from data usingmaximum-likelihood?
θML = arg maxθ
log p({xn}Nn=1|θ)
p(xn|sn = k, θ) = N (xn; mk ,Σk)
{π3, µ3,Σ3}
{π2, µ2,Σ2}
{π1, µ1,Σ1}
Which plot shows the density p(x|θ) of this 1D Mixture of Gaussians?mixing proportions: π1 = 1
4 , π2 = 12 and π3 = 1
4means: µ1 = 0, µ2 = 3 and µ3 = −3standard deviations: σ1 = 1
2 , σ2 = 12 and σ3 = 1
x-5 0 5
p(x
|θ)
A.
x-5 0 5
B.
x-5 0 5
C.
Mixture of Gaussians: p(s = k|θ) = πk p(x|s = k, θ) = N (x;µk , σ2k)
Which plot shows the density p(x|θ) of this 1D Mixture of Gaussians?mixing proportions: π1 = 1
4 , π2 = 12 and π3 = 1
4means: µ1 = 0, µ2 = 3 and µ3 = −3standard deviations: σ1 = 1
2 , σ2 = 12 and σ3 = 1
x-5 0 5
p(x
|θ)
A.
x-5 0 5
B.
x-5 0 5
C.
p(x|θ) = 12N (x; 0, 1
22) + 1
4N (x; 3, 12
2) + 14N (x;−3, 1)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =
∑s
log p(x, s|θ)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =
∑s
log p(x, s|θ)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =
∑s
log p(x, s|θ)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihood
I Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =
∑s
log p(x, s|θ)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =
∑s
log p(x, s|θ)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =
∑s
log p(x, s|θ)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 so
log p({xn}Nn=1|θ) = log p(x|θ) =∑
slog p(x, s|θ)
Different ways of maximising the likelihood
log p({xn}Nn=1|θ)
= logN∏
n=1p(xn|θ) =
N∑n=1
log p(xn|θ) =N∑
n=1log
K∑k=1
p(xn, sn = k|θ)
=N∑
n=1log
K∑k=1
p(sn = k|θ)p(xn|sn = k, θ) =N∑
n=1log
K∑k=1
πkN (xn;µk , σ2k)
I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm
Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.
For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =
∑s
log p(x, s|θ)
A lower bound on the log-likelihood log p(x|θ)
log-likelihood
A lower bound on the log-likelihood log p(x|θ)
log-likelihood
posterior distributionover class memberships
arbitrary distributionover class memberships
A lower bound on the log-likelihood log p(x|θ)
KL-Divergencelog-likelihood
posterior distributionover class memberships
arbitrary distributionover class memberships
A lower bound on the log-likelihood log p(x|θ)
KL-Divergencelog-likelihood
posterior distributionover class memberships
arbitrary distributionover class memberships
free-energy
A brief introduction to the Kullback-Leibler divergence
KL(p1(z)||p2(z)) =∑
zp1(z) log p1(z)
p2(z)Important properties:I Gibb’s inequality: KL(p1(z)||p2(z)) ≥ 0, equality at p1(z) = p2(z)
I proof via Jensen’s inequality or differentiation (see MacKay pg. 35 )I Non-symmetric: KL(p1(z)||p2(z)) 6= KL(p2(z)||p1(z))
I hence named divergence and not distanceExample:
I binary variables z ∈ {0, 1}I p(z = 1) = 0.8 and q(z = 1) = ρ
ρ0 0.5 1
KL
(q || p
)
0
2
4
6
ρ0 0.5 1
KL
(p || q
)
0
2
4
6
0.8 0.8
A lower bound on the log-likelihood log p(x|θ)
KL-Divergencelog-likelihood
posterior distributionover class memberships
arbitrary distributionover class memberships
free-energy
A lower bound on the log-likelihood log p(x|θ)
Non-negativity of KL-Divergence
Free-energy is lower bound onlog-likelihood
KL-Divergencelog-likelihood
posterior distributionover class memberships
arbitrary distributionover class memberships
free-energy
A lower bound on the log-likelihood log p(x|θ)
Non-negativity of KL-Divergence
KL-Divergence equal to 0when
Free-energy is lower bound onlog-likelihood
Free-energy equal to log-likelihood when
KL-Divergencelog-likelihood
posterior distributionover class memberships
arbitrary distributionover class memberships
free-energy
A lower bound on the log-likelihood log p(x|θ)
Non-negativity of KL-Divergence
KL-Divergence equal to 0when
Free-energy is lower bound onlog-likelihood
Free-energy equal to log-likelihood when
KL-Divergencelog-likelihood
posterior distributionover class memberships
arbitrary distributionover class memberships
simple to compute
free-energy
Visualising the free-energy lower bound
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
Visualising the free-energy lower bound
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
Visualising the free-energy lower bound
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
Visualising the free-energy lower bound
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
Visualising the free-energy lower bound
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
What is the maximal value of the free-energy along this vertical slice?
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
?
?
What is the maximal value of the free-energy along this vertical slice?
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
What is the maximal value of the free-energy along this vertical slice?
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
What is the maximal value of the free-energy along this vertical slice?
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithm
F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))
The Expectation Maximisation (EM) algorithmFrom initial (random) parameters θ0 iterate t = 1, . . . ,T the two steps:
E step: for fixed θt−1, maximize lower bound F(q(s), θt−1) wrt q(s).As log likelihood log p(x|θ) is independent of q(s) this is equivalent tominimizing KL(q(s)||p(s|x, θt−1)), so qt(s) = p(s|x, θt−1).
M step: for fixed qt(s) maximize the lower bound F(qt(s), θ) wrt θ.
F(q(s), θ) =∑
sq(s) log
(p(x|s, θ)p(s|θ)
)−∑
sq(s) log q(s),
the second term is the entropy of q(s), independent of θ, so the M step is
θt = argmaxθ
∑s
qt(s) log(p(x|s, θ)p(s|θ)
).
Although the steps work with the lower bound (Lyupanov* function),each iteration cannot decrease the log likelihood as
log p(x|θt−1)E step
= F(qt(s), θt−1)M step≤ F(qt(s), θt)
lower bound≤ log p(x|θt)
The Expectation Maximisation (EM) algorithmFrom initial (random) parameters θ0 iterate t = 1, . . . ,T the two steps:
E step: for fixed θt−1, maximize lower bound F(q(s), θt−1) wrt q(s).As log likelihood log p(x|θ) is independent of q(s) this is equivalent tominimizing KL(q(s)||p(s|x, θt−1)), so qt(s) = p(s|x, θt−1).
M step: for fixed qt(s) maximize the lower bound F(qt(s), θ) wrt θ.
F(q(s), θ) =∑
sq(s) log
(p(x|s, θ)p(s|θ)
)−∑
sq(s) log q(s),
the second term is the entropy of q(s), independent of θ, so the M step is
θt = argmaxθ
∑s
qt(s) log(p(x|s, θ)p(s|θ)
).
Although the steps work with the lower bound (Lyupanov* function),each iteration cannot decrease the log likelihood as
log p(x|θt−1)E step
= F(qt(s), θt−1)M step≤ F(qt(s), θt)
lower bound≤ log p(x|θt)
The Expectation Maximisation (EM) algorithmFrom initial (random) parameters θ0 iterate t = 1, . . . ,T the two steps:
E step: for fixed θt−1, maximize lower bound F(q(s), θt−1) wrt q(s).As log likelihood log p(x|θ) is independent of q(s) this is equivalent tominimizing KL(q(s)||p(s|x, θt−1)), so qt(s) = p(s|x, θt−1).
M step: for fixed qt(s) maximize the lower bound F(qt(s), θ) wrt θ.
F(q(s), θ) =∑
sq(s) log
(p(x|s, θ)p(s|θ)
)−∑
sq(s) log q(s),
the second term is the entropy of q(s), independent of θ, so the M step is
θt = argmaxθ
∑s
qt(s) log(p(x|s, θ)p(s|θ)
).
Although the steps work with the lower bound (Lyupanov* function),each iteration cannot decrease the log likelihood as
log p(x|θt−1)E step
= F(qt(s), θt−1)M step≤ F(qt(s), θt)
lower bound≤ log p(x|θt)
Application of EM to Mixture of Gaussians (E Step)I Assume D = 1 dimensional data for x simplicityI Gaussian mixture model parameters: θ = {µk , σ
2k , πk}k=1...K
I One latent variable per datapoint sn ,n = 1 . . .N takes values 1 . . .K .Probability of the observations given the latent variables and theparameters, and the prior on latent variables are:
p(xn |sn = k, θ) = 1√2πσ2
ke
− 12σ2
k(xn−µk)2
p(sn = k|θ) = πk
so the E step becomes:
q(sn = k) = p(sn = k|xn , θ) ∝ p(xn , sn = k|θ) = πk√2πσ2
ke
− 12σ2
k(xn−µk)2
= unk
That is: q(sn = k) = rnk = unkun
where un =∑K
k=1 unk
Posterior for each latent variable, sn follows a categorical distribution withprobability given by the product of the prior and likelihood, renormalised.rnk is called the responsibility that component k takes for data point n.
Application of EM to Mixture of Gaussians (E Step)I Assume D = 1 dimensional data for x simplicityI Gaussian mixture model parameters: θ = {µk , σ
2k , πk}k=1...K
I One latent variable per datapoint sn ,n = 1 . . .N takes values 1 . . .K .Probability of the observations given the latent variables and theparameters, and the prior on latent variables are:
p(xn |sn = k, θ) = 1√2πσ2
ke
− 12σ2
k(xn−µk)2
p(sn = k|θ) = πk
so the E step becomes:
q(sn = k) = p(sn = k|xn , θ) ∝ p(xn , sn = k|θ) = πk√2πσ2
ke
− 12σ2
k(xn−µk)2
= unk
That is: q(sn = k) = rnk = unkun
where un =∑K
k=1 unk
Posterior for each latent variable, sn follows a categorical distribution withprobability given by the product of the prior and likelihood, renormalised.rnk is called the responsibility that component k takes for data point n.
Application of EM to Mixture of Gaussians (E Step)I Assume D = 1 dimensional data for x simplicityI Gaussian mixture model parameters: θ = {µk , σ
2k , πk}k=1...K
I One latent variable per datapoint sn ,n = 1 . . .N takes values 1 . . .K .Probability of the observations given the latent variables and theparameters, and the prior on latent variables are:
p(xn |sn = k, θ) = 1√2πσ2
ke
− 12σ2
k(xn−µk)2
p(sn = k|θ) = πk
so the E step becomes:
q(sn = k) = p(sn = k|xn , θ) ∝ p(xn , sn = k|θ) = πk√2πσ2
ke
− 12σ2
k(xn−µk)2
= unk
That is: q(sn = k) = rnk = unkun
where un =∑K
k=1 unk
Posterior for each latent variable, sn follows a categorical distribution withprobability given by the product of the prior and likelihood, renormalised.rnk is called the responsibility that component k takes for data point n.
Application of EM to Mixture of Gaussians (M Step)The lower bound is
F(q(s), θ) =N∑
n=1
K∑k=1
q(sn = k)[log(πk)− 1
2σ2k(xn−µk)2−1
2 log(σ2k)]+const.
The M step, optimizing F(q(s), θ) wrt the parameters, θ
∂F∂µj
=N∑
n=1q(sn = j)xn − µj
σ2j
= 0⇒ µj=∑N
n=1q(sn = j)xn∑Nn=1q(sn = j)
,
∂F∂σ2
j=
N∑n=1
q(sn = j)[
(xn − µj)2
2σ4j
− 12σ2
j
]= 0⇒ σ2
j =∑N
n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)
,
∂[F + λ(1−∑
k πk)]∂πj
=N∑
n=1
q(sn = j)πj
− λ = 0⇒ πj=1N
N∑n=1
q(sn = j)
E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.
Application of EM to Mixture of Gaussians (M Step)The lower bound is
F(q(s), θ) =N∑
n=1
K∑k=1
q(sn = k)[log(πk)− 1
2σ2k(xn−µk)2−1
2 log(σ2k)]+const.
The M step, optimizing F(q(s), θ) wrt the parameters, θ
∂F∂µj
=N∑
n=1q(sn = j)xn − µj
σ2j
= 0⇒ µj=∑N
n=1q(sn = j)xn∑Nn=1q(sn = j)
,
∂F∂σ2
j=
N∑n=1
q(sn = j)[
(xn − µj)2
2σ4j
− 12σ2
j
]= 0⇒ σ2
j =∑N
n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)
,
∂[F + λ(1−∑
k πk)]∂πj
=N∑
n=1
q(sn = j)πj
− λ = 0⇒ πj=1N
N∑n=1
q(sn = j)
E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.
Application of EM to Mixture of Gaussians (M Step)The lower bound is
F(q(s), θ) =N∑
n=1
K∑k=1
q(sn = k)[log(πk)− 1
2σ2k(xn−µk)2−1
2 log(σ2k)]+const.
The M step, optimizing F(q(s), θ) wrt the parameters, θ
∂F∂µj
=N∑
n=1q(sn = j)xn − µj
σ2j
= 0⇒ µj=∑N
n=1q(sn = j)xn∑Nn=1q(sn = j)
,
∂F∂σ2
j=
N∑n=1
q(sn = j)[
(xn − µj)2
2σ4j
− 12σ2
j
]= 0⇒ σ2
j =∑N
n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)
,
∂[F + λ(1−∑
k πk)]∂πj
=N∑
n=1
q(sn = j)πj
− λ = 0⇒ πj=1N
N∑n=1
q(sn = j)
E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.
Application of EM to Mixture of Gaussians (M Step)The lower bound is
F(q(s), θ) =N∑
n=1
K∑k=1
q(sn = k)[log(πk)− 1
2σ2k(xn−µk)2−1
2 log(σ2k)]+const.
The M step, optimizing F(q(s), θ) wrt the parameters, θ
∂F∂µj
=N∑
n=1q(sn = j)xn − µj
σ2j
= 0⇒ µj=∑N
n=1q(sn = j)xn∑Nn=1q(sn = j)
,
∂F∂σ2
j=
N∑n=1
q(sn = j)[
(xn − µj)2
2σ4j
− 12σ2
j
]= 0⇒ σ2
j =∑N
n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)
,
∂[F + λ(1−∑
k πk)]∂πj
=N∑
n=1
q(sn = j)πj
− λ = 0⇒ πj=1N
N∑n=1
q(sn = j)
E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.
Application of EM to Mixture of Gaussians (M Step)The lower bound is
F(q(s), θ) =N∑
n=1
K∑k=1
q(sn = k)[log(πk)− 1
2σ2k(xn−µk)2−1
2 log(σ2k)]+const.
The M step, optimizing F(q(s), θ) wrt the parameters, θ
∂F∂µj
=N∑
n=1q(sn = j)xn − µj
σ2j
= 0⇒ µj=∑N
n=1q(sn = j)xn∑Nn=1q(sn = j)
,
∂F∂σ2
j=
N∑n=1
q(sn = j)[
(xn − µj)2
2σ4j
− 12σ2
j
]= 0⇒ σ2
j =∑N
n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)
,
∂[F + λ(1−∑
k πk)]∂πj
=N∑
n=1
q(sn = j)πj
− λ = 0⇒ πj=1N
N∑n=1
q(sn = j)
E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
EM for MoGs: soft, non-axis aligned K-means
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ
0 0.5 1 1.5
free-e
nerg
y-1500
-1000
-500
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
KABOOM!
What will happen with this initialisation?
x1
-2 -1 0 1 2
x2
-2
-1
0
1
2
initialise: θ = {πk ,mk ,Σk}Kk=1
repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N
M-Step:arg max
θ
∑n,k rnk log p(sn = k,x|θ)
until convergence
θ (Σ11
)0 0.05 0.1 0.15 0.2
free-e
nerg
y
50
100
150
KABOOM!maximum likelihood can overfitΣk must be restrictedfor bounded F(q(s), θ)
Summary
I MoG + EM algorithm = soft k-means clustering with non-axisaligned, non-equally weighted clusters
I EM can be used to fit latent variable models e.g. PCA, Factoranalysis, MoGs, HMMs, ...
I requires tractable posterior p(s|x, θ) entropy andaverage log-joint Eq(s|θ) [log p(s,x|θ)]
LimitationsI MoG clusters still have simple shapes (ellipses)
I a single real cluster might be described by many componentsI more complex cluster models have been developed
I maximum-likelihood can overfitI Bayesian approaches avoid overfitting p(θ, s|x)
I co-ordinate ascent is often slow to converge (lots of iterationsrequired)
I joint optimisation of F(q(s), θ) fasterI direct optimisation of log-likelihood log p(x|θ)
Appendix: proof of KL divergence propertiesMinimise Kullback Leibler divergence (relative entropy) KL(q(x)||p(x)):add Lagrange multiplier (enforce q(x) normalises), take variationalderivatives:
δ
δq(x)[ ∫
q(x) log q(x)p(x)dx + λ
(1−
∫q(x)dx
)]= log q(x)
p(x) + 1− λ.
Find staionary point by setting the derivative to zero:
q(x) = exp(λ−1)p(x), normalization conditon λ = 1, so q(x) = p(x),
which corresponds to a minimum, since the second derivative is positive:
δ2
δq(x)δq(x)KL(q(x)||p(x)) = 1q(x) > 0.
The minimum value attained at q(x) = p(x) is KL(p(x)||p(x)) = 0,showing that KL(q(x)||p(x))
I is non-negativeI attains its minimum 0 when p(x) and q(x) are equal