clustering and the em algorithmlearning.eng.cam.ac.uk/.../turner/teaching/clustering.pdfwhat is...

Clustering and the EM algorithm

Rich Turner and Jose Miguel Hernandez-Lobato

What is clustering?

network community detection

Campbell et al Social Network Analysis

image segmentation

vector quantisationgenetic clusteringanomaly detection

crime analysis

What is clustering?

I Roughly speaking, two points belonging to the same cluster aregenerally more similar to each other or closer to each other than twopoints belonging to different clusters.

I D = {x1 . . . xN} → s = {s1 . . . sN} where sn ∈ {1 . . .K}I Unsupervised learning problem (no labels or rewards)

A first clustering algorithm: k-means

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

initialise: mk ∈ RD for k = 1 . . .K

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

mk = mean(xn : sn = k)

endfor

until convergence (sn fixed)

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

Question: is K-means guaranteed to converge for any dataset D?

-2 -1 0 1 2

could one or more of the cluster centres diverge or oscillate?

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

K-means as optimisationLet sn,k = 1 if data point n is assigned to cluster k and zero otherwise

Note:∑K

k=1 sn,k = 1

C({sn,k}, {mk}) =N∑

K∑k=1

sn,k ||xn −mk ||2

K-means tries to minimise the cost function C with respect to{sn,k} and {mk}, subject to

∑k sn,k = 1 and sn,k ∈ {0, 1}

K-means sequentially:I minimises C with respect to {sn,k}, holding {mk} fixed.I minimises C with respect to {mk}, holding {sn,k} fixed.

C is a Lyupanov function =⇒ K-means converges, but...Finding the global optimum of C is a hard problem.

Note:∑K

k=1 sn,k = 1

C({sn,k}, {mk}) =N∑

K∑k=1

sn,k ||xn −mk ||2

∑k sn,k = 1 and sn,k ∈ {0, 1}

Note:∑K

k=1 sn,k = 1

C({sn,k}, {mk}) =N∑

K∑k=1

sn,k ||xn −mk ||2

∑k sn,k = 1 and sn,k ∈ {0, 1}

Note:∑K

k=1 sn,k = 1

C({sn,k}, {mk}) =N∑

K∑k=1

sn,k ||xn −mk ||2

∑k sn,k = 1 and sn,k ∈ {0, 1}

Where will K-means converge to when run on these data?

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

-2 -1 0 1 2

input: D = {x1, . . . ,xN}, xn ∈ RD

repeat

for n = 1 . . .N

sn = arg mink

||xn −mk ||

endfor

for k = 1 . . .K

endfor

Pathologies of K-means

-2 -1 0 1 2

I strange results when clustershave very different shapes

I returns hard assignmentsI soft assignments more

sensible

I initialisation delicateI k++ means algorithm:

spreads out initial centres

Pathologies of K-means

-2 -1 0 1 2

I strange results when clustershave very different shapes

I returns hard assignmentsI soft assignments more

sensible

I initialisation delicateI k++ means algorithm:

spreads out initial centres

Mixture of Gaussians: Generative Model

-2 -1 0 1 2

For each data point 1 . . .N

sample cluster membership:

p(sn = k|θ) = πk note∑K

k=1 πk = 1

sample data-value given cluster mem.:

p(xn|sn = k, θ) = N (xn; mk ,Σk)

{π3, µ3,Σ3}

{π2, µ2,Σ2}

{π1, µ1,Σ1}

-2 -1 0 1 2

k=1 πk = 1

{π3, µ3,Σ3}

{π2, µ2,Σ2}

{π1, µ1,Σ1}

-2 -1 0 1 2

k=1 πk = 1

How can we learn the parame-ters of this model from data usingmaximum-likelihood?

θML = arg maxθ

log p({xn}Nn=1|θ)

{π3, µ3,Σ3}

{π2, µ2,Σ2}

{π1, µ1,Σ1}

Which plot shows the density p(x|θ) of this 1D Mixture of Gaussians?mixing proportions: π1 = 1

4 , π2 = 12 and π3 = 1

4means: µ1 = 0, µ2 = 3 and µ3 = −3standard deviations: σ1 = 1

2 , σ2 = 12 and σ3 = 1

x-5 0 5

Mixture of Gaussians: p(s = k|θ) = πk p(x|s = k, θ) = N (x;µk , σ2k)

Which plot shows the density p(x|θ) of this 1D Mixture of Gaussians?mixing proportions: π1 = 1

4 , π2 = 12 and π3 = 1

4means: µ1 = 0, µ2 = 3 and µ3 = −3standard deviations: σ1 = 1

2 , σ2 = 12 and σ3 = 1

x-5 0 5

p(x|θ) = 12N (x; 0, 1

22) + 1

4N (x; 3, 12

2) + 14N (x;−3, 1)

Different ways of maximising the likelihood

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

p(sn = k|θ)p(xn|sn = k, θ) =N∑

n=1log

K∑k=1

πkN (xn;µk , σ2k)

I Method 1: direct gradient based optimisation of the log-likelihoodI Method 2: the expectation-maximisation (EM) algorithm

Direct optimisation is generally faster, but the EM algorithm (i) issometimes simpler to implement, (ii) is more widely used, and (iii)leads to important generalisations.

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 solog p({xn}Nn=1|θ) = log p(x|θ) =

log p(x, s|θ)

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

n=1log

K∑k=1

log p(x, s|θ)

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

n=1log

K∑k=1

log p(x, s|θ)

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

n=1log

K∑k=1

I Method 1: direct gradient based optimisation of the log-likelihood

I Method 2: the expectation-maximisation (EM) algorithm

log p(x, s|θ)

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

n=1log

K∑k=1

log p(x, s|θ)

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

n=1log

K∑k=1

log p(x, s|θ)

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

n=1log

K∑k=1

For generality, we simplify notation: let x = {xn}Nn=1 and s = {sn}Nn=1 so

log p({xn}Nn=1|θ) = log p(x|θ) =∑

slog p(x, s|θ)

log p({xn}Nn=1|θ)

= logN∏

n=1p(xn|θ) =

N∑n=1

log p(xn|θ) =N∑

n=1log

K∑k=1

p(xn, sn = k|θ)

n=1log

K∑k=1

n=1log

K∑k=1

log p(x, s|θ)

A lower bound on the log-likelihood log p(x|θ)

log-likelihood

posterior distributionover class memberships

arbitrary distributionover class memberships

KL-Divergencelog-likelihood

free-energy

A brief introduction to the Kullback-Leibler divergence

KL(p1(z)||p2(z)) =∑

zp1(z) log p1(z)

p2(z)Important properties:I Gibb’s inequality: KL(p1(z)||p2(z)) ≥ 0, equality at p1(z) = p2(z)

I proof via Jensen’s inequality or differentiation (see MacKay pg. 35 )I Non-symmetric: KL(p1(z)||p2(z)) 6= KL(p2(z)||p1(z))

I hence named divergence and not distanceExample:

I binary variables z ∈ {0, 1}I p(z = 1) = 0.8 and q(z = 1) = ρ

ρ0 0.5 1

(q || p

ρ0 0.5 1

(p || q

0.8 0.8

free-energy

Non-negativity of KL-Divergence

Free-energy is lower bound onlog-likelihood

free-energy

KL-Divergence equal to 0when

Free-energy equal to log-likelihood when

free-energy

KL-Divergence equal to 0when

Free-energy equal to log-likelihood when

simple to compute

free-energy

Visualising the free-energy lower bound

F(q(s), θ) = log p(x|θ)−KL(q(s)||p(s|x, θ))

What is the maximal value of the free-energy along this vertical slice?

The Expectation Maximisation (EM) algorithm

The Expectation Maximisation (EM) algorithmFrom initial (random) parameters θ0 iterate t = 1, . . . ,T the two steps:

E step: for fixed θt−1, maximize lower bound F(q(s), θt−1) wrt q(s).As log likelihood log p(x|θ) is independent of q(s) this is equivalent tominimizing KL(q(s)||p(s|x, θt−1)), so qt(s) = p(s|x, θt−1).

M step: for fixed qt(s) maximize the lower bound F(qt(s), θ) wrt θ.

F(q(s), θ) =∑

sq(s) log

(p(x|s, θ)p(s|θ)

)−∑

sq(s) log q(s),

the second term is the entropy of q(s), independent of θ, so the M step is

θt = argmaxθ

qt(s) log(p(x|s, θ)p(s|θ)

Although the steps work with the lower bound (Lyupanov* function),each iteration cannot decrease the log likelihood as

log p(x|θt−1)E step

= F(qt(s), θt−1)M step≤ F(qt(s), θt)

lower bound≤ log p(x|θt)

F(q(s), θ) =∑

sq(s) log

(p(x|s, θ)p(s|θ)

)−∑

sq(s) log q(s),

θt = argmaxθ

F(q(s), θ) =∑

sq(s) log

(p(x|s, θ)p(s|θ)

)−∑

sq(s) log q(s),

θt = argmaxθ

Application of EM to Mixture of Gaussians (E Step)I Assume D = 1 dimensional data for x simplicityI Gaussian mixture model parameters: θ = {µk , σ

2k , πk}k=1...K

I One latent variable per datapoint sn ,n = 1 . . .N takes values 1 . . .K .Probability of the observations given the latent variables and theparameters, and the prior on latent variables are:

p(xn |sn = k, θ) = 1√2πσ2

− 12σ2

k(xn−µk)2

p(sn = k|θ) = πk

so the E step becomes:

q(sn = k) = p(sn = k|xn , θ) ∝ p(xn , sn = k|θ) = πk√2πσ2

− 12σ2

k(xn−µk)2

That is: q(sn = k) = rnk = unkun

where un =∑K

k=1 unk

Posterior for each latent variable, sn follows a categorical distribution withprobability given by the product of the prior and likelihood, renormalised.rnk is called the responsibility that component k takes for data point n.

2k , πk}k=1...K

p(xn |sn = k, θ) = 1√2πσ2

− 12σ2

k(xn−µk)2

p(sn = k|θ) = πk

− 12σ2

k(xn−µk)2

where un =∑K

k=1 unk

2k , πk}k=1...K

p(xn |sn = k, θ) = 1√2πσ2

− 12σ2

k(xn−µk)2

p(sn = k|θ) = πk

− 12σ2

k(xn−µk)2

where un =∑K

k=1 unk

Application of EM to Mixture of Gaussians (M Step)The lower bound is

F(q(s), θ) =N∑

K∑k=1

q(sn = k)[log(πk)− 1

2σ2k(xn−µk)2−1

2 log(σ2k)]+const.

The M step, optimizing F(q(s), θ) wrt the parameters, θ

∂F∂µj

n=1q(sn = j)xn − µj

= 0⇒ µj=∑N

n=1q(sn = j)xn∑Nn=1q(sn = j)

∂F∂σ2

N∑n=1

q(sn = j)[

(xn − µj)2

− 12σ2

]= 0⇒ σ2

j =∑N

n=1q(sn = j)(xn − µj)2∑Nn=1q(sn = j)

∂[F + λ(1−∑

k πk)]∂πj

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

E step fills in the values of the hidden variables: M step just like performingsupervised learning with known (soft) cluster assignments.

F(q(s), θ) =N∑

K∑k=1

2 log(σ2k)]+const.

∂F∂µj

= 0⇒ µj=∑N

∂F∂σ2

N∑n=1

q(sn = j)[

(xn − µj)2

− 12σ2

]= 0⇒ σ2

j =∑N

∂[F + λ(1−∑

k πk)]∂πj

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

F(q(s), θ) =N∑

K∑k=1

2 log(σ2k)]+const.

∂F∂µj

= 0⇒ µj=∑N

∂F∂σ2

N∑n=1

q(sn = j)[

(xn − µj)2

− 12σ2

]= 0⇒ σ2

j =∑N

∂[F + λ(1−∑

k πk)]∂πj

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

F(q(s), θ) =N∑

K∑k=1

2 log(σ2k)]+const.

∂F∂µj

= 0⇒ µj=∑N

∂F∂σ2

N∑n=1

q(sn = j)[

(xn − µj)2

− 12σ2

]= 0⇒ σ2

j =∑N

∂[F + λ(1−∑

k πk)]∂πj

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

F(q(s), θ) =N∑

K∑k=1

2 log(σ2k)]+const.

∂F∂µj

= 0⇒ µj=∑N

∂F∂σ2

N∑n=1

q(sn = j)[

(xn − µj)2

− 12σ2

]= 0⇒ σ2

j =∑N

∂[F + λ(1−∑

k πk)]∂πj

q(sn = j)πj

− λ = 0⇒ πj=1N

N∑n=1

q(sn = j)

EM for MoGs: soft, non-axis aligned K-means

-2 -1 0 1 2

initialise: θ = {πk ,mk ,Σk}Kk=1

repeatE-Step:rnk = p(sn = k|xn, θ) for n = 1 . . .N

M-Step:arg max

∑n,k rnk log p(sn = k,x|θ)

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

-2 -1 0 1 2

M-Step:arg max

until convergence

0 0.5 1 1.5

free-e

y-1500

What will happen with this initialisation?

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

KABOOM!

-2 -1 0 1 2

M-Step:arg max

until convergence

θ (Σ11

)0 0.05 0.1 0.15 0.2

free-e

KABOOM!maximum likelihood can overfitΣk must be restrictedfor bounded F(q(s), θ)

Summary

I MoG + EM algorithm = soft k-means clustering with non-axisaligned, non-equally weighted clusters

I EM can be used to fit latent variable models e.g. PCA, Factoranalysis, MoGs, HMMs, ...

I requires tractable posterior p(s|x, θ) entropy andaverage log-joint Eq(s|θ) [log p(s,x|θ)]

LimitationsI MoG clusters still have simple shapes (ellipses)

I a single real cluster might be described by many componentsI more complex cluster models have been developed

I maximum-likelihood can overfitI Bayesian approaches avoid overfitting p(θ, s|x)

I co-ordinate ascent is often slow to converge (lots of iterationsrequired)

I joint optimisation of F(q(s), θ) fasterI direct optimisation of log-likelihood log p(x|θ)

Appendix: proof of KL divergence propertiesMinimise Kullback Leibler divergence (relative entropy) KL(q(x)||p(x)):add Lagrange multiplier (enforce q(x) normalises), take variationalderivatives:

δq(x)[ ∫

q(x) log q(x)p(x)dx + λ

∫q(x)dx

)]= log q(x)

p(x) + 1− λ.

Find staionary point by setting the derivative to zero:

q(x) = exp(λ−1)p(x), normalization conditon λ = 1, so q(x) = p(x),

which corresponds to a minimum, since the second derivative is positive:

δq(x)δq(x)KL(q(x)||p(x)) = 1q(x) > 0.

The minimum value attained at q(x) = p(x) is KL(p(x)||p(x)) = 0,showing that KL(q(x)||p(x))

I is non-negativeI attains its minimum 0 when p(x) and q(x) are equal

clustering and the em algorithmlearning.eng.cam.ac.uk/.../turner/teaching/clustering.pdfwhat is...

Documents