mlpr: clustering, mixture models, and the em algorithm - …€¦ · mlpr: clustering, mixture...

MLPR: Clustering, Mixture Models, and the EMAlgorithm

Machine Learning and Pattern Recognition

Charles Sutton

School of Informatics, University of Edinburgh

1 / 36

Overview

I Up until now: supervised learning. Training data containsexamples of inputs and outputs

I Now we will see a few examples of unsupervised learning.

I Unsupervised methods: The algorithm is given a set oftraining inputs but no outputs. Just “find some patterns”

I This lecture: clustering.

I Clustering: Given a set of feature vectors {x1,x2, . . . ,xN}.Assign each data point xi to a cluster label zi ∈ {1, 2, . . . ,K}

I Sort of like classification, except that “the class labels” for thetraining data aren’t given!

I Idea is that items that are similar in feature space should beassigned to the same group

2 / 36

Example: Image ClusteringInput: Each xi represents one imageOutput: zi ∈ {1, 2, . . .K} a cluster labele.g., Input:

Source: http://137.189.35.203/WebUI/CatDatabase/catData.html

3 / 36

http://137.189.35.203/WebUI/CatDatabase/catData.html

Example: Image Clusteringe.g., Outputs:

1 1 1 1

1

1

1

2 2

2

2

2

2

2

1

33 3

3

3

33 3 3 3

4

44

I The circles give a potential assignment to the ziI Notice that we did not get a group label in the training

inputs! This is the key difference from classification.4 / 36

Examples: Time Series ClusteringInput: Each x1 . . .xN is one of the lines in figure below.xi = (xi1, xi2, . . . xi6). Each xit is one of the points (circles) infigure

Example input:

0 9.5 11.5 13.5 15.5 18.5 20.5−5

−4

−3

−2

−1

0

1

2

3

4

5

time

ge

ne

s

yeast microarray data

}}

Cluster 1

Cluster 2

5 / 36

Examples: Time Series Clustering

I Details: Each line represents a gene in yeast

I We measure how much each gene is expressed (i.e., convertedinto RNA) over time, as some experimental condition ischanged

0 9.5 11.5 13.5 15.5 18.5 20.5−5

−4

−3

−2

−1

0

1

2

3

4

5

time

ge

ne

s


}}

Cluster 1

Cluster 2

6 / 36

Examples: Time Series Clustering

Output: Assignment of each vector to a cluster

Example output:

0 9.5 11.5 13.5 15.5 18.5 20.5−5

−4

−3

−2

−1

0

1

2

3

4

5

time

ge

ne

s


}}

Cluster 1

Cluster 2

7 / 36

Ill-Posed Problem

I There is no “one right answer” to a clustering problem.

I For example, could cluster cats by colour:

1 1 1 1

1

1

1

2 2

2

2

2

2

2

1

33 3

3

3

33 3 3 3

4

44

8 / 36

Ill-Posed ProblemI Or by location:

1

43

2 2

22

1

1

11

1 1 1 1

31

3

4

4

44

4 4 4 4 4

I Remember no free lunch from first lecture? Differentclustering methods will make different prior assumptions,which determine which of these clusterings they find.

9 / 36

Clustering Methods

I There are lots and lots of clustering methods:I K-means, K-mediodsI Hierarchical clustering (agglomerative or top down)I Spectral clusteringI Biclustering

I Many of these aren’t probabilistic.

I We won’t go through the above methods in this course.I Today we will do

I Mixture models

I which are commonly used for clustering, but are used for otherproblems as well (density estimation, and even supervisedlearning).

10 / 36

Mixture Modelling

I A mixture model is a way of making a simple density moreflexible.

I Previously, we have had models that looked like p(xi|θ)I To make a mixture model, you need to start from a base, like

Gaussians, beta, bernoulli, etc.

I From this create a “mixture of Gaussians”, “mixture ofbetas”, etc.

I The main idea will be to add a new random variable to themodel.

I We will have a joint distribution over zi and xi.

I zi is a variable that we add just to make the model moreconvenient. Called a latent variable.

11 / 36

Mixture Modelling

I Previously, we have had models that looked like p(xi|θ)I Now we will have one set of parameters for each clusterθ1, θ2, . . . θk

I In general, model will be

p(zi = k) = πk

p(xi|zi = k) = p(xi|θk)

I π = (π1, π2, . . . πK) represent how common the clusters are.These are called mixing weights

I The p(xi|θk) are the base distributions that we are mixingtogether

12 / 36

Example 1: Mixture of Gaussians

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

p(zi = k) = πk

p(xi|zi = k) = N (xi;µk,Σk)

Notice a different mean and covariance for each cluster(Above plots of the density of a MOG for K = 3)

13 / 36

Example 2: Mixture of Multinoullis

I Suppose each data point a bit vector, i.e., xi ∈ {0, 1}D

I Can choose p(x|z) to be a product of Bernoullis. Let onetraining instance be x = (x1, x2, . . . xD) where xj ∈ {0, 1}.We define

p(x|z = k, θk) =D∏j=1

θxjkj (1− θkj)

1−xj ,

i.e., if we are in cluster k, then the distribution over the bitvector x is a product of independent Bernoullis.

θk = (θk1, θk2, . . . , θkD) give the Bernoulli parameter for eachdimension, i.e.,

p(xj = 1|z = k, θk) = θkj

14 / 36

A Point about Expressivity

I Squinting at this equation

p(x|z = k, θk) =D∏j=1

θxjkj (1− θkj)

1−xj ,

we see that given z = k, all pairs of variables xi and xj areindependent.

I But marginally xi and xj are dependent under this model!

I This is because xi provides information about z, whichprovides info about xj

I For example, consider D = 2 and K = 2 ,and

θ1 = (0.99, 0.01)

θ2 = (0.01, 0.99)

π = (0.5, 0.5)

Once I see x1, I am almost certain about the value of x2.

15 / 36

Summary: A Point about Expressivity

I A product of independent Bernoullis is not a very expressivemodel.

I We have made the model more expressive by embedding itwithin a mixture.

I Now we can represent correlation, when we couldn’t before

I This is a general point about mixtures: The family ofdistributions in a mixture model will be larger than theoriginal models

16 / 36

Maximizing Likelihood

I Parameters to estimate are θ = (θ1, . . . , θK , π)

I The definition of the likelihood doesn’t change just becausewe added latent variables. It is still the probability of datagiven parameters, i.e.,

L(θ) =

N∑i=1

log p(xi|θ)

I But now we have to compute a marginal distribution to getp(xi|θ), i.e.,

L(θ) =

N∑i=1

log

K∑k=1

p(xi|zi = k, θ)p(zi = k|θ)

17 / 36

Likelihood Example

I For a mixture of Gaussians we would have

L(θ) =

N∑i=1

log

K∑k=1

N (xi;µk,Σk)πk

18 / 36

Multimodality

I The likelihood for a mixture model can be multimodal, for thesame reason as the neural network likelihood can be.

I For example, we can do label switching, i.e., permuteθ1 . . . θK , and then permute π1 . . . πK to match.

I When we run an optimization procedure, we will only find alocal optimum

I More technically, the log likelihood is not convex (can verifyby taking second derivatives)

19 / 36

How to apply to Clustering

I Given data {x1,x2, . . . ,xN} and a model

p(x1, . . . ,xN , z1 . . . zN |θ1 . . . θK , π)

I Maximize likelihood to get ML parameters θ̂1 . . . θ̂K , π̂

I For each training point xi, compute

rik := p(zi = k|xi, θ̂1 . . . θ̂K , π̂)

This is called the responsibility of cluster i for point k.

I Interpret the responsibilities as

I If you need xi to be assigned to exactly one cluster, takemaxk rik.

I Soft cluster == each point gets a distribution over clusters,hard clustering == each point gets exactly one cluster

20 / 36

Optimizing the Likelihood

I Could use any of the methods from the last lecture, e.g.,gradient descent, etc.

I But in practice we usually don’t use those for mixture models

I It turns out to be easier to apply a different optimizationalgorithm called expectation maximization (EM)

I EM is a special optimization algorithm for probabilisticmodels.

I Think of it as a competitor to gradient descent, conjugategradient, etc.

21 / 36

Motivation for EM

Consider 1D mixture of Gaussians. Model is:

p(x) =

K∑k=1

πk(2πσ2k)

−1/2 exp

{− 1

2σ2k(x− µk)2

}Your data look like:

xi zi1.25 ??1.073 ??1.75 ??0.234 ??0.484 ??

Difficult to figure out how to estimate πk, µk, σ2k.

22 / 36

Motivation for EM


p(x) =

K∑k=1

πk(2πσ2k)

−1/2 exp

{− 1

2σ2k(x− µk)2

}Suppose someone tells you the cluster labels. Now your databecomes:

xi zi1.25 11.073 21.75 10.234 20.484 2

Now do you know how to estimate πk, µk, σ2k?

23 / 36

Motivation for EM


p(x) =

K∑k=1

πk(2πσ2k)

−1/2 exp

{− 1

2σ2k(x− µk)2

}Suppose someone tells you the cluster labels. Now your databecomes:

xi zi1.25 11.073 21.75 10.234 20.484 2

Now do you know how to estimate πk, µk, σ2k?

24 / 36

Hard Cluster Assignment GivenYes! This is just a class-conditional Gaussian. ML is easy.

xi zi1.25 11.073 21.75 10.234 20.484 2

e.g.,

πk =

∑Ni=1 I{zi = k}

N

µk =1∑N

i=1 I{zi = k}

N∑i=1

I{zi = k}xi

σ2k =1∑N

i=1 I{zi = k}

N∑i=1

I{zi = k}(xi − µk)2

25 / 36

Soft Cluster Assignment Givenxi zi1.25 [0.9, 0.1]1.073 21.75 10.234 20.484 2

I Suppose instead you get a soft assignmentto clusters [ri1, ri2]

I ri1 degree that the point belongs tocluster 1, ri2 degree to cluster 2,ri1 + ri2 = 1

Intuitively we’d like to treat each point as partially belonging toboth clusters, i.e.,

πk =

∑Ni=1 rikN

µk =1∑N

i=1 rik

N∑i=1

rikxi

σ2k =1∑N

i=1 rik

N∑i=1

rik(xi − µk)2

We’re taking weighted averages rather than sample averages.26 / 36

Outline of EM

I EM is an algorithm for performing maximum likelihood wherethere is missing data.

I Can be used where:I Data x = (x1 . . .xN )I A model p(x, z|θ)I We want the ML solution

θ̂ = maxθ

log∑z

p(x, z|θ)

I A model of the form p(x, z|θ) is called a latent variable modelbecause we do not get to observe the random variable z in theobserved data

27 / 36

EM for Mixture Models

EM is an iterative algorithm, like gradient descent. At everyiteration t we have a current guess θ(t) of the parameters

I Repeat

1. E-step. Compute a soft assignment to the latent variables.Compute q(z) := p(z|x, θ(t)).

2. M-step. Reestimate the parameters based on the softassignment. Define the complete data likelihood:

`c = log p(x, z|θ)

Thenθ(t−1) ← max

θ

∑z

q(z)`c(θ)︸︷︷︸Q(θ,θ(t−1))

3. t← t+ 1I until converged, i.e., d(θ(t), θ(t)) < �

28 / 36

Specializing it to a Model

I Given a new model p(z,x|θ), may need to derive the EMalgorithm for that model

I To do this, needI Method for computing E-step: q(z) := p(z|x, θ(t))I Method for computing M-step: maxθ Q(θ, θ

(t−1))

I Example: Mixture of GaussiansI Already described p(z|x, θ(t)); this is the responsibilityI It turns out that maxθ Q(θ, θ

(t−1)) occurs at the weightedaverage estimates on the previous slide!

29 / 36

Example: EM on a Gaussian Mixture

I Repeat

1. E-step. Assign the points to clusters. Compute

rik = p(zi = k|xi, θ(t)k . . . θ(t)k , π

(t))

for all i ∈ 1 . . . N and k ∈ 1 . . .K2. M-step. Re-estimate the model parameters

πk =

∑Ni=1 rikN

µk =1∑N

i=1 rik

N∑i=1

rikxi

σ2k =1∑N

i=1 rik

N∑i=1

rik(xi − µk)2

3. t← t+ 1I until converged, i.e., d(θ(t), θ(t)) < �

30 / 36

−2 0 2

−2

0

2

31 / 36

Why does EM work?

I The M-step maximizes a functionQ(θ, θ(t)) =

∑z q(z) log p(x, z|θ)

I Compare this to the true log likelihood

L(θ) = log

[∑z

p(x, z|θ)

]

I Key point 1: Q(θ, θ(t)) + const ≤ L(θ) for all θI Key point 2: Because q arises from E-step

Q(θ(t), θ(t)) + const = log p(x|θ(t))

32 / 36

Why does EM work? (cont)

Q(θ,θt)

Q(θ,θt+1

)

l(θ)

θt

θt+1

θt+2

I E-step: Create a lowerbound

L(θ(t), q) = Q(θ, θ(t))+const

I M-step: Maximize thelower bound wrt θ

Means that: EM monotonically increases the likelihood at everyiteration

L(θ(t+1)) ≥ L(θ(t+1), q) ≥ L(θ(t), q) = L(θ(t))

33 / 36

Why does Key Point 1 hold? log is concave

α log x0 + (1− α) log x1 ≤ log(αx0 + (1− α)x1)

34 / 36

Lower bound on L

L(θ) = log

[∑z

p(x, z|θ)

]

= log

[∑z

q(z)p(x, z|θ)q(z)

]

≥∑z

q(z) log

[p(x, z|θ)q(z)

]Jensen’s inequality

= H(q) + Eq[log p(x, z|θ)]:= L(θ, q)

This is the lower bound. So const on the previous slides is H(q),which is constant with respect to θ.

35 / 36

Summary

I Mixture models provide a method for making a family ofdistributions richer

I Also widely used for performing clustering

I Maximum likelihood in mixture algorithms : use EM algorithm

36 / 36

mlpr: clustering, mixture models, and the em algorithm - …€¦ · mlpr: clustering, mixture...

Documents