mlpr: clustering, mixture models, and the em algorithm - …€¦ · mlpr: clustering, mixture...

36
MLPR: Clustering, Mixture Models, and the EM Algorithm Machine Learning and Pattern Recognition Charles Sutton School of Informatics, University of Edinburgh 1 / 36

Upload: others

Post on 19-Feb-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

  • MLPR: Clustering, Mixture Models, and the EMAlgorithm

    Machine Learning and Pattern Recognition

    Charles Sutton

    School of Informatics, University of Edinburgh

    1 / 36

  • Overview

    I Up until now: supervised learning. Training data containsexamples of inputs and outputs

    I Now we will see a few examples of unsupervised learning.

    I Unsupervised methods: The algorithm is given a set oftraining inputs but no outputs. Just “find some patterns”

    I This lecture: clustering.

    I Clustering: Given a set of feature vectors {x1,x2, . . . ,xN}.Assign each data point xi to a cluster label zi ∈ {1, 2, . . . ,K}

    I Sort of like classification, except that “the class labels” for thetraining data aren’t given!

    I Idea is that items that are similar in feature space should beassigned to the same group

    2 / 36

  • Example: Image ClusteringInput: Each xi represents one imageOutput: zi ∈ {1, 2, . . .K} a cluster labele.g., Input:

    Source: http://137.189.35.203/WebUI/CatDatabase/catData.html

    3 / 36

    http://137.189.35.203/WebUI/CatDatabase/catData.html

  • Example: Image Clusteringe.g., Outputs:

    1 1 1 1

    1

    1

    1

    2 2

    2

    2

    2

    2

    2

    1

    33 3

    3

    3

    33 3 3 3

    4

    44

    I The circles give a potential assignment to the ziI Notice that we did not get a group label in the training

    inputs! This is the key difference from classification.4 / 36

  • Examples: Time Series ClusteringInput: Each x1 . . .xN is one of the lines in figure below.xi = (xi1, xi2, . . . xi6). Each xit is one of the points (circles) infigure

    Example input:

    0 9.5 11.5 13.5 15.5 18.5 20.5−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    time

    ge

    ne

    s

    yeast microarray data

    }}

    Cluster 1

    Cluster 2

    5 / 36

  • Examples: Time Series Clustering

    I Details: Each line represents a gene in yeast

    I We measure how much each gene is expressed (i.e., convertedinto RNA) over time, as some experimental condition ischanged

    0 9.5 11.5 13.5 15.5 18.5 20.5−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    time

    ge

    ne

    s

    yeast microarray data

    }}

    Cluster 1

    Cluster 2

    6 / 36

  • Examples: Time Series Clustering

    Output: Assignment of each vector to a cluster

    Example output:

    0 9.5 11.5 13.5 15.5 18.5 20.5−5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    5

    time

    ge

    ne

    s

    yeast microarray data

    }}

    Cluster 1

    Cluster 2

    7 / 36

  • Ill-Posed Problem

    I There is no “one right answer” to a clustering problem.

    I For example, could cluster cats by colour:

    1 1 1 1

    1

    1

    1

    2 2

    2

    2

    2

    2

    2

    1

    33 3

    3

    3

    33 3 3 3

    4

    44

    8 / 36

  • Ill-Posed ProblemI Or by location:

    1

    43

    2 2

    22

    1

    1

    11

    1 1 1 1

    31

    3

    4

    4

    44

    4 4 4 4 4

    I Remember no free lunch from first lecture? Differentclustering methods will make different prior assumptions,which determine which of these clusterings they find.

    9 / 36

  • Clustering Methods

    I There are lots and lots of clustering methods:I K-means, K-mediodsI Hierarchical clustering (agglomerative or top down)I Spectral clusteringI Biclustering

    I Many of these aren’t probabilistic.

    I We won’t go through the above methods in this course.I Today we will do

    I Mixture models

    I which are commonly used for clustering, but are used for otherproblems as well (density estimation, and even supervisedlearning).

    10 / 36

  • Mixture Modelling

    I A mixture model is a way of making a simple density moreflexible.

    I Previously, we have had models that looked like p(xi|θ)I To make a mixture model, you need to start from a base, like

    Gaussians, beta, bernoulli, etc.

    I From this create a “mixture of Gaussians”, “mixture ofbetas”, etc.

    I The main idea will be to add a new random variable to themodel.

    I We will have a joint distribution over zi and xi.

    I zi is a variable that we add just to make the model moreconvenient. Called a latent variable.

    11 / 36

  • Mixture Modelling

    I Previously, we have had models that looked like p(xi|θ)I Now we will have one set of parameters for each clusterθ1, θ2, . . . θk

    I In general, model will be

    p(zi = k) = πk

    p(xi|zi = k) = p(xi|θk)

    I π = (π1, π2, . . . πK) represent how common the clusters are.These are called mixing weights

    I The p(xi|θk) are the base distributions that we are mixingtogether

    12 / 36

  • Example 1: Mixture of Gaussians

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    p(zi = k) = πk

    p(xi|zi = k) = N (xi;µk,Σk)

    Notice a different mean and covariance for each cluster(Above plots of the density of a MOG for K = 3)

    13 / 36

  • Example 2: Mixture of Multinoullis

    I Suppose each data point a bit vector, i.e., xi ∈ {0, 1}D

    I Can choose p(x|z) to be a product of Bernoullis. Let onetraining instance be x = (x1, x2, . . . xD) where xj ∈ {0, 1}.We define

    p(x|z = k, θk) =D∏j=1

    θxjkj (1− θkj)

    1−xj ,

    i.e., if we are in cluster k, then the distribution over the bitvector x is a product of independent Bernoullis.

    θk = (θk1, θk2, . . . , θkD) give the Bernoulli parameter for eachdimension, i.e.,

    p(xj = 1|z = k, θk) = θkj

    14 / 36

  • A Point about Expressivity

    I Squinting at this equation

    p(x|z = k, θk) =D∏j=1

    θxjkj (1− θkj)

    1−xj ,

    we see that given z = k, all pairs of variables xi and xj areindependent.

    I But marginally xi and xj are dependent under this model!

    I This is because xi provides information about z, whichprovides info about xj

    I For example, consider D = 2 and K = 2 ,and

    θ1 = (0.99, 0.01)

    θ2 = (0.01, 0.99)

    π = (0.5, 0.5)

    Once I see x1, I am almost certain about the value of x2.

    15 / 36

  • Summary: A Point about Expressivity

    I A product of independent Bernoullis is not a very expressivemodel.

    I We have made the model more expressive by embedding itwithin a mixture.

    I Now we can represent correlation, when we couldn’t before

    I This is a general point about mixtures: The family ofdistributions in a mixture model will be larger than theoriginal models

    16 / 36

  • Maximizing Likelihood

    I Parameters to estimate are θ = (θ1, . . . , θK , π)

    I The definition of the likelihood doesn’t change just becausewe added latent variables. It is still the probability of datagiven parameters, i.e.,

    L(θ) =

    N∑i=1

    log p(xi|θ)

    I But now we have to compute a marginal distribution to getp(xi|θ), i.e.,

    L(θ) =

    N∑i=1

    log

    K∑k=1

    p(xi|zi = k, θ)p(zi = k|θ)

    17 / 36

  • Likelihood Example

    I For a mixture of Gaussians we would have

    L(θ) =

    N∑i=1

    log

    K∑k=1

    N (xi;µk,Σk)πk

    18 / 36

  • Multimodality

    I The likelihood for a mixture model can be multimodal, for thesame reason as the neural network likelihood can be.

    I For example, we can do label switching, i.e., permuteθ1 . . . θK , and then permute π1 . . . πK to match.

    I When we run an optimization procedure, we will only find alocal optimum

    I More technically, the log likelihood is not convex (can verifyby taking second derivatives)

    19 / 36

  • How to apply to Clustering

    I Given data {x1,x2, . . . ,xN} and a model

    p(x1, . . . ,xN , z1 . . . zN |θ1 . . . θK , π)

    I Maximize likelihood to get ML parameters θ̂1 . . . θ̂K , π̂

    I For each training point xi, compute

    rik := p(zi = k|xi, θ̂1 . . . θ̂K , π̂)

    This is called the responsibility of cluster i for point k.

    I Interpret the responsibilities as

    I If you need xi to be assigned to exactly one cluster, takemaxk rik.

    I Soft cluster == each point gets a distribution over clusters,hard clustering == each point gets exactly one cluster

    20 / 36

  • Optimizing the Likelihood

    I Could use any of the methods from the last lecture, e.g.,gradient descent, etc.

    I But in practice we usually don’t use those for mixture models

    I It turns out to be easier to apply a different optimizationalgorithm called expectation maximization (EM)

    I EM is a special optimization algorithm for probabilisticmodels.

    I Think of it as a competitor to gradient descent, conjugategradient, etc.

    21 / 36

  • Motivation for EM

    Consider 1D mixture of Gaussians. Model is:

    p(x) =

    K∑k=1

    πk(2πσ2k)

    −1/2 exp

    {− 1

    2σ2k(x− µk)2

    }Your data look like:

    xi zi1.25 ??1.073 ??1.75 ??0.234 ??0.484 ??

    Difficult to figure out how to estimate πk, µk, σ2k.

    22 / 36

  • Motivation for EM

    Consider 1D mixture of Gaussians. Model is:

    p(x) =

    K∑k=1

    πk(2πσ2k)

    −1/2 exp

    {− 1

    2σ2k(x− µk)2

    }Suppose someone tells you the cluster labels. Now your databecomes:

    xi zi1.25 11.073 21.75 10.234 20.484 2

    Now do you know how to estimate πk, µk, σ2k?

    23 / 36

  • Motivation for EM

    Consider 1D mixture of Gaussians. Model is:

    p(x) =

    K∑k=1

    πk(2πσ2k)

    −1/2 exp

    {− 1

    2σ2k(x− µk)2

    }Suppose someone tells you the cluster labels. Now your databecomes:

    xi zi1.25 11.073 21.75 10.234 20.484 2

    Now do you know how to estimate πk, µk, σ2k?

    24 / 36

  • Hard Cluster Assignment GivenYes! This is just a class-conditional Gaussian. ML is easy.

    xi zi1.25 11.073 21.75 10.234 20.484 2

    e.g.,

    πk =

    ∑Ni=1 I{zi = k}

    N

    µk =1∑N

    i=1 I{zi = k}

    N∑i=1

    I{zi = k}xi

    σ2k =1∑N

    i=1 I{zi = k}

    N∑i=1

    I{zi = k}(xi − µk)2

    25 / 36

  • Soft Cluster Assignment Givenxi zi1.25 [0.9, 0.1]1.073 21.75 10.234 20.484 2

    I Suppose instead you get a soft assignmentto clusters [ri1, ri2]

    I ri1 degree that the point belongs tocluster 1, ri2 degree to cluster 2,ri1 + ri2 = 1

    Intuitively we’d like to treat each point as partially belonging toboth clusters, i.e.,

    πk =

    ∑Ni=1 rikN

    µk =1∑N

    i=1 rik

    N∑i=1

    rikxi

    σ2k =1∑N

    i=1 rik

    N∑i=1

    rik(xi − µk)2

    We’re taking weighted averages rather than sample averages.26 / 36

  • Outline of EM

    I EM is an algorithm for performing maximum likelihood wherethere is missing data.

    I Can be used where:I Data x = (x1 . . .xN )I A model p(x, z|θ)I We want the ML solution

    θ̂ = maxθ

    log∑z

    p(x, z|θ)

    I A model of the form p(x, z|θ) is called a latent variable modelbecause we do not get to observe the random variable z in theobserved data

    27 / 36

  • EM for Mixture Models

    EM is an iterative algorithm, like gradient descent. At everyiteration t we have a current guess θ(t) of the parameters

    I Repeat

    1. E-step. Compute a soft assignment to the latent variables.Compute q(z) := p(z|x, θ(t)).

    2. M-step. Reestimate the parameters based on the softassignment. Define the complete data likelihood:

    `c = log p(x, z|θ)

    Thenθ(t−1) ← max

    θ

    ∑z

    q(z)`c(θ)︸ ︷︷ ︸Q(θ,θ(t−1))

    3. t← t+ 1I until converged, i.e., d(θ(t), θ(t)) < �

    28 / 36

  • Specializing it to a Model

    I Given a new model p(z,x|θ), may need to derive the EMalgorithm for that model

    I To do this, needI Method for computing E-step: q(z) := p(z|x, θ(t))I Method for computing M-step: maxθ Q(θ, θ

    (t−1))

    I Example: Mixture of GaussiansI Already described p(z|x, θ(t)); this is the responsibilityI It turns out that maxθ Q(θ, θ

    (t−1)) occurs at the weightedaverage estimates on the previous slide!

    29 / 36

  • Example: EM on a Gaussian Mixture

    I Repeat

    1. E-step. Assign the points to clusters. Compute

    rik = p(zi = k|xi, θ(t)k . . . θ(t)k , π

    (t))

    for all i ∈ 1 . . . N and k ∈ 1 . . .K2. M-step. Re-estimate the model parameters

    πk =

    ∑Ni=1 rikN

    µk =1∑N

    i=1 rik

    N∑i=1

    rikxi

    σ2k =1∑N

    i=1 rik

    N∑i=1

    rik(xi − µk)2

    3. t← t+ 1I until converged, i.e., d(θ(t), θ(t)) < �

    30 / 36

  • −2 0 2

    −2

    0

    2

    31 / 36

  • Why does EM work?

    I The M-step maximizes a functionQ(θ, θ(t)) =

    ∑z q(z) log p(x, z|θ)

    I Compare this to the true log likelihood

    L(θ) = log

    [∑z

    p(x, z|θ)

    ]

    I Key point 1: Q(θ, θ(t)) + const ≤ L(θ) for all θI Key point 2: Because q arises from E-step

    Q(θ(t), θ(t)) + const = log p(x|θ(t))

    32 / 36

  • Why does EM work? (cont)

    Q(θ,θt)

    Q(θ,θt+1

    )

    l(θ)

    θt

    θt+1

    θt+2

    I E-step: Create a lowerbound

    L(θ(t), q) = Q(θ, θ(t))+const

    I M-step: Maximize thelower bound wrt θ

    Means that: EM monotonically increases the likelihood at everyiteration

    L(θ(t+1)) ≥ L(θ(t+1), q) ≥ L(θ(t), q) = L(θ(t))

    33 / 36

  • Why does Key Point 1 hold? log is concave

    α log x0 + (1− α) log x1 ≤ log(αx0 + (1− α)x1)

    34 / 36

  • Lower bound on L

    L(θ) = log

    [∑z

    p(x, z|θ)

    ]

    = log

    [∑z

    q(z)p(x, z|θ)q(z)

    ]

    ≥∑z

    q(z) log

    [p(x, z|θ)q(z)

    ]Jensen’s inequality

    = H(q) + Eq[log p(x, z|θ)]:= L(θ, q)

    This is the lower bound. So const on the previous slides is H(q),which is constant with respect to θ.

    35 / 36

  • Summary

    I Mixture models provide a method for making a family ofdistributions richer

    I Also widely used for performing clustering

    I Maximum likelihood in mixture algorithms : use EM algorithm

    36 / 36