ch 9. mixture models and em pattern recognition and machine learning, c. m. bishop, 2006....

Ch 9. Mixture Models and EMCh 9. Mixture Models and EM

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Biointelligence Laboratory, Seoul National Universityhttp://bi.snu.ac.kr/

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

9.1 K-means Clustering 9.2 Mixtures of Gaussians

9.2.1 Maximum likelihood 9.2.2 EM for Gaussian mixtures

9.3 An Alternative View of EM 9.3.1 Gaussian mixtures revisited 9.3.2 Relation to K-means 9.3.3 Mixtures of Bernoulli distributions

9.4 The EM Algorithm in General

ContentsContents



9.1 K-means Clustering (1/3)9.1 K-means Clustering (1/3)

Problem of identifying groups, or clusters, of data points in a multidimensional space Partitioning the data set into some number K of

clusters Cluster: a group of data points whose inter-point

distances are small compared with the distances to points outside of the cluster

Goal: an assignment of data points to clusters such that the sum of the squares of the distances to each data point to its closest vector (the center of the cluster) is a minimum




Two-stage optimization In the 1st stage: minimizing J with respect to the rnk, keepin

g the μk fixed

In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed

The mean of all of the data points assigned to cluster k



9.2 Mixtures of Gaussians (1/3)9.2 Mixtures of Gaussians (1/3)A formulation of Gaussian mixtures in terms of discrete A formulation of Gaussian mixtures in terms of discrete latentlatent variablesvariables

Gaussian mixture distribution can be written as a linear superposition of Gaussian

An equivalent formulation of the Gaussian mixture involving an explicit latent variable Graphical representation of a mixture

model A binary random variable z having a 1-of-K representation

The marginal distribution of x is a Gaussian mixture of the form (*) ( for every observed data point xn, there is a corresponding latent variable zn)

…(*)



9.2 Mixtures of Gaussians (2/3)9.2 Mixtures of Gaussians (2/3)

γ(zk) can also be viewed as the responsibility that component k takes for explaining the observation x



9.2 Mixtures of Gaussians (3/3)9.2 Mixtures of Gaussians (3/3)

Generating random samples distributed according to the Gaussian mixture model Generating a value for z, which denoted as from the margina

l distribution p(z) and then generate a value for x from the conditional distribution

a. The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue

b. The corresponding samples from the marginal distribution p(x) c. The same samples in which the colors represent the value of the responsibilities γ(zn

k) associated with data point Illustrating the responsibilities by evaluating the posterior probability for each compon

ent in the mixture distribution which this data set was generated



9.2.1 Maximum likelihood (1/3)9.2.1 Maximum likelihood (1/3)

Graphical representation of a Gaussian mixture model fora set of N i.i.d. data points{xn}, with corresponding latent points {zn}

The log of the likelihood function

….(*1)




For simplicity, consider a Gaussian mixture whose components have covariance matrices given by Suppose that one of the components of the mixture model has its mea

n μ j exactly equal to one of the data points so that μ j = xn

This data point will contribute a term in the likelihood function of the form

Once there are at least two components in the mixture, one ofthe components can have a finite variance and therefore assignfinite probability to all of the datapoints while the other componentcan shrink onto one specific data point and thereby contribute an everincreasing additive value to the loglikelihood over-fitting problem



Overfittings when one of the Gaussian components Overfittings when one of the Gaussian components collapses onto a specific data pointcollapses onto a specific data point

LLim

xxxxxx

xxxL

LLim

xxxx

xxxxxxL

xNzxpxxNzxpxxxD

0

22

23

221

13

122

22

221

12

1

22

21

221

11

1

0

121

1312

2

1

21

131211

3

1

222111321

1

2222

22

1

22

222

2

)(exp

2

1

2

)(exp

2

1

2

)(exp

2

1

2

)(exp

2

1

2

)(exp

2

1

2

)(exp

2

1

case) (mixture

0

1

2

)()(exp

1

2

)()()(exp

1

case)Gaussian single(

),|()1|( ),,|()1|( },,,{




Over-fitting problem Example of the over-fitting in a maximum likelihood approach This problem does not occur in the case of Bayesian approach In applying maximum likelihood to a Gaussian mixture models, t

here should be steps to avoid finding such pathological solutions and instead seek local minima of the likelihood function that are well behaved

Identifiability problem A K-component mixture will have a total of K! equivalent solution

s corresponding to the K! ways of assigning K sets of parameters to K components

Difficulty of maximizing the log likelihood function the presence of the summation over k that appears inside the logarithm gives no closed form solution as in the single case



9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (1/4)(1/4)I. Assign some initial values for the means, covariances, and mixin

g coefficientsII. Expectation or E step

• Using the current value for the parameters to evaluate the posterior probabilities or responsibilities

III. Maximization or M step• Using the result of II to re-estimate the means, covariances, and mixi

ng coefficients It is common to run the K-means algorithm in order to find a suit

able initial values• The covariance matrices the sample covariances of the clusters f

ound by the K-means algorithm• Mixing coefficients the fractions of data points assigned to the res

pective clusters



9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (2/4) (2/4) Given a Gaussian mixture model, the goal is to maximize the likelih

ood function with respect to the parameters1. Initialize the means μk, covariance Σk and mixing coefficients πk

2. E step

3. M step

4. Evaluate the log likelihood



9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (3/4)(3/4)

Setting the derivatives of (*2) with respect to the means of the Gaussian components to zero

Setting the derivatives of (*2) with respect to the covariance of the Gaussian components to zero

…(*2)

Responsibilityγ(znk )

A weighted mean of all of the points in the data set

- Each data point weighted by the corresponding posterior probability

- The denominator given by the effective # of points associated with the corresponding component



9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (4/4)(4/4)



9.3 An Alternative View of EM 9.3 An Alternative View of EM

In maximizing the log likelihood function the summation prevents the logarithm from acting directly on the joint distributio

n Instead, the log likelihood function for the complete data set {X, Z} is straightforw

ard. In practice since we are not given the complete data set, we consider instead its

expected value Q under the posterior distribution p( Z|X, Θ) of the latent variable General EM

1. Choose an initial setting for the parameters Θold 2. E step Evaluate p(Z|X,Θold ) 3. M step Evaluate Θnew given by

Θnew = argmaxΘQ(Θ ,Θold) Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)

4. It the covariance criterion is not satisfied, then let Θold Θnew



9.3.1 Gaussian mixtures revisited 9.3.1 Gaussian mixtures revisited (1/2) (1/2) Maximizing the likelihood for the complete data {X, Z}

The logarithm acts directly on the Gaussian distribution much simpler solution to the maximum likelihood problem the maximization with respect to a mean or a covariance is

exactly as for a single Gaussian (closed form)

N

n

K

kkknk

N

n

K

kkknnk

N

n

K

k

zkknk

xNμ|Xpcf

xNzμ|ZXp

xNμ|ZXp nk

1 1

1 1k

1 1

),|(ln ),, (ln .)(

}),|(ln {ln),, ,(ln

)],|([ ),, ,(



9.3.1 Gaussian mixtures revisited 9.3.1 Gaussian mixtures revisited (2/2)(2/2) Unknown latent variables considering expectation of the

complete-data log likelihood with respect to the posterior distribution of the latent variables

Posterior distribution

The expected value of the indicator variable under this posterior distribution

The expected value of the complete-data log likelihood function

…(*3)

))11.9( ),10.9( .(

)],|([)(),, ,|( ),, ,|(1 1

ref

xNZpμZXpμXZpN

n

K

k

zkknk

nk

)(),|(

),|(

)(

)1|()1()|1()(

1

nkK

jjjnj

kknk

n

nknnknnknk

zxN

xN

xp

zxpzpxzpzE

N

n

K

kkknnk xNzμ|ZXpE

1 1k }),|(ln ){ln()],, ,([ln



9.3.2 Relation to K-means9.3.2 Relation to K-means

K-means performs a hard assignment of data points to the clusters (each data point is associated uniquely with one cluster

EM makes a soft assignment based on the posterior probabilities K-means can be derived as a particular limit of EM for Gaussian

mixtures: As epsilon gets smaller, the terms for which is farthest

will go to zero most quickly. Hence the responsibility go all zero except for the term k for which the responsibility will go to unity

Thus maximizing the expected complete data log-likelihood is equivalent to minimizing the distortion measure J for the K-means

(In Elliptical K-means, the covariance is estimated also.)

N

n

K

kknnk

j jnj

knknkknkkn

constxrZXpE

x

xzxIxp

1 1

2

2

22

2

1)],,|,([ln

}2/exp{

}2/exp{)( ,

2

1exp

2

1),|(

2

jnx



(1 1 ) (2 2 )

9.3.3 Mixtures of Bernoulli 9.3.3 Mixtures of Bernoulli distributionsdistributions



9.3.3 Mixtures of Bernoulli distributions 9.3.3 Mixtures of Bernoulli distributions (1/2) (1/2)

on.distributi Bernulli single a unlike , variablesebetween th nscorrelatio capture

canon distributi mixture thediagonal,longer no is )Cov(matrix covariance theBecause *

})({)}1()1({

)(})1({})1({)(

),,()(,)1()1()(:)|(,)|(:)mixture(

)1()(),,,()(,)1()(:)( :component) (single

1) 1, 0, 0, (1, .].[

) are variabesindividual that note and

)( )|( ,)|( then ),())(Let (

)()()()())(()()()()(

,)()(

)),,( },{ (

(mixture) )1()|( ),|(),|(

)})1({)( ),,,()(),,,((

component) (single )1()|(

22211

221

211222111

22211

22222

21111

221122112

2322

21

3112211

23

2)(

2)()(

11

11

1

1

)1(

1

11

1

)1(

x

x

xx

xxx

x

μμμ|xx

xxμμxxμ|xxxxxxx

μμ|xx

μμ,,μμ

μxμxμx

Σxμxx

μx

T

Tkk

TT

kK1

1

111

I

IICov

EpcHpcHp

ICovEpHp

gE

t, given μindependenx

jixxEcxEccE

EEEEEEEECov

EE

ppp

diagCovExx

p

i

kkjikijkkikiikijk

K

k

Tkk

K

k

Tkk

T

K

kkk

K

kkk

kDk

D

i

xki

xki

K

kkkk

iiDD

D

i

xi

xi

ii

ii



9.3.3 Mixtures of Bernoulli distributions 9.3.3 Mixtures of Bernoulli distributions (2/2)(2/2)

infinity togoes

likelihood hein which t iessingularit no are thereGaussians, of mixture theocontrast tIn *

, step)-(M

)(1

,)( , )|(

)|(][)( step)-(E

)]1ln()1(ln[ln)()],|,(ln[E

)]1ln()1(ln[ln),|,(ln

:)function likelihood log data-complete(

)(

variables)indicator binary a is ),(( )|(),|(

)|(ln),|(ln

k

11

1

1 11Z

1 11

1

T1

1

1 1

N

N

zN

zNp

pzEz

xxzp

xxzp

p

z,zpp

pp

kk

N

nnk

k

N

nnkkK

jjj

kknknk

N

n

D

ikinikinik

K

knk

N

n

D

ikinikinik

K

knk

K

k

zk

K

K

k

z

N

n

K

kkk

k

k

k

nk

n

n

k

n

xμ

xxμx

μx

μZX

μZX

z

zμxμzx

μxμx

π

π

π

π

|



9.3.3 Mixtures of Bernoulli distributions 9.3.3 Mixtures of Bernoulli distributions (2a/2)(2a/2)

N=600 digit images, 3 mixtures A mixture of k=3 Bernoulli distributions by 10 EM iterations Parameters for each of the three components/single multivariate Bernoulli The analysis of Bernoulli mixtures can be extended to the case of

multinomial binary variables having M>2 states (Ex.9.19)



9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (1/3) (1/3) Direct optimization of p (X|) is difficult while optimization of

complete data likelihood p (X, Z|) is significantly easier. Decomposition of the likelihood p (X|)

)),(for expression theinto )|(ln ),|(ln )|,(ln e(substitut

))(

),|(ln)()||( ,

)(

)|,(ln)(),( (where

)||(),()|(ln

)|,()|(

qXpXZpZXp

Zq

XZpZqpqKL

Zq

ZXpZqq

pqKLqXp

ZXpXp

ZZ

Z

L

L

L

)|(lnon boundlower a is ),()|(ln),( XpqXpq LL



9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (2/3)(2/3)

(E step) The lower bound L (q, old) is maximized while holding old fixed. Since ln p(X| ) does not depend on q(Z), L (q, old) will be the largest when KL(q||p) vanishes (i.e. when q(Z) is equal to the posterior distribution p(Z|X, old))

(M step) q(Z) is fixed and the lower bound L (q, old) is maximized wrt. to new . When the lower bound is increased, is updated making KL(q||p) greater than 0. Thus the increase in the log likelihood function is greater than the increase in the lower bound.

In the M step, the quantity being maximized is the expectation of the complete-data log-likelihood

)),|( be set to is )( ( ),(

),|(ln),|()|,(ln),|(),(

oldold

oldZ

oldZ

old

XZpZqwhereconstQ

XZpXZpZXpXZpq

L



c.f. - 9.3 An Alternative View of EM c.f. - 9.3 An Alternative View of EM

In maximizing the log likelihood function the summation prevents the logarithm from acting directly on the joint distributio

n Instead, the log likelihood function for the complete data set {X, Z} is straightforw

ard. In practice since we are not given the complete data set, we consider instead its

expected value Q under the posterior distribution p( Z|X, Θ) of the latent variable General EM

1. Choose an initial setting for the parameters Θold 2. E step Evaluate p(Z|X,Θold ) 3. M step Evaluate Θnew given by

Θnew = argmaxΘQ(Θ ,Θold) Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)

4. It the covariance criterion is not satisfied, then let Θold Θnew



9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (3/3)(3/3) Start with initial parameter value old

In the first E step, evaluation of posterior distribution over latent variables gives rise to a lower bound L (q, old) whose value equals the log likelihood at old (blue curve)

Note that the bound makes a tangential contact with the log-likelihood at old , so that both curves have the same gradient

For mixture components from the exponential family, this bound is a convex function

In the M step, the bound is maximized giving the value new which gives a larger value of log-likelihood than old .

The subsequent E step constructs a bound tangential at new (green curve)



9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (3a/3)(3a/3)

EM can be also used to maximize the posterior distribution p(|X) over parameters.

Optimize the RHS alternatively wrt q and Optimization wrt q is the same E step M step required only a small modification through the introduction of the pri

or term ln p()

.)(ln),()(ln)(ln)||(),(

)(ln)(ln)|(ln)|(ln (9.70),ion decompositby

)(ln),(ln)|(ln ),(/),()|(

constpqXpppqKLq

XppXpXp

XpXpXpXpXpXp

LL



9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (3b/3)(3b/3)

For complex problems, either E step or M step or both are intractable: Intractable M: Generalized EM (GEM), expectation conditional maximization (ECM) Intractable E: Partial E step GEM: instead of maximizing L(q,) wrt , it seeks to change the parameters to increase

its value. ECM: makes several constrained optimization within each M step. For instance, parame

ters are partitioned into groups and the M step is broken down into multiple steps each of which involves optimizing one of the subset with the remainder held fixed.

Partial (or incremental) EM: (Note) For any given , there is a unique maximum L(q*,) wrt q . Since L(q*,) = ln p(X|), there is a * for the global maximum of L(q,) and ln p(X|*) is a global maximum too. Any algorithm that converges to the global maximum of L(q,) will find a value of that is also a global maximum of the log likelihood ln p(X|)

Each E or M step in partial E step algorithm is increasing the value of L(q,) and if the algorithm converges to a local (or global) maximum of L(q,) , this will correspond to a local (or global) maximum of the log likelihood function ln p(X|).



9.4 The EM Algorithm in General (3c/3)9.4 The EM Algorithm in General (3c/3)

(Incremental EM) For a Gaussian mixture, suppose xm is updated with old and new values of responsibilities old(zmk), new(zmk) in the E-step .

In the M step, the means are updated as,

Both E step and M step take fixed time independent of the total number of data points. Because the parameters are revised after each data point, rather than waiting until after the whole data set is processed, this incremental version can converge faster than the batch version.

)()(

)()()(

)( )18.9(

)(1

)17.9(

1

1

mkold

mknewold

knewk

oldkmnew

k

mkold

mknew

oldk

newk

N

nnkk

n

N

nnk

kk

zzNN

xN

zz

zN

xzN

mnewk

mkold

mknew

oldknew

k

mkold

mknew

newk

mmkold

mknewold

kmkold

mknewnew

knewk

newk

mmkold

mmknewold

koldk

newk

newk

xN

zz

N

zz

xzzzzNN

xzxzNN

)()()()(1

)]()([)]()([

)()(

ch 9. mixture models and em pattern recognition and machine learning, c. m. bishop, 2006....

Documents

snu biointelligence

gaussian mixture model

mixtures of gaussians

data set

group of data

assignment of data points

biointelligence laboratory

observation x slide