csce 633: machine learning...title csce 633: machine learning author lecture 24: unsupervised...

CSCE 633: Machine Learning

Lecture 24: Unsupervised Learning: EM

Texas A&M University

10-18-19

Last Time

• PCA• Clustering, K-Means

B Mortazavi CSECopyright 2018 1

Goals of this lecture

• Expectation Maximization


Clustering

Finding patterns/structure/sub-populations in data


K-means Clustering

Representation

• Input: Data D = {x1, . . . , xN}• Output: Clusters µ1, . . . ,µK

• Decision: Cluster membership, the cluster id assigned to sample xn,i.e. A(xn) ∈ {1, . . . ,K}

• Evaluation metric: Distortion measureJ =

N∑n=1

K∑k=1

rnk‖xn − µk‖22, where rnk = 1 if A(xn) = k, 0 otherwise

• Intuition: Data points assigned to cluster k should be close tocentroid µk


K-means Clustering

Evaluation metric: minrnk

J = minrnk

N∑n=1

K∑k=1

rnk‖xn − µk‖22

Optimization:• Step 0: Initialize µk to some values• Step 1: Assume the current value of µk fixed, minimize J over rnk ,

which leads to the following cluster assignment rule

rnk ={

1, if k = arg minj ‖xn − µj‖22

0, otherwise• Step 2: Assume the current value of rnk fixed, minimize J over µk ,

which leads to the following rule to update the prototypes of theclusters µk =

∑n

rnk xn∑n

rnk

• Step 3: Determine whether to stop or return to Step 1


K-means Clustering

Remarks• The centroid µk is the means of data points assigned to the cluster

k, hence the name K-means clustering.• The procedure terminates after a finite number of steps, as the

procedure reduces J in both Step 1 and Step 2• There is no guarantee the procedure terminates at the global

optimum of J . In most cases, the algorithm stops at a localoptimum, which depends on the initial values in Step 0 → randomrestarts to improve chances of getting closer to global optima


K-means Clustering

Example


K-means Clustering

Application: vector quantization• We can replace our data points with the centroids µk from the

clusters they are assigned to → vector quantization• We have compressed the data points into

• a codebook of all the centroids {µ1, . . . ,µK}• a list of indices to the codebook for the data points (created

based on rnk)• This compression is obviously lossy as certain information will be

lost if we use a very small K


Probabilistic interpretation of clustering

• We want to find p(x) that best describes our data• The data points seem to form 3 clusters• However, we cannot model p(x) with simple and known

distributions, e.g. one Gaussian


Probabilistic interpretation of clustering

• Instead, we will model each region with a Gaussian distribution →Gaussian mixture models (GMMs)

• Question 1: How do we know which (color) region a data pointcomes from?

• Question 2: What are the parameters of Gaussian distributions ineach region?

• We will answer both in an unsupervised way from dataD = {x1, . . . , xn}


Gaussian mixture models: formal definition

A Gaussian mixture model has the following density function for x

p(x) =K∑

k=1ωkN (x; µk ,Σk)

• K : number of Gaussians• µk ,Σk : mean & covariance of k th component• ωk : component weights, how much component k contributes to the

final distribution

ωk > 0 , ∀k andK∑

k=1ωk = 1

ωk can be represented by the prior distribution: ωk = p(z = k),which decides which mixture to use


GMM as the marginal distribution of a joint distribution

• Consider the following joint distribution p(x, z) = p(z)p(x|z)• z is a discrete random variable taking values between 1 and K ,

“selects” a Gaussian component• We denote ωk = p(z = k)• Assume Gaussian conditional distributions

p(x|z = k) = N (x; µk ,Σk)• Then the marginal distribution of x is

p(x) =K∑

k=1ωkN (x; µk ,Σk)

(which is the Gaussian mixture model)


GMM as the marginal distribution of a joint distribution

• The joint distribution between x and z (representing color) are

p(x|z = red) = N (x; µ1,Σ1)

p(x|z = blue) = N (x; µ2,Σ2)p(x|z = green) = N (x; µ3,Σ3)

• The marginal distribution is thus

p(x) =p(red)N (x; µ1,Σ1) + p(blue)N (x; µ2,Σ2)+ p(green)N (x; µ3,Σ3)


Parameter estimation for GMMs: the easy case with complete data

We know the component in which each sample belongs to• Data D = {(x1, z1), . . . , (xN, zN)}• We want to find θ = {µk ,Σk , ωk}• Maximum log-likelihood: θ∗ = arg max log(p(D)|θ)

Solutionωk =

∑n γnk∑

k∑

n γnkµk = 1∑

n γnk

∑nγnkxn

Σk = 1∑n γnk

∑nγnk(xn − µk)(xn − µk)T

where γnk = 1 if zn = k


Parameter estimation for GMMs: the easy case with complete data

Understanding the intuition

ωk =∑

n γnk∑k∑

n γnkµk = 1∑

n γnk

∑nγnkxn

Σk = 1∑n γnk


• For ωk : count the number of data points whose zn is k and divideby the total number of data points

• For µk : get all the data points whose zn is k, compute their mean• For Σk : get all the data points whose zn is k, compute their

covariance


Parameter estimation for GMMs: incomplete data

Trick: estimation with soft γnk

γnk = p(zn = k|xn)

ωk =∑

n γnk∑k∑

n γnk

µk = 1∑n γnk

∑nγnkxn

Σk = 1∑n γnk


Every data point xn is assigned to a component fractionally according top(zn = k|xn), also called responsibility


Parameter estimation for GMMs: incomplete data

Trick: estimation with soft γnkSince we do not know θ to begin with, we cannot compute the soft γnkBut we can invoke an iterative procedure and alternate betweenestimating γnk and using the estimated γnk to compute θ = {µk ,Σk , ωk}• Step 0: guess θ with initial values• Step 1: compute γnk using current θ

• Step 2: update θ using computed γnk

• Step 3: go back to Step 1

Questions: i) is this procedure correct, for example, optimizing a sensiblecriterion? ii) practically, will this procedure ever stop instead of iteratingforever? The answer lies in the Expectation Maximization (EM)algorithm, a powerful procedure for model estimation with unknown data


Expectation Maximization: motivation and setup

• EM is used to estimate parameters for probabilistic models withhidden/latent variables

p(x|θ) =∑

zp(x, z|θ)

where x is the observed random variable and θ is the hidden• We are given data D = {x1, . . . , xN}, where the corresponding

hidden variable values z are not included• Our goal is to obtain the maximum likelihood estimate of θ

θ∗ = arg maxθ

N∑n=1

log p(xn|θ) = arg maxθ

N∑n=1

log∑

zn

p(xn, zn|θ)

The above objective function is called incomplete log-likelihood and it iscomputationally intractable (since it needs to sum over all possible valuesof z and then take the logarithm)B Mortazavi CSE

Copyright 2018 18

Expectation Maximization: Complete log-likelihood

• Incomplete log-likelihood

l(θ) =N∑

n=1log∑

zn

p(xn, zn|θ)

• EM uses a clever trick to change this into sum-log form by defining:

Qq(θ) =N∑

n=1logEzn∼q(zn)p(xn, zn|θ)

=N∑

n=1

∑zn

q(zn) log p(xn, zn|θ)

where q(z) is a distribution over zThe above is called expected (complete) log-likelihood, since it takes theform of log-sum which is computationally tractableB Mortazavi CSE

Copyright 2018 19

Expectation Maximization

• As before, Q(θ) cannot be computed, as p(z|x,θ) depends on theunknown parameter values θ

• Instead, we will use a known value θold to compute the expectedlikelihood

Q(θ,θold ) =∑

n

∑zn

p(zn|xn,θold ) log p(xn, zn|θ)

• In the above, the variable is θ, while θold is known• By its definition Q(θ) = Q(θ,θold )• But how does Q(θ,θold ) relates to l(θ)?• We will show that

l(θ) ≥ Q(θ,θold ) +∑

nH[p(z|xn,θ

old )]

• Thus, Q(θ) is better than Q(θ,θold ), except that we cannotcompute the former



• Suppose we have an initial guess θold and we maximize the auxiliaryfunction

θnew = arg maxθ

A(θ,θold )

• With the new guess, we have

l(θnew ) ≥ A(θnew ,θold ) ≥ A(θnew ,θold ) = l(θold )

• By maximizing the auxiliary function, we will keep increasing thelikelihood (core of EM)



Algorithm outline• Step 0: Initialize θ with θ(0)

• Step 1 (E-step): Compute the auxiliary function using the currentvalue of θ

A(θ,θ(t))

• Step 2 (M-step): Maximize the auxiliary function

θ(t+1) ← arg max A(θ,θ(t))

• Step 3: Increase t to t + 1 and back to Step 1; or stop if l(θ(t+1))does not improve much



Remarks• EM converges but only to a local optimum: global optimum is not

guaranteed• The E-step depends on computing the posterior probability

p(zn|xn,θ(t))

• The M-step does not depend on the entropy term, so we need onlyto do the following

θ(t+1) ← arg max A(θ,θ(t)) = arg max Q(θ,θ(t))

We often call the last term Q-function


Expectation Maximization for GMMs

• Hidden variable follows Multinomial distributionzn ∼ Multinomial(ω1, . . . , ωK )

• The likelihood of observation xn equals the probability ofcorresponding component p(xn|zn,θ) =

∏Kk=1N (xn|µk ,Σk)γnk

• Complete data log-likelihood

Q(θ) = E

[∑n

log p(xn, zn|θ)]

= E

[∑n

log(p(zn|θ)p(xn|zn,θ))]

= E

[∑n

log(∏

kωγnk

k

∏kN (xn|µk ,Σk)γnk

)]

= E

[∑n

∑kγnk (logωk + logN (xn|µk ,Σk))

]=∑

n

∑k

E[γnk |xn,µk ,Σk ]γnk (logωk + logN (xn|µk ,Σk))



What is the M-step in GMM?Maximize the auxiliary functionQ(θ,θ(t)) =

∑n

∑k

p(z = k|xn,θ(t)) log p(xn, z = k|θ)

=∑

n

∑kγnk log p(xn, z = k|θ)

=∑

n

∑kγnk log(p(z = k)p(xn|z = k))

=∑

n

∑kγnk [logωk + logN (xn|µk ,Σk)]

⇒ ωk =∑

n γnk

N

⇒ µk = 1∑n γnk

∑nγnkxn

⇒ Σk = 1∑n γnk




Example of EM implementation


GMMs and K-means

GMMs provide probabilistic interpretation for K-means• Assume all Gaussian components have σ2I as their covariance

matrices• Further assume σ → 0• Thus, we only need to estimate µk , i.e., means• Then, the EM for GMM parameter estimation simplifies to K-means

For this reason, K-means is often called hard GMM or GMMs is calledsoft K-means. The soft posterior γnk provides a probabilistic assignmentfor xn to cluster k represented by the corresponding Gaussian distribution


Takeaways and Next Time

• Expectation Maximization and GMM• Next Time: More Unsupervised Learning


csce 633: machine learning...title csce 633: machine learning author lecture 24: unsupervised...

Documents