csce 633: machine learning...title csce 633: machine learning author lecture 24: unsupervised...

33
CSCE 633: Machine Learning Lecture 24: Unsupervised Learning: EM Texas A&M University 10-18-19

Upload: others

Post on 07-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

CSCE 633: Machine Learning

Lecture 24: Unsupervised Learning: EM

Texas A&M University

10-18-19

Page 2: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Last Time

• PCA• Clustering, K-Means

B Mortazavi CSECopyright 2018 1

Page 3: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Goals of this lecture

• Expectation Maximization

B Mortazavi CSECopyright 2018 2

Page 4: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Clustering

Finding patterns/structure/sub-populations in data

B Mortazavi CSECopyright 2018 3

Page 5: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

K-means Clustering

Representation

• Input: Data D = {x1, . . . , xN}• Output: Clusters µ1, . . . ,µK

• Decision: Cluster membership, the cluster id assigned to sample xn,i.e. A(xn) ∈ {1, . . . ,K}

• Evaluation metric: Distortion measureJ =

N∑n=1

K∑k=1

rnk‖xn − µk‖22, where rnk = 1 if A(xn) = k, 0 otherwise

• Intuition: Data points assigned to cluster k should be close tocentroid µk

B Mortazavi CSECopyright 2018 4

Page 6: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

K-means Clustering

Evaluation metric: minrnk

J = minrnk

N∑n=1

K∑k=1

rnk‖xn − µk‖22

Optimization:• Step 0: Initialize µk to some values• Step 1: Assume the current value of µk fixed, minimize J over rnk ,

which leads to the following cluster assignment rule

rnk ={

1, if k = arg minj ‖xn − µj‖22

0, otherwise• Step 2: Assume the current value of rnk fixed, minimize J over µk ,

which leads to the following rule to update the prototypes of theclusters µk =

∑n

rnk xn∑n

rnk

• Step 3: Determine whether to stop or return to Step 1

B Mortazavi CSECopyright 2018 5

Page 7: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

K-means Clustering

Remarks• The centroid µk is the means of data points assigned to the cluster

k, hence the name K-means clustering.• The procedure terminates after a finite number of steps, as the

procedure reduces J in both Step 1 and Step 2• There is no guarantee the procedure terminates at the global

optimum of J . In most cases, the algorithm stops at a localoptimum, which depends on the initial values in Step 0 → randomrestarts to improve chances of getting closer to global optima

B Mortazavi CSECopyright 2018 6

Page 8: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

K-means Clustering

Example

B Mortazavi CSECopyright 2018 7

Page 9: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

K-means Clustering

Application: vector quantization• We can replace our data points with the centroids µk from the

clusters they are assigned to → vector quantization• We have compressed the data points into

• a codebook of all the centroids {µ1, . . . ,µK}• a list of indices to the codebook for the data points (created

based on rnk)• This compression is obviously lossy as certain information will be

lost if we use a very small K

B Mortazavi CSECopyright 2018 8

Page 10: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Probabilistic interpretation of clustering

• We want to find p(x) that best describes our data• The data points seem to form 3 clusters• However, we cannot model p(x) with simple and known

distributions, e.g. one Gaussian

B Mortazavi CSECopyright 2018 9

Page 11: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Probabilistic interpretation of clustering

• Instead, we will model each region with a Gaussian distribution →Gaussian mixture models (GMMs)

• Question 1: How do we know which (color) region a data pointcomes from?

• Question 2: What are the parameters of Gaussian distributions ineach region?

• We will answer both in an unsupervised way from dataD = {x1, . . . , xn}

B Mortazavi CSECopyright 2018 10

Page 12: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Gaussian mixture models: formal definition

A Gaussian mixture model has the following density function for x

p(x) =K∑

k=1ωkN (x; µk ,Σk)

• K : number of Gaussians• µk ,Σk : mean & covariance of k th component• ωk : component weights, how much component k contributes to the

final distribution

ωk > 0 , ∀k andK∑

k=1ωk = 1

ωk can be represented by the prior distribution: ωk = p(z = k),which decides which mixture to use

B Mortazavi CSECopyright 2018 11

Page 13: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

GMM as the marginal distribution of a joint distribution

• Consider the following joint distribution p(x, z) = p(z)p(x|z)• z is a discrete random variable taking values between 1 and K ,

“selects” a Gaussian component• We denote ωk = p(z = k)• Assume Gaussian conditional distributions

p(x|z = k) = N (x; µk ,Σk)• Then the marginal distribution of x is

p(x) =K∑

k=1ωkN (x; µk ,Σk)

(which is the Gaussian mixture model)

B Mortazavi CSECopyright 2018 12

Page 14: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

GMM as the marginal distribution of a joint distribution

• The joint distribution between x and z (representing color) are

p(x|z = red) = N (x; µ1,Σ1)

p(x|z = blue) = N (x; µ2,Σ2)p(x|z = green) = N (x; µ3,Σ3)

• The marginal distribution is thus

p(x) =p(red)N (x; µ1,Σ1) + p(blue)N (x; µ2,Σ2)+ p(green)N (x; µ3,Σ3)

B Mortazavi CSECopyright 2018 13

Page 15: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Parameter estimation for GMMs: the easy case with complete data

We know the component in which each sample belongs to• Data D = {(x1, z1), . . . , (xN, zN)}• We want to find θ = {µk ,Σk , ωk}• Maximum log-likelihood: θ∗ = arg max log(p(D)|θ)

Solutionωk =

∑n γnk∑

k∑

n γnkµk = 1∑

n γnk

∑nγnkxn

Σk = 1∑n γnk

∑nγnk(xn − µk)(xn − µk)T

where γnk = 1 if zn = k

B Mortazavi CSECopyright 2018 14

Page 16: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Parameter estimation for GMMs: the easy case with complete data

Understanding the intuition

ωk =∑

n γnk∑k∑

n γnkµk = 1∑

n γnk

∑nγnkxn

Σk = 1∑n γnk

∑nγnk(xn − µk)(xn − µk)T

• For ωk : count the number of data points whose zn is k and divideby the total number of data points

• For µk : get all the data points whose zn is k, compute their mean• For Σk : get all the data points whose zn is k, compute their

covariance

B Mortazavi CSECopyright 2018 15

Page 17: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Parameter estimation for GMMs: incomplete data

Trick: estimation with soft γnk

γnk = p(zn = k|xn)

ωk =∑

n γnk∑k∑

n γnk

µk = 1∑n γnk

∑nγnkxn

Σk = 1∑n γnk

∑nγnk(xn − µk)(xn − µk)T

Every data point xn is assigned to a component fractionally according top(zn = k|xn), also called responsibility

B Mortazavi CSECopyright 2018 16

Page 18: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Parameter estimation for GMMs: incomplete data

Trick: estimation with soft γnkSince we do not know θ to begin with, we cannot compute the soft γnkBut we can invoke an iterative procedure and alternate betweenestimating γnk and using the estimated γnk to compute θ = {µk ,Σk , ωk}• Step 0: guess θ with initial values• Step 1: compute γnk using current θ

• Step 2: update θ using computed γnk

• Step 3: go back to Step 1

Questions: i) is this procedure correct, for example, optimizing a sensiblecriterion? ii) practically, will this procedure ever stop instead of iteratingforever? The answer lies in the Expectation Maximization (EM)algorithm, a powerful procedure for model estimation with unknown data

B Mortazavi CSECopyright 2018 17

Page 19: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization: motivation and setup

• EM is used to estimate parameters for probabilistic models withhidden/latent variables

p(x|θ) =∑

zp(x, z|θ)

where x is the observed random variable and θ is the hidden• We are given data D = {x1, . . . , xN}, where the corresponding

hidden variable values z are not included• Our goal is to obtain the maximum likelihood estimate of θ

θ∗ = arg maxθ

N∑n=1

log p(xn|θ) = arg maxθ

N∑n=1

log∑

zn

p(xn, zn|θ)

The above objective function is called incomplete log-likelihood and it iscomputationally intractable (since it needs to sum over all possible valuesof z and then take the logarithm)B Mortazavi CSE

Copyright 2018 18

Page 20: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization: Complete log-likelihood

• Incomplete log-likelihood

l(θ) =N∑

n=1log∑

zn

p(xn, zn|θ)

• EM uses a clever trick to change this into sum-log form by defining:

Qq(θ) =N∑

n=1logEzn∼q(zn)p(xn, zn|θ)

=N∑

n=1

∑zn

q(zn) log p(xn, zn|θ)

where q(z) is a distribution over zThe above is called expected (complete) log-likelihood, since it takes theform of log-sum which is computationally tractableB Mortazavi CSE

Copyright 2018 19

Page 21: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization: Choice of q(z)

• We will choose a special q(z) = p(z|x,θ), i.e., the posteriorprobability of z

Q(θ) = Qz∼p(z|x,θ)(θ)• We will show that

l(θ) = Q(θ) +N∑

n=1H [p(z|xn,θ)]

where H is the entropy of the probabilistic distribution p(z|x,θ)

H [p(z|x,θ)] = −∑

zn

p(zn|x,θ) log p(zn|x,θ)

B Mortazavi CSECopyright 2018 20

Page 22: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization: Choice of q(z)

Proof

Qq(θ) =∑n

∑zn

q(zn) log p(xn, zn|θ)

Q(θ) = Qz∼p(z|x,θ)(θ) =∑

n

∑zn

p(zn|xn,θ) log p(xn, zn|θ)

=∑

n

∑zn

p(zn|xn,θ)[log p(xn|θ) + log p(zn|xn,θ)]

=∑

n

∑zn

p(zn|xn,θ) log p(xn|θ) +∑

n

∑zn

p(zn|xn,θ) log p(zn|xn,θ)

=∑

nlog p(xn|θ)

∑zn

p(zn|xn,θ)−H [p(z|xn,θ)]

= l(θ)−H [p(z|xn,θ)]

B Mortazavi CSECopyright 2018 21

Page 23: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization

• As before, Q(θ) cannot be computed, as p(z|x,θ) depends on theunknown parameter values θ

• Instead, we will use a known value θold to compute the expectedlikelihood

Q(θ,θold ) =∑

n

∑zn

p(zn|xn,θold ) log p(xn, zn|θ)

• In the above, the variable is θ, while θold is known• By its definition Q(θ) = Q(θ,θold )• But how does Q(θ,θold ) relates to l(θ)?• We will show that

l(θ) ≥ Q(θ,θold ) +∑

nH[p(z|xn,θ

old )]

• Thus, Q(θ) is better than Q(θ,θold ), except that we cannotcompute the former

B Mortazavi CSECopyright 2018 22

Page 24: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization

Proof

l(θ) =∑

nlog p(xn|θ) =

∑n

log∑

zn

p(xn, zn|θ)

=∑

nlog∑

zn

p(z|xn,θold ) p(xn, zn|θ)

p(z|xn,θold )

≥∑

n

∑zn

p(z|xn,θold ) log p(xn, zn|θ)

−∑

n

∑zn

p(z|xn,θold ) log p(z|xn,θ

old )

= Q(θ,θold ) +∑

nH[p(z|xn,θ

old )]

︸ ︷︷ ︸A(θ,θold ) : auxiliary function

In the above we have used that log∑

i

wi xi ≥∑

i

wi log xi , ∀wi ≥ 0 s.t.∑

i

wi = 1

B Mortazavi CSECopyright 2018 23

Page 25: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization

• Suppose we have an initial guess θold and we maximize the auxiliaryfunction

θnew = arg maxθ

A(θ,θold )

• With the new guess, we have

l(θnew ) ≥ A(θnew ,θold ) ≥ A(θnew ,θold ) = l(θold )

• By maximizing the auxiliary function, we will keep increasing thelikelihood (core of EM)

B Mortazavi CSECopyright 2018 24

Page 26: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization

Algorithm outline• Step 0: Initialize θ with θ(0)

• Step 1 (E-step): Compute the auxiliary function using the currentvalue of θ

A(θ,θ(t))

• Step 2 (M-step): Maximize the auxiliary function

θ(t+1) ← arg max A(θ,θ(t))

• Step 3: Increase t to t + 1 and back to Step 1; or stop if l(θ(t+1))does not improve much

B Mortazavi CSECopyright 2018 25

Page 27: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization

Remarks• EM converges but only to a local optimum: global optimum is not

guaranteed• The E-step depends on computing the posterior probability

p(zn|xn,θ(t))

• The M-step does not depend on the entropy term, so we need onlyto do the following

θ(t+1) ← arg max A(θ,θ(t)) = arg max Q(θ,θ(t))

We often call the last term Q-function

B Mortazavi CSECopyright 2018 26

Page 28: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization for GMMs

• Hidden variable follows Multinomial distributionzn ∼ Multinomial(ω1, . . . , ωK )

• The likelihood of observation xn equals the probability ofcorresponding component p(xn|zn,θ) =

∏Kk=1N (xn|µk ,Σk)γnk

• Complete data log-likelihood

Q(θ) = E

[∑n

log p(xn, zn|θ)]

= E

[∑n

log(p(zn|θ)p(xn|zn,θ))]

= E

[∑n

log(∏

kωγnk

k

∏kN (xn|µk ,Σk)γnk

)]

= E

[∑n

∑kγnk (logωk + logN (xn|µk ,Σk))

]=∑

n

∑k

E[γnk |xn,µk ,Σk ]γnk (logωk + logN (xn|µk ,Σk))

B Mortazavi CSECopyright 2018 27

Page 29: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization for GMMs

What is the E-step in GMM?

E[γnk |xn,µk ,Σk ] = p(γnk = 1|xn,θk) = p(xn|γnk = 1,θ)p(γnk = 1|θk)∑k

p(xn|γnk = 1,θ)p(γnk = 1|θk)

We compute the probability

γnk = p(z = k|xn,θ(t))

γnk = ωkp(xn|µ(t−1)k ,Σ

(t−1)k )∑

k ωkp(xn|µ(t−1)k ,Σ

(t−1)k )

B Mortazavi CSECopyright 2018 28

Page 30: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization for GMMs

What is the M-step in GMM?Maximize the auxiliary functionQ(θ,θ(t)) =

∑n

∑k

p(z = k|xn,θ(t)) log p(xn, z = k|θ)

=∑

n

∑kγnk log p(xn, z = k|θ)

=∑

n

∑kγnk log(p(z = k)p(xn|z = k))

=∑

n

∑kγnk [logωk + logN (xn|µk ,Σk)]

⇒ ωk =∑

n γnk

N

⇒ µk = 1∑n γnk

∑nγnkxn

⇒ Σk = 1∑n γnk

∑nγnk(xn − µk)(xn − µk)T

B Mortazavi CSECopyright 2018 29

Page 31: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Expectation Maximization for GMMs

Example of EM implementation

B Mortazavi CSECopyright 2018 30

Page 32: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

GMMs and K-means

GMMs provide probabilistic interpretation for K-means• Assume all Gaussian components have σ2I as their covariance

matrices• Further assume σ → 0• Thus, we only need to estimate µk , i.e., means• Then, the EM for GMM parameter estimation simplifies to K-means

For this reason, K-means is often called hard GMM or GMMs is calledsoft K-means. The soft posterior γnk provides a probabilistic assignmentfor xn to cluster k represented by the corresponding Gaussian distribution

B Mortazavi CSECopyright 2018 31

Page 33: CSCE 633: Machine Learning...Title CSCE 633: Machine Learning Author Lecture 24: Unsupervised Learning: EM Created Date 10/18/2019 10:26:47 AM

Takeaways and Next Time

• Expectation Maximization and GMM• Next Time: More Unsupervised Learning

B Mortazavi CSECopyright 2018 32