csce 633: machine learning...title csce 633: machine learning author lecture 24: unsupervised...
TRANSCRIPT
CSCE 633: Machine Learning
Lecture 24: Unsupervised Learning: EM
Texas A&M University
10-18-19
Last Time
• PCA• Clustering, K-Means
B Mortazavi CSECopyright 2018 1
Goals of this lecture
• Expectation Maximization
B Mortazavi CSECopyright 2018 2
Clustering
Finding patterns/structure/sub-populations in data
B Mortazavi CSECopyright 2018 3
K-means Clustering
Representation
• Input: Data D = {x1, . . . , xN}• Output: Clusters µ1, . . . ,µK
• Decision: Cluster membership, the cluster id assigned to sample xn,i.e. A(xn) ∈ {1, . . . ,K}
• Evaluation metric: Distortion measureJ =
N∑n=1
K∑k=1
rnk‖xn − µk‖22, where rnk = 1 if A(xn) = k, 0 otherwise
• Intuition: Data points assigned to cluster k should be close tocentroid µk
B Mortazavi CSECopyright 2018 4
K-means Clustering
Evaluation metric: minrnk
J = minrnk
N∑n=1
K∑k=1
rnk‖xn − µk‖22
Optimization:• Step 0: Initialize µk to some values• Step 1: Assume the current value of µk fixed, minimize J over rnk ,
which leads to the following cluster assignment rule
rnk ={
1, if k = arg minj ‖xn − µj‖22
0, otherwise• Step 2: Assume the current value of rnk fixed, minimize J over µk ,
which leads to the following rule to update the prototypes of theclusters µk =
∑n
rnk xn∑n
rnk
• Step 3: Determine whether to stop or return to Step 1
B Mortazavi CSECopyright 2018 5
K-means Clustering
Remarks• The centroid µk is the means of data points assigned to the cluster
k, hence the name K-means clustering.• The procedure terminates after a finite number of steps, as the
procedure reduces J in both Step 1 and Step 2• There is no guarantee the procedure terminates at the global
optimum of J . In most cases, the algorithm stops at a localoptimum, which depends on the initial values in Step 0 → randomrestarts to improve chances of getting closer to global optima
B Mortazavi CSECopyright 2018 6
K-means Clustering
Example
B Mortazavi CSECopyright 2018 7
K-means Clustering
Application: vector quantization• We can replace our data points with the centroids µk from the
clusters they are assigned to → vector quantization• We have compressed the data points into
• a codebook of all the centroids {µ1, . . . ,µK}• a list of indices to the codebook for the data points (created
based on rnk)• This compression is obviously lossy as certain information will be
lost if we use a very small K
B Mortazavi CSECopyright 2018 8
Probabilistic interpretation of clustering
• We want to find p(x) that best describes our data• The data points seem to form 3 clusters• However, we cannot model p(x) with simple and known
distributions, e.g. one Gaussian
B Mortazavi CSECopyright 2018 9
Probabilistic interpretation of clustering
• Instead, we will model each region with a Gaussian distribution →Gaussian mixture models (GMMs)
• Question 1: How do we know which (color) region a data pointcomes from?
• Question 2: What are the parameters of Gaussian distributions ineach region?
• We will answer both in an unsupervised way from dataD = {x1, . . . , xn}
B Mortazavi CSECopyright 2018 10
Gaussian mixture models: formal definition
A Gaussian mixture model has the following density function for x
p(x) =K∑
k=1ωkN (x; µk ,Σk)
• K : number of Gaussians• µk ,Σk : mean & covariance of k th component• ωk : component weights, how much component k contributes to the
final distribution
ωk > 0 , ∀k andK∑
k=1ωk = 1
ωk can be represented by the prior distribution: ωk = p(z = k),which decides which mixture to use
B Mortazavi CSECopyright 2018 11
GMM as the marginal distribution of a joint distribution
• Consider the following joint distribution p(x, z) = p(z)p(x|z)• z is a discrete random variable taking values between 1 and K ,
“selects” a Gaussian component• We denote ωk = p(z = k)• Assume Gaussian conditional distributions
p(x|z = k) = N (x; µk ,Σk)• Then the marginal distribution of x is
p(x) =K∑
k=1ωkN (x; µk ,Σk)
(which is the Gaussian mixture model)
B Mortazavi CSECopyright 2018 12
GMM as the marginal distribution of a joint distribution
• The joint distribution between x and z (representing color) are
p(x|z = red) = N (x; µ1,Σ1)
p(x|z = blue) = N (x; µ2,Σ2)p(x|z = green) = N (x; µ3,Σ3)
• The marginal distribution is thus
p(x) =p(red)N (x; µ1,Σ1) + p(blue)N (x; µ2,Σ2)+ p(green)N (x; µ3,Σ3)
B Mortazavi CSECopyright 2018 13
Parameter estimation for GMMs: the easy case with complete data
We know the component in which each sample belongs to• Data D = {(x1, z1), . . . , (xN, zN)}• We want to find θ = {µk ,Σk , ωk}• Maximum log-likelihood: θ∗ = arg max log(p(D)|θ)
Solutionωk =
∑n γnk∑
k∑
n γnkµk = 1∑
n γnk
∑nγnkxn
Σk = 1∑n γnk
∑nγnk(xn − µk)(xn − µk)T
where γnk = 1 if zn = k
B Mortazavi CSECopyright 2018 14
Parameter estimation for GMMs: the easy case with complete data
Understanding the intuition
ωk =∑
n γnk∑k∑
n γnkµk = 1∑
n γnk
∑nγnkxn
Σk = 1∑n γnk
∑nγnk(xn − µk)(xn − µk)T
• For ωk : count the number of data points whose zn is k and divideby the total number of data points
• For µk : get all the data points whose zn is k, compute their mean• For Σk : get all the data points whose zn is k, compute their
covariance
B Mortazavi CSECopyright 2018 15
Parameter estimation for GMMs: incomplete data
Trick: estimation with soft γnk
γnk = p(zn = k|xn)
ωk =∑
n γnk∑k∑
n γnk
µk = 1∑n γnk
∑nγnkxn
Σk = 1∑n γnk
∑nγnk(xn − µk)(xn − µk)T
Every data point xn is assigned to a component fractionally according top(zn = k|xn), also called responsibility
B Mortazavi CSECopyright 2018 16
Parameter estimation for GMMs: incomplete data
Trick: estimation with soft γnkSince we do not know θ to begin with, we cannot compute the soft γnkBut we can invoke an iterative procedure and alternate betweenestimating γnk and using the estimated γnk to compute θ = {µk ,Σk , ωk}• Step 0: guess θ with initial values• Step 1: compute γnk using current θ
• Step 2: update θ using computed γnk
• Step 3: go back to Step 1
Questions: i) is this procedure correct, for example, optimizing a sensiblecriterion? ii) practically, will this procedure ever stop instead of iteratingforever? The answer lies in the Expectation Maximization (EM)algorithm, a powerful procedure for model estimation with unknown data
B Mortazavi CSECopyright 2018 17
Expectation Maximization: motivation and setup
• EM is used to estimate parameters for probabilistic models withhidden/latent variables
p(x|θ) =∑
zp(x, z|θ)
where x is the observed random variable and θ is the hidden• We are given data D = {x1, . . . , xN}, where the corresponding
hidden variable values z are not included• Our goal is to obtain the maximum likelihood estimate of θ
θ∗ = arg maxθ
N∑n=1
log p(xn|θ) = arg maxθ
N∑n=1
log∑
zn
p(xn, zn|θ)
The above objective function is called incomplete log-likelihood and it iscomputationally intractable (since it needs to sum over all possible valuesof z and then take the logarithm)B Mortazavi CSE
Copyright 2018 18
Expectation Maximization: Complete log-likelihood
• Incomplete log-likelihood
l(θ) =N∑
n=1log∑
zn
p(xn, zn|θ)
• EM uses a clever trick to change this into sum-log form by defining:
Qq(θ) =N∑
n=1logEzn∼q(zn)p(xn, zn|θ)
=N∑
n=1
∑zn
q(zn) log p(xn, zn|θ)
where q(z) is a distribution over zThe above is called expected (complete) log-likelihood, since it takes theform of log-sum which is computationally tractableB Mortazavi CSE
Copyright 2018 19
Expectation Maximization: Choice of q(z)
• We will choose a special q(z) = p(z|x,θ), i.e., the posteriorprobability of z
Q(θ) = Qz∼p(z|x,θ)(θ)• We will show that
l(θ) = Q(θ) +N∑
n=1H [p(z|xn,θ)]
where H is the entropy of the probabilistic distribution p(z|x,θ)
H [p(z|x,θ)] = −∑
zn
p(zn|x,θ) log p(zn|x,θ)
B Mortazavi CSECopyright 2018 20
Expectation Maximization: Choice of q(z)
Proof
Qq(θ) =∑n
∑zn
q(zn) log p(xn, zn|θ)
Q(θ) = Qz∼p(z|x,θ)(θ) =∑
n
∑zn
p(zn|xn,θ) log p(xn, zn|θ)
=∑
n
∑zn
p(zn|xn,θ)[log p(xn|θ) + log p(zn|xn,θ)]
=∑
n
∑zn
p(zn|xn,θ) log p(xn|θ) +∑
n
∑zn
p(zn|xn,θ) log p(zn|xn,θ)
=∑
nlog p(xn|θ)
∑zn
p(zn|xn,θ)−H [p(z|xn,θ)]
= l(θ)−H [p(z|xn,θ)]
B Mortazavi CSECopyright 2018 21
Expectation Maximization
• As before, Q(θ) cannot be computed, as p(z|x,θ) depends on theunknown parameter values θ
• Instead, we will use a known value θold to compute the expectedlikelihood
Q(θ,θold ) =∑
n
∑zn
p(zn|xn,θold ) log p(xn, zn|θ)
• In the above, the variable is θ, while θold is known• By its definition Q(θ) = Q(θ,θold )• But how does Q(θ,θold ) relates to l(θ)?• We will show that
l(θ) ≥ Q(θ,θold ) +∑
nH[p(z|xn,θ
old )]
• Thus, Q(θ) is better than Q(θ,θold ), except that we cannotcompute the former
B Mortazavi CSECopyright 2018 22
Expectation Maximization
Proof
l(θ) =∑
nlog p(xn|θ) =
∑n
log∑
zn
p(xn, zn|θ)
=∑
nlog∑
zn
p(z|xn,θold ) p(xn, zn|θ)
p(z|xn,θold )
≥∑
n
∑zn
p(z|xn,θold ) log p(xn, zn|θ)
−∑
n
∑zn
p(z|xn,θold ) log p(z|xn,θ
old )
= Q(θ,θold ) +∑
nH[p(z|xn,θ
old )]
︸ ︷︷ ︸A(θ,θold ) : auxiliary function
In the above we have used that log∑
i
wi xi ≥∑
i
wi log xi , ∀wi ≥ 0 s.t.∑
i
wi = 1
B Mortazavi CSECopyright 2018 23
Expectation Maximization
• Suppose we have an initial guess θold and we maximize the auxiliaryfunction
θnew = arg maxθ
A(θ,θold )
• With the new guess, we have
l(θnew ) ≥ A(θnew ,θold ) ≥ A(θnew ,θold ) = l(θold )
• By maximizing the auxiliary function, we will keep increasing thelikelihood (core of EM)
B Mortazavi CSECopyright 2018 24
Expectation Maximization
Algorithm outline• Step 0: Initialize θ with θ(0)
• Step 1 (E-step): Compute the auxiliary function using the currentvalue of θ
A(θ,θ(t))
• Step 2 (M-step): Maximize the auxiliary function
θ(t+1) ← arg max A(θ,θ(t))
• Step 3: Increase t to t + 1 and back to Step 1; or stop if l(θ(t+1))does not improve much
B Mortazavi CSECopyright 2018 25
Expectation Maximization
Remarks• EM converges but only to a local optimum: global optimum is not
guaranteed• The E-step depends on computing the posterior probability
p(zn|xn,θ(t))
• The M-step does not depend on the entropy term, so we need onlyto do the following
θ(t+1) ← arg max A(θ,θ(t)) = arg max Q(θ,θ(t))
We often call the last term Q-function
B Mortazavi CSECopyright 2018 26
Expectation Maximization for GMMs
• Hidden variable follows Multinomial distributionzn ∼ Multinomial(ω1, . . . , ωK )
• The likelihood of observation xn equals the probability ofcorresponding component p(xn|zn,θ) =
∏Kk=1N (xn|µk ,Σk)γnk
• Complete data log-likelihood
Q(θ) = E
[∑n
log p(xn, zn|θ)]
= E
[∑n
log(p(zn|θ)p(xn|zn,θ))]
= E
[∑n
log(∏
kωγnk
k
∏kN (xn|µk ,Σk)γnk
)]
= E
[∑n
∑kγnk (logωk + logN (xn|µk ,Σk))
]=∑
n
∑k
E[γnk |xn,µk ,Σk ]γnk (logωk + logN (xn|µk ,Σk))
B Mortazavi CSECopyright 2018 27
Expectation Maximization for GMMs
What is the E-step in GMM?
E[γnk |xn,µk ,Σk ] = p(γnk = 1|xn,θk) = p(xn|γnk = 1,θ)p(γnk = 1|θk)∑k
p(xn|γnk = 1,θ)p(γnk = 1|θk)
We compute the probability
γnk = p(z = k|xn,θ(t))
γnk = ωkp(xn|µ(t−1)k ,Σ
(t−1)k )∑
k ωkp(xn|µ(t−1)k ,Σ
(t−1)k )
B Mortazavi CSECopyright 2018 28
Expectation Maximization for GMMs
What is the M-step in GMM?Maximize the auxiliary functionQ(θ,θ(t)) =
∑n
∑k
p(z = k|xn,θ(t)) log p(xn, z = k|θ)
=∑
n
∑kγnk log p(xn, z = k|θ)
=∑
n
∑kγnk log(p(z = k)p(xn|z = k))
=∑
n
∑kγnk [logωk + logN (xn|µk ,Σk)]
⇒ ωk =∑
n γnk
N
⇒ µk = 1∑n γnk
∑nγnkxn
⇒ Σk = 1∑n γnk
∑nγnk(xn − µk)(xn − µk)T
B Mortazavi CSECopyright 2018 29
Expectation Maximization for GMMs
Example of EM implementation
B Mortazavi CSECopyright 2018 30
GMMs and K-means
GMMs provide probabilistic interpretation for K-means• Assume all Gaussian components have σ2I as their covariance
matrices• Further assume σ → 0• Thus, we only need to estimate µk , i.e., means• Then, the EM for GMM parameter estimation simplifies to K-means
For this reason, K-means is often called hard GMM or GMMs is calledsoft K-means. The soft posterior γnk provides a probabilistic assignmentfor xn to cluster k represented by the corresponding Gaussian distribution
B Mortazavi CSECopyright 2018 31
Takeaways and Next Time
• Expectation Maximization and GMM• Next Time: More Unsupervised Learning
B Mortazavi CSECopyright 2018 32