k-means, em and mixture models
DESCRIPTION
TRANSCRIPT
Machine Learning
K-means, E.M. and Mixture models
Department of Computer Science
November 22, 2010
Machine Learning
Remind: Three Main Problems in ML
• Three main problems in ML:
– Regression: Linear Regression, Neural net...– Classification: Decision Tree, kNN, Bayessian Classifier...– Density Estimation: Gauss Naive DE,...
• Today, we will learn:
– K-means: a trivial unsupervised classification algorithm.– Expectation Maximization: a general algorithm for density estimation.
∗ We will see how to use EM in general cases and in specific case of GMM.– GMM: a tool for modelling Data-in-the-Wild (density estimator)
∗ We also learn how to use GMM in a Bayessian Classifier
Machine Learning 1
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
– Regularized EM– Model Selection
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures– EM for mixtures
• Gaussian mixtures for classification
• Case studies
Machine Learning 2
Unsupervised Learning
• So far, we have considered supervised learning techniques:– Label of each sample is included in the training set
Sample Labelx1 y1... ...xn yk
• Unsupervised learning:– Traning set contains the samples only
Sample Labelx1...xn
Machine Learning 3
Unsupervised Learning
−10 0 10 20 30 40 500
10
20
30
40
50
60
(a) Supervised learning.
−10 0 10 20 30 40 500
10
20
30
40
50
60
(b) Unsupervised learning.
Figure 1: Unsupervised vs. Supervised Learning
Machine Learning 4
What is unsupervised learning useful for?
• Collecting and labeling a large training set can be very expensive.
• Be able to find features which are helpful for categorization.
• Gain insight into the natural structure of the data.
Machine Learning 5
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
– Regularized EM– Model Selection
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures– EM for mixtures
• Gaussian mixtures for classification
• Case studies
Machine Learning 6
K-means clustering• Clustering algorithms aim to find
groups of “similar” data points amongthe input data.
• K-means is an effective algorithm to ex-tract a given number of clusters from atraining set.
• Once done, the cluster locations canbe used to classify data into distinctclasses. −10 0 10 20 30 40 50
0
10
20
30
40
50
60
Machine Learning 7
K-means clustering
• Given:
– The dataset: {xn}Nn=1 = {x1, x2, ..., xN}
– Number of clusters: K (K < N)
• Goal: find a partition S = {Sk}Kk=1 so that it minimizes the objective function
J =N∑
n=1
K∑k=1
rnk ∥ xn − µk ∥2 (1)
where rnk = 1 if xn is assigned to cluster Sk, and rnj = 0 for j ̸= k.
i.e. Find values for the {rnk} and the {µk} to minimize (1).
Machine Learning 8
K-means clustering
J =N∑
n=1
K∑k=1
rnk ∥ xn − µk ∥2
• Select some initial values for the µk.
• Expectation: keep the µk fixed, minimize J respect to rnk.
• Maximization: keep the rnk fixed, minimize J respect to the µk.
• Loop until no change in the partitions (or maximum number of interations isexceeded).
Machine Learning 9
K-means clustering
J =N∑
n=1
K∑k=1
rnk ∥ xn − µk ∥2
• Expectation: J is linear function of rnk
rnk =1 if k = arg minj ∥ xn − µj ∥2
0 otherwise
• Maximization: setting the derivative of J with respect to µk to zero, gives:
µk =∑
n rnkxn∑n rnk
Convergence of K-means: assured [why?], but may lead to local minimum of J[8]
Machine Learning 10
K-means clustering: How to understand?
J =N∑
n=1
K∑k=1
rnk ∥ xn − µk ∥2
• Expectation: minimize J respect to rnk
– For each xn, find the “closest” cluster mean µk and put xn into cluster Sk.
• Maximization: minimize J respect to µk
– For each cluster Sk, re-estimate the cluster mean µk to be the average valueof all samples in Sk.
• Loop until no change in the partitions (or maximum number of interations isexceeded).
Machine Learning 11
K-means clustering: Demonstration
Machine Learning 12
K-means clustering: some variations
• Initial cluster centroids:
– Randomly selected– Iterative procedure: k-mean++ [2]
• Number of clusters K:
– Empirically/experimentally: 2 ∼√
n– Learning [6]
• Objective function:
– General dissimilarity measure: k-medoids algorithm.
• Speeding up:
– kd-trees for pre-processing [7]– Triangle inequality for distance calculation [4]
Machine Learning 13
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
– Regularized EM– Model Selection
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures– EM for mixtures
• Gaussian mixtures for classification
• Case studies
Machine Learning 14
Expectation Maximization
E.M.Machine Learning 15
Expectation Maximization
• A general-purpose algorithm for MLE in a wide range of situations.
• First formally stated by Dempster, Laird and Rubin in 1977 [1]
– We even have several books discussing only on EM and its variations!
• An excellent way of doing our unsupervised learning problem, as we will see
– EM is also used widely in other domains.
Machine Learning 16
EM: a solution for MLE
• Given a statistical model with:
– a set X of observed data,– a set Z of unobserved latent data,– a vector of unknown parameters θ,– a likelihood function L (θ; X, Z) = p (X, Z | θ)
• Roughly speaking, the aim of MLE is to determine θ̂ = arg maxθ L (θ; X, Z)
– We known the old trick: partial derivatives of the log likelihood...– But it is not always tractable [e.g.]– Other solutions are available.
Machine Learning 17
EM: General Case
L (θ; X, Z) = p (X, Z | θ)
• EM is just an iterative procedure for finding the MLE
• Expectation step: keep the current estimate θ(t) fixed, calculate the expectedvalue of the log likelihood function
Q(θ | θ(t)
)= E [log L (θ; X, Z)] = E [log p (X, Z | θ)]
• Maximization step: Find the parameter that maximizes this quantity
θ(t+1) = arg maxθ
Q(θ | θ(t)
)
Machine Learning 18
EM: Motivation
• If we know the value of the parameters θ, we can find the value of latent variablesZ by maximizing the log likelihood over all possible values of Z
– Searching on the value space of Z.
• If we know Z, we can find an estimate of θ
– Typically by grouping the observed data points according to the value of asso-ciated latent variable,
– then averaging the values (or some functions of the values) of the points ineach group.
To understand this motivation, let’s take K-means as a trivial example...
Machine Learning 19
EM: informal description
Both θ and Z are unknown, EM is an iterative algorithm:
1. Initialize the parameters θ to some random values.
2. Compute the best values of Z given these parameter values.
3. Use the just-computed values of Z to find better estimates for θ.
4. Iterate until convergence.
Machine Learning 20
EM Convergence
• E.M. Convergence: Yes
– After each iteration, p (X, Z | θ) must increase or remain [NOT OBVIOUS]– But it can not exceed 1 [OBVIOUS]– Hence it must converge [OBVIOUS]
• Bad news: E.M. converges to local optimum.
– Whether the algorithm converges to the global optimum depends on the ini-tialization.
• Let’s take K-means as an example, again...
• Details can be found in [9].
Machine Learning 21
Regularized EM (REM)
• EM tries to inference the latent (missing) data Z from the observations X
– We want to choose the missing data that has a strong probabilistic relationto the observations, i.e. we assume that the observations contains lots ofinformation about the missing data.
– But E.M. does not have any control on the relationship between the missingdata and the observations!
• Regularized EM (REM) [5] tries to optimized the penalized likelihood
L̃ (θ | X, Z) = L (θ | X, Z) − γH (Z | X, θ)
where H (Y ) is Shannon’s entropy of the random variable Y :
H (Y ) = − ∑y
p (y) log p (y)
and the positive value γ is the regularization parameter. [When γ = 0?]
Machine Learning 22
Regularized EM (REM)
• E-step: unchanged
• M-step: Find the parameter that maximizes this quantity
θ(t+1) = arg maxθ
Q̃(θ | θ(t)
)
whereQ̃
(θ | θ(t)
)= Q
(θ | θ(t)
)− γH (Z | X, θ)
• REM is expected to converge faster than EM (and it does!)
• So, to apply REM, we just need to determine the H (·) part...
Machine Learning 23
Model Selection
• Considering a parametric model:
– When estimating model parameters using MLE, it is possible to increase thelikelihood by adding parameters
– But may result in over-fitting.
• e.g. K-means with different values of K...
• Need a criteria for model selection, e.g. to “judge” which model configuration isbetter, how many parameters is sufficient...
– Cross Validation– Akaike Information Criterion (AIC)– Bayesian Factor
∗ Bayesian Informaction Criterion (BIC)∗ Deviance Information Criterion
– ...
Machine Learning 24
Bayesian Information Criterion
BIC = − log p(data | θ̂
)+ # of param
2log n
• Where:
– θ̂: the estimated parameters.– p
(data | θ̂
): the maximized value of the likelihood function for the estimated
model.– n: number of data points.– Note that there are other ways to write the BIC expression, but they are all
equivalent.
• Given any two estimated models, the model with the lower value of BIC ispreferred.
Machine Learning 25
Bayesian Score
• BIC is an asymptotic (large n) approximation to better (and hard to evaluate)Bayesian score
Bayesian score =ˆ
θ
p (θ) p (data | θ) dθ
• Given two models, the model selection is based on Bayes factor
K =
ˆθ1
p (θ1) p (data | θ1) dθ1ˆθ2
p (θ2) p (data | θ2) dθ2
Machine Learning 26
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
– Regularized EM– Model Selection
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures– EM for mixtures
• Gaussian mixtures for classification
• Case studies
Machine Learning 27
Remind: Bayes Classifier
0 10 20 30 40 50 60 70 80−10
0
10
20
30
40
50
60
70
p (y = i | x) = p (x | y = i) p (y = i)p (x)
Machine Learning 28
Remind: Bayes Classifier
0 10 20 30 40 50 60 70 80−10
0
10
20
30
40
50
60
70
In case of Gaussian Bayes Classifier:
p (y = i | x) =1
(2π)d/2∥Σi∥1/2 exp[−1
2 (x − µi)T Σi (x − µi)]pi
p (x)
How can we deal with the denominator p (x)?
Machine Learning 29
Remind: The Single Gaussian Distribution
• Multivariate Gaussian
N (x; µ, Σ) = 1(2π)d/2 ∥ Σ ∥1/2
exp−1
2(x − µ)T Σ−1 (x − µ)
• For maximum likelihood
0 = ∂ ln N (x1, x2, ..., xN; µ, Σ)∂µ
• and the solution isµML = 1
N
N∑i=1
xi
ΣML = 1N
N∑i=1
(xi − µML)T (xi − µML)
Machine Learning 30
The GMM assumption
• There are k components: {ci}ki=1
• Component ci has an associated meanvector µi
•
•
µ1
µ2
µ3
Machine Learning 31
The GMM assumption
• There are k components: {ci}ki=1
• Component ci has an associated meanvector µi
• Each component generates data from aGaussian with mean µi and covariancematrix Σi
• Each sample is generated according tothe following guidelines:
µ1
µ2
µ3
Machine Learning 32
The GMM assumption
• There are k components: {ci}ki=1
• Component ci has an associated meanvector µi
• Each component generates data from aGaussian with mean µi and covariancematrix Σi
• Each sample is generated according tothe following guidelines:– Randomly select component ci
with probability P (ci) = wi, s.t.∑ki=1 wi = 1
µ2
Machine Learning 33
The GMM assumption
• There are k components: {ci}ki=1
• Component ci has an associated meanvector µi
• Each component generates data from aGaussian with mean µi and covariancematrix Σi
• Each sample is generated according tothe following guidelines:
– Randomly select component ci withprobability P (ci) = wi, s.t.∑k
i=1 wi = 1– Sample ~ N (µi, Σi)
µ2
x
Machine Learning 34
Probability density function of GMM
“Linear combination” of Gaussians:
f (x) =k∑
i=1wiN (x; µi, Σi) , where
k∑i=1
wi = 1
0 50 100 150 200 2500
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
w1N(
µ1, σ2
1
)
w2N(
µ2, σ2
2
)
w3N(
µ3, σ2
3
)
f (x)
(a) The pdf of an 1D GMM with 3 components. (b) The pdf of an 2D GMM with 3 components.
Figure 2: Probability density function of some GMMs.
Machine Learning 35
GMM: Problem definition
f (x) =k∑
i=1wiN (x; µi, Σi) , where
k∑i=1
wi = 1
Given a training set, how to model these data point using GMM?
• Given:
– The trainning set: {xi}Ni=1
– Number of clusters: k
• Goal: model this data using a mixture of Gaussians
– Weights: w1, w2, ..., wk
– Means and covariances: µ1, µ2, ..., µk; Σ1, Σ2, ..., Σk
Machine Learning 36
Computing likelihoods in unsupervised case
f (x) =k∑
i=1wiN (x; µi, Σi) , where
k∑i=1
wi = 1
• Given a mixture of Gaussians, denoted by G. For any x, we can define thelikelihood:
P (x | G) = P (x | w1, µ1, Σ1, ..., wk, µk, Σk)=
k∑i=1
P (x | ci) P (ci)
=k∑
i=1wiN (x; µi, Σi)
• So we can define likelihood for the whole training set [Why?]
P (x1, x2, ..., xN | G) =N∏
i=1P (xi | G)
=N∏
i=1
k∑j=1
wjN (xi; µj, Σj)
Machine Learning 37
Estimating GMM parameters
• We known this: Maximum Likelihood Estimation
ln P (X | G) =N∑
i=1ln
k∑j=1
wjN (xi; µj, Σj)
– For the max likelihood:0 = ∂ ln P (X | G)
∂µj
– This leads to non-linear non-analytically-solvable equations!
• Use gradient descent
– Slow but doable
• A much cuter and recently popular method...
Machine Learning 38
E.M. for GMM
• Remember:
– We have the training set {xi}Ni=1, the number of components k.
– Assume we know p (c1) = w1, p (c2) = w2, ..., p (ck) = wk
– We don’t know µ1, µ2, ..., µk
The likelihood:
p (data | µ1, µ2, ..., µk) = p (x1, x2, ..., xN | µ1, µ2, ..., µk)=
N∏i=1
p (xi | µ1, µ2, ..., µk)
=N∏
i=1
k∑j=1
p (xi | wj, µ1, µ2, ..., µk) p (cj)
=N∏
i=1
k∑j=1
K exp− 1
2σ2
(xi − µj
)2wi
Machine Learning 39
E.M. for GMM
• For Max. Likelihood, we know ∂∂µi
log p (data | µ1, µ2, ..., µk) = 0• Some wild algebra turns this into: For Maximum Likelihood, for each j:
µj =
N∑i=1
p (cj | xi, µ1, µ2, ..., µk) xi
N∑i=1
p (cj | xi, µ1, µ2, ..., µk)
This is N non-linear equations of µj’s.• So:
– If, for each xi, we know p (cj | xi, µ1, µ2, ..., µk), then we could easily computeµj,
– If we know each µj, we could compute p (cj | xi, µ1, µ2, ..., µk) for each xi
and cj.
Machine Learning 40
E.M. for GMM
• E.M. is coming: on the t’th iteration, let our estimates be
λt = {µ1 (t) , µ2 (t) , ..., µk (t)}
• E-step: compute the expected classes of all data points for each class
p (cj | xi, λt) = p (xi | cj, λt) p (cj | λt)p (xi | λt)
= p(xi | cj, µj (t) , σjI
)p (cj)
k∑m=1
p (xi | cm, µm (t) , σmI) p (cm)
• M-step: compute µ given our data’s class membership distributions
µj (t + 1) =
N∑i=1
p (cj | xi, λt) xi
N∑i=1
p (cj | xi, λt)
Machine Learning 41
E.M. for General GMM: E-step
• On the t’th iteration, let our estimates be
λt = {µ1 (t) , µ2 (t) , ..., µk (t) , Σ1 (t) , Σ2 (t) , ..., Σk (t) , w1 (t) , w2 (t) , ..., wk (t)}
• E-step: compute the expected classes of all data points for each class
τij (t) ≡ p (cj | xi, λt) = p (xi | cj, λt) p (cj | λt)p (xi | λt)
= p(xi | cj, µj (t) , Σj (t)
)wj (t)
k∑m=1
p (xi | cm, µm (t) , Σj (t)) wm (t)
Machine Learning 42
E.M. for General GMM: M-step
• M-step: compute µ given our data’s class membership distributions
wj (t + 1) =
N∑i=1
p (cj | xi, λt)
Nµj (t + 1) =
N∑i=1
p (cj | xi, λt) xi
N∑i=1
p (cj | xi, λt)
= 1N
N∑i=1
τij (t) = 1Nwj (t + 1)
N∑i=1
τij (t) xi
Σj (t + 1) =
N∑i=1
p (cj | xi, λt)[xi − µj (t + 1)
] [xi − µj (t + 1)
]TN∑
i=1p (cj | xi, λt)
= 1Nwj (t + 1)
N∑i=1
τij (t)[xi − µj (t + 1)
] [xi − µj (t + 1)
]T
Machine Learning 43
E.M. for General GMM: Initialization
• wj = 1/k, j = 1, 2, ..., k
• Each µj is set to a randomly selected point
– Or use K-means for this initialization.
• Each Σj is computed using the equation in previous slide...
Machine Learning 44
Regularized E.M. for GMM
• In case of REM, the entropy H (·) is
H (C | X; λt) = −N∑
i=1
k∑i=1
p (cj | xi; λt) log p (cj | xi; λt)
= −N∑
i=1
k∑i=1
τij (t) log τij (t)
and the likelihood will be
L̃ (λt; X, C) =L (λt; X, C) − γH (C | X; λt)
=N∑
i=1log
k∑j=1
wjp (xi | cj, λt)
+ γN∑
i=1
k∑i=1
τij (t) log τij (t)
Machine Learning 45
Regularized E.M. for GMM
• Some algebra [5] turns into:
wj (t + 1) =
N∑i=1
p (cj | xi, λt) (1 + γ log p (cj | xi, λt))
N
= 1N
N∑i=1
τij (t) (1 + γ log τij (t))
µj (t + 1) =
N∑i=1
p (cj | xi, λt) xi (1 + γ log p (cj | xi, λt))N∑
i=1p (cj | xi, λt) (1 + γ log p (cj | xi, λt))
= 1Nwj (t + 1)
N∑i=1
τij (t) xi (1 + γ log τij (t))
Machine Learning 46
Regularized E.M. for GMM
• Some algebra [5] turns into (cont.):
Σj (t + 1) = 1Nwj (t + 1)
N∑i=1
τij (t) (1 + γ log τij (t)) dij (t + 1)
wheredij (t + 1) =
[xi − µj (t + 1)
] [xi − µj (t + 1)
]T
Machine Learning 47
Demonstration
• EM for GMM
• REM for GMM
Machine Learning 48
Local optimum solution
• E.M. is guaranteed to find the local optimal solution by monotonically increasingthe log-likelihood
• Whether it converges to the global optimal solution depends on the initialization
−10 −5 0 5 10 150
2
4
6
8
10
12
14
16
18
−10 −5 0 5 10 150
5
10
15
Machine Learning 49
GMM: Selecting the number of components
• We can run the E.M. algorithm with different numbers of components.
– Need a criteria for selecting the “best” number of components
−10 −5 0 5 10 150
5
10
15
−10 −5 0 5 10 150
2
4
6
8
10
12
14
16
−10 −5 0 5 10 150
2
4
6
8
10
12
14
16
Machine Learning 50
GMM: Model Selection
• Empirically/Experimentally [Sure!]
• Cross-Validation [How?]
• BIC
• ...
Machine Learning 51
GMM: Model Selection
• Empirically/Experimentally
– Typically 3-5 components
• Cross-Validation: K-fold, leave-one-out...
– Omit each point xi in turn, estimate the parameters θ−i on the basis of theremaining points, then evaluate
N∑i=1
log p(xi | θ−i
)
• BIC: find k (the number of components) that minimize the BIC
BIC = − log p(data | θ̂m
)+ dk
2log n
where dk is the number of (effective) parameters in the k-component mixture.
Machine Learning 52
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
– Regularized EM– Model Selection
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures– EM for mixtures
• Gaussian mixtures for classification
• Case studies
Machine Learning 53
Gaussian mixtures for classification
p (y = i | x) = p (x | y = i) p (y = i)p (x)
• To build a Bayesian classifier based on GMM, we can use GMM to model data ineach class
– So each class is modeled by one k-component GMM.
• For example:Class 0: p (y = 0) , p (x | θ0), (a 3-component mixture)Class 1: p (y = 1) , p (x | θ1), (a 3-component mixture)Class 2: p (y = 2) , p (x | θ2), (a 3-component mixture)...
Machine Learning 54
GMM for Classification
• As previous, each class is modeled by a k-component GMM.
• A new test sample x is classified according to
c = arg maxi
p (y = i) p (x | θi)
wherep (x | θi) =
k∑i=1
wiN (x; µi, Σi)
• Simple, quick (and is actually used!)
Machine Learning 55
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
– Regularized EM– Model Selection
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures– EM for mixtures
• Gaussian mixtures for classification
• Case studies
Machine Learning 56
Case studies
• Background subtraction
– GMM for each pixel
• Speech recognition
– GMM for the underlying distribution of feature vectors of each phone
• Many, many others...
Machine Learning 57
What you should already know?
• K-means as a trivial classifier
• E.M. - an algorithm for solving many MLE problems
• GMM - a tool for modeling data
– Note 1: We can have a mixture model of many different types of distribution,not only Gaussians
– Note 2: Compute the sum of Gaussians may be expensive, some approximationsare available [3]
• Model selection:
– Bayesian Information Criterion
Machine Learning 58
Q & A
Machine Learning 59
References
[1] N. Laird A. Dempster and D. Rubin. Maximum likelihood from incomplete datavia the em algorithm. Journal of the Royal Statistical Society. Series B (Method-ological), 39(1):pp. 1–38., 1977.
[2] David Arthur and Sergei Vassilvitskii. k-means ++ : The Advantages of CarefulSeeding. In Proceedings of the eighteenth annual ACM-SIAM symposium onDiscrete algorithms, volume 8, pages 1027–1035, 2007.
[3] N. Gumerov C. Yang, R. Duraiswami and L. Davis. Improved fast gauss transformand efficient kernel density estimation. In IEEE International Conference onComputer Vision, pages pages 464–471, 2003.
[4] Charles Elkan. Using the Triangle Inequality to Accelerate k-Means. In Proceed-
Machine Learning 60
ings of the Twentieth International Conference on Machine Learning (ICML),2003.
[5] Keshu Zhang Haifeng Li and Tao Jiang. The regularized em algorithm. InProceedings of the 20th National Conference on Artificial Intelligence, pagespages 807 – 812, Pittsburgh, PA, 2005.
[6] Greg Hamerly and Charles Elkan. Learning the k in k-means. In In NeuralInformation Processing Systems. MIT Press, 2003.
[7] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, RuthSilverman, and Angela Y Wu. An efficient k-means clustering algorithm: anal-ysis and implementation. IEEE Transactions on Pattern Analysis and MachineIntelligence, 24(7):881–892, July 2002.
[8] J MacQueen. Some methods for classification and analysis of multivariate obser-vations. In Proceedings of 5th Berkeley Symposium on Mathematical Statisticsand Probability, volume 233, pages 281–297. University of California Press, 1967.
Machine Learning 61
[9] C.F. Wu. On the convergence properties of the em algorithm. The Annals ofStatistics, 11:95–103, 1983.
Machine Learning 62