k-means and gmm
TRANSCRIPT
Network Intelligence and Analysis Lab
Network Intelligence and Analysis Lab
Clustering methods via EM algorithm
2014.07.10Sanghyuk Chun
Network Intelligence and Analysis Lab
β’ Machine Learningβ’ Training dataβ’ Learning model
β’ Unsupervised Learningβ’ Training data without labelβ’ Input data: π·π· = {π₯π₯1, π₯π₯2, β¦ , π₯π₯ππ}β’ Most of unsupervised learning problems are trying to find
hidden structure in unlabeled dataβ’ Examples: Clustering, Dimensionality Reduction (PCA, LDA), β¦
Machine Learning and Unsupervised Learning
2
Network Intelligence and Analysis Lab
β’ Clusteringβ’ Grouping objects in a such way that objects in the same group
are more similar to each other than other groupsβ’ Input: a set of objects (or data) without group informationβ’ Output: cluster index for each object
β’ Usage: Customer Segmentation, Image Segmentationβ¦
Unsupervised Learning and Clustering
Input Output
ClusteringAlgorithm
3
Network Intelligence and Analysis Lab
K-means ClusteringIntroductionOptimization
4
Network Intelligence and Analysis Lab
β’ Intuition: data in same cluster has shorter distance than data which are in other clusters
β’ Goal: minimize distance between data in same clusterβ’ Objective function:
β’
π½π½ = οΏ½ππ=1
ππ
οΏ½ππ=1
πΎπΎ
ππππππ π±π±π§π§ β πππ€π€ 2
β’ Where N is number of data points, K is number of clustersβ’ ππππππ β {0,1} is indicator variables where k describing which of
the K clusters the data point π±π±π§π§ is assigned toβ’ πππ€π€ is a prototype associated with the k-th cluster
β’ Eventually πππ€π€ is same as the center (mean) of cluster
K-means Clustering
5
Network Intelligence and Analysis Lab
β’ Objective function:β’
ππππππππππππ{ππππππ,πππ€π€} οΏ½ππ=1
ππ
οΏ½ππ=1
πΎπΎ
ππππππ π±π±π§π§ β πππ€π€ 2
β’ This function can be solved through an iterative procedureβ’ Step 1: minimize J with respect to the ππππππ, keeping πππ€π€ is fixedβ’ Step 2: minimize J with respect to the πππ€π€, keeping ππππππ is fixedβ’ Repeat Step 1,2 until converge
β’ Does it always converge?
K-means Clustering β Optimization
6
Network Intelligence and Analysis Lab
β’ Biconvex optimization is a generalization of convex optimization where the objective function and the constraint set can be biconvex
β’ ππ π₯π₯,π¦π¦ is biconvex if fixing x, πππ₯π₯ y = ππ π₯π₯,π¦π¦ is convex over Y and fixing y, πππ¦π¦ π₯π₯ = ππ π₯π₯,π¦π¦ is convex over X
β’ One way to solve biconvex optimization problem is that iteratively solve the corresponding convex problems
β’ It does not guarantee the global optimal pointβ’ But it always converge to some local optimum
Optional β Biconvex optimization
7
Network Intelligence and Analysis Lab
β’
ππππππππππππ{ππππππ,πππ€π€} οΏ½ππ=1
ππ
οΏ½ππ=1
πΎπΎ
ππππππ π±π±π§π§ β πππ€π€ 2
β’ Step 1: minimize J with respect to the ππππππ, keeping πππ€π€ is fixed
β’ ππππππ = οΏ½1 ππππ ππ = ππππππππππππππ π±π±π§π§ β πππ€π€ ππ
0 πππππππππππππππππβ’ Step 2: minimize J with respect to the πππ€π€, keeping ππππππ is fixed
β’ Derivative with respect to πππ€π€ to zero givingβ’ 2βππ ππππππ π±π±π§π§ β πππ€π€ = 0β’ πππ€π€ = βππ πππππππ±π±π§π§
βππ ππππππβ’ πππ€π€ is equal to the mean of all the data assigned to cluster k
K-means Clustering β Optimization
8
Network Intelligence and Analysis Lab
β’ Advantage of K-means clusteringβ’ Easy to implement (kmeans in Matlab, kcluster in Python)β’ In practice, it works well
β’ Disadvantage of K-means clusteringβ’ It can converge to local optimumβ’ Computing Euclidian distance of every point is expensive
β’ Solution: Batch K-meansβ’ Euclidian distance is non-robust to outlier
β’ Solution: K-medoids algorithms (use different metric)
K-means Clustering β Conclusion
9
Network Intelligence and Analysis Lab
Mixture of GaussiansMixture ModelEM AlgorithmEM for Gaussian Mixtures
10
Network Intelligence and Analysis Lab
β’ Assumption: There are k components: ππππ ππ=1ππ
β’ Component ππππ has an associated mean vector ππππβ’ Each component generates data from a Gaussian with mean ππππ
and covariance matrix Ξ£ππ
Mixture of Gaussians
ππ1 ππ2
ππ3ππ4
ππ5
11
Network Intelligence and Analysis Lab
β’ Represent model as linear combination of Gaussiansβ’ Probability density function of GMMβ’
ππ π₯π₯ = οΏ½ππ=1
πΎπΎ
ππππππ π₯π₯ ππππ , Ξ£ππ
β’ ππ π₯π₯ ππππ , Ξ£ππ = 12ππ ππ/2 Ξ£ 1/2 exp{β1
2π₯π₯ β ππ β€Ξ£β1 π₯π₯ β ππ }
β’ Which is called a mixture of Gaussian or Gaussian Mixture Modelβ’ Each Gaussian density is called component of the mixtures and
has its own mean ππππ and covariance Ξ£ππβ’ The parameters are called mixing coefficients (βππ ππππ = 1)
Gaussian Mixture Model
12
Network Intelligence and Analysis Lab
β’ ππ π₯π₯ = βππ=1πΎπΎ ππππππ π₯π₯ ππππ, Ξ£ππ , where βππ ππππ = 1
β’ Input:β’ The training set: π₯π₯ππ ππ=1
ππ
β’ Number of clusters: k
β’ Goal: model this data using mixture of Gaussiansβ’ Mixing coefficients ππ1,ππ2, β¦ ,ππππβ’ Means and covariance: ππ1, ππ2, β¦ , ππππ; Ξ£1, Ξ£2, β¦ , Ξ£ππ
Clustering using Mixture Model
13
Network Intelligence and Analysis Lab
β’ ππ π₯π₯ πΊπΊ = ππ π₯π₯ ππ1, ππ1, β¦ = βππ ππ π₯π₯ ππππ ππ(ππππ) = βππ ππππππ(π₯π₯|ππππ , Ξ£ππ)β’ ππ π₯π₯1, π₯π₯2, β¦ , π₯π₯ππ πΊπΊ = Ξ ππππ(π₯π₯ππ|πΊπΊ)β’ The log likelihood function is given byβ’
lnππ ππ ππ,ππ,πΊπΊ = οΏ½ππ=1
ππ
ln οΏ½ππ=1
πΎπΎ
ππππππ π±π±π§π§ πππ€π€,πΊπΊπ€π€
β’ Goal: Find parameter which maximize log-likelihoodβ’ Problem: Hard to compute maximum likelihoodβ’ Solution: use EM algorithm
Maximum Likelihood of GMM
14
Network Intelligence and Analysis Lab
β’ EM algorithm is an iterative procedure for finding the MLEβ’ An expectation (E) step creates a function for the expectation of
the log-likelihood evaluated using the current estimate for the parameters
β’ A maximization (M) step computes parameters maximizing the expected log-likelihood found on the E step
β’ These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
β’ EM always converges to one of local optimums
EM (Expectation Maximization) Algorithm
15
Network Intelligence and Analysis Lab
β’
ππππππππππππ{ππππππ,πππ€π€} οΏ½ππ=1
ππ
οΏ½ππ=1
πΎπΎ
ππππππ π±π±π§π§ β πππ€π€ 2
β’ E-Step: minimize J with respect to the ππππππ, keeping πππ€π€ is fixed
β’ ππππππ = οΏ½1 ππππ ππ = ππππππππππππππ π±π±π§π§ β πππ€π€ ππ
0 πππππππππππππππππ
β’ M-Step: minimize J with respect to the πππ€π€, keeping ππππππ is fixed
β’ πππ€π€ = βππ πππππππ±π±π§π§βππ ππππππ
K-means revisit: EM and K-means
16
Network Intelligence and Analysis Lab
β’ Let π§π§ππ is Bernoulli random variable with probability ππππβ’ ππ π§π§ππ = 1 = ππππ where βπ§π§ππ = 1 and βππππ = 1
β’ Because z use a 1-of-K representation, this distribution in the form
β’ ππ π§π§ = βππ=1πΎπΎ ππππ
π§π§ππ
β’ Similarly, the conditional distribution of x given a particular value for z is a Gaussian
β’ ππ π₯π₯ π§π§ = βππ=1πΎπΎ ππ π₯π₯ ππππ, Ξ£ππ π§π§ππ
Latent variable for GMM
17
Network Intelligence and Analysis Lab
β’ The joint distribution is given by ππ π₯π₯, π§π§ = ππ π§π§ ππ(π₯π₯|π§π§)β’ ππ π₯π₯ = βπ§π§ ππ π§π§ ππ(π₯π₯|π§π§) = βππ ππππππ(π₯π₯|ππππ , Ξ£ππ)β’ Thus the marginal distribution of x is a Gaussian mixture of the
above formβ’ Now, we are able to work with joint distribution instead of
marginal distributionβ’ Graphical representation of a GMM
for a set of N i.i.d. data points {π₯π₯ππ}with corresponding latent variable{π§π§ππ}, where n=1,β¦,N
Latent variable for GMM
π³π³π§π§
πΏπΏππ
ππ
ππ πΊπΊN
18
Network Intelligence and Analysis Lab
β’ Conditional probability of z given xβ’ From Bayesβ theorem,
β’ πΎπΎ π§π§ππ β‘ ππ π§π§ππ = 1 π±π± = ππ π§π§ππ=1 ππ π±π± π§π§ππ = 1βππ=1πΎπΎ ππ π§π§ππ=1 ππ π±π± π§π§ππ = 1
=ππππππ π±π± πππ€π€,πΊπΊπ€π€
βππ=1πΎπΎ ππππππ(π±π±|πππ£π£,πΊπΊπ£π£)
β’ πΎπΎ π§π§ππ can also be viewed as the responsibility that component k takes for βexplainingβ the observation x
EM for Gaussian Mixtures (E-step)
19
Network Intelligence and Analysis Lab
β’ Likelihood function for GMMβ’
lnππ ππ ππ,ππ,πΊπΊ = οΏ½ππ=1
ππ
ln οΏ½ππ=1
πΎπΎ
ππππππ π±π±π§π§ πππ€π€,πΊπΊπ€π€
β’ Setting the derivatives of log likelihood with respect to the means ππππ of the Gaussian components to zero, we obtain
β’
ππππ =1
NπποΏ½ππ=1
ππ
πΎπΎ π§π§ππππ π±π±π§π§
where, ππππ = βππ=1ππ πΎπΎ(π§π§ππππ)
EM for Gaussian Mixtures (M-step)
20
Network Intelligence and Analysis Lab
β’ Setting the derivatives of likelihood with respect to the Ξ£ππ to zero, we obtain
β’
πΊπΊππ =1ππππ
οΏ½ππ=1
ππ
πΎπΎ π§π§ππππ π±π±π§π§ β ππππ π±π±π§π§ β ππππ β€
β’ Maximize likelihood with respect to the mixing coefficient ππ by using a Lagrange multiplier, we obtain
β’ lnππ ππ ππ,ππ,πΊπΊ + ππ(βππ=1πΎπΎ ππππ β 1)
β’ ππππ = ππππππ
EM for Gaussian Mixtures (M-step)
21
Network Intelligence and Analysis Lab
β’ ππππ ,Ξ£ππ ,ππππ do not constitute a closed-form solution for the parameters of the mixture model because the responsibility πΎπΎ π§π§ππππ depend on those parameters in a complex way
β’ πΎπΎ(π§π§ππππ) = ππππππ π±π± πππ€π€,πΊπΊπ€π€βππ=1πΎπΎ ππππππ(π±π±|πππ£π£,πΊπΊπ£π£)
β’ In EM algorithm for GMM, πΎπΎ(π§π§ππππ) and parameters are iteratively optimized
β’ In E step, responsibilities or the posterior probabilities are evaluated by current values for the parameters
β’ In M step, re-estimate the means, covariances, and mixing coefficients using previous results
EM for Gaussian Mixtures
22
Network Intelligence and Analysis Lab
β’ Initialize the means ππππ, covariances Ξ£ππ and mixing coefficient ππππ, and evaluate the initial value of the log likelihood
β’ E step: Evaluate the responsibilities using the current parameterβ’
πΎπΎ(π§π§ππππ) =ππππππ π±π± πππ€π€,πΊπΊπ€π€
βππ=1πΎπΎ ππππππ(π±π±|πππ£π£,πΊπΊπ£π£)β’ M step: Re-estimate parameters using the current responsibilities
β’ ππππππππππ = 1Nππβππ=1ππ πΎπΎ π§π§ππππ π±π±π§π§
β’ πΊπΊππππππππ = 1ππππβππ=1ππ πΎπΎ π§π§ππππ π±π±π§π§ β ππππ π±π±π§π§ β ππππ β€
β’ ππππππππππ = ππππππ
β’ ππππ = βππ=1ππ πΎπΎ(π§π§ππππ)β’ Repeat E step and M step until converge
EM for Gaussian Mixtures
23
Network Intelligence and Analysis Lab
β’ We can derive the K-means algorithm as a particular limit of EM for Gaussian Mixture Model
β’ Consider a Gaussian mixture model with covariance matrices are given by πππΌπΌ, where ππ is a variance parameter and I is identity
β’ If we consider the limit ππ β 0, log likelihood of GMM becomes
β’ πΈπΈπ§π§ lnππ ππ,ππ ππ, Ξ£,ππ β β12
= βππβππ ππππππ π±π±π§π§ β πππ€π€ 2 + πΆπΆ
β’ Thus, we see that in this limit, maximizing the expected complete-data log likelihood is equivalent to K-means algorithm
Relationship between K-means algorithm and GMM
24