clustering. what is cluster analysis k-means adaptive initialization em learning mixture gaussians...

Clustering

What is Cluster Analysis k-Means Adaptive Initialization EM Learning Mixture Gaussians E-step M-step k-Means vs Mixture of Gaussians

k-Means Clustering

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Feature space Sample

rx (1),

r x (2),..,

r x (k ),..,

r x (n )

⎪ ⎪ ⎪

∈ ℜ d

rx −

r y = (x i − y i)

||x|| ≥ 0 equality only if x=0 || x||=|| ||x|| ||x1+x2||≤ ||x1||+||x2||

lp norm

p= x i

∑ ⎛

⎝ ⎜

⎠ ⎟

Metric

d(x,y) ≥ 0 equality holds only if x=y d(x,y) = d(y,x) d(x,y) ≤ d(x,z)+d(z,y)

d2(r x ,

r z ) = x i − zi( )

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

k-means Clustering

Cluster centers c1,c2,.,ck with clusters C1,C2,.,Ck

d2(r x ,

r z ) = x i − zi( )

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

The error function has a local minima if,

E = d2(x,c j )2

x∈C j

∑j=1

k-means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

AlgorithmRandom initialization of k cluster centers

-assign to each xi in the dataset the nearest cluster center (centroid) cj according to d2

-compute all new cluster centers }until ( |Enew - Eold| < or number of iterations max_iterations)

Adaptive k-means learning (batch modus)for large datasets

Random initialization of cluster centersdo{

chose xi from the dataset

cj* nearest cluster center (centroid) cj according to d2

}until ( |Enew - Eold| < or number of iterations max_iterations)

*new =r c j

*old +1

C j*old +1

x −r c j

*old( )

How to chose k? You have to know your data!

Repeated runs of k-means clustering on the same data can lead to quite different partition results Why? Because we use random initialization

benötigt.

Adaptive Initialization Choose a maximum radius within every data

point should have a cluster seed after completion of the initialization phase

In a single sweep go through the data and assigns the cluster seeds according to the chosen radius A data point becomes a new cluster seed, if it is not

covered by the spheres with the chosen radius of the other already assigned seeds

K-MAI clustering (Wichert et al. 2003)

Expectation Maximization Clustering

Feature space Sample

rx (1),

r x (2),..,

r x (k ),..,

r x (n )

⎪ ⎪ ⎪

∈ ℜ d

rx −

r x −

r μ )T Σ−1(

r x −

r μ )

Mahalanobis distance

Bayes’s rule

After the evidence is obtained; posterior probability P(a|b) The probability of a given that all we know is b

(Reverent Thomas Bayes 1702-1761)

P(b | a) =P(a | b)P(b)

Covariance Measuring the tendency two features xi and xj

varying in the same direction The covariance between features xi and xj is

estimated for n patterns

c ij =

x i(k ) − mi( ) x j

(k ) − m j( )k=1

∑n −1

c11 c12 .. c1d

c21 c22 .. c2d

.. .. .. ..

cd1 cd 2 .. cdd

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

Learning Mixture Gaussians

What kind of probability distribution might have generated the data

Clustering presumes that the data are generated from mixture distributions, P

The Normal Density Univariate density

Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian

Where: = mean (or expected value) of x 2 = expected squared deviation or variance

⎥⎥⎦

⎢⎢⎣

⎡⎟⎠

⎞⎜⎝

⎛ −−=

Example: Mixture of 2 Gaussians

benötigt.

Multivariate density

Multivariate normal density in d dimensions is:

where:

x = (x1, x2, …, xd)t (t stands for the transpose vector form)

= (1, 2, …, d)t mean vector Σ = d*d covariance matrix

|Σ| and Σ-1 are determinant and inverse respectively

⎥⎦

⎤⎢⎣

⎡ −−−= − )x()x(2

1)x(P 1t

2/12/dμΣμ

benötigt.

Example: Mixture of 3 Gaussians

A mixture distribution has k components, each of which is a distribution in its own

A data point is generated by first choosing a component and than generating a sample from that component

Let C denote the component with values 1,…,k Mixture distribution is given by

x refers to the data point wi=P(C=i) the weight of each component µi the mean (vector) of each component ∑i (matrix)

the covariance of each component€

P(x) = P(C = i)P(x | C = i)i=1

⎥⎦

⎤⎢⎣

⎡ −−−= − )x()x(2

1)x(P 1t

2/12/dμΣμ

1= P(C = i)I =1

If we knew which component generated each data point, then it would be easy to recover the component Gaussians

We could fit the parameters of a Gaussian to a data set

⎥⎦

⎤⎢⎣

⎡ −−−= − )x()x(2

1)x(P 1t

2/12/dμΣμ

Basic EM idea Pretend that we know the parameters of the

model Infer the probability that each data point

belongs to each component Refit the component to the data, where each

component is fitted to the entire data set Each point is weighted by the probability that it

belongs to that component

Algorithm We initialize the mixture parameters arbitrarily

E- step (expectation): Compute the probabilities pij=P(C=i|xj), the

probability that xj was generated by the component I

By Bayes’ rule pij=P(xj|C=i)P(C=i)

• P(xj|C=i) is just the probability at xj of the ith Gaussian

• P(C=i) is just the weight parameter of the ith Gaussian

pi = pij

M-step (maximization):

wi=P(C=i)

i ←pij

Σi ←pij

wi ← pi

benötigt.

Problems Gaussians component shrinks so that it covers just a

single point Variance goes to zero, and likelihood will go to infinity Two components can “merge”, acquiring identical

means and variances and sharing their data points Serious problems, especially in high dimensions

It helps to initialize the parameters with reasonable values

k-Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters

K-Means: minimize

MixGaussian: maximize P(x|C=i)

Mixture of Gaussian is the more general formulation

Equivalent to k-Means when ∑i =I ,

⎥⎦

⎤⎢⎣

⎡ −−−= − )x()x(2

1)x(P 1t

2/12/dμΣμ

P(C = i) =1

kC = i

0 else

⎧ ⎨ ⎪

⎩ ⎪

E = d2(x,c j )2

x∈C j

∑j=1

What is Cluster Analysis k-Means Adaptive Initialization EM Learning Mixture Gaussians E-step M-step k-Means vs Mixture of Gaussians

Tree Clustering COBWEB

clustering. what is cluster analysis k-means adaptive initialization em learning mixture gaussians...

y slide

p slide

variance slide

norm x

features x

random initialization

p norm slide

data clustering

Documents

mixturesof* gaussians - university of washington ·...

factorial mixture of gaussians and the marginal independence...

2012　mdsp pr12 k means mixture of gaussian

fitting sums of gaussians

clustering：k-means, expect-maximization and gaussian...

sliced wasserstein distance for learning gaussian mixture...

clustering with k-means and mixture of gaussian densities

dictionaries –mixtures of gaussians...

factorial mixture of gaussians and the marginal independence...

scale mixture of gaussians and the statistics of natural...

scale mixture of gaussians modelling of polarimetric...

algorithmic approaches to statistical questions ·...

csc321 tutorial 8: assignment 3: mixture of...

prml 9.1-9.2: k-means clustering & mixtures of gaussians

mixture of gaussians expectation maximization (em) part 1

mixture of gaussians models - redwood center for ... ·...

high-dimension...

variational deep embedding: an unsupervised and generative...

gaussian mixture models with component means constrained...

a data-driven failure prognostics method based on mixture...