data mining course 2007 eric postma clustering. overview three approaches to clustering...

37
Data Mining Course 2007 Eric Postma Clustering

Upload: rosamund-watkins

Post on 06-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

These datasets have identical statistics up to 2 nd order

TRANSCRIPT

Page 1: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Data Mining Course 2007

Eric Postma

Clustering

Page 2: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Overview

Three approaches to clustering

1. Minimization of reconstruction error• PCA, nlPCA, k-means clustering

2. Distance preservation• Sammon mapping, Isomap, SPE

3. Maximum likelihood density estimation• Gaussian Mixtures

Page 3: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

• These datasets have identical statistics up to 2nd order

Page 4: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

1. Minimization of reconstruction error

Page 5: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of PCA (1)

• Face dataset (Rice database)

Page 6: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of PCA (2)

• Average face

Page 7: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of PCA (3)

• Top 10 Eigenfaces

Page 8: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Each 39-dimensional data item describes different aspects of the welfare and poverty of one country.

2D PCA projection

Page 9: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Non-linear PCA

• Using neural networks (to be discussed tomorrow)

Page 10: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

2. Distance preservation

Page 11: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Sammon mapping

• Given a data set X. The distance between any two samples is defined as Dij

• We consider the projection on a two dimensional plane where the projected points are separated by dij

• Define an Error function

i jiij

ijij

i ji ij DdD

DE

21

Page 12: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Sammon mapping

Page 13: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Main limitations of Sammon

• The Sammon mapping procedure is a gradient descent method

• Main limitation: local minima

• MDS may be preferred because it finds global minima (being based on PCA)

• Both methods have difficulty with “curved or curly subspaces”

Page 14: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Isomap

• Tenenbaum• Build a graph in which each node

represents a data point• Compute shortest distances along the

graph (e.g., Dijkstra’s algorithm)• Store all distances in a matrix D• Perform MDS on the matrix D

Page 15: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of Isomap (1)

• For two arbitrary points on the manifold Euclidean distance does not always reflect similarity (cf. dashed blue line)

Page 16: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of Isomap (2)

• Isomap finds the appropriate shortest path along the graph (red curve, for K=7, N=1000)

Page 17: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of Isomap (3)

• Two-dimensional embedding (red line is the shortest path along the graph, blue line is the true distance in the embedding.

Page 18: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of Isomap (4)

• Isomaps (●) ability to find the intrinsic dimensionality as compared to PCA and MDS (∆ and o).

Page 19: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of Isomap (5)

Page 20: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of Isomap (6)

Page 21: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of Isomap (7)

• Interpolation along a straight line

Page 22: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Stochastic Proximity Embedding

• SPE algorithm

• Agrafiotis, D.K. and Xu, H. (2002). A self-organizing principle for learning nonlinear manifolds. Proceedings of the National Academy of Sciences U.S.A.

Page 23: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Stress function

Output proximity between points i and j Input proximity between points i and j

2)(),( ijijijij rdrdf

Page 24: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Swiss roll data set

Original 3D set 2D embedding obtained by SPE

Page 25: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Stress as a function of embedding dimension(averaged over 30 runs)

Page 26: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Scalability (# steps for four set sizes)Linear scaling

Page 27: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Conformations of methylpropyletherC1C2C3O4C5

Page 28: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Diamine combinatorial library

Page 29: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Clustering

• Minimize the total within-cluster variance (reconstruction error)

C

c

N

i ci

ic wxkE2

• kic = 1 if a data point belongs to cluster cK-means clustering1. Random selection of C cluster centres2. Partition the data by assigning them to the clusters3. The mean of each partitioning is the new cluster centreA distance threshold may be used…

Page 30: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

• Effect of distance threshold on the number of clusters

Page 31: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Main limitation of k-means clustering

• Final partitioning and cluster centres depend on initial configuration

• Discrete partitioning may introduce errors

• Instead of minimizing the reconstruction error, we may maximize the likelihood of the data (given some probabilistic model)

Page 32: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Neural algorithms related to k-means

• Kohonen self-organizing feature maps

• Competitive learning networks

Page 33: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

3. Maximum likelihood

Page 34: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Gaussian Mixtures

• Model the pdf of the data using a mixture of distributions

• K is the number of kernels (<< # data points)• Common choice for the component densities p(x|i):

K

iiPixpxp )()|()(

2

2

2/2 2exp

)2(1)|(

i

id

i

xixp

Page 35: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Illustration of EM applied to GM model

The solid line gives the initialization of the EM algorithm: two kernels,P(1) = P(2) = 0:5, μ1 = 0.0752; μ2 = 1.0176, σ1 = σ2 = 0:2356

Page 36: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Convergence after 10 EM steps..

Page 37: Data Mining Course 2007 Eric Postma Clustering. Overview Three approaches to clustering 1.Minimization of reconstruction error PCA, nlPCA, k-means clustering

Relevant literature

• L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik (submitted). Dimensionality Reduction: A Comparative Review.

• http://www.cs.unimaas.nl/l.vandermaaten