data mining course 2007 eric postma clustering. overview three approaches to clustering...
DESCRIPTION
These datasets have identical statistics up to 2 nd orderTRANSCRIPT
Data Mining Course 2007
Eric Postma
Clustering
Overview
Three approaches to clustering
1. Minimization of reconstruction error• PCA, nlPCA, k-means clustering
2. Distance preservation• Sammon mapping, Isomap, SPE
3. Maximum likelihood density estimation• Gaussian Mixtures
• These datasets have identical statistics up to 2nd order
1. Minimization of reconstruction error
Illustration of PCA (1)
• Face dataset (Rice database)
Illustration of PCA (2)
• Average face
Illustration of PCA (3)
• Top 10 Eigenfaces
Each 39-dimensional data item describes different aspects of the welfare and poverty of one country.
2D PCA projection
Non-linear PCA
• Using neural networks (to be discussed tomorrow)
2. Distance preservation
Sammon mapping
• Given a data set X. The distance between any two samples is defined as Dij
• We consider the projection on a two dimensional plane where the projected points are separated by dij
• Define an Error function
i jiij
ijij
i ji ij DdD
DE
21
Sammon mapping
Main limitations of Sammon
• The Sammon mapping procedure is a gradient descent method
• Main limitation: local minima
• MDS may be preferred because it finds global minima (being based on PCA)
• Both methods have difficulty with “curved or curly subspaces”
Isomap
• Tenenbaum• Build a graph in which each node
represents a data point• Compute shortest distances along the
graph (e.g., Dijkstra’s algorithm)• Store all distances in a matrix D• Perform MDS on the matrix D
Illustration of Isomap (1)
• For two arbitrary points on the manifold Euclidean distance does not always reflect similarity (cf. dashed blue line)
Illustration of Isomap (2)
• Isomap finds the appropriate shortest path along the graph (red curve, for K=7, N=1000)
Illustration of Isomap (3)
• Two-dimensional embedding (red line is the shortest path along the graph, blue line is the true distance in the embedding.
Illustration of Isomap (4)
• Isomaps (●) ability to find the intrinsic dimensionality as compared to PCA and MDS (∆ and o).
Illustration of Isomap (5)
Illustration of Isomap (6)
Illustration of Isomap (7)
• Interpolation along a straight line
Stochastic Proximity Embedding
• SPE algorithm
• Agrafiotis, D.K. and Xu, H. (2002). A self-organizing principle for learning nonlinear manifolds. Proceedings of the National Academy of Sciences U.S.A.
Stress function
Output proximity between points i and j Input proximity between points i and j
2)(),( ijijijij rdrdf
Swiss roll data set
Original 3D set 2D embedding obtained by SPE
Stress as a function of embedding dimension(averaged over 30 runs)
Scalability (# steps for four set sizes)Linear scaling
Conformations of methylpropyletherC1C2C3O4C5
Diamine combinatorial library
Clustering
• Minimize the total within-cluster variance (reconstruction error)
C
c
N
i ci
ic wxkE2
• kic = 1 if a data point belongs to cluster cK-means clustering1. Random selection of C cluster centres2. Partition the data by assigning them to the clusters3. The mean of each partitioning is the new cluster centreA distance threshold may be used…
• Effect of distance threshold on the number of clusters
Main limitation of k-means clustering
• Final partitioning and cluster centres depend on initial configuration
• Discrete partitioning may introduce errors
• Instead of minimizing the reconstruction error, we may maximize the likelihood of the data (given some probabilistic model)
Neural algorithms related to k-means
• Kohonen self-organizing feature maps
• Competitive learning networks
3. Maximum likelihood
Gaussian Mixtures
• Model the pdf of the data using a mixture of distributions
• K is the number of kernels (<< # data points)• Common choice for the component densities p(x|i):
K
iiPixpxp )()|()(
2
2
2/2 2exp
)2(1)|(
i
id
i
xixp
Illustration of EM applied to GM model
The solid line gives the initialization of the EM algorithm: two kernels,P(1) = P(2) = 0:5, μ1 = 0.0752; μ2 = 1.0176, σ1 = σ2 = 0:2356
Convergence after 10 EM steps..
Relevant literature
• L.J.P. van der Maaten, E.O. Postma, and H.J. van den Herik (submitted). Dimensionality Reduction: A Comparative Review.
• http://www.cs.unimaas.nl/l.vandermaaten