cluster analysis

Cluster Analysis

Summer School

“Achievements and Applications of Contemporary Informatics,

Mathematics and Physics” (AACIMP 2011)

August 8-20, 2011, Kiev, Ukraine

Erik Kropat

University of the Bundeswehr Munich Institute for Theoretical Computer Science,

Mathematics and Operations Research

Neubiberg, Germany

The Knowledge Discovery Process

PRE-

PROCESSING

DATA MINING

PATTERN EVALUATION

RawData

Preprocessed Data

Patterns

Knowledge

Standardizing Missing values / outliers

Strategic planning

Patterns, clusters, correlations automated classification outlier / anomaly detection association rule learning…

Clustering

Clustering

… is a tool for data analysis, which solves classification problems.

Problem

Given n observations, split them into K similar groups.

Question

How can we define “similarity” ?

Similarity

A cluster is a set of entities which are alike,

and entities from different clusters are not alike.

Distance

A cluster is an aggregation of points such that

the distance between any two points in the cluster

is less than

the distance between any point in the cluster and any point not in it.

Density

Clusters may be described as

connected regions of a multidimensional space containing a relatively high density of points,

separated from other such regions by a region containing a relatively low density of points.

Min Max-Problem

Homogeneity: Objects within the same cluster should be similar to each other.

Separation: Objects in different clusters should be dissimilar from each other.

similarity ⇔ distance

Distance between clusters Distance between

objects

Types of Clustering

Clustering

Hierarchical Clustering

Partitional Clustering

agglomerative divisive

Similarity and Distance

Distance Measures

A metric on a set G is a function d: G x G → R+ that satisfies the following conditions:

(D1) d(x, y) = 0 ⇔ x = y (identity) (D2) d(x, y) = d(y, x) ≥ 0 for all x, y ∈ G (symmetry & non-negativity) (D3) d(x, y) ≤ d(x, z) + d(z, y) for all x, y, z ∈ G (triangle inequality)

x

y

z

Examples Minkowski-Distance

o r = 1: Manhatten distance

o r = 2: Euklidian distance

Σ i = 1

_

, r ∈ [1, ∞) n

1 r

, x, y ∈ Rn. d r (x, y) = | xi − yi | r

Euclidean Distance

d2 (x, y) = , x, y ∈ Rn

x = (1, 1)

y = (4, 3)

d2 (x, y) = (1 - 4) + (1 - 3) = √13

x

y

Σ i = 1

_ n 1

2 ( xi − yi ) 2

2 2 _ 1 2 ____

Manhatten Distance

d1 (x, y) = , x, y ∈ Rn

x = (1, 1)

y = (4, 3)

d1 (x, y) = 1 - 4 + 1 - 3 = 3 + 2 = 5

Σ i = 1

n | xi − yi |

x

y

| | | |

Maximum Distance

d∞ (x, y) = max | xi − yi | 1 ≤ i ≤ n

, x, y ∈ Rn

x = (1, 1)

y = (4, 3)

d∞ (x, y) = max (3, 2) = 3

x

y

Similarity Measures

A similarity function on a set G is a function S: G x G → R that satisfies the following conditions:

(S1) S (x, y) ≥ 0 for all x, y ∈ G (positive defined) (S2) S (x, y) ≤ S (x, x) for all x, y ∈ G (auto-similarity) (S3) S (x, y) = S (x, x) ⇔ x = y for all x, y ∈ G (identity) The value of the similarity function is greater when two points are closer.

Similarity Measures

• There are many different definitions of similarity. • Often used

(S4) S (x, y) = S (y, x) for all x, y ∈ G (symmetry)

Hierachical Clustering

Dendrogram

www.isa.uni-stuttgart.de/lehre/SAHBD

Gross national product of EU countries – agriculture (1993)

Eucl

idea

ndi

stan

ce (

com

plet

elin

kage

)Eu

clid

ean

dist

ance

(co

mpl

ete

linka

ge)

Eucl

idea

n di

stan

ce

(com

plet

e lin

kage

)

Cluster Dendrogram


Hierarchical clustering creates a hierarchy of clusters of the set G.

Agglomerative clustering: Clusters are successively merged together

Divisive clustering: Clusters are recursively split


agglomerative divisive

Agglomerative Clustering

e3 e4 Step 0

Step 1

Step 2

, e3 , e4 Step 3

4 clusters

3 clusters

2 clusters

1 cluster e1 , e2

Merge clusters with smallest distance between the two clusters

e4

e1 , e2 , e3 e4

e3 e1 , e2

e1 e2

Divisive Clustering

e4 Step 3

Step 2

Step 1

, e3 , e4 Step 0

4 clusters

3 clusters

2 clusters

1 cluster e1 , e2

Chose a cluster, that optimally splits in two particular clusters according to a given criterion.

e3

e1 e2

e1 , e2

e1 , e2 e3 , e4

e4

e3

Agglomerative Clustering

INPUT

Given n objects G = { e1,...,en }

represented by p-dimensional feature vectors x1,...,xn ∈ Rp Object

Feat

ure

1

Feat

ure

2

Feat

ure

3

Feat

ure

p

x1 = ( x11 x12 x13 . . . x1p )

x2 = ( x21 x22 x23 . . . x2p )

⁞ ⁞ ⁞ ⁞ ⁞

xn = ( xn1 xn2 xn3 . . . xnp )

Example I

An online shop collects data from its customers. For each of the n customers it exists a p-dimensional feature vector

Object

Example II

In a clinical trial laboratory values of a large number of patients are gathered. For each of the n patients it exists a p-dimensional feature vector

Agglomerative Algorithms

• Begin with disjoint clustering C1 = { {e1}, {e2}, ... , {en} } • Terminate when all objects are in one cluster Cn = { {e1, e2, ... , en} } • Iterate find the most similar pair of clusters

and merge them into a single cluster.

Sequence of clusterings (Ci )i=1,...n of G with

C i 1 ⊂ C i for i = 2,...,n.

e1 e2 e3 e4

What is the distance between two clusters?

d (A,B) A B

⇒ Various hierarchical clustering algorithms

Agglomerative Hierarchical Clustering

There exist many metrics to measure the distance between clusters. They lead to particular agglomerative clustering methods: • Single-Linkage Clustering

• Complete-Linkage Clustering

• Average Linkage Clustering

• Centroid Method

• . . .

Single-Linkage Clustering

Nearest-Neighbor-Method

The distance between the clusters A und B is the minimum distance between the elements of each cluster: d(A,B) = min { d (a, b) | a ∈ A, b ∈ B }

a b d(A,B)

Single-Linkage Clustering

• Advantage: Can detect very long and even curved clusters.

Can be used to detect outliers.

• Drawback: Chaining phenomen

Clusters that are very distant to each other may be forced together due to single elements being close to each other.

A

B

C

Complete-Linkage Clustering

Furthest-Neighbor-Method

The distance between the clusters A and B is the maximum distance between the elements of each cluster:

d(A,B) = max { d(a,b) | a ∈ A, b ∈ B }

a b d (A, B)

Complete-Linkage Clustering

• … tends to find compact clusters of approximately equal diameters. • … avoids the chaining phenomen.

• … cannot be used for outlier detection.

Average-Linkage Clustering

The distance between the clusters A and B is the mean distance between the elements of each cluster: d (A, B) = d (a, b)

|A| ⋅ |B| 1 Σ

a ∈ A, b ∈ B

⋅

d(A,B)

a b

A B

Centroid Method

The distance between the clusters A and B is the (squared) Euclidean distance of the cluster centroids.

d (A, B) x x

d (A, B) x x

Agglomerative Hierarchical Clustering

a b d (A, B) d (A, B)

d (A, B) d (A, B)

d (A, B)

Bioinformatics

Alizadeh et al., Nature 403 (2000): pp.503–511

Exercise

Paris

Berlin Kiev

Odessa

Exercise

Kiev Odessa Berlin Paris

Kiev 440 1200 2000

Odessa 440 1400 2100

Berlin 1200 1400 900

Paris 2000 2100 900

The following table shows the distances between 4 cities:

Determine a hierarchical clustering with

the single linkage method.

Solution - Single Linkage


Kiev 440 1200 2000

Odessa 440 1400 2100

Berlin 1200 1400 900

Paris 2000 2100 900

Step 0: Clustering

{Kiev}, {Odessa}, {Berlin}, {Paris}

Distances between clusters



Kiev 440 1200 2000

Odessa 440 1400 2100

Berlin 1200 1400 900

Paris 2000 2100 900

Step 0: Clustering

{Kiev}, {Odessa}, {Berlin}, {Paris}

Distances between clusters minimal distance

⇒ Merge clusters { Kiev } and { Odessa } Distance value: 440


Kiev, Odessa Berlin Paris

Kiev, Odessa 1200 2000

Berlin 1200 900

Paris 2000 900

Step 1: Clustering

{Kiev, Odessa}, {Berlin}, {Paris}

Distances between clusters


Kiev, Odessa Berlin Paris

Kiev, Odessa 1200 2000

Berlin 1200 900

Paris 2000 900

Step 1: Clustering

{Kiev, Odessa}, {Berlin}, {Paris}


⇒ Merge clusters { Berlin } and { Paris } Distance value: 900


Kiev, Odessa Berlin, Paris

Kiev, Odessa 1200

Berlin, Paris 1200

Step 2: Clustering

{Kiev, Odessa}, {Berlin, Paris}


⇒ Merge clusters { Kiev, Odessa } and { Berlin, Paris } Distance value: 1200


Step 3: Clustering

{Kiev, Odessa, Berlin, Paris}


Hierarchy

Kiev Odessa Berlin Paris 0

440

4 clusters

3 clusters

2 clusters

1 cluster

1340

2540

440

900

1200

Distance values

Divisive Clustering

Divisive Algorithms

• Begin with one cluster C1 = { {e1, e2, ... , en} } • Terminate when all objects are in disjoint clusters Cn = { {e1}, {e2}, ... , {en} } • Iterate Chose a cluster Cf , that optimally splits two particular clusters Ci and Cj according to a given criterion.

Sequence of clusterings (Ci )i=1,...n of G with

C i ⊃ C i + 1 for i = 1,...,n-1.

e1 e2 e3 e4


Minimal Distance Methods


• Aims to partition n observations into K clusters. • The number of clusters and

an initial partition are given.

• The initial partition is considered as

“not optimal“ and should be

iteratively repartitioned.

The number of clusters is given !!!

K = 2

K = 2

initial partition

final partition


Difference to hierarchical clustering

• number of clusters is fixed.

• an object can change the cluster.

Initial partition is obtained by

• random or

• the application of an hierarchical clustering algorithm in advance.

Estimation of the number of clusters

• specialized methods (e.g., Silhouette) or

• the application of an hierarchical clustering algorithm in advance.

Partitional Clustering - Methods

• K-Means and

• Fuzzy-C-Means

In this course we will introduce the minimal distance methods . . .

K-Means

K-Means

Find K cluster centroids µ1 ,..., µK

that minimize the objective function

dist ( µi, x ) i = 1 Σ K

x ∈ C i Σ J =

2

Aims to partition n observations into K clusters

in which each observation belongs to the cluster with the nearest mean.

G

C1

C2

C3

K-Means

Find K cluster centroids µ1 ,..., µK

that minimize the objective function

dist ( µi, x ) i = 1 Σ K

x ∈ C i Σ J =

2

Aims to partition n observations into K clusters

in which each observation belongs to the cluster with the nearest mean.

G

C1

C2

C3

x x

x

K-Means - Minimal Distance Method

x x

Given: n objects, K clusters

1. Determine initial partition.

2. Calculate cluster centroids.

3. For each object, calculate the distances to all cluster centroids.

4. If the distance to the centroid of another cluster is smaller than the distance to the actual cluster centroid, then assign the object to the other cluster.

repartition

5. If clusters are repartitioned: GOTO 2.

ELSE: STOP.

Example

Initial Partition

ₓ ₓ ₓ ₓ

Final Partition

Exercise

Initial Partition

ₓ ₓ

Final Partition

ₓ

ₓ

K-Means

• K-Means does not determine the global optimal partition.

• The final partition obtained by K-Means depends on the initial partition.

Hard Clustering / Soft Clustering

Hard Clustering

Clustering

Soft Clustering

Each object is a member of exactly one cluster

Each object has a fractional membership in all clusters

K-Means Fuzzy-c-Means

Fuzzy-c-Means

• When clusters are well separated, hard clustering (K-Means) makes sense.

• In many cases, clusters are not well separated.

In hard clustering, borderline objects are assigned to a cluster in an arbitrary manner.

Fuzzy Clustering vs. Hard Clustering

• Fuzzy Theory was introduced by Lofti Zadeh in 1965.

• An object can belong to a set with a degree of membership

between 0 and 1.

• Classical set theory is a special case of fuzzy theory

that restricts membership values to be either 0 or 1.

Fuzzy Set Theory

• Is based on fuzzy logic and fuzzy set theory.

• Objects can belong to more than one cluster.

• Each object belongs to all clusters with some weight (degree of membership)

Fuzzy Clustering

1

0

Cluster 1

Cluster 2

Cluster 3

Hard Clustering

• K-Means

Object

Cluster e1 e2 e3 e4

C1 0 1 0 0

C2 1 0 0 0

C3 0 0 1 1 C1

e2 C2

e1

C3

e3 e4

− The number K of clusters is given.

− Each object is assigned to exactly one cluster.

Partition

Fuzzy Clustering

• Fuzzy-c-means

Object

Cluster e1

e2

e3

e4

C1 0.8 0.2 0.1 0.0

C2 0.2 0.2 0.2 0.0

C3 0.0 0.6 0.7 1.0

Σ 1 1 1 1

− The number c of clusters is given.

− Each object has a fractional membership in all clusters

Fuzzy-Clustering

There is no strict sub-division of clusters.

Fuzzy-c-Means

• Membership Matrix

The entry u i k denotes the degree of membership of object k in cluster i .

U = ( u i k ) ∈ [0, 1]c x n

Object 1 Object 2 … Object n

Cluster 1 u11 u12 … u1n

Cluster 2 u21 u22 … u2n

Cluster c uc1 uc2 … ucn

…

…

…

…

Restrictions (Membership Matrix)

1. All weights for a given object, ek, must add up to 1.

2. Each cluster contains – with non-zero weight – at least one object,

but does not contain – with a weight of one – all the objects.

i = 1 Σ c

u i k = 1 (k = 1,...,n)

k = 1 Σ n

u i k < n (i = 1,...,c) 0 <

Fuzzy-c-Means

• Vector of prototypes (cluster centroids) Remark

The cluster centroids and the membership matrix are initialized randomly.

Afterwards they are iteratively optimized.

V = ( v1,...,vc ) ∈ Rc T

Fuzzy-c-Means

ALGORITHM

1. Select an initial fuzzy partition U = (u i k )

⇒ assign values to all u i k 2. Repeat

3. Compute the centroid of each cluster using the fuzzy partition

4. Update the fuzzy partition U = (u i k )

5. Until the centroids do not change.

Other stopping criterion: “change in the u i k is below a given threshold”.

Fuzzy-c-Means

• K-Means and Fuzzy-c-Means attempt to minimize the sum of the squared errors (SSE). • In K-Means:

• In Fuzzy-c-Means:

dist ( vi, xk ) u i k m

i = 1 Σ c

k = 1 Σ n . SSE =

2

dist ( vi, x ) i = 1 Σ K

x ∈ C i Σ SSE =

2

m ∈ [1, ∞] is a parameter (fuzzifier) that determines the influence of the weights.

u1k v1

v2

v3

xk u3k

u2k

Computing Cluster Centroids

• For each cluster i = 1,...,c the centroid is defined by • This is an extension of the definition of centroids of k-means.

• All points are considered and the contribution of each point

to the centroid is weighted by its membership degrees.

u1k v1

v2

v3

xk u3k

u2k

k = 1 Σ n v i =

u i k m

_________________ xk u i k

m

k = 1 Σ n

( i = 1,...,c )

(V)

• Minimization of SSE subject to the constraints leads to the following update formula:

s = 1 Σ c

u i k =

dist ( v i , xk ) 2

dist ( vs , xk ) __________

1 m – 1 _____

______________________________________ 1

Update of the Fuzzy Partition (Membership Matrix)

2 (U)

Fuzzy-c-Means

Iteration

Calculate updates of

• Matrix U of membership grades with (U)

• Matrix V of cluster centroids with (V)

until cluster centroids are stable or the maximum number of iterations is reached.

Initialization

Determine (randomly)

• Matrix U of membership grades

• Matrix V of cluster centroids.

Fuzzy-c-means

• Fuzzy-c-means depends on the Euclidian metric

⇒ spherical clusters. • Other metrics can be applied to obtain different cluster shapes.

• Fuzzy covariance matrix (Gustafson/Kessel 1979)

⇒ ellipsoidal clusters.

Cluster Validity Indizes

Cluster Validity Indexes

Fuzzy-c-means requires the number of clusters as input.

Question: How can we determine the “optimal” number of clusters?

Method: For all possible number of clusters calculate the cluster validity index. Then, determine the optimal number of clusters.

Note: CVIs usually do not depend on the clustering algorithm.

Idea: Determine the cluster partition for a given number of clusters. Then, evaluate the cluster partition by a cluster validity index.


• Partition Coefficient (Bezdek 1981)

• Optimal number of clusters c∗ :

PC (c) = i = 1 Σ c

k = 1 Σ n

u i k 2

PC (c∗) = max 2 ≤ c ≤ n-1

PC (c)

1 n

__ , 2 ≤ c ≤ n-1


• Partition Entropy (Bezdek 1974)


• Drawback of PC and PE: Only degrees of memberships are considered. The geometry of the data set is neglected.

PC (c∗) = min 2 ≤ c ≤ n-1

PC (c)

PC (c) = i = 1 Σ c

k = 1 Σ n

u i k , 2 ≤ c ≤ n-1 1 n

__ _ log2 u i k


• Fukuyama-Sugeno Index (Fukuyama/Sugeno 1989)


FS (c) = Compactness of clusters

Separation of clusters

_

i =1

c 1 c __ v = vi PC (c∗) = max

2 ≤ c ≤ n-1 PC (c) Σ

i = 1 Σ c

k = 1 Σ n

u i k m

dist ( vi , xk )

i = 1 Σ c

k = 1 Σ n

u i k m

dist ( vi , v ) _

2

2 _

Application

Data Mining and Decision Support Systems Landslide Events (UniBw, Geoinformatics Group: W. Reinhardt, E. Nuhn)

• Measurements (pressure values, tension, deformation vectors)

• Simulations (finite-element model)

→ Spatial Data Mining / Early Warning Systems for Landslide Events

→ Fuzzy clustering approaches (feature weighting)

Problem: Uncertain data from measurements and simulations

Partition

Hard Clustering

Data

Fuzzy Clustering

Fuzzy-Cluster Fuzzy-Partition

Data

Fuzzy Clustering

Feature Weighting

Nuhn/Kropat/Reinhardt/Pickl: Preparation of complex landslide simulation results with clustering approaches for decision support and early warning. Submitted to Hawaii International Conference on System Sciences (HICCS 45), Grand Wailea, Maui, 2012.

Thank you very much!

cluster analysis

Education

y rn1iny x

g x g r

y g positive defineds2s

y g autosimilaritys3s

clusters distance

useds4 s x

manhatten distancend1

n2 2d2 x