lecture notes for chapter 9 introduction to data mining

25
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Upload: others

Post on 16-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture Notes for Chapter 9 Introduction to Data Mining

Data MiningCluster Analysis: Advanced Concepts

and Algorithms

Lecture Notes for Chapter 9

Introduction to Data Miningby

Tan, Steinbach, Kumar

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Page 2: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Clustering Algorithms Types

Prototype Based Density Based Graph Based

(Added By Dr. Rafea)

Page 3: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Prototype Based (Fuzzy C-Mean)

http://www.cse.aucegypt.edu/~rafea/CSCE564/slides/Clustering.pdf

http://www.cse.aucegypt.edu/~rafea/CSCE564/slides/Fuzzy-C_Means%20example.pdf

(Added By Dr. Rafea)

Page 4: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Hard (Crisp) vs Soft (Fuzzy) Clustering

Hard (Crisp) vs. Soft (Fuzzy) clustering– For soft clustering allow point to belong to more than

one cluster – For K-means, generalize objective function

𝑀𝑀𝑖𝑖𝑖𝑖 : weight with which object xi belongs to cluster 𝒄𝒄𝒋𝒋

– To minimize SSE, repeat the following steps: Fix 𝒄𝒄𝒋𝒋and determine w𝑖𝑖𝑖𝑖 (cluster assignment) Fixw𝑖𝑖𝑖𝑖 and recompute 𝒄𝒄𝒋𝒋

– Hard clustering:wπ‘–π‘–π‘–π‘–βˆˆ {0,1}

11

=βˆ‘=

k

jijw𝑆𝑆𝑆𝑆𝑆𝑆 = οΏ½

𝑖𝑖=1

π‘˜π‘˜

�𝑖𝑖=1

π‘šπ‘š

𝑀𝑀𝑖𝑖𝑖𝑖 𝑑𝑑𝑖𝑖𝑑𝑑𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒄𝒄𝑖𝑖)2

Page 5: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Soft (Fuzzy) Clustering: Estimating Weights

21

22

21

9)25()12()(

xx

xx

wwwwxSSE

+=βˆ’+βˆ’=

SSE(x) is minimized when wx1 = 1, wx2 = 0

1 2 5

c1 c2x

Page 6: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Fuzzy C-means

Objective function

w𝑖𝑖𝑖𝑖: weight with which object 𝒙𝒙𝑖𝑖 belongs to cluster 𝒄𝒄𝒋𝒋 𝑝𝑝: a power for the weight not a superscript and controls how

β€œfuzzy” the clustering is

– To minimize objective function, repeat the following: Fix 𝒄𝒄𝒋𝒋 and determinew𝑖𝑖𝑖𝑖

Fixw𝑖𝑖𝑖𝑖and recompute 𝒄𝒄

– Fuzzy c-means clustering:wπ‘–π‘–π‘–π‘–βˆˆ[0,1]

Bezdek, James C. Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, 1981.

11

=βˆ‘=

k

jijw

p: fuzzifier (p > 1)

𝑆𝑆𝑆𝑆𝑆𝑆 = �𝑖𝑖=1

π‘˜π‘˜

�𝑖𝑖=1

π‘šπ‘š

𝑀𝑀𝑖𝑖𝑖𝑖𝑝𝑝 𝑑𝑑𝑖𝑖𝑑𝑑𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒄𝒄𝑖𝑖)2

Page 7: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Fuzzy C-means

22

21

222

221

9

)25()12()(

xx

xx

wwwwxSSE

+=

βˆ’+βˆ’=

SSE(x) is minimized when wx1 = 0.9, wx2 = 0.1

1 2 5

c1 c2x

SS

E(x

)

Page 8: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Fuzzy C-means

Objective function:

Initialization: choose the weights wij randomly

Repeat:– Update centroids:

– Update weights:

11

=βˆ‘=

k

jijw

p: fuzzifier (p > 1)

𝑀𝑀𝑖𝑖𝑖𝑖 = (1/𝑑𝑑𝑖𝑖𝑑𝑑𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒄𝒄𝑖𝑖)2)1

π‘π‘βˆ’1/�𝑖𝑖=1

π‘˜π‘˜

(1/𝑑𝑑𝑖𝑖𝑑𝑑𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒄𝒄𝑖𝑖)2)1

π‘π‘βˆ’1

𝒄𝒄𝒋𝒋 = �𝑖𝑖=1

π‘šπ‘š

𝑀𝑀𝑖𝑖𝑖𝑖𝒙𝒙𝑖𝑖 /�𝑖𝑖=1

π‘šπ‘š

𝑀𝑀𝑖𝑖𝑖𝑖

𝑆𝑆𝑆𝑆𝑆𝑆 = �𝑖𝑖=1

π‘˜π‘˜

�𝑖𝑖=1

π‘šπ‘š

𝑀𝑀𝑖𝑖𝑖𝑖𝑝𝑝 𝑑𝑑𝑖𝑖𝑑𝑑𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒄𝒄𝑖𝑖)2

Page 9: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Fuzzy K-means Applied to Sample Data

maximummembership

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Page 10: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

An Example Application: Image Segmentation

Modified versions of fuzzy c-means have been used for image segmentation– Especially fMRI images (functional magnetic

resonance images)

References– Gong, Maoguo, Yan Liang, Jiao Shi, Wenping Ma, and Jingjing Ma. "Fuzzy c-means clustering with local

information and kernel metric for image segmentation." Image Processing, IEEE Transactions on 22, no. 2 (2013): 573-584.

From left to right: original images, fuzzy c-means, EM, BCFCM– Ahmed, Mohamed N., Sameh M. Yamany, Nevin Mohamed, Aly A. Farag, and Thomas Moriarty. "A modified

fuzzy c-means algorithm for bias field estimation and segmentation of MRI data." Medical Imaging, IEEE Transactions on 21, no. 3 (2002): 193-199.

Page 11: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Hard (Crisp) vs Soft (Probabilistic) Clustering

Idea is to model the set of data points as arising from a mixture of distributions

– Typically, normal (Gaussian) distribution is used– But other distributions have been very profitably used

Clusters are found by estimating the parameters of the statistical distributions

– Can use a k-means like algorithm, called the Expectation-Maximization (EM) algorithm, to estimate these parameters Actually, k-means is a special case of this approach

– Provides a compact representation of clusters – The probabilities with which point belongs to each cluster provide

a functionality similar to fuzzy clustering.

Page 12: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Probabilistic Clustering: Example

Informal example: consider modeling the points that generate the following histogram.

Looks like a combination of twonormal (Gaussian) distributions

Suppose we can estimate the mean and standard deviation of each normal distribution.

– This completely describes the two clusters– We can compute the probabilities with which each point belongs

to each cluster– Can assign each point to the

cluster (distribution) for which it is most probable.

Page 13: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Probabilistic Clustering: EM Algorithm

Initialize the parametersRepeat

For each point, compute its probability under each distributionUsing these probabilities, update the parameters of each distribution

Until there is no change

Very similar to K-means Consists of assignment and update steps Can use random initialization

– Problem of local minima For normal distributions, typically use K-means to initialize If using normal distributions, can find elliptical as well as

spherical shapes.

Page 14: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Probabilistic Clustering: Updating Centroids

Update formula for weights assuming an estimate for statistical parameters

Very similar to the fuzzy k-means formula– Weights are probabilities– Weights are not raised to a power– Probabilities calculated using Bayes rule:

Need to assign weights to each cluster– Weights may not be equal– Similar to prior probabilities– Can be estimated:

βˆ‘βˆ‘==

=m

iij

m

iijij CpCp

11

)|(/)|( xxxc

βˆ‘=

=m

iijj Cp

mCp

1

)|(1)( x

π’™π’™π’Šπ’Š 𝑖𝑖𝑑𝑑 π‘Žπ‘Ž π‘‘π‘‘π‘Žπ‘Žπ‘‘π‘‘π‘Žπ‘Ž 𝑝𝑝𝑝𝑝𝑖𝑖𝑝𝑝𝑑𝑑𝐢𝐢𝒋𝒋 𝑖𝑖𝑑𝑑 π‘Žπ‘Ž 𝑐𝑐𝑐𝑐𝑐𝑐𝑑𝑑𝑑𝑑𝑐𝑐𝑐𝑐𝒄𝒄𝒋𝒋 𝑖𝑖𝑑𝑑 π‘Žπ‘Ž 𝑐𝑐𝑐𝑐𝑝𝑝𝑑𝑑𝑐𝑐𝑝𝑝𝑖𝑖𝑑𝑑

𝑝𝑝 𝐢𝐢𝒋𝒋 π’™π’™π’Šπ’Š =𝑝𝑝 π’™π’™π’Šπ’Š 𝐢𝐢𝒋𝒋 𝑝𝑝(𝐢𝐢𝒋𝒋)

βˆ‘π‘™π‘™=1π‘˜π‘˜ 𝑝𝑝 π’™π’™π’Šπ’Š 𝐢𝐢𝑙𝑙 𝑝𝑝(𝐢𝐢𝒍𝒍)

Page 15: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

More Detailed EM Algorithm

Page 16: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Probabilistic Clustering Applied to Sample Data

maximumprobability

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Page 17: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Probabilistic Clustering: Dense and Sparse Clusters

-10 -8 -6 -4 -2 0 2 4-8

-6

-4

-2

0

2

4

6

x

y

?

Page 18: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Problems with EM

Convergence can be slow

Only guarantees finding local maxima

Makes some significant statistical assumptions

Number of parameters for Gaussian distribution grows as O(d2), d the number of dimensions– Parameters associated with covariance matrix– K-means only estimates cluster means, which grow

as O(d)

Page 19: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Useful Links to Probabilistic Clustering

https://www.youtube.com/watch?v=iQoXFmbXRJA&t=6s

https://www.youtube.com/watch?v=TG6Bh-NFhA0

Page 20: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Density Based (Grid Based Clustering)

Algorithm1. Define a set of grid cells2. Assign objects to the appropriate cells and compute

the density of each cell3. Eliminate cells having a density below a specified

threshold4. Form clusters from adjacent groups of dense cells

(Added By Dr. Rafea)

Page 21: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Example

00000000000000000618174

271801313141431240211018110004142030000000

Define a set of Grid cellsAssign Objects to the cells and compute their densitiesDiscard cells having less than 9 objects (loosing parts of the cluster

Page 22: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Graph-Based Clustering

Graph-Based clustering uses the proximity graph– Start with the proximity matrix– Consider each point as a node in a graph– Each edge between two nodes has a weight which is

the proximity between the two points– Initially the proximity graph is fully connected – MIN (single-link) and MAX (complete-link) can be

viewed as starting with this graph

In the simplest case, clusters are connected components in the graph.

Page 23: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Graph-Based Clustering: Sparsification

The amount of data that needs to be processed is drastically reduced – Sparsification can eliminate more than 99% of the

entries in a proximity matrix – The amount of time required to cluster the data is

drastically reduced– The size of the problems that can be handled is

increased

Page 24: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Graph-Based Clustering: Sparsification …

Clustering may work better– Sparsification techniques keep the connections to the most

similar (nearest) neighbors of a point while breaking the connections to less similar points.

– The nearest neighbors of a point tend to belong to the same class as the point itself.

– This reduces the impact of noise and outliers and sharpens the distinction between clusters.

Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning algorithms.

– Chameleon and Hypergraph-based Clustering

Page 25: Lecture Notes for Chapter 9 Introduction to Data Mining

Β© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 β€Ή#β€Ί

Sparsification in the Clustering Process