click to add title - wordpress.com ·  · 11/7/2016 what is cluster analysis? clustering is the...

43
Clustering YZM 3226 Makine Öğrenmesi

Upload: others

Post on 11-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Clustering

YZM 3226 – Makine Öğrenmesi

Page 2: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Outline

◘ What is Cluster Analysis?

◘ Similarity Measure

◘ Clustering Applications

◘ Clustering Methods

◘ Clustering Algorithm Selection

◘ Cluster Validation

Page 3: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

What is Cluster Analysis?

◘ Clustering is the process of grouping large data sets according to

their similarity.

◘ A cluster is a collection of data objects that are similar to one

another within the same cluster.

Size Based Geographic Distance Based

Each point represents a house

Page 4: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Clustering Definition

◘ Grouping similar data objects into clusters.

◘ Clustering

– Given a set of data points

– Data points have a set of attributes find clusters

– A similarity measure

Page 5: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Clustering Applications

◘ Spatial Data Analysis

– Detect spatial clusters

– e.g. Create thematic maps in GIS

◘ Image Processing

◘ WWW

– Document clustering

• To find groups of documents that are similar to each other based on the

important terms appearing in them.

– Cluster Weblog data

• To discover groups of similar access patterns

◘ Marketting

◘ Medical Applications

◘ Science Applications

Page 6: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Example Clustering Applications

◘ Marketing: – Customer Segmentation: Find clusters of similar customers

– Market Segmentation: Subdivide a market into subsets

◘ Medical: Clustering of disease

◘ Insurance: Identifying groups of insurance policy holders

◘ City-planning: Identifying groups of houses according to their house

type, value, and geographical location

◘ Earth-quake studies: Observed earth quake epicenters should be

clustered along continent faults

Income

Page 7: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Data Types

Categorical

Nominal Ordinal Continous Discrete

Numeric

Continous Discrete Binary Categorical

(Nominal)

Categorical

(Ordinal)

12 0-18 Smoker Mountain bicycle Very Unhappy

45 18-40 Non-Smoker Utility bicycle Unhappy

34 40-100 Racing bicycle Neutral

9 Happy

48 Very Happy

Similarity Measures

Page 8: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Similarity Measures

◘ Numeric Data

– If attributes are continuous:

• Manhattan Distance (p=1)

• Euclidean Distance (p=2)

• Minkowski Distance

◘ Categorical Data

– Jaccard's distance

– ...

◘ Others

– Problem-specific measures

cbacb jid

),(

Page 9: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

1- Similarity Measures for Numeric Data

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Euclidean Distance Matrix

p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

Manhattan Distance Matrix

Page 10: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

Similarity Measures for Numeric Data

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Euclidean Distance Matrix

p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

Manhattan Distance Matrix

Page 11: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Example for Clustering Numeric Data

◘ Document Clustering

– Each document becomes a `term' vector,

• each term is a component (attribute) of the vector,

• the value of each component is the number of times the corresponding term occurs in

the document.

Document 1

se

aso

n

time

ou

t

lost

wi

n

ga

me

sco

re

ba

ll

play

co

ach

tea

m

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Doc Doc Doc Doc Doc Doc Doc Doc

Page 12: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

2- Similarity Measures for Categorical Data

◘ Categorical Data:

◘ e.g. Binary Variables - 0/1 - presence/absence

– Jaccard's coefficient (measure similarity)

– Jaccard's distance (measure dissimilarity)

cbaa jisim

Jaccard ),(

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

cbacb jid

),(

Page 13: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Example for Clustering Categorical Data

◘ Find the Jaccard's distance between Apple and Banana.

Feature of Fruit Sphere shape Sweet Sour Crunchy

Object i =Apple Yes Yes Yes Yes

Object j =Banana No Yes No No

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

cbacb jid

),((a = 1, b = 3, c = 0, d= 0)

(3+0) / (1 + 3 + 0) = 3/4 = 0.75

Page 14: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Example for Clustering Categorical Data

◘ Who are the most likely to have a similar disease?

Let the values Y and P be set to 1, and the value N be set to 0

Result: Jim and Mary are unlikely to have a similar disease.

Jack and Mary are the most likely to have a similar disease.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

cbacb jid

),(

Page 15: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Clustering Methods

◘ Partitioning Methods

– K-Means, K-Medoids, PAM, CLARA, CLARANS, ...

◘ Hierarchical Methods

– AGNES, DIANA, BIRCH, CURE, CHAMELEON, ...

◘ Density-Based Methods

– DBSCAN, OPTICS, DENCLUE, ...

◘ Grid-Based Methods

– STING, WaveCluster, CLIQUE ...

◘ Model-Based Methods

– COBWEB, CLASSIT, SOM (Self-Organizing Feature Maps) ...

Page 16: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Clustering Algorithm Selection

1. Scalability

– Efficiently execution on large databases

– Scanning database only several times

2. Running on different data types

– Continuous, discrete, binary, nominal, ordinal, …

3. Updateability

– Updating clusters after insertion and deletion of some data values

4. Efficient memory usage

5. Input parameters

– Results in different outputs on different inputs

– Un-understandable and too many input parameters

6. Without any foreknowledge

7. Different cluster shapes

8. Workable on dirty data

– Workable on missing, wrong and noise data

9. Insensitivity on data ordering

10. Multi dimension

– Workability on multi-dimensional datasets

11. Usability for different areas

Page 17: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Cluster Characteristics

◘ Each cluster is represented by the following characteristics

Page 18: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Categories of Clustering Algorithms

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

AGNES

DIANA

BIRCH

CURE

CHAMELEON

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

Page 19: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Partitioning Methods

◘ Construct a partition of a database D of n objects into a set of k

clusters

◘ Given a k, find a partition of k clusters that optimizes the chosen

partitioning criterion e.g. minimize SSE (Sum of Squared Distance)

Page 20: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

K-Means

◘ K-Means is an algorithm to cluster n objects based on attributes into k

partitions, k < n.

Page 21: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

K-Means Example

Page 22: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

K-Means Example

Page 23: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

K-Means Demo

Page 24: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

K-Means Adv. DisAdv.

◘ Strength:

– Relatively efficient: O(tkn) n is # objects, k is # clusters, and t is # iterations.

– Easy to understand

◘ Weakness

– Applicable only when mean is defined, then what about categorical data?

– Need to specify k, the number of clusters, in advance

– Unable to handle noisy data and outliers

– Not suitable to discover clusters with non-convex shapes

K-Means result Other clustering

algorithm result

Page 25: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

K-Medoids Method

◘ K-Medoids: Instead of taking the mean value of the object in a cluster as a

reference point, medoids can be used, which is the most centrally located

object in a cluster.

Page 26: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

K-Medoids Method

K-Means

K-Medoids

Page 27: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Categories of Clustering Algorithms

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Page 28: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Hierarchical Clustering

◘ Create a hierarchical decomposition of the set of data using some criterion

◘ Strength: This method does not require the number of clusters k as an input.

◘ Weakness: But it needs a termination condition.

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

(AGNES)

divisive

(DIANA)

Page 29: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Hierarchical Clustering

Page 30: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

How the Clusters are Merged?

◘ Given: Matrix of similarity between every point pair

◘ Start with each point in a separate cluster and merge clusters based on some

criteria:

– Single link: The minimum of the distances between the members of the clusters.

– Complete link: The maximum of the distances between the members of the clusters.

– Average link: The average of the distances between the members of the clusters.

Min

distance

Average

distance

Max

distance

Page 31: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

How the Clusters are Merged?

Single Link: Kümelerin en yakın

elemanları arasındaki uzaklık

Complete Link: Kümelerin en uzak

elemanları arasındaki uzaklık

Average Link: Bir kümedeki elemanlar ve diğer

kümedeki başka elemanlar arasındaki ortalama

uzaklık

Centroid Link: İki kümenin

centroidleri arasındaki uzaklık

Page 32: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

How the Clusters are Merged?

Single Link

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2 1

2

3

4

5

6

1

2 5

3

4

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Complete Link

1

2

3

4

5

6 1

2

5

3

4

Average Link

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

Page 33: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Hierarchical Clustering Demo

◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Page 34: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Categories of Clustering Algorithms

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Page 35: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Density-Based Clustering

◘ Dense objects should be grouped together into one cluster.

◘ They use a fixed threshold value to determine dense regions.

◘ Density-based clustering algorithms

– DBSCAN (1996)

– OPTICS (1999)

– DENCLUE (1998)

Check the number of points within a specified radius of the point

Core

Border

Outlier

Eps = 1cm

MinPts = 5

Page 36: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Categories of Clustering Algorithms

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Page 37: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Grid Based Clustering Methods

◘ Simplest approach is to divide region into a number of rectangular

cells of equal volume and define density as # of points the cell

contains

Page 38: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Categories of Clustering Algorithms

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

PAM

CLARA

CLARANS

OPTICS

DENCLUE

STING

WaveCluster

CLIQUE

Model

Based

Methods

Density

Based

Methods

Grid

Based

Methods

1

2

3

4

5

6

1

23 4

5

COBWEB

CLASSIT

SOM

DBSCAN

AGNES

DIANA

BIRCH

CURE

CHAMELEON

Page 39: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Model Based Methods

Attempt to optimize the fit between the given data and some mathematical model

It uses statistical functions

Page 40: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Clustering Algorithms

General Overview

Page 41: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Factors Affecting Clustering Results

◘ Outliers

◘ Inappropriate value for parameters

◘ Drawbacks of the clustering algorithm themselves

INPUT DATASET

GOOD CLUSTERING BAD CLUSTERING

Parameter (k=6) Parameter (k=20)

Page 42: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Cluster Validation - SSE

◘ Clustering • Data points in one cluster are more similar to one another.

• Data points in separate clusters are less similar to one another.

◘ A good clustering method will produce high quality clusters with • high intra-class similarity

• low inter-class similarity

Intracluster distances

are minimized

Intercluster distances

are maximized

Clustering in 3-D space.

Sum of Squared Error (SSE)

Page 43: Click to add title - WordPress.com ·  · 11/7/2016 What is Cluster Analysis? Clustering is the process of grouping large data sets according to their similarity. A cluster is a

Cluster Validation - Correlation

Between -1.0 and +1.0