click to add title - wordpress.com · · 11/7/2016 what is cluster analysis? clustering is the...

Clustering

YZM 3226 – Makine Öğrenmesi

Outline

◘ What is Cluster Analysis?

◘ Similarity Measure

◘ Clustering Applications

◘ Clustering Methods

◘ Clustering Algorithm Selection

◘ Cluster Validation

What is Cluster Analysis?

◘ Clustering is the process of grouping large data sets according to

their similarity.

◘ A cluster is a collection of data objects that are similar to one

another within the same cluster.

Size Based Geographic Distance Based

Each point represents a house

Clustering Definition

◘ Grouping similar data objects into clusters.

◘ Clustering

– Given a set of data points

– Data points have a set of attributes find clusters

– A similarity measure

Clustering Applications

◘ Spatial Data Analysis

– Detect spatial clusters

– e.g. Create thematic maps in GIS

◘ Image Processing

◘ WWW

– Document clustering

• To find groups of documents that are similar to each other based on the

important terms appearing in them.

– Cluster Weblog data

• To discover groups of similar access patterns

◘ Marketting

◘ Medical Applications

◘ Science Applications

Example Clustering Applications

◘ Marketing: – Customer Segmentation: Find clusters of similar customers

– Market Segmentation: Subdivide a market into subsets

◘ Medical: Clustering of disease

◘ Insurance: Identifying groups of insurance policy holders

◘ City-planning: Identifying groups of houses according to their house

type, value, and geographical location

◘ Earth-quake studies: Observed earth quake epicenters should be

clustered along continent faults

Income

Data Types

Categorical

Nominal Ordinal Continous Discrete

Numeric

Continous Discrete Binary Categorical

(Nominal)

Categorical

(Ordinal)

12 0-18 Smoker Mountain bicycle Very Unhappy

45 18-40 Non-Smoker Utility bicycle Unhappy

34 40-100 Racing bicycle Neutral

9 Happy

48 Very Happy

Similarity Measures

◘ Numeric Data

– If attributes are continuous:

• Manhattan Distance (p=1)

• Euclidean Distance (p=2)

• Minkowski Distance

◘ Categorical Data

– Jaccard's distance

– ...

◘ Others

– Problem-specific measures

cbacb jid

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

1- Similarity Measures for Numeric Data

0 1 2 3 4 5 6

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Euclidean Distance Matrix

p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

Manhattan Distance Matrix

p1 p2 p3 p4

p1 0 2.828 3.162 5.099

p2 2.828 0 1.414 3.162

p3 3.162 1.414 0 2

p4 5.099 3.162 2 0

Similarity Measures for Numeric Data

0 1 2 3 4 5 6

point x y

p1 0 2

p2 2 0

p3 3 1

p4 5 1

Euclidean Distance Matrix

p1 p2 p3 p4

p1 0 4 4 6

p2 4 0 2 4

p3 4 2 0 2

p4 6 4 2 0

Manhattan Distance Matrix

Example for Clustering Numeric Data

◘ Document Clustering

– Each document becomes a `term' vector,

• each term is a component (attribute) of the vector,

• the value of each component is the number of times the corresponding term occurs in

the document.

Document 1

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Doc Doc Doc Doc Doc Doc Doc Doc

2- Similarity Measures for Categorical Data

◘ Categorical Data:

◘ e.g. Binary Variables - 0/1 - presence/absence

– Jaccard's coefficient (measure similarity)

– Jaccard's distance (measure dissimilarity)

cbaa jisim

Jaccard ),(

pdbcasum

Object i

Object j

cbacb jid

Example for Clustering Categorical Data

◘ Find the Jaccard's distance between Apple and Banana.

Feature of Fruit Sphere shape Sweet Sour Crunchy

Object i =Apple Yes Yes Yes Yes

Object j =Banana No Yes No No

pdbcasum

Object i

Object j

cbacb jid

),((a = 1, b = 3, c = 0, d= 0)

(3+0) / (1 + 3 + 0) = 3/4 = 0.75

Example for Clustering Categorical Data

◘ Who are the most likely to have a similar disease?

Let the values Y and P be set to 1, and the value N be set to 0

Result: Jim and Mary are unlikely to have a similar disease.

Jack and Mary are the most likely to have a similar disease.

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0211

67.0111

33.0102

maryjimd

jimjackd

maryjackd

pdbcasum

Object i

Object j

cbacb jid

Clustering Methods

◘ Partitioning Methods

– K-Means, K-Medoids, PAM, CLARA, CLARANS, ...

◘ Hierarchical Methods

– AGNES, DIANA, BIRCH, CURE, CHAMELEON, ...

◘ Density-Based Methods

– DBSCAN, OPTICS, DENCLUE, ...

◘ Grid-Based Methods

– STING, WaveCluster, CLIQUE ...

◘ Model-Based Methods

– COBWEB, CLASSIT, SOM (Self-Organizing Feature Maps) ...

Clustering Algorithm Selection

1. Scalability

– Efficiently execution on large databases

– Scanning database only several times

2. Running on different data types

– Continuous, discrete, binary, nominal, ordinal, …

3. Updateability

– Updating clusters after insertion and deletion of some data values

4. Efficient memory usage

5. Input parameters

– Results in different outputs on different inputs

– Un-understandable and too many input parameters

6. Without any foreknowledge

7. Different cluster shapes

8. Workable on dirty data

– Workable on missing, wrong and noise data

9. Insensitivity on data ordering

10. Multi dimension

– Workability on multi-dimensional datasets

11. Usability for different areas

Cluster Characteristics

◘ Each cluster is represented by the following characteristics

Categories of Clustering Algorithms

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

CLARANS

CHAMELEON

OPTICS

DENCLUE

WaveCluster

CLIQUE

Methods

Density

Methods

COBWEB

CLASSIT

DBSCAN

Partitioning Methods

◘ Construct a partition of a database D of n objects into a set of k

clusters

◘ Given a k, find a partition of k clusters that optimizes the chosen

partitioning criterion e.g. minimize SSE (Sum of Squared Distance)

K-Means

◘ K-Means is an algorithm to cluster n objects based on attributes into k

partitions, k < n.

K-Means Example

◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

K-Means Demo

K-Means Adv. DisAdv.

◘ Strength:

– Relatively efficient: O(tkn) n is # objects, k is # clusters, and t is # iterations.

– Easy to understand

◘ Weakness

– Applicable only when mean is defined, then what about categorical data?

– Need to specify k, the number of clusters, in advance

– Unable to handle noisy data and outliers

– Not suitable to discover clusters with non-convex shapes

K-Means result Other clustering

algorithm result

K-Medoids Method

◘ K-Medoids: Instead of taking the mean value of the object in a cluster as a

reference point, medoids can be used, which is the most centrally located

object in a cluster.

K-Medoids Method

K-Means

K-Medoids

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

CLARANS

OPTICS

DENCLUE

WaveCluster

CLIQUE

Methods

Density

Methods

COBWEB

CLASSIT

DBSCAN

CHAMELEON

Hierarchical Clustering

◘ Create a hierarchical decomposition of the set of data using some criterion

◘ Strength: This method does not require the number of clusters k as an input.

◘ Weakness: But it needs a termination condition.

Step 0 Step 1 Step 2 Step 3 Step 4

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

(AGNES)

divisive

(DIANA)

Hierarchical Clustering

How the Clusters are Merged?

◘ Given: Matrix of similarity between every point pair

◘ Start with each point in a separate cluster and merge clusters based on some

criteria:

– Single link: The minimum of the distances between the members of the clusters.

– Complete link: The maximum of the distances between the members of the clusters.

– Average link: The average of the distances between the members of the clusters.

distance

Average

distance

Single Link: Kümelerin en yakın

elemanları arasındaki uzaklık

Complete Link: Kümelerin en uzak

elemanları arasındaki uzaklık

Average Link: Bir kümedeki elemanlar ve diğer

kümedeki başka elemanlar arasındaki ortalama

uzaklık

Centroid Link: İki kümenin

centroidleri arasındaki uzaklık

Single Link

3 6 2 5 4 10

3 6 4 1 2 50

Complete Link

Average Link

3 6 4 1 2 50

Hierarchical Clustering Demo

◘ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

CLARANS

OPTICS

DENCLUE

WaveCluster

CLIQUE

Methods

Density

Methods

COBWEB

CLASSIT

DBSCAN

CHAMELEON

Density-Based Clustering

◘ Dense objects should be grouped together into one cluster.

◘ They use a fixed threshold value to determine dense regions.

◘ Density-based clustering algorithms

– DBSCAN (1996)

– OPTICS (1999)

– DENCLUE (1998)

Check the number of points within a specified radius of the point

Border

Outlier

Eps = 1cm

MinPts = 5

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

CLARANS

OPTICS

DENCLUE

WaveCluster

CLIQUE

Methods

Density

Methods

COBWEB

CLASSIT

DBSCAN

CHAMELEON

Grid Based Clustering Methods

◘ Simplest approach is to divide region into a number of rectangular

cells of equal volume and define density as # of points the cell

contains

Partitioning

Methods

Hierarchical

Methods

K-Means

K-Medoid

CLARANS

OPTICS

DENCLUE

WaveCluster

CLIQUE

Methods

Density

Methods

COBWEB

CLASSIT

DBSCAN

CHAMELEON

Model Based Methods

Attempt to optimize the fit between the given data and some mathematical model

It uses statistical functions

Clustering Algorithms

General Overview

Factors Affecting Clustering Results

◘ Outliers

◘ Inappropriate value for parameters

◘ Drawbacks of the clustering algorithm themselves

INPUT DATASET

GOOD CLUSTERING BAD CLUSTERING

Parameter (k=6) Parameter (k=20)

Cluster Validation - SSE

◘ Clustering • Data points in one cluster are more similar to one another.

• Data points in separate clusters are less similar to one another.

◘ A good clustering method will produce high quality clusters with • high intra-class similarity

• low inter-class similarity

Intracluster distances

are minimized

Intercluster distances

are maximized

Clustering in 3-D space.

Sum of Squared Error (SSE)

Cluster Validation - Correlation

Between -1.0 and +1.0

click to add title - wordpress.com · · 11/7/2016 what is cluster analysis? clustering is the...

Documents

clustering. what is clustering? clustering: the process of...

cluster analysis grouping cases or variables. clustering...

cluster classroom grouping model

musical grouping structure - titanmusic.com · done on...

grouping - birmingham...

monitoring progress in the gifted cluster grouping model:...

introduction - ijsetr.com file · web viewword clustering...

rural and suburban cluster grouping: reflections'on staff...

the use of ability grouping and flexible grouping within ......

performance-based cluster grouping by jessica l. …

blood grouping is based on type of antigen

reservoir characterization, modelling and lateral ...ip...

cluster creation on websphere application server 8.5 ... ›...

alizadeh et. al. (2000) stephen ayers 12/2/01. clustering...

cluster analysis and its application to healthcare claims...

the 16 career clusters a career cluster is a grouping of...

dr. shahram yazdani diagnosis-related grouping. dr. shahram...

cluster analyis - focus-balkans.org cluster... ·...

total school cluster grouping & differentiation: a...

plc/book study guide for the cluster grouping handbook ·...