what is cluster analysis clustering – partitioning a data set into several groups (clusters) such...

19
What is Cluster Analysis • Clustering– Partitioning a data set into several groups (clusters) such that Homogeneity: Objects belonging to the same cluster are similar to each other Separation: Objects belonging to different clusters are dissimilar to each other. • Three fundamental elements of clustering – The set of objects – The set of attributes Distance measure

Post on 20-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

What is Cluster Analysis • Clustering– Partitioning a data set into several

groups (clusters) such that– Homogeneity: Objects belonging to the same cluster

are similar to each other – Separation: Objects belonging to different clusters are

dissimilar to each other. 

• Three fundamental elements of clustering – The set of objects– The set of attributes– Distance measure

Page 2: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Supervised versus Unsupervised Learning

• Supervised learning (classification)– Supervision: Training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

– New data is classified based on training set

• Unsupervised learning (clustering)– Class labels of training data are unknown– Given a set of measurements, observations, etc.,

need to establish existence of classes or clusters in data

Page 3: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

What Is Good Clustering?• Good clustering method will produce high

quality clusters with– high intra-class similarity– low inter-class similarity

• Quality of a clustering method is also measured by its ability to discover some or all of hidden patterns

• Quality of a clustering result depends on both the similarity measure used by the method and its implementation

Page 4: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Requirements of Clustering in Data Mining • Scalability

• Ability to deal with different types of attributes

• Minimal requirements for domain knowledge to determine input parameters

• Able to deal with noise and outliers

• Discovery of clusters with arbitrary shape

• Insensitive to order of input records

• High dimensionality

• Incorporation of user-specified constraints

• Interpretability and usability

Page 5: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Application Examples

• A stand-alone tool: explore data distribution

• A preprocessing step for other algorithms• Pattern recognition, spatial data analysis,

image processing, market research, WWW, …– Cluster documents– Cluster web log data to discover groups of

similar access patterns

Page 6: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Gene Expression Data Matrix Gene Expression Patterns

Co-expressed Genes

Why looking for co-expressed genes? Co-expression indicates co-function; Co-expression also indicates co-regulation.

Co-expressed Genes

Page 7: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Gene-based Clustering

-1.5

-1

-0.5

0

0.5

1

1.5

Time Points

Ex

pre

ss

ion

Va

lue

-1.5

-1

-0.5

0

0.5

1

1.5

Time Points

Ex

pre

ss

ion

Le

ve

l

-1.5

-1

-0.5

0

0.5

1

1.5

Time Point

Ex

pre

ss

ion

Va

lue

Examples of co-expressed genes and coherent patterns in gene expression data

Iyer’s data [2]

[2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.

Page 8: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Data Matrix

• For memory-based clustering– Also called object-by-variable structure

• Represents n objects with p variables (attributes, measures)– A relational table

npx

nfx

nx

ipx

ifx

ix

px

fxx

1

1

1111

Page 9: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Two-way Clustering of Micoarray Data

sample 1 sample 2sample

3sample

4sample

gene 1 0.13 0.72 0.1 0.57

gene 2 0.34 1.58 1.05 1.15

gene 3 0.43 1.1 0.97 1

gene 4 1.22 0.97 1 0.85

gene 5 -0.89 1.21 1.29 1.08

gene 6 1.1 1.45 1.44 1.12

gene 7 0.83 1.15 1.1 1

gene 8 0.87 1.32 1.35 1.13

gene 9 -0.33 1.01 1.38 1.21

gene 10 0.10 0.85 1.03 1

gene…

• Clustering genes

• Samples are attributes

• Find genes with similar function

• Clustering samples

• Genes are attributes.

• Find samples with similar phenotype, e.g. cancers.

• Feature selection.

• Informative genes.

• Curse of dimensionality.

Page 10: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Dissimilarity Matrix

• For memory-based clustering– Also called object-by-object structure– Proximities of pairs of objects– d(i,j): dissimilarity between objects i and j– Nonnegative– Close to 0: similar

0,2)(,1)(

0(3,2)(3,1)

0(2,1)

0

ndnd

dd

d

Page 11: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Distance Matrix

s 1 s 2 s 3 s 4 …

g 1 0.13 0.72 0.1 0.57

g 2 0.34 1.58 1.05 1.15

g 3 0.43 1.1 0.97 1

g 4 1.22 0.97 1 0.85

g 5 -0.89 1.21 1.29 1.08

g 6 1.1 1.45 1.44 1.12

g 7 0.83 1.15 1.1 1

g 8 0.87 1.32 1.35 1.13

g 9 -0.33 1.01 1.38 1.21

g 10 0.10 0.85 1.03 1

g 1 g 2 g 3 g 4 …

g 1 0 D(1,2) D(1,3) D(1,4)

g 2 0 D(2,3) D(2,4)

g 3 0 D(3,4)

g 4 0

Original Data Matrix

Distance Matrix

Page 12: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

How Good Is the Clustering?

• Dissimilarity/similarity depends on distance function– Different applications have different functions– Inter-clusters distance maximization– Intra-clusters distance minimization

• Judgment of clustering quality is typically highly subjective

Page 13: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Types of Data in Clustering

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

Page 14: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Interval-valued Variables

• Continuous measurements of a roughly linear scale– Weight, height, latitude and longitude

coordinates, temperature, etc.

• Effect of measurement units in attributes– Smaller unit larger variable range larger

effect to the result– Standardization + background knowledge

Page 15: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Standardization

• Calculate the mean absolute deviation

,– The mean is not squared, so the effect of outliers

is reduced.

• Calculate the standardized measurement (z-score)

• Mean absolute deviation is more robust– The effect of outliers is reduced but remains

detectable

|)|...|||(|121 fnffffff

mxmxmxns .)...21

1nffff

xx(xn m

f

fifif s

mx z

Page 16: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Minkowski Distance• Minkowski distance: a generalization

• If q = 2, d is Euclidean distance

• If q = 1, d is Manhattan distance

)0(||...||||),(2211

qqj

xi

xj

xi

xj

xi

xjid q

pp

qq

xi

xj

q=2 q=16

6

128.48

Xi (1,7)

Xj(7,1)

Page 17: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Properties of Minkowski Distance

• Nonnegative: d(i,j) 0

• The distance of an object to itself is 0

– d(i,i) = 0

• Symmetric: d(i,j) = d(j,i)• Triangular inequality

– d(i,j) d(i,k) + d(k,j) i j

k

Page 18: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Major Clustering Approaches• Partitioning algorithms: Construct various

partitions and then evaluate them by some criterion

• Hierarchy algorithms: Create a hierarchical decomposition of set of data (or objects) using some criterion

• Density-based: based on connectivity and density functions

• Grid-based: based on a multiple-level granularity structure

Page 19: What is Cluster Analysis Clustering – Partitioning a data set into several groups (clusters) such that –Homogeneity: Objects belonging to the same cluster

Clustering Algorithms• If we “clustering” the clustering algorithms

Clustering algorithms

Partition-based

Hierarchical clustering

Centroid-based

K-means

Density-based

Model-based

Medoid-based

PAM CLARA CLARANS