what is cluster analysis clustering – partitioning a data set into several groups (clusters) such...
Post on 20-Dec-2015
225 views
TRANSCRIPT
What is Cluster Analysis • Clustering– Partitioning a data set into several
groups (clusters) such that– Homogeneity: Objects belonging to the same cluster
are similar to each other – Separation: Objects belonging to different clusters are
dissimilar to each other.
• Three fundamental elements of clustering – The set of objects– The set of attributes– Distance measure
Supervised versus Unsupervised Learning
• Supervised learning (classification)– Supervision: Training data (observations,
measurements, etc.) are accompanied by labels indicating the class of the observations
– New data is classified based on training set
• Unsupervised learning (clustering)– Class labels of training data are unknown– Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in data
What Is Good Clustering?• Good clustering method will produce high
quality clusters with– high intra-class similarity– low inter-class similarity
• Quality of a clustering method is also measured by its ability to discover some or all of hidden patterns
• Quality of a clustering result depends on both the similarity measure used by the method and its implementation
Requirements of Clustering in Data Mining • Scalability
• Ability to deal with different types of attributes
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Discovery of clusters with arbitrary shape
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Application Examples
• A stand-alone tool: explore data distribution
• A preprocessing step for other algorithms• Pattern recognition, spatial data analysis,
image processing, market research, WWW, …– Cluster documents– Cluster web log data to discover groups of
similar access patterns
Gene Expression Data Matrix Gene Expression Patterns
Co-expressed Genes
Why looking for co-expressed genes? Co-expression indicates co-function; Co-expression also indicates co-regulation.
Co-expressed Genes
Gene-based Clustering
-1.5
-1
-0.5
0
0.5
1
1.5
Time Points
Ex
pre
ss
ion
Va
lue
-1.5
-1
-0.5
0
0.5
1
1.5
Time Points
Ex
pre
ss
ion
Le
ve
l
-1.5
-1
-0.5
0
0.5
1
1.5
Time Point
Ex
pre
ss
ion
Va
lue
Examples of co-expressed genes and coherent patterns in gene expression data
Iyer’s data [2]
[2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.
Data Matrix
• For memory-based clustering– Also called object-by-variable structure
• Represents n objects with p variables (attributes, measures)– A relational table
npx
nfx
nx
ipx
ifx
ix
px
fxx
1
1
1111
Two-way Clustering of Micoarray Data
sample 1 sample 2sample
3sample
4sample
…
gene 1 0.13 0.72 0.1 0.57
gene 2 0.34 1.58 1.05 1.15
gene 3 0.43 1.1 0.97 1
gene 4 1.22 0.97 1 0.85
gene 5 -0.89 1.21 1.29 1.08
gene 6 1.1 1.45 1.44 1.12
gene 7 0.83 1.15 1.1 1
gene 8 0.87 1.32 1.35 1.13
gene 9 -0.33 1.01 1.38 1.21
gene 10 0.10 0.85 1.03 1
gene…
• Clustering genes
• Samples are attributes
• Find genes with similar function
• Clustering samples
• Genes are attributes.
• Find samples with similar phenotype, e.g. cancers.
• Feature selection.
• Informative genes.
• Curse of dimensionality.
Dissimilarity Matrix
• For memory-based clustering– Also called object-by-object structure– Proximities of pairs of objects– d(i,j): dissimilarity between objects i and j– Nonnegative– Close to 0: similar
0,2)(,1)(
0(3,2)(3,1)
0(2,1)
0
ndnd
dd
d
Distance Matrix
s 1 s 2 s 3 s 4 …
g 1 0.13 0.72 0.1 0.57
g 2 0.34 1.58 1.05 1.15
g 3 0.43 1.1 0.97 1
g 4 1.22 0.97 1 0.85
g 5 -0.89 1.21 1.29 1.08
g 6 1.1 1.45 1.44 1.12
g 7 0.83 1.15 1.1 1
g 8 0.87 1.32 1.35 1.13
g 9 -0.33 1.01 1.38 1.21
g 10 0.10 0.85 1.03 1
…
g 1 g 2 g 3 g 4 …
g 1 0 D(1,2) D(1,3) D(1,4)
g 2 0 D(2,3) D(2,4)
g 3 0 D(3,4)
g 4 0
…
Original Data Matrix
Distance Matrix
How Good Is the Clustering?
• Dissimilarity/similarity depends on distance function– Different applications have different functions– Inter-clusters distance maximization– Intra-clusters distance minimization
• Judgment of clustering quality is typically highly subjective
Types of Data in Clustering
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types
Interval-valued Variables
• Continuous measurements of a roughly linear scale– Weight, height, latitude and longitude
coordinates, temperature, etc.
• Effect of measurement units in attributes– Smaller unit larger variable range larger
effect to the result– Standardization + background knowledge
Standardization
• Calculate the mean absolute deviation
,– The mean is not squared, so the effect of outliers
is reduced.
• Calculate the standardized measurement (z-score)
• Mean absolute deviation is more robust– The effect of outliers is reduced but remains
detectable
|)|...|||(|121 fnffffff
mxmxmxns .)...21
1nffff
xx(xn m
f
fifif s
mx z
Minkowski Distance• Minkowski distance: a generalization
• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
)0(||...||||),(2211
qqj
xi
xj
xi
xj
xi
xjid q
pp
xi
xj
q=2 q=16
6
128.48
Xi (1,7)
Xj(7,1)
Properties of Minkowski Distance
• Nonnegative: d(i,j) 0
• The distance of an object to itself is 0
– d(i,i) = 0
• Symmetric: d(i,j) = d(j,i)• Triangular inequality
– d(i,j) d(i,k) + d(k,j) i j
k
Major Clustering Approaches• Partitioning algorithms: Construct various
partitions and then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition of set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
Clustering Algorithms• If we “clustering” the clustering algorithms
Clustering algorithms
Partition-based
Hierarchical clustering
Centroid-based
K-means
Density-based
Model-based
…
Medoid-based
PAM CLARA CLARANS