l15: statistical clustering
TRANSCRIPT
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1
L15: statistical clustering
β’ Similarity measures
β’ Criterion functions
β’ Cluster validity
β’ Flat clustering algorithms β k-means
β ISODATA
β’ Hierarchical clustering algorithms β Divisive
β Agglomerative
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2
Non-parametric unsupervised learning β’ In L14 we introduced the concept of unsupervised learning
β A collection of pattern recognition methods that βlearn without a teacherβ
β Two types of clustering methods were mentioned: parametric and non-parametric
β’ Parametric unsupervised learning β Equivalent to density estimation with a mixture of (Gaussian) components
β Through the use of EM, the identity of the component that originated each data point was treated as a missing feature
β’ Non-parametric unsupervised learning β No density functions are considered in these methods
β Instead, we are concerned with finding natural groupings (clusters) in a dataset
β’ Non-parametric clustering involves three steps β Defining a measure of (dis)similarity between examples
β Defining a criterion function for clustering
β Defining an algorithm to minimize (or maximize) the criterion function
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
Proximity measures
β’ Definition of metric β A measuring rule π(π₯, π¦) for the distance between two vectors π₯ and π¦ is considered a metric if it satisfies the following properties
π π₯, π¦ β₯ π0 π π₯, π¦ = π0 πππ π₯ = π¦ π π₯, π¦ = π π¦, π₯ π π₯, π¦ β€ π π₯, π§ + π π§, π¦
β’ If the metric has the property π ππ₯, ππ¦ = π π π₯, π¦ then it is called a norm and denoted π π₯, π¦ = π₯ β π¦
β’ The most general form of distance metric is the power norm
π₯ β π¦ π/π = π₯π β π¦ππ
π·
π=1
1/π
β π controls the weight placed on any dimension dissimilarity, whereas π controls the distance growth of patterns that are further apart
β Notice that the definition of norm must be relaxed, allowing a power factor for |π|
[Marques de SΓ‘, 2001]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4
β’ Most commonly used metrics are derived from the power norm
β Minkowski metric (πΏπ norm)
π₯ β π¦ π = π₯π β π¦πππ·
π=1
1/π
β’ The choice of an appropriate value of π depends on the amount of emphasis that you would like to give to the larger differences between dimensions
β Manhattan or city-block distance (πΏ1 norm)
π₯ β π¦ πβπ = π₯π β π¦π
π·
π=1
β’ When used with binary vectors, the L1 norm is known as the Hamming distance
β Euclidean norm (πΏ2 norm)
π₯ β π¦ π = π₯π β π¦π2π·
π=1
1/2
β Chebyshev distance (πΏβ norm) π₯ β π¦ π = max
1β€πβ€π·π₯π β π¦π
Contours of equal distance
L1
L2
Lβ
x1
x2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5
β’ Other metrics are also popular β Quadratic distance
π π₯, π¦ = π₯ β π¦ ππ΅ π₯ β π¦ β’ The Mahalanobis distance is a particular case of this distance
β Canberra metric (for non-negative features)
πππ π₯, π¦ = π₯πβπ¦π
π₯π+π¦π
π·π=1
β Non-linear distance
ππ π₯, π¦ = 0 ππ ππ π₯, π¦ < ππ» π. π€.
β’ where π is a threshold and π» is a distance
β’ An appropriate choice for π» and π for feature selection is that they should satisfy
π» =Ξπ2
ππ2 ππ
β’ and that π satisfies the unbiasedness and consistency conditions of the Parzen estimator: πππ β β,π β 0 ππ π β β
[Webb, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6
β’ The above distance metrics are measures of dissimilarity
β’ Some measures of similarity also exist β Inner product
π πΌπππΈπ π₯, π¦ = π₯ππ¦
β’ The inner product is used when the vectors π₯ and π¦ are normalized, so that they have the same length
β Correlation coefficient
π πΆππ π π₯, π¦ = π₯π β π₯ π¦π β π¦ π·π=1
π₯π β π₯ 2π·
π=1 π¦π β π¦ 2π·
π=1
1/2
β Tanimoto measure (for binary-valued vectors)
π π π₯, π¦ =π₯ππ¦
π₯ 2 + π¦ 2 β π₯ππ¦
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7
Criterion function for clustering
β’ Once a (dis)similarity measure has been determined, we need to define a criterion function to be optimized β The most widely used clustering criterion is the sum-of-square-error
π½πππΈ = π₯ β ππ2
π₯βππ
πΆ
π=1
where ππ =1
ππ π₯
π₯βππ
β’ This criterion measures how well the data set π = {π₯(1β¦π₯(π} is
represented by the cluster centers π = {π(1β¦π(πΆ} (πΆ < π)
β’ Clustering methods that use this criterion are called minimum variance
β Other criterion functions exist, based on the scatter matrices used in Linear Discriminant Analysis
β’ For details, refer to [Duda, Hart and Stork, 2001]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8
Cluster validity β’ The validity of the final cluster solution is highly subjective
β This is in contrast with supervised training, where a clear objective function is known: Bayes risk
β Note that the choice of (dis)similarity measure and criterion function will have a major impact on the final clustering produced by the algorithms
β’ Example β Which are the meaningful clusters in these cases? β How many clusters should be considered?
β A number of quantitative methods for cluster validity are proposed in [Theodoridis
and Koutrombas, 1999]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9
Iterative optimization β’ Once a criterion function has been defined, we must find a
partition of the data set that minimizes the criterion β Exhaustive enumeration of all partitions, which guarantees the optimal
solution, is unfeasible β’ For example, a problem with 5 clusters and 100 examples yields 1067
partitionings
β’ The common approach is to proceed in an iterative fashion 1) Find some reasonable initial partition and then 2) Move samples from one cluster to another in order to reduce the
criterion function
β’ These iterative methods produce sub-optimal solutions but are computationally tractable
β’ We will consider two groups of iterative methods β Flat clustering algorithms
β’ These algorithms produce a set of disjoint clusters β’ Two algorithms are widely used: k-means and ISODATA
β Hierarchical clustering algorithms: β’ The result is a hierarchy of nested clusters β’ These algorithms can be broadly divided into agglomerative and divisive
approaches
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10
The k-means algorithm
β’ Method β k-means is a simple clustering procedure that attempts to minimize
the criterion function π½πππΈ in an iterative fashion
π½πππΈ = π₯ β ππ2
π₯βπππΆπ=1 where ππ =
1
ππ π₯π₯βππ
β It can be shown (L14) that k-means is a particular case of the EM algorithm for mixture models
1. Define the number of clusters 2. Initialize clusters by
β’ an arbitrary assignment of examples to clusters or β’ an arbitrary set of cluster centers (some examples used as centers)
3. Compute the sample mean of each cluster 4. Reassign each example to the cluster with the nearest mean 5. If the classification of all samples has not changed, stop, else go to step 3
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11
β’ Vector quantization β An application of k-means to signal
processing and communication
β Univariate signal values are usually quantized into a number of levels β’ Typically a power of 2 so the signal
can be transmitted in binary format
β The same idea can be extended for multiple channels β’ We could quantize each separate
channel
β’ Instead, we can obtain a more efficient coding if we quantize the overall multidimensional vector by finding a number of multidimensional prototypes (cluster centers)
β’ The set of cluster centers is called a codebook, and the problem of finding this codebook is normally solved using the k-means algorithm
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
8
Sig
na
l
T im e (s )
C o n t in u o u s
s ig n a l
Q u a n t iz e d
s ig n a l
Q u a n t iz a t io n
n o is e
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
8
Sig
na
l
T im e (s )
C o n t in u o u s
s ig n a l
Q u a n t iz e d
s ig n a l
Q u a n t iz a t io n
n o is e
V o ro n o i
re g io n
C o d e w o rd sV e c to rs
V o ro n o i
re g io n
C o d e w o rd sV e c to rs
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12
ISODATA
β’ Iterative Self-Organizing Data Analysis (ISODATA) β An extension to the k-means algorithm with some heuristics to
automatically select the number of clusters
β’ ISODATA requires the user to select a number of parameters
β πππΌπ_πΈπ minimum number of examples per cluster
β ππ· desired (approximate) number of clusters
β ππ2 maximum spread parameter for splitting
β π·ππΈπ πΊπΈ maximum distance separation for merging
β πππΈπ πΊπΈ maximum number of clusters that can be merged
β’ The algorithm works in an iterative fashion 1) Perform k-means clustering
2) Split any clusters whose samples are sufficiently dissimilar
3) Merge any two clusters sufficiently close
4) Go to 1)
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
1. Select an initial number of clusters ππΆ and use the first ππΆ examples as cluster centers π, π = 1. . ππΆ
2. Assign each example to the closest cluster
a. Exit the algorithm if the classification of examples has not changed
3. Eliminate clusters that contain less than πππΌπ_πΈπ examples and
a. Assign their examples to the remaining clusters based on minimum distance
b. Decrease ππΆ accordingly
4. For each cluster π,
a. Compute the center k as the sample mean of all the examples assigned to that cluster b. Compute the average distance between examples and cluster centers
πππ£π =1
π ππππππΆπ=1 and ππ =
1
ππ π₯ β πππ₯βππ
c. Compute the variance of each axis and find the axis πβ with maximum variance ππ2 πβ
6. For each cluster k with ππ2 πβ > π
2, if {ππ > ππ΄ππΊ πππ ππ > 2πππΌπ_πΈπ + 1} or {ππΆ < ππ·/2}
a. Split that cluster into two clusters where the two centers k1 and k2 differ only in the coordinate πβ
i. π1(π β) = π(π β) + π(π β) (all other coordinates remain the same, 0 < < 1)
ii. π2(π β) = π(π β) β π(π β) (all other coordinates remain the same, 0 < < 1)
b. Increment ππΆ accordingly
c. Reassign the clusterβs examples to one of the two new clusters based on minimum distance to cluster
centers
7. If ππΆ > 2ππ· then
a. Compute all distances π·ππ = π(π, π)
b. Sort π·ππ in decreasing order
b. For each pair of clusters sorted by π·ππ, if (1) neither cluster has been already merged, (2) π·ππ <
π·ππΈπ πΊπΈ and (3) not more than NMERGE pairs of clusters have been merged in this loop, then
i. Merge ith and jth clusters
ii. Compute the cluster center πβ² =ππππ+ππππ
ππ+ππ
iii. Decrement NC accordingly
8. Go to step 1 [Therrien, 1989]
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
β’ ISODATA has been shown to be an extremely powerful heuristic
β’ Some of its advantages are β Self-organizing capabilities
β Flexibility in eliminating clusters that have very few examples
β Ability to divide clusters that are too dissimilar
β Ability to merge clusters that are sufficiently similar
β’ However, it suffers from the following limitations β Data must be linearly separable (long narrow or curved clusters are not
handled properly)
β It is difficult to know a priori the βoptimalβ parameters
β Performance is highly dependent on these parameters
β For large datasets and large number of clusters, ISODATA is less efficient than other linear methods
β Convergence is unknown, although it appears to work well for non-overlapping clusters
β’ In practice, ISODATA is run multiple times with different values of the parameters and the clustering with minimum SSE is selected
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
Hierarchical clustering
β’ k-means and ISODATA create disjoint clusters, resulting in a flat data representation β However, sometimes it is desirable to obtain a hierarchical
representation of data, with clusters and sub-clusters arranged in a tree-structured fashion
β Hierarchical representations are commonly used in the sciences (e.g., biological taxonomy)
β’ Hierarchical clustering methods can be grouped in two general classes β Agglomerative
β’ Also known as bottom-up or merging
β’ Starting with N singleton clusters, successively merge clusters until one cluster is left
β Divisive β’ Also known as top-down or splitting
β’ Starting with a unique cluster, successively split the clusters until N singleton examples are left
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 16
Dendrograms
β’ A binary tree that shows the structure of the clusters β Dendrograms are the preferred representation for hierarchical clusters
β’ In addition to the binary tree, the dendrogram provides the similarity measure between clusters (the vertical axis)
β An alternative representation is based on sets {{π₯1, {π₯2, π₯3}}, {{{π₯4, π₯5}, {π₯6, π₯7}}, π₯8}}
β’ However, unlike the dendrogram, sets cannot express quantitative information
x
1x
2x
3x
4x
5x
6x
7x
8
H ig h s im ila r ity
L o w s im ila r ity
x1
x2
x3
x4
x5
x6
x7
x8
x1
x2
x3
x4
x5
x6
x7
x8
H ig h s im ila r ity
L o w s im ila r ity
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 17
Divisive clustering
β’ Define β ππΆ Number of clusters
β ππΈπ Number of examples
β’ How to choose the βworstβ cluster β Largest number of examples
β Largest variance
β Largest sum-squared-errorβ¦
β’ How to split clusters β Mean-median in one feature direction
β Perpendicular to the direction of largest varianceβ¦
β’ The computations required by divisive clustering are more intensive than for agglomerative clustering methods β For this reason, agglomerative approaches are more popular
1. Start with one large cluster
2. Find βworstβ cluster
3. Split it
4. If ππΆ < ππΈπ go to 2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 18
Agglomerative clustering β’ Define
β ππΆ Number of clusters
β ππΈπ Number of examples
β’ How to find the βnearestβ pair of clusters
β Minimum distance dmin ππ , ππ = minπ₯βπππ¦βππ
π₯ β π¦
β Maximum distance dmax ππ , ππ = maxπ₯βπππ¦βππ
π₯ β π¦
β Average distance davg ππ , ππ =1
ππππ π₯ β π¦π¦βπππ₯βππ
β Mean distance dmean ππ , ππ = ππ β ππ
1. Start with ππΈπ singleton clusters
2. Find nearest clusters
3. Merge them
4. If ππΆ > 1 go to 2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 19
β’ Minimum distance β When ππππ is used to measure distance between clusters, the algorithm is
called the nearest-neighbor or single-linkage clustering algorithm β If the algorithm is allowed to run until only one cluster remains, the result
is a minimum spanning tree (MST) β This algorithm favors elongated classes
β’ Maximum distance β When ππππ₯ is used to measure distance between clusters, the algorithm
is called the farthest-neighbor or complete-linkage clustering algorithm β From a graph-theoretic point of view, each cluster constitutes a complete
sub-graph β This algorithm favors compact classes
β’ Average and mean distance β ππππ and ππππ₯ are extremely sensitive to outliers since their
measurement of between-cluster distance involves minima or maxima β πππ£π and πππππ are more robust to outliers β Of the two, πππππ is more attractive computationally
β’ Notice that πππ£π involves the computation of ππππ pairwise distances
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 20
β’ Example β Perform agglomerative clustering on π using the single-linkage metric
π = {1, 3, 4, 9, 10, 13, 21, 23, 28, 29}
β’ In case of ties, always merge the pair of clusters with the largest mean
β’ Indicate the order in which the merging operations occur
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
3.5
0
1
2
3
4
5
6
7
8
9.5 28.5
22
11.25
13.5 25.25
19.38
2.25
1 2 3
4 5
6
7 8
9
Dis
tan
ce
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 21
ππππ vs. ππππ
B O S N Y D C M IA C H I S E A S F L A D E N
B O S 0 2 0 6 4 2 9 1 5 0 4 9 6 3 2 9 7 6 3 0 9 5 2 9 7 9 1 9 4 9
N Y 2 0 6 0 2 3 3 1 3 0 8 8 0 2 2 8 1 5 2 9 3 4 2 7 8 6 1 7 7 1
D C 4 2 9 2 3 3 0 1 0 7 5 6 7 1 2 6 8 4 2 7 9 9 2 6 3 1 1 6 1 6
M IA 1 5 0 4 1 3 0 8 1 0 7 5 0 1 3 2 9 3 2 7 3 3 0 5 3 2 6 8 7 2 0 3 7
C H I 9 6 3 8 0 2 6 7 1 1 3 2 9 0 2 0 1 3 2 1 4 2 2 0 5 4 9 9 6
S E A 2 9 7 6 2 8 1 5 2 6 8 4 3 2 7 3 2 0 1 3 0 8 0 8 1 1 3 1 1 3 0 7
S F 3 0 9 5 2 9 3 4 2 7 9 9 3 0 5 3 2 1 4 2 8 0 8 0 3 7 9 1 2 3 5
L A 2 9 7 9 2 7 8 6 2 6 3 1 2 6 8 7 2 0 5 4 1 1 3 1 3 7 9 0 1 0 5 9
D E N 1 9 4 9 1 7 7 1 1 6 1 6 2 0 3 7 9 9 6 1 3 0 7 1 2 3 5 1 0 5 9 0
0
200
400
600
800
1000
BOS NY DC CHI DEN SEA SF LA MIA
(dis
)sim
ilari
ty
Single-linkage
0
500
1000
1500
2000
2500
3000
BOS NY DC CHI MIA SEA SF LA DEN
(dis
)sim
ilari
ty
Complete-linkage