chapter 1
DESCRIPTION
Chapter 1. Introduction to Clustering. Section 1.1. Introduction. Objectives. Introduce clustering and unsupervised learning. Explain the various forms of cluster analysis. Outline several key distance metrics used as estimates of experimental unit similarity. Course Overview. Definition. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/1.jpg)
Chapter 1
Introduction to Clustering
![Page 2: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/2.jpg)
Section 1.1
Introduction
![Page 3: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/3.jpg)
3
Objectives Introduce clustering and unsupervised learning. Explain the various forms of cluster analysis. Outline several key distance metrics used as
estimates of experimental unit similarity.
![Page 4: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/4.jpg)
4
Course OverviewVariable Selection
VARCLUS
Plot DataPRINCOMP,MDS,CANDISC
PreprocessingACECLUS
‘Fuzzy’ ClusteringFACTOR Discrete Clustering
Hierarchical ClusteringCLUSTER Optimization Clustering
Parametric ClusteringFASTCLUS
Non-Parametric ClusteringMODECLUS
![Page 5: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/5.jpg)
5
“Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.”
B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”
Definition
Cluster Solution
Sensible Interpretable Un-interpretable
Given Class Derived Class
![Page 6: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/6.jpg)
6
Learning without a priori knowledge about the classification of samples; learning without a teacher.
Kohonen (1995), “Self-Organizing Maps”
Unsupervised Learning
![Page 7: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/7.jpg)
Section 1.2
Types of Clustering
![Page 8: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/8.jpg)
8
Distinguish between the two major classes of clustering methods:
– hierarchical clustering– optimization (partitive) clustering.
Objectives
![Page 9: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/9.jpg)
9
Hierarchical Clustering Agglomerative DivisiveIteration
1
2
3
4
![Page 10: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/10.jpg)
10
Propagation of ErrorsIteration
1
2
3
4
(error)
(error)
(error)
![Page 11: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/11.jpg)
11
Optimization (Partitive) Clustering
“Seeds” Observations
XX
X
X
Initial State Final State
Old location
X
XX X
X
XX
X
New location
![Page 12: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/12.jpg)
12
Heuristic Search1. Find an initial partition of the n objects into g groups.2. Calculate the change in the error function produced
by moving each observation from its own cluster to another group.
3. Make the change resulting in the greatest improvement in the error function.
4. Repeat steps 2 and 3 until no move results in improvement.
![Page 13: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/13.jpg)
Section 1.3
Similarity Metrics
![Page 14: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/14.jpg)
14
Define similarity and what comprises a good measure of similarity.
Describe a variety of similarity metrics.
Objectives
![Page 15: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/15.jpg)
15
Although the concept of similarity is fundamental to our thinking, it is also often difficult to precisely quantify.
Which is more similar to a duck: a crow or a penguin?
The metric that you choose to operationalize similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover.
What Is Similarity?
![Page 16: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/16.jpg)
16
The following principles have been identified as a foundation of any good similarity metric:
1. symmetry: d(x,y) = d(y,x)2. non-identical distinguishability: if d(x,y) 0 then x y3. identical non-distinguishability: if d(x,y) = 0 then x = y
Some popular similarity metrics (for example, correlation) fail to meet one or more of these criteria.
What Makes a Good Similarity Metric?
![Page 17: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/17.jpg)
17
Euclidean Distance Similarity Metric
Pythagorean Theorem: The square of the hypotenuse is equal to the sum of the squares of the other two sides.
d
iiiE wxD
1
2
x1
x2
(x1, x2)
(0, 0)
2
1
22
iixh
![Page 18: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/18.jpg)
18
City block (Manhattan) distance is the distance between two points measured along axes at right angles.
d
iiiM wxD
1
1
City Block Distance Similarity Metric
(w1,w2)
(x1,x2)
![Page 19: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/19.jpg)
19
Similar
...
..
.. .
.
. .. .
Tom
Mar
ieCorrelation Similarity Metrics
Dissimilar
..
....
. ..
... .
Jerry
Mar
ie
Tom .
.
.... ...
...
.
Jerry
No Similarity
![Page 20: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/20.jpg)
20
The Problem with CorrelationVariable Observation 1 Observation 2
x1 5 51
x2 4 42
x3 3 33
x4 2 24
x5 1 15
Mean 3 33 Std. Dev. 1.5811 14.2302
The correlation between observations 1 and 2 is a perfect 1.0, but are the observations really similar?
![Page 21: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/21.jpg)
21
i
ii nv
nf ˆ
Density Estimate Based Similarity Metrics
Clusters can be seen as areas of increased observation density. Similarity is a function of the distance between the identified density bubbles (hyper-spheres).
similarity
Density Estimate 1(Cluster 1)
Density Estimate 2(Cluster 2)
![Page 22: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/22.jpg)
22
1 2 3 4 5 … 17
Gene A 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1Gene B 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1
DH = 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 = 5
Gene expression levels under 17 conditions (low=0, high=1)
d
iiiH wx D
1
Hamming Distance Similarity Metric
![Page 23: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/23.jpg)
23
The DISTANCE ProcedureGeneral form of the DISTANCE procedure:
Both the PROC DISTANCE statement and the VAR statement are required.
PROC DISTANCE METHOD=method <options> ;COPY variables;VAR level (variables < / option-list >) ;
RUN;
![Page 24: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/24.jpg)
24
This demonstration illustrates the impact on cluster formation of two distance metrics generated by the DISTANCE procedure.
Generating Distances ch1s3d1
![Page 25: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/25.jpg)
Section 1.4
Classification Performance
![Page 26: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/26.jpg)
26
Use classification matrices to determine the quality of a proposed cluster solution.
Use the chi-square and Cramer’s V statistic to assess the relative strength of the derived association.
Objectives
![Page 27: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/27.jpg)
27
Perfect Solution
Quality of the Cluster Solution
Typical Solution
No Solution
![Page 28: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/28.jpg)
28
Probability of Cluster Assignment
Frequency
The probability that a cluster number represents a given class is given by the cluster’s proportion of the row total.
Probability
![Page 29: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/29.jpg)
29
The Chi-Square Statistic
i j ij
ijij
expected) expected observed( 2
2
The chi-square statistic (and associated probability)• determine whether an association exists• depend on sample size• do not measure the strength of the association.
![Page 30: Chapter 1](https://reader035.vdocuments.mx/reader035/viewer/2022070423/5681679c550346895ddcdea6/html5/thumbnails/30.jpg)
30
Measuring Strength of an Association
WEAK STRONG0 1
CRAMER'S V STATISTIC
)1,1min(/V sCramer'
2
crn
Cramer’s V ranges from -1 to 1 for 2X2 tables.