cluster analysis - mycourses...cluster analysis (ca) ~ method for organizing data (people, things,...
TRANSCRIPT
![Page 1: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/1.jpg)
Cluster Analysis
Pekka Malo 30E00500 – Quantitative Empirical Research Spring 2016
![Page 2: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/2.jpg)
What is cluster analysis?
18.01.16 Cluster Analysis
2
![Page 3: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/3.jpg)
Cluster analysis is known by many names …
Segmentation Q-analysis
Classification
Unsupervised learning Taximetrics
Learning without a teacher
18.01.16 Cluster Analysis
3
![Page 4: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/4.jpg)
Purpose: Find a way to group data in a meaningful manner
Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful
groups or taxonomies based on a set of variables that describe the key features of the observations
Cluster ~ a group of observations, which are similar to each other and different from observations in other clusters
18.01.16 Cluster Analysis
4
![Page 5: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/5.jpg)
http://obelia.jde.aca.mmu.ac.uk/multivar/other.htm
18.01.16 Cluster Analysis
5
![Page 6: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/6.jpg)
Example 1
How many clusters and how do you cluster?
18.01.16 Cluster Analysis
6
![Page 7: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/7.jpg)
Example 2
How many clusters and how do you cluster?
18.01.16 Cluster Analysis
7
![Page 8: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/8.jpg)
Example 3
How many clusters and how do you cluster?
18.01.16 Cluster Analysis
8
![Page 9: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/9.jpg)
Between-Cluster Variation = Maximize
Within-Cluster Variation = Minimize
Objectives in Cluster Analysis
Source: Hair et al. (2010)
18.01.16 Cluster Analysis
9
![Page 10: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/10.jpg)
Within-groups vs. Between-groups • Within-groups property: Each group is homogenous with
respect to certain characteristics, i.e. observations in each group are similar to each other
• Between-groups property: Each group should be different from other groups with respect to the same characteristics, i.e. observations of one group should be different from the observations of other groups
18.01.16 Cluster Analysis
10
![Page 11: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/11.jpg)
How many clusters and how do you cluster?
18.01.16 Cluster Analysis
11
![Page 12: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/12.jpg)
How many clusters and how do you cluster?
18.01.16 Cluster Analysis
12
![Page 13: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/13.jpg)
What can we do with cluster analysis?
• Detect groups which are statistically significant – Taxonomy description: Natural groups in data
– Simplification of data: Groups instead of individuals
• Identify meaning for the clusters – Which relationships can be identified?
• Explain and find ways how they can be used
18.01.16 Cluster Analysis
13
![Page 14: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/14.jpg)
Classification vs. Clustering • Classification
– We know the “groups” for at least some of the observations
– Objective is to find a rule / function which correctly assigns observations into groups
– Supervised learning procedure
• Clustering – We don’t know the groups a priori
– Objective is to group together points “which are similar”
– Identify the underlying “hidden” structure in the data
– Unsupervised learning procedure (i.e. no labeled data for training)
18.01.16
Cluster Analysis
14
![Page 15: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/15.jpg)
Clustering ~ “post hoc” segmentation
• Any discrete variable is a segmentation – E.g., gender, geographical area, etc.
• A priori segmentation – Use existing discrete variables to create segments
• Post hoc segmentation – Collect data on various attributes
– Apply statistical technique to find segments
18.01.16 Cluster Analysis
15
![Page 16: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/16.jpg)
Sample size considerations • Representativeness: The sample used for obtaining the
cluster analysis should be representative of the population and its underlying structure (in particular the potential groups of interest)
• Minimum group sizes based on relevance to research question and confidence needed in characterization of the groups
18.01.16 Cluster Analysis
16
![Page 17: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/17.jpg)
Phases of Clustering
Evaluation of significance
Decision regarding the number of clusters
Technique (Hierarchical / Nonhierarchical)
Similarity Measures
Choice of variables
18.01.16 Cluster Analysis
17
![Page 18: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/18.jpg)
Step 1: Goals and choice of variables
• No theoretical guidelines
• Driven by the problem and practical significance – Do the variables help to characterize the objects?
– Are the variables clearly related to the objectives?
• Warning: – Avoid including variables “just because you can”
– Results are dramatically affected by inclusion of even one or two inappropriate or undifferentiated variables
18.01.16 Cluster Analysis
18
![Page 19: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/19.jpg)
Choice of variablesExample: Gamers and their 4 screens
Source: gameindustryblog.com (Newzoo model)
18.01.16 Cluster Analysis
19
![Page 20: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/20.jpg)
Choice of variables (cont.)Example: Gamers and their 4 screens
Source: gameindustryblog.com (Newzoo model)
18.01.16 Cluster Analysis
20
![Page 21: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/21.jpg)
Step 2: Choice of similarity measure
How close or similar are two observations?
Interobject similarity is an empirical measure of correspondence, or resemblance, between objects to
be clustered.
18.01.16 Cluster Analysis
21
![Page 22: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/22.jpg)
Types of similarity measures
• Distance (or dissimilarity) Measures – Euclidean Distance
– Minkowski Metric
– Euclidean Distance for Standardized Data
– Mahalanobis Distance
• Association Coefficient
• Correlation Coefficient
• Subjective Similarity
18.01.16 Cluster Analysis
22
![Page 23: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/23.jpg)
Distance Measures • Minkowski metric:
– n = 2 : Euclidean Distance
– n = 1 : City-block Distance
18.01.16 Cluster Analysis
23
![Page 24: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/24.jpg)
Euclidean Distance: Example
18.01.16 Cluster Analysis
24
![Page 25: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/25.jpg)
Standardization of variables
• Standardization of variables is commonly preferred to avoid problems due to different scales
• Most commonly done using Z-scores
• If groups are to be formed based on respondents’ response styles, then within-case or row-centering standardization can be considered
18.01.16 Cluster Analysis
25
![Page 26: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/26.jpg)
Distance Measures (cont.)
• Euclidean Distance for Standardized Data:
• Mahalanobis Distance:
18.01.16 Cluster Analysis
26
![Page 27: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/27.jpg)
Standardized Euclidean Distance: Example
18.01.16 Cluster Analysis
27
![Page 28: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/28.jpg)
Mahalanobis Distance: Example
18.01.16 Cluster Analysis
28
![Page 29: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/29.jpg)
Association Coefficients
• Consider two objects described by p binary variables. Similarity between objects A and B can be presented as follows:
where
0 10 a b1 c d
Object B
Object A Example
A: 0 0 1 1 0 0 1 1 1 1 0
B: 1 1 0 0 1 0 0 1 1 0 0
18.01.16 Cluster Analysis
29
![Page 30: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/30.jpg)
Correlation Coefficients
• Between Observations i and j
18.01.16 Cluster Analysis
30
![Page 31: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/31.jpg)
Procedures and choice of measure
• Some procedures are sensitive to the similarity measure
• Multiple measures should be employed to compare results and gain confidence of the solution
18.01.16 Cluster Analysis
31
![Page 32: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/32.jpg)
Outliers • Should be removed when
– Observation is aberrant and not representative of population
– Representative of small or insignificant segments which are not interest for research objectives
• Should be retained when – Under-sampling issues
– Poor representations of relevant groups in population
18.01.16 Cluster Analysis
32
![Page 33: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/33.jpg)
Identification of outliers using similarity measures
• Consider observations with large distances from all other observations
• Use graphic profile diagrams to highlight aberrant observations
• Appearance in very small or single-member clusters
18.01.16 Cluster Analysis
33
![Page 34: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/34.jpg)
Multicollinearity of variables • Substantial multicollinearity of input variables is
problematic
• If large collinearity is found, the following strategies could be considered
– Reduce variables to equal numbers in each set of correlated measures
– Use a distance measure that compensates for the correlation (e.g. Mahalanobis distance)
– Include only variables that are not highly correlated
18.01.16 Cluster Analysis
34
![Page 35: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/35.jpg)
Step 3: Hierarchical Clustering
• Centroid method
• Nearest-neighbor or single-linkage method
• Farthest-neighbor or complete-linkage method
• Average linkage method
• Ward method
18.01.16 Cluster Analysis
35
![Page 36: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/36.jpg)
Hierarchical Clustering with SPSS
18.01.16 Cluster Analysis
36
![Page 37: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/37.jpg)
Types of hierarchical methods
• Agglomerative ~ Build-up
• Divisive ~ Break-down
18.01.16 Cluster Analysis
37
![Page 38: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/38.jpg)
Agglomerative vs. Divisive
Source: Hair et al. (2010)
18.01.16 Cluster Analysis
38
![Page 39: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/39.jpg)
How do agglomerative approaches work?
18.01.16 Cluster Analysis
39
Start with all observations as their own cluster
Use selected similarity measure to combine two most similar observations into a new cluster of two observations
Repeat the procedure using the similarity measure to group together the most similar observations or combinations of observations into another new cluster
Continue until all observations are in a single cluster
![Page 40: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/40.jpg)
Example: Single-Linkage Method
• Principle – The distance between two clusters is represented by the
minimum of the distance between all possible pairs of subjects in the two groups
18.01.16 Cluster Analysis
40
![Page 41: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/41.jpg)
Example: Single-Linkage Method Single Linkage
0
5
10
15
20
25
0 5 10 15 20 25 30 35
Points
18.01.16 Cluster Analysis
41
![Page 42: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/42.jpg)
Example: Complete-Linkage Method
• Principle: – The distance between two-clusters is represented by the
maximum of the distance between all possible pairs of subjects in the two groups
18.01.16 Cluster Analysis
42
![Page 43: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/43.jpg)
Example: Complete-Linkage Method
18.01.16 Cluster Analysis
43
Complete Linkage
0
5
10
15
20
25
0 5 10 15 20 25 30 35
Points
![Page 44: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/44.jpg)
Source: Hair et al. (2010)
![Page 45: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/45.jpg)
Source: Hair et al. (2010)
![Page 46: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/46.jpg)
Ward’s Method • “Incremental sum of squares method” that uses the
within-cluster distances and between-cluster distances
• Joins groups A and B that minimize increase in SSE:
SSEA =nAX
i=1
(yi � yA)0(yi � yA)
SSEB =nBX
i=1
(yi � yB)0(yi � yB)
SSEAB =nABX
i=1
(yi � yAB)0(yi � yAB)
IAB = SSEAB � (SSEA + SSEB)
18.01.16 Cluster Analysis
46
![Page 47: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/47.jpg)
Choice of hierarchical approachPros and cons • Single-linkage
– Most versatile, but poorly delineated cluster structures in a dataset may lead to snakelike cluster-chains
• Complete-linkage – No chaining, but impacted by outliers
• Average linkage – Considers average similarity of all individuals in a cluster
– Tends to generate clusters with small within-cluster variation
– Less affected by outliers
18.01.16 Cluster Analysis
47
![Page 48: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/48.jpg)
Choice of hierarchical approach (cont.)Pros and cons
• Ward’s method – Uses total sum of squares within clusters
– Most appropriate when equally sized clusters are expected
– Easily distorted by outliers
• Centroid linkage – Considers difference between cluster centroids
– Less affected by outliers
18.01.16 Cluster Analysis
48
![Page 49: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/49.jpg)
18.01.16 Cluster Analysis
49
![Page 50: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/50.jpg)
Choosing the number of clusters • No single objective procedure
• Evaluation based on following considerations: – Occurrence of single-member of extremely small clusters is not
acceptable and should be eliminated
– Ad-hoc stopping rules in hierarchical methods based on the rate of change in total similarity measure as the number of clusters increases or decreases
– Clusters should be significantly different across the set of variables
– Solutions must have theoretical validity based on external validation
18.01.16 Cluster Analysis
50
![Page 51: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/51.jpg)
Measures of heterogeneity change
• Percentage changes in heterogeneity – E.g. use of agglomeration coefficient in SPSS, which measures
heterogeneity as distance at which clusters are formed
– E.g. within-cluster sum of squares when Ward’s method is considered
• Measures of variance change – Root mean square standard deviation (RMSSTD) ~ square root of
the variance of the new cluster formed by joining two clusters, where the variance is computed across all clustering variables
– Large increase in RMSSTD indicates joining of two dissimilar clusters
18.01.16 Cluster Analysis
51
![Page 52: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/52.jpg)
Visualization of the solution
Dendrogram is convenient when the number of observations is not very high
18.01.16 Cluster Analysis
52
![Page 53: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/53.jpg)
![Page 54: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/54.jpg)
Use agglomeration schedule to decide number of clusters
Seek for demarcation point
18.01.16 Cluster Analysis
54
![Page 55: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/55.jpg)
Step 4: Refine the solution with Non-hierarchical Clustering Procedures
• Sometimes a combination of hierarchical and nonhierarchical methods is considered:
– Use hierarchical method (e.g., Ward’s) to choose number of clusters and profile cluster centers that serve as initial seeds
– Use nonhierarchical method (e.g., k-Means) to cluster all observations using the seed points to provide more accurate cluster membership
18.01.16 Cluster Analysis
55
![Page 56: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/56.jpg)
Hierarchical vs. Non-hierarchical • Choose hierarchical method when
– Wide range (possibly all) cluster solutions are to be examined
– Sample size is moderate (under 300-400), no more than 1000
• Choose nonhierarchical method when – Number of clusters is known
– Initial seed points can be specified by practical, objective or theoretical basis
– Results are less susceptible to outliers, distance measure or inclusion of irrelevant variables
– Works on large datasets
18.01.16 Cluster Analysis
56
![Page 57: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/57.jpg)
Hybrid approaches • Sometimes a combination of hierarchical and
nonhierarchical methods is considered
• Idea: – Use hierarchical method to choose number of clusters and
profile cluster centers that serve as initial seeds
– Use nonhierarchical method to cluster all observations using the seed points to provide more accurate cluster memberships
18.01.16 Cluster Analysis
57
![Page 58: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/58.jpg)
Simple k-Means algorithm Given an initial seed, the algorithm alternates between the following steps:
1. Assignment step: – Add each observation to the cluster, whose mean leads to the
least within-group sum of squares (Squared Euclidean distance)
2. Update step: – Compute new cluster means and use them as centroids for
observations in the updated cluster
18.01.16 Cluster Analysis
58
![Page 59: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/59.jpg)
K-Means in SPSS
![Page 60: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/60.jpg)
Save solution and examine output
18.01.16 Cluster Analysis
60
![Page 61: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/61.jpg)
Non-hierarchical Clustering
• Distance Measures
• Statistical Criteria – tr W “Within-group sums of squares”
(= Euclidean Distance)
– tr W-1T “Hotelling’s trace”
– |W| “Wilks’ lambda”
– Largest Eigenvalue of W-1T
–
18.01.16 Cluster Analysis
61
![Page 62: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/62.jpg)
tr W
18.01.16 Cluster Analysis
62
![Page 63: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/63.jpg)
tr W-1T
18.01.16 Cluster Analysis
63
![Page 64: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/64.jpg)
|W|
18.01.16 Cluster Analysis
64
![Page 65: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/65.jpg)
18.01.16 Cluster Analysis
65
![Page 66: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/66.jpg)
Step 5: Evaluation of cluster solutions
Source: Marketing research (Winter 2010)
18.01.16 Cluster Analysis
66
![Page 67: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/67.jpg)
How informative is your solution?
• Generalizability: Are the segments identifiable in a larger population?
• Substantiality: How sizeable are the segments when compared to each other?
• Accessibility and actionability: How easily can the segments be reached? Can we execute strategies using the solution?
• Stability: Is the solution repeatable (e.g. if new measurements are done)?
Segmentation is information compression. Good segmentation conveys key information about the important
variables or attributes.
18.01.16 Cluster Analysis
67
![Page 68: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/68.jpg)
Statistical vs. Practical criteria • Statistical:
– Do the segment profiles differ in a statistically significant manner?
– What attributes contribute most to the group differences?
– Are the groups internally homogeneous and externally heterogeneous?
• Practical: – Are the segments substantial enough for making profit?
– Is the solution stable?
– Can we reach the segment in a cost-effective manner?
– Is it useful for decision making purposes?
– Do the segments respond consistently to stimulus?
18.01.16 Cluster Analysis
68
![Page 69: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/69.jpg)
Dash of criticism • Conceptual vs. empirical support
• Descriptive, atheoretical, non-inferential?
• Clusters always produced regardless of empirical structure?
• Solution not generalizable due to dependence on variables used for defining similarity measure?
18.01.16 Cluster Analysis
69
![Page 70: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/70.jpg)
Comparison of profiles
Source: Rencher: Methods of Multivariate Analysis
18.01.16 Cluster Analysis
70
![Page 71: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/71.jpg)
Any differences between population means ? µi : a sample mean for population Pi and W n-1 : a common variance-covariance matrix (Estimator)
W = W1 + W2 + … + Wg
Λ = |W | / | T | = | W | / | W + B | = | T - B | / | T |,
where, – T = "total" sum of squares and products (SSP) matrix – W = "within-groups" SSP-matrix – Wk= ”total” SSP matrix in group k – B = "between-groups" SSP-matrix = T – W
18.01.16 Cluster Analysis
71
![Page 72: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/72.jpg)
T: Total SSP-matrix
18.01.16 Cluster Analysis
72
![Page 73: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/73.jpg)
W: Within Groups SSP-matrix
18.01.16 Cluster Analysis
73
![Page 74: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/74.jpg)
One-Way MANOVA
(overall mean) (treatment effect) (error)
18.01.16 Cluster Analysis
74
![Page 75: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/75.jpg)
Assumptions for MANOVA test
• Observations must be independent
• Variance-covariance matrices must be equal for all treatment groups
• The set of dependent variables must follow a multivariate normal distribution
18.01.16 Cluster Analysis
75
![Page 76: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/76.jpg)
Interpretation of solutions
• Examine profiles for distinguishing characteristics – Where and how much do the cluster profiles differ?
• Solutions failing to show substantial differences in mean profiles indicate that other solutions should be examined
• Cluster centroids should be assessed with respect to prior expectations based on theory or practical experience
18.01.16 Cluster Analysis
76
![Page 77: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/77.jpg)
Profiling the cluster solutions • Once clusters are identified, the objective is to describe
the characteristics of each cluster and how they differ on relevant dimensions
• Utilize data not included in the cluster procedure to profile the characteristics of each cluster
– Demographics, psychographics, consumption patterns, etc.
• Often done using Discriminant Analysis to compare average score profiles for the clusters
– Dependent variable (categorical) = cluster membership
– Independent variables = Demographics + Psychographics + …
18.01.16
Cluster Analysis
77
![Page 78: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/78.jpg)
Validation • Cross-validation:
– Create sub-samples of the dataset (random splitting)
– Compare cluster solutions for consistency (number of clusters and profiles) ■ Very stable solution would be produced with less than 10% of
observations assigned differently ■ Stable solution is when 10-20% of observations are assigned to a
different group ■ Somewhat stable solution when 20-25% are assigned to a different
cluster
• Using relevant external variables: – Examine differences on variables not included in the cluster analysis
but for which there is a theoretical and relevant reason to expect variation across the clusters
18.01.16 Cluster Analysis
78
![Page 79: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/79.jpg)
Thank you!
79
![Page 80: Cluster Analysis - MyCourses...Cluster Analysis (CA) ~ method for organizing data (people, things, events, products, companies, etc.) into meaningful groups or taxonomies based on](https://reader036.vdocuments.mx/reader036/viewer/2022063015/5fd3764b93d9bd301e6498d8/html5/thumbnails/80.jpg)
R – give it a spin!
18.01.16 Cluster Analysis
80