Download - Cluster analysis
CLUSTER ANALYSISWei-Jiun, Shen Ph. D.
When you were still young and naïve…
Classification
You may classify them by
Shape
You may classify them by
Color
You may classify them by
Sum of internal angles
Classification
Shape Color Sum of internal angles
Similarity of characteristics
Purpose of cluster analysis
Grouping objects based on the similarity of characteristics they possess. Homogeneity Heterogeneity
Geometrically, the objects within clusters will be close together, while the distance between clusters will be farther apart.
Major role that cluster analysis can play
Data reduction Classify large number of observation into manageable
groups
Taxonomy description Exploratory Confirmatory
Examining the influence of cluster on dependent variables Whether different motivational constructs are
differentially associated with effort and enjoyment
How does cluster analysis work?
The primary objective of cluster analysis is to define the structure of the data by placing the most similar observations into groups.
What clustering variables can be used? How do we measure similarity? How do we form clusters? How many clusters do we form?
Selecting clustering variables
Statistically, Any quantitative variable
Theoretically, conceptually, practically, Theoretical fundament corresponding to
research Q
Measuring similarity
Similarity The degree of correspondence among objects
across all of the characteristics.
Correlational measures Distance measures
Similarity measure
Correlation measure Grouping cases base on respondent pattern
Distance measure Grouping cases base on distance
X1 X2 X3 X4012345
case1case2case3
Distance measures
Euclidean distance Squared Euclidean distance City-block (Manhattan) distance Chebychev distance Mahalanobis D 2(standardization/variance-
covariance)
Illustration
Forming clusters
Similarity
Method Hierarchical
Agglomerative / divisive
non-hierarchical (quick)
Number of clusters
Theoretical specified
Statistical stopping rule Measures of heterogeneity change
Example
N=7 2 variables (scale range from 0-10 ) Hierarchical cluster analysis with
agglomerative method
Example
Scatterplot
Example - measure similarity
Euclidean distance
Example - similarity & number of clusters
Procedure
0.778-0.0480.0900.6620.524
(| E-F |+| E-G |+| F-G |) / 3(| E-F |+| E-G |+| F-G |)+ | F-G | / 4
Example – graphing
Graphical portrayal
Example
Dendogram
Standardizing the data
Clustering variables that have scales using widely differing numbers of scale points or that exhibit large differences in standard deviations should be standardized.
Z-score Standardized distance (e.g., Mahalanobis
distance)
Deriving clusters
Hierarchical cluster analysis Hierarchical
Non-hierarchical cluster analysis K-means
Combination of both methods Two Step
HIERARCHICAL CLUSTER ANALYSIS
Hierarchical cluster analysis (HCA)
The stepwise procedure Agglomerate or divide group step by step
Agglomerative (SPSS selected) Aggregate object with object / cluster with cluster N clusters to 1 cluster
Divisive Separate cluster to object 1 cluster to n clusters
Dendogram / tree graph
Agglomerative aglorithms
Single linkage Complete linkage Average linkage Centroid method Ward’s method Mahalanobis diatance
Agglomerative aglorithms
Single linkage / neighbor method Defines similarity between clusters as the
shortest distance from any object in one cluster to any object in the other.
Pics:Retrieved from: http://ppt.cc/uKm0
Agglomerative aglorithms
Complete linkage / Farthest – neighbor method Defines two clusters based on the maximum
distance between any two members in the two clusters.
Agglomerative aglorithms
Centroid method Cluster centroids
Are the mean values of the observation on the variables of the cluster
The distance between the two clusters equals the distance between the two centroids
Agglomerative aglorithms
Average linkage The distance between two clusters is defined as
the average distance between all pairs of the two clusters’ members.
Agglomerative aglorithms
Ward’s method The similarity between two clusters is the sum of
squares within the clusters summed over all variables.
Least variance within cluster
Number of clusters
Theoretical specified
Statistical stopping rule Measures of heterogeneity change
Hierarchical cluster analysis
The hierarchical cluster analysis provides an excellent framework with which to compare any set of cluster solutions.
This method helps in judging how many clusters should be retained or considered.
NON-HIERARCHICAL CLUSTER ANALYSIS
Non-hierarchical cluster analysis (non-HCA)
Non-hierarchical cluster analysis assign objects into clusters once the number of clusters is specified.
Two steps in non-HCA Specify cluster seed: identify starting points Assignment-assign each observation to one of
the cluster seeds.
Non-hierarchical cluster analysis-algorithm
Aims to partition n observation into k clusters in which each observation belongs to the cluster with the nearest mean.
Cluster seed assignment Sequential (1 by 1) Parallel (simultaneously) Optimization
K-means method
0 1 2 3 4 50123456
scattrplot
case
Pros and Cons of HCA
Advantage Comprehensive information
A wide range of alternative clustering solution
Disadvantage Outliers Large samples / large numbers of variable
Pros and Cons of non-HCA
Advantage Less susceptible to outliers Extremely large data sets
Disadvantage Less information Susceptible to initial seed point
Combination of each method
Two step Hierarchical technique is used to select the
number of clusters and profile clusters centers that serve as initial cluster seeds in the nonhierarchical procedure.
A nonhierarchical method then clusters all observations using the seed points to provide more accurate cluster memberships.
Interpretation of clusters
Mean profile of cluster
Name the clusters
Validation of clusters
Cross validation Two sub-sample
Confirmatory Discriminant analysis Predictive validity
Differences on variables Profile analysis (M)Analysis of variance
Assumptions of cluster analysis
Inferential statistics?
Representativeness
Multicollinearity Factor analysis Cluster analysis
Compare to other multivariate analyses
Cluster analysis (CA) vs. Factor analysis (FA) CA: grouping cases based on distance
(proximity) FA: grouping observations based on pattern of
variations (correlation)
Cluster analysis vs. Discriminant analysis (DA) CA: group is NOT given (exploratory) DA: group is given (confirmatory)
Summary
Research question Assumption confirmation
Multicollinearity Cluster analysis
Selecting clustering variables Conducting analysis Interpreting clusters Validating clusters
Main analysis
It is just a beginning…
Practice
根據相關實證研究的證據,教練的自主支持行為、威嚇、過度控制、有條件式的關愛與酬賞控制等多種教練行為,是影響運動員的重要因子。研究生戴平台想知道教練對運動員的認知、情意與行為後果的影響。戴平台認為,教練可以從運動員所知覺到的教練行為的多種組合被區分為不同的類型,檢視在不同類型教練下,運動員的認知、情意與行為反應,應該比較能瞭解運動團隊中,教練對運動員所產生的影響。請根據運動員所知覺到的教練行為,以集群分析幫戴平台將運動員所知覺到的教練區分為不同類型。