Download - Cluster analysis

CLUSTER ANALYSISWei-Jiun, Shen Ph. D.

When you were still young and naïve…

Classification

You may classify them by

Shape


Color


Sum of internal angles

Classification

Shape Color Sum of internal angles

Similarity of characteristics

Purpose of cluster analysis

Grouping objects based on the similarity of characteristics they possess. Homogeneity Heterogeneity

Geometrically, the objects within clusters will be close together, while the distance between clusters will be farther apart.

Major role that cluster analysis can play

Data reduction Classify large number of observation into manageable

groups

Taxonomy description Exploratory Confirmatory

Examining the influence of cluster on dependent variables Whether different motivational constructs are

differentially associated with effort and enjoyment

How does cluster analysis work?

The primary objective of cluster analysis is to define the structure of the data by placing the most similar observations into groups.

What clustering variables can be used? How do we measure similarity? How do we form clusters? How many clusters do we form?

Selecting clustering variables

Statistically, Any quantitative variable

Theoretically, conceptually, practically, Theoretical fundament corresponding to

research Q

Measuring similarity

Similarity The degree of correspondence among objects

across all of the characteristics.

Correlational measures Distance measures

Similarity measure

Correlation measure Grouping cases base on respondent pattern

Distance measure Grouping cases base on distance

X1 X2 X3 X4012345

case1case2case3

Distance measures

Euclidean distance Squared Euclidean distance City-block (Manhattan) distance Chebychev distance Mahalanobis D 2(standardization/variance-

covariance)

http://www.google.com.tw/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=jKT1ab9q7zIolM&tbnid=0NY6Rp4_NfJAXM:&ved=0CAUQjRw&url=http://lyfat.wordpress.com/2012/05/22/euclidean-vs-chebyshev-vs-manhattan-distance/&ei=zJTbU7bWN4Tq8AXi-YCoCQ&psig=AFQjCNEpvdL24OAzehDpGBDnO1L71_2c5w&ust=1406985751850760

http://www.google.com.tw/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=jKT1ab9q7zIolM&tbnid=0NY6Rp4_NfJAXM:&ved=0CAUQjRw&url=http://lyfat.wordpress.com/2012/05/22/euclidean-vs-chebyshev-vs-manhattan-distance/&ei=5pTbU6q2FY678gXV24LYDw&psig=AFQjCNEpvdL24OAzehDpGBDnO1L71_2c5w&ust=1406985751850760

http://www.google.com.tw/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&docid=jKT1ab9q7zIolM&tbnid=3EqROW_CR8BkcM:&ved=0CAUQjRw&url=http://lyfat.wordpress.com/2012/05/22/euclidean-vs-chebyshev-vs-manhattan-distance/&ei=BpXbU420IoXr8AWv04GQBw&psig=AFQjCNEpvdL24OAzehDpGBDnO1L71_2c5w&ust=1406985751850760

Illustration

Forming clusters

Similarity

Method Hierarchical

Agglomerative / divisive

non-hierarchical (quick)

Number of clusters

Theoretical specified

Statistical stopping rule Measures of heterogeneity change

Example

N=7 2 variables (scale range from 0-10 ) Hierarchical cluster analysis with

agglomerative method

Example

Scatterplot

Example - measure similarity

Euclidean distance

Example - similarity & number of clusters

Procedure

0.778-0.0480.0900.6620.524

（｜ E-F ｜+｜ E-G ｜+｜ F-G ｜） / 3（｜ E-F ｜+｜ E-G ｜+｜ F-G ｜）+ ｜ F-G ｜ / 4

Example – graphing

Graphical portrayal

Example

Dendogram

Standardizing the data

Clustering variables that have scales using widely differing numbers of scale points or that exhibit large differences in standard deviations should be standardized.

Z-score Standardized distance (e.g., Mahalanobis

distance)

Deriving clusters

Hierarchical cluster analysis Hierarchical

Non-hierarchical cluster analysis K-means

Combination of both methods Two Step

HIERARCHICAL CLUSTER ANALYSIS

Hierarchical cluster analysis (HCA)

The stepwise procedure Agglomerate or divide group step by step

Agglomerative (SPSS selected) Aggregate object with object / cluster with cluster N clusters to 1 cluster

Divisive Separate cluster to object 1 cluster to n clusters

Dendogram / tree graph

Agglomerative aglorithms

Single linkage Complete linkage Average linkage Centroid method Ward’s method Mahalanobis diatance


Single linkage / neighbor method Defines similarity between clusters as the

shortest distance from any object in one cluster to any object in the other.

Pics:Retrieved from: http://ppt.cc/uKm0


Complete linkage / Farthest – neighbor method Defines two clusters based on the maximum

distance between any two members in the two clusters.


Centroid method Cluster centroids

Are the mean values of the observation on the variables of the cluster

The distance between the two clusters equals the distance between the two centroids


Average linkage The distance between two clusters is defined as

the average distance between all pairs of the two clusters’ members.


Ward’s method The similarity between two clusters is the sum of

squares within the clusters summed over all variables.

Least variance within cluster

Number of clusters

Theoretical specified

Statistical stopping rule Measures of heterogeneity change

Hierarchical cluster analysis

The hierarchical cluster analysis provides an excellent framework with which to compare any set of cluster solutions.

This method helps in judging how many clusters should be retained or considered.

NON-HIERARCHICAL CLUSTER ANALYSIS

Non-hierarchical cluster analysis (non-HCA)

Non-hierarchical cluster analysis assign objects into clusters once the number of clusters is specified.

Two steps in non-HCA Specify cluster seed: identify starting points Assignment-assign each observation to one of

the cluster seeds.

Non-hierarchical cluster analysis-algorithm

Aims to partition n observation into k clusters in which each observation belongs to the cluster with the nearest mean.

Cluster seed assignment Sequential (1 by 1) Parallel (simultaneously) Optimization

K-means method

0 1 2 3 4 50123456

scattrplot

case

Pros and Cons of HCA

Advantage Comprehensive information

A wide range of alternative clustering solution

Disadvantage Outliers Large samples / large numbers of variable

Pros and Cons of non-HCA

Advantage Less susceptible to outliers Extremely large data sets

Disadvantage Less information Susceptible to initial seed point

Combination of each method

Two step Hierarchical technique is used to select the

number of clusters and profile clusters centers that serve as initial cluster seeds in the nonhierarchical procedure.

A nonhierarchical method then clusters all observations using the seed points to provide more accurate cluster memberships.

Interpretation of clusters

Mean profile of cluster

Name the clusters

Validation of clusters

Cross validation Two sub-sample

Confirmatory Discriminant analysis Predictive validity

Differences on variables Profile analysis (M)Analysis of variance

Assumptions of cluster analysis

Inferential statistics?

Representativeness

Multicollinearity Factor analysis Cluster analysis

Compare to other multivariate analyses

Cluster analysis (CA) vs. Factor analysis (FA) CA: grouping cases based on distance

(proximity) FA: grouping observations based on pattern of

variations (correlation)

Cluster analysis vs. Discriminant analysis (DA) CA: group is NOT given (exploratory) DA: group is given (confirmatory)

Summary

Research question Assumption confirmation

Multicollinearity Cluster analysis

Selecting clustering variables Conducting analysis Interpreting clusters Validating clusters

Main analysis

It is just a beginning…

Practice

根據相關實證研究的證據，教練的自主支持行為、威嚇、過度控制、有條件式的關愛與酬賞控制等多種教練行為，是影響運動員的重要因子。研究生戴平台想知道教練對運動員的認知、情意與行為後果的影響。戴平台認為，教練可以從運動員所知覺到的教練行為的多種組合被區分為不同的類型，檢視在不同類型教練下，運動員的認知、情意與行為反應，應該比較能瞭解運動團隊中，教練對運動員所產生的影響。請根據運動員所知覺到的教練行為，以集群分析幫戴平台將運動員所知覺到的教練區分為不同類型。

Download - Cluster analysis

Top Related