cluster analysis

47
CLUSTER ANALYSIS Wei-Jiun, Shen Ph. D.

Upload: -

Post on 25-Jan-2017

27 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Cluster analysis

CLUSTER ANALYSISWei-Jiun, Shen Ph. D.

Page 2: Cluster analysis

When you were still young and naïve…

Classification

Page 3: Cluster analysis

You may classify them by

Shape

Page 4: Cluster analysis

You may classify them by

Color

Page 5: Cluster analysis

You may classify them by

Sum of internal angles

Page 6: Cluster analysis

Classification

Shape Color Sum of internal angles

Similarity of characteristics

Page 7: Cluster analysis

Purpose of cluster analysis

Grouping objects based on the similarity of characteristics they possess. Homogeneity Heterogeneity

Geometrically, the objects within clusters will be close together, while the distance between clusters will be farther apart.

Page 8: Cluster analysis

Major role that cluster analysis can play

Data reduction Classify large number of observation into manageable

groups

Taxonomy description Exploratory Confirmatory

Examining the influence of cluster on dependent variables Whether different motivational constructs are

differentially associated with effort and enjoyment

Page 9: Cluster analysis

How does cluster analysis work?

The primary objective of cluster analysis is to define the structure of the data by placing the most similar observations into groups.

What clustering variables can be used? How do we measure similarity? How do we form clusters? How many clusters do we form?

Page 10: Cluster analysis

Selecting clustering variables

Statistically, Any quantitative variable

Theoretically, conceptually, practically, Theoretical fundament corresponding to

research Q

Page 11: Cluster analysis

Measuring similarity

Similarity The degree of correspondence among objects

across all of the characteristics.

Correlational measures Distance measures

Page 12: Cluster analysis

Similarity measure

Correlation measure Grouping cases base on respondent pattern

Distance measure Grouping cases base on distance

X1 X2 X3 X4012345

case1case2case3

Page 14: Cluster analysis

Illustration

Page 15: Cluster analysis

Forming clusters

Similarity

Method Hierarchical

Agglomerative / divisive

non-hierarchical (quick)

Page 16: Cluster analysis

Number of clusters

Theoretical specified

Statistical stopping rule Measures of heterogeneity change

Page 17: Cluster analysis

Example

N=7 2 variables (scale range from 0-10 ) Hierarchical cluster analysis with

agglomerative method

Page 18: Cluster analysis

Example

Scatterplot

Page 19: Cluster analysis

Example - measure similarity

Euclidean distance

Page 20: Cluster analysis

Example - similarity & number of clusters

Procedure

0.778-0.0480.0900.6620.524

(| E-F |+| E-G |+| F-G |) / 3(| E-F |+| E-G |+| F-G |)+ | F-G | / 4

Page 21: Cluster analysis

Example – graphing

Graphical portrayal

Page 22: Cluster analysis

Example

Dendogram

Page 23: Cluster analysis

Standardizing the data

Clustering variables that have scales using widely differing numbers of scale points or that exhibit large differences in standard deviations should be standardized.

Z-score Standardized distance (e.g., Mahalanobis

distance)

Page 24: Cluster analysis

Deriving clusters

Hierarchical cluster analysis Hierarchical

Non-hierarchical cluster analysis K-means

Combination of both methods Two Step

Page 25: Cluster analysis

HIERARCHICAL CLUSTER ANALYSIS

Page 26: Cluster analysis

Hierarchical cluster analysis (HCA)

The stepwise procedure Agglomerate or divide group step by step

Agglomerative (SPSS selected) Aggregate object with object / cluster with cluster N clusters to 1 cluster

Divisive Separate cluster to object 1 cluster to n clusters

Page 27: Cluster analysis

Dendogram / tree graph

Page 28: Cluster analysis

Agglomerative aglorithms

Single linkage Complete linkage Average linkage Centroid method Ward’s method Mahalanobis diatance

Page 29: Cluster analysis

Agglomerative aglorithms

Single linkage / neighbor method Defines similarity between clusters as the

shortest distance from any object in one cluster to any object in the other.

Pics:Retrieved from: http://ppt.cc/uKm0

Page 30: Cluster analysis

Agglomerative aglorithms

Complete linkage / Farthest – neighbor method Defines two clusters based on the maximum

distance between any two members in the two clusters.

Page 31: Cluster analysis

Agglomerative aglorithms

Centroid method Cluster centroids

Are the mean values of the observation on the variables of the cluster

The distance between the two clusters equals the distance between the two centroids

Page 32: Cluster analysis

Agglomerative aglorithms

Average linkage The distance between two clusters is defined as

the average distance between all pairs of the two clusters’ members.

Page 33: Cluster analysis

Agglomerative aglorithms

Ward’s method The similarity between two clusters is the sum of

squares within the clusters summed over all variables.

Least variance within cluster

Page 34: Cluster analysis

Number of clusters

Theoretical specified

Statistical stopping rule Measures of heterogeneity change

Page 35: Cluster analysis

Hierarchical cluster analysis

The hierarchical cluster analysis provides an excellent framework with which to compare any set of cluster solutions.

This method helps in judging how many clusters should be retained or considered.

Page 36: Cluster analysis

NON-HIERARCHICAL CLUSTER ANALYSIS

Page 37: Cluster analysis

Non-hierarchical cluster analysis (non-HCA)

Non-hierarchical cluster analysis assign objects into clusters once the number of clusters is specified.

Two steps in non-HCA Specify cluster seed: identify starting points Assignment-assign each observation to one of

the cluster seeds.

Page 38: Cluster analysis

Non-hierarchical cluster analysis-algorithm

Aims to partition n observation into k clusters in which each observation belongs to the cluster with the nearest mean.

Cluster seed assignment Sequential (1 by 1) Parallel (simultaneously) Optimization

K-means method

0 1 2 3 4 50123456

scattrplot

case

Page 39: Cluster analysis

Pros and Cons of HCA

Advantage Comprehensive information

A wide range of alternative clustering solution

Disadvantage Outliers Large samples / large numbers of variable

Page 40: Cluster analysis

Pros and Cons of non-HCA

Advantage Less susceptible to outliers Extremely large data sets

Disadvantage Less information Susceptible to initial seed point

Page 41: Cluster analysis

Combination of each method

Two step Hierarchical technique is used to select the

number of clusters and profile clusters centers that serve as initial cluster seeds in the nonhierarchical procedure.

A nonhierarchical method then clusters all observations using the seed points to provide more accurate cluster memberships.

Page 42: Cluster analysis

Interpretation of clusters

Mean profile of cluster

Name the clusters

Page 43: Cluster analysis

Validation of clusters

Cross validation Two sub-sample

Confirmatory Discriminant analysis Predictive validity

Differences on variables Profile analysis (M)Analysis of variance

Page 44: Cluster analysis

Assumptions of cluster analysis

Inferential statistics?

Representativeness

Multicollinearity Factor analysis Cluster analysis

Page 45: Cluster analysis

Compare to other multivariate analyses

Cluster analysis (CA) vs. Factor analysis (FA) CA: grouping cases based on distance

(proximity) FA: grouping observations based on pattern of

variations (correlation)

Cluster analysis vs. Discriminant analysis (DA) CA: group is NOT given (exploratory) DA: group is given (confirmatory)

Page 46: Cluster analysis

Summary

Research question Assumption confirmation

Multicollinearity Cluster analysis

Selecting clustering variables Conducting analysis Interpreting clusters Validating clusters

Main analysis

It is just a beginning…

Page 47: Cluster analysis

Practice

根據相關實證研究的證據,教練的自主支持行為、威嚇、過度控制、有條件式的關愛與酬賞控制等多種教練行為,是影響運動員的重要因子。研究生戴平台想知道教練對運動員的認知、情意與行為後果的影響。戴平台認為,教練可以從運動員所知覺到的教練行為的多種組合被區分為不同的類型,檢視在不同類型教練下,運動員的認知、情意與行為反應,應該比較能瞭解運動團隊中,教練對運動員所產生的影響。請根據運動員所知覺到的教練行為,以集群分析幫戴平台將運動員所知覺到的教練區分為不同類型。