cluster analysis

Download Cluster analysis

If you can't read please download the document

Post on 25-Jan-2017



Data & Analytics

0 download

Embed Size (px)


Cluster analysis

Cluster analysisWei-Jiun, Shen Ph. D.

When you were still young and naveClassification

You may classify them byShape

You may classify them byColor

You may classify them bySum of internal angles

Classification ShapeColorSum of internal angles

Similarity of characteristics

Purpose of cluster analysisGrouping objects based on the similarity of characteristics they possess.HomogeneityHeterogeneity

Geometrically, the objects within clusters will be close together, while the distance between clusters will be farther apart.

Major role that cluster analysis can playData reductionClassify large number of observation into manageable groups

Taxonomy descriptionExploratory Confirmatory

Examining the influence of cluster on dependent variablesWhether different motivational constructs are differentially associated with effort and enjoyment

How does cluster analysis work?The primary objective of cluster analysis is to define the structure of the data by placing the most similar observations into groups.

What clustering variables can be used?How do we measure similarity?How do we form clusters?How many clusters do we form?

Selecting clustering variablesStatistically,Any quantitative variable

Theoretically, conceptually, practically,Theoretical fundament corresponding to research Q

Measuring similaritySimilarityThe degree of correspondence among objects across all of the characteristics.

Correlational measuresDistance measures

Similarity measureCorrelation measureGrouping cases base on respondent patternDistance measureGrouping cases base on distance

Distance measuresEuclidean distanceSquared Euclidean distanceCity-block (Manhattan) distanceChebychev distanceMahalanobis D 2(standardization/variance-covariance)


Forming clustersSimilarity

MethodHierarchicalAgglomerative / divisive

non-hierarchical (quick)

Number of clustersTheoretical specified

Statistical stopping ruleMeasures of heterogeneity change

ExampleN=72 variables (scale range from 0-10 )Hierarchical cluster analysis with agglomerative method


Example - measure similarityEuclidean distance

Example - similarity & number of clustersProcedure


E-F +E-G +F-G / 3E-F +E-G +F-G + F-G / 4

Example graphingGraphical portrayal


Standardizing the dataClustering variables that have scales using widely differing numbers of scale points or that exhibit large differences in standard deviations should be standardized.

Z-scoreStandardized distance (e.g., Mahalanobis distance)

Deriving clustersHierarchical cluster analysisHierarchical

Non-hierarchical cluster analysisK-means

Combination of both methodsTwo Step

Hierarchical Cluster analysis

Hierarchical cluster analysis (HCA)The stepwise procedureAgglomerate or divide group step by stepAgglomerative (SPSS selected)Aggregate object with object / cluster with clusterN clusters to 1 cluster

Divisive Separate cluster to object1 cluster to n clusters

Dendogram / tree graph

Agglomerative aglorithmsSingle linkageComplete linkageAverage linkageCentroid methodWards methodMahalanobis diatance

Agglomerative aglorithmsSingle linkage / neighbor methodDefines similarity between clusters as the shortest distance from any object in one cluster to any object in the other.Pics:Retrieved from:

Agglomerative aglorithmsComplete linkage / Farthest neighbor methodDefines two clusters based on the maximum distance between any two members in the two clusters.

Agglomerative aglorithmsCentroid methodCluster centroidsAre the mean values of the observation on the variables of the clusterThe distance between the two clusters equals the distance between the two centroids

Agglomerative aglorithmsAverage linkageThe distance between two clusters is defined as the average distance between all pairs of the two clusters members.

Agglomerative aglorithmsWards methodThe similarity between two clusters is the sum of squares within the clusters summed over all variables.

Least variance within cluster

Number of clustersTheoretical specified

Statistical stopping ruleMeasures of heterogeneity change

Hierarchical cluster analysisThe hierarchical cluster analysis provides an excellent framework with which to compare any set of cluster solutions.

This method helps in judging how many clusters should be retained or considered.

Non-Hierarchical Cluster analysis

Non-hierarchical cluster analysis (non-HCA)Non-hierarchical cluster analysis assign objects into clusters once the number of clusters is specified.

Two steps in non-HCASpecify cluster seed: identify starting pointsAssignment-assign each observation to one of the cluster seeds.

Non-hierarchical cluster analysis-algorithmAims to partition n observation into k clusters in which each observation belongs to the cluster with the nearest mean.

Cluster seed assignmentSequential (1 by 1)Parallel (simultaneously)OptimizationK-means method

Pros and Cons of HCAAdvantage Comprehensive informationA wide range of alternative clustering solution

DisadvantageOutliersLarge samples / large numbers of variable

Pros and Cons of non-HCAAdvantageLess susceptible to outliersExtremely large data sets

DisadvantageLess informationSusceptible to initial seed point

Combination of each methodTwo stepHierarchical technique is used to select the number of clusters and profile clusters centers that serve as initial cluster seeds in the nonhierarchical procedure.

A nonhierarchical method then clusters all observations using the seed points to provide more accurate cluster memberships.

Interpretation of clustersMean profile of cluster

Name the clusters

Validation of clustersCross validationTwo sub-sample

ConfirmatoryDiscriminant analysisPredictive validity

Differences on variablesProfile analysis(M)Analysis of variance

Assumptions of cluster analysisInferential statistics?


MulticollinearityFactor analysisCluster analysis

Compare to other multivariate analysesCluster analysis (CA) vs. Factor analysis (FA)CA: grouping cases based on distance (proximity)FA: grouping observations based on pattern of variations (correlation)

Cluster analysis vs. Discriminant analysis (DA)CA: group is NOT given (exploratory)DA: group is given (confirmatory)

SummaryResearch questionAssumption confirmationMulticollinearityCluster analysisSelecting clustering variablesConducting analysisInterpreting clustersValidating clustersMain analysis

It is just a beginning