cluster analysis
TRANSCRIPT
1
Cluster Analysis
2
Collections….
33
Cluster analysis
Suppose one wants to understand the consumers in the finance market. With regard to the fiannce market, there are few outdtanding aspects like Risk, Returns, Liquidity. One could use any two aspects at one time to form clusters of consumers, or could use more than two attributes also.
L HB FM
KI N O E
J
AG D
C
RETRUNS (X2)
RISK (X1)( -3 , -3)
III
III
Cluster - I : Consumers believe in high risk and high returns instruments (Equity)
Cluster - II : Consumers believe in average risk and average
returns (Mutual funds, Company fixed deposits)
Cluster - III : Consumers believe in low risk and low returns
(Savings bank, Fixed deposits)
44
Cluster analysis
It is a class of techniques used to classify cases into groups that are• relatively homogeneous within
themselves and• heterogeneous between each other• Homogeneity (similarity) and
heterogeneity (dissimilarity) are measured on the basis of a defined set of variables
These groups are called clusters
55
Market segmentation
Cluster analysis is especially useful for market segmentation
Segmenting a market means dividing its potential consumers into separate sub-sets where• Consumers in the same group are similar with
respect to a given set of characteristics• Consumers belonging to different groups are
dissimilar with respect to the same set of characteristics
This allows one to calibrate the marketing mix differently according to the target consumer group
66
Other uses of cluster analysis
Product characteristics and the identification of new product opportunities.
Clustering of similar brands or products according to their characteristics allow one to identify competitors, potential market opportunities and available niches
Data reduction• Factor analysis and principal component analysis
allow to reduce the number of variables. • Cluster analysis allows to reduce the number of
observations, by grouping them into homogeneous clusters.
Maps profiling simultaneously consumers and products, market opportunities and preferences as in preference or perceptual mappings
7
Quality: What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns
8
Steps in Cluster Analysis
Formulate the problem
Select the distance measure
Select the clustering procedure
Decide the number of clusters
Interpret and profile the clusters
Assess the validity of clustering
9
Formulate the problem
Selecting the variables on which the clustering is based – The most important part of formulating the clustering problem
Inclusion of even one or two irrelevant variables may distort an otherwise useful clustering solutions
The set of selected variables should be able to describe the similarity between objects in terms that are relevant to marketing research problem
The variables should be selected on the basis of past research, theory, or a consideration of the hypotheses being tested.
10
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the
similarity or dissimilarity between two data objects
Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a
positive integer
If q = 1, d is Manhattan distance
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
||...||||),(2211 pp jxixjxixjxixjid
11
Similarity and Dissimilarity Between Objects
Distance measures how far apart two observations are.
Similarity measures how alike two cases are.
When two or more variables are used to define distance, the one with the
larger magnitude will dominate.
To avoid this it is common to first standardize all variables. SPSS hierarchical clustering supports these types of measures:
Interval. Available alternatives are Euclidean distance, squared Euclidean distance, cosine, Pearson correlation, Chebychev, block, Minkowski, and customized. Counts. Available alternatives are chi-square measure and phi-square measure. Binary. Binary matching is a common type of similarity data, where 1 indicates a match and 0 indicates no match between any pair of cases. There are multiple matched attributes and the similarity score is the number of matches divided by the number of attributes being matched.
12
Similarity and Dissimilarity Between Objects : Interval Measure
Euclidean distance : Square root of the sum of the square of the x difference plus the square of the y distance.
Squared Euclidean distance removes the sign and also places greater emphasis on objects further apart, thus increasing the effect of outliers. This is the default for interval data.
Cosine.. Interval-level similarity based on the cosine of the angle between two vectors of values.
Pearson correlation. Interval-level similarity based on the product-moment correlation. For case clustering as opposed the variable clustering, the reseacher transposes the normal data table in which columns are variables and rows are cases. By using columns as cases and rows as variables instead, the correlation is between cases and these correlations may constitute the cells of the similarity matrix.
13
Similarity and Dissimilarity Between Objects : Interval Measure
Chebychev distance is the maximum absolute difference between a pair of cases on any one of the two or more dimensions (variables) which are being used to define distance. Pairs will be defined as different according to their difference on a single dimension, ignoring their similarity on the remaining dimensions.
Block distance, Manhattan distance, is the average absolute difference across the two or more dimensions which are used to define distance.
Minkowski distance is the generalized distance function. The pth root of the sum of the absolute differences to the pth power between the values for the items.
dij = [sum(xik - xjk )p](1/p) When p = 1, Minkowski distance is city block distance. In the case of binary data, when p = 1 Minkowski distance is Hamming distance, When p = 2, Minkowski distance is Euclidean distance.
14
Similarity and Dissimilarity Between Objects : Count Measure
Chi-square measure. Based on the chi-square test of equality for two sets of frequencies, this measure is the default for count data.
Phi-square measure. Phi-square normalizes chi-square measure by the square root of the combined frequency.
15
Similarity and Dissimilarity Between Objects : Count Measure
Squared Euclidean distance is also an option for binary data as it is for interval.
Size difference. This asymmetry index ranges from 0 to 1.
Pattern difference. This dissimilarity measure also ranges from 0 to 1. Computed from a 2x2 table as bc/(n**2), where b and c are the diagonal cells and n is the number of observations.
Variance. Computed from a2x2 table as (b+c)/4n. It also ranges from 0 to 1.
Dispersion. is a similarity measure with a range of -1 to 1.
Shape. is a distance measure with a range of 0 to 1. Shape penalizes asymmetry of mismatches.
Simple matching. is the ratio of matches to the total number of values.
Phi 4-point correlation. is a binary analog of the Pearson correlation coefficient and has a range of -1 to 1.
16
Similarity and Dissimilarity Between Objects : Count Measure
Lambda. Goodman and Kruskal's lambda, which is interpreted as the proportional reduction of error (PRE) when using one item to predict the other (predicting in both directions). It ranges from 0 to 1, with 1 being perfect predictive monotonicity..
Anderberg's D. is a variant on lambda. D is the actual reduction of error using one item to predict the other (predicting in both directions). I ranges from 0 to 1.
Czekanowski or Sorensen measure, is an index in which joint absences are excluded from computation and matches are weighted double.
Hamann is the number of matches minus the number of nonmatches, divided by the total number of items which ranges from -1 to 1.
Jaccard. This is an index in which joint absences are excluded from consideration. Equal weight is given to matches and nonmatches. Also known as the similarity ratio. Jaccard is commonly recommended for binary data.
17
Similarity and Dissimilarity Between Objects : Count Measure
Kulczynski 1 is the ratio of joint presences to all nonmatches. Its lower bound is 0 and it is unbounded above. It is theoretically undefined when there are no nonmatches.
Kulczynski 2. is the conditional probability that the characteristic is present in one item, given that it is present in the other.
Lance and Williams index, also called the Bray-Curtis nonmetric coefficient, is based on a 2x2 table, using the formula (b+c)/(2a+b+c), where a represents the cell corresponding to cases present on both items, and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other. It ranges from 0 to 1.
Ochiai index is the binary equivalent of the cosine similarity measure and ranges from 0 to 1.
Rogers and Tanimoto index gives double weight to nonmatches.
Russel and Rao index is equivalent to the inner (dot) product, giving equal weight to matches and nonmatches. This is a common choice for binary similarity data.
18
Similarity and Dissimilarity Between Objects : Count Measure
Sokal and Sneath 1 index also gives double weight to matches. Sokal and Sneath 2. index gives double weight to nonmatches, but joint
absences are excluded from consideration. Sokal and Sneath 3. index is the ratio of matches to nonmatches. Its
minimum is 0 but it is unbounded above. It is theoretically undefined when there are no nonmatches; however, the software assigns an arbitrary value of 9999.999 when the value is undefined or is greater than this value.
Sokal and Sneath 4 index is based on the conditional probability that the characteristic in one item matches the value in the other, taking the mean of predictions either direction.
Sokal and Sneath 5 index is the squared geometric mean of conditional probabilities of positive and negative matches. It ranges from f 0 to 1.
Yule's Y. , also called the coefficient of colligation, is a similarity measure based on the cross-ratio for a 2 x 2 table and is independent of the marginal totals. It ranges from -1 to +1.
Yule's Q. is a similarity measure which is a special 2x2 case of Goodman and Kruskal's gamma. It is based on the cross-ratio
1919
Clustering procedures
Hierarchical procedures• Agglomerative (start from n clusters to get to 1 cluster)
• Linkage Methods (Single, Complete, Average)• Variance Methods (Ward’s method)• Centroid Methods
• Divisive (start from 1 cluster to get to n clusters) Non hierarchical procedures
• Sequential Threshold
• Parallel Threshold
• Optimising Threshold
2020
Hierarchical clustering
Agglomerative:• Each of the n observations constitutes a separate cluster• The two clusters that are more similar according to same distance rule
are aggregated, so that in step 1 there are n-1 clusters• In the second step another cluster is formed (n-2 clusters), by nesting
the two clusters that are more similar, and so on• There is a merging in each step until all observations end up in a single
cluster in the final step. Divisive
• All observations are initially assumed to belong to a single cluster• The most dissimilar observation is extracted to form a separate cluster• In step 1 there will be 2 clusters, in the second step three clusters and
so on, until the final step will produce as many clusters as the number of observations.
The number of clusters determines the stopping rule for the algorithms
2121
Non-hierarchical clustering
These algorithms do not follow a hierarchy and produce a single partition
Knowledge of the number of clusters (c) is required In the first step, initial cluster centres (the seeds) are determined
for each of the c clusters, either by the researcher or by the software (usually the first c observation or observations are chosen randomly)
Each iteration allocates observations to each of the c clusters, based on their distance from the cluster centres
Cluster centres are computed again and observations may be reallocated to the nearest cluster in the next iteration
When no observations can be reallocated or a stopping rule is met, the process stops
2222
Distance between clusters
Algorithms vary according to the way the distance between two clusters is defined.
The most common algorithm for hierarchical methods include• single linkage method• complete linkage method• average linkage method• ward algorithm • centroid method
2323
Linkage methods
Single linkage method (nearest neighbour): distance between two clusters is the minimum distance among all possible distances between observations belonging to the two clusters.
Complete linkage method (furthest neighbour): nests two cluster using as a basis the maximum distance between observations belonging to separate clusters.
Average linkage method: the distance between two clusters is the average of all distances between observations in the two clusters
24
Hierarchical Clustering
Single Linkage
Clustering criterion based on the shortest distance
Complete Linkage
Clustering criterion based on the longest distance
25
Hierarchical Clustering (Contd.)
Average Linkage
Clustering criterion based on the average distance
2626
Ward algorithm
1. The sum of squared distances is computed within each of the cluster, considering all distances between observation within the same cluster
2. The algorithm proceeds by choosing the aggregation between two clusters which generates the smallest increase in the total sum of squared distances.
It is a computationally intensive method, because at each step all the sum of squared distances need to be computed, together with all potential increases in the total sum of squared distances for each possible aggregation of clusters.
27
Hierarchical Clustering (Contd.)
Ward's Method
Based on the loss of information resulting from grouping of the objects into clusters (minimize within cluster variation)
2828
Centroid method
The distance between two clusters is the distance between the two centroids,
Centroids are the cluster averages for each of the variables• each cluster is defined by a single set of
coordinates, the averages of the coordinates of all individual observations belonging to that cluster
Difference between the centroid and the average linkage method• Centroid: computes the average of the co-
ordinates of the observations belonging to an individual cluster
• Average linkage: computes the average of the distances between two separate clusters.
29
Hierarchical Clustering (Contd.)
Centroid Method
Based on the distance between the group centroids (the point
whose coordinates are the means of all the observations in the
cluster)
30
Non-hierarchical Clustering
Sequential Threshold
Cluster center is selected and all objects within a prespecified
threshold value are grouped
Parallel Threshold
Several cluster centers are selected and objects within threshold
level are assigned to the nearest center
Optimizing
Objects can be later reassigned to clusters on the basis of
optimizing some overall criterion measure
3131
Non-hierarchical clustering: K-means method
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is provided
• First k elements• Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned to the nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Units can be reassigned in successive steps (optimising partioning)
3232
Non-hierarchical threshold methods
Sequential threshold methods• a prior threshold is fixed and units within that distance
are allocated to the first seed• a second seed is selected and the remaining units are
allocated, etc. Parallel threshold methods
• more than one seed are considered simultaneously When reallocation is possible after each stage, the methods
are termed optimizing procedures.
3333
Hierarchical vs. non-hierarchical methods
Hierarchical Methods Non-hierarchical methods
No decision about the number of clusters
Problems when data contain a high level of error
Can be very slow, preferable with small data-sets
Initial decisions are more influential (one-step only)
At each step they require computation of the full proximity matrix
Faster, more reliable, works with large data sets
Need to specify the number of clusters
Need to set the initial seeds Only cluster distances to seeds
need to be computed in each iteration
34
Description of Variables
Variable Description Corresponding Name in Output
Scale Values
Willingness to Export (Y1) Will 1(definitely not interested) to 5 (definitely interested)
Level of Interest in Seeking Govt Assistance (Y2)
Govt 1(definitely not interested) to 5 (definitely interested)
Employee Size (X1) Size Greater than Zero
Firm Revenue (X2) Rev In millions of dollars
Years of Operation in the Domestic Market (X3)
Years Actual number of years
Number of Products Currently Produced by the Firm (X4)
Prod Actual number
Training of Employees (X5) Train 0 (no formal program) or 1 (existence of a formal program)
Management Experience in International Operation (X6)
Exp 0 (no experience) or 1 (presence of experience)
35
Export Data – K-means Clustering Results
Initial Cluster Centers
Cluster
1 2 3 Will 4 1 5 Govt 5 1 4 Train 1 0 0 Experience 1 0 1 Years 6.0 7.0 4.5 Prod 5 2 11 Mod_size 5.80 2.80 2.90 Mod_Rev 1.00 .90 .90
Final Cluster Centers
Cluster
1 2 3 Will 4 2 5
Govt 4 2 4
Train 1 0 1
Experience 1 0 1
Years 6.3 6.7 6.0
Prod 6 3 10
Mod_size 4.97 3.60 5.42
Mod_Rev 1.76 1.74 1.21
Distances between Final Cluster Centers
Cluster 1 2 3 1 3.899 4.193 2 3.899 7.602 3 4.193 7.602
Size/10
Revenue/1000
36
Export Data – K-means Clustering Results (contd.)
ANOVA
Cluster Error
Mean Square df Mean Square df F Sig.
Will 58.540 2 .683 117 85.710 .000 Govt 34.297 2 .750 117 45.717 .000 Train 2.228 2 .177 117 12.565 .000 Experience 3.640 2 .142 117 25.590 .000 Years 4.091 2 .690 117 5.932 .004 Prod 298.924 2 1.377 117 217.038 .000 Mod_size 32.451 2 .537 117 60.391 .000 Mod_Rev 2.252 2 .873 117 2.580 .080
Number of Cases in each Cluster
1 56.000 2 46.000
Cluster
3 18.000 Valid 120.000 Missing .000
3737
Determining the optimal number of cluster from hierarchical methods
Graphical• dendrogram• scree diagram
Statistical• Arnold’s criterion• pseudo F statistic• pseudo t2 statistic• cubic clustering criterion (CCC)
3838
Dendrogram
Rescaled Distance Cluster Combine C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+ 231 275 145 181 333 117 336 337 209 431 178
This dotted line represents the distance between clusters
These are the individual cases
Case 231 and case 275 are merged
And the merging distance is relatively small
As the algorithm proceeds, the merging distances become larger
3939
Scree diagram
0
2
4
6
8
10
12
11 10 9 8 7 6 5 4 3 2 1
Number of clusters
Dis
tan
ce
Merging distance on the y-axis
When one moves from 7 to 6 clusters, the merging distance increases noticeably
4040
Statistical tests
The rationale is that in optimal partition, variability within clusters should be as small as possible, while variability between clusters should be maximized
This principle is similar to the ANOVA-F test However, since hierarchical algorithms proceed
sequentially, the probability distribution of statistics relating variability within and variability between is unknown and differs from the F distribution
4141
Statistical criteria to detect the optimal partition
Arnold’s criterion: find the minimum of the determinant of the within cluster sum of squares matrix W
Pseudo F, CCC and Pseudo t2: the ideal number of clusters should correspond to• a local maximum for the Pseudo-F and CCC, and• a small value of the pseudo t2 which increases in the next step
(preferably a local minimum). These criteria are rarely consistent among them, so that the
researcher should also rely on meaningful (interpretable) criteria. Non-parametric methods (SAS) also allow one to determine the
number of clusters• k-th nearest neighbour method: • the researcher sets a parameter (k) • for each k the method returns the optimal number of clusters. • if this optimal number is the same for several values of k, then
the determination of the number of clusters is relatively robust
4242
Suggested approach:2-steps procedures
1. First perform a hierarchical method to define the number of clusters
2. Then use the k-means procedure to actually form the clusters
The reallocation problem Rigidity of hierarchical methods: once a unit is
classified into a cluster, it cannot be moved to other clusters in subsequent steps
The k-means method allows a reclassification of all units in each iteration.
If some uncertainty about the number of clusters remains after running the hierarchical method, one may also run several k-means clustering procedures and apply the previously discussed statistical tests to choose the best partition.
4343
The SPSS two-step procedure
The observations are preliminarily aggregated into clusters using an hybrid hierarchical procedure named cluster feature tree.
This first step produces a number of pre-clusters, which is higher than the final number of clusters, but much smaller than the number of observations.
In the second step, a hierarchical method is used to classify the pre-clusters, obtaining the final classification.
During this second clustering step, it is possible to determine the number of clusters.
The user can either fix the number of clusters or let the algorithm search for the best one according to information criteria which are also based on goodness-of-fit measures.
4444
Evaluation and validation
goodness-of-fit of a cluster analysis • ratio between the sum of squared errors and the total sum of
squared errors (similar to R2) • root mean standard deviation within clusters.
Validation: if the identified cluster structure (number of clusters and cluster characteristics) is real, it should not be c
Validation approaches • use of different samples to check whether the final output is
similar• Split the sample into two groups when no other samples are
available• Check for the impact of initial seeds / order of cases
(hierarchical approach) on the final partition• Check for the impact of the selected clustering method
4545
Cluster analysis in SPSS
Three types of cluster analysis are available in SPSS
4646
Hierarchical cluster analysis
Variables selected for the analysis
Statistics required in the analysis
Graphs (dendrogram)Advice: no plots
Clustering method and options
Create a new variable with cluster membership for each case
4747
Statistics
The agglomeration schedule is a table which shows the steps of the clustering procedure, indicating which cases (clusters) are merged and the merging distance
The proximity matrix contains all distances between cases (it may be huge)
Shows the cluster membership of individual cases only for a sub-set of solutions
4848
Plots
Shows the clustering process, indicating which cases are aggregated and the merging distance
With many cases, the dendrogram is hardly readable
The icicle plot (which can be restricted to cover a small range of clusters), shows at what stage cases are clustered. The plot is cumbersome and slows down the analysis (advice: no icicle)
4949
MethodChoose a hierarchical algorithm
Choose the type of data (interval, counts binary) and the appropriate measure
Specify whether the variables (values) should be standardized before analysis. Z-scores return variables with zero mean and unity variance. Other standardizations are possible. Distance measures can also be transformed
5050
Cluster memberships
If the number of clusters has been decided (or at least a range of solutions), it is possible to save the cluster membership for each case into new variables
5151
The example: agglomeration schedule
Cluster Combined
Stage Number of clusters Cluster 1 Cluster 2 Distance Diff. Dist
490 10 8 12 544.4
491 9 8 11 559.3 14.9
492 8 3 7 575.0 15.7
493 7 3 366 591.6 16.6
494 6 3 6 610.6 19.0
495 5 3 37 636.6 26.0
496 4 13 23 663.7 27.1
497 3 3 13 700.8 37.1
498 2 1 8 754.1 53.3
499 1 1 3 864.2 110.2
Last 10 stages of the process (10 to 1 clusters)
As the algorithms proceeds towards the end, the distance increases
5252
Scree diagram
Scree diagram
590
640
690
740
790
840
7 6 5 4 3 2 1
Number of clusters
Dis
tan
ce
The scree diagram (not provided by SPSS but created from the agglomeration schedule) shows a larger distance increase when the cluster number goes below 4
Elbow?
5353
Non-hierarchical solutionwith 4 clusters
26.6% 20.2% 23.8% 29.4% 100.0%
1.4 3.2 1.9 3.1 2.4
238.0 1158.9 333.8 680.3 576.9
72 44 40 48 52
28.8 64.4 29.2 60.6 45.4
8.8 64.3 9.2 19.0 23.1
25.1 77.7 33.5 39.1 41.8
17.7 147.8 24.6 57.1 57.2
29.6 146.2 39.4 63.0 65.3
N %Case Number
MeanHousehold size
MeanGross current income ofhousehold MeanAge of HouseholdReference Person MeanEFS: Total Food &non-alcoholic beverage MeanEFS: Total Clothing andFootwear MeanEFS: Total Housing,Water, Electricity MeanEFS: Total Transportcosts MeanEFS: Total Recreation
1 2 3 4
Ward Method
Total
5454
K-means solution (4 clusters)
Variables
Number of clusters (fixed)
Ask for one (classify only) or more iterations before stopping the algorithm
It is possible to read a file with initial seeds or write final seeds on a file
5555
K-means options
Improve the algorithm by allowing for more iterations and running means (seeds are recomputed at each stage)
Creates a new variable with cluster membership for each case
More options including an ANOVA table with statistics
5656
Results from k-means(initial seeds chosen by SPSS)
Final Cluster Centers
2.0 2.0 2.8 3.2
264.5 241.1 791.2 1698.1
56 75 46 45
37.3 22.2 54.1 66.2
14.0 28.0 31.7 48.4
34.7 100.3 47.3 64.5
28.4 10.4 78.3 156.8
39.6 3013.1 74.4 125.9
Household size
Gross current income ofhousehold
Age of HouseholdReference Person
EFS: Total Food &non-alcoholic beverage
EFS: Total Clothing andFootwear
EFS: Total Housing,Water, Electricity
EFS: Total Transportcosts
EFS: Total Recreation
1 2 3 4
ClusterNumber of Cases in each Cluster
292.000
1.000
155.000
52.000
500.000
.000
1
2
3
4
Cluster
Valid
Missing
The k-means algorithm is sensible to outliers and SPSS chose an improbable amount for recreation expenditure as an initial seed for cluster 2 (probably an outlier due to misrecording or an exceptional expenditure)
5757
Results from k-means:initial seeds from hierarchical clustering
32.6% 10.2% 33.6% 23.6% 100.0%
1.7 3.1 2.5 2.9 2.4
163.5 1707.3 431.8 865.9 576.9
60 45 50 46 52
31.3 65.5 45.1 56.8 45.4
12.3 48.4 19.1 32.7 23.1
29.8 65.3 41.9 48.1 41.8
24.6 156.8 37.4 87.5 57.2
30.3 126.8 67.9 83.4 65.3
N %Case Number
MeanHousehold size
MeanGross current income ofhousehold MeanAge of HouseholdReference Person MeanEFS: Total Food &non-alcoholic beverage MeanEFS: Total Clothing andFootwear MeanEFS: Total Housing,Water, Electricity MeanEFS: Total Transportcosts MeanEFS: Total Recreation
1 2 3 4
Cluster Number of Case
Total
The first cluster is now larger, but it still represents older and poorer households. The other clusters are not very different from the ones obtained with the Ward algorithm, indicating a certain robustness of the results.
5858
2-step clustering
it is possible to make a distinction between categorical and continuous variables
The search for the optimal number of clusters may be constrained
This is the information criterion to choose the optimal partition
One may also asks for plots and descriptive stats
5959
Options
It is advisable to control for outliers (OLs) because the analysis is usually sensitive to OLs
It is possible to choose which variable should be standardized prior to run the analysis
More advanced options are available for a better control on the procedure
6060
Output
Cluster Distribution
2 .4% .4%
5 1.0% 1.0%
490 98.2% 98.2%
2 .4% .4%
499 100.0% 100.0%
499 100.0%
1
2
3
4
Combined
Cluster
Total
N% of
Combined % of Total
Results are not satisfactory With no prior decision on the number of clusters, two clusters are found, one
with a single observations and the other with the remaining 499 observations. Allowing for outlier treatment does not improve results Setting the number of clusters to four produces these results
It seems that the two-step clustering is biased towards finding a macro-cluster.
This might be due to the fact that the number of observations is relatively small, but the combination of the Ward algorithm with the k-means algorithm is more effective
6161
Discussion
It might seem that cluster analysis is too sensitive to the researcher’s choices
This is partly due to the relatively small data-set and possibly to correlation between variables
However, all outputs point out to a segment with older and poorer household and another with younger and larger households, with high expenditures.
By intensifying the search and adjusting some of the properties, cluster analysis does help identifying homogeneous groups.
“Moral”: cluster analysis needs to be adequately validated and it may be risky to run a single cluster analysis and take the results as truly informative, especially in presence of outliers.
62
Assumptions and Limitations of Cluster Analysis
Assumptions The basic measure of similarity on which the clustering is
based is a valid measure of the similarity between the objects.
There is theoretical justification for structuring the objects into clusters
Limitations It is difficult to evaluate the quality of the clustering
It is difficult to know exactly which clusters are very similar and which objects are difficult to assign.
It is difficult to select a clustering criterion and program on any basis other than availability.