cluster analysis

62
1 Cluster Analysis

Upload: subhankarsubhra

Post on 19-Nov-2014

721 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Cluster Analysis

1

Cluster Analysis

Page 2: Cluster Analysis

2

Collections….

Page 3: Cluster Analysis

33

Cluster analysis

Suppose one wants to understand the consumers in the finance market. With regard to the fiannce market, there are few outdtanding aspects like Risk, Returns, Liquidity. One could use any two aspects at one time to form clusters of consumers, or could use more than two attributes also.

L HB FM

KI N O E

J

AG D

C

RETRUNS (X2)

RISK (X1)( -3 , -3)

III

III

Cluster - I : Consumers believe in high risk and high returns instruments (Equity)

Cluster - II : Consumers believe in average risk and average

returns (Mutual funds, Company fixed deposits)

Cluster - III : Consumers believe in low risk and low returns

(Savings bank, Fixed deposits)

Page 4: Cluster Analysis

44

Cluster analysis

It is a class of techniques used to classify cases into groups that are• relatively homogeneous within

themselves and• heterogeneous between each other• Homogeneity (similarity) and

heterogeneity (dissimilarity) are measured on the basis of a defined set of variables

These groups are called clusters

Page 5: Cluster Analysis

55

Market segmentation

Cluster analysis is especially useful for market segmentation

Segmenting a market means dividing its potential consumers into separate sub-sets where• Consumers in the same group are similar with

respect to a given set of characteristics• Consumers belonging to different groups are

dissimilar with respect to the same set of characteristics

This allows one to calibrate the marketing mix differently according to the target consumer group

Page 6: Cluster Analysis

66

Other uses of cluster analysis

Product characteristics and the identification of new product opportunities.

Clustering of similar brands or products according to their characteristics allow one to identify competitors, potential market opportunities and available niches

Data reduction• Factor analysis and principal component analysis

allow to reduce the number of variables. • Cluster analysis allows to reduce the number of

observations, by grouping them into homogeneous clusters.

Maps profiling simultaneously consumers and products, market opportunities and preferences as in preference or perceptual mappings

Page 7: Cluster Analysis

7

Quality: What Is Good Clustering?

A good clustering method will produce high quality

clusters with

high intra-class similarity

low inter-class similarity

The quality of a clustering result depends on both the

similarity measure used by the method and its

implementation

The quality of a clustering method is also measured by

its ability to discover some or all of the hidden patterns

Page 8: Cluster Analysis

8

Steps in Cluster Analysis

Formulate the problem

Select the distance measure

Select the clustering procedure

Decide the number of clusters

Interpret and profile the clusters

Assess the validity of clustering

Page 9: Cluster Analysis

9

Formulate the problem

Selecting the variables on which the clustering is based – The most important part of formulating the clustering problem

Inclusion of even one or two irrelevant variables may distort an otherwise useful clustering solutions

The set of selected variables should be able to describe the similarity between objects in terms that are relevant to marketing research problem

The variables should be selected on the basis of past research, theory, or a consideration of the hypotheses being tested.

Page 10: Cluster Analysis

10

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the

similarity or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

two p-dimensional data objects, and q is a

positive integer

If q = 1, d is Manhattan distance

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp jxixjxixjxixjid

Page 11: Cluster Analysis

11

Similarity and Dissimilarity Between Objects

Distance measures how far apart two observations are.

Similarity measures how alike two cases are.

When two or more variables are used to define distance, the one with the

larger magnitude will dominate.

To avoid this it is common to first standardize all variables. SPSS hierarchical clustering supports these types of measures:

Interval. Available alternatives are Euclidean distance, squared Euclidean distance, cosine, Pearson correlation, Chebychev, block, Minkowski, and customized. Counts. Available alternatives are chi-square measure and phi-square measure. Binary. Binary matching is a common type of similarity data, where 1 indicates a match and 0 indicates no match between any pair of cases. There are multiple matched attributes and the similarity score is the number of matches divided by the number of attributes being matched.

Page 12: Cluster Analysis

12

Similarity and Dissimilarity Between Objects : Interval Measure

Euclidean distance : Square root of the sum of the square of the x difference plus the square of the y distance.

Squared Euclidean distance removes the sign and also places greater emphasis on objects further apart, thus increasing the effect of outliers. This is the default for interval data.

Cosine.. Interval-level similarity based on the cosine of the angle between two vectors of values.

Pearson correlation. Interval-level similarity based on the product-moment correlation. For case clustering as opposed the variable clustering, the reseacher transposes the normal data table in which columns are variables and rows are cases. By using columns as cases and rows as variables instead, the correlation is between cases and these correlations may constitute the cells of the similarity matrix.

Page 13: Cluster Analysis

13

Similarity and Dissimilarity Between Objects : Interval Measure

Chebychev distance is the maximum absolute difference between a pair of cases on any one of the two or more dimensions (variables) which are being used to define distance. Pairs will be defined as different according to their difference on a single dimension, ignoring their similarity on the remaining dimensions.

Block distance, Manhattan distance, is the average absolute difference across the two or more dimensions which are used to define distance.

Minkowski distance is the generalized distance function. The pth root of the sum of the absolute differences to the pth power between the values for the items.

dij = [sum(xik - xjk )p](1/p) When p = 1, Minkowski distance is city block distance. In the case of binary data, when p = 1 Minkowski distance is Hamming distance, When p = 2, Minkowski distance is Euclidean distance.

Page 14: Cluster Analysis

14

Similarity and Dissimilarity Between Objects : Count Measure

Chi-square measure. Based on the chi-square test of equality for two sets of frequencies, this measure is the default for count data.

Phi-square measure. Phi-square normalizes chi-square measure by the square root of the combined frequency.

Page 15: Cluster Analysis

15

Similarity and Dissimilarity Between Objects : Count Measure

Squared Euclidean distance is also an option for binary data as it is for interval.

Size difference. This asymmetry index ranges from 0 to 1.

Pattern difference. This dissimilarity measure also ranges from 0 to 1. Computed from a 2x2 table as bc/(n**2), where b and c are the diagonal cells and n is the number of observations.

Variance. Computed from a2x2 table as (b+c)/4n. It also ranges from 0 to 1.

Dispersion. is a similarity measure with a range of -1 to 1.

Shape. is a distance measure with a range of 0 to 1. Shape penalizes asymmetry of mismatches.

Simple matching. is the ratio of matches to the total number of values.

Phi 4-point correlation. is a binary analog of the Pearson correlation coefficient and has a range of -1 to 1.

Page 16: Cluster Analysis

16

Similarity and Dissimilarity Between Objects : Count Measure

Lambda. Goodman and Kruskal's lambda, which is interpreted as the proportional reduction of error (PRE) when using one item to predict the other (predicting in both directions). It ranges from 0 to 1, with 1 being perfect predictive monotonicity..

Anderberg's D. is a variant on lambda. D is the actual reduction of error using one item to predict the other (predicting in both directions). I ranges from 0 to 1.

Czekanowski or Sorensen measure, is an index in which joint absences are excluded from computation and matches are weighted double.

Hamann is the number of matches minus the number of nonmatches, divided by the total number of items which ranges from -1 to 1.

Jaccard. This is an index in which joint absences are excluded from consideration. Equal weight is given to matches and nonmatches. Also known as the similarity ratio. Jaccard is commonly recommended for binary data.

Page 17: Cluster Analysis

17

Similarity and Dissimilarity Between Objects : Count Measure

Kulczynski 1 is the ratio of joint presences to all nonmatches. Its lower bound is 0 and it is unbounded above. It is theoretically undefined when there are no nonmatches.

Kulczynski 2. is the conditional probability that the characteristic is present in one item, given that it is present in the other.

Lance and Williams index, also called the Bray-Curtis nonmetric coefficient, is based on a 2x2 table, using the formula (b+c)/(2a+b+c), where a represents the cell corresponding to cases present on both items, and b and c represent the diagonal cells corresponding to cases present on one item but absent on the other. It ranges from 0 to 1.

Ochiai index is the binary equivalent of the cosine similarity measure and ranges from 0 to 1.

Rogers and Tanimoto index gives double weight to nonmatches.

Russel and Rao index is equivalent to the inner (dot) product, giving equal weight to matches and nonmatches. This is a common choice for binary similarity data.

Page 18: Cluster Analysis

18

Similarity and Dissimilarity Between Objects : Count Measure

Sokal and Sneath 1 index also gives double weight to matches. Sokal and Sneath 2. index gives double weight to nonmatches, but joint

absences are excluded from consideration. Sokal and Sneath 3. index is the ratio of matches to nonmatches. Its

minimum is 0 but it is unbounded above. It is theoretically undefined when there are no nonmatches; however, the software assigns an arbitrary value of 9999.999 when the value is undefined or is greater than this value.

Sokal and Sneath 4 index is based on the conditional probability that the characteristic in one item matches the value in the other, taking the mean of predictions either direction.

Sokal and Sneath 5 index is the squared geometric mean of conditional probabilities of positive and negative matches. It ranges from f 0 to 1.

Yule's Y. , also called the coefficient of colligation, is a similarity measure based on the cross-ratio for a 2 x 2 table and is independent of the marginal totals. It ranges from -1 to +1.

Yule's Q. is a similarity measure which is a special 2x2 case of Goodman and Kruskal's gamma. It is based on the cross-ratio

Page 19: Cluster Analysis

1919

Clustering procedures

Hierarchical procedures• Agglomerative (start from n clusters to get to 1 cluster)

• Linkage Methods (Single, Complete, Average)• Variance Methods (Ward’s method)• Centroid Methods

• Divisive (start from 1 cluster to get to n clusters) Non hierarchical procedures

• Sequential Threshold

• Parallel Threshold

• Optimising Threshold

Page 20: Cluster Analysis

2020

Hierarchical clustering

Agglomerative:• Each of the n observations constitutes a separate cluster• The two clusters that are more similar according to same distance rule

are aggregated, so that in step 1 there are n-1 clusters• In the second step another cluster is formed (n-2 clusters), by nesting

the two clusters that are more similar, and so on• There is a merging in each step until all observations end up in a single

cluster in the final step. Divisive

• All observations are initially assumed to belong to a single cluster• The most dissimilar observation is extracted to form a separate cluster• In step 1 there will be 2 clusters, in the second step three clusters and

so on, until the final step will produce as many clusters as the number of observations.

The number of clusters determines the stopping rule for the algorithms

Page 21: Cluster Analysis

2121

Non-hierarchical clustering

These algorithms do not follow a hierarchy and produce a single partition

Knowledge of the number of clusters (c) is required In the first step, initial cluster centres (the seeds) are determined

for each of the c clusters, either by the researcher or by the software (usually the first c observation or observations are chosen randomly)

Each iteration allocates observations to each of the c clusters, based on their distance from the cluster centres

Cluster centres are computed again and observations may be reallocated to the nearest cluster in the next iteration

When no observations can be reallocated or a stopping rule is met, the process stops

Page 22: Cluster Analysis

2222

Distance between clusters

Algorithms vary according to the way the distance between two clusters is defined.

The most common algorithm for hierarchical methods include• single linkage method• complete linkage method• average linkage method• ward algorithm • centroid method

Page 23: Cluster Analysis

2323

Linkage methods

Single linkage method (nearest neighbour): distance between two clusters is the minimum distance among all possible distances between observations belonging to the two clusters.

Complete linkage method (furthest neighbour): nests two cluster using as a basis the maximum distance between observations belonging to separate clusters.

Average linkage method: the distance between two clusters is the average of all distances between observations in the two clusters

Page 24: Cluster Analysis

24

Hierarchical Clustering

Single Linkage

Clustering criterion based on the shortest distance

Complete Linkage

Clustering criterion based on the longest distance

Page 25: Cluster Analysis

25

Hierarchical Clustering (Contd.)

Average Linkage

Clustering criterion based on the average distance

Page 26: Cluster Analysis

2626

Ward algorithm

1. The sum of squared distances is computed within each of the cluster, considering all distances between observation within the same cluster

2. The algorithm proceeds by choosing the aggregation between two clusters which generates the smallest increase in the total sum of squared distances.

It is a computationally intensive method, because at each step all the sum of squared distances need to be computed, together with all potential increases in the total sum of squared distances for each possible aggregation of clusters.

Page 27: Cluster Analysis

27

Hierarchical Clustering (Contd.)

Ward's Method

Based on the loss of information resulting from grouping of the objects into clusters (minimize within cluster variation)

Page 28: Cluster Analysis

2828

Centroid method

The distance between two clusters is the distance between the two centroids,

Centroids are the cluster averages for each of the variables• each cluster is defined by a single set of

coordinates, the averages of the coordinates of all individual observations belonging to that cluster

Difference between the centroid and the average linkage method• Centroid: computes the average of the co-

ordinates of the observations belonging to an individual cluster

• Average linkage: computes the average of the distances between two separate clusters.

Page 29: Cluster Analysis

29

Hierarchical Clustering (Contd.)

Centroid Method

Based on the distance between the group centroids (the point

whose coordinates are the means of all the observations in the

cluster)

Page 30: Cluster Analysis

30

Non-hierarchical Clustering

Sequential Threshold

Cluster center is selected and all objects within a prespecified

threshold value are grouped

Parallel Threshold

Several cluster centers are selected and objects within threshold

level are assigned to the nearest center

Optimizing

Objects can be later reassigned to clusters on the basis of

optimizing some overall criterion measure

Page 31: Cluster Analysis

3131

Non-hierarchical clustering: K-means method

1. The number k of clusters is fixed

2. An initial set of k “seeds” (aggregation centres) is provided

• First k elements• Other seeds (randomly selected or explicitly defined)

3. Given a certain fixed threshold, all units are assigned to the nearest cluster seed

4. New seeds are computed

5. Go back to step 3 until no reclassification is necessary

Units can be reassigned in successive steps (optimising partioning)

Page 32: Cluster Analysis

3232

Non-hierarchical threshold methods

Sequential threshold methods• a prior threshold is fixed and units within that distance

are allocated to the first seed• a second seed is selected and the remaining units are

allocated, etc. Parallel threshold methods

• more than one seed are considered simultaneously When reallocation is possible after each stage, the methods

are termed optimizing procedures.

Page 33: Cluster Analysis

3333

Hierarchical vs. non-hierarchical methods

Hierarchical Methods Non-hierarchical methods

No decision about the number of clusters

Problems when data contain a high level of error

Can be very slow, preferable with small data-sets

Initial decisions are more influential (one-step only)

At each step they require computation of the full proximity matrix

Faster, more reliable, works with large data sets

Need to specify the number of clusters

Need to set the initial seeds Only cluster distances to seeds

need to be computed in each iteration

Page 34: Cluster Analysis

34

Description of Variables

Variable Description Corresponding Name in Output

Scale Values

Willingness to Export (Y1) Will 1(definitely not interested) to 5 (definitely interested)

Level of Interest in Seeking Govt Assistance (Y2)

Govt 1(definitely not interested) to 5 (definitely interested)

Employee Size (X1) Size Greater than Zero

Firm Revenue (X2) Rev In millions of dollars

Years of Operation in the Domestic Market (X3)

Years Actual number of years

Number of Products Currently Produced by the Firm (X4)

Prod Actual number

Training of Employees (X5) Train 0 (no formal program) or 1 (existence of a formal program)

Management Experience in International Operation (X6)

Exp 0 (no experience) or 1 (presence of experience)

Page 35: Cluster Analysis

35

Export Data – K-means Clustering Results

Initial Cluster Centers

Cluster

1 2 3 Will 4 1 5 Govt 5 1 4 Train 1 0 0 Experience 1 0 1 Years 6.0 7.0 4.5 Prod 5 2 11 Mod_size 5.80 2.80 2.90 Mod_Rev 1.00 .90 .90

Final Cluster Centers

Cluster

1 2 3 Will 4 2 5

Govt 4 2 4

Train 1 0 1

Experience 1 0 1

Years 6.3 6.7 6.0

Prod 6 3 10

Mod_size 4.97 3.60 5.42

Mod_Rev 1.76 1.74 1.21

Distances between Final Cluster Centers

Cluster 1 2 3 1 3.899 4.193 2 3.899 7.602 3 4.193 7.602

Size/10

Revenue/1000

Page 36: Cluster Analysis

36

Export Data – K-means Clustering Results (contd.)

ANOVA

Cluster Error

Mean Square df Mean Square df F Sig.

Will 58.540 2 .683 117 85.710 .000 Govt 34.297 2 .750 117 45.717 .000 Train 2.228 2 .177 117 12.565 .000 Experience 3.640 2 .142 117 25.590 .000 Years 4.091 2 .690 117 5.932 .004 Prod 298.924 2 1.377 117 217.038 .000 Mod_size 32.451 2 .537 117 60.391 .000 Mod_Rev 2.252 2 .873 117 2.580 .080

Number of Cases in each Cluster

1 56.000 2 46.000

Cluster

3 18.000 Valid 120.000 Missing .000

Page 37: Cluster Analysis

3737

Determining the optimal number of cluster from hierarchical methods

Graphical• dendrogram• scree diagram

Statistical• Arnold’s criterion• pseudo F statistic• pseudo t2 statistic• cubic clustering criterion (CCC)

Page 38: Cluster Analysis

3838

Dendrogram

Rescaled Distance Cluster Combine C A S E 0 5 10 15 20 25 Label Num +---------+---------+---------+---------+---------+ 231 275 145 181 333 117 336 337 209 431 178

This dotted line represents the distance between clusters

These are the individual cases

Case 231 and case 275 are merged

And the merging distance is relatively small

As the algorithm proceeds, the merging distances become larger

Page 39: Cluster Analysis

3939

Scree diagram

0

2

4

6

8

10

12

11 10 9 8 7 6 5 4 3 2 1

Number of clusters

Dis

tan

ce

Merging distance on the y-axis

When one moves from 7 to 6 clusters, the merging distance increases noticeably

Page 40: Cluster Analysis

4040

Statistical tests

The rationale is that in optimal partition, variability within clusters should be as small as possible, while variability between clusters should be maximized

This principle is similar to the ANOVA-F test However, since hierarchical algorithms proceed

sequentially, the probability distribution of statistics relating variability within and variability between is unknown and differs from the F distribution

Page 41: Cluster Analysis

4141

Statistical criteria to detect the optimal partition

Arnold’s criterion: find the minimum of the determinant of the within cluster sum of squares matrix W

Pseudo F, CCC and Pseudo t2: the ideal number of clusters should correspond to• a local maximum for the Pseudo-F and CCC, and• a small value of the pseudo t2 which increases in the next step

(preferably a local minimum). These criteria are rarely consistent among them, so that the

researcher should also rely on meaningful (interpretable) criteria. Non-parametric methods (SAS) also allow one to determine the

number of clusters• k-th nearest neighbour method: • the researcher sets a parameter (k) • for each k the method returns the optimal number of clusters. • if this optimal number is the same for several values of k, then

the determination of the number of clusters is relatively robust

Page 42: Cluster Analysis

4242

Suggested approach:2-steps procedures

1. First perform a hierarchical method to define the number of clusters

2. Then use the k-means procedure to actually form the clusters

The reallocation problem Rigidity of hierarchical methods: once a unit is

classified into a cluster, it cannot be moved to other clusters in subsequent steps

The k-means method allows a reclassification of all units in each iteration.

If some uncertainty about the number of clusters remains after running the hierarchical method, one may also run several k-means clustering procedures and apply the previously discussed statistical tests to choose the best partition.

Page 43: Cluster Analysis

4343

The SPSS two-step procedure

The observations are preliminarily aggregated into clusters using an hybrid hierarchical procedure named cluster feature tree.

This first step produces a number of pre-clusters, which is higher than the final number of clusters, but much smaller than the number of observations.

In the second step, a hierarchical method is used to classify the pre-clusters, obtaining the final classification.

During this second clustering step, it is possible to determine the number of clusters.

The user can either fix the number of clusters or let the algorithm search for the best one according to information criteria which are also based on goodness-of-fit measures.

Page 44: Cluster Analysis

4444

Evaluation and validation

goodness-of-fit of a cluster analysis • ratio between the sum of squared errors and the total sum of

squared errors (similar to R2) • root mean standard deviation within clusters.

Validation: if the identified cluster structure (number of clusters and cluster characteristics) is real, it should not be c

Validation approaches • use of different samples to check whether the final output is

similar• Split the sample into two groups when no other samples are

available• Check for the impact of initial seeds / order of cases

(hierarchical approach) on the final partition• Check for the impact of the selected clustering method

Page 45: Cluster Analysis

4545

Cluster analysis in SPSS

Three types of cluster analysis are available in SPSS

Page 46: Cluster Analysis

4646

Hierarchical cluster analysis

Variables selected for the analysis

Statistics required in the analysis

Graphs (dendrogram)Advice: no plots

Clustering method and options

Create a new variable with cluster membership for each case

Page 47: Cluster Analysis

4747

Statistics

The agglomeration schedule is a table which shows the steps of the clustering procedure, indicating which cases (clusters) are merged and the merging distance

The proximity matrix contains all distances between cases (it may be huge)

Shows the cluster membership of individual cases only for a sub-set of solutions

Page 48: Cluster Analysis

4848

Plots

Shows the clustering process, indicating which cases are aggregated and the merging distance

With many cases, the dendrogram is hardly readable

The icicle plot (which can be restricted to cover a small range of clusters), shows at what stage cases are clustered. The plot is cumbersome and slows down the analysis (advice: no icicle)

Page 49: Cluster Analysis

4949

MethodChoose a hierarchical algorithm

Choose the type of data (interval, counts binary) and the appropriate measure

Specify whether the variables (values) should be standardized before analysis. Z-scores return variables with zero mean and unity variance. Other standardizations are possible. Distance measures can also be transformed

Page 50: Cluster Analysis

5050

Cluster memberships

If the number of clusters has been decided (or at least a range of solutions), it is possible to save the cluster membership for each case into new variables

Page 51: Cluster Analysis

5151

The example: agglomeration schedule

    Cluster Combined    

Stage Number of clusters Cluster 1 Cluster 2 Distance Diff. Dist

490 10 8 12 544.4  

491 9 8 11 559.3 14.9

492 8 3 7 575.0 15.7

493 7 3 366 591.6 16.6

494 6 3 6 610.6 19.0

495 5 3 37 636.6 26.0

496 4 13 23 663.7 27.1

497 3 3 13 700.8 37.1

498 2 1 8 754.1 53.3

499 1 1 3 864.2 110.2

Last 10 stages of the process (10 to 1 clusters)

As the algorithms proceeds towards the end, the distance increases

Page 52: Cluster Analysis

5252

Scree diagram

Scree diagram

590

640

690

740

790

840

7 6 5 4 3 2 1

Number of clusters

Dis

tan

ce

The scree diagram (not provided by SPSS but created from the agglomeration schedule) shows a larger distance increase when the cluster number goes below 4

Elbow?

Page 53: Cluster Analysis

5353

Non-hierarchical solutionwith 4 clusters

26.6% 20.2% 23.8% 29.4% 100.0%

1.4 3.2 1.9 3.1 2.4

238.0 1158.9 333.8 680.3 576.9

72 44 40 48 52

28.8 64.4 29.2 60.6 45.4

8.8 64.3 9.2 19.0 23.1

25.1 77.7 33.5 39.1 41.8

17.7 147.8 24.6 57.1 57.2

29.6 146.2 39.4 63.0 65.3

N %Case Number

MeanHousehold size

MeanGross current income ofhousehold MeanAge of HouseholdReference Person MeanEFS: Total Food &non-alcoholic beverage MeanEFS: Total Clothing andFootwear MeanEFS: Total Housing,Water, Electricity MeanEFS: Total Transportcosts MeanEFS: Total Recreation

1 2 3 4

Ward Method

Total

Page 54: Cluster Analysis

5454

K-means solution (4 clusters)

Variables

Number of clusters (fixed)

Ask for one (classify only) or more iterations before stopping the algorithm

It is possible to read a file with initial seeds or write final seeds on a file

Page 55: Cluster Analysis

5555

K-means options

Improve the algorithm by allowing for more iterations and running means (seeds are recomputed at each stage)

Creates a new variable with cluster membership for each case

More options including an ANOVA table with statistics

Page 56: Cluster Analysis

5656

Results from k-means(initial seeds chosen by SPSS)

Final Cluster Centers

2.0 2.0 2.8 3.2

264.5 241.1 791.2 1698.1

56 75 46 45

37.3 22.2 54.1 66.2

14.0 28.0 31.7 48.4

34.7 100.3 47.3 64.5

28.4 10.4 78.3 156.8

39.6 3013.1 74.4 125.9

Household size

Gross current income ofhousehold

Age of HouseholdReference Person

EFS: Total Food &non-alcoholic beverage

EFS: Total Clothing andFootwear

EFS: Total Housing,Water, Electricity

EFS: Total Transportcosts

EFS: Total Recreation

1 2 3 4

ClusterNumber of Cases in each Cluster

292.000

1.000

155.000

52.000

500.000

.000

1

2

3

4

Cluster

Valid

Missing

The k-means algorithm is sensible to outliers and SPSS chose an improbable amount for recreation expenditure as an initial seed for cluster 2 (probably an outlier due to misrecording or an exceptional expenditure)

Page 57: Cluster Analysis

5757

Results from k-means:initial seeds from hierarchical clustering

32.6% 10.2% 33.6% 23.6% 100.0%

1.7 3.1 2.5 2.9 2.4

163.5 1707.3 431.8 865.9 576.9

60 45 50 46 52

31.3 65.5 45.1 56.8 45.4

12.3 48.4 19.1 32.7 23.1

29.8 65.3 41.9 48.1 41.8

24.6 156.8 37.4 87.5 57.2

30.3 126.8 67.9 83.4 65.3

N %Case Number

MeanHousehold size

MeanGross current income ofhousehold MeanAge of HouseholdReference Person MeanEFS: Total Food &non-alcoholic beverage MeanEFS: Total Clothing andFootwear MeanEFS: Total Housing,Water, Electricity MeanEFS: Total Transportcosts MeanEFS: Total Recreation

1 2 3 4

Cluster Number of Case

Total

The first cluster is now larger, but it still represents older and poorer households. The other clusters are not very different from the ones obtained with the Ward algorithm, indicating a certain robustness of the results.

Page 58: Cluster Analysis

5858

2-step clustering

it is possible to make a distinction between categorical and continuous variables

The search for the optimal number of clusters may be constrained

This is the information criterion to choose the optimal partition

One may also asks for plots and descriptive stats

Page 59: Cluster Analysis

5959

Options

It is advisable to control for outliers (OLs) because the analysis is usually sensitive to OLs

It is possible to choose which variable should be standardized prior to run the analysis

More advanced options are available for a better control on the procedure

Page 60: Cluster Analysis

6060

Output

Cluster Distribution

2 .4% .4%

5 1.0% 1.0%

490 98.2% 98.2%

2 .4% .4%

499 100.0% 100.0%

499 100.0%

1

2

3

4

Combined

Cluster

Total

N% of

Combined % of Total

Results are not satisfactory With no prior decision on the number of clusters, two clusters are found, one

with a single observations and the other with the remaining 499 observations. Allowing for outlier treatment does not improve results Setting the number of clusters to four produces these results

It seems that the two-step clustering is biased towards finding a macro-cluster.

This might be due to the fact that the number of observations is relatively small, but the combination of the Ward algorithm with the k-means algorithm is more effective

Page 61: Cluster Analysis

6161

Discussion

It might seem that cluster analysis is too sensitive to the researcher’s choices

This is partly due to the relatively small data-set and possibly to correlation between variables

However, all outputs point out to a segment with older and poorer household and another with younger and larger households, with high expenditures.

By intensifying the search and adjusting some of the properties, cluster analysis does help identifying homogeneous groups.

“Moral”: cluster analysis needs to be adequately validated and it may be risky to run a single cluster analysis and take the results as truly informative, especially in presence of outliers.

Page 62: Cluster Analysis

62

Assumptions and Limitations of Cluster Analysis

Assumptions The basic measure of similarity on which the clustering is

based is a valid measure of the similarity between the objects.

There is theoretical justification for structuring the objects into clusters

Limitations It is difficult to evaluate the quality of the clustering

It is difficult to know exactly which clusters are very similar and which objects are difficult to assign.

It is difficult to select a clustering criterion and program on any basis other than availability.