seven (plus or minus two) clusters, a monte carlo study

Seven (plus or minus two) Clusters, A Monte Carlo Study Larry Hoyle, Policy Research Institute, The University of Kansas

1972 Kansas Statistical Abstract

Shading by Overprinting

Shading by Line Spacing

Line Shading Detail

What did they have in common?Neither method is continuousSo both methods required grouping or classesFixed number of combinationsCharacters on a fixed gridInteger number of lines in the polygonLines are relatively coarse

How to Group for ShadingEqual IntervalsEqual numbers (quantiles)By clustersDont group (unclassed)

Population Density 7 Equal Intervals100 counties fall into the bottom class

Population Density - Equal Numbers15 counties in each class - a very different picture

Population Density - Cluster MeansGroup around the 7 values that best represent the data

Population Density - UnclassedNo classes, just shade in proportion to value

ClusteringTries for Best groupingEach member of cluster can be represented by the mean of the group

Proc FastclusYou specify the number of clustersMinimizes cluster sum of squared distance (e.g. minimum within cluster variance)inspired by: k-means (MacQueen) leader algorithm (Hartigan)

Example clustering - data

4 clustersyclusterdata.x0102030405060708090R-squared=.9912

4 clusters dataCorrelation .9956R-squared=.9912

Out4j

cluster Numberoriginal Valuecluster Mean

126.9

136.9

156.9

186.9

196.9

1106.9

1116.9

21821.3

22021.3

22221.3

22521.3

34042.7

34242.7

34642.7

47377.6

47577.6

47777.6

47877.6

47977.6

48077.6

48177.6

&A

Page &P

3 clustersyclusterdata.x0102030405060708090R-squared=.9609

How many clusters is enough?

Plot R-squared by number of clustersSample of 300 observations, Uniform distribution,11 cluster analyses

What happens if there really arent any clusters?

Lets try 500 samples

Uniform, 300 obs. per sample500 samples, 11 clusterings each

Uniform, 1000 obs. per sample500 samples, 11 clusterings each

Normal, 300 obs. per sample500 samples, 11 clusterings each

Normal, 1000 obs. per sample500 samples, 11 clusterings each

Exponential, 300 obs. per sample500 samples, 11 clusterings each

Distribution of worst sample

Exponential, 1000 obs. per sample500 samples, 11 clusterings each

So Whats with 72?

Uniform, 72500 samples, 11 clusterings each

Normal, 72500 samples, 11 clusterings each

Exponential, 72500 samples, 11 clusterings each

Minimum R squared by sample size and distributionAt least 95% of the variance for all

Sheet1

Minimum R squared by sample size and distribution

ExponentialNormalUniform

300100030010003001000

Clusters

50.8830.8650.8770.8870.9510.956

60.920.8990.9080.9220.9660.969

70.9380.9260.9330.940.9750.978

80.9530.9340.9450.9470.9820.982

90.9660.9510.9570.9570.9850.986

Sheet2

Sheet3

HistogramsEqual intervalsNumber of observations in each interval

Needle Plotof Cluster Means

Bar chart needs more bars

The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Information ProcessingGeorge Miller, The Psychological Review 1956, vol.63 pp. 81-97

Limits on Categories for Absolute JudgmentsPitch 6Loudness 5Visual position 9Size of a square 5Hue 8Name the colors in this slide

And finally, what about the magical number seven? George A. Miller

Miller Quote 1

seven wonders of the worldseven seasseven deadly sinsseven daughters of Atlas in the Pleiadesseven ages of manseven levels of hellseven primary colorsseven notes of the musical scaleseven days of the weekWhat about the

Miller Quote 2

seven-point rating scaleseven categories for absolute judgmentseven objects in the span of attentionseven digits in the span of immediate memoryWhat about the

Perhaps there is something deep and profound behind all these sevens, something just calling out for us to discover it.

Miller Quote 3

Miller - closeBut I suspect that it is only a pernicious, Pythagorean coincidence.

Coincidence or Natures Parsimony?Does our capacity match whats needed for 95% of the variance?95%? Hmmmm.confidence intervalsan A19 fingers and toes970,000 web pages

Larry HoylePolicy Research InstituteUniversity of [email protected]

Lets begin with a trip back to those thrilling days of yesteryear.30 years ago computer generated maps had limited shading options. Here is an example from the 1972 Kansas Statistical Abstract. The next slide shows how the shading was done.

This Detail from the previous map shows how shading was done using overprinted characters. If youre old enough you may remember mall vendors that would print your picture this way. This allowed shaded maps on interactive devices like teletypes, and ASCII terminals.

A decade made better options commonly available. Plotting terminals and better plotters allowed shading like this. You can see this heritage in SAS/GRAPH today.Here is a detail from the previous slide. Shading was done by the spacing between lines.The characters were all on fixed centers and you couldnt get fine detail in the polygon edges. Each polygon could only hold a small integer number of lines, which were fairly coarse.This all meant that you couldnt shade proportionally to value, you had to group values and shade the group the same. These are common methods.Notice that the lowest 100 counties all have the same shading the lowest interval.7 groups each with 15 counties. This exagerates differences among the 100 low counties.Here we used cluster analysis to compute 7 means that minimize the within group variances. Shading is proportional to the group mean. This better represents the subtle differences among the low density counties.Technique promoted by the geographer George Jenks.Here the counties are shaded proportionally to the density. It looks a lot like the clustered map.Here is an artificial sample of 21 points that we are going to arrange into clusters.The stars a cluster means, the corresponding points in the cluster share the same color. This is a good fit, explaining over 95% of the variance. If we created a new variable using the value of the cluster mean and then computed the squared correlation between the original and the new variable, it would be .95904Three clusters isnt quite as good a fit only 91% of the variance. Note that one group from the 4-group clustering is split up.Here we took a sample of 200 observations drawn from the Uniform distribution an did cluster analyses 11 times with 2 groups then 3 then 4 then 5 and so on. The benefit of adding extra groups trails off after 5 groups.Here we repeated our earlier chart over 500 samples and overlaid them all onto one chart. The horizontal value cluster has a little random number added to help see the cloud of points.This is with samples of size 1000These samples were from the normal distribution. Note Ive changed the vertical scale. The clusters of clusterings are not as tight.The exponential distribution yields a few samples fro which 2 clusters really doesnt do the job.This was the sample for which 2 clusters really didnt work. One cluster obviously gets chewed up by the extreme value. The remainder is still really skewed and will have quite a bit of internal variance.Lets draw a line at 95% of the variance.In every case 9 or fewer clusters explains at least 95% of the variance.Lets change gears a little hereWhen we do a histogram we are dividing the range of the variable into equal intervals.Here is a plot of a sample as a scatter plot (the vertical is random),As a histogram,And as a needle plot of the 4 cluster means.Note that the two lowest points show up as a separate cluster but have been assimilated in the histogram.We need a lot more bars to show the outliers. (hmm within 7 2)Also 7 memory chunks, 7 objects in the span of attention

seven (plus or minus two) clusters, a monte carlo study

Documents

sample500 samples

clusterings eachnormal

clusterings eachexponential

clusterings eachuniform

clusterings eachminimum

clusterings eachso whats

cluster analyseswhat

number of clusterssample