seven (plus or minus two) clusters, a monte carlo study

43
Seven (plus or minus two) Clusters, A Monte Carlo Study Larry Hoyle, Policy Research Institute, The University of Kansas

Upload: zorina

Post on 23-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Seven (plus or minus two) Clusters, A Monte Carlo Study. Larry Hoyle, Policy Research Institute, The University of Kansas. 1972 Kansas Statistical Abstract. 30 Years Ago. Shading by Overprinting. Shading by Line Spacing. 20 Years Ago. Line Shading Detail. - PowerPoint PPT Presentation

TRANSCRIPT

  • Seven (plus or minus two) Clusters, A Monte Carlo Study Larry Hoyle, Policy Research Institute, The University of Kansas

  • 1972 Kansas Statistical Abstract

  • Shading by Overprinting

  • Shading by Line Spacing

  • Line Shading Detail

  • What did they have in common?Neither method is continuousSo both methods required grouping or classesFixed number of combinationsCharacters on a fixed gridInteger number of lines in the polygonLines are relatively coarse

  • How to Group for ShadingEqual IntervalsEqual numbers (quantiles)By clustersDont group (unclassed)

  • Population Density 7 Equal Intervals100 counties fall into the bottom class

  • Population Density - Equal Numbers15 counties in each class - a very different picture

  • Population Density - Cluster MeansGroup around the 7 values that best represent the data

  • Population Density - UnclassedNo classes, just shade in proportion to value

  • ClusteringTries for Best groupingEach member of cluster can be represented by the mean of the group

  • Proc FastclusYou specify the number of clustersMinimizes cluster sum of squared distance (e.g. minimum within cluster variance)inspired by: k-means (MacQueen) leader algorithm (Hartigan)

  • Example clustering - data

  • 4 clustersyclusterdata.x0102030405060708090R-squared=.9912

  • 4 clusters dataCorrelation .9956R-squared=.9912

    Out4j

    cluster Numberoriginal Valuecluster Mean

    126.9

    136.9

    156.9

    186.9

    196.9

    1106.9

    1116.9

    21821.3

    22021.3

    22221.3

    22521.3

    34042.7

    34242.7

    34642.7

    47377.6

    47577.6

    47777.6

    47877.6

    47977.6

    48077.6

    48177.6

    &A

    Page &P

  • 3 clustersyclusterdata.x0102030405060708090R-squared=.9609

  • How many clusters is enough?

  • Plot R-squared by number of clustersSample of 300 observations, Uniform distribution,11 cluster analyses

  • What happens if there really arent any clusters?

    Lets try 500 samples

  • Uniform, 300 obs. per sample500 samples, 11 clusterings each

  • Uniform, 1000 obs. per sample500 samples, 11 clusterings each

  • Normal, 300 obs. per sample500 samples, 11 clusterings each

  • Normal, 1000 obs. per sample500 samples, 11 clusterings each

  • Exponential, 300 obs. per sample500 samples, 11 clusterings each

  • Distribution of worst sample

  • Exponential, 1000 obs. per sample500 samples, 11 clusterings each

  • So Whats with 72?

  • Uniform, 72500 samples, 11 clusterings each

  • Normal, 72500 samples, 11 clusterings each

  • Exponential, 72500 samples, 11 clusterings each

  • Minimum R squared by sample size and distributionAt least 95% of the variance for all

    Sheet1

    Minimum R squared by sample size and distribution

    ExponentialNormalUniform

    300100030010003001000

    Clusters

    50.8830.8650.8770.8870.9510.956

    60.920.8990.9080.9220.9660.969

    70.9380.9260.9330.940.9750.978

    80.9530.9340.9450.9470.9820.982

    90.9660.9510.9570.9570.9850.986

    Sheet2

    Sheet3

  • HistogramsEqual intervalsNumber of observations in each interval

  • Needle Plotof Cluster Means

  • Bar chart needs more bars

  • The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Information ProcessingGeorge Miller, The Psychological Review 1956, vol.63 pp. 81-97

  • Limits on Categories for Absolute JudgmentsPitch 6Loudness 5Visual position 9Size of a square 5Hue 8Name the colors in this slide

  • And finally, what about the magical number seven? George A. Miller

  • Miller Quote 1

    seven wonders of the worldseven seasseven deadly sinsseven daughters of Atlas in the Pleiadesseven ages of manseven levels of hellseven primary colorsseven notes of the musical scaleseven days of the weekWhat about the

  • Miller Quote 2

    seven-point rating scaleseven categories for absolute judgmentseven objects in the span of attentionseven digits in the span of immediate memoryWhat about the

  • Perhaps there is something deep and profound behind all these sevens, something just calling out for us to discover it.

    Miller Quote 3

  • Miller - closeBut I suspect that it is only a pernicious, Pythagorean coincidence.

  • Coincidence or Natures Parsimony?Does our capacity match whats needed for 95% of the variance?95%? Hmmmm.confidence intervalsan A19 fingers and toes970,000 web pages

    Larry HoylePolicy Research InstituteUniversity of [email protected]

    Lets begin with a trip back to those thrilling days of yesteryear.30 years ago computer generated maps had limited shading options. Here is an example from the 1972 Kansas Statistical Abstract. The next slide shows how the shading was done.

    This Detail from the previous map shows how shading was done using overprinted characters. If youre old enough you may remember mall vendors that would print your picture this way. This allowed shaded maps on interactive devices like teletypes, and ASCII terminals.

    A decade made better options commonly available. Plotting terminals and better plotters allowed shading like this. You can see this heritage in SAS/GRAPH today.Here is a detail from the previous slide. Shading was done by the spacing between lines.The characters were all on fixed centers and you couldnt get fine detail in the polygon edges. Each polygon could only hold a small integer number of lines, which were fairly coarse.This all meant that you couldnt shade proportionally to value, you had to group values and shade the group the same. These are common methods.Notice that the lowest 100 counties all have the same shading the lowest interval.7 groups each with 15 counties. This exagerates differences among the 100 low counties.Here we used cluster analysis to compute 7 means that minimize the within group variances. Shading is proportional to the group mean. This better represents the subtle differences among the low density counties.Technique promoted by the geographer George Jenks.Here the counties are shaded proportionally to the density. It looks a lot like the clustered map.Here is an artificial sample of 21 points that we are going to arrange into clusters.The stars a cluster means, the corresponding points in the cluster share the same color. This is a good fit, explaining over 95% of the variance. If we created a new variable using the value of the cluster mean and then computed the squared correlation between the original and the new variable, it would be .95904Three clusters isnt quite as good a fit only 91% of the variance. Note that one group from the 4-group clustering is split up.Here we took a sample of 200 observations drawn from the Uniform distribution an did cluster analyses 11 times with 2 groups then 3 then 4 then 5 and so on. The benefit of adding extra groups trails off after 5 groups.Here we repeated our earlier chart over 500 samples and overlaid them all onto one chart. The horizontal value cluster has a little random number added to help see the cloud of points.This is with samples of size 1000These samples were from the normal distribution. Note Ive changed the vertical scale. The clusters of clusterings are not as tight.The exponential distribution yields a few samples fro which 2 clusters really doesnt do the job.This was the sample for which 2 clusters really didnt work. One cluster obviously gets chewed up by the extreme value. The remainder is still really skewed and will have quite a bit of internal variance.Lets draw a line at 95% of the variance.In every case 9 or fewer clusters explains at least 95% of the variance.Lets change gears a little hereWhen we do a histogram we are dividing the range of the variable into equal intervals.Here is a plot of a sample as a scatter plot (the vertical is random),As a histogram,And as a needle plot of the 4 cluster means.Note that the two lowest points show up as a separate cluster but have been assimilated in the histogram.We need a lot more bars to show the outliers. (hmm within 7 2)Also 7 memory chunks, 7 objects in the span of attention