clustering and segmentation

29
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining . http://www-users.cs.umn.edu/~kumar/dmbook/

Upload: hua

Post on 05-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Clustering and segmentation. MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining . http://www-users.cs.umn.edu/~kumar/dmbook/. What is Cluster Analysis?. Grouping data so that elements in a group will be Similar (or related) to one another - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering and segmentation

CLUSTERING AND SEGMENTATIONMIS2502

Data Analytics

Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.http://www-users.cs.umn.edu/~kumar/dmbook/

Page 2: Clustering and segmentation

What is Cluster Analysis?• Grouping data so that elements in a group will be

• Similar (or related) to one another• Different (or unrelated) from elements in other groups

http://www.baseball.bornbybits.com/blog/uploaded_images/Takashi_Saito-703616.gif

Distance within clusters is minimized

Distance between

clusters is maximized

Page 3: Clustering and segmentation

Applications

Understanding• Group related documents for browsing• Create groups of similar customers• Discover which stocks have similar price

fluctuations

Summarization• Reduce the size of large data sets• Those similar groups can be treated as a

single data point

Page 4: Clustering and segmentation

Even more examples

Marketing• Discover distinct customer groups for

targeted promotions

Insurance• Finding “good customers” (low claim costs,

reliable premium payments)

Healthcare• Find patients with high-risk behaviors

Page 5: Clustering and segmentation

What cluster analysis is NOT

Manual (“supervised”) classification• People simply place items into

categories

Simple segmentation

• Dividing students into groups by last name

Main idea:

The clusters must come from the data, not from

external specifications.

Creating the “buckets”

beforehand is categorization, but

not clustering.

Page 6: Clustering and segmentation

Clusters can be ambiguous

The difference is the threshold you set.How distinct must a cluster be to be it’s own cluster?

How many clusters? 6

2 4

Page 7: Clustering and segmentation

Two clustering techniques

Partition• Non-overlapping

subsets (clusters) such that each data object is in exactly one subset

Hierarchical• Set of nested clusters

organized as a hierarchical tree

Page 8: Clustering and segmentation

Partitional Clustering

Three distinct groups emerge,

but…

…some curveballs behave more like

splitters.

…some splitters look more like

fastballs.

Page 9: Clustering and segmentation

Hierarchical Clustering

p1

p2

p3

p5

p4

p1 p2 p3 p4 p5

This is a dendrogram

Tree diagram used to represent

clusters

Page 10: Clustering and segmentation

Clusters can be ambiguous

The difference is the threshold you set.How distinct must a cluster be to be it’s own cluster?

How many clusters? 6

2 4

adapted from Tan, Steinbach, and Kumar. Introduction to Data Mining (2004)

Page 11: Clustering and segmentation

K-means (partitional)

Choose K clusters

Select K points as initial centroids

Assign all points to clusters based on distance

Recompute the centroid of each cluster

Did the center change? DONE!

Yes

No

The K-means algorithm is one

method for doing partitional

clustering

Page 12: Clustering and segmentation

K-Means Demonstration

Here is the initial data set

Page 13: Clustering and segmentation

K-Means Demonstration

Choose K points as

initial centroids

Page 14: Clustering and segmentation

K-Means Demonstration

Assign data points

according to distance

Page 15: Clustering and segmentation

K-Means Demonstration

And re-assign the points

Page 16: Clustering and segmentation

K-Means Demonstration

Recalculate the centroids

Page 17: Clustering and segmentation

K-Means Demonstration

And keep doing that until you

settle on a final set of

clusters

Page 18: Clustering and segmentation

Choosing the initial centroids

• Choosing the right number• Choosing the right initial location

It matters

• They won’t make sense within the context of the problem

• Unrelated data points will be included in the same group

Bad choices create bad groupings

Page 19: Clustering and segmentation

Example of Poor Initialization

This may “work” mathematically but the clusters don’t make much sense.

Page 20: Clustering and segmentation

Evaluating K-Means Clusters• On the previous slide, we did it visually, but there is a

mathematical test

• Sum-of-Squares Error (SSE)• The distance to the nearest cluster center• How close does each point get to the center?

• This just means• In a cluster, compute distance from a point (m) to the cluster center (x)• Square that distance (so sign isn’t an issue)• Add them all together

K

i Cxi

i

xmdistSSE1

2 ),(

Page 21: Clustering and segmentation

Example: Evaluating ClustersCluster 1 Cluster 2

2

1.313 3.3

1.5

SSE1 = 12 + 1.32 + 22 = 1 + 1.69 + 4 = 6.69

SSE2 = 32 + 3.32 + 1.52 = 9 + 10.89 + 2.25 = 22.14

• Lower individual cluster SSE = a better cluster• Lower total SSE = a better set of clusters• More clusters will reduce SSE

Considerations Reducing SSE within a cluster

increases cohesion(we want that)

Page 22: Clustering and segmentation

Choosing the best initial centroids• There’s no single, best way to choose initial centroids

• So what do you do?• Multiple runs (?)• Use a sample set of data first

• And then apply it to your main data set

• Select more centroids to start with• Then choose the ones that are farthest apart• Because those are the most distinct

• Pre and post-processing of the data

Page 23: Clustering and segmentation

Pre-processing: Getting the right centroids

• “Pre” Get the data ready for analysis

• Normalize the data• Reduces the dispersion of data points by re-computing the

distance• Rationale: Preserves differences while dampening the effect of the

outliers

• Remove outliers • Reduces the dispersion of data points by removing the atypical

data• Rationale: They don’t represent the population anyway

Page 24: Clustering and segmentation

Post-processing: Getting the right centroids

• “Post” Interpreting the results of the clustering analysis

• Remove small clusters• May be outliers

• Split loose clusters• With high SSE that look like they are really two different groups

• Merge clusters• With relatively low SSE that are “close” together

Page 25: Clustering and segmentation

Limitations of K-Means Clustering

• Clusters vary widely in size• Clusters vary widely in density• Clusters are not in rounded

shapes• The data set has a lot of outliers

K-Means gives

unreliable results when

The clusters may never make sense.In that case, the data may just not be well-suited for clustering!

Page 26: Clustering and segmentation

Similarity between clusters (inter-cluster)

• Most common: distance between centroids• Also can use SSE

• Look at distance between cluster 1’s points and other centroids• You’d want to maximize SSE between clusters

Cluster 1

Cluster 5

Increasing SSE across

clusters increases

separation(we want that)

Page 27: Clustering and segmentation

Figuring out if our clusters are good• “Good” means

• Meaningful• Useful• Provides insight

• The pitfalls• Poor clusters reveal incorrect associations• Poor clusters reveal inconclusive associations• There might be room for improvement and we can’t tell

• This is somewhat subjective and depends upon the expectations of the analyst

Page 28: Clustering and segmentation

Cluster validity assessment

• Do to the clusters confirm predefined labels?

• i.e., “Entropy”External

• How well-formed are the clusters?• i.e., SSE or correlationInternal

• How well does one clustering algorithm compare to another?

• i.e., compare SSEsRelative

Page 29: Clustering and segmentation

The Keys to Successful Clustering

• We want high cohesion within clusters (minimize differences)• High SSE, high correlation

• And high separation between clusters (maximize differences)• Low SSE, low correlation

• We need to choose the right number of clusters

• We need to choose the right initial centroids

• But there’s no easy way to do this• Combination of trial-and-error, knowledge of

the problem, and looking at the output

In SAS, cohesion is measured by root mean

square standard deviation…

…and separation measured by distance to

nearest cluster