jeremy tantrum, department of statistics, university of washington joint work with alejandro murua...

31
Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation This work has been supported by NSA grant 62-1942 Hierarchical Model- Based Clustering of Large Datasets Through Fractionation and Refractionation

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Jeremy Tantrum, Department of Statistics,

University of Washington

joint work with

Alejandro Murua & Werner StuetzleInsightful Corporation University of Washington

This work has been supported by NSA grant 62-1942

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and

Refractionation

Page 2: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Motivating Example

• Consider clustering documents• Topic Detection and Tracking corpus

• 15,863 news stories for one year from Reuters and CNN• 25,000 unique words• Possibly many topics

• Large numbers of observations• High dimensions• Many groups

Page 3: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Goal of Clustering

40 45 50 55

74

76

78

80

82

84

Detect that there are 5 or 6 groupsAssign Observations to groups

Page 4: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

NonParametric Clustering

Premise: • Observations are sampled from a density p(x) • Groups correspond to modes of p(x)

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Page 5: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

NonParametric Clustering

Fitting: Estimate p(x) nonparametrically and find significant modes of the estimate

-10 -5 0 5 10

0.0

0.02

0.04

0.06

0.08

0.10

0.12

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Page 6: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Model Based Clustering

Premise: • Observations are sampled from a mixture density p(x) = g pg(x)• Groups correspond to mixture components

Page 7: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Model Based Clustering

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| |||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Fitting: Estimate g and parameters of pg(x)

Page 8: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Model Based Clustering

Fitting a Mixture of Gaussians

• Use the EM algorithm to maximize the log likelihood– Estimates the probabilities of each observation

belonging to each group– Maximizes likelihood given these probabilites

– Requires a good starting point

Page 9: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Model Based Clustering

Hierarchical Clustering• Provides a good starting point for EM

algorithm• Start with every point being it’s own cluster• Merge the two closest clusters

– Measured by the decrease in likelihood when those two clusters are merged

– Uses the Classification Likelihood – not the Mixture Likelihood

• Algorithm is quadratic in the number of observations

Page 10: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

| || | |||| | | |||||||||| |||||||| |||||| |||||||||||| ||||||||||||||||||| ||||||| || ||||||| || |||||| | |||| || || | |

p1(x)p2(x)

p (x)

Merge gives small decrease in likelihood

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

Merge gives big decrease in likelihood

Likelihood Distance

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

p1(x) p2(x)

p (x)

Page 11: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Bayesian Information Criterion

• Choose number of clusters by maximizing the Bayesian Information Criterion

– r is the number of parameters– n is the number of observations

• Log likelihood penalized for complexity

Page 12: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Fractionation

Original Data – size n

n/M fractions of size M

If n >M

M is the largest number of observations for which a hierarchical O(M2) algorithm is computationally feasible

Invented by Cutting, Karger, Pederson and Tukey for nonparametric clustering of large datasets.

n clusters(meta-obervations, i)

Partition each fraction into M clusters < 1

Page 13: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Fractionation

– n meta-observations after the first round– 2n meta-observations after the second round– in meta-observations after the ith round

• For the ith pass, we have i-1n/M fractions taking O(M2) operations each

• Total number of operations is:

• Total running time is linear in n!

Page 14: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

• Use model based clustering

• Meta-observations contain all sufficient statistics – (ni, i, i)

– ni is the number of observations – size

– i is the mean – location

– i is the covariance matrix – shape and volume

Model Based Fractionation

Page 15: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

• ••••

•••

••

••

••

••

••

••

•••

••

••••

••

••

••

••

••

••

•••

••

•••

••

••

••

•••

••

••

••

••••

••

••

••

••

••

••

••

•• •

••••

••

•••

••

••

••

••

•••

••

••

•••

••

••

••

••

••

••

••

••

••

••••

••

••

••

••

••••

••

••

••

•••

••

••

•••

••

••

••

••

••

••

••

••

••

••

An example, 400 observations in 4 groupsObservations in the first fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

••

••

••

••

••••••

••

•••

••

••

•••

••••

••

••

••

••

••

••••

••

••

•••••

••

10 meta-observations from the first fraction10 meta-observations from the second fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

•••

••

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

•••

••

10 meta-observations from the third fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

••

••

•••

••

••

••

••

••

••

••

••

••

•••

••

••••

••

••

•••

• •

10 meta-observations from the fourth fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

•••

••

••

••

••

•••

• •

•• ••

•••

••

••

••

••

The 40 Meta-observations

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

The Final Clusters Chosen by BIC

Success!

Model Based Fractionation

Page 16: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

The data – 400 observations in 25 groups

1 2 3 4 5

12

34

5 •

••

••

••

••

••

••

••

•••

••••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

•••

••

••

••

••

••••

••

•••

••

••

••

••••

•••

••

••

•••

1 2 3 4 5

12

34

5 •

••

••

••

••

••

•••

••

••

••

••

••

••

•••

••

Observations in fraction 110 meta-observations from the first fraction10 meta-observations from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

•••

••

••

••

••

••

••

••

••

10 meta-observations from the third fraction

1 2 3 4 5

12

34

5

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

••

••

•••

•••

•••

••

••

••

••

••

••

••

•••

••

•••

10 meta-observations from the fourth fraction

1 2 3 4 5

12

34

5

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

••

••

••

••••

••

• •

••

••

••

•••

••

••

•••

•••

••

••

•••

The 40 meta-observations

1 2 3 4 5

12

34

5

The clusters chosen by BICFractionation fails!

Example 2

Page 17: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Refractionation

Problem:• If the number of meta-observations generated from a

fraction is less than the number of groups in that fraction then two or more groups will be merged.

• Once observations from two groups are merged they can never be split again.

Solution:• Apply fractionation repeatedly.• Use meta-observations from the previous pass of

fractionation to create “better” fractions.

Page 18: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Example 2 Continued

1 2 3 4 5

12

34

5

The 40 meta-observations4 new clusters4 new fractions

1 2 3 4 5

12

34

5 • ••

••

••

•••

••

•••

•••••••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

•••

••

•••

•••

• •

••

•••

••

••

••

••

••

•••

••

••••

••••

•••

•••

••

••

••

••

•••

• •••

••

••

•••••

•••

••••

••

•••

••

••

••

••

••

••••

••••

••

••••

••••

••••••

••

••

Page 19: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Observations in the new fraction 1

1 2 3 4 5

12

34

5

• ••

••

••

•••

••

••

••••••

••

••

••

••

•••

•••

••

••

••

•••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

•••

••

•••

•••

• •

••

••

•••

• ••

••

••

••

•••

••

••

•••

••

••••

••

••••

••

•••

••

••

••

••••

•••

••

•••

•••

••

1 2 3 4 5

12

34

5

• ••

••

••

•••

••

••

••••••

••

••

••

••

•••

•••

••

••

••

•••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

•••

••

•••

•••

• •

••

••

•••

• ••

••

••

••

•••

••

••

•••

••

••••

••

••••

••

•••

••

••

••

••••

•••

••

•••

•••

••

Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

•••

• •

••

•••

••

••

••

••

••

•••

••

••••

••••

•••

•••

••

••

••

••

•••

Clusters from the third fraction

1 2 3 4 5

12

34

5

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

• •••

••

••

•••••

•••

••••

••

•••

••

••

••

••

••

••••

••••

••

••••

••••

••••••

••

••

Clusters from the fourth fraction

1 2 3 4 5

12

34

5

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

•• ••

••

••

•••

••

•••

•••••••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••••

•••

The 40 meta-observations

1 2 3 4 5

12

34

5

Clusters chosen by BIC

Example 2 – Pass 2

Page 20: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

The 40 meta-observations of pass 2 of fractionation

1 2 3 4 5

12

34

5

4 new clusters

1 2 3 4 5

12

34

5

4 new fractions

1 2 3 4 5

12

34

5

••

••

••••

••

•••••

••

•••

•••••

••••

••

••

••

•••

••

••

••

•••

••

• ••

••

••

••

•••

••

•••

•••

••••

•••

•••

••

•••

••

••

••

••••

••

•••

••

•••

••

••

••••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

•••

••

•••

••

•••••

••••

•••

••

•••

•••

••

••

••

••

••

••

•••

••••••

••

••

••

••

••

••

•••

••••

1 2 3 4 5

12

34

5

••

••

••••

••

•••••

••

•••

•••••

••••

••

••

••

•••

••

••

••

•••

••

• ••

••

••

••

•••

••

•••

•••

••

••

••

••

••

•••••

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

Observations in the new fraction 1Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

•••

••

•••

••

•••••

••••

•••

••

•••

•••

••

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

•••

•••

••

••••

•••

•••

••

••

••

••••

••

•••

••

••

•••

Clusters from the third fraction

1 2 3 4 5

12

34

5

••••

•••

•••

••

•••

••

••

••

••••

••

•••

••

•••

••

••

••••

•••

••

••

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

• ••

••

••

••

•••

••

••

Clusters from the fourth fraction

1 2 3 4 5

12

34

5 •

••

••

••

••

••

•••

••••••

••

••

••

••

••

••

•••

••••

••

••••

•••••

•••

••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

• ••

••

••

••

•••

••

The 40 meta-observations

1 2 3 4 5

12

34

5

Clusters chosen by BICRefractionation Succeeds

Example 2 – Pass 3

Page 21: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Realistic Example

• 1100 documents from the TDT corpus partitioned by people into 19 topics– Transformed into 50 dimensional space using Latent

Semantic Indexing

••

•••

•••

••

••

••

••

••

••

••

••

••

••

•••

••

••

•••

••

••

••

•••••

••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

•••

••

••

••

•••

•••

••

•••

••

••

••

••

•••

••

•••

••

••

••••

••

••

••

••

•••

••

••

•••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••••••

••• •

••

•••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••••••••••••

• ••••••

•••

•••

•••

••••

••

•• ••

• ••••

••• ••••

••••

•••

•••

••••••••••••

••

•••••

••

••

•••

•••

••

•••••

••••••

•••

••

•••

••

••••

••••

•••

••

••••

••

•••

••••••

•••

••

•••

••

••

•••••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

•••

••

••

••

••••

••

••

••

••

••

••••

••

•••••••

•••

•• ••••••••••

••••

••

••••

•••••••••••••••••••••••••••••••••••

••••

••

••

••••••••

••••••••••

••

••

••••••

••

•• •

••••

••

••

••

••

••

••

•••••

••

••

•••

••

•••••

••

•••

••••

•••••••

•••

••••

••

••••

••

• •••• •••••••••••••••••••••••••••••••••••••••••

••

•••

••••

••

Projection of the dataonto a plane – colorsrepresent topics

Page 22: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Realistic ExampleWant to create a dataset with more observations and more groupsIdea: Replace each group with a scaled and transformed version of the entire data set.

Page 23: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Realistic ExampleWant to create a dataset with more observations and more groupsIdea: Replace each group with a scaled and transformed version of the entire data set.

Page 24: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Realistic Example

To measure similarity of clusters to groups:Fowlkes-Mallows index• Geometric average of:

– Probability of 2 randomly chosen observations from the same cluster being in the same group

– Probability of 2 randomly chosen observations from the same group being in the same cluster

• Fowlkes–Mallows index near 1 means clusters are good estimates of the groups

• Clustering the 1100 documents gives a Fowlkes–Mallows index of 0.76 – our “gold standard”

Page 25: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Realistic Example

• 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions

• Fraction size¼1000 with 100 metaobservations per fraction

• 4 passes of fractionation choosing 361 clusters

Pass Min Median Max nf

1 270 289 296 20

2 18 88 150 18

3 18 19 60 17

4 19 19 58 16Distribution of the number of groups per fraction.

Number of fractions

Page 26: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Realistic Example

Pass Fowlkes Mallows

Purity of the clusters

1 0.325 1729

2 0.554 908

3 0.616 671

4 0.613 651

The sum of the number of groups represented in each cluster:• 361 is perfect

• 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions

• Fraction size¼1000 with 100 metaobservations per fraction

• 4 passes of fractionation choosing 361 clusters

Page 27: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Realistic Example

• 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions

• Fraction size¼1000 with 100 metaobservations per fraction

• 4 passes of fractionation choosing 361 clusters

Refractionation:• Purifies fractions• Successfully deals with the case where the number of

groups is greater than M, the number of meta-observations

Page 28: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Contributions

• Model Based Fractionation– Extended fractionation idea to parametric setting

• Incorporates information about size, shape and volume of clusters

• Chooses number of clusters

– Still linear in n

• Model Based ReFractionation– Extended fractionation to handle larger number of

groups

Page 29: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Extensions

• Extend to 100,000s of observations – 1000s of groups– Currently the number of groups must be less

than M

• Extend to a more flexible class of models– With small groups in high dimensions, we need

a more constrained model (fewer parameters) than the full covariance model

– Mixture of Factor Analyzers

Page 30: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University
Page 31: Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University

Fowlkes-Mallows Index

Pr(2 documents in same group |

they are in the same cluster)

Pr(2 documents in same cluster |

they are in the same group)

true clusters

Groups 1 2 … I Total

1 n11 n12 … n1I n1¢

2 n21 n22 … n2I n1¢

… … … … … …

J nJ1 nj2 … nJI n1¢

Total n¢1 n¢2 … n¢I n