jeremy tantrum, department of statistics, university of washington joint work with alejandro murua...

Jeremy Tantrum, Department of Statistics,

University of Washington

joint work with

Alejandro Murua & Werner StuetzleInsightful Corporation University of Washington

This work has been supported by NSA grant 62-1942

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and

Refractionation

Motivating Example

• Consider clustering documents• Topic Detection and Tracking corpus

• 15,863 news stories for one year from Reuters and CNN• 25,000 unique words• Possibly many topics

• Large numbers of observations• High dimensions• Many groups

Goal of Clustering

40 45 50 55

74

76

78

80

82

84

Detect that there are 5 or 6 groupsAssign Observations to groups

NonParametric Clustering

Premise: • Observations are sampled from a density p(x) • Groups correspond to modes of p(x)

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

NonParametric Clustering

Fitting: Estimate p(x) nonparametrically and find significant modes of the estimate

-10 -5 0 5 10

0.0

0.02

0.04

0.06

0.08

0.10

0.12

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Model Based Clustering

Premise: • Observations are sampled from a mixture density p(x) = g pg(x)• Groups correspond to mixture components


-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| |||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Fitting: Estimate g and parameters of pg(x)


Fitting a Mixture of Gaussians

• Use the EM algorithm to maximize the log likelihood– Estimates the probabilities of each observation

belonging to each group– Maximizes likelihood given these probabilites

– Requires a good starting point


Hierarchical Clustering• Provides a good starting point for EM

algorithm• Start with every point being it’s own cluster• Merge the two closest clusters

– Measured by the decrease in likelihood when those two clusters are merged

– Uses the Classification Likelihood – not the Mixture Likelihood

• Algorithm is quadratic in the number of observations

| || | |||| | | |||||||||| |||||||| |||||| |||||||||||| ||||||||||||||||||| ||||||| || ||||||| || |||||| | |||| || || | |

p1(x)p2(x)

p (x)

Merge gives small decrease in likelihood

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

Merge gives big decrease in likelihood

Likelihood Distance

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

p1(x) p2(x)

p (x)

Bayesian Information Criterion

• Choose number of clusters by maximizing the Bayesian Information Criterion

– r is the number of parameters– n is the number of observations

• Log likelihood penalized for complexity

Fractionation

Original Data – size n

n/M fractions of size M

If n >M

M is the largest number of observations for which a hierarchical O(M2) algorithm is computationally feasible

Invented by Cutting, Karger, Pederson and Tukey for nonparametric clustering of large datasets.

n clusters(meta-obervations, i)

Partition each fraction into M clusters < 1

Fractionation

– n meta-observations after the first round– 2n meta-observations after the second round– in meta-observations after the ith round

• For the ith pass, we have i-1n/M fractions taking O(M2) operations each

• Total number of operations is:

• Total running time is linear in n!

• Use model based clustering

• Meta-observations contain all sufficient statistics – (ni, i, i)

– ni is the number of observations – size

– i is the mean – location

– i is the covariance matrix – shape and volume

Model Based Fractionation

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

• ••••

•

•••

•

••

•

•

•

•

•

•

••

•

•

•

••

••

•

•

•

••

••

•••

•

•

•

•

••

•

••••

••

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

••

•

•

•

•

••

•

•••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

••

•

•

••

••

•

•

•

•••

••

•

••

••

•

•

•

••••

•

•

••

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•• •

•

•

•

•

•

•

•

••••

••

•••

•

•

••

•

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

••

•

••

•

•

•

•

•••

•

•

•

•

•

•

••

•

•

••

•

•

••

••

•

••

•

•

•

•

•

••

•

•

•

••

•

••

••

•

••••

•

••

•

•

•

•

•

•

••

••

••

••••

•

•

••

•

•

•

•

•

••

•

•

••

•••

•

•

•

•

•

•

•

••

••

•

•

•

•

•••

••

•

••

•

•

••

•

•

••

•

•

•

•

•

•

•

••

••

•

••

••

•

•

••

•

•

•

•

•

••

•

•

•

•

•

An example, 400 observations in 4 groupsObservations in the first fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

•

•

•

•

•

••

•

•

•

••

•

•

••

•

••

••••••

•

••

•

•

•

•

•

•

•••

•

•

•

•

•

••

••

•

•

•••

••••

•

•

•

•

•

••

••

••

••

•

••

•

•

••••

•

•

••

•

•

•

••

•

•••••

•

•

••

•

•

10 meta-observations from the first fraction10 meta-observations from the second fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•••

•

•

••

•

•

••

••

••

•

••

•••

••

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

••

•

•

•

••

••

••

•

•

••

••

•

•

•••

•

•

•

••

•

10 meta-observations from the third fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

•

••

•

•

•

•

•

•

•

••

•

••

•

•

•

•••

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

••

••

•

••

••

•

•

•

•

•

••

•

•

•

•

•

••

••

•••

•

•

•

••

•

••••

•

•

••

•

••

•

•

•

•

•

•

•

•••

•

•

• •

10 meta-observations from the fourth fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

•

••

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

••

•

•

•

•

•

••

•••

•

•

•

• •

•

•

•

•

•

•

•

•• ••

•

•••

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

••

•

The 40 Meta-observations

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

The Final Clusters Chosen by BIC

Success!

Model Based Fractionation

The data – 400 observations in 25 groups

1 2 3 4 5

12

34

5 •

•

••

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•••

•

•

••••

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

••

••

•

•

••

•

•

••

•

•

•

•

•

•

•

•

•

••

••

•

••

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

••

•••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

••

••

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••••

•

•

•

•

•

••

•••

•

•

•

•

•

•

•

•

•

•

•

••

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••••

•

•••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•••

•

1 2 3 4 5

12

34

5 •

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

••

•

•

•

•

••

•

••

•

•

•

•

••

•

•

•

••

••

•

•

•

•

•••

•

•

••

•

Observations in fraction 110 meta-observations from the first fraction10 meta-observations from the second fraction

1 2 3 4 5

12

34

5

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

••

••

•

•

•

•

•

••

•

•

•

•

••

•

•

•

•

•

•

•

•

•

10 meta-observations from the third fraction

1 2 3 4 5

12

34

5

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

••

••

•

•

•

•

•••

•••

•

•••

••

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

••

•

•

•

•••

•

•

10 meta-observations from the fourth fraction

1 2 3 4 5

12

34

5

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

••

••

•

•

•

•

••

•

•

•

•

•

••••

•

•

•

•

•

•

•

•

•

•

•

••

•

• •

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•••

•

•

•••

•

•

••

••

•••

The 40 meta-observations

1 2 3 4 5

12

34

5

The clusters chosen by BICFractionation fails!

Example 2

Refractionation

Problem:• If the number of meta-observations generated from a

fraction is less than the number of groups in that fraction then two or more groups will be merged.

• Once observations from two groups are merged they can never be split again.

Solution:• Apply fractionation repeatedly.• Use meta-observations from the previous pass of

fractionation to create “better” fractions.

Example 2 Continued

1 2 3 4 5

12

34

5

The 40 meta-observations4 new clusters4 new fractions

1 2 3 4 5

12

34

5 • ••

•

•

•

•

•

•

•

•

••

••

•••

••

•

•

•••

•••••••

•

••

••

••

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•••

••

••

••

••

••

••

••••

•

•

•

•••

••

•

•

•

•••

••••

••

•

••

••

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•••••

•

•

•

•

••

••

•

•

•

•

•

•

••

••

•

••

•

•

•

••

•

•

•

••

•

•

•

••

•

••

•

•

•

•

••

•••

•

•

•

•

•

••

•

••

•

•••

••

•••

•

•

•

•

•

•••

•

•

• •

••

•••

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•••

•

••

•

•

•

•

•

••••

•

•

•

••••

•••

•

•••

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•••

•

• •••

•

•

•

•

•

•

•

••

••

•••••

•••

•

•

•

••••

•

•

••

•

•••

••

•

•

••

••

••

•

••

••••

•

•

•

•

••••

•

••

••••

•

••••

••••••

•

•

•

••

•

••

•

•

Observations in the new fraction 1

1 2 3 4 5

12

34

5

• ••

•

•

•

•

•

•

•

•

••

••

•••

••

•

•

••

•

••••••

•

•

••

•

•

••

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•••

••

•

•

••

•

•

••

•

•

•••

•

•

•

•

•••

••

•

•

•

•••

••••

••

•

••

••

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•••••

•

•

•

•

••

••

•

•

•

•

•

•

••

••

•

••

•

•

•

••

•

•

•

••

•

•

•

••

•

••

•

•

•

•

••

•••

•

•

•

•

•

••

•

••

•

•••

••

•••

•

•

•

•

•

•••

•

•

• •

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•••

•

•

•

•

•

•

•

•

• ••

•

•

•

•

••

••

••

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•••

•

•

•

•

•

•

•

•

••

•

•

•

••••

•

••

•

•

•

••••

•

•

••

•

•••

••

•

•

••

••

•

•

•

•

•

••••

•

•

•

•

•••

•

•

•

•

•

•

••

•

•••

•

•

•

•••

•

•

•

•

•

•

•

••

•

•

1 2 3 4 5

12

34

5

• ••

•

•

•

•

•

•

•

•

••

••

•••

••

•

•

••

•

••••••

•

•

••

•

•

••

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•••

••

•

•

••

•

•

••

•

•

•••

•

•

•

•

•••

••

•

•

•

•••

••••

••

•

••

••

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•••••

•

•

•

•

••

••

•

•

•

•

•

•

••

••

•

••

•

•

•

••

•

•

•

••

•

•

•

••

•

••

•

•

•

•

••

•••

•

•

•

•

•

••

•

••

•

•••

••

•••

•

•

•

•

•

•••

•

•

• •

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•••

•

•

•

•

•

•

•

•

• ••

•

•

•

•

••

••

••

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•••

•

•

•

•

•

•

•

•

••

•

•

•

••••

•

••

•

•

•

••••

•

•

••

•

•••

••

•

•

••

••

•

•

•

•

•

••••

•

•

•

•

•••

•

•

•

•

•

•

••

•

•••

•

•

•

•••

•

•

•

•

•

•

•

••

•

•

Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•••

•

•

• •

••

•••

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•••

•

••

•

•

•

•

•

••••

•

•

•

••••

•••

•

•••

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•••

•

Clusters from the third fraction

1 2 3 4 5

12

34

5

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

• •••

•

•

•

•

•

•

•

••

••

•••••

•••

•

•

•

••••

•

•

••

•

•••

••

•

•

••

••

••

•

••

••••

•

•

•

•

••••

•

••

••••

•

••••

••••••

•

•

•

••

•

••

•

•

Clusters from the fourth fraction

1 2 3 4 5

12

34

5

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•• ••

•

•

•

•

•

•

•

•

••

••

•••

••

•

•

•••

•••••••

•

••

••

••

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•••

••

••

••

••

••

••

••••

•

•

•

•••


1 2 3 4 5

12

34

5

Clusters chosen by BIC

Example 2 – Pass 2

The 40 meta-observations of pass 2 of fractionation

1 2 3 4 5

12

34

5

4 new clusters

1 2 3 4 5

12

34

5

4 new fractions

1 2 3 4 5

12

34

5

••

••

••••

•

••

•••••

••

•••

•

•

•

•••••

•

••••

•

•

•

•

••

••

•

••

•••

••

••

•

•

••

•

•••

•

•

••

• ••

•

•

•

••

•

•

•

••

•

•

•

••

•••

••

•

•

•

•

•••

•••

•

•

••••

•••

•

•••

••

•••

••

••

•

••

••••

•

•

•

••

•

•••

•

•

•

•

•

••

•••

••

•

•

•

••

••••

•••

•

•

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

••

•

•

••

••

••

•

•

•

••

•

•

••

••

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

••

•

••

•

•••

••

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•••

•

•

•

••

•

•

•

•••

•

••

•

•

•

•

•

•••••

••••

•

•••

••

•

•••

•

•

•

•

•

•

•

•

•

•••

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

••

•

•

••

•

•

•

•••

•

•

••••••

••

•

••

••

••

•

•

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

••••

1 2 3 4 5

12

34

5

••

••

••••

•

••

•••••

••

•••

•

•

•

•••••

•

••••

•

•

•

•

••

••

•

••

•••

••

••

•

•

••

•

•••

•

•

••

• ••

•

•

•

••

•

•

•

••

•

•

•

••

•••

••

•

•

•

•

•••

•••

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•••••

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

••

•

•

••

•

•

••

•

•

•

••

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

Observations in the new fraction 1Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

••

•

•

•

••

•

•

••

••

••

•

•

•

••

•

•

••

••

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

••

•

••

•

•••

••

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•••

•

•

•

••

•

•

•

•••

•

••

•

•

•

•

•

•••••

••••

•

•••

••

•

•••

•

•

•

•

•

•

•

•

•

•••

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•••

•

••

•

•

••••

•••

•

•••

••

•

••

••

•

•

•

•

•

••••

•

•

•

••

•

•••

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

Clusters from the third fraction

1 2 3 4 5

12

34

5

•

•

••••

•••

•

•••

••

•••

••

••

•

••

••••

•

•

•

••

•

•••

•

•

•

•

•

••

•••

••

•

•

•

••

••••

•••

•

•

•

•

••

•

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

••

•

•

•

•••

•

•

•

•

•

•

•

•

• ••

••

••

••

•

•••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

••

•

•

Clusters from the fourth fraction

1 2 3 4 5

12

34

5 •

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

••

•

•

••

•

•

•

•••

•

•

••••••

••

•

••

••

••

•

•

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

••••

••

•

•

••••

•

•

•

•••••

•

•

•

•

•

•

•

•

•••

••

•

•

•

•

•

•

•

•

•

••

••

•

••

•••

••

••

•

•

••

•

••

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

• ••

•

•

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•


1 2 3 4 5

12

34

5

Clusters chosen by BICRefractionation Succeeds

Example 2 – Pass 3

Realistic Example

• 1100 documents from the TDT corpus partitioned by people into 19 topics– Transformed into 50 dimensional space using Latent

Semantic Indexing

••

•••

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

••

•

•

•

••

••

••

•

•

•

•

•

•

••

••

•

•

••

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

•

•

•

••

•••••

•

•

••

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

•

•

••

•

•

••

•

•

••

•

•

•

•

•

•

•

•

•

••

•

••

•

••

••

•

••

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•••

•

•

•

•••

•

•

••

•

••

•

•

•

••

•

•

•

•

•••

•••

••

•

•••

•

•

•

••

•

•

••

•

•

•

••

•

•

•

•

••

•

•••

•

••

•

•••

•

••

•

••

•

•

•

•

•

••••

•

•

•

••

•

•

•

•

•

••

•

•

••

•

•

•

•

••

•

•••

••

•

•

•

•

•

••

•

•••

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

••

••

•

•

•

•

•

••

•

•••

•••

•

•

•

•

•

•

•

•

•

•

••

••

•

•

••

•

•

•

•

••

•

•

•

•

••

•

••

•

•

•

•

•

•

•

•

••

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

••

•

•

••

•

••

•

••

•

•

•••

•

•

•

••••••

••• •

••

•

•••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••

•••••••••••••••••

•

•

• ••••••

•••

•••

•••

••••

•

••

•

•

•• ••

•

• ••••

••• ••••

•

••••

•••

•

•••

•

••••••••••••

••

•••••

•

••

••

•••

•••

••

•

•••••

•

•

••••••

•

•

•••

•

••

•••

••

•

•

•

••••

•

•

••••

•

•

•••

•

•

•

••

••••

••

•

•

•••

•

•

•

••••••

•

•••

••

•

•

•

•••

•

••

•

•

•

•

•

••

•

•••••

••

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

••

••

••

•

•

••

••

•

•

••

•

••

•

••

•

••

••

••

•

•

•••

••

••

•

•••

••

•

•

••

•

•

•

•

•

•

•

•

•

••

•

•

••••

•

•

•

•

••

••

••

•

•

•

••

••

•

••••

•

•

•

•

••

•••••••

•••

•

•• ••••••••••

••••

••

••••

•

•••••••••••••••••••••••••••••••••••

••••

•

••

••

••••••••

••••••••••

••

••

••••••

•

•

••

•• •

••••

•

••

••

•

•

••

•

••

••

•

••

•••••

••

••

•

•••

•

•

•

•

••

•

•••••

•

•

••

•••

•

•

•

•

••••

•

•••••••

•••

•

••••

••

•

•

•

••••

•

•

••

• •••• •••••••••••••••••••••••••••••••••••••••••

••

•••

••••

••

•

Projection of the dataonto a plane – colorsrepresent topics

Realistic ExampleWant to create a dataset with more observations and more groupsIdea: Replace each group with a scaled and transformed version of the entire data set.

Realistic Example

To measure similarity of clusters to groups:Fowlkes-Mallows index• Geometric average of:

– Probability of 2 randomly chosen observations from the same cluster being in the same group

– Probability of 2 randomly chosen observations from the same group being in the same cluster

• Fowlkes–Mallows index near 1 means clusters are good estimates of the groups

• Clustering the 1100 documents gives a Fowlkes–Mallows index of 0.76 – our “gold standard”

Realistic Example

• 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions

• Fraction size¼1000 with 100 metaobservations per fraction

• 4 passes of fractionation choosing 361 clusters

Pass Min Median Max nf

1 270 289 296 20

2 18 88 150 18

3 18 19 60 17

4 19 19 58 16Distribution of the number of groups per fraction.

Number of fractions

Realistic Example

Pass Fowlkes Mallows

Purity of the clusters

1 0.325 1729

2 0.554 908

3 0.616 671

4 0.613 651

The sum of the number of groups represented in each cluster:• 361 is perfect




Realistic Example




Refractionation:• Purifies fractions• Successfully deals with the case where the number of

groups is greater than M, the number of meta-observations

Contributions

• Model Based Fractionation– Extended fractionation idea to parametric setting

• Incorporates information about size, shape and volume of clusters

• Chooses number of clusters

– Still linear in n

• Model Based ReFractionation– Extended fractionation to handle larger number of

groups

Extensions

• Extend to 100,000s of observations – 1000s of groups– Currently the number of groups must be less

than M

• Extend to a more flexible class of models– With small groups in high dimensions, we need

a more constrained model (fewer parameters) than the full covariance model

– Mixture of Factor Analyzers

Fowlkes-Mallows Index

Pr(2 documents in same group |

they are in the same cluster)

Pr(2 documents in same cluster |

they are in the same group)

true clusters

Groups 1 2 … I Total

1 n11 n12 … n1I n1¢

2 n21 n22 … n2I n1¢

… … … … … …

J nJ1 nj2 … nJI n1¢

Total n¢1 n¢2 … n¢I n

jeremy tantrum, department of statistics, university of washington joint work with alejandro murua...

Documents