birch: balanced iterative reducing and clustering using hierarchies by tian zhang, raghu...

33
Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: [email protected]

Upload: denisse-asbury

Post on 16-Dec-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Birch: Balanced Iterative Reducing and

Clustering using Hierarchies

By Tian Zhang, Raghu Ramakrishnan

Presented by

Vladimir Jelić 3218/10e-mail: [email protected]

Page 2: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

What is Data Clustering?

A cluster is a closely-packed group. A collection of data objects that are similar to

one another and treated collectively as a group.

Data Clustering is the partitioning of a dataset into clusters

2 of 28 Vladimir Jelić ([email protected])

Page 3: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Data Clustering

Helps understand the natural grouping or structure in a dataset

Provided a large set of multidimensional data– Data space is usually not uniformly occupied– Identify the sparse and crowded places– Helps visualization

3 of 28 Vladimir Jelić ([email protected])

Page 4: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Some Clustering Applications

Biology – building groups of genes with related patterns

Marketing – partition the population of consumers to market segments

Division of WWW pages into genres. Image segmentations – for object recognition Land use – Identification of areas of similar

land use from satellite images

4 of 28 Vladimir Jelić ([email protected])

Page 5: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Clustering Problems

Today many datasets are too large to fit into main memory

The dominating cost of any clustering algorithm is I/O, because seek times on disk are orders of a magnitude higher than RAM access times

5 of 28 Vladimir Jelić ([email protected])

Page 6: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Previous Work

Two classes of clustering algorithms: Probability-Based

Examples: COBWEB and CLASSIT

Distance-Based Examples: KMEANS, KMEDOIDS, and CLARANS

6 of 28 Vladimir Jelić ([email protected])

Page 7: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Previous Work: COBWEB

Probabilistic approach to make decisions Clusters are represented with probabilistic

description Probability representations of clusters is

expensive Every instance (data point) translates into a

terminal node in the hierarchy, so large hierarchies tend to over fit data

7 of 28 Vladimir Jelić ([email protected])

Page 8: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Previous Work: KMeans

Distance based approach, so there must be distance measurement between any two instances

Sensitive to instance order Instances must be stored in memory All instances must be initially available May have exponential run time

8 of 28 Vladimir Jelić ([email protected])

Page 9: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Previous Work: CLARANS

Also distance based approach, so there must be distance measurement between any two instances

computational complexity of CLARANS is about O(n2)

Sensitive to instance order Ignore the fact that not all data points in the

dataset are equally important

9 of 28 Vladimir Jelić ([email protected])

Page 10: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Contributions of BIRCH

Each clustering decision is made without scanning all data points

BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes

BIRCH makes full use of available memory to derive the finest possible subclusters ( to ensure accuracy) while minimizing I/O costs ( to ensure efficiency)

10 of 28 Vladimir Jelić ([email protected])

Page 11: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Background Knowledge (1)

Given a cluster of instances , we define:

11 of 28 Vladimir Jelić ([email protected])

}{ ix

2

11 1

2

2

11

20

10

))1(

)((

))(

(

NN

xxD

N

xxR

N

xx

N

i

N

j ji

N

i i

N

i i

Centroid:

Radius:

Diameter:

Page 12: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Background Knowledge (2)

Vladimir Jelić ([email protected])12 of 28

2)

21

1

1

1

21

11 2

21 11(

2)

1

11(

2)

21

211(4

21

))121)(21(

211 2112

)((3

21

)

21

11 21 11

2)(

(2

NN

k

N

i

NN

Nj N

NNNl lx

jxN

Nl lx

ixNN

NNl lx

kxD

NNNN

NNi

NNj jxix

D

NN

Ni

NNNj jxix

D

|1

)(

20)(

10||2010|1

21

)2

)2010((0

d

i

ix

ixxxD

xxD

centroid Manhattan distance:

centroid Euclidian distance:

average inter-cluster:

average intra-cluster:

variance increase:

Page 13: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Clustering Features (CF)

The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set.

Each entry in the CF tree represents a cluster of objects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following.

13 of 28 Vladimir Jelić ([email protected])

Page 14: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Clustering Feature (CF)

Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,

CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points.

14 of 28 Vladimir Jelić ([email protected])

Page 15: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

CF Additivity Theorem (1)

If CF1 = (N1, LS1, SS1), and

CF2 = (N2 ,LS2, SS2) are the CF entries of two disjoint sub-clusters.

The CF entry of the sub-cluster formed by merging the two disjoin sub-clusters is:

CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)

15 of 28 Vladimir Jelić ([email protected])

Page 16: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

CF Additivity Theorem (2)

Vladimir Jelić ([email protected])16 of 28

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

Example:

Page 17: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Properties of CF-Tree

Each non-leaf node has at most B entriesEach leaf node has at most L CF entries which each satisfy threshold TNode size is determined by dimensionality of data space and input parameter P (page size)

17 of 28 Vladimir Jelić ([email protected])

Page 18: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

CF Tree Insertion

Identifying the appropriate leaf: recursively descending the CF tree and choosing the closest child node according to a chosen distance metric

Modifying the leaf: test whether the leaf can absorb the node without violating the threshold. If there is no room, split the node

Modifying the path: update CF information up the path.

18 of 28 Vladimir Jelić ([email protected])

Page 19: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Example of the BIRCH Algorithm

Root

LN1

LN2LN3

LN1 LN2 LN3

sc1sc2

sc3

sc4sc5

sc6sc7

sc1 sc2sc3sc4

sc5 sc6 sc7sc8

sc8

New subcluster

19 of 28 Vladimir Jelić ([email protected])

Page 20: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Vladimir Jelić ([email protected])19 of 28

Merge Operation in BIRCH

RootLN1”

LN2LN3

LN1’ LN2 LN3

sc1

sc2sc3

sc4sc5

sc6sc7

sc1 sc2sc3sc4

sc5 sc6 sc7sc8

sc8

LN1’

LN1”

If the branching factor of a leaf node can not exceed 3, then LN1 is split

Page 21: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Vladimir Jelić ([email protected])19 of 28

Merge Operation in BIRCH

Root

LN1”

LN2LN3

LN1’

LN2LN3

sc1

sc2

sc3sc4

sc5

sc6

sc7

sc1sc2 sc3 sc4

sc5 sc6 sc7sc8

sc8

LN1’LN1”

If the branching factor of a non-leaf node can not exceed 3, then the root is split and the height of the CF Tree increases by one

NLN1

NLN2

Page 22: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Vladimir Jelić ([email protected])19 of 28

Merge Operation in BIRCH

root

LN1

LN2

sc1sc2 sc3 sc4

sc5sc6

LN1

Assume that the subclusters are numbered according to the order of formation

sc3

sc4

sc5sc6

sc2sc1

LN2

Page 23: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Vladimir Jelić ([email protected])19 of 28

LN2”

LN1

sc3 sc4

sc5sc6

sc2

If the branching factor of a leaf node can not exceed 3, then LN2 is split

sc1

LN2’

root

LN2’

sc1 sc2sc3sc4sc5

sc6

LN1

LN2”

Merge Operation in BIRCH

Page 24: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Vladimir Jelić ([email protected])19 of 28

rootLN2”

LN3’

LN3”

sc3 sc4

sc5sc6

sc2

sc1sc2 sc3sc4sc5

sc6

LN3’

LN2’ and LN1 will be merged, and the newly formed nodewil be split immediately

sc1

LN3”

LN2”

Merge Operation in BIRCH

Page 25: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Birch Clustering Algorithm (1)

Phase 1: Scan all data and build an initial in-memory CF tree.

Phase 2: condense into desirable length by building a smaller CF tree.

Phase 3: Global clustering Phase 4: Cluster refining – this is optional,

and requires more passes over the data to refine the results

20 of 28 Vladimir Jelić ([email protected])

Page 26: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Birch Clustering Algorithm (2)

Vladimir Jelić ([email protected])21 of 28

Page 27: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Birch – Phase 1

Start with initial threshold and insert points into the tree

If run out of memory, increase thresholdvalue, and rebuild a smaller tree by reinserting values from older tree and then other values

Good initial threshold is important but hard to figure out

Outlier removal – when rebuilding tree remove outliers

22 of 28 Vladimir Jelić ([email protected])

Page 28: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Birch - Phase 2

Optional Phase 3 sometime have minimum size which

performs well, so phase 2 prepares the tree for phase 3.

BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF tree, which removes sparse clusters as outliers and groups dense clusters into larger ones.

23 of 28 Vladimir Jelić ([email protected])

Page 29: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Birch – Phase 3

Problems after phase 1:– Input order affects results– Splitting triggered by node size

Phase 3:– cluster all leaf nodes on the CF values according

to an existing algorithm– Algorithm used here: agglomerative hierarchical

clustering

24 of 28 Vladimir Jelić ([email protected])

Page 30: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Birch – Phase 4

Optional Do additional passes over the dataset &

reassign data points to the closest centroid from phase 3

Recalculating the centroids and redistributing the items.

Always converges (no matter how many time phase 4 is repeated)

25 of 28 Vladimir Jelić ([email protected])

Page 31: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Conclusions (1)

Birch performs faster than existing algorithms (CLARANS and KMEANS) on large datasets

Scans whole data only once Handles outliers better Superior to other algorithms in stability and

scalability

26 of 28 Vladimir Jelić ([email protected])

Page 32: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

Conclusions (2)

Since each node in a CF tree can hold only a limited number of entries due to the size, a CF tree node doesn’t always correspond to what a user may consider a nature cluster. Moreover, if the clusters are not spherical in shape, it doesn’t perform well because it uses the notion of radius or diameter to control the boundary of a cluster

Vladimir Jelić ([email protected])27 of 28

Page 33: Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jelić 3218/10 e-mail: jelicvladimir5@gmail.com

References

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method for very large databases. SIGMOD'96

Jan Oberst: Efficient Data Clustering and How to Groom Fast-Growing Trees

Tan, Steinbach, Kumar: Introduction to Data Mining

Vladimir Jelić ([email protected])28 of 28