a matrix density based algorithm to hierarchically co-cluster documents and words

25
A matrix density based algorithm to hierarchically co-cluster documents and words Advisor Dr. Hsu Graduate Keng-Wei Chang Author Bhushan Mandhan i Sachindra Joshi Krishna Ku mmamuru

Upload: quinn-beach

Post on 03-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

A matrix density based algorithm to hierarchically co-cluster documents and words. Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Bhushan Mandhani Sachindra Joshi Krishna Kummamuru. outline. Motivation Objective Introduction background - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A matrix density based algorithm to hierarchically co-cluster documents and words

A matrix density based algorithm to hierarchically co-cluster documents and words

Advisor : Dr. HsuGraduate : Keng-Wei ChangAuthor : Bhushan Mandhani

Sachindra Joshi Krishna Kummamuru

Page 2: A matrix density based algorithm to hierarchically co-cluster documents and words

outline

Motivation Objective Introduction background Rowset Partitioning and Submatrix Agglomeration(RPSA)

Experimental results Conclusions Personal Opinion

Page 3: A matrix density based algorithm to hierarchically co-cluster documents and words

Motivation

With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.

Page 4: A matrix density based algorithm to hierarchically co-cluster documents and words

Objective

A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo 、 Google.

This paper proposes an algorithm to hierarchically cluster documents for solving problems.

Page 5: A matrix density based algorithm to hierarchically co-cluster documents and words

Introduction

90s -> 100 thousand pages ; 2002 -> 2 billion pages; it has become increasingly important to organize

the information Manually is accurate, but not always feasible Need tools to automatically arrange documents t

o labeled hierarchies Propose RPSA -> two step partitional-agglomerative

Page 6: A matrix density based algorithm to hierarchically co-cluster documents and words

background

Vector Model for Documents Evaluation of Clustering Quality Evaluation of Hierarchical Clustering

Page 7: A matrix density based algorithm to hierarchically co-cluster documents and words

Vector Model for Documents

We have d documents

Document i is represented by

is the number of occurrences of word j in document i

Term Frequency , TF

Inverse Document Frequency , IDF

im

ijt

Unitized-TF IDF

Page 8: A matrix density based algorithm to hierarchically co-cluster documents and words

Evaluation of Clustering Quality

1. Purity :

2. Entropy :

ij

g

i ijj ppE

1log-

Page 9: A matrix density based algorithm to hierarchically co-cluster documents and words

Evaluation of Hierarchical Clustering

Page 10: A matrix density based algorithm to hierarchically co-cluster documents and words

Rowset Partitioning and Submatrix Agglomeration(RPSA)

tow-step partitional-agglomerative algorithm 1th step : The Partitioning Step 2th step : The Agglomerative Step

Page 11: A matrix density based algorithm to hierarchically co-cluster documents and words

The Partitioning Step

Define the density of submatices

a row r, a column c

a set R of rows , a set C of columns

Page 12: A matrix density based algorithm to hierarchically co-cluster documents and words

The Partitioning Step

Generating a Leaf Cluster

Page 13: A matrix density based algorithm to hierarchically co-cluster documents and words

The Partitioning Step

Choice of Leader Documents

The sum of TFIDF vector representing that document

Documents with relatively large lengths were observed to be better leader documents for the algorithm above

Page 14: A matrix density based algorithm to hierarchically co-cluster documents and words

The Partitioning Step

The Complete Partitioning Algorithm

Page 15: A matrix density based algorithm to hierarchically co-cluster documents and words

The Partitioning Step

Complexity Analysis The time complexity is O(mz) The space complexity is O(z)

Page 16: A matrix density based algorithm to hierarchically co-cluster documents and words

The Agglomerative Step

Reduce the number of clusters The similarity measure between two clusters

for merging Flat Clustering Hierarchical Clustering

Page 17: A matrix density based algorithm to hierarchically co-cluster documents and words

The Agglomerative Step

Complexity Analysis The time complexity is O( ) The space complexity is O( )

zm2

2m

Page 18: A matrix density based algorithm to hierarchically co-cluster documents and words

Experimental results-Flat Clustering

Data Sets

Page 19: A matrix density based algorithm to hierarchically co-cluster documents and words

Experimental results-Flat Clustering

Results

Page 20: A matrix density based algorithm to hierarchically co-cluster documents and words

Experimental results-Flat Clustering

Page 21: A matrix density based algorithm to hierarchically co-cluster documents and words

Experimental results-Hierarchical Clustering

Data Sets

Page 22: A matrix density based algorithm to hierarchically co-cluster documents and words

Experimental results-Hierarchical Clustering

Data Sets

Page 23: A matrix density based algorithm to hierarchically co-cluster documents and words

Experimental results-Hierarchical Clustering

Results

Page 24: A matrix density based algorithm to hierarchically co-cluster documents and words

Conclusions

It is comparable with or better than the best k-means run

It’s performance does not degrade on small data sets

It’s acceptable on purity in hierarchy

Page 25: A matrix density based algorithm to hierarchically co-cluster documents and words

Personal Opinion