a matrix density based algorithm to hierarchically co-cluster documents and words
DESCRIPTION
A matrix density based algorithm to hierarchically co-cluster documents and words. Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Bhushan Mandhani Sachindra Joshi Krishna Kummamuru. outline. Motivation Objective Introduction background - PowerPoint PPT PresentationTRANSCRIPT
A matrix density based algorithm to hierarchically co-cluster documents and words
Advisor : Dr. HsuGraduate : Keng-Wei ChangAuthor : Bhushan Mandhani
Sachindra Joshi Krishna Kummamuru
outline
Motivation Objective Introduction background Rowset Partitioning and Submatrix Agglomeration(RPSA)
Experimental results Conclusions Personal Opinion
Motivation
With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.
Objective
A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo 、 Google.
This paper proposes an algorithm to hierarchically cluster documents for solving problems.
Introduction
90s -> 100 thousand pages ; 2002 -> 2 billion pages; it has become increasingly important to organize
the information Manually is accurate, but not always feasible Need tools to automatically arrange documents t
o labeled hierarchies Propose RPSA -> two step partitional-agglomerative
background
Vector Model for Documents Evaluation of Clustering Quality Evaluation of Hierarchical Clustering
Vector Model for Documents
We have d documents
Document i is represented by
is the number of occurrences of word j in document i
Term Frequency , TF
Inverse Document Frequency , IDF
im
ijt
Unitized-TF IDF
Evaluation of Clustering Quality
1. Purity :
2. Entropy :
ij
g
i ijj ppE
1log-
Evaluation of Hierarchical Clustering
Rowset Partitioning and Submatrix Agglomeration(RPSA)
tow-step partitional-agglomerative algorithm 1th step : The Partitioning Step 2th step : The Agglomerative Step
The Partitioning Step
Define the density of submatices
a row r, a column c
a set R of rows , a set C of columns
The Partitioning Step
Generating a Leaf Cluster
The Partitioning Step
Choice of Leader Documents
The sum of TFIDF vector representing that document
Documents with relatively large lengths were observed to be better leader documents for the algorithm above
The Partitioning Step
The Complete Partitioning Algorithm
The Partitioning Step
Complexity Analysis The time complexity is O(mz) The space complexity is O(z)
The Agglomerative Step
Reduce the number of clusters The similarity measure between two clusters
for merging Flat Clustering Hierarchical Clustering
The Agglomerative Step
Complexity Analysis The time complexity is O( ) The space complexity is O( )
zm2
2m
Experimental results-Flat Clustering
Data Sets
Experimental results-Flat Clustering
Results
Experimental results-Flat Clustering
Experimental results-Hierarchical Clustering
Data Sets
Experimental results-Hierarchical Clustering
Data Sets
Experimental results-Hierarchical Clustering
Results
Conclusions
It is comparable with or better than the best k-means run
It’s performance does not degrade on small data sets
It’s acceptable on purity in hierarchy
Personal Opinion