a new suffix tree similarity measure for document clustering hung chim, xiaotie deng www 07

17
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Den g WWW 07

Upload: alexia-west

Post on 28-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

A New Suffix TreeSimilarity Measure forDocument Clustering

Hung Chim, Xiaotie Deng

WWW 07

Page 2: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

1. Document Clustering

• Agglomerative Hierarchical Clustering (AHC)

Page 3: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

• Suffix Tree Clustering (STC)

- commonly used in result clustering

Page 4: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

2-1. Suffix Tree Clustering

Ex: 3 documents

• cat ate cheese• cat ate mouse too• mouse ate cheese too

Page 5: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

cat ate cheese

Page 6: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

cat ate cheese

Page 7: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

cat ate cheese

Page 8: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

cat ate cheese

Page 9: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

score(B) = |B| f(|P|)f: remove stopwords, <= 3

, > 40% && penalize single word, constant for |P| > 6

2-2. Base Cluster

Page 10: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

2-3. Combining Base Cluster

• Keep top k(=500) base cluster

• Merge high overlap base clustersmerge Bi & Bj iff

|Bi∩Bj| / |Bi| > 0.5

|Bj∩Bi| / |Bj| > 0.5

Page 11: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

2-4. Advantage

• High precision even using snippet

• Incremental and linear time

• Order Independent

• No magic k

top k base clusters? 0.5?

Page 12: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07
Page 13: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

3. New Suffix Tree Clustering

diT =

[tfidf(n1, di), tfidf(n2, di), …]

Group-average AHC

(GAHC)

Page 14: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

4. Evaluation

• Use F-measure

precision(Ci, Gj) = |Ci∩ Gj | / |Ci|

recall(Ci, Gj) = |Ci∩ Gj | / | Gj |

Page 15: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

• OHSUMED Document CollectionMeSH indexing terms

• RCV1 Document Collectioncategories

Page 16: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07
Page 17: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

5. Comparison

• STC : seldom generate large cluster

• NSTC : not incremental