a new suffix tree similarity measure for document clustering hung chim, xiaotie deng www 07

A New Suffix TreeSimilarity Measure forDocument Clustering

Hung Chim, Xiaotie Deng

WWW 07

1. Document Clustering

• Agglomerative Hierarchical Clustering (AHC)

• Suffix Tree Clustering (STC)

- commonly used in result clustering

2-1. Suffix Tree Clustering

Ex: 3 documents

• cat ate cheese• cat ate mouse too• mouse ate cheese too

cat ate cheese

score(B) = |B| f(|P|)f: remove stopwords, <= 3

, > 40% && penalize single word, constant for |P| > 6

2-2. Base Cluster

2-3. Combining Base Cluster

• Keep top k(=500) base cluster

• Merge high overlap base clustersmerge Bi & Bj iff

|Bi∩Bj| / |Bi| > 0.5

|Bj∩Bi| / |Bj| > 0.5

2-4. Advantage

• High precision even using snippet

• Incremental and linear time

• Order Independent

• No magic k

top k base clusters? 0.5?

3. New Suffix Tree Clustering

diT =

[tfidf(n1, di), tfidf(n2, di), …]

Group-average AHC

(GAHC)

4. Evaluation

• Use F-measure

precision(Ci, Gj) = |Ci∩ Gj | / |Ci|

recall(Ci, Gj) = |Ci∩ Gj | / | Gj |

• OHSUMED Document CollectionMeSH indexing terms

• RCV1 Document Collectioncategories

5. Comparison

• STC : seldom generate large cluster

• NSTC : not incremental

a new suffix tree similarity measure for document clustering hung chim, xiaotie deng www 07

Documents

base clusterscoreb

base clusterkeep

ci gj cirecallci

suffix tree clusteringex

new suffix tree clusteringdit

bjbi bj

cheese toocat

mouse toomouse