a new suffix tree similarity measure for document clustering hung chim, xiaotie deng www 07
TRANSCRIPT
A New Suffix TreeSimilarity Measure forDocument Clustering
Hung Chim, Xiaotie Deng
WWW 07
1. Document Clustering
• Agglomerative Hierarchical Clustering (AHC)
• Suffix Tree Clustering (STC)
- commonly used in result clustering
2-1. Suffix Tree Clustering
Ex: 3 documents
• cat ate cheese• cat ate mouse too• mouse ate cheese too
cat ate cheese
cat ate cheese
cat ate cheese
cat ate cheese
score(B) = |B| f(|P|)f: remove stopwords, <= 3
, > 40% && penalize single word, constant for |P| > 6
2-2. Base Cluster
2-3. Combining Base Cluster
• Keep top k(=500) base cluster
• Merge high overlap base clustersmerge Bi & Bj iff
|Bi∩Bj| / |Bi| > 0.5
|Bj∩Bi| / |Bj| > 0.5
2-4. Advantage
• High precision even using snippet
• Incremental and linear time
• Order Independent
• No magic k
top k base clusters? 0.5?
3. New Suffix Tree Clustering
diT =
[tfidf(n1, di), tfidf(n2, di), …]
Group-average AHC
(GAHC)
4. Evaluation
• Use F-measure
precision(Ci, Gj) = |Ci∩ Gj | / |Ci|
recall(Ci, Gj) = |Ci∩ Gj | / | Gj |
• OHSUMED Document CollectionMeSH indexing terms
• RCV1 Document Collectioncategories
5. Comparison
• STC : seldom generate large cluster
• NSTC : not incremental