a latent semantic indexing-based approach to multilingual document clastering
DESCRIPTION
A Latent Semantic Indexing-based approach to multilingual document clastering. Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin Decision Support Systems 45 (2008) 606-620 Reporter : Yi Ru, Lee. Outline. Introduction Latent Semantic Indexing(LSI) - PowerPoint PPT PresentationTRANSCRIPT
Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin
Decision Support Systems 45 (2008) 606-620
Reporter : Yi Ru, Lee
Introduction
Latent Semantic Indexing(LSI)
LSI-based multilingual document clustering technique
Empirical evaluation
Conclusion
2
Translation-basedSynonymy Polysemyvocabulary
Multilingual spaceLatent Semantic Indexing(LSI)Lexical matchingReduce the dimensions
3
Singular Value Decomposition (SVD)
kX kU TkVk
4
diUdi
diUdi
Tkk
Tk
1
5
6
Multilingual semantic space analysis
7
Document folding-in
8
Dimension Selection
i
jiwDjDL )( Dj denote the LSI dimension j
Wji is the weight of document i in Dj
9
Clustering
Hierarchical clustering algorithm
n
i
n
iii
n
iii
yx
yxYX
1 1
22
1),cos(
10
11
TA
CACR )(recallcluster
GA
CACP )(precisioncluster
TA is the set of associations in the true categories.
GA is the set of associations in the clusters generated by the document clustering technique. CA is the set of correct associations that exists in both the clusters and the true categories.
12
Examples
TA={(e1−e2),(c1−c2), (e1−c1), (e1−c2), (e2−c1), (e2−c2), (e3−e4),(c3−c4), (c3−c5), (c4−c5), (e3−c3), (e3−c4), (e3−c5), (e4−c3), (e4−c4), (e4−c5)}
GA={(e1−e2), (c1−c3), (e1−c1), (e1−c3), (e2−c1), (e2−c3), (e3−e4), (e3−c2), (e4−c2), (c4−c5)}
CA={(e1−e2), (e1−c1), (e2−c1), (e3−e4), (c4−c5)}
13
PRT curves of the LSI-based MLDC technique
14
Comparisons of different representation schemes
15
Effect of dimension selection (h=5 for MLDC with dimension selection; k=5 for MLDC without dimension selection)
16
Effect of dimension selection (h=20 for MLDC with dimension selection; k=20 for MLDC without dimension selection)
17
Best scenario versus best scenario comparison
18
PRT curves of overall, monolingual, and cross-lingual performance
19
monolingual PRT curve > overall PRT curve > cross-lingual PRT curve
Specific domain
20
Thank you
21