a latent semantic indexing-based approach to multilingual document clastering

Post on 30-Dec-2015

53 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Latent Semantic Indexing-based approach to multilingual document clastering. Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin Decision Support Systems 45 (2008) 606-620 Reporter : Yi Ru, Lee. Outline. Introduction Latent Semantic Indexing(LSI) - PowerPoint PPT Presentation

TRANSCRIPT

Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin

Decision Support Systems 45 (2008) 606-620

Reporter : Yi Ru, Lee

Introduction

Latent Semantic Indexing(LSI)

LSI-based multilingual document clustering technique

Empirical evaluation

Conclusion

2

Translation-basedSynonymy Polysemyvocabulary

Multilingual spaceLatent Semantic Indexing(LSI)Lexical matchingReduce the dimensions

3

Singular Value Decomposition (SVD)

kX kU TkVk

4

diUdi

diUdi

Tkk

Tk

1

5

6

Multilingual semantic space analysis

7

Document folding-in

8

Dimension Selection

i

jiwDjDL )( Dj denote the LSI dimension j

Wji is the weight of document i in Dj

9

Clustering

Hierarchical clustering algorithm

n

i

n

iii

n

iii

yx

yxYX

1 1

22

1),cos(

10

11

TA

CACR )(recallcluster

GA

CACP )(precisioncluster

TA is the set of associations in the true categories.

GA is the set of associations in the clusters generated by the document clustering technique. CA is the set of correct associations that exists in both the clusters and the true categories.

12

Examples

TA={(e1−e2),(c1−c2), (e1−c1), (e1−c2), (e2−c1), (e2−c2), (e3−e4),(c3−c4), (c3−c5), (c4−c5), (e3−c3), (e3−c4), (e3−c5), (e4−c3), (e4−c4), (e4−c5)}

GA={(e1−e2), (c1−c3), (e1−c1), (e1−c3), (e2−c1), (e2−c3), (e3−e4), (e3−c2), (e4−c2), (c4−c5)}

CA={(e1−e2), (e1−c1), (e2−c1), (e3−e4), (c4−c5)}

13

PRT curves of the LSI-based MLDC technique

14

Comparisons of different representation schemes

15

Effect of dimension selection (h=5 for MLDC with dimension selection; k=5 for MLDC without dimension selection)

16

Effect of dimension selection (h=20 for MLDC with dimension selection; k=20 for MLDC without dimension selection)

17

Best scenario versus best scenario comparison

18

PRT curves of overall, monolingual, and cross-lingual performance

19

monolingual PRT curve > overall PRT curve > cross-lingual PRT curve

Specific domain

20

Thank you

21

top related