a latent semantic indexing-based approach to multilingual document clastering

21
Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin Decision Support Systems 45 (2008) 606-620 Reporter : Yi Ru, Lee

Upload: kylan-gamble

Post on 30-Dec-2015

51 views

Category:

Documents


0 download

DESCRIPTION

A Latent Semantic Indexing-based approach to multilingual document clastering. Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin Decision Support Systems 45 (2008) 606-620 Reporter : Yi Ru, Lee. Outline. Introduction Latent Semantic Indexing(LSI) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin

Decision Support Systems 45 (2008) 606-620

Reporter : Yi Ru, Lee

Page 2: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Introduction

Latent Semantic Indexing(LSI)

LSI-based multilingual document clustering technique

Empirical evaluation

Conclusion

2

Page 3: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Translation-basedSynonymy Polysemyvocabulary

Multilingual spaceLatent Semantic Indexing(LSI)Lexical matchingReduce the dimensions

3

Page 4: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Singular Value Decomposition (SVD)

kX kU TkVk

4

Page 5: A Latent Semantic Indexing-based approach to  multilingual document  clastering

diUdi

diUdi

Tkk

Tk

1

5

Page 6: A Latent Semantic Indexing-based approach to  multilingual document  clastering

6

Page 7: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Multilingual semantic space analysis

7

Page 8: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Document folding-in

8

Page 9: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Dimension Selection

i

jiwDjDL )( Dj denote the LSI dimension j

Wji is the weight of document i in Dj

9

Page 10: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Clustering

Hierarchical clustering algorithm

n

i

n

iii

n

iii

yx

yxYX

1 1

22

1),cos(

10

Page 11: A Latent Semantic Indexing-based approach to  multilingual document  clastering

11

Page 12: A Latent Semantic Indexing-based approach to  multilingual document  clastering

TA

CACR )(recallcluster

GA

CACP )(precisioncluster

TA is the set of associations in the true categories.

GA is the set of associations in the clusters generated by the document clustering technique. CA is the set of correct associations that exists in both the clusters and the true categories.

12

Page 13: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Examples

TA={(e1−e2),(c1−c2), (e1−c1), (e1−c2), (e2−c1), (e2−c2), (e3−e4),(c3−c4), (c3−c5), (c4−c5), (e3−c3), (e3−c4), (e3−c5), (e4−c3), (e4−c4), (e4−c5)}

GA={(e1−e2), (c1−c3), (e1−c1), (e1−c3), (e2−c1), (e2−c3), (e3−e4), (e3−c2), (e4−c2), (c4−c5)}

CA={(e1−e2), (e1−c1), (e2−c1), (e3−e4), (c4−c5)}

13

Page 14: A Latent Semantic Indexing-based approach to  multilingual document  clastering

PRT curves of the LSI-based MLDC technique

14

Page 15: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Comparisons of different representation schemes

15

Page 16: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Effect of dimension selection (h=5 for MLDC with dimension selection; k=5 for MLDC without dimension selection)

16

Page 17: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Effect of dimension selection (h=20 for MLDC with dimension selection; k=20 for MLDC without dimension selection)

17

Page 18: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Best scenario versus best scenario comparison

18

Page 19: A Latent Semantic Indexing-based approach to  multilingual document  clastering

PRT curves of overall, monolingual, and cross-lingual performance

19

Page 20: A Latent Semantic Indexing-based approach to  multilingual document  clastering

monolingual PRT curve > overall PRT curve > cross-lingual PRT curve

Specific domain

20

Page 21: A Latent Semantic Indexing-based approach to  multilingual document  clastering

Thank you

21