document clustering via matrix representation
DESCRIPTION
Document Clustering via Matrix Representation. Xufei Wang , Jiliang Tang and Huan Liu Arizona State University. News Article. Lead Paragraph Explanations Additional Information. Research Paper. Introduction Related Work Problem Statement Solution Experiment Conclusion. Book. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/1.jpg)
Data Mining and Machine Learning Lab
Document Clustering via Matrix Representation
Xufei Wang, Jiliang Tang and Huan Liu
Arizona State University
![Page 2: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/2.jpg)
News Article
• Lead Paragraph
• Explanations
• Additional Information
![Page 3: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/3.jpg)
Research Paper
• Introduction• Related Work• Problem Statement• Solution• Experiment• Conclusion
![Page 4: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/4.jpg)
Book
• Chapters
• References
• Appendix
![Page 5: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/5.jpg)
Document Organization
• Not randomly organized
• Put relevant content together
• Logically independent segments
5
![Page 6: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/6.jpg)
Matrix Space Model
• Represent a document as a matrix–Segment–Term
• Each segment is a vector of terms– Terms + frequency
6
![Page 7: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/7.jpg)
Vector Space Model
• Oversimplify a document– Mixing topics– Word order is lost– Susceptible to noise
7
TdNddd wwwV ),,,( ,,2,1
![Page 8: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/8.jpg)
An Example
• The IEEE International Conference on Data Mining series (ICDM) has established itself as the world's premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications.
• Vancouver is a coastal seaport city on the mainland of British Columbia, Canada. It is the hub of Greater Vancouver, which, with over 2.3 million residents, is the third most populous metropolitan area in the country, and the most populous in Western Canada.
![Page 9: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/9.jpg)
Matrix Space Model
• Segment 1
• Segment 2
conference data mining research international3 3 3 2 2
vancouver populous canada metropolitan columbia2 2 2 1 1
![Page 10: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/10.jpg)
Vector Space Model
conference data mining research international canada vancouver3 3 3 2 2 2 2
![Page 11: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/11.jpg)
Pros of a Matrix Representation
• Interpretation: – Segments vs. topics
• Finer granularity for data management– Segments vs. document
• Multiple or Single class labels– Flexibility
11
![Page 12: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/12.jpg)
Verify the Effectiveness via Clustering
• Information Retrieval– Indexing based on segments
• Classification– New approaches based on Matrix inputs
• Clustering– New approaches based on Matrix inputs
12
![Page 13: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/13.jpg)
A Graphical Interpretation
![Page 14: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/14.jpg)
Step 1: Obtaining Segments
• Many approaches for segmentation– Terms– Sentences (Choi et al. 2000)– Paragraphs (Tagarelli et al. 2008)
• Determining the number of segments– Open research problem
14
![Page 15: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/15.jpg)
Step 2: Latent Topic Extraction
• Non-negative Matrix Approximation (NMA)
• LMi represents the probability of a term belonging to a latent topic
• MiRT represents the probability of a segment belonging to a latent topic
15
n
iF
Tii
RRMMLL
RLMA
cii
r0
2
0:0:0:
2
21
1min
![Page 16: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/16.jpg)
Step 3: Clustering
• Un-overlapping clustering
• Overlapping clustering
16
jiji
iik
dd
ccentroidd 2)(min
ij
ijkccentroidd
2)(min
![Page 17: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/17.jpg)
Datasets
• 20newsgroup– 20 classes– 6,038 documents
• Reuters-21578– 26 clusters– 1,964 documents
• Classic– 3 clusters– 1,486 documents
17
![Page 18: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/18.jpg)
Experimental Method
• Generate Datasets (by specifying k)
• Evaluate the accuracy
• Repeat
18
![Page 19: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/19.jpg)
Number of Latent Topics
![Page 20: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/20.jpg)
Number of Segments
![Page 21: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/21.jpg)
Comparative Study
![Page 22: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/22.jpg)
Conclusion
• Proposing a matrix representation for documents
• Significant improvements with MSM
• Information Retrieval, classification tasks
22
![Page 23: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/23.jpg)
Questions
![Page 24: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/24.jpg)
24
![Page 25: Document Clustering via Matrix Representation](https://reader036.vdocuments.mx/reader036/viewer/2022062301/568160ff550346895dd0400a/html5/thumbnails/25.jpg)
25