page 1 a soft subspace clustering method for text data using a probability based feature weighting...
TRANSCRIPT
![Page 1: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/1.jpg)
Page 1
A Soft subspace clustering Method for Text Data using a Probability based Feature
Weighting Scheme
Abdul Wahid, Xiaoying Gao, Peter Andreae
Victoria University of Wellington
New Zealand
![Page 2: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/2.jpg)
Soft subspace clustering
• Clustering normally use– all features
• Text data– too many features
• Subspace clustering use– subsets of features-----subspace
• Soft– a feature has a weight in each subspace
![Page 3: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/3.jpg)
Research questions
• What are the subspaces• How to define the weights
– Feature to subspace
• LDA (Latent Dirichlet Allocation)– Topic modelling– Automatically detects topics
• Solution– Topics as subspace– Weight: word probability in each topic
![Page 4: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/4.jpg)
LDA: example by Edwin Chen
• Suppose you have the following set of sentences, and you want two topics:
• I like to eat broccoli and bananas.• I ate a banana and spinach smoothie for breakfast.• Chinchillas and kittens are cute.• My sister adopted a kitten yesterday.• Look at this cute hamster munching on a piece of
broccoli.
![Page 5: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/5.jpg)
LDA example by Edwin Chen
• Sentences 1 and 2: 100% Topic A• Sentences 3 and 4: 100% Topic B• Sentence 5: 60% Topic A, 40% Topic B
• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
![Page 6: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/6.jpg)
Apply LDA
• Gibbs Sampling• Generate two matrices
– Topic--Documents matrix
– Topic – term matrix
![Page 7: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/7.jpg)
Preprocessing
Documents LDA
Gibbs Sampl
ing
Assign Initial
Clusters
Assign Weights
𝜃
𝜙
Refine clusters
![Page 8: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/8.jpg)
Our DWKM algorithm
• K-mean based algorithm• Use LDA to get two matrices• Use document-topic matrix to initialise the
clusters• Repeat
– Calculate the centroid of each cluster– Assign each document to the nearest centroid
• The distance measure is weighted by term-topic matrix
• Until convergence
![Page 9: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/9.jpg)
New distance measure
Weights: word probability in a topic
![Page 10: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/10.jpg)
Soft Subspace Clustering
Refine feature weights
LDA
Feature weighting
Initial cluster estimation
Refine clusters using feature
weights
Semantic information
Refine clusters
Our new approach
Randomly Assign feature weights
Randomly assign documents to clusters
Subspace Clustering
Hard Subspace Clustering
Common approach
![Page 11: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/11.jpg)
Experiments
• Data sets– 4 Synthetic datasets– 6 Real data sets
• Evaluation parameters– Accuracy– F measure– NMI (Normal Mutual Information)– Entropy
• Compared with– K-means, LDA as a clustering method, FWKM, EWKM, FGKM
![Page 12: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/12.jpg)
Resultsdatasets Metric K-means LDA FWKM EWKM FGKM DWKM
SD1 Acc 0.65 0.66 0.77 0.69 0.82 0.87
F-M 0.63 0.65 0.73 0.59 0.75 0.81
SD2 Acc 0.63 0.68 0.76 0.72 0.87 0.92
F-M 0.64 0.69 0.75 0.63 0.82 0.88
SD3 Acc 0.62 0.64 0.67 0.70 0.94 0.94
F-M 0.62 0.63 0.64 0.59 0.91 0.92
SD4 Acc 0.60 0.61 0.61 0.69 0.91 0.93
F-M 0.59 0.60 0.60 0.58 0.88 0.90
![Page 13: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/13.jpg)
Results
![Page 14: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/14.jpg)
Conclusion
• A new soft subspace clustering algorithm• A new distance measure• Apply LDA to get semantic information• Improved performance
![Page 15: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/15.jpg)
Future work
• Non-parametric LDA model– No need to give number of topics
• Reduce computational complexity• Use LDA to generate different candidate
clustering solution for clustering ensembles.
![Page 16: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria](https://reader036.vdocuments.mx/reader036/viewer/2022062409/5697bff11a28abf838cbb64b/html5/thumbnails/16.jpg)
Page 16