learnability of k-means clustering - trinity college€¦ · • partitions data instances into k...

1
Partitions data instances into k clusters Performance evaluation: compare the resulting clustering assignment with the ground-truth clustering assignment (GT) (e.g. Wine dataset) Datasets : UCI machine learning repository 1 Algorithm : Python’s machine learning library Scikit-learn 2 Performance Measure - Adjusted Rand Index (ARI) Missing values were replaced with average value of the attribute Categorical attributes were excluded Yuxuan Li | Advisor: Prof. Takunari Miyazaki | Department of Computer Science | Trinity College Learnability of k-means Clustering 1. 17/40 datasets resulted in an ARI score near 0.0 (red bars). - k-means failed to generate meaningful clustering assignments 2. Optimal performance deviated from optimal k. - Optimal k = the “correct” number of clusters - Ideally, clustering assignment should be more prone to error if we started with a k value that was too “wrong”. - The heat map suggested otherwise. Each row is a dataset. - For some, performance did not differ much when k varied. - For others, optimal performance was found with a k value far from the optimal k. Machine learning has proved successful in Spam detection, recommender systems Image/Video recognition Machine learning = superior models? Impressive achievements mainly from supervised learning models Can models learn without prior knowledge? Issue of learnability in unsupervised learning Motivation k-means Clustering Methods Results What is reasonable algorithmically might not be desirable Results revealed the inherent difficulty of unsupervised learning 3 - Supervised learning Many datasets yield a predictive accuracy of 95% + when the labels are accessible in training 1 - Semi-supervised learning Feeding some prior knowledge into the model will immediately boost its performance 4 What’s promising - Large-scale model trained with non-labeled images can be selective to high-level features. 5 Conclusions Thanks to Professor Miyazaki for the support and guidance. Thanks to friends in the Department of Computer Science for helpful discussions. Acknowledgements 1. Lichman, M. (2013). UCI Machine Learning Repository [http:// archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 2. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830. 3. Williams, Alex. (2015) What is clustering and why is it hard? http://alexhwilliams.info/itsneuronalblog/2015/09/11/ clustering1/ 4. Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001, June). Constrained k-means clustering with background knowledge. In ICML (Vol. 1, pp. 577-584). 5. Le, Q. V. (2013, May). Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 8595-8598). IEEE. References -1 0 1 independent random perfect matching matching matching

Upload: others

Post on 06-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learnability of k-means Clustering - Trinity College€¦ · • Partitions data instances into k clusters • Performance evaluation: compare the resulting clustering assignment

• Partitions data instances into k clusters

• Performance evaluation: compare the resulting clustering assignment with the ground-truth clustering assignment (GT) (e.g. Wine dataset)

• Datasets : UCI machine learning repository1

• Algorithm : Python’s machine learning library Scikit-learn2

• Performance Measure - Adjusted Rand Index (ARI)

• Missing values were replaced with average value of the attribute

• Categorical attributes were excluded

Yuxuan Li | Advisor: Prof. Takunari Miyazaki | Department of Computer Science | Trinity College

Learnability of k-means Clustering

1. 17/40 datasets resulted in an ARI score near 0.0 (red bars). - k-means failed to generate meaningful clustering assignments

2. Optimal performance deviated from optimal k. - Optimal k = the “correct” number of clusters

- Ideally, clustering assignment should be moreprone to error if we started with a k value that was too “wrong”.

- The heat map suggested otherwise. Each row is a dataset.

- For some, performance did not differ much when k varied.

- For others, optimal performance was found with a k value far fromthe optimal k.

Machine learning has proved successful in • Spam detection, recommender systems • Image/Video recognition

Machine learning = superior models? • Impressive achievements mainly from supervised

learning models

Can models learn without prior knowledge? • Issue of learnability in unsupervised learning

Motivation

k-means Clustering

Methods

Results

• What is reasonable algorithmically might not be desirable

• Results revealed the inherent difficulty of unsupervised learning3

- Supervised learning

Many datasets yield a predictive accuracy of 95% + when the labels are accessible in training1

- Semi-supervised learning

Feeding some prior knowledge into the model will immediately boost its performance4

• What’s promising

- Large-scale model trained with non-labeled images can be selective to high-level features. 5

Conclusions

Thanks to Professor Miyazaki for the support and guidance. Thanks to friends in the Department of Computer Science for helpful discussions.

Acknowledgements

1. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

2. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.

3. Williams, Alex. (2015) What is clustering and why is it hard? http://alexhwilliams.info/itsneuronalblog/2015/09/11/clustering1/

4. Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001, June). Constrained k-means clustering with background knowledge. In ICML (Vol. 1, pp. 577-584).

5. Le, Q. V. (2013, May). Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 8595-8598). IEEE.

References

-1 0 1

independent random perfect matching matching matching