k-means clustering with scikit-learn

K-Means Clustering with Scikit-LearnSarah Guido

PyData SV 2014

About Me

• Today: graduated from the University of Michigan!• Soon: data scientist at Reonomy• PyLadies co-organizer• @sarah_guido

Outline

• What is k-means clustering?• How it works• When to use it

• K-means clustering in scikit-learn• Basic implementation• Implementation with tuned parameters

Clustering

• Unsupervised learning• Unlabeled data

• Split observations into groups• Distance between data points• Exploring the data

K-means clustering

• Formally: a method of vector quantization• Partition space into Voronoi cells

• Separate samples into n groups of equal variance

• Uses the Euclidean distance metric

K-means clustering

• Iterative refinement• Three basic steps• Step 1: Choose k• Iterate over:

• Step 2: Assignment• Step 3: Update

• Repeats until convergence has been reached

K-means clustering

• Assignment

• Update

K-means clustering

• Advantages• Scales well• Efficient• Will always converge

• Disadvantages• Choosing the wrong k• Convergence to local minimum

K-means clustering

• When to use• Normally distributed data• Large number of samples• Not too many clusters• Distance can be measured in a linear fashion

Scikit-Learn

• Machine learning module• Open-source• Built-in datasets• Good resources for learning

Scikit-Learn

• Model = EstimatorObject()• Unsupervised:

• Model.fit(dataset.data)• dataset.data = dataset

• Supervised would use the labels as a second parameter

K-means in scikit-learn

• Efficient and fast• You: pick n clusters, kmeans: finds n initial centroids

• Run clustering jobs in parallel

Dataset

• University of California Machine Learning Repository

• Individual household power consumption


• Results

K-means parameters

• n_clusters• max_iter• n_init• init • precompute_distances• tol• n_jobs• random_state

n_clusters: choosing k

• Graphing the variance• Information criterion• Cross-validation


• Graphing the variance• from scipy.spatial.distance import cdist, pdist• cdist: distance computation between sets of

observations• pdist: pairwise distances between observations in the

same set


• Graphing the variance


n_clusters = 4 n_clusters = 7


• n_clusters = 8 (default)

init

• k-means++• Default• Selects initial clusters in a way that speeds up

convergence

• random• Choose k rows at random for initial centroids

• Ndarray that gives initial centers• (n_clusters, n_features)

K-means revised

• Set n_clusters • 7, 8

• Set init• kmeans++, random

K-means revised

n_clusters = 8, init = kmeans++ n_clusters = 8, init = random

K-means revised

n_clusters = 7, init = kmeans++ n_clusters = 7, init = random

Comparing results: silhouette score

• Silhouette coefficient• No ground truth• Mean distance between an observation and all other

points in its cluster• Mean distance between an observation and all other

points in the next nearest cluster

• Silhouette score in scikit-learn• Mean of silhouette coefficient for all of the observations• Closer to 1, the better the fit• Large dataset == long time

Comparing results: silhouette score

• n_clusters=8, init=kmeans++• 0.8117

• n_clusters=8, init=random• 0.6511

• n_clusters=7, init=kmeans++• 0.7719

• n_clusters=7, init=random• 0.7037

What does this tell us?

• Patterns exist• Groups of similar observations exist• Sometimes, the defaults work• We need more exploration!

A few tips

• Clustering is a good way to explore your data• Intuition fails in high dimensions

• Use dimensionality reduction

• Combine with other models• Know your data

Materials and resources

• Scikit-learn documentation• scikit-learn.org/stable/documentation.html

• Datasets• http://archive.ics.uci.edu/ml/datasets.html• Mldata.org

• Blogs• http://datasciencelab.wordpress.com/

Contact me!

• Twitter: @sarah_guido• www.linkedin.com/in/sarahguido/• https://github.com/sarguido

k-means clustering with scikit-learn

Technology

n clusters

clusters distance

parameters n

iter n

initial clusters

clusters max

init init

silhouette score n