csc 411: lecture 12: clustering - department of computer …urtasun/courses/csc411_fall16/12... ·...

65
CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 1 / 20

Upload: others

Post on 24-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

CSC 411: Lecture 12: Clustering

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 1 / 20

Page 2: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Today

Unsupervised learning

Clustering

I k-meansI Soft k-means

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 2 / 20

Page 3: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Motivating Examples

Determine groups of people in image above

I based on clothing stylesI gender, age, etc

Determine moving objects in videos

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 3 / 20

Page 4: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Motivating Examples

Determine groups of people in image aboveI based on clothing stylesI gender, age, etc

Determine moving objects in videos

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 3 / 20

Page 5: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Motivating Examples

Determine groups of people in image aboveI based on clothing stylesI gender, age, etc

Determine moving objects in videosZemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 3 / 20

Page 6: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionalityI Find clustersI Model data densityI Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 7: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionalityI Find clustersI Model data densityI Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 8: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionalityI Find clustersI Model data densityI Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 9: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionality

I Find clustersI Model data densityI Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 10: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionalityI Find clusters

I Model data densityI Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 11: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionalityI Find clustersI Model data density

I Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 12: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionalityI Find clustersI Model data densityI Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 13: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Unsupervised Learning

Supervised learning algorithms have a clear goal: produce desired outputs forgiven inputs.

I You are given {(x (i), t(i))} during training (inputs and targets)

Goal of unsupervised learning algorithms (no explicit feedback whetheroutputs of system are correct) less clear.

I You are given the inputs {x (i)} during training, labels are unknown.

Tasks to consider:

I Reduce dimensionalityI Find clustersI Model data densityI Find hidden causes

Key utility

I Compress dataI Detect outliersI Facilitate other learning

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 4 / 20

Page 14: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Major Types

Primary problems, approaches in unsupervised learning fall into three classes:

1. Dimensionality reduction: represent each input case using a smallnumber of variables (e.g., principal components analysis, factoranalysis, independent components analysis)

2. Clustering: represent each input case using a prototype example (e.g.,k-means, mixture models)

3. Density estimation: estimating the probability distribution over thedata space

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 5 / 20

Page 15: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Major Types

Primary problems, approaches in unsupervised learning fall into three classes:

1. Dimensionality reduction: represent each input case using a smallnumber of variables (e.g., principal components analysis, factoranalysis, independent components analysis)

2. Clustering: represent each input case using a prototype example (e.g.,k-means, mixture models)

3. Density estimation: estimating the probability distribution over thedata space

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 5 / 20

Page 16: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Major Types

Primary problems, approaches in unsupervised learning fall into three classes:

1. Dimensionality reduction: represent each input case using a smallnumber of variables (e.g., principal components analysis, factoranalysis, independent components analysis)

2. Clustering: represent each input case using a prototype example (e.g.,k-means, mixture models)

3. Density estimation: estimating the probability distribution over thedata space

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 5 / 20

Page 17: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Major Types

Primary problems, approaches in unsupervised learning fall into three classes:

1. Dimensionality reduction: represent each input case using a smallnumber of variables (e.g., principal components analysis, factoranalysis, independent components analysis)

2. Clustering: represent each input case using a prototype example (e.g.,k-means, mixture models)

3. Density estimation: estimating the probability distribution over thedata space

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 5 / 20

Page 18: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Clustering

Grouping N examples into K clusters one of canonical problems inunsupervised learning

Motivation: prediction; lossy compression; outlier detection

We assume that the data was generated from a number of different classes.The aim is to cluster data from the same class together.

I How many classes?I Why not put each datapoint into a separate class?

What is the objective function that is optimized by sensible clustering?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 6 / 20

Page 19: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Clustering

Grouping N examples into K clusters one of canonical problems inunsupervised learning

Motivation: prediction; lossy compression; outlier detection

We assume that the data was generated from a number of different classes.The aim is to cluster data from the same class together.

I How many classes?I Why not put each datapoint into a separate class?

What is the objective function that is optimized by sensible clustering?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 6 / 20

Page 20: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Clustering

Grouping N examples into K clusters one of canonical problems inunsupervised learning

Motivation: prediction; lossy compression; outlier detection

We assume that the data was generated from a number of different classes.The aim is to cluster data from the same class together.

I How many classes?I Why not put each datapoint into a separate class?

What is the objective function that is optimized by sensible clustering?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 6 / 20

Page 21: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Clustering

Grouping N examples into K clusters one of canonical problems inunsupervised learning

Motivation: prediction; lossy compression; outlier detection

We assume that the data was generated from a number of different classes.The aim is to cluster data from the same class together.

I How many classes?I Why not put each datapoint into a separate class?

What is the objective function that is optimized by sensible clustering?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 6 / 20

Page 22: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Clustering

Assume the data {x(1), . . . , x(N)} lives in a Euclidean space, x(n) ∈ Rd .

Assume the data belongs to K classes (patterns)

How can we identify those classes (data points that belong to each class)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 7 / 20

Page 23: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Clustering

Assume the data {x(1), . . . , x(N)} lives in a Euclidean space, x(n) ∈ Rd .

Assume the data belongs to K classes (patterns)

How can we identify those classes (data points that belong to each class)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 7 / 20

Page 24: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Clustering

Assume the data {x(1), . . . , x(N)} lives in a Euclidean space, x(n) ∈ Rd .

Assume the data belongs to K classes (patterns)

How can we identify those classes (data points that belong to each class)?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 7 / 20

Page 25: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means

Initialization: randomly initialize cluster centers

The algorithm iteratively alternates between two steps:

I Assignment step: Assign each data point to the closest cluster

Assignments Refitted means

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 8 / 20

Page 26: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means

Initialization: randomly initialize cluster centers

The algorithm iteratively alternates between two steps:

I Assignment step: Assign each data point to the closest cluster

Assignments

Refitted means

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 8 / 20

Page 27: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means

Initialization: randomly initialize cluster centers

The algorithm iteratively alternates between two steps:

I Assignment step: Assign each data point to the closest clusterI Refitting step: Move each cluster center to the center of gravity of the

data assigned to it

Assignments Refitted means

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 8 / 20

Page 28: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Figure from Bishop Simple demo: http://syskall.com/kmeans.js/

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 9 / 20

Page 29: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means Objective

What is actually being optimized?

K-means Objective:Find cluster centers m and assignments r to minimize the sum of squareddistances of data points {x(n)} to their assigned cluster centers

min{m},{r}

J({m}, {r}) = min{m},{r}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

s.t.∑k

r(n)k = 1,∀n, where r

(n)k ∈ {0, 1},∀k, n

where r(n)k = 1 means that x(n) is assigned to cluster k (with center mk)

Optimization method is a form of coordinate descent (”block coordinatedescent”)

I Fix centers, optimize assignments (choose cluster whose mean isclosest)

I Fix assignments, optimize means (average of assigned datapoints)

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 10 / 20

Page 30: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means Objective

What is actually being optimized?

K-means Objective:Find cluster centers m and assignments r to minimize the sum of squareddistances of data points {x(n)} to their assigned cluster centers

min{m},{r}

J({m}, {r}) = min{m},{r}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

s.t.∑k

r(n)k = 1,∀n, where r

(n)k ∈ {0, 1},∀k, n

where r(n)k = 1 means that x(n) is assigned to cluster k (with center mk)

Optimization method is a form of coordinate descent (”block coordinatedescent”)

I Fix centers, optimize assignments (choose cluster whose mean isclosest)

I Fix assignments, optimize means (average of assigned datapoints)

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 10 / 20

Page 31: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means Objective

What is actually being optimized?

K-means Objective:Find cluster centers m and assignments r to minimize the sum of squareddistances of data points {x(n)} to their assigned cluster centers

min{m},{r}

J({m}, {r}) = min{m},{r}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

s.t.∑k

r(n)k = 1,∀n, where r

(n)k ∈ {0, 1},∀k, n

where r(n)k = 1 means that x(n) is assigned to cluster k (with center mk)

Optimization method is a form of coordinate descent (”block coordinatedescent”)

I Fix centers, optimize assignments (choose cluster whose mean isclosest)

I Fix assignments, optimize means (average of assigned datapoints)

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 10 / 20

Page 32: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means Objective

What is actually being optimized?

K-means Objective:Find cluster centers m and assignments r to minimize the sum of squareddistances of data points {x(n)} to their assigned cluster centers

min{m},{r}

J({m}, {r}) = min{m},{r}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

s.t.∑k

r(n)k = 1,∀n, where r

(n)k ∈ {0, 1},∀k, n

where r(n)k = 1 means that x(n) is assigned to cluster k (with center mk)

Optimization method is a form of coordinate descent (”block coordinatedescent”)

I Fix centers, optimize assignments (choose cluster whose mean isclosest)

I Fix assignments, optimize means (average of assigned datapoints)

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 10 / 20

Page 33: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means Objective

What is actually being optimized?

K-means Objective:Find cluster centers m and assignments r to minimize the sum of squareddistances of data points {x(n)} to their assigned cluster centers

min{m},{r}

J({m}, {r}) = min{m},{r}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

s.t.∑k

r(n)k = 1,∀n, where r

(n)k ∈ {0, 1},∀k, n

where r(n)k = 1 means that x(n) is assigned to cluster k (with center mk)

Optimization method is a form of coordinate descent (”block coordinatedescent”)

I Fix centers, optimize assignments (choose cluster whose mean isclosest)

I Fix assignments, optimize means (average of assigned datapoints)

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 10 / 20

Page 34: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

The K-means Algorithm

Initialization: Set K cluster means m1, . . . ,mK to random values

Repeat until convergence (until assignments do not change):

I Assignment: Each data point x(n) assigned to nearest mean

k̂n = arg mink

d(mk , x(n))

(with, for example, L2 norm: k̂n = arg mink ||mk − x(n)||2)

and Responsibilities (1 of k encoding)

r(n)k = 1←→ k̂(n) = k

I Update: Model parameters, means are adjusted to match samplemeans of data points they are responsible for:

mk =

∑n r

(n)k x(n)∑n r

(n)k

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 11 / 20

Page 35: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

The K-means Algorithm

Initialization: Set K cluster means m1, . . . ,mK to random values

Repeat until convergence (until assignments do not change):

I Assignment: Each data point x(n) assigned to nearest mean

k̂n = arg mink

d(mk , x(n))

(with, for example, L2 norm: k̂n = arg mink ||mk − x(n)||2)

and Responsibilities (1 of k encoding)

r(n)k = 1←→ k̂(n) = k

I Update: Model parameters, means are adjusted to match samplemeans of data points they are responsible for:

mk =

∑n r

(n)k x(n)∑n r

(n)k

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 11 / 20

Page 36: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

The K-means Algorithm

Initialization: Set K cluster means m1, . . . ,mK to random values

Repeat until convergence (until assignments do not change):

I Assignment: Each data point x(n) assigned to nearest mean

k̂n = arg mink

d(mk , x(n))

(with, for example, L2 norm: k̂n = arg mink ||mk − x(n)||2)

and Responsibilities (1 of k encoding)

r(n)k = 1←→ k̂(n) = k

I Update: Model parameters, means are adjusted to match samplemeans of data points they are responsible for:

mk =

∑n r

(n)k x(n)∑n r

(n)k

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 11 / 20

Page 37: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

The K-means Algorithm

Initialization: Set K cluster means m1, . . . ,mK to random values

Repeat until convergence (until assignments do not change):

I Assignment: Each data point x(n) assigned to nearest mean

k̂n = arg mink

d(mk , x(n))

(with, for example, L2 norm: k̂n = arg mink ||mk − x(n)||2)

and Responsibilities (1 of k encoding)

r(n)k = 1←→ k̂(n) = k

I Update: Model parameters, means are adjusted to match samplemeans of data points they are responsible for:

mk =

∑n r

(n)k x(n)∑n r

(n)k

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 11 / 20

Page 38: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

The K-means Algorithm

Initialization: Set K cluster means m1, . . . ,mK to random values

Repeat until convergence (until assignments do not change):

I Assignment: Each data point x(n) assigned to nearest mean

k̂n = arg mink

d(mk , x(n))

(with, for example, L2 norm: k̂n = arg mink ||mk − x(n)||2)

and Responsibilities (1 of k encoding)

r(n)k = 1←→ k̂(n) = k

I Update: Model parameters, means are adjusted to match samplemeans of data points they are responsible for:

mk =

∑n r

(n)k x(n)∑n r

(n)k

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 11 / 20

Page 39: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means for Vector Quantization

Figure from Bishop

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 12 / 20

Page 40: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

K-means for Image Segmentation

How would you modify k-means to get super pixels?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 13 / 20

Page 41: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 42: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 43: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 44: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 45: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 46: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 47: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 48: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about K-means

Why does update set mk to mean of assigned points?

Where does distance d come from?

What if we used a different distance measure?

How can we choose best distance?

How to choose K?

How can we choose between alternative clusterings?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 14 / 20

Page 49: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Why K-means Converges

Whenever an assignment is changed, the sum squared distances J of datapoints from their assigned cluster centers is reduced.

Whenever a cluster center is moved, J is reduced.

Test for convergence: If the assignments do not change in the assignmentstep, we have converged (to at least a local minimum).

K-means cost function after each E step (blue) and M step (red). Thealgorithm has converged after the third M step

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 15 / 20

Page 50: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Why K-means Converges

Whenever an assignment is changed, the sum squared distances J of datapoints from their assigned cluster centers is reduced.

Whenever a cluster center is moved, J is reduced.

Test for convergence: If the assignments do not change in the assignmentstep, we have converged (to at least a local minimum).

K-means cost function after each E step (blue) and M step (red). Thealgorithm has converged after the third M step

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 15 / 20

Page 51: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Why K-means Converges

Whenever an assignment is changed, the sum squared distances J of datapoints from their assigned cluster centers is reduced.

Whenever a cluster center is moved, J is reduced.

Test for convergence: If the assignments do not change in the assignmentstep, we have converged (to at least a local minimum).

K-means cost function after each E step (blue) and M step (red). Thealgorithm has converged after the third M step

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 15 / 20

Page 52: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Local Minima

The objective J is non-convex (socoordinate descent on J is not guaranteedto converge to the global minimum)

There is nothing to prevent k-meansgetting stuck at local minima.

We could try many random starting points

We could try non-local split-and-mergemoves:

I Simultaneously merge two nearbyclusters

I and split a big cluster into two

A bad local optimum

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 16 / 20

Page 53: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Soft K-means

Instead of making hard assignments of data points to clusters, we can makesoft assignments. One cluster may have a responsibility of .7 for a datapointand another may have a responsibility of .3.

I Allows a cluster to use more information about the data in the refittingstep.

I What happens to our convergence guarantee?I How do we decide on the soft assignments?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 17 / 20

Page 54: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Soft K-means

Instead of making hard assignments of data points to clusters, we can makesoft assignments. One cluster may have a responsibility of .7 for a datapointand another may have a responsibility of .3.

I Allows a cluster to use more information about the data in the refittingstep.

I What happens to our convergence guarantee?I How do we decide on the soft assignments?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 17 / 20

Page 55: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Soft K-means

Instead of making hard assignments of data points to clusters, we can makesoft assignments. One cluster may have a responsibility of .7 for a datapointand another may have a responsibility of .3.

I Allows a cluster to use more information about the data in the refittingstep.

I What happens to our convergence guarantee?

I How do we decide on the soft assignments?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 17 / 20

Page 56: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Soft K-means

Instead of making hard assignments of data points to clusters, we can makesoft assignments. One cluster may have a responsibility of .7 for a datapointand another may have a responsibility of .3.

I Allows a cluster to use more information about the data in the refittingstep.

I What happens to our convergence guarantee?I How do we decide on the soft assignments?

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 17 / 20

Page 57: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Soft K-means Algorithm

Initialization: Set K means {mk} to random values

Repeat until convergence (until assignments do not change):

I Assignment: Each data point n given soft ”degree of assignment” toeach cluster mean k , based on responsibilities

r(n)k =

exp[−βd(mk , x(n))]∑j exp[−βd(mj , x(n))]

I Update: Model parameters, means, are adjusted to match samplemeans of datapoints they are responsible for:

mk =

∑n r

(n)k x(n)∑n r

(n)k

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 18 / 20

Page 58: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about Soft K-means

How to set β?

What about problems with elongated clusters?

Clusters with unequal weight and width

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 19 / 20

Page 59: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about Soft K-means

How to set β?

What about problems with elongated clusters?

Clusters with unequal weight and width

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 19 / 20

Page 60: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Questions about Soft K-means

How to set β?

What about problems with elongated clusters?

Clusters with unequal weight and width

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 19 / 20

Page 61: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

A Generative View of Clustering

We need a sensible measure of what it means to cluster the data well.

I This makes it possible to judge different models.I It may make it possible to decide on the number of clusters.

An obvious approach is to imagine that the data was produced by agenerative model.

I Then we can adjust the parameters of the model to maximize theprobability that it would produce exactly the data we observed.

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 20 / 20

Page 62: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

A Generative View of Clustering

We need a sensible measure of what it means to cluster the data well.

I This makes it possible to judge different models.

I It may make it possible to decide on the number of clusters.

An obvious approach is to imagine that the data was produced by agenerative model.

I Then we can adjust the parameters of the model to maximize theprobability that it would produce exactly the data we observed.

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 20 / 20

Page 63: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

A Generative View of Clustering

We need a sensible measure of what it means to cluster the data well.

I This makes it possible to judge different models.I It may make it possible to decide on the number of clusters.

An obvious approach is to imagine that the data was produced by agenerative model.

I Then we can adjust the parameters of the model to maximize theprobability that it would produce exactly the data we observed.

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 20 / 20

Page 64: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

A Generative View of Clustering

We need a sensible measure of what it means to cluster the data well.

I This makes it possible to judge different models.I It may make it possible to decide on the number of clusters.

An obvious approach is to imagine that the data was produced by agenerative model.

I Then we can adjust the parameters of the model to maximize theprobability that it would produce exactly the data we observed.

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 20 / 20

Page 65: CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

A Generative View of Clustering

We need a sensible measure of what it means to cluster the data well.

I This makes it possible to judge different models.I It may make it possible to decide on the number of clusters.

An obvious approach is to imagine that the data was produced by agenerative model.

I Then we can adjust the parameters of the model to maximize theprobability that it would produce exactly the data we observed.

Zemel, Urtasun, Fidler (UofT) CSC 411: 12-Clustering 20 / 20