lecture 8: clustering algorithms · clustering...

35
lecture 8: clustering algorithms STAT 545: Intro. to Computational Statistics Vinayak Rao Purdue University September 11, 2019

Upload: others

Post on 16-Jul-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

lecture 8: clustering algorithmsSTAT 545: Intro. to Computational Statistics

Vinayak RaoPurdue University

September 11, 2019

Page 2: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Clustering

Given a large dataset, group data points into ‘clusters’.

Data points in the same cluster are similar in some sense.

E.g. cluster students scores (to decide grade)

Applications:

Compression/feature-extraction/exploration/visualization

• simpler representation of complex data

Image segmentation, community detectn, co-expressed genes

1/14

Page 3: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Clustering

Given a large dataset, group data points into ‘clusters’.

Data points in the same cluster are similar in some sense.

E.g. cluster students scores (to decide grade)

Applications:

Compression/feature-extraction/exploration/visualization

• simpler representation of complex data

Image segmentation, community detectn, co-expressed genes

1/14

Page 4: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Clustering (contd.)

We are given N data vectors (x1, . . . , xN) in ℜd.Let ci be the cluster assignment of observation xi:

ci ∈ {1 · · · K}

Equivalently, we can use one-hot (or 1-of-K) encoding:

ric =

1, if ci = c0, otherwise

E.g. ci = 3 ⇐⇒ ri = (0, 0, 1, 0, 0, ..., 0)

Observe: ric ≥ 0 and∑K

c=1 ric = 1 just like a probability vector.

However, ric is binary: we will relax this in later lectures.

2/14

Page 5: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Cluster parameters

Associate cluster i with parameter θi ∈ ℜd (cluster prototype).

Write θ = {θ1, . . . , θK}, C = {c1, . . . , cN} (or R = {r, . . . , rN}).

Problem: Given data (x1, . . . , xN) find θ and C.

Define a loss-function L(θ,C), and minimize it.

Start by defining a distance (or similarity measure) d(x,θ):

d(x,θ) =d∑j=1

(xj − θj)2 Squared Euclidean or L2 dist.

d(x,θ) =d∑j=1

|xj − θj| L1 distance

3/14

Page 6: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Cluster parameters

Associate cluster i with parameter θi ∈ ℜd (cluster prototype).

Write θ = {θ1, . . . , θK}, C = {c1, . . . , cN} (or R = {r, . . . , rN}).

Problem: Given data (x1, . . . , xN) find θ and C.

Define a loss-function L(θ,C), and minimize it.

Start by defining a distance (or similarity measure) d(x,θ):

d(x,θ) =d∑j=1

(xj − θj)2 Squared Euclidean or L2 dist.

d(x,θ) =d∑j=1

|xj − θj| L1 distance

3/14

Page 7: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Cluster parameters

Associate cluster i with parameter θi ∈ ℜd (cluster prototype).

Write θ = {θ1, . . . , θK}, C = {c1, . . . , cN} (or R = {r, . . . , rN}).

Problem: Given data (x1, . . . , xN) find θ and C.

Define a loss-function L(θ,C), and minimize it.

Start by defining a distance (or similarity measure) d(x,θ):

d(x,θ) =d∑j=1

(xj − θj)2 Squared Euclidean or L2 dist.

d(x,θ) =d∑j=1

|xj − θj| L1 distance

3/14

Page 8: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Clustering loss function

We want all members of a cluster to be close to the prototype.∑i s.t. ci=c

d(xi,θc) =N∑i=1

ricd(xi,θc) should be small for each c.

Overall loss function:

L(θ,C) =K∑c=1

∑is.t.ci=c1

d(xi,θc)

=K∑c=1

N∑i=1

ricd(xi,θc)

Optimize over both:

• cluster assignments (discrete)• cluster parameters (continuous)

4/14

Page 9: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

K-means

Minimizing L(θ,C) is hard (O(NDK+1)).

Instead, use heuristic greedy (local-search) algorithms.When d(·, ·) is Euclidean, the most popular is Lloyd’s algorithm.

If we had the cluster parameters θ∗, can we solve for C?

Copt = argmin L(θ∗,C)

If we had the cluster assignments C∗, can we solve for θ?

θopt = argmin L(θ,C∗)

5/14

Page 10: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

K-means

Minimizing L(θ,C) is hard (O(NDK+1)).

Instead, use heuristic greedy (local-search) algorithms.When d(·, ·) is Euclidean, the most popular is Lloyd’s algorithm.

If we had the cluster parameters θ∗, can we solve for C?

Copt = argmin L(θ∗,C)

If we had the cluster assignments C∗, can we solve for θ?

θopt = argmin L(θ,C∗)

5/14

Page 11: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

K-means

Minimizing L(θ,C) is hard (O(NDK+1)).

Instead, use heuristic greedy (local-search) algorithms.When d(·, ·) is Euclidean, the most popular is Lloyd’s algorithm.

If we had the cluster parameters θ∗, can we solve for C?

Copt = argmin L(θ∗,C)

If we had the cluster assignments C∗, can we solve for θ?

θopt = argmin L(θ,C∗)

5/14

Page 12: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

K-means

Start with an initialialization of the parameters, call it θ0.Assign observations to nearest clusters, giving R0.

Repeat for i in 1 to N:

• Recalculate cluster means, θi• Recalculate cluster assignments, Ri

Coordinate-descent.

Resulting algorithm has complexity O(INKD)

6/14

Page 13: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

K-means

Start with an initialialization of the parameters, call it θ0.Assign observations to nearest clusters, giving R0.

Repeat for i in 1 to N:

• Recalculate cluster means, θi• Recalculate cluster assignments, Ri

Coordinate-descent.

Resulting algorithm has complexity O(INKD)

6/14

Page 14: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

[demo]

7/14

Page 15: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Questions

Does this algorithm converge to a global minimum?

Does it converge at all?

What is the convergence criteria?

8/14

Page 16: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Limitations

Local optima: Sensitive to initialization.Solution: Run many times and pick the best clustering.

Empty clusters.Solution: discard them, or use heuristics to assignobservations to them

Choosing K.Solution: search over a set of K’s, penalizing larger values.

Requires circular clusters.Solution: use some other method

9/14

Page 17: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Variations to k-means

Modify distance functions.L1 distance: k-medians

Modify the algorithm.L1 distance: k-medoids (exemplar-based)

10/14

Page 18: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.O ten there is no clear best clustering.

11/14

Page 19: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.O ten there is no clear best clustering.

11/14

Page 20: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.O ten there is no clear best clustering.

11/14

Page 21: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.O ten there is no clear best clustering.

11/14

Page 22: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.O ten is is natural to view data as having a hierarchical structure

11/14

Page 23: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.O ten is is natural to view data as having a hierarchical structure

11/14

Page 24: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Two approaches:

Top-down (divisive) clustering:

• Initialize all observations into a single cluster, and divideclusters sequentially.

Bottom-up (agglomerative) clustering:

• Initialize each observation in its own cluster, and mergeclusters sequentially.

• More flexible, and more common.

12/14

Page 25: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

12

3

4

5

1 2 3 4 5

13/14

Page 26: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

12

3

4

5

1 2 3 4 5

13/14

Page 27: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

12

3

4

5

1 2 3 4 5

13/14

Page 28: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

12

3

4

5

1 2 3 4 5

13/14

Page 29: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

12

3

4

5

1 2 3 4 5

13/14

Page 30: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

Pick a distance function (e.g. Euclidean).

Pick a linkage criterion defining distance between two clusters:

• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1

|A||B|∑

x∈A,y∈B d(x, y).

14/14

Page 31: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

Pick a distance function (e.g. Euclidean).

Pick a linkage criterion defining distance between two clusters:

• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).

• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1

|A||B|∑

x∈A,y∈B d(x, y).

14/14

Page 32: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

Pick a distance function (e.g. Euclidean).

Pick a linkage criterion defining distance between two clusters:

• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).

• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1

|A||B|∑

x∈A,y∈B d(x, y).

14/14

Page 33: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

Pick a distance function (e.g. Euclidean).

Pick a linkage criterion defining distance between two clusters:

• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).

• Average linkage: d(A,B) = 1|A||B|

∑x∈A,y∈B d(x, y).

14/14

Page 34: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

Pick a distance function (e.g. Euclidean).

Pick a linkage criterion defining distance between two clusters:

• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1

|A||B|∑

x∈A,y∈B d(x, y).14/14

Page 35: lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Agglomerative clustering

Pick a distance function (e.g. Euclidean).

Pick a linkage criterion defining distance between two clusters:

• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1

|A||B|∑

x∈A,y∈B d(x, y).14/14