sergei vassilvitskii, research scientist, google at mlconf nyc - 4/15/16

57
Teaching k-Means New Tricks Sergei Vassilvitskii Google

Upload: mlconf

Post on 22-Jan-2018

931 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Teaching k-Means New Tricks

Sergei VassilvitskiiGoogle

Page 2: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

k-Means Algorithm

The k-Means Algorithm [Lloyd ’57]– Clusters points intro groups– Remains a workhorse of machine learning even in the age of deep networks

Page 3: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Initialize with random clusters

49Saturday, August 25, 12

Page 4: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Assign each point to nearest center

50Saturday, August 25, 12

Page 5: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Recompute optimum centers (means)

51Saturday, August 25, 12

Page 6: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat: Assign points to nearest center

52Saturday, August 25, 12

Page 7: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat: Recompute centers

53Saturday, August 25, 12

Page 8: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...

54Saturday, August 25, 12

Page 9: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...Until clustering does not change

55Saturday, August 25, 12

Page 10: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...Until clustering does not change

Total error reduced at every step - guaranteed to converge.

55Saturday, August 25, 12

Page 11: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Lloyd’s Method: k-means

Repeat...Until clustering does not change

Total error reduced at every step - guaranteed to converge.

Minimizes:

56

�(X,C) =X

x2X

d(x,C)2

Saturday, August 25, 12

Page 12: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

New Tricks for k-Means

Initialization:– Is random initialization a good idea?

Large data:– Clustering many points (in parallel) – Clustering into many clusters

Page 13: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random?

57Saturday, August 25, 12

Page 14: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random?

58Saturday, August 25, 12

Page 15: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random? A bad idea

59Saturday, August 25, 12

Page 16: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means Initialization

Random? A bad idea

Even with many random restarts!

59Saturday, August 25, 12

Page 17: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

60Saturday, August 25, 12

Page 18: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

61Saturday, August 25, 12

Page 19: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

62Saturday, August 25, 12

Page 20: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

63Saturday, August 25, 12

Page 21: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Easy Fix

Select centers using a furthest point algorithm (2-approximation to k-Center clustering).

64Saturday, August 25, 12

Page 22: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

Page 23: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

Page 24: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

Page 25: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

65Saturday, August 25, 12

Page 26: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Sensitive to Outliers

66Saturday, August 25, 12

Page 27: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Interpolate between two methods. Give preference to further points.

Let be the distance between and the nearest cluster center. Sample next center proportionally to .

k-means++

67

D(p) p

D↵(p)

Saturday, August 25, 12

Page 28: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means++

68

D(p) p

Interpolate between two methods. Give preference to further points.

Let be the distance between and the nearest cluster center. Sample next center proportionally to . D↵(p)

D↵(p)Px

D↵(p)

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }

Saturday, August 25, 12

Page 29: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means++

69

D(p) p

Interpolate between two methods. Give preference to further points.

Let be the distance between and the nearest cluster center. Sample next center proportionally to . D↵(p)

↵ = 1↵ = 2

Original Lloyd’s:

Furthest Point: k-means++:

↵ = 0

D↵(p)Px

D↵(p)

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }

Saturday, August 25, 12

Page 30: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

Page 31: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

Page 32: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

Page 33: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means++

70Saturday, August 25, 12

Page 34: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means++

71

Theorem [AV ’07]: k-means++ guarantees a approximation⇥(log k)

Saturday, August 25, 12

Page 35: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

New Tricks for k-Means

Initialization:– Is random initialization a good idea?

Large data:– Clustering many points (in parallel) – Clustering into many clusters

Page 36: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Dealing with large data

The new initialization approach:– Leads to very good clusterings– But is very sequential!

• Must select one cluster at a time, then update the distribution we are sampling from

– How to adapt it in the world of parallel computing?

Page 37: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Speeding up initialization

Initialization:

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i) { Select next point p with probability ; UpdateDistance(); }

Improving the speed:– Instead of selecting a single point, sample many points at a time– Oversample: select more than k centers, and then select the best k out of them.

D

2(p)Px

D

2(x)

Page 38: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means||

74

kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }}

D2(p)Pp D2(p)

Saturday, August 25, 12

Page 39: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means||

75

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Saturday, August 25, 12

Page 40: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means||

76

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Independent selection

Easy MR

Saturday, August 25, 12

Page 41: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means||

77

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Independent selection Easy MR

Oversampling Parameter

Saturday, August 25, 12

Page 42: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means||

78

kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability ; UpdateDistances(); } Prune to k points total by clustering the clusters}

k · ` · D↵(p)Px

D↵(p)

log`(�(X, c))

Independent selection Easy MR

Oversampling Parameter

Re-clustering step

Saturday, August 25, 12

Page 43: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

k-means||: Analysis

How Many Rounds?– Theorem: After rounds, guarantee approximation – In practice: fewer iterations are needed– Need to re-cluster intermediate centers

Discussion:– Number of rounds independent of k– Tradeoff between number of rounds and memory

79

O(1)O(log`(n�))

O(k` log`(n�))

Saturday, August 25, 12

Page 44: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

How well does this work?

80

1e+12

1e+13

1e+14

1e+15

1e+16

1 10co

stlog # Rounds

KDD Dataset, k=17

l/k=1l/k=2l/k=4

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

1 10

cost

log # Rounds

KDD Dataset, k=33

l/k=1l/k=2l/k=4

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

1 10

cost

log # Rounds

KDD Dataset, k=65

l/k=1l/k=2l/k=4

1e+10

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

1 10 100

cost

log # Rounds

KDD Dataset, k=129

l/k=1l/k=2l/k=4

Random Initialization

k-means++

k-means||

l=1l=2l=4

Saturday, August 25, 12

Page 45: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

MR ML Algorithmics Sergei Vassilvitskii

Performance vs. k-means++

– Even better on small datasets: 4600 points, 50 dimensions (SPAM)

– Accuracy:

– Time (iterations):

81Saturday, August 25, 12

Page 46: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

New Tricks for k-Means

Initialization:– Is random initialization a good idea?

Large data:– Clustering many points (in parallel) – Clustering into many clusters

Page 47: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Large k

How do you run k-means when k is large? – For every point, need to find the nearest center

Page 48: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Large k

How do you run k-means when k is large? – For every point, need to find the nearest center– Naive approach: linear scan

Page 49: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Large k

How do you run k-means when k is large? – For every point, need to find the nearest center– Naive approach: linear scan – Better approach [Elkan]:

• Use triangle inequality to see if the center could have possibly gotten closer• Still expensive when k is large

Page 50: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Using Nearest Neighbor Data Structures

Expensive step of k-Means:– For every point, find the nearest center

But we have many algorithms for nearest neighbors!

Page 51: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Using Nearest Neighbor Data Structures

Expensive step of k-Means:– For every point, find the nearest center

But we have many algorithms for nearest neighbors!

First idea:– Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time

Page 52: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Using Nearest Neighbor Data Structures

Expensive step of k-Means:– For every point, find the nearest center

But we have many algorithms for nearest neighbors!

First idea:– Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time

Better idea:– Index the points! – For every center, query the nearest points

Page 53: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Performance

Two large datasets:– 1M points in each– 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters.

Page 54: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Performance

Two large datasets:– 1M points in each– 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters.

Index based k-means:– Simple implementation: 2-7x faster than traditional k-means– No degradation in quality (same objective function value) – More complex implementation:

• An additional 8-50x speed improvement !

Page 55: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

K-Means Algorithm

Almost 60 years on, still incredibly popular and useful approach It has gotten better with age:

– Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets– New implementations that handle points in many dimensions and clustering into

many clusters– New approaches for online clustering

Page 56: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

K-Means Algorithm

Almost 60 years on, still incredibly popular and useful approach It has gotten better with age:

– Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets– New implementations that handle points in many dimensions and clustering into

many clusters– New approaches for online clustering

More work remains!– Non spherical clusters – Other metric spaces – Dealing with outliers

Page 57: Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Thank You.

Arthur, D., V., S. K-means++, the advantages of better seeding. SODA 2007.

Bahmani, B., Moseley, B., Vattani A., Kumar, R., V.,S. Scalable k-means++. VLDB 2012.

Broder, A., Garcia, L., Josifovski, V., V.S., Venkatesan, S. Scalable k-means by ranked retrieval. WSDM 2014.