clustering - wordpress.com · modern clustering problem may involve euclidean spaces of very high...

16
Clustering Lecture-9 Barna Saha Acknowledgement: Some of the slides taken from Jeff Ullman’s course on Mining Massive Datasets

Upload: others

Post on 27-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Clustering Lecture-9

Barna Saha

Acknowledgement: Some of the slides taken from Jeff Ullman’s course on Mining Massive Datasets

Page 2: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

The Problem of Clustering �  Given a set of points, with a notion of distance

between points, group the points into some number of clusters, so that members of a cluster are “close” to each other, while members of different clusters are “far.”

2

Page 3: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Example: Clusters

3

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Page 4: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Clustering in Low Dimensional Euclidean

Space is Easy

Page 5: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Modern Clustering Problem �  May involve Euclidean spaces of very high

dimension.

�  Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance, Edit Distance etc.

�  Example: �  Cluster documents by topics based on occurrences of

unusual words �  Cluster moviegoers by the type or types of movies

they like �  Cluster genes by their sequence similarity

Page 6: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Clustering Strategies �  Two fundamentally different approaches

�  Hierarchical or Agglomerative Clustering �  Start with each point in its own cluster

�  Merge clusters based on ``closeness’’

�  Point Assignment �  Start with some clusters (possibly empty)

�  Consider points and insert them in appropriate clusters

Page 7: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Which is Better?

�  Point assignment good when clusters are nice, convex shapes.

�  Hierarchical can win when shapes are weird.

7

Page 8: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

8

Hierarchical Clustering �  Two important questions:

1.  How do you determine the “nearness” of clusters?

2.  How do you represent a cluster of more than one point?

Page 9: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

9

Hierarchical Clustering – (2) �  Key problem: as you build clusters, how do you

represent the location of each cluster, to tell which pair of clusters is closest?

�  Euclidean case: each cluster has a centroid = average of its points. �  Measure intercluster distances by distances of

centroids.

Page 10: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Example

x (1.5,1.5)

x (4.5,0.5) x (1,1)

x (4.7,1.3)

Page 11: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Tree showing the grouping of points

(0,0) (1,2) (2,1) (4,1) (5,0) (5,3)

Page 12: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

And in the Non-Euclidean Case?

�  The only “locations” we can talk about are the points themselves. �  I.e., there is no “average” of two points.

�  Approach 1: clustroid = point “closest” to other points. �  Treat clustroid as if it were centroid, when computing

intercluster distances.

Page 13: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

13

“Closest” Point? �  Possible meanings:

1.  Smallest maximum distance to the other points.

2.  Smallest average distance to other points. 3.  Smallest sum of squares of distances to other

points. 4.  Etc., etc.

Page 14: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

Efficiency of Hierarchical Clustering

�  Start by computing O(n2) distances

�  Subsequent steps taken O((n-1)2), O((n-2)2),….

�  Total=O(n3)

�  Can be reduced to O(n2logn) using priority queue (See 7.2.2)

�  Can you reduce it to o(n2)?

Page 15: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

K-Means �  An example of a point-assignment based clustering

�  Initially choose k points that are likely to be in different clusters

�  Make these points the centroids of their clusters

�  For each remaining point p DO �  Find the centroid to which p is closest

�  Add p to the cluster of that centroid

�  Adjust the centroid of that cluster to account for p

�  Optional: reassign all points based on the new centroids. Repeat as long as there is any change in assignment.

Page 16: Clustering - WordPress.com · Modern Clustering Problem May involve Euclidean spaces of very high dimension. Non Euclidean space: Jaccard distance, Cosine Distance, Hamming Distance,

How to select the K centers?