) document clustering - univerzita karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. ·...

114
Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants NPFL103: Information Retrieval (10) Document clustering Pavel Pecina [email protected] Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stugart. 1 / 114

Upload: others

Post on 17-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

NPFL103: Information Retrieval (10)Document clustering

Pavel [email protected]

Institute of Formal and Applied LinguisticsFaculty of Mathematics and Physics

Charles University

Original slides are courtesy of Hinrich Schütze, University of Stuttgart.

1 / 114

Page 2: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Contents

Introduction

K-means

Evaluation

How many clusters?

Hierarchical clustering

Variants

2 / 114

Page 3: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Introduction

3 / 114

Page 4: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Clustering: Definition

(Document) clustering is the process of grouping a set of documentsinto clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Unsupervised = there are no labeled or annotated data.

4 / 114

Page 5: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Exercise: Data set with clear cluster structure

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

5 / 114

Page 6: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Classification vs. Clustering

Classification: supervised learning

Clustering: unsupervised learning

Classification: Classes are human-defined and part of the input tothe learning algorithm.

Clustering: Clusters are inferred from the data without human input. However, there are many ways of influencing the outcome of

clustering: number of clusters, similarity measure, representation ofdocuments, …

6 / 114

Page 7: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

The cluster hypothesis

Cluster hypothesis: Documents in the same cluster behave similarlywith respect to relevance to information needs.

All applications of clustering in IR are based (directly or indirectly) onthe cluster hypothesis.

Van Rijsbergen’s original wording (1979): “closely associateddocuments tend to be relevant to the same requests”.

7 / 114

Page 8: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Applications of clustering in IR

application what is benefitclustered?

search result clustering searchresults

more effective infor-mation presentationto user

collection clustering collection effective informationpresentation for ex-ploratory browsing

cluster-based retrieval collection higher efficiency:faster search

8 / 114

Page 9: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Search result clustering for better navigation

9 / 114

Page 10: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Global navigation: Yahoo

10 / 114

Page 11: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Global navigation: MESH

11 / 114

Page 12: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Global navigation: MESH (lower level)

12 / 114

Page 13: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering …

but well known examples for using a global hierarchy for navigation.

Eample for global navigation/exploration based on clustering: Google News

13 / 114

Page 14: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Clustering for improving recall

To improve search recall: Cluster docs in collection a priori

When a query matches a doc d, also return other docs in the clustercontaining d

Hope: if we do this: the query “car” will also return docs containing“automobile”

Because the clustering algorithm groups together docs containing“car” with those containing “automobile”.

Both types of documents contain words like “parts”, “dealer”,“mercedes”, “road trip”.

14 / 114

Page 15: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Goals of clustering

General goal: put related docs in the same cluster, put unrelated docsin different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data set we areclustering.

Initially, we will assume the number of clusters K is given.

Later: Semiautomatic methods for determining K

Secondary goals in clustering Avoid very small and very large clusters

Define clusters that are easy to explain to the user

15 / 114

Page 16: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Flat vs. hierarchical clustering

Flat algorithms Usually start with a random (partial) partitioning of docs into groups

Refine iteratively

Main algorithm: K-means

Hierarchical algorithms Create a hierarchy

Bottom-up, agglomerative

Top-down, divisive

16 / 114

Page 17: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Hard vs. soft clustering

Hard clustering: Each document belongs to exactly one cluster. More common and easier to do

Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies

You may want to put sneakers in two clusters: sports apparel/shoes

You can only do that with a soft clustering approach.

This class: flat and hierarchical hard clustering

Next class: latent semantic indexing, a form of soft clustering

17 / 114

Page 18: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Flat algorithms

Flat algorithms compute a partition of N documents into K clusters.

Given: a set of documents and the number K

Find: a partition into K clusters optimizing the chosen criterion

Global optimization: exhaustively enumerate partitions, pick optimal Not tractable

Effective heuristic method: K-means algorithm

18 / 114

Page 19: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

K-means

19 / 114

Page 20: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

K-means

Perhaps the best known clustering algorithm

Simple, works well in many cases

Use as default / baseline for clustering documents

20 / 114

Page 21: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Document representations in clustering

Vector space model

As in vector space classification, we measure relatedness betweenvectors by Euclidean distance …

…which is almost equivalent to cosine similarity.

Almost: centroids are not length-normalized.

21 / 114

Page 22: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

K-means: Basic idea

Each cluster in K-means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Recall definition of centroid (ω denotes a cluster):

µ(ω) =1

|ω|∑x∈ω

x

We search for minimum avg. squared difference by iterating 2 steps: reassignment: assign each vector to its closest centroid

recomputation: recompute each centroid as the average of the vectorsthat were assigned to it in reassignment

22 / 114

Page 23: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

K-means pseudocode (µk is centroid of ωk)

K-means(x1, . . . , xN,K)1 (s1, s2, . . . , sK)← SelectRandomSeeds(x1, . . . , xN,K)2 for k← 1 to K3 do µk ← sk4 while stopping criterion has not been met5 do for k← 1 to K6 do ωk ← 7 for n← 1 to N8 do j← argminj′ |µj′ − xn|9 ωj ← ωj ∪ xn (reassignment of vectors)

10 for k← 1 to K11 do µk ← 1

|ωk|∑

x∈ωkx (recomputation of centroids)

12 return µ1, . . . , µK

23 / 114

Page 24: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Random selection of initial centroids

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

24 / 114

Page 25: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assign points to closest center

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

25 / 114

Page 26: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assignment

2

11

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

26 / 114

Page 27: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Recompute cluster centroids

2

11

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

×

×

27 / 114

Page 28: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

28 / 114

Page 29: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assignment

2

21

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

29 / 114

Page 30: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Recompute cluster centroids

2

21

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

30 / 114

Page 31: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

31 / 114

Page 32: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assignment

2

22

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

32 / 114

Page 33: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Recompute cluster centroids

2

22

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

33 / 114

Page 34: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

34 / 114

Page 35: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assignment

2

22

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

35 / 114

Page 36: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Recompute cluster centroids

2

22

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

×

×

36 / 114

Page 37: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

37 / 114

Page 38: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assignment

2

22

2

1

1

1 122

1

2

1

11

1

11

2 1

×

×

38 / 114

Page 39: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Recompute cluster centroids

2

22

2

1

1

1 122

1

2

1

11

1

11

2 1

××

×

×

39 / 114

Page 40: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

××

40 / 114

Page 41: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assignment

2

22

2

1

1

1 122

1

2

1

11

1

11

1 1

××

41 / 114

Page 42: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Recompute cluster centroids

2

22

2

1

1

1 122

1

2

1

11

1

11

1 1

××

×

×

42 / 114

Page 43: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

××

43 / 114

Page 44: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Assignment

2

22

2

1

1

1 122

1

1

1

11

1

11

1 1

××

44 / 114

Page 45: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Recompute cluster centroids

2

22

2

1

1

1 122

1

1

1

11

1

11

1 1

××

×

×

45 / 114

Page 46: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Worked Example: Centroids and assignments after convergence

2

22

2

1

1

1 122

1

1

1

11

1

11

1 1

××

46 / 114

Page 47: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

K-means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vector andclosest centroid

RSS decreases during each reassignment step. because each vector is moved to a closer centroid

RSS decreases during each recomputation step. See the book for a proof.

There is only a finite number of clusterings.

Thus: We must reach a fixed point.

Assumption: Ties are broken consistently.

Finite set & monotonically decreasing→ convergence

47 / 114

Page 48: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

convergence and pptimality of K-means

K-means is guaranteed to converge

But we don’t know how long convergence will take!

If we don’t care about a few docs switching back and forth, thenconvergence is usually fast (< 10-20 iterations).

However, complete convergence can take many more iterations.

Convergence = optimality

Convergence does not mean that we converge to the optimalclustering!

This is the great weakness of K-means.

If we start with a bad set of seeds, the resulting clustering can behorrible.

48 / 114

Page 49: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Exercise: Suboptimal clustering

0

1

2

3

0 1 2 3 4

×

×

×

×

×

×d1 d2 d3

d4 d5 d6

What is the optimal clustering for K = 2? Do we converge on this clustering for arbitrary seeds di, dj?

49 / 114

Page 50: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Initialization of K-means

Random seed selection is just one of many ways K-means can beinitialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Better ways of computing initial centroids: Select seeds not randomly, but using some heuristic (e.g., filter out

outliers or find a set of seeds that has “good coverage” of thedocument space)

Use hierarchical clustering to find good seeds

Select i (e.g., i = 10) different random sets of seeds, do a K-meansclustering for each, select the clustering with lowest RSS

50 / 114

Page 51: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Time complexity of K-means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of the document’s< M values to one of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all important dimensions

However: This is not a real worst-case analysis.

In pathological cases, complexity can be worse than linear.

51 / 114

Page 52: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Evaluation

52 / 114

Page 53: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

What is a good clustering?

Internal criteria Example of an internal criterion: RSS in K-means

But an internal criterion often does not evaluate the actual utility of aclustering in the application.

Alternative: External criteria Evaluate with respect to a human-defined classification

53 / 114

Page 54: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

External criteria for clustering quality

Based on a gold standard data set, e.g., the Reuters collection we alsoused for the evaluation of classification

Goal: Clustering should reproduce the classes in the gold standard

(But we only want to reproduce how documents are divided intogroups, not the class labels.)

First measure for how well we were able to reproduce the classes:purity

54 / 114

Page 55: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

External criterion: Purity

purity(Ω,C) =1

N

∑k

maxj|ωk ∩ cj|

Ω = ω1, ω2, . . . , ωK is the set of clusters and C = c1, c2, . . . , cJ isthe set of classes.

For each cluster ωk: find class cj with most members nkj in ωk

Sum all nkj and divide by total number of points

55 / 114

Page 56: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Example for computing purity

x

o

x x

xx

o

x

o

o ⋄o x

⋄ ⋄⋄

x

cluster 1 cluster 2 cluster 3

To compute purity:

5 = maxj |ω1 ∩ cj| (class x, cluster 1);

4 = maxj |ω2 ∩ cj| (class o, cluster 2); and

3 = maxj |ω3 ∩ cj| (class ⋄, cluster 3).Purity is (1/17)× (5 + 4 + 3) ≈ 0.71.

56 / 114

Page 57: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measure that doesnot have this problem: Rand index.

RI = TP+TNTP+FP+FN+TN

Based on 2x2 contingency table of all pairs of documents:

same cluster different clusterssame class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)

Where: TP+FN+FP+TN is the total number of pairs;

(N2

)for N docs.

Each pair is either positive or negative (the clustering puts the twodocuments in the same or in different clusters) …

…and either “true” (correct) or “false” (incorrect): the clusteringdecision is correct or incorrect.

57 / 114

Page 58: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Example: compute Rand Index for the o/⋄/x example

We first compute TP+ FP. The three clusters contain 6, 6, and 5points, respectively, so the total number of “positives” or pairs ofdocuments that are in the same cluster is:

TP+ FP =

(62

)+

(62

)+

(52

)= 40

Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄ pairs incluster 3, and the x pair in cluster 3 are true positives:

TP =

(52

)+

(42

)+

(32

)+

(22

)= 20

Thus, FP = 40− 20 = 20.

FN and TN are computed similarly.

58 / 114

Page 59: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Rand index for the o/⋄/x example

same cluster different clusterssame class TP = 20 FN = 24different classes FP = 20 TN = 72

RI is then (20 + 72)/(20 + 20 + 24 + 72) ≈ 0.68.

59 / 114

Page 60: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Two other external evaluation measures

Two other measures

Normalized mutual information (NMI) How much information does the clustering contain about the

classification?

Singleton clusters (number of clusters = number of docs) havemaximum MI

Therefore: normalize by entropy of clusters and classes

F measure Like Rand, but “precision” and “recall” can be weighted

60 / 114

Page 61: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Evaluation results for the o/⋄/x example

purity NMI RI F5lower bound 0.0 0.0 0.0 0.0maximum 1.0 1.0 1.0 1.0value for example 0.71 0.36 0.68 0.46

All measures range from 0 (bad clustering) to 1 (perfect clustering).

61 / 114

Page 62: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

How many clusters?

62 / 114

Page 63: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

How many clusters?

Number of clusters K is given in many applications. E.g., there may be an external constraint on K.

What if there is no external constraint? Is there a “right” number ofclusters?

One way to go: define an optimization criterion Given docs, find K for which the optimum is reached.

What optimization criterion can we use?

We can’t use RSS or average squared distance from centroid ascriterion: always chooses K = N clusters.

63 / 114

Page 64: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Simple objective function for K: Basic idea

Start with 1 cluster (K = 1)

Keep adding clusters (= keep increasing K)

Add a penalty for each new cluster

Then trade off cluster penalties against average squared distancefrom centroid

Choose the value of K with the best tradeoff

64 / 114

Page 65: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Simple objective function for K: Formalization

Given a clustering, define the cost for a document as (squared)distance to centroid

Define total distortion RSS(K) as sum of all individual document costs(corresponds to average distance)

Then: penalize each cluster with a cost λ

Thus for a clustering with K clusters, total cluster penalty is Kλ

Define the total cost of a clustering as distortion plus total clusterpenalty: RSS(K) + Kλ

Select K that minimizes (RSS(K) + Kλ)

Still need to determine good value for λ …

65 / 114

Page 66: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Finding the “knee” in the curve

2 4 6 8 10

17

50

18

00

18

50

19

00

19

50

number of clusters

resid

ua

l su

m o

f sq

ua

res

Pick the number of clusters where curve “flattens”. Here: 4 or 9.66 / 114

Page 67: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Hierarchical clustering

67 / 114

Page 68: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Hierarchical clustering

Our goal in hierarchical clustering is to create a hierarchy like the onewe saw earlier in Reuters:

coffee poultry oil & gasFranceUKChinaKenya

industriesregions

TOP

We want to create this hierarchy automatically.

We can do this either top-down or bottom-up.

The best known bottom-up method is hierarchical agglomerativeclustering.

68 / 114

Page 69: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Hierarchical agglomerative clustering (HAC)

HAC creates a hierachy in the form of a binary tree.

Assumes a similarity measure for determining similarity of twoclusters.

Up to now, our similarity measures were for documents.

We will look at four different cluster similarity measures.

69 / 114

Page 70: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

HAC: Basic algorithm

Start with each document in a separate cluster

Then repeatedly merge the two clusters that are most similar

Until there is only one cluster.

The history of merging is a hierarchy in the form of a binary tree.

The standard way of depicting this history is a dendrogram.

70 / 114

Page 71: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

A dendrogram

The history of mergerscan be read off frombottom to top.

The horizontal line ofeach merger tells uswhat the similarity ofthe merger was.

We can cut thedendrogram at aparticular point (e.g., at0.1 or 0.4) to get a flatclustering.

71 / 114

Page 72: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Divisive clustering

Divisive clustering is top-down.

Alternative to HAC (which is bottom up).

Divisive clustering: Start with all docs in one big cluster

Then recursively split clusters

Eventually each node forms a cluster on its own.

→ Bisecting K-means at the end

For now: HAC (= bottom-up)

72 / 114

Page 73: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Naive HAC algorithm

SimpleHAC(d1, . . . , dN)1 for n← 1 to N2 do for i← 1 to N3 do C[n][i]← Sim(dn, di)4 I[n]← 1 (keeps track of active clusters)5 A← [] (collects clustering as a sequence of merges)6 for k← 1 to N− 17 do ⟨i,m⟩ ← argmax⟨i,m⟩:i=m∧I[i]=1∧I[m]=1 C[i][m]8 A.Append(⟨i,m⟩) (store merge)9 for j← 1 to N

10 do (use i as representative for < i,m >)11 C[i][j]← Sim(< i,m >, j)12 C[j][i]← Sim(< i,m >, j)13 I[m]← 0 (deactivate cluster)14 return A

73 / 114

Page 74: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Computational complexity of the naive algorithm

First, we compute the similarity of all N× N pairs of documents.

Then, in each of N iterations: We scan the O(N× N) similarities to find the maximum similarity.

We merge the two clusters with maximum similarity.

We compute the similarity of the new cluster with all other (surviving)clusters.

There are O(N) iterations, each performing a O(N× N) “scan”operation.

Overall complexity is O(N3).

We’ll look at more efficient algorithms later.

74 / 114

Page 75: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Key question: How to define cluster similarity

Single-link: Maximum similarity Maximum similarity of any two documents

Complete-link: Minimum similarity Minimum similarity of any two documents

Centroid: Average “intersimilarity” Average similarity of all document pairs (but excluding pairs of docs in

the same cluster)

This is equivalent to the similarity of the centroids.

Group-average: Average “intrasimilarity” Average similary of all document pairs, including pairs of docs in the

same cluster

75 / 114

Page 76: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Cluster similarity: Example

0

1

2

3

4

0 1 2 3 4 5 6 7

b

bb

b

76 / 114

Page 77: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Single-link: Maximum similarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

bb

b

77 / 114

Page 78: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Complete-link: Minimum similarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

bb

b

78 / 114

Page 79: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Centroid: Average intersimilarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

bb

b

intersimilarity = similarity of two documents in different clusters

79 / 114

Page 80: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Group average: Average intrasimilarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

bb

b

intrasimilarity = similarity of any pair, including cases in the same cluster

80 / 114

Page 81: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Cluster similarity: Larger Example

0

1

2

3

4

0 1 2 3 4 5 6 7

b

b

b

bbb

bb

b

bb

b

b

b

b

b

b

b

b

b

81 / 114

Page 82: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Single-link: Maximum similarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

b

b

bbb

bb

b

bb

b

b

b

b

b

b

b

b

b

82 / 114

Page 83: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Complete-link: Minimum similarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

b

b

bbb

bb

b

bb

b

b

b

b

b

b

b

b

b

83 / 114

Page 84: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Centroid: Average intersimilarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

b

b

bbb

bb

b

bb

b

b

b

b

b

b

b

b

b

84 / 114

Page 85: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Group average: Average intrasimilarity

0

1

2

3

4

0 1 2 3 4 5 6 7

b

b

b

bbb

bb

b

bb

b

b

b

b

b

b

b

b

b

85 / 114

Page 86: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Single link HAC

The similarity of two clusters is the maximum intersimilarity – themaximum similarity of a document from the first cluster and adocument from the second cluster.

Once we have merged two clusters, how do we update the similaritymatrix?

This is simple for single link:

sim(ωi, (ωk1 ∪ ωk2)) = max(sim(ωi, ωk1), sim(ωi, ωk2))

86 / 114

Page 87: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

This dendrogram was produced by single-link

Notice: many smallclusters (1 or 2members) being addedto the main cluster

There is no balanced2-cluster or 3-clusterclustering that can bederived by cutting thedendrogram.

87 / 114

Page 88: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Complete link HAC

The similarity of two clusters is the minimum intersimilarity – theminimum similarity of a document from the first cluster and adocument from the second cluster.

Once we have merged two clusters, how do we update the similaritymatrix?

Again, this is simple:

sim(ωi, (ωk1 ∪ ωk2)) = min(sim(ωi, ωk1), sim(ωi, ωk2))

We measure the similarity of two clusters by computing the diameterof the cluster that we would get if we merged them.

88 / 114

Page 89: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Complete-link dendrogram

Notice that thisdendrogram is muchmore balanced thanthe single-link one.

We can create a2-cluster clusteringwith two clusters ofabout the same size.

89 / 114

Page 90: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Exercise: Compute single and complete link clusterings

0

1

2

3

0 1 2 3 4

×

d5×

d6×

d7×

d8

×

d1×

d2×

d3×

d4

90 / 114

Page 91: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Single-link clustering

0

1

2

3

0 1 2 3 4

×

d5×

d6×

d7×

d8

×

d1×

d2×

d3×

d4

91 / 114

Page 92: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Complete link clustering

0

1

2

3

0 1 2 3 4

×

d5×

d6×

d7×

d8

×

d1×

d2×

d3×

d4

92 / 114

Page 93: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Single-link vs. Complete link clustering

0

1

2

3

0 1 2 3 4

×

d5×

d6×

d7×

d8

×

d1×

d2×

d3×

d4

0

1

2

3

0 1 2 3 4

×

d5×

d6×

d7×

d8

×

d1×

d2×

d3×

d4

93 / 114

Page 94: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Single-link: Chaining

0

1

2

0 1 2 3 4 5 6 7 8 9 10 11 12

× × × × × × × × × × × ×

× × × × × × × × × × × ×

Single-link clustering often produces long, stragglyclusters. For mostapplications, these are undesirable.

94 / 114

Page 95: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

What 2-cluster clustering will complete-link produce?

0

1

0 1 2 3 4 5 6 7

×

d1×

d2×

d3×

d4×

d5

Coordinates: 1 + 2× ϵ, 4, 5 + 2× ϵ, 6, 7− ϵ.

95 / 114

Page 96: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Complete-link: Sensitivity to outliers

0

1

0 1 2 3 4 5 6 7

×

d1×

d2×

d3×

d4×

d5

The complete-link clustering of this set splits d2 from its rightneighbors – clearly undesirable.

The reason is the outlier d1.

This shows that a single outlier can negatively affect the outcome ofcomplete-link clustering.

Single-link clustering does better in this case.

96 / 114

Page 97: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Centroid HAC

The similarity of two clusters is the average intersimilarity – theaverage similarity of documents from the first cluster withdocuments from the second cluster.

A naive implementation of this definition is inefficient (O(N2)), butthe definition is equivalent to computing the similarity of thecentroids:

sim-cent(ωi, ωj) = µ(ωi) · µ(ωj)

Hence the name: centroid HAC

Note: this is the dot product, not cosine similarity!

97 / 114

Page 98: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Exercise: Compute centroid clustering

0

1

2

3

4

5

0 1 2 3 4 5 6 7

× d1

× d2

× d3

× d4

×d5 × d6

98 / 114

Page 99: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Centroid clustering

0

1

2

3

4

5

0 1 2 3 4 5 6 7

× d1

× d2

× d3

× d4

×d5 × d6bcµ1

bc µ2

bcµ3

99 / 114

Page 100: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Inversion in centroid clustering

In an inversion, the similarity increases during a merge sequence.Results in an “inverted” dendrogram.

Below: Similarity of the first merger (d1 ∪ d2) is -4.0, similarity ofsecond merger ((d1 ∪ d2) ∪ d3) is ≈ −3.5.

0

1

2

3

4

5

0 1 2 3 4 5

× ×

×

bc

d1 d2

d3−4−3−2−10

d1 d2 d3

100 / 114

Page 101: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Inversions

Hierarchical clustering algorithms that allow inversions are inferior.

The rationale for hierarchical clustering is that at any given point,we’ve found the most coherent clustering for a given K.

Intuitively: smaller clusterings should be more coherent than largerclusterings.

An inversion contradicts this intuition: we have a large cluster that ismore coherent than one of its subclusters.

The fact that inversions can occur in centroid clustering is a reasonnot to use it.

101 / 114

Page 102: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Group-average agglomerative clustering (GAAC)

GAAC also has an “average-similarity” criterion, but does not haveinversions.

The similarity of two clusters is the average intrasimilarity – theaverage similarity of all document pairs (including those from thesame cluster).

But we exclude self-similarities.

102 / 114

Page 103: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Group-average agglomerative clustering (GAAC)

Again, a naive implementation is inefficient (O(N2)) and there is anequivalent, more efficient, centroid-based definition:

sim-ga(ωi, ωj) =1

(Ni + Nj)(Ni + Nj − 1)[(

∑dm∈ωi∪ωj

dm)2 − (Ni + Nj)]

Again, this is the dot product, not cosine similarity.

103 / 114

Page 104: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Which HAC clustering should I use?

Don’t use centroid HAC because of inversions.

In most cases: GAAC is best since it isn’t subject to chaining andsensitivity to outliers.

However, we can only use GAAC for vector representations.

For other types of document representations (or if only pairwisesimilarities for documents are available): use complete-link.

There are also some applications for single-link (e.g., duplicatedetection in web search).

104 / 114

Page 105: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Flat or hierarchical clustering?

For high efficiency, use flat clustering (or perhaps bisecting k-means)

For deterministic results: HAC

When a hierarchical structure is desired: hierarchical algorithm

HAC also can be applied if K cannot be predetermined (can startwithout knowing K)

105 / 114

Page 106: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Variants

106 / 114

Page 107: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Bisecting K-means: A top-down algorithm

Start with all documents in one cluster

Split the cluster into 2 using K-means

Of the clusters produced so far, select one to split (e.g. select thelargest one)

Repeat until we have produced the desired number of clusters

107 / 114

Page 108: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Bisecting K-means

BisectingKMeans(d1, . . . , dN)1 ω0 ← d1, . . . , dN2 leaves← ω03 for k← 1 to K− 14 do ωk ← PickClusterFrom(leaves)5 ωi, ωj ← KMeans(ωk, 2)6 leaves← leaves \ ωk ∪ ωi, ωj7 return leaves

108 / 114

Page 109: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Bisecting K-means

If we don’t generate a complete hierarchy, then a top-down algorithmlike bisecting K-means is much more efficient than HAC algorithms.

But bisecting K-means is not deterministic.

There are deterministic versions of bisecting K-means (see resourcesat the end), but they are much less efficient.

109 / 114

Page 110: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Efficient single link clustering

SingleLinkClustering(d1, . . . , dN,K)1 for n← 1 to N2 do for i← 1 to N3 do C[n][i].sim← SIM(dn, di)4 C[n][i].index← i5 I[n]← n6 NBM[n]← argmaxX∈C[n][i]:n =i X.sim7 A← []8 for n← 1 to N− 19 do i1 ← argmaxi:I[i]=i NBM[i].sim

10 i2 ← I[NBM[i1].index]11 A.Append(⟨i1, i2⟩)12 for i← 1 to N13 do if I[i] = i ∧ i = i1 ∧ i = i214 then C[i1][i].sim← C[i][i1].sim← max(C[i1][i].sim,C[i2][i].sim)15 if I[i] = i216 then I[i]← i117 NBM[i1]← argmaxX∈C[i1][i]:I[i]=i∧i =i1 X.sim18 return A

110 / 114

Page 111: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Time complexity of HAC

The single-link algorithm we just saw is O(N2).

Much more efficient than the O(N3) algorithm we looked at earlier!

There is no known O(N2) algorithm for complete-link, centroid andGAAC.

Best time complexity for these three is O(N2 logN): See book.

In practice: little difference between O(N2 logN) and O(N2).

111 / 114

Page 112: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Combination similarities of the four algorithms

clustering algorithm sim(ℓ, k1, k2)single-link max(sim(ℓ, k1), sim(ℓ, k2))complete-link min(sim(ℓ, k1), sim(ℓ, k2))centroid ( 1

Nmvm) · ( 1

Nℓvℓ)

group-average 1(Nm+Nℓ)(Nm+Nℓ−1) [(vm + vℓ)2 − (Nm + Nℓ)]

112 / 114

Page 113: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

Comparison of HAC algorithms

method combination similarity time compl. optimal? commentsingle-link max intersimilarity of any 2 docs Θ(N2) yes chaining effectcomplete-link min intersimilarity of any 2 docs Θ(N2 logN) no sensitive to outliers

group-average average of all sims Θ(N2 logN) nobest choice formost applications

centroid average intersimilarity Θ(N2 logN) no inversions can occur

113 / 114

Page 114: ) Document clustering - Univerzita Karlovapecina/courses/npfl103/slides/... · 2018. 11. 19. · Introduction K-means Evaluation Howmanyclusters? Hierarchicalclustering Variants Clustering:

Introduction K-means Evaluation How many clusters? Hierarchical clustering Variants

What to do with the hierarchy?

Use as is (e.g., for browsing as in Yahoo hierarchy)

Cut at a predetermined threshold

Cut to get a predetermined number of clusters K Ignores hierarchy below and above cutting line.

114 / 114