introduction to information retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · recap clustering:...

260
Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters? Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Sch¨ utze Center for Information and Language Processing, University of Munich 2014-06-11 Sch¨ utze: Flat clustering 1 / 86

Upload: others

Post on 14-Oct-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Introduction to Information Retrievalhttp://informationretrieval.org

IIR 16: Flat Clustering

Hinrich Schutze

Center for Information and Language Processing, University of Munich

2014-06-11

Schutze: Flat clustering 1 / 86

Page 2: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Overview

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

Schutze: Flat clustering 2 / 86

Page 3: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

Schutze: Flat clustering 3 / 86

Page 4: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Learning to rank for zone scoring

Given query q and document d , weighted zone scoring assigns tothe pair (q, d) a score in the interval [0,1] by computing a linearcombination of document zone scores, where each zone contributesa value.

Consider a set of documents, which have l zones

Let g1, ..., gl ∈ [0, 1], such that∑l

i=1 gi = 1

For 1 ≤ i ≤ l , let si be the Boolean score denoting a match(or non-match) between q and the i th zone

si = 1 if a query term occurs in zone i , 0 otherwise

Weighted zone scoring aka ranked Boolean retrieval

Rank documents according to∑l

i=1 gi si

Learning to rank approach: learn the weights gi from training data

Schutze: Flat clustering 4 / 86

Page 5: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Training set for learning to rank

Φj dj qj sT sB r(dj , qj)Φ1 37 linux 1 1 RelevantΦ2 37 penguin 0 1 NonrelevantΦ3 238 system 0 1 RelevantΦ4 238 penguin 0 0 NonrelevantΦ5 1741 kernel 1 1 RelevantΦ6 2094 driver 0 1 RelevantΦ7 3194 driver 1 0 Nonrelevant

Schutze: Flat clustering 5 / 86

Page 6: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Summary of learning to rank approach

The problem of making a binary relevant/nonrelevantjudgment is cast as a classification or regression problem,based on a training set of query-document pairs andassociated relevance judgments.

In principle, any method learning a classifier (including leastsquares regression) can be used to find this line.

Big advantage of learning to rank: we can avoid hand-tuningscoring functions and simply learn them from training data.

Bottleneck of learning to rank: the cost of maintaining arepresentative set of training examples whose relevanceassessments must be made by humans.

Schutze: Flat clustering 6 / 86

Page 7: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

LTR features used by Microsoft Research (1)

Zones: body, anchor, title, url, whole document

Features derived from standard IR models: query termnumber, query term ratio, length, idf, sum of term frequency,min of term frequency, max of term frequency, mean of termfrequency, variance of term frequency, sum of lengthnormalized term frequency, min of length normalized termfrequency, max of length normalized term frequency, mean oflength normalized term frequency, variance of lengthnormalized term frequency, sum of tf-idf, min of tf-idf, max oftf-idf, mean of tf-idf, variance of tf-idf, boolean model, BM25

Schutze: Flat clustering 7 / 86

Page 8: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

LTR features used by Microsoft Research (2)

Language model features: LMIR.ABS, LMIR.DIR, LMIR.JM

Web-specific features: number of slashes in url, length of url,inlink number, outlink number, PageRank, SiteRank

Spam features: QualityScore

Usage-based features: query-url click count, url click count,url dwell time

Schutze: Flat clustering 8 / 86

Page 9: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Ranking SVMs

Vector of feature differences: Φ(di , dj , q) = ψ(di , q)−ψ(dj , q)

By hypothesis, one of di and dj has been judged morerelevant.

Notation: We write di ≺ dj for “di precedes dj in the resultsordering”.

If di is judged more relevant than dj , then we will assign thevector Φ(di , dj , q) the class yijq = +1; otherwise −1.

This gives us a training set of pairs of vectors and“precedence indicators”. Each of the vectors is computed asthe difference of two document-query vectors.

We can then train an SVM on this training set with the goalof obtaining a classifier that returns

~wTΦ(di , dj , q) > 0 iff di ≺ dj

Schutze: Flat clustering 9 / 86

Page 10: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Take-away today

Schutze: Flat clustering 10 / 86

Page 11: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Take-away today

What is clustering?

Schutze: Flat clustering 10 / 86

Page 12: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Take-away today

What is clustering?

Applications of clustering in information retrieval

Schutze: Flat clustering 10 / 86

Page 13: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Take-away today

What is clustering?

Applications of clustering in information retrieval

K -means algorithm

Schutze: Flat clustering 10 / 86

Page 14: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Take-away today

What is clustering?

Applications of clustering in information retrieval

K -means algorithm

Evaluation of clustering

Schutze: Flat clustering 10 / 86

Page 15: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Take-away today

What is clustering?

Applications of clustering in information retrieval

K -means algorithm

Evaluation of clustering

How many clusters?

Schutze: Flat clustering 10 / 86

Page 16: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

Schutze: Flat clustering 11 / 86

Page 17: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering: Definition

Schutze: Flat clustering 12 / 86

Page 18: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering: Definition

(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Schutze: Flat clustering 12 / 86

Page 19: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering: Definition

(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Documents within a cluster should be similar.

Schutze: Flat clustering 12 / 86

Page 20: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering: Definition

(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Schutze: Flat clustering 12 / 86

Page 21: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering: Definition

(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Schutze: Flat clustering 12 / 86

Page 22: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering: Definition

(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.

Documents within a cluster should be similar.

Documents from different clusters should be dissimilar.

Clustering is the most common form of unsupervised learning.

Unsupervised = there are no labeled or annotated data.

Schutze: Flat clustering 12 / 86

Page 23: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Data set with clear cluster structure

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

Schutze: Flat clustering 13 / 86

Page 24: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Classification vs. Clustering

Schutze: Flat clustering 14 / 86

Page 25: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Classification vs. Clustering

Classification: supervised learning

Schutze: Flat clustering 14 / 86

Page 26: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Classification vs. Clustering

Classification: supervised learning

Clustering: unsupervised learning

Schutze: Flat clustering 14 / 86

Page 27: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Classification vs. Clustering

Classification: supervised learning

Clustering: unsupervised learning

Classification: Classes are human-defined and part of theinput to the learning algorithm.

Schutze: Flat clustering 14 / 86

Page 28: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Classification vs. Clustering

Classification: supervised learning

Clustering: unsupervised learning

Classification: Classes are human-defined and part of theinput to the learning algorithm.

Clustering: Clusters are inferred from the data without humaninput.

Schutze: Flat clustering 14 / 86

Page 29: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Classification vs. Clustering

Classification: supervised learning

Clustering: unsupervised learning

Classification: Classes are human-defined and part of theinput to the learning algorithm.

Clustering: Clusters are inferred from the data without humaninput.

However, there are many ways of influencing the outcome ofclustering: number of clusters, similarity measure,representation of documents, . . .

Schutze: Flat clustering 14 / 86

Page 30: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

Schutze: Flat clustering 15 / 86

Page 31: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

The cluster hypothesis

Schutze: Flat clustering 16 / 86

Page 32: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

The cluster hypothesis

Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.

Schutze: Flat clustering 16 / 86

Page 33: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

The cluster hypothesis

Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.

All applications of clustering in IR are based (directly or indirectly)on the cluster hypothesis.

Schutze: Flat clustering 16 / 86

Page 34: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

The cluster hypothesis

Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.

All applications of clustering in IR are based (directly or indirectly)on the cluster hypothesis.

Van Rijsbergen’s original wording (1979): “closely associateddocuments tend to be relevant to the same requests”.

Schutze: Flat clustering 16 / 86

Page 35: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Applications of clustering in IR

Schutze: Flat clustering 17 / 86

Page 36: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Applications of clustering in IR

application what is benefitclustered?

search result clustering searchresults

more effective infor-mation presentationto user

Scatter-Gather (subsets of)collection

alternative user inter-face: “search withouttyping”

collection clustering collection effective informationpresentation for ex-ploratory browsing

cluster-based retrieval collection higher efficiency:faster search

Schutze: Flat clustering 17 / 86

Page 37: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Search result clustering for better navigation

Schutze: Flat clustering 18 / 86

Page 38: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Search result clustering for better navigation

Schutze: Flat clustering 18 / 86

Page 39: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Scatter-Gather

Schutze: Flat clustering 19 / 86

Page 40: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Scatter-Gather

Schutze: Flat clustering 19 / 86

Page 41: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation: Yahoo

Schutze: Flat clustering 20 / 86

Page 42: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation: Yahoo

Schutze: Flat clustering 20 / 86

Page 43: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation: MESH (upper level)

Schutze: Flat clustering 21 / 86

Page 44: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation: MESH (upper level)

Schutze: Flat clustering 21 / 86

Page 45: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation: MESH (lower level)

Schutze: Flat clustering 22 / 86

Page 46: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation: MESH (lower level)

Schutze: Flat clustering 22 / 86

Page 47: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Navigational hierarchies: Manual vs. automatic creation

Schutze: Flat clustering 23 / 86

Page 48: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

Schutze: Flat clustering 23 / 86

Page 49: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchyfor navigation.

Schutze: Flat clustering 23 / 86

Page 50: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchyfor navigation.

Some examples for global navigation/exploration based onclustering:

Schutze: Flat clustering 23 / 86

Page 51: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchyfor navigation.

Some examples for global navigation/exploration based onclustering:

Cartia

Schutze: Flat clustering 23 / 86

Page 52: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchyfor navigation.

Some examples for global navigation/exploration based onclustering:

CartiaThemescapes

Schutze: Flat clustering 23 / 86

Page 53: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Navigational hierarchies: Manual vs. automatic creation

Note: Yahoo/MESH are not examples of clustering.

But they are well known examples for using a global hierarchyfor navigation.

Some examples for global navigation/exploration based onclustering:

CartiaThemescapesGoogle News

Schutze: Flat clustering 23 / 86

Page 54: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation combined with visualization (1)

Schutze: Flat clustering 24 / 86

Page 55: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation combined with visualization (1)

Schutze: Flat clustering 24 / 86

Page 56: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation combined with visualization (2)

Schutze: Flat clustering 25 / 86

Page 57: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global navigation combined with visualization (2)

Schutze: Flat clustering 25 / 86

Page 58: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global clustering for navigation: Google News

Schutze: Flat clustering 26 / 86

Page 59: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Global clustering for navigation: Google News

http://news.google.com

Schutze: Flat clustering 26 / 86

Page 60: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering for improving recall

Schutze: Flat clustering 27 / 86

Page 61: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering for improving recall

To improve search recall:

Schutze: Flat clustering 27 / 86

Page 62: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering for improving recall

To improve search recall:

Cluster docs in collection a priori

Schutze: Flat clustering 27 / 86

Page 63: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering for improving recall

To improve search recall:

Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d

Schutze: Flat clustering 27 / 86

Page 64: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering for improving recall

To improve search recall:

Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d

Hope: if we do this: the query “car” will also return docscontaining “automobile”

Schutze: Flat clustering 27 / 86

Page 65: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering for improving recall

To improve search recall:

Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d

Hope: if we do this: the query “car” will also return docscontaining “automobile”

Because the clustering algorithm groups together docscontaining “car” with those containing “automobile”.

Schutze: Flat clustering 27 / 86

Page 66: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Clustering for improving recall

To improve search recall:

Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d

Hope: if we do this: the query “car” will also return docscontaining “automobile”

Because the clustering algorithm groups together docscontaining “car” with those containing “automobile”.Both types of documents contain words like “parts”, “dealer”,“mercedes”, “road trip”.

Schutze: Flat clustering 27 / 86

Page 67: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Exercise: Data set with clear cluster structure

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

Proposealgorithmfor findingthe clusterstructure inthisexample

Schutze: Flat clustering 28 / 86

Page 68: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

Schutze: Flat clustering 29 / 86

Page 69: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

Schutze: Flat clustering 29 / 86

Page 70: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

Schutze: Flat clustering 29 / 86

Page 71: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data setwe are clustering.

Schutze: Flat clustering 29 / 86

Page 72: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data setwe are clustering.

Initially, we will assume the number of clusters K is given.

Schutze: Flat clustering 29 / 86

Page 73: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data setwe are clustering.

Initially, we will assume the number of clusters K is given.Later: Semiautomatic methods for determining K

Schutze: Flat clustering 29 / 86

Page 74: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data setwe are clustering.

Initially, we will assume the number of clusters K is given.Later: Semiautomatic methods for determining K

Secondary goals in clustering

Schutze: Flat clustering 29 / 86

Page 75: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data setwe are clustering.

Initially, we will assume the number of clusters K is given.Later: Semiautomatic methods for determining K

Secondary goals in clustering

Avoid very small and very large clusters

Schutze: Flat clustering 29 / 86

Page 76: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data setwe are clustering.

Initially, we will assume the number of clusters K is given.Later: Semiautomatic methods for determining K

Secondary goals in clustering

Avoid very small and very large clustersDefine clusters that are easy to explain to the user

Schutze: Flat clustering 29 / 86

Page 77: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Desiderata for clustering

General goal: put related docs in the same cluster, putunrelated docs in different clusters.

We’ll see different ways of formalizing this.

The number of clusters should be appropriate for the data setwe are clustering.

Initially, we will assume the number of clusters K is given.Later: Semiautomatic methods for determining K

Secondary goals in clustering

Avoid very small and very large clustersDefine clusters that are easy to explain to the userMany others . . .

Schutze: Flat clustering 29 / 86

Page 78: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Schutze: Flat clustering 30 / 86

Page 79: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Schutze: Flat clustering 30 / 86

Page 80: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroups

Schutze: Flat clustering 30 / 86

Page 81: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iteratively

Schutze: Flat clustering 30 / 86

Page 82: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means

Schutze: Flat clustering 30 / 86

Page 83: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means

Hierarchical algorithms

Schutze: Flat clustering 30 / 86

Page 84: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means

Hierarchical algorithms

Create a hierarchy

Schutze: Flat clustering 30 / 86

Page 85: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means

Hierarchical algorithms

Create a hierarchyBottom-up, agglomerative

Schutze: Flat clustering 30 / 86

Page 86: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat vs. Hierarchical clustering

Flat algorithms

Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means

Hierarchical algorithms

Create a hierarchyBottom-up, agglomerativeTop-down, divisive

Schutze: Flat clustering 30 / 86

Page 87: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Schutze: Flat clustering 31 / 86

Page 88: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

Schutze: Flat clustering 31 / 86

Page 89: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Schutze: Flat clustering 31 / 86

Page 90: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Schutze: Flat clustering 31 / 86

Page 91: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchies

Schutze: Flat clustering 31 / 86

Page 92: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put sneakers in two clusters:

Schutze: Flat clustering 31 / 86

Page 93: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put sneakers in two clusters:

sports apparel

Schutze: Flat clustering 31 / 86

Page 94: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put sneakers in two clusters:

sports apparel

shoes

Schutze: Flat clustering 31 / 86

Page 95: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put sneakers in two clusters:

sports apparel

shoes

You can only do that with a soft clustering approach.

Schutze: Flat clustering 31 / 86

Page 96: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put sneakers in two clusters:

sports apparel

shoes

You can only do that with a soft clustering approach.

This class: flat, hard clustering

Schutze: Flat clustering 31 / 86

Page 97: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put sneakers in two clusters:

sports apparel

shoes

You can only do that with a soft clustering approach.

This class: flat, hard clustering

Next time: hierarchical, hard clustering

Schutze: Flat clustering 31 / 86

Page 98: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Hard vs. Soft clustering

Hard clustering: Each document belongs to exactly onecluster.

More common and easier to do

Soft clustering: A document can belong to more than onecluster.

Makes more sense for applications like creating browsablehierarchiesYou may want to put sneakers in two clusters:

sports apparel

shoes

You can only do that with a soft clustering approach.

This class: flat, hard clustering

Next time: hierarchical, hard clustering

Next week: latent semantic indexing, a form of softclustering

Schutze: Flat clustering 31 / 86

Page 99: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat algorithms

Schutze: Flat clustering 32 / 86

Page 100: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Schutze: Flat clustering 32 / 86

Page 101: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Given: a set of documents and the number K

Schutze: Flat clustering 32 / 86

Page 102: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Given: a set of documents and the number K

Find: a partition into K clusters that optimizes the chosenpartitioning criterion

Schutze: Flat clustering 32 / 86

Page 103: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Given: a set of documents and the number K

Find: a partition into K clusters that optimizes the chosenpartitioning criterion

Global optimization: exhaustively enumerate partitions, pickoptimal one

Schutze: Flat clustering 32 / 86

Page 104: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Given: a set of documents and the number K

Find: a partition into K clusters that optimizes the chosenpartitioning criterion

Global optimization: exhaustively enumerate partitions, pickoptimal one

Not tractable

Schutze: Flat clustering 32 / 86

Page 105: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Flat algorithms

Flat algorithms compute a partition of N documents into aset of K clusters.

Given: a set of documents and the number K

Find: a partition into K clusters that optimizes the chosenpartitioning criterion

Global optimization: exhaustively enumerate partitions, pickoptimal one

Not tractable

Effective heuristic method: K -means algorithm

Schutze: Flat clustering 32 / 86

Page 106: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

Schutze: Flat clustering 33 / 86

Page 107: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means

Schutze: Flat clustering 34 / 86

Page 108: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means

Perhaps the best known clustering algorithm

Schutze: Flat clustering 34 / 86

Page 109: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means

Perhaps the best known clustering algorithm

Simple, works well in many cases

Schutze: Flat clustering 34 / 86

Page 110: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means

Perhaps the best known clustering algorithm

Simple, works well in many cases

Use as default / baseline for clustering documents

Schutze: Flat clustering 34 / 86

Page 111: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Document representations in clustering

Schutze: Flat clustering 35 / 86

Page 112: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Document representations in clustering

Vector space model

Schutze: Flat clustering 35 / 86

Page 113: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Document representations in clustering

Vector space model

As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .

Schutze: Flat clustering 35 / 86

Page 114: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Document representations in clustering

Vector space model

As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .

. . . which is almost equivalent to cosine similarity.

Schutze: Flat clustering 35 / 86

Page 115: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Document representations in clustering

Vector space model

As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .

. . . which is almost equivalent to cosine similarity.

Almost: centroids are not length-normalized.

Schutze: Flat clustering 35 / 86

Page 116: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means: Basic idea

Schutze: Flat clustering 36 / 86

Page 117: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means: Basic idea

Each cluster in K -means is defined by a centroid.

Schutze: Flat clustering 36 / 86

Page 118: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means: Basic idea

Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Schutze: Flat clustering 36 / 86

Page 119: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means: Basic idea

Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Recall definition of centroid:

~µ(ω) =1

|ω|

~x∈ω

~x

where we use ω to denote a cluster.

Schutze: Flat clustering 36 / 86

Page 120: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means: Basic idea

Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Recall definition of centroid:

~µ(ω) =1

|ω|

~x∈ω

~x

where we use ω to denote a cluster.

We try to find the minimum average squared difference byiterating two steps:

Schutze: Flat clustering 36 / 86

Page 121: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means: Basic idea

Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Recall definition of centroid:

~µ(ω) =1

|ω|

~x∈ω

~x

where we use ω to denote a cluster.

We try to find the minimum average squared difference byiterating two steps:

reassignment: assign each vector to its closest centroid

Schutze: Flat clustering 36 / 86

Page 122: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means: Basic idea

Each cluster in K -means is defined by a centroid.

Objective/partitioning criterion: minimize the average squareddifference from the centroid

Recall definition of centroid:

~µ(ω) =1

|ω|

~x∈ω

~x

where we use ω to denote a cluster.

We try to find the minimum average squared difference byiterating two steps:

reassignment: assign each vector to its closest centroidrecomputation: recompute each centroid as the average of thevectors that were assigned to it in reassignment

Schutze: Flat clustering 36 / 86

Page 123: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means pseudocode (µk is centroid of ωk)

Schutze: Flat clustering 37 / 86

Page 124: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means pseudocode (µk is centroid of ωk)

K -means(~x1, . . . , ~xN,K )1 (~s1,~s2, . . . ,~sK )← SelectRandomSeeds(~x1, . . . , ~xN,K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6 do ωk ← 7 for n← 1 to N8 do j ← argminj ′ |~µj ′ − ~xn|9 ωj ← ωj ∪ ~xn (reassignment of vectors)10 for k ← 1 to K11 do ~µk ←

1|ωk |

~x∈ωk~x (recomputation of centroids)

12 return ~µ1, . . . , ~µK

Schutze: Flat clustering 37 / 86

Page 125: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Set of points to be clustered

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

Schutze: Flat clustering 38 / 86

Page 126: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

Exercise: (i) Guess what the optimal clustering into two clusters isin this case; (ii) compute the centroids of the clusters

Schutze: Flat clustering 38 / 86

Page 127: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Random selection of initial centroids

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

Schutze: Flat clustering 39 / 86

Page 128: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assign points to closest center

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

Schutze: Flat clustering 40 / 86

Page 129: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assignment

2

1

1

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

Schutze: Flat clustering 41 / 86

Page 130: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Recompute cluster centroids

2

1

1

2

1

1

1 111

1

1

1

11

2

11

2 2

×

×

×

×

Schutze: Flat clustering 42 / 86

Page 131: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

Schutze: Flat clustering 43 / 86

Page 132: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assignment

2

2

1

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

Schutze: Flat clustering 44 / 86

Page 133: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Recompute cluster centroids

2

2

1

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

Schutze: Flat clustering 45 / 86

Page 134: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

Schutze: Flat clustering 46 / 86

Page 135: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assignment

2

2

2

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

Schutze: Flat clustering 47 / 86

Page 136: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Recompute cluster centroids

2

2

2

2

1

1

1 111

1

2

1

11

2

11

2 2

×

×

×

×

Schutze: Flat clustering 48 / 86

Page 137: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

Schutze: Flat clustering 49 / 86

Page 138: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assignment

2

2

2

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

Schutze: Flat clustering 50 / 86

Page 139: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Recompute cluster centroids

2

2

2

2

1

1

1 121

1

2

1

11

2

11

2 2

×

×

×

×

Schutze: Flat clustering 51 / 86

Page 140: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

×

×

Schutze: Flat clustering 52 / 86

Page 141: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assignment

2

2

2

2

1

1

1 122

1

2

1

11

1

11

2 1

×

×

Schutze: Flat clustering 53 / 86

Page 142: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Recompute cluster centroids

2

2

2

2

1

1

1 122

1

2

1

11

1

11

2 1

××

×

×

Schutze: Flat clustering 54 / 86

Page 143: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

××

Schutze: Flat clustering 55 / 86

Page 144: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assignment

2

2

2

2

1

1

1 122

1

2

1

11

1

11

1 1

××

Schutze: Flat clustering 56 / 86

Page 145: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Recompute cluster centroids

2

2

2

2

1

1

1 122

1

2

1

11

1

11

1 1

××

×

×

Schutze: Flat clustering 57 / 86

Page 146: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assign points to closest centroid

b

b

b

b

b

b

b bb

b

b

b

b

bb

b

bb

b b

××

Schutze: Flat clustering 58 / 86

Page 147: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Assignment

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

Schutze: Flat clustering 59 / 86

Page 148: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Example: Recompute cluster centroids

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

×

×

Schutze: Flat clustering 60 / 86

Page 149: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Worked Ex.: Centroids and assignments after convergence

2

2

2

2

1

1

1 122

1

1

1

11

1

11

1 1

××

Schutze: Flat clustering 61 / 86

Page 150: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

Schutze: Flat clustering 62 / 86

Page 151: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

Schutze: Flat clustering 62 / 86

Page 152: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

Schutze: Flat clustering 62 / 86

Page 153: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

because each vector is moved to a closer centroid

Schutze: Flat clustering 62 / 86

Page 154: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

because each vector is moved to a closer centroid

RSS decreases during each recomputation step.

Schutze: Flat clustering 62 / 86

Page 155: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

because each vector is moved to a closer centroid

RSS decreases during each recomputation step.

see next slide

Schutze: Flat clustering 62 / 86

Page 156: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

because each vector is moved to a closer centroid

RSS decreases during each recomputation step.

see next slide

There is only a finite number of clusterings.

Schutze: Flat clustering 62 / 86

Page 157: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

because each vector is moved to a closer centroid

RSS decreases during each recomputation step.

see next slide

There is only a finite number of clusterings.

Thus: We must reach a fixed point.

Schutze: Flat clustering 62 / 86

Page 158: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

because each vector is moved to a closer centroid

RSS decreases during each recomputation step.

see next slide

There is only a finite number of clusterings.

Thus: We must reach a fixed point.

Assumption: Ties are broken consistently.

Schutze: Flat clustering 62 / 86

Page 159: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge: Proof

RSS = sum of all squared distances between document vectorand closest centroid

RSS decreases during each reassignment step.

because each vector is moved to a closer centroid

RSS decreases during each recomputation step.

see next slide

There is only a finite number of clusterings.

Thus: We must reach a fixed point.

Assumption: Ties are broken consistently.

Finite set & monotonically decreasing → convergence

Schutze: Flat clustering 62 / 86

Page 160: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Recomputation decreases average distance

Schutze: Flat clustering 63 / 86

Page 161: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Recomputation decreases average distance

RSS =∑K

k=1 RSSk – the residual sum of squares (the “goodness”measure)

RSSk(~v) =∑

~x∈ωk

‖~v − ~x‖2 =∑

~x∈ωk

M∑

m=1

(vm − xm)2

∂RSSk(~v)

∂vm=

~x∈ωk

2(vm − xm) = 0

vm =1

|ωk |

~x∈ωk

xm

The last line is the componentwise definition of the centroid!

Schutze: Flat clustering 63 / 86

Page 162: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Recomputation decreases average distance

RSS =∑K

k=1 RSSk – the residual sum of squares (the “goodness”measure)

RSSk(~v) =∑

~x∈ωk

‖~v − ~x‖2 =∑

~x∈ωk

M∑

m=1

(vm − xm)2

∂RSSk(~v)

∂vm=

~x∈ωk

2(vm − xm) = 0

vm =1

|ωk |

~x∈ωk

xm

The last line is the componentwise definition of the centroid!We minimize RSSk when the old centroid is replaced with the newcentroid.

Schutze: Flat clustering 63 / 86

Page 163: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Recomputation decreases average distance

RSS =∑K

k=1 RSSk – the residual sum of squares (the “goodness”measure)

RSSk(~v) =∑

~x∈ωk

‖~v − ~x‖2 =∑

~x∈ωk

M∑

m=1

(vm − xm)2

∂RSSk(~v)

∂vm=

~x∈ωk

2(vm − xm) = 0

vm =1

|ωk |

~x∈ωk

xm

The last line is the componentwise definition of the centroid!We minimize RSSk when the old centroid is replaced with the newcentroid. RSS, the sum of the RSSk , must then also decrease duringrecomputation.

Schutze: Flat clustering 63 / 86

Page 164: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge

Schutze: Flat clustering 64 / 86

Page 165: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge

But we don’t know how long convergence will take!

Schutze: Flat clustering 64 / 86

Page 166: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge

But we don’t know how long convergence will take!

If we don’t care about a few docs switching back and forth,then convergence is usually fast (< 10-20 iterations).

Schutze: Flat clustering 64 / 86

Page 167: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

K -means is guaranteed to converge

But we don’t know how long convergence will take!

If we don’t care about a few docs switching back and forth,then convergence is usually fast (< 10-20 iterations).

However, complete convergence can take many moreiterations.

Schutze: Flat clustering 64 / 86

Page 168: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Optimality of K -means

Convergence 6= optimality

Schutze: Flat clustering 65 / 86

Page 169: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Optimality of K -means

Convergence 6= optimality

Convergence does not mean that we converge to the optimalclustering!

Schutze: Flat clustering 65 / 86

Page 170: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Optimality of K -means

Convergence 6= optimality

Convergence does not mean that we converge to the optimalclustering!

This is the great weakness of K -means.

Schutze: Flat clustering 65 / 86

Page 171: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Optimality of K -means

Convergence 6= optimality

Convergence does not mean that we converge to the optimalclustering!

This is the great weakness of K -means.

If we start with a bad set of seeds, the resulting clustering canbe horrible.

Schutze: Flat clustering 65 / 86

Page 172: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Exercise: Suboptimal clustering

Schutze: Flat clustering 66 / 86

Page 173: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Exercise: Suboptimal clustering

0

1

2

3

0 1 2 3 4

×

×

×

×

×

×d1 d2 d3

d4 d5 d6

What is the optimal clustering for K = 2?

Do we converge on this clustering for arbitrary seedsdi , dj?

Schutze: Flat clustering 66 / 86

Page 174: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Initialization of K -means

Schutze: Flat clustering 67 / 86

Page 175: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Schutze: Flat clustering 67 / 86

Page 176: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Schutze: Flat clustering 67 / 86

Page 177: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Better ways of computing initial centroids:

Schutze: Flat clustering 67 / 86

Page 178: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Better ways of computing initial centroids:

Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)

Schutze: Flat clustering 67 / 86

Page 179: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Better ways of computing initial centroids:

Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seeds

Schutze: Flat clustering 67 / 86

Page 180: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Initialization of K -means

Random seed selection is just one of many ways K -means canbe initialized.

Random seed selection is not very robust: It’s easy to get asuboptimal clustering.

Better ways of computing initial centroids:

Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seedsSelect i (e.g., i = 10) different random sets of seeds, do aK -means clustering for each, select the clustering with lowestRSS

Schutze: Flat clustering 67 / 86

Page 181: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Schutze: Flat clustering 68 / 86

Page 182: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Computing one distance of two vectors is O(M).

Schutze: Flat clustering 68 / 86

Page 183: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Schutze: Flat clustering 68 / 86

Page 184: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)

Schutze: Flat clustering 68 / 86

Page 185: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)

Assume number of iterations bounded by I

Schutze: Flat clustering 68 / 86

Page 186: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all importantdimensions

Schutze: Flat clustering 68 / 86

Page 187: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all importantdimensions

However: This is not a real worst-case analysis.

Schutze: Flat clustering 68 / 86

Page 188: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Time complexity of K -means

Computing one distance of two vectors is O(M).

Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)

Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)

Assume number of iterations bounded by I

Overall complexity: O(IKNM) – linear in all importantdimensions

However: This is not a real worst-case analysis.

In pathological cases, complexity can be worse than linear.

Schutze: Flat clustering 68 / 86

Page 189: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

Schutze: Flat clustering 69 / 86

Page 190: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

What is a good clustering?

Schutze: Flat clustering 70 / 86

Page 191: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

What is a good clustering?

Internal criteria

Schutze: Flat clustering 70 / 86

Page 192: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

What is a good clustering?

Internal criteria

Example of an internal criterion: RSS in K -means

Schutze: Flat clustering 70 / 86

Page 193: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

What is a good clustering?

Internal criteria

Example of an internal criterion: RSS in K -means

But an internal criterion often does not evaluate the actualutility of a clustering in the application.

Schutze: Flat clustering 70 / 86

Page 194: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

What is a good clustering?

Internal criteria

Example of an internal criterion: RSS in K -means

But an internal criterion often does not evaluate the actualutility of a clustering in the application.

Alternative: External criteria

Schutze: Flat clustering 70 / 86

Page 195: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

What is a good clustering?

Internal criteria

Example of an internal criterion: RSS in K -means

But an internal criterion often does not evaluate the actualutility of a clustering in the application.

Alternative: External criteria

Evaluate with respect to a human-defined classification

Schutze: Flat clustering 70 / 86

Page 196: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criteria for clustering quality

Schutze: Flat clustering 71 / 86

Page 197: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criteria for clustering quality

Based on a gold standard data set, e.g., the Reuters collectionwe also used for the evaluation of classification

Schutze: Flat clustering 71 / 86

Page 198: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criteria for clustering quality

Based on a gold standard data set, e.g., the Reuters collectionwe also used for the evaluation of classification

Goal: Clustering should reproduce the classes in the goldstandard

Schutze: Flat clustering 71 / 86

Page 199: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criteria for clustering quality

Based on a gold standard data set, e.g., the Reuters collectionwe also used for the evaluation of classification

Goal: Clustering should reproduce the classes in the goldstandard

(But we only want to reproduce how documents are dividedinto groups, not the class labels.)

Schutze: Flat clustering 71 / 86

Page 200: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criteria for clustering quality

Based on a gold standard data set, e.g., the Reuters collectionwe also used for the evaluation of classification

Goal: Clustering should reproduce the classes in the goldstandard

(But we only want to reproduce how documents are dividedinto groups, not the class labels.)

First measure for how well we were able to reproduce theclasses: purity

Schutze: Flat clustering 71 / 86

Page 201: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criterion: Purity

Schutze: Flat clustering 72 / 86

Page 202: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criterion: Purity

purity(Ω,C ) =1

N

k

maxj|ωk ∩ cj |

Ω = ω1, ω2, . . . , ωK is the set of clusters andC = c1, c2, . . . , cJ is the set of classes.

Schutze: Flat clustering 72 / 86

Page 203: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criterion: Purity

purity(Ω,C ) =1

N

k

maxj|ωk ∩ cj |

Ω = ω1, ω2, . . . , ωK is the set of clusters andC = c1, c2, . . . , cJ is the set of classes.

For each cluster ωk : find class cj with most members nkj in ωk

Schutze: Flat clustering 72 / 86

Page 204: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

External criterion: Purity

purity(Ω,C ) =1

N

k

maxj|ωk ∩ cj |

Ω = ω1, ω2, . . . , ωK is the set of clusters andC = c1, c2, . . . , cJ is the set of classes.

For each cluster ωk : find class cj with most members nkj in ωk

Sum all nkj and divide by total number of points

Schutze: Flat clustering 72 / 86

Page 205: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Example for computing purity

Schutze: Flat clustering 73 / 86

Page 206: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Example for computing purity

x

o

x x

x

x

o

x

o

o ⋄o x

⋄ ⋄

x

cluster 1 cluster 2 cluster 3

To compute purity: 5 = maxj |ω1 ∩ cj | (class x, cluster 1); 4 =maxj |ω2 ∩ cj | (class o, cluster 2); and 3 = maxj |ω3 ∩ cj | (class ⋄,cluster 3). Purity is (1/17) × (5 + 4 + 3) ≈ 0.71.

Schutze: Flat clustering 73 / 86

Page 207: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Schutze: Flat clustering 74 / 86

Page 208: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Schutze: Flat clustering 74 / 86

Page 209: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Definition: RI = TP+TNTP+FP+FN+TN

Schutze: Flat clustering 74 / 86

Page 210: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Definition: RI = TP+TNTP+FP+FN+TN

Based on 2x2 contingency table of all pairs of documents:same cluster different clusters

same class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)

Schutze: Flat clustering 74 / 86

Page 211: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Definition: RI = TP+TNTP+FP+FN+TN

Based on 2x2 contingency table of all pairs of documents:same cluster different clusters

same class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)

TP+FN+FP+TN is the total number of pairs.

Schutze: Flat clustering 74 / 86

Page 212: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Definition: RI = TP+TNTP+FP+FN+TN

Based on 2x2 contingency table of all pairs of documents:same cluster different clusters

same class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)

TP+FN+FP+TN is the total number of pairs.

TP+FN+FP+TN =(

N2

)

for N documents.

Schutze: Flat clustering 74 / 86

Page 213: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Definition: RI = TP+TNTP+FP+FN+TN

Based on 2x2 contingency table of all pairs of documents:same cluster different clusters

same class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)

TP+FN+FP+TN is the total number of pairs.

TP+FN+FP+TN =(

N2

)

for N documents.

Example:(

172

)

= 136 in o/⋄/x example

Schutze: Flat clustering 74 / 86

Page 214: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Definition: RI = TP+TNTP+FP+FN+TN

Based on 2x2 contingency table of all pairs of documents:same cluster different clusters

same class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)

TP+FN+FP+TN is the total number of pairs.

TP+FN+FP+TN =(

N2

)

for N documents.

Example:(

172

)

= 136 in o/⋄/x example

Each pair is either positive or negative (the clustering puts thetwo documents in the same or in different clusters) . . .

Schutze: Flat clustering 74 / 86

Page 215: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Another external criterion: Rand index

Purity can be increased easily by increasing K – a measurethat does not have this problem: Rand index.

Definition: RI = TP+TNTP+FP+FN+TN

Based on 2x2 contingency table of all pairs of documents:same cluster different clusters

same class true positives (TP) false negatives (FN)different classes false positives (FP) true negatives (TN)

TP+FN+FP+TN is the total number of pairs.

TP+FN+FP+TN =(

N2

)

for N documents.

Example:(

172

)

= 136 in o/⋄/x example

Each pair is either positive or negative (the clustering puts thetwo documents in the same or in different clusters) . . .

. . . and either “true” (correct) or “false” (incorrect): theclustering decision is correct or incorrect.

Schutze: Flat clustering 74 / 86

Page 216: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Rand Index: Example

Schutze: Flat clustering 75 / 86

Page 217: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Rand Index: Example

As an example, we compute RI for the o/⋄/x example. We firstcompute TP + FP. The three clusters contain 6, 6, and 5 points,respectively, so the total number of “positives” or pairs ofdocuments that are in the same cluster is:

TP + FP =

(

62

)

+

(

62

)

+

(

52

)

= 40

Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄pairs in cluster 3, and the x pair in cluster 3 are true positives:

TP =

(

52

)

+

(

42

)

+

(

32

)

+

(

22

)

= 20

Thus, FP = 40 − 20 = 20.FN and TN are computed similarly.

Schutze: Flat clustering 75 / 86

Page 218: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Rand measure for the o/⋄/x example

Schutze: Flat clustering 76 / 86

Page 219: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Rand measure for the o/⋄/x example

same cluster different clusterssame class TP = 20 FN = 24different classes FP = 20 TN = 72

RI is then (20 + 72)/(20 + 20 + 24 + 72) ≈ 0.68.

Schutze: Flat clustering 76 / 86

Page 220: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Schutze: Flat clustering 77 / 86

Page 221: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Two other measures

Schutze: Flat clustering 77 / 86

Page 222: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Two other measures

Normalized mutual information (NMI)

Schutze: Flat clustering 77 / 86

Page 223: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Two other measures

Normalized mutual information (NMI)

How much information does the clustering contain about theclassification?

Schutze: Flat clustering 77 / 86

Page 224: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Two other measures

Normalized mutual information (NMI)

How much information does the clustering contain about theclassification?Singleton clusters (number of clusters = number of docs) havemaximum MI

Schutze: Flat clustering 77 / 86

Page 225: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Two other measures

Normalized mutual information (NMI)

How much information does the clustering contain about theclassification?Singleton clusters (number of clusters = number of docs) havemaximum MITherefore: normalize by entropy of clusters and classes

Schutze: Flat clustering 77 / 86

Page 226: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Two other measures

Normalized mutual information (NMI)

How much information does the clustering contain about theclassification?Singleton clusters (number of clusters = number of docs) havemaximum MITherefore: normalize by entropy of clusters and classes

F measure

Schutze: Flat clustering 77 / 86

Page 227: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Two other external evaluation measures

Two other measures

Normalized mutual information (NMI)

How much information does the clustering contain about theclassification?Singleton clusters (number of clusters = number of docs) havemaximum MITherefore: normalize by entropy of clusters and classes

F measure

Like Rand, but “precision” and “recall” can be weighted

Schutze: Flat clustering 77 / 86

Page 228: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Evaluation results for the o/⋄/x example

Schutze: Flat clustering 78 / 86

Page 229: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Evaluation results for the o/⋄/x example

purity NMI RI F5lower bound 0.0 0.0 0.0 0.0maximum 1.0 1.0 1.0 1.0value for example 0.71 0.36 0.68 0.46

All four measures range from 0 (really bad clustering) to 1 (perfectclustering).

Schutze: Flat clustering 78 / 86

Page 230: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Outline

1 Recap

2 Clustering: Introduction

3 Clustering in IR

4 K -means

5 Evaluation

6 How many clusters?

Schutze: Flat clustering 79 / 86

Page 231: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Schutze: Flat clustering 80 / 86

Page 232: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Number of clusters K is given in many applications.

Schutze: Flat clustering 80 / 86

Page 233: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Number of clusters K is given in many applications.

E.g., there may be an external constraint on K . Example: Inthe case of Scatter-Gather, it was hard to show more than10–20 clusters on a monitor in the 90s.

Schutze: Flat clustering 80 / 86

Page 234: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Number of clusters K is given in many applications.

E.g., there may be an external constraint on K . Example: Inthe case of Scatter-Gather, it was hard to show more than10–20 clusters on a monitor in the 90s.

What if there is no external constraint? Is there a “right”number of clusters?

Schutze: Flat clustering 80 / 86

Page 235: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Number of clusters K is given in many applications.

E.g., there may be an external constraint on K . Example: Inthe case of Scatter-Gather, it was hard to show more than10–20 clusters on a monitor in the 90s.

What if there is no external constraint? Is there a “right”number of clusters?

One way to go: define an optimization criterion

Schutze: Flat clustering 80 / 86

Page 236: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Number of clusters K is given in many applications.

E.g., there may be an external constraint on K . Example: Inthe case of Scatter-Gather, it was hard to show more than10–20 clusters on a monitor in the 90s.

What if there is no external constraint? Is there a “right”number of clusters?

One way to go: define an optimization criterion

Given docs, find K for which the optimum is reached.

Schutze: Flat clustering 80 / 86

Page 237: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Number of clusters K is given in many applications.

E.g., there may be an external constraint on K . Example: Inthe case of Scatter-Gather, it was hard to show more than10–20 clusters on a monitor in the 90s.

What if there is no external constraint? Is there a “right”number of clusters?

One way to go: define an optimization criterion

Given docs, find K for which the optimum is reached.What optimization criterion can we use?

Schutze: Flat clustering 80 / 86

Page 238: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

How many clusters?

Number of clusters K is given in many applications.

E.g., there may be an external constraint on K . Example: Inthe case of Scatter-Gather, it was hard to show more than10–20 clusters on a monitor in the 90s.

What if there is no external constraint? Is there a “right”number of clusters?

One way to go: define an optimization criterion

Given docs, find K for which the optimum is reached.What optimization criterion can we use?We can’t use RSS or average squared distance from centroidas criterion: always chooses K = N clusters.

Schutze: Flat clustering 80 / 86

Page 239: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Exercise

Schutze: Flat clustering 81 / 86

Page 240: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Exercise

Your job is to develop the clustering algorithms for acompetitor to news.google.com

Schutze: Flat clustering 81 / 86

Page 241: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Exercise

Your job is to develop the clustering algorithms for acompetitor to news.google.com

You want to use K -means clustering.

Schutze: Flat clustering 81 / 86

Page 242: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Exercise

Your job is to develop the clustering algorithms for acompetitor to news.google.com

You want to use K -means clustering.

How would you determine K?

Schutze: Flat clustering 81 / 86

Page 243: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Basic idea

Schutze: Flat clustering 82 / 86

Page 244: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Basic idea

Start with 1 cluster (K = 1)

Schutze: Flat clustering 82 / 86

Page 245: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Basic idea

Start with 1 cluster (K = 1)

Keep adding clusters (= keep increasing K )

Schutze: Flat clustering 82 / 86

Page 246: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Basic idea

Start with 1 cluster (K = 1)

Keep adding clusters (= keep increasing K )

Add a penalty for each new cluster

Schutze: Flat clustering 82 / 86

Page 247: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Basic idea

Start with 1 cluster (K = 1)

Keep adding clusters (= keep increasing K )

Add a penalty for each new cluster

Then trade off cluster penalties against average squareddistance from centroid

Schutze: Flat clustering 82 / 86

Page 248: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Basic idea

Start with 1 cluster (K = 1)

Keep adding clusters (= keep increasing K )

Add a penalty for each new cluster

Then trade off cluster penalties against average squareddistance from centroid

Choose the value of K with the best tradeoff

Schutze: Flat clustering 82 / 86

Page 249: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Schutze: Flat clustering 83 / 86

Page 250: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Given a clustering, define the cost for a document as(squared) distance to centroid

Schutze: Flat clustering 83 / 86

Page 251: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Given a clustering, define the cost for a document as(squared) distance to centroid

Define total distortion RSS(K) as sum of all individualdocument costs (corresponds to average distance)

Schutze: Flat clustering 83 / 86

Page 252: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Given a clustering, define the cost for a document as(squared) distance to centroid

Define total distortion RSS(K) as sum of all individualdocument costs (corresponds to average distance)

Then: penalize each cluster with a cost λ

Schutze: Flat clustering 83 / 86

Page 253: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Given a clustering, define the cost for a document as(squared) distance to centroid

Define total distortion RSS(K) as sum of all individualdocument costs (corresponds to average distance)

Then: penalize each cluster with a cost λ

Thus for a clustering with K clusters, total cluster penalty isKλ

Schutze: Flat clustering 83 / 86

Page 254: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Given a clustering, define the cost for a document as(squared) distance to centroid

Define total distortion RSS(K) as sum of all individualdocument costs (corresponds to average distance)

Then: penalize each cluster with a cost λ

Thus for a clustering with K clusters, total cluster penalty isKλ

Define the total cost of a clustering as distortion plus totalcluster penalty: RSS(K) + Kλ

Schutze: Flat clustering 83 / 86

Page 255: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Given a clustering, define the cost for a document as(squared) distance to centroid

Define total distortion RSS(K) as sum of all individualdocument costs (corresponds to average distance)

Then: penalize each cluster with a cost λ

Thus for a clustering with K clusters, total cluster penalty isKλ

Define the total cost of a clustering as distortion plus totalcluster penalty: RSS(K) + Kλ

Select K that minimizes (RSS(K) + Kλ)

Schutze: Flat clustering 83 / 86

Page 256: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Simple objective function for K : Formalization

Given a clustering, define the cost for a document as(squared) distance to centroid

Define total distortion RSS(K) as sum of all individualdocument costs (corresponds to average distance)

Then: penalize each cluster with a cost λ

Thus for a clustering with K clusters, total cluster penalty isKλ

Define the total cost of a clustering as distortion plus totalcluster penalty: RSS(K) + Kλ

Select K that minimizes (RSS(K) + Kλ)

Still need to determine good value for λ . . .

Schutze: Flat clustering 83 / 86

Page 257: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Finding the “knee” in the curve

Schutze: Flat clustering 84 / 86

Page 258: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Finding the “knee” in the curve

2 4 6 8 10

1750

1800

1850

1900

1950

number of clusters

resi

dual

sum

of s

quar

es

Pick the number of clusters where curve “flattens”. Here: 4 or9.

Schutze: Flat clustering 84 / 86

Page 259: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Take-away today

What is clustering?

Applications of clustering in information retrieval

K -means algorithm

Evaluation of clustering

How many clusters?

Schutze: Flat clustering 85 / 86

Page 260: Introduction to Information Retrieval http ...hs/teach/14s/ir/pdf/16flat.pdf · Recap Clustering: Introduction ClusteringinIR K-means Evaluation Howmanyclusters? Learning to rank

Recap Clustering: Introduction Clustering in IR K -means Evaluation How many clusters?

Resources

Chapter 16 of IIR

Resources at http://cislmu.org

Keith van Rijsbergen on the cluster hypothesis (he was one ofthe originators)Bing/Carrot2/Clusty: search result clustering systemsStirling number: the number of distinct k-clusterings of nitems

Schutze: Flat clustering 86 / 86