the optimum clustering framework: implementing the cluster hypothesis

89
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis Norbert Fuhr University of Duisburg-Essen March 30, 2011

Upload: yaevents

Post on 11-Nov-2014

2.785 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum DocumentClustering:

Implementing the Cluster Hypothesis

Norbert Fuhr

University of Duisburg-Essen

March 30, 2011

Page 2: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 2

Outline

1 Introduction

2 Cluster Metric

3 Optimum clustering

4 Towards Optimum Clustering

5 Experiments

6 Conclusion and Outlook

Page 3: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 3

Introduction

1 Introduction

2 Cluster Metric

3 Optimum clustering

4 Towards Optimum Clustering

5 Experiments

6 Conclusion and Outlook

Page 4: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4

Introduction

Motivation

Ad-hoc Retrievalheuristic models:

define retrieval functionevaluate to test if it yields good quality

Probability Ranking Principle (PRP)theoretic foundation for optimum retrievalnumerous probabilistic models based on PRP

Document clusteringclassic approach:

define similarity function and fusion principleevaluate to test if they yield good quality

Optimum Clustering Principle?

Page 5: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4

Introduction

Motivation

Ad-hoc Retrievalheuristic models:

define retrieval functionevaluate to test if it yields good quality

Probability Ranking Principle (PRP)theoretic foundation for optimum retrievalnumerous probabilistic models based on PRP

Document clusteringclassic approach:

define similarity function and fusion principleevaluate to test if they yield good quality

Optimum Clustering Principle?

Page 6: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4

Introduction

Motivation

Ad-hoc Retrievalheuristic models:

define retrieval functionevaluate to test if it yields good quality

Probability Ranking Principle (PRP)theoretic foundation for optimum retrievalnumerous probabilistic models based on PRP

Document clusteringclassic approach:

define similarity function and fusion principleevaluate to test if they yield good quality

Optimum Clustering Principle?

Page 7: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5

Introduction

Cluster Hypothesis

Original Formulation”closely associated documents tend to be relevant to the samerequests” (Rijsbergen 1979)

Idea of optimum clustering:

Cluster documents in such a way, that for any request, therelevant documents occur together in one cluster

redefine document similarity:documents are similar if they are relevant to the same queries

Page 8: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5

Introduction

Cluster Hypothesis

Original Formulation”closely associated documents tend to be relevant to the samerequests” (Rijsbergen 1979)

Idea of optimum clustering:

Cluster documents in such a way, that for any request, therelevant documents occur together in one cluster

redefine document similarity:documents are similar if they are relevant to the same queries

Page 9: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5

Introduction

Cluster Hypothesis

Original Formulation”closely associated documents tend to be relevant to the samerequests” (Rijsbergen 1979)

Idea of optimum clustering:

Cluster documents in such a way, that for any request, therelevant documents occur together in one cluster

redefine document similarity:documents are similar if they are relevant to the same queries

Page 10: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6

Introduction

The Optimum Clustering Framework

Page 11: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6

Introduction

The Optimum Clustering Framework

Page 12: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6

Introduction

The Optimum Clustering Framework

Page 13: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6

Introduction

The Optimum Clustering Framework

Page 14: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 7

Cluster Metric

1 Introduction

2 Cluster Metric

3 Optimum clustering

4 Towards Optimum Clustering

5 Experiments

6 Conclusion and Outlook

Page 15: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 8

Cluster Metric

Defining a Metric based on the Cluster Hypothesis

General idea:Evaluate clustering wrt. a set of queriesFor each query and each cluster, regard pairs ofdocuments co-occurring:

relevant-relevant: goodrelevant-irrelevant: badirrelevant-irrelevant: don’t care

Page 16: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9

Cluster Metric

Pairwise precision

Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n

i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅

ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant

documents in Ci wrt. qk )

Pairwise precision (weighted average over all clusters)

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Page 17: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9

Cluster Metric

Pairwise precision

Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n

i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅

ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant

documents in Ci wrt. qk )

Pairwise precision (weighted average over all clusters)

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Page 18: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9

Cluster Metric

Pairwise precision

Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n

i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅

ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant

documents in Ci wrt. qk )

Pairwise precision (weighted average over all clusters)

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Page 19: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9

Cluster Metric

Pairwise precision

Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n

i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅

ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant

documents in Ci wrt. qk )

Pairwise precision (weighted average over all clusters)

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Page 20: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9

Cluster Metric

Pairwise precision

Q Set of queriesD Document collectionR relevance judgments: R ⊂ Q × DC Clustering, C = {C1, . . . ,Cn} s.th. ∪n

i=1Ci = D and∀i , j : i 6= j → Ci ∩ Cj = ∅

ci = |Ci | (size of cluster Ci ),rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevant

documents in Ci wrt. qk )

Pairwise precision (weighted average over all clusters)

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Page 21: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10

Cluster Metric

Pairwise precision – Example

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Query set: disjoint classification with two classes a and b,three clusters: (aab|bb|aa)Pp = 1

7(3(13 + 0) + 2(0 + 1) + 2(1 + 0)) = 5

7 .

Perfect clustering for a disjoint classification would yieldPp = 1for arbitrary query sets, values > 1 are possible

Page 22: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10

Cluster Metric

Pairwise precision – Example

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Query set: disjoint classification with two classes a and b,three clusters: (aab|bb|aa)Pp = 1

7(3(13 + 0) + 2(0 + 1) + 2(1 + 0)) = 5

7 .

Perfect clustering for a disjoint classification would yieldPp = 1for arbitrary query sets, values > 1 are possible

Page 23: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11

Cluster Metric

Pairwise recall

rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevantdocuments in Ci wrt. qk )

gk = |{d ∈ D|(qk ,d) ∈ R}| (number of relevantdocuments for qk )

(micro recall)

Rp(D,Q,R, C) =∑

qk∈Q∑

Ci∈C rik (rik − 1)∑qk∈Qgk>1

gk (gk − 1)

Example: (aab|bb|aa)2 a pairs (out of 6)1 b pair (out of 3)Rp = 2+1

6+3 = 13 .

Page 24: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11

Cluster Metric

Pairwise recall

rik = |{dm ∈ Ci |(qk ,dm) ∈ R}| (number of relevantdocuments in Ci wrt. qk )

gk = |{d ∈ D|(qk ,d) ∈ R}| (number of relevantdocuments for qk )

(micro recall)

Rp(D,Q,R, C) =∑

qk∈Q∑

Ci∈C rik (rik − 1)∑qk∈Qgk>1

gk (gk − 1)

Example: (aab|bb|aa)2 a pairs (out of 6)1 b pair (out of 3)Rp = 2+1

6+3 = 13 .

Page 25: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12

Cluster Metric

Perfect clustering

C is a perfect clustering iff there exists no clustering C′ s.th.Pp(D,Q,R, C) < Pp(D,Q,R, C′)∧Rp(D,Q,R, C) < Rp(D,Q,R, C′)

strong Pareto optimum – more than one perfect clusteringpossible

Example:

Pp({d1,d2,d3}, {d4,d5}) =Pp({d1,d2}, {d3,d4,d5}) = 1,Rp = 2

3

Pp({d1,d2,d3,d4,d5}) = 0.6,Rp = 1

Page 26: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12

Cluster Metric

Perfect clustering

C is a perfect clustering iff there exists no clustering C′ s.th.Pp(D,Q,R, C) < Pp(D,Q,R, C′)∧Rp(D,Q,R, C) < Rp(D,Q,R, C′)

strong Pareto optimum – more than one perfect clusteringpossible

Example:

Pp({d1,d2,d3}, {d4,d5}) =Pp({d1,d2}, {d3,d4,d5}) = 1,Rp = 2

3

Pp({d1,d2,d3,d4,d5}) = 0.6,Rp = 1

Page 27: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12

Cluster Metric

Perfect clustering

C is a perfect clustering iff there exists no clustering C′ s.th.Pp(D,Q,R, C) < Pp(D,Q,R, C′)∧Rp(D,Q,R, C) < Rp(D,Q,R, C′)

strong Pareto optimum – more than one perfect clusteringpossible

Example:Pp({d1,d2,d3}, {d4,d5}) =Pp({d1,d2}, {d3,d4,d5}) = 1,Rp = 2

3

Pp({d1,d2,d3,d4,d5}) = 0.6,Rp = 1

Page 28: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13

Cluster Metric

Do perfect clusterings form a hierarchy?

Page 29: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13

Cluster Metric

Do perfect clusterings form a hierarchy?

Pp

Rp

1

1

C

C = {{d1,d2,d3,d4}}

Page 30: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13

Cluster Metric

Do perfect clusterings form a hierarchy?

Pp

Rp

1

1

C

C’

C = {{d1,d2,d3,d4}}

C′ = {{d1,d2}, {d3,d4}}

Page 31: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13

Cluster Metric

Do perfect clusterings form a hierarchy?

Pp

Rp

1

1

C’

CC’’

C = {{d1,d2,d3,d4}}

C′ = {{d1,d2}, {d3,d4}} C′′ = {{d1,d2,d3}, {d4}}

Page 32: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 14

Optimum clustering

1 Introduction

2 Cluster Metric

3 Optimum clustering

4 Towards Optimum Clustering

5 Experiments

6 Conclusion and Outlook

Page 33: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15

Optimum clustering

Optimum Clustering

Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality

Page 34: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15

Optimum clustering

Optimum Clustering

Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality

Page 35: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15

Optimum clustering

Optimum Clustering

Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality

Page 36: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15

Optimum clustering

Optimum Clustering

Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality

Page 37: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15

Optimum clustering

Optimum Clustering

Usually, clustering process has no knowledge aboutrelevance judgmentsswitch from external to internal cluster measuresreplace relevance judgments by estimates of probability ofrelevancerequires probabilistic retrieval method yielding P(rel |q,d) compute expected cluster quality

Page 38: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16

Optimum clustering

Expected cluster quality

Pairwise precision:

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Expected precision:

!π(D,Q, C) = 1|D|

∑Ci∈C|Ci |>1

ci

ci(ci − 1)

∑qk∈Q

∑(dl ,dm)∈Ci×Ci

dl 6=dm

P(rel |qk ,dl)P(rel |qk ,dm)

Page 39: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16

Optimum clustering

Expected cluster quality

Pairwise precision:

Pp(D,Q,R, C) =1|D|

∑Ci∈Cci>1

ci

∑qk∈Q

rik (rik − 1)ci(ci − 1)

Expected precision:

!π(D,Q, C) = 1|D|

∑Ci∈C|Ci |>1

ci

ci(ci − 1)

∑qk∈Q

∑(dl ,dm)∈Ci×Ci

dl 6=dm

P(rel |qk ,dl)P(rel |qk ,dm)

Page 40: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17

Optimum clustering

Expected precision

π(D,Q, C) = 1|D|

∑Ci∈C|Ci |>1

ci

ci(ci − 1)

∑(dl ,dm)∈Ci×Ci

dl 6=dm

∑qk∈Q

P(rel |qk ,dl)P(rel |qk ,dm)

here∑

qk∈Q P(rel |qk ,dl)P(rel |qk ,dm) gives the expectednumber of queries for which both dl and dm are relevant

Transform a document into a vector of relevance probabilities:~τT (dm) = (P(rel |q1,dm),P(rel |q2,dm), . . . ,P(rel |q|Q|,dm)).

π(D,Q, C) =1|D|

∑Ci∈C|Ci |>1

1ci − 1

∑(dl ,dm)∈Ci×Ci

dl 6=dm

~τT (dl) · ~τ(dm)

Page 41: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17

Optimum clustering

Expected precision

π(D,Q, C) = 1|D|

∑Ci∈C|Ci |>1

ci

ci(ci − 1)

∑(dl ,dm)∈Ci×Ci

dl 6=dm

∑qk∈Q

P(rel |qk ,dl)P(rel |qk ,dm)

here∑

qk∈Q P(rel |qk ,dl)P(rel |qk ,dm) gives the expectednumber of queries for which both dl and dm are relevant

Transform a document into a vector of relevance probabilities:~τT (dm) = (P(rel |q1,dm),P(rel |q2,dm), . . . ,P(rel |q|Q|,dm)).

π(D,Q, C) =1|D|

∑Ci∈C|Ci |>1

1ci − 1

∑(dl ,dm)∈Ci×Ci

dl 6=dm

~τT (dl) · ~τ(dm)

Page 42: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18

Optimum clustering

Expected recall

Rp(D,Q,R, C) =∑

qk∈Q∑

Ci∈C rik (rik − 1)∑qk∈Qgk>1

gk (gk − 1)

Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore

compute an estimate for the numerator only:

ρ(D,Q, C) =∑Ci∈C

∑(dl ,dm)∈Ci×Ci

dl 6=dm

~τT (dl) · ~τ(dm)

(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)

Page 43: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18

Optimum clustering

Expected recall

Rp(D,Q,R, C) =∑

qk∈Q∑

Ci∈C rik (rik − 1)∑qk∈Qgk>1

gk (gk − 1)

Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore

compute an estimate for the numerator only:

ρ(D,Q, C) =∑Ci∈C

∑(dl ,dm)∈Ci×Ci

dl 6=dm

~τT (dl) · ~τ(dm)

(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)

Page 44: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18

Optimum clustering

Expected recall

Rp(D,Q,R, C) =∑

qk∈Q∑

Ci∈C rik (rik − 1)∑qk∈Qgk>1

gk (gk − 1)

Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore

compute an estimate for the numerator only:

ρ(D,Q, C) =∑Ci∈C

∑(dl ,dm)∈Ci×Ci

dl 6=dm

~τT (dl) · ~τ(dm)

(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)

Page 45: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18

Optimum clustering

Expected recall

Rp(D,Q,R, C) =∑

qk∈Q∑

Ci∈C rik (rik − 1)∑qk∈Qgk>1

gk (gk − 1)

Direct estimation requires estimation of denominator→ biasedestimatesBut: denominator is constant for a given query set→ ignore

compute an estimate for the numerator only:

ρ(D,Q, C) =∑Ci∈C

∑(dl ,dm)∈Ci×Ci

dl 6=dm

~τT (dl) · ~τ(dm)

(Scalar product ~τT (dl) · ~τ(dm) gives the expected number ofqueries for which both dl and dm are relevant)

Page 46: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19

Optimum clustering

Optimum clustering

C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)

Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!

Page 47: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19

Optimum clustering

Optimum clustering

C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)

Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!

Page 48: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19

Optimum clustering

Optimum clustering

C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)

Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!

Page 49: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19

Optimum clustering

Optimum clustering

C is an optimum clustering iff there exists no clustering C′ s.th.π(D,Q, C) < π(D,Q, C′) ∧ ρ(D,Q, C) < ρ(D,Q, C′)

Pareto optimaSet of perfect (and optimum) clusterings not even forms acluster hierarchy no hierarchic clustering method will find all optima!

Page 50: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 20

Towards Optimum Clustering

1 Introduction

2 Cluster Metric

3 Optimum clustering

4 Towards Optimum Clustering

5 Experiments

6 Conclusion and Outlook

Page 51: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 21

Towards Optimum Clustering

Towards Optimum ClusteringDevelopment of an (optimum) clustering method

1 Set of queries,2 Probabilistic retrieval method,3 Document similarity metric, and4 Fusion principle.

Page 52: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22

Towards Optimum Clustering

A simple application

1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)

4 Fusion principle: group average clustering

π(D,Q,C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

standard clustering method

Page 53: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22

Towards Optimum Clustering

A simple application

1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)

4 Fusion principle: group average clustering

π(D,Q,C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

standard clustering method

Page 54: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22

Towards Optimum Clustering

A simple application

1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)

4 Fusion principle: group average clustering

π(D,Q,C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

standard clustering method

Page 55: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22

Towards Optimum Clustering

A simple application

1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)

4 Fusion principle: group average clustering

π(D,Q,C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

standard clustering method

Page 56: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22

Towards Optimum Clustering

A simple application

1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)

4 Fusion principle: group average clustering

π(D,Q,C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

standard clustering method

Page 57: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22

Towards Optimum Clustering

A simple application

1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)

4 Fusion principle: group average clustering

π(D,Q,C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

standard clustering method

Page 58: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22

Towards Optimum Clustering

A simple application

1 Set of queries: all possible one-term queries2 Probabilistic retrieval method: tf ∗ idf3 Document similarity metric: ~τT (dl) · ~τ(dm)

4 Fusion principle: group average clustering

π(D,Q,C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

standard clustering method

Page 59: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 23

Towards Optimum Clustering

Query set

Too few queries in real collections→ artificial query set

collection clustering: set of all possible one-term queriesProbability distribution over the query set: uniform /proportional to doc. freq.Document representation: original terms / transformationsof the term spaceSemantic dimensions: focus on certain aspects only (e.g.images: color, contour, texture)

result clustering: set of all query expansions

Page 60: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 24

Towards Optimum Clustering

Probabilistic retrieval method

Model: In principle, any retrieval model suitableTransformation to probabilities: direct estimation /transforming the retrieval score into such a probability

Page 61: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 25

Towards Optimum Clustering

Document similarity metric.

fixed as ~τT (dl) · ~τ(dm)

Page 62: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 26

Towards Optimum Clustering

Fusion principles

OCF only gives guidelines for good fusion principles:consider metrics π and/or ρ during fusion

Page 63: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27

Towards Optimum Clustering

Group average clustering:

σ(C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)

Page 64: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27

Towards Optimum Clustering

Group average clustering:

σ(C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)

Page 65: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27

Towards Optimum Clustering

Group average clustering:

σ(C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)

Page 66: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27

Towards Optimum Clustering

Group average clustering:

σ(C) =1

c(c − 1)

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

expected precision as criterion!starts with singleton clusters minimum recallbuilding larger clusters for increasing recallforms cluster with highest precision(which may be lower than that of the current clusters)

Page 67: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28

Towards Optimum Clustering

Fusion principles – min cut

starts with single cluster (maximum recall)searches for cut with minimum loss in recall

ρ(D,Q, C) =∑Ci∈C

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

consider expected precision for breaking ties!

Page 68: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28

Towards Optimum Clustering

Fusion principles – min cut

starts with single cluster (maximum recall)searches for cut with minimum loss in recall

ρ(D,Q, C) =∑Ci∈C

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

consider expected precision for breaking ties!

Page 69: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28

Towards Optimum Clustering

Fusion principles – min cut

starts with single cluster (maximum recall)searches for cut with minimum loss in recall

ρ(D,Q, C) =∑Ci∈C

∑(dl ,dm)∈C×C

dl 6=dm

~τT (dl) · ~τ(dm)

consider expected precision for breaking ties!

Page 70: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 71: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 72: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 73: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 74: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 75: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 76: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 77: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 78: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29

Towards Optimum Clustering

Finding optimum clusterings

Min cut(assume cohesive similarity graph)

starts with optimum clustering for maximum recallmin cut finds split with minimum loss in recallconsider precision for tie breaking optimum clustering for two clustersO(n3) (vs. O(2n) for the general case)subsequent splits will not necessarily reach optima

Group averagein general, multiple fusion steps for reaching first optimumgreedy strategy does not necessarily find this optimum!

Page 79: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 30

Experiments

1 Introduction

2 Cluster Metric

3 Optimum clustering

4 Towards Optimum Clustering

5 Experiments

6 Conclusion and Outlook

Page 80: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 31

Experiments

Experiments with a Query Set

ADI collection:35 queries70 documents (relevant to 2.4 queries on avg.)

Experiments:Q35opt using the actual relevance in ~τ(d)

Q35 BM25 estimates for the 35 queries1Tuni 1-term queries, uniform distribution1Tdf 1-term queries, according to document frequency

Page 81: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 32

Experiments

0

0.5

1

1.5

2

2.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Q35optQ35

1Tuni1Tdf

Page 82: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33

Experiments

Using Keyphrases as Query SetCompare clustering results based on different query sets

1 ‘bag-of-words’: single words as queries2 keyphrases automatically extracted as head-noun phrases,

single query = all keyphrases of a document

Test collections:4 test collections assembled from the RCV1 (Reuters)news corpus# documents: 600 vs. 6000# categories: 6 vs. 12,Frequency distribution of classes: ([U]niform vs.[R]andom).

Page 83: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33

Experiments

Using Keyphrases as Query SetCompare clustering results based on different query sets

1 ‘bag-of-words’: single words as queries2 keyphrases automatically extracted as head-noun phrases,

single query = all keyphrases of a document

Test collections:4 test collections assembled from the RCV1 (Reuters)news corpus# documents: 600 vs. 6000# categories: 6 vs. 12,Frequency distribution of classes: ([U]niform vs.[R]andom).

Page 84: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 34

Experiments

Using Keyphrases as Query Set - Results

Average Precision (External) F-measure

Page 85: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 35

Experiments

Evaluation of the Expected F-Measure

Correlation between expected F-Measure (internal measure)andstandard F-measure (comparison with reference classification)

test collections as beforeregard quality of 40 different clustering methods for eachsetting(find optimum clustering among these 40 methods)

Page 86: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 36

Experiments

Correlation resultsPearson correlation between internal measures and theexternal F-Measure

Page 87: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 37

Conclusion and Outlook

1 Introduction

2 Cluster Metric

3 Optimum clustering

4 Towards Optimum Clustering

5 Experiments

6 Conclusion and Outlook

Page 88: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 38

Conclusion and Outlook

Summary

Optimum Clustering Framework

makes Cluster Hypothesis a requirementforms theoretical basis for development of better clusteringmethodsyields positive experimental evidence

Page 89: The Optimum Clustering Framework: Implementing the Cluster Hypothesis

A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 39

Conclusion and Outlook

Further Research

theoreticalcompatibility of existing clustering methods with OCFextension of OCF to soft clusteringextension of OCF to hierarchical clustering

experimentalvariation of query setsuser experiments