information retrieval lecture 6 introduction to information retrieval (manning et al. 2007) chapter...

14
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Upload: elaine-malone

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

Information Retrieval

Lecture 6Introduction to Information Retrieval (Manning et al. 2007)

Chapter 16

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Page 2: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

What is text clustering?

Text clustering – grouping a set of documents into classes of similar documents.

Classification vs. Clustering Classification: supervised learning

Labeled data are given for training Clustering: unsupervised learning

Only unlabeled data are available

Page 3: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

Why text clustering?

To improve user interface Navigation/analysis of corpus or search results

To improve recall Cluster docs in corpus a priori. When a query

matches a doc d, also return other docs in the cluster containing d. Hope if we do this, the query “car” will also return docs containing “automobile”.

To improve retrieval speed Cluster Pruning

Page 4: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

http://clusty.com/

Page 5: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

What clustering is good?

External criteria Consistent with the latent classes in gold standard

(ground truth) data. Internal criteria

High intra-cluster similarity Low inter-cluster similarity

Page 6: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

Issues for Clustering

Similarity between docs Ideal: semantic similarity Practical: statistical similarity, e.g., cosine.

Number of clusters Fixed, e.g., kMeans. Flexible, e.g., Single-Link HAC.

Structure of clusters Flat partition, e.g., kMeans. Hierarchical tree, e.g., Single-Link HAC.

Page 7: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

kMeans Algorithm

Pick k docs {s1, s2,…,sk} randomly as seeds.Repeat until clustering converges (or other stopping criterion):

For each doc di :Assign di to cluster cj such that sim(di, sj) is maximal.

For each cluster cj :Update sj to the centroid (mean) of cluster cj.

jcdj

j dc

s

||

1

),(maxarg)( jCc

sdsimdcj

Page 8: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

kMeans – Example(k = 2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reassign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Page 9: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

kMeans – Example

Page 10: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

kMeans – Example

Page 11: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

kMeans – Online Demo

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Page 12: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

Convergence

kMeans is proved to converge, i.e., to reach a state in which clusters don’t change.

kMeans usually converges quickly, i.e., the number of iterations is small in most cases.

Page 13: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

Seeds

Problem Results can vary because of

random seed selections. Some seeds can result in poor

convergence rate, or convergence to sub-optimal clustering.

Solution Try kMeans for multiple times

with different random seed selections.

……

In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}

Example showingsensitivity to seeds

Page 14: Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang

Take Home Message

kMeans

jcdj

j dc

s

||

1

),(maxarg)( jCc

sdsimdcj