information retrieval lecture 6 introduction to information retrieval (manning et al. 2007) chapter...
TRANSCRIPT
Information Retrieval
Lecture 6Introduction to Information Retrieval (Manning et al. 2007)
Chapter 16
For the MSc Computer Science Programme
Dell ZhangBirkbeck, University of London
What is text clustering?
Text clustering – grouping a set of documents into classes of similar documents.
Classification vs. Clustering Classification: supervised learning
Labeled data are given for training Clustering: unsupervised learning
Only unlabeled data are available
Why text clustering?
To improve user interface Navigation/analysis of corpus or search results
To improve recall Cluster docs in corpus a priori. When a query
matches a doc d, also return other docs in the cluster containing d. Hope if we do this, the query “car” will also return docs containing “automobile”.
To improve retrieval speed Cluster Pruning
http://clusty.com/
What clustering is good?
External criteria Consistent with the latent classes in gold standard
(ground truth) data. Internal criteria
High intra-cluster similarity Low inter-cluster similarity
Issues for Clustering
Similarity between docs Ideal: semantic similarity Practical: statistical similarity, e.g., cosine.
Number of clusters Fixed, e.g., kMeans. Flexible, e.g., Single-Link HAC.
Structure of clusters Flat partition, e.g., kMeans. Hierarchical tree, e.g., Single-Link HAC.
kMeans Algorithm
Pick k docs {s1, s2,…,sk} randomly as seeds.Repeat until clustering converges (or other stopping criterion):
For each doc di :Assign di to cluster cj such that sim(di, sj) is maximal.
For each cluster cj :Update sj to the centroid (mean) of cluster cj.
jcdj
j dc
s
||
1
),(maxarg)( jCc
sdsimdcj
kMeans – Example(k = 2)
Pick seeds
Reassign clusters
Compute centroids
xx
Reassign clusters
xx xx Compute centroids
Reassign clusters
Converged!
kMeans – Example
kMeans – Example
kMeans – Online Demo
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Convergence
kMeans is proved to converge, i.e., to reach a state in which clusters don’t change.
kMeans usually converges quickly, i.e., the number of iterations is small in most cases.
Seeds
Problem Results can vary because of
random seed selections. Some seeds can result in poor
convergence rate, or convergence to sub-optimal clustering.
Solution Try kMeans for multiple times
with different random seed selections.
……
In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}
Example showingsensitivity to seeds
Take Home Message
kMeans
jcdj
j dc
s
||
1
),(maxarg)( jCc
sdsimdcj