web bar 2004 advanced retrieval and web mining

WEB BAR 2004 Advanced Retrieval and Web Mining

Lecture 12

Today’s Topic: Clustering 1

Motivation: Recommendations Document clustering Clustering algorithms

Restaurant recommendations

We have a list of all Palo Alto restaurants with and ratings for some as provided by Stanford students

Which restaurant(s) should I recommend to you?

InputAlice Il Fornaio Yes

Bob Ming's No

Cindy Straits Café No

Dave Ming's Yes

Alice Straits Café No

Estie Zao Yes

Cindy Zao No

Dave Brahma Bull No

Dave Zao Yes

Estie Ming's Yes

Fred Brahma Bull No

Alice Mango Café No

Fred Ramona's No

Dave Homma's Yes

Bob Higashi West Yes

Estie Straits Café Yes

Algorithm 0

Recommend to you the most popular restaurants say # positive votes minus # negative votes

Ignores your culinary preferences And judgements of those with similar preferences

How can we exploit the wisdom of “like-minded” people?

Basic assumption Preferences are not random For example, if I like Il Fornaio, it’s more likely I will

also like Cenzo

Another look at the input - a matrix

Brahma Bull Higashi West Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice Yes No Yes NoBob Yes No No

Cindy Yes No NoDave No No Yes Yes YesEstie No Yes Yes YesFred No No

Now that we have a matrix

Brahma Bull Higashi West Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1

Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1

View all other entries as zeros for now.

Similarity between two people

Similarity between their preference vectors. Inner products are a good start. Dave has similarity 3 with Estie

but -2 with Cindy. Perhaps recommend Straits Cafe to Dave

and Il Fornaio to Bob, etc.

Algorithm 1.1

Goal: recommend restaurants I don’t know Input: evaluation of restaurants I’ve been to Basic idea: find the person “most similar” to me in

the database and recommend something s/he likes.

Aspects to consider: No attempt to discern cuisines, etc. What if I’ve been to all the restaurants s/he has? Do you want to rely on one person’s opinions?

www.everyonesacritic.net (movies)

Algorithm 1.k

Look at the k people who are most similar Recommend what’s most popular among them Issues?

Slightly more sophisticated attempt

Group similar users together into clusters To make recommendations:

Find the “nearest cluster” Recommend the restaurants most popular in this

cluster Features:

efficient avoids data sparsity issues still no attempt to discern why you’re

recommended what you’re recommended how do you cluster?

How do you cluster?

Two key requirements for “good” clustering: Keep similar people together in a cluster Separate dissimilar people

Factors: Need a notion of similarity/distance Vector space? Normalization? How many clusters?

Fixed a priori? Completely data driven?

Avoid “trivial” clusters - too large or small

Looking beyond

Clustering people forrestaurant recommendations

Clustering other things(documents, web pages)

Other approachesto recommendation

General unsupervised machine learning.

Amazon.com

Why cluster documents?

For improving recall in search applications Better search results

For speeding up vector space retrieval Faster search

Corpus analysis/navigation Better user interface

Improving search recall

Cluster hypothesis - Documents with similar text are related

Ergo, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other

docs in the cluster containing D Hope if we do this:

The query “car” will also return docs containing automobile

clustering grouped together docs containing car with those containing automobile.

Why might this happen?

Speeding up vector space retrieval

In vector space retrieval, must find nearest doc vectors to query vector

This would entail finding the similarity of the query to every doc – slow (for some applications)

By clustering docs in corpus a priori find nearest docs in cluster(s) close to query inexact but avoids exhaustive similarity

computationExercise: Make up a simple example with points on a line in 2 clusters where this inexactness shows up.

Speeding up vector space retrieval

Cluster documents into k clusters Retrieve closest cluster ci to query Rank documents in ci and return to user Applications? Web search engines?

Clustering for UI (1)Corpus analysis/navigation

Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Allows user to browse through corpus to find

information Crucial need: meaningful labels for topic nodes.

Yahoo: manual hierarchy Often not available for new document collection

Clustering for UI (2)Navigating search results

Given the results of a search (say Jaguar, or NLP), partition into groups of related docs

Can be viewed as a form of word sense disambiguation

Jaguar may have senses: The car company The animal The football team The video game …

Results list clustering example

•Cluster 1:•Jaguar Motor Cars’ home page

•Mike’s XJS resource page

•Vermont Jaguar owners’ club

•Cluster 2:•Big cats

•My summer safari trip

•Pictures of jaguars, leopards and lions

•Cluster 3:•Jacksonville Jaguars’ Home Page

•AFC East Football Teams

Search Engine Example: Vivisimo

Search for “NLP” on vivisimo

www.vivisimo.com Doesn’t always work well:

no geographic/coffee clusters for “java”!

Representation for Clustering

Similarity measure Document representation

What makes docs “related”?

Ideal: semantic similarity. Practical: statistical similarity

We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a

distance (rather than similarity) between docs. We will describe algorithms in terms of cosine

similarity.

Recall doc as vector

Each doc j is a vector of tfidf values, one component for each term.

Can normalize to unit length. So we have a vector space

terms are axes - aka features n docs live in this space even with stemming, may have 10000+

dimensions do we really want to use all terms?

Different from using vector space for search. Why?

Intuition

Postulate: Documents that are “close together” in vector space talk about the same things.

t 1

D2

D1

D3

D4

t 3

t 2

x

y

Cosine similarity

. Aka

1

)(

:, of similarity Cosine

,

product inner normalized

m

i ikwijwDDsim

DD

kj

kj

How Many Clusters?

Number of clusters k is given Partition n docs into predetermined number of

clusters Finding the “right” number of clusters is part of

the problem Given docs, partition into an “appropriate” number

of subsets. E.g., for query results - ideal value of k not known

up front - though UI may impose limits. Can usually take an algorithm for one flavor and

convert to the other.

Clustering Algorithms

Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive Need a notion of cluster similarity

Iterative, “flat” algorithms Usually start with a random (partial) partitioning Refine it iteratively

Dendrogram: Example

be,not,he,I,it,this,the,his,a,and,but,in,on,with,for,at,from,of,to,as,is,was

Dendrogram: Document Example

As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts.

d1

d2

d3

d4

d5

d1,d2

d4,d5

d3

d3,d4,d5

Agglomerative clustering

Given: target number of clusters k. Initially, each doc viewed as a cluster

start with n clusters; Repeat:

while there are > k clusters, find the “closest pair” of clusters and merge them.

“Closest pair” of clusters

Many variants to defining closest pair of clusters “Center of gravity”

Clusters whose centroids (centers of gravity) are the most cosine-similar

Average-link Average cosine between pairs of elements

Single-link Similarity of the most cosine-similar (single-link)

Complete-link Similarity of the “furthest” points, the least cosine-

similar

Definition of Cluster Similarity

Single-link clustering Similarity of two closest points Can create elongated, straggly clusters Chaining effect

Complete-link clustering Similarity of two least similar points Sensitive to outliers

Centroid-based and average-link Good compromise

Key notion: cluster representative

We want a notion of a representative point in a cluster

Representative should be some sort of “typical” or central point in the cluster, e.g., point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the “average” of all docs in the cluster

Centroid or center of gravity

Centroid

Centroid of a cluster = component-wise average of vectors in a cluster - is a vector. Need not be a doc.

Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5). Centroid is a good cluster representative in most

cases.

Centroid

Centroid

Is the centroid of normalized vectors normalized?

Outliers in centroid computation

Can ignore outliers when computing centroid. What is an outlier?

Lots of statistical definitions, e.g. moment of point to centroid > M some cluster moment.

CentroidOutlier

Say 10.

Medoid As Cluster Representative

The centroid does not have to be a document. Medoid: A cluster representative that is one of

the documents For example: the document closest to the

centroid One reason this is useful

Consider the representative of a large cluster (>1000 documents)

The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector

Compare: mean/centroid vs. median/medoid

Example: n=6, k=3, closest pair of centroids

d1 d2

d3

d4

d5

d6

Centroid after first step.

Centroid aftersecond step.

Issues

Have to support finding closest pairs continually compare all pairs?

Potentially n3 cosine similarity computations To avoid: use approximations. “points” are switching clusters as centroids

change. Naïve implementation expensive for large

document sets (100,000s) Efficient implementation

Cluster a sample, then assign the entire set Avoid dense centroids (e.g., by using medoids)

Why?

Exercise

Consider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

“Using approximations”

In standard algorithm, must find closest pair of centroids at each step

Approximation: instead, find nearly closest pair use some data structure that makes this

approximation easier to maintain simplistic example: maintain closest pair based on

distances in projection on a random line

Random line

Different algorithm: k-means

K-means generates a “flat” set of clusters K-means is non-hierarchical Given: k - the number of clusters desired. Iterative algorithm. Hard to get good bounds on the number of

iterations to convergence. Rarely a problem in practice

Basic iteration

Reassignment At the start of the iteration, we have k centroids.

Subproblem: where do we get them for 1. iteration? Each doc assigned to the nearest centroid.

Centroid recomputation All docs assigned to the same centroid are

averaged to compute a new centroid thus have k new centroids.

Iteration example

Current centroidsDocs

Iteration example

New centroidsDocs

k-Means Clustering: Initialization

We could start with with any k docs as centroids But k random docs are better. Repeat basic iteration until termination condition

satisfied. Exercise: find better approach for finding good

starting points

Termination conditions

Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don’t change.

Does this mean that the docs in a cluster are

unchanged?

Convergence

Why should the k-means algorithm ever reach a fixed point? A state in which clusters don’t change.

k-means is a special case of a general procedure known as the EM algorithm. EM is known to converge. Number of iterations could be large.

Exercise

Consider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see?

Is agglomerative clustering likely to produce different results?

Convergence of K-Means

Define goodness measure of cluster k as sum of squared distances from cluster centroid: G_k = sum_i (v_i – c_k)^2 (sum all v_i in cluster k)

G = sum_k G_k Reassignment monotonically decreases G since

each vector is assigned to the closest centroid. Recomputation monotonically decreases each

G_k since: (m_k number of members in cluster) sum (v_in –a)^2 reaches minimum for: sum –2(v_in-a) = 0

Convergence of K-Means

sum –2(v_in-a) = 0 sum v_in = sum a m_k a = sum v_in a = 1/m_k sum v_in = c_kn

k not specified in advance

Say, the results of a query. Solve an optimization problem: penalize having

lots of clusters application dependant, e.g., compressed summary

of search results list. Tradeoff between having more clusters (better

focus within each cluster) and having too many clusters

k not specified in advance

Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid

Define the Total Benefit to be the sum of the individual doc Benefits.

Why is there always a clustering of Total Benefit n?

Penalize lots of clusters

For each cluster, we have a Cost C. Thus for a clustering with k clusters, the Total

Cost is kC. Define the Value of a clustering to be =

Total Benefit - Total Cost. Find the clustering of highest value, over all

choices of k.

Back to agglomerative clustering

In a run of agglomerative clustering, we can try all values of k=n,n-1,n-2, … 1.

At each, we can measure our value, then pick the best choice of k.

Exercise

Suppose a run of agglomerative clustering finds k=7 to have the highest value amongst all k. Have we found the highest-value clustering amongst all clusterings with k=7?

Clustering vs Classification

Clustering Unsupervised Input

Clustering algorithm Similarity measure Number of clusters

No specific information for each document Classification

Supervised Each document is labeled with a class Build a classifier that assigns documents to one of the

classes Two types of partitioning: supervised vs unsupervised


Consider clustering a large set of computer science documents what do you expect to see in the vector space?


Consider clustering a large set of computer science documents what do you expect to see in the vector space?

NLP

Graphics

AI

Theory

Arch.

Decision boundaries

Could we use these blobs to infer the subject of a new document?

NLP

Graphics

AI

Theory

Arch.

Deciding what a new doc is about

Check which region the new doc falls into can output “softer” decisions as well.

NLP

Graphics

AI

Theory

Arch.

= AI

Setup for Classification

Given “training” docs for each category Theory, AI, NLP, etc.

Cast them into a decision space generally a vector space with each doc viewed as

a bag of words Build a classifier that will classify new docs

Essentially, partition the decision space Given a new doc, figure out which partition it falls

into

Supervised vs. unsupervised learning

This setup is called supervised learning in the terminology of Machine Learning

In the domain of text, various names Text classification, text categorization Document classification/categorization “Automatic” categorization Routing, filtering …

In contrast, the earlier setting of clustering is called unsupervised learning Presumes no availability of training samples Clusters output may not be thematically unified.

Which is better?

Depends on your setting on your application

Can use in combination Analyze a corpus using clustering Hand-tweak the clusters and label them Use tweaked clusters as training input for

classification Subsequent docs get classified

Computationally, methods quite different Main issue: can you get training data?

Summary

Two types of clustering Hierarchical, agglomerative clustering Flat, iterative clustering

How many clusters? Key parameters

Representation of data points Similarity/distance measure

web bar 2004 advanced retrieval and web mining

Documents

cluster docs

nearest cluster

similar text

comwhy cluster documents

similar preferenceshow

evaluation of restaurants

palo alto restaurants

good clustering