web bar 2004 advanced retrieval and web mining
DESCRIPTION
WEB BAR 2004 Advanced Retrieval and Web Mining. Lecture 12. Today’s Topic: Clustering 1. Motivation: Recommendations Document clustering Clustering algorithms. Restaurant recommendations. We have a list of all Palo Alto restaurants with and ratings for some - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/1.jpg)
WEB BAR 2004 Advanced Retrieval and Web Mining
Lecture 12
![Page 2: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/2.jpg)
Today’s Topic: Clustering 1
Motivation: Recommendations Document clustering Clustering algorithms
![Page 3: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/3.jpg)
Restaurant recommendations
We have a list of all Palo Alto restaurants with and ratings for some as provided by Stanford students
Which restaurant(s) should I recommend to you?
![Page 4: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/4.jpg)
InputAlice Il Fornaio Yes
Bob Ming's No
Cindy Straits Café No
Dave Ming's Yes
Alice Straits Café No
Estie Zao Yes
Cindy Zao No
Dave Brahma Bull No
Dave Zao Yes
Estie Ming's Yes
Fred Brahma Bull No
Alice Mango Café No
Fred Ramona's No
Dave Homma's Yes
Bob Higashi West Yes
Estie Straits Café Yes
![Page 5: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/5.jpg)
Algorithm 0
Recommend to you the most popular restaurants say # positive votes minus # negative votes
Ignores your culinary preferences And judgements of those with similar preferences
How can we exploit the wisdom of “like-minded” people?
Basic assumption Preferences are not random For example, if I like Il Fornaio, it’s more likely I will
also like Cenzo
![Page 6: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/6.jpg)
Another look at the input - a matrix
Brahma Bull Higashi West Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice Yes No Yes NoBob Yes No No
Cindy Yes No NoDave No No Yes Yes YesEstie No Yes Yes YesFred No No
![Page 7: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/7.jpg)
Now that we have a matrix
Brahma Bull Higashi West Mango Il Fornaio Zao Ming's Ramona's Straits Homma'sAlice 1 -1 1 -1Bob 1 -1 -1
Cindy 1 -1 -1Dave -1 -1 1 1 1Estie -1 1 1 1Fred -1 -1
View all other entries as zeros for now.
![Page 8: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/8.jpg)
Similarity between two people
Similarity between their preference vectors. Inner products are a good start. Dave has similarity 3 with Estie
but -2 with Cindy. Perhaps recommend Straits Cafe to Dave
and Il Fornaio to Bob, etc.
![Page 9: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/9.jpg)
Algorithm 1.1
Goal: recommend restaurants I don’t know Input: evaluation of restaurants I’ve been to Basic idea: find the person “most similar” to me in
the database and recommend something s/he likes.
Aspects to consider: No attempt to discern cuisines, etc. What if I’ve been to all the restaurants s/he has? Do you want to rely on one person’s opinions?
www.everyonesacritic.net (movies)
![Page 10: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/10.jpg)
Algorithm 1.k
Look at the k people who are most similar Recommend what’s most popular among them Issues?
![Page 11: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/11.jpg)
Slightly more sophisticated attempt
Group similar users together into clusters To make recommendations:
Find the “nearest cluster” Recommend the restaurants most popular in this
cluster Features:
efficient avoids data sparsity issues still no attempt to discern why you’re
recommended what you’re recommended how do you cluster?
![Page 12: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/12.jpg)
How do you cluster?
Two key requirements for “good” clustering: Keep similar people together in a cluster Separate dissimilar people
Factors: Need a notion of similarity/distance Vector space? Normalization? How many clusters?
Fixed a priori? Completely data driven?
Avoid “trivial” clusters - too large or small
![Page 13: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/13.jpg)
Looking beyond
Clustering people forrestaurant recommendations
Clustering other things(documents, web pages)
Other approachesto recommendation
General unsupervised machine learning.
Amazon.com
![Page 14: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/14.jpg)
Why cluster documents?
For improving recall in search applications Better search results
For speeding up vector space retrieval Faster search
Corpus analysis/navigation Better user interface
![Page 15: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/15.jpg)
Improving search recall
Cluster hypothesis - Documents with similar text are related
Ergo, to improve search recall: Cluster docs in corpus a priori When a query matches a doc D, also return other
docs in the cluster containing D Hope if we do this:
The query “car” will also return docs containing automobile
clustering grouped together docs containing car with those containing automobile.
Why might this happen?
![Page 16: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/16.jpg)
Speeding up vector space retrieval
In vector space retrieval, must find nearest doc vectors to query vector
This would entail finding the similarity of the query to every doc – slow (for some applications)
By clustering docs in corpus a priori find nearest docs in cluster(s) close to query inexact but avoids exhaustive similarity
computationExercise: Make up a simple example with points on a line in 2 clusters where this inexactness shows up.
![Page 17: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/17.jpg)
Speeding up vector space retrieval
Cluster documents into k clusters Retrieve closest cluster ci to query Rank documents in ci and return to user Applications? Web search engines?
![Page 18: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/18.jpg)
Clustering for UI (1)Corpus analysis/navigation
Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Allows user to browse through corpus to find
information Crucial need: meaningful labels for topic nodes.
Yahoo: manual hierarchy Often not available for new document collection
![Page 19: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/19.jpg)
Clustering for UI (2)Navigating search results
Given the results of a search (say Jaguar, or NLP), partition into groups of related docs
Can be viewed as a form of word sense disambiguation
Jaguar may have senses: The car company The animal The football team The video game …
![Page 20: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/20.jpg)
Results list clustering example
•Cluster 1:•Jaguar Motor Cars’ home page
•Mike’s XJS resource page
•Vermont Jaguar owners’ club
•Cluster 2:•Big cats
•My summer safari trip
•Pictures of jaguars, leopards and lions
•Cluster 3:•Jacksonville Jaguars’ Home Page
•AFC East Football Teams
![Page 21: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/21.jpg)
Search Engine Example: Vivisimo
Search for “NLP” on vivisimo
www.vivisimo.com Doesn’t always work well:
no geographic/coffee clusters for “java”!
![Page 22: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/22.jpg)
Representation for Clustering
Similarity measure Document representation
![Page 23: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/23.jpg)
What makes docs “related”?
Ideal: semantic similarity. Practical: statistical similarity
We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a
distance (rather than similarity) between docs. We will describe algorithms in terms of cosine
similarity.
![Page 24: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/24.jpg)
Recall doc as vector
Each doc j is a vector of tfidf values, one component for each term.
Can normalize to unit length. So we have a vector space
terms are axes - aka features n docs live in this space even with stemming, may have 10000+
dimensions do we really want to use all terms?
Different from using vector space for search. Why?
![Page 25: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/25.jpg)
Intuition
Postulate: Documents that are “close together” in vector space talk about the same things.
t 1
D2
D1
D3
D4
t 3
t 2
x
y
![Page 26: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/26.jpg)
Cosine similarity
. Aka
1
)(
:, of similarity Cosine
,
product inner normalized
m
i ikwijwDDsim
DD
kj
kj
![Page 27: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/27.jpg)
How Many Clusters?
Number of clusters k is given Partition n docs into predetermined number of
clusters Finding the “right” number of clusters is part of
the problem Given docs, partition into an “appropriate” number
of subsets. E.g., for query results - ideal value of k not known
up front - though UI may impose limits. Can usually take an algorithm for one flavor and
convert to the other.
![Page 28: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/28.jpg)
Clustering Algorithms
Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive Need a notion of cluster similarity
Iterative, “flat” algorithms Usually start with a random (partial) partitioning Refine it iteratively
![Page 29: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/29.jpg)
Dendrogram: Example
be,not,he,I,it,this,the,his,a,and,but,in,on,with,for,at,from,of,to,as,is,was
![Page 30: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/30.jpg)
Dendrogram: Document Example
As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts.
d1
d2
d3
d4
d5
d1,d2
d4,d5
d3
d3,d4,d5
![Page 31: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/31.jpg)
Agglomerative clustering
Given: target number of clusters k. Initially, each doc viewed as a cluster
start with n clusters; Repeat:
while there are > k clusters, find the “closest pair” of clusters and merge them.
![Page 32: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/32.jpg)
“Closest pair” of clusters
Many variants to defining closest pair of clusters “Center of gravity”
Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link Average cosine between pairs of elements
Single-link Similarity of the most cosine-similar (single-link)
Complete-link Similarity of the “furthest” points, the least cosine-
similar
![Page 33: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/33.jpg)
Definition of Cluster Similarity
Single-link clustering Similarity of two closest points Can create elongated, straggly clusters Chaining effect
Complete-link clustering Similarity of two least similar points Sensitive to outliers
Centroid-based and average-link Good compromise
![Page 34: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/34.jpg)
Key notion: cluster representative
We want a notion of a representative point in a cluster
Representative should be some sort of “typical” or central point in the cluster, e.g., point inducing smallest radii to docs in cluster smallest squared distances, etc. point that is the “average” of all docs in the cluster
Centroid or center of gravity
![Page 35: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/35.jpg)
Centroid
Centroid of a cluster = component-wise average of vectors in a cluster - is a vector. Need not be a doc.
Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5). Centroid is a good cluster representative in most
cases.
Centroid
![Page 36: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/36.jpg)
Centroid
Is the centroid of normalized vectors normalized?
![Page 37: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/37.jpg)
Outliers in centroid computation
Can ignore outliers when computing centroid. What is an outlier?
Lots of statistical definitions, e.g. moment of point to centroid > M some cluster moment.
CentroidOutlier
Say 10.
![Page 38: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/38.jpg)
Medoid As Cluster Representative
The centroid does not have to be a document. Medoid: A cluster representative that is one of
the documents For example: the document closest to the
centroid One reason this is useful
Consider the representative of a large cluster (>1000 documents)
The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector
Compare: mean/centroid vs. median/medoid
![Page 39: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/39.jpg)
Example: n=6, k=3, closest pair of centroids
d1 d2
d3
d4
d5
d6
Centroid after first step.
Centroid aftersecond step.
![Page 40: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/40.jpg)
Issues
Have to support finding closest pairs continually compare all pairs?
Potentially n3 cosine similarity computations To avoid: use approximations. “points” are switching clusters as centroids
change. Naïve implementation expensive for large
document sets (100,000s) Efficient implementation
Cluster a sample, then assign the entire set Avoid dense centroids (e.g., by using medoids)
Why?
![Page 41: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/41.jpg)
Exercise
Consider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?
![Page 42: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/42.jpg)
“Using approximations”
In standard algorithm, must find closest pair of centroids at each step
Approximation: instead, find nearly closest pair use some data structure that makes this
approximation easier to maintain simplistic example: maintain closest pair based on
distances in projection on a random line
Random line
![Page 43: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/43.jpg)
Different algorithm: k-means
K-means generates a “flat” set of clusters K-means is non-hierarchical Given: k - the number of clusters desired. Iterative algorithm. Hard to get good bounds on the number of
iterations to convergence. Rarely a problem in practice
![Page 44: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/44.jpg)
Basic iteration
Reassignment At the start of the iteration, we have k centroids.
Subproblem: where do we get them for 1. iteration? Each doc assigned to the nearest centroid.
Centroid recomputation All docs assigned to the same centroid are
averaged to compute a new centroid thus have k new centroids.
![Page 45: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/45.jpg)
Iteration example
Current centroidsDocs
![Page 46: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/46.jpg)
Iteration example
New centroidsDocs
![Page 47: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/47.jpg)
k-Means Clustering: Initialization
We could start with with any k docs as centroids But k random docs are better. Repeat basic iteration until termination condition
satisfied. Exercise: find better approach for finding good
starting points
![Page 48: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/48.jpg)
Termination conditions
Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don’t change.
Does this mean that the docs in a cluster are
unchanged?
![Page 49: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/49.jpg)
Convergence
Why should the k-means algorithm ever reach a fixed point? A state in which clusters don’t change.
k-means is a special case of a general procedure known as the EM algorithm. EM is known to converge. Number of iterations could be large.
![Page 50: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/50.jpg)
Exercise
Consider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see?
Is agglomerative clustering likely to produce different results?
![Page 51: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/51.jpg)
Convergence of K-Means
Define goodness measure of cluster k as sum of squared distances from cluster centroid: G_k = sum_i (v_i – c_k)^2 (sum all v_i in cluster k)
G = sum_k G_k Reassignment monotonically decreases G since
each vector is assigned to the closest centroid. Recomputation monotonically decreases each
G_k since: (m_k number of members in cluster) sum (v_in –a)^2 reaches minimum for: sum –2(v_in-a) = 0
![Page 52: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/52.jpg)
Convergence of K-Means
sum –2(v_in-a) = 0 sum v_in = sum a m_k a = sum v_in a = 1/m_k sum v_in = c_kn
![Page 53: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/53.jpg)
k not specified in advance
Say, the results of a query. Solve an optimization problem: penalize having
lots of clusters application dependant, e.g., compressed summary
of search results list. Tradeoff between having more clusters (better
focus within each cluster) and having too many clusters
![Page 54: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/54.jpg)
k not specified in advance
Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid
Define the Total Benefit to be the sum of the individual doc Benefits.
Why is there always a clustering of Total Benefit n?
![Page 55: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/55.jpg)
Penalize lots of clusters
For each cluster, we have a Cost C. Thus for a clustering with k clusters, the Total
Cost is kC. Define the Value of a clustering to be =
Total Benefit - Total Cost. Find the clustering of highest value, over all
choices of k.
![Page 56: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/56.jpg)
Back to agglomerative clustering
In a run of agglomerative clustering, we can try all values of k=n,n-1,n-2, … 1.
At each, we can measure our value, then pick the best choice of k.
![Page 57: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/57.jpg)
Exercise
Suppose a run of agglomerative clustering finds k=7 to have the highest value amongst all k. Have we found the highest-value clustering amongst all clusterings with k=7?
![Page 58: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/58.jpg)
Clustering vs Classification
Clustering Unsupervised Input
Clustering algorithm Similarity measure Number of clusters
No specific information for each document Classification
Supervised Each document is labeled with a class Build a classifier that assigns documents to one of the
classes Two types of partitioning: supervised vs unsupervised
![Page 59: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/59.jpg)
Clustering vs Classification
Consider clustering a large set of computer science documents what do you expect to see in the vector space?
![Page 60: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/60.jpg)
Clustering vs Classification
Consider clustering a large set of computer science documents what do you expect to see in the vector space?
NLP
Graphics
AI
Theory
Arch.
![Page 61: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/61.jpg)
Decision boundaries
Could we use these blobs to infer the subject of a new document?
NLP
Graphics
AI
Theory
Arch.
![Page 62: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/62.jpg)
Deciding what a new doc is about
Check which region the new doc falls into can output “softer” decisions as well.
NLP
Graphics
AI
Theory
Arch.
= AI
![Page 63: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/63.jpg)
Setup for Classification
Given “training” docs for each category Theory, AI, NLP, etc.
Cast them into a decision space generally a vector space with each doc viewed as
a bag of words Build a classifier that will classify new docs
Essentially, partition the decision space Given a new doc, figure out which partition it falls
into
![Page 64: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/64.jpg)
Supervised vs. unsupervised learning
This setup is called supervised learning in the terminology of Machine Learning
In the domain of text, various names Text classification, text categorization Document classification/categorization “Automatic” categorization Routing, filtering …
In contrast, the earlier setting of clustering is called unsupervised learning Presumes no availability of training samples Clusters output may not be thematically unified.
![Page 65: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/65.jpg)
Which is better?
Depends on your setting on your application
Can use in combination Analyze a corpus using clustering Hand-tweak the clusters and label them Use tweaked clusters as training input for
classification Subsequent docs get classified
Computationally, methods quite different Main issue: can you get training data?
![Page 66: WEB BAR 2004 Advanced Retrieval and Web Mining](https://reader036.vdocuments.mx/reader036/viewer/2022081519/56813fc3550346895daaa3c9/html5/thumbnails/66.jpg)
Summary
Two types of clustering Hierarchical, agglomerative clustering Flat, iterative clustering
How many clusters? Key parameters
Representation of data points Similarity/distance measure