the cluster hypothesis: ranking document clusters
TRANSCRIPT
![Page 1: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/1.jpg)
The Cluster Hypothesis: Ranking Document Clusters
Oren KurlandFaculty of Industrial Engineering and Management
Technion
* Based on joint work with Fiana Raiber* This work has been supported by, and carried out, at the Technion-Microsoft Electronic Commerce Research Center
![Page 2: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/2.jpg)
Query = “oren kurland dblp”
Search #1 Search #2
Is search a solved problem?
![Page 3: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/3.jpg)
The ad hoc retrieval task
Rank documents in a corpus by their relevance to the information need expressed by a query
![Page 4: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/4.jpg)
Example: Vector space model(Salton ’68)
o query= “Technion”o �⃗�𝑞 =< 0, … , 0,1,0, … , 0 >
o document= “Faculty Technion Student Technion”
o 𝑑𝑑 =< 0, … , 0,1,0, … , 1,0, … , 2,0, … , 0 >
o 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑; 𝑞𝑞) ≜ cos(�⃗�𝑞, 𝑑𝑑)
o Term weighting scheme: TF.IDFo TF: the number of occurrences of a term in the documento IDF: The inverse of the document frequency of the term
![Page 5: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/5.jpg)
Web search engineso Use a variety of relevance signals
o The textual similarity between the page and the query (query-dependent)
o The textual similarity between the query and the anchor text of pages that point to the page (query-dependent)
o The PageRank score of the page (query-independent)o The clickthrough rate for the page (query-independent)o …
o Learning-to-rank (Liu ’09)
![Page 6: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/6.jpg)
The document-query similarityo Relevance is determined based on whether the
document content satisfies the information need expressed by the query
o The document-query similarity is among the most important features for ranking pages in Web search engines (Liu ’09)
Back to classical information retrieval?
![Page 7: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/7.jpg)
The cluster hypothesis
“Closely associated documents tend to be relevant to the same requests”
(Jardine&van Rijsbergen ’71,van Rijsbergen ’79)
Operational consequence:
Relevant documents should be more similar to each other than to non-relevant documents
![Page 8: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/8.jpg)
Cluster-based results interface
Clustering the pages Automatically labeling the clustersFinding the highest quality clusters
![Page 9: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/9.jpg)
Cluster-based document retrieval
Query
Initial list of documents
Set of clusters
Ranking of clusters
Ranking of documents
Document ranking method
Clustering method
Each cluster is replaced with its
documents
Cluster ranking method
![Page 10: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/10.jpg)
The optimal cluster problem(Kurland&Domshlak ’08)
p@5
Doc-query similarity
Query expansion
Oracle experiment
![Page 11: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/11.jpg)
The cluster ranking task
( ) ( ) ( ),( )
,p C Q
p C Qp Q
p C Q= =
o Estimate the probability that cluster C is relevant to query Q:
o Estimate p(C,Q) using Markov Random Fields
rank
![Page 12: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/12.jpg)
Markov Random Fields
( )( )( ),
ll L Gl
p C QZ
ψ∈=
∏
G
( )L G −
( )l lψ −Z −
l −
o Define a graph : o Nodes ‒ random variables representing Q and C’s documentso Edges ‒ dependencies between the variableso set of cliques in Go clique o potential defined over lo normalization factor
o A common instantiation of potential functions:o feature function defined over lo weight associated with ( ) ( )( )expl l ll f lψ λ=( )lf l −
lλ − ( )lf ldef
![Page 13: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/13.jpg)
ClustMRF
o A linear (in feature functions) cluster ranking function that depends on the graph G
o Next:o Determine the clique set L(G)o Associate feature functions with cliques
( ) ( )( )
, l ll L G
p C Q f lλ∈
= ∑rank
![Page 14: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/14.jpg)
The lQD clique
Q
d3d2d1
lQD lQD lQD
o Contains the query Qand a single document din cluster C
o Consider query-similarity values of C’s documents independently
( )1
| |log ( , ) Cgeo qsim QDf l sim Q d− =
( , )sim
def
(cf., Liu&Croft ’08)
. . is an inter-text similarity measurenumber of documents in C| |C −
![Page 15: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/15.jpg)
The lQC cliqueo Contains the query Q
and all C’s documents
o Induce information from relations between query-similarity values of C’s documents
Q
d3d2d1
lQC
( ) { }( )log ( , )QC d CA qsimf l A sim Q d− ∈=def
{ }A∈ min, max, stdv
![Page 16: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/16.jpg)
The lC cliqueo Contains only C’s
documents
o Induce information based on query-independent properties of C’s documents (e.g., PageRank score, ratio of stopwords to non stopwords)
Q
d3d2d1
lC
( ) ( ){ }( )logdP C CAf l A P d− ∈
=
def
{ }A∈ min, max, geometric mean
P is a query-independent document (quality) measure
![Page 17: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/17.jpg)
Empirical evaluation
Query
Initial list of 50
documents
Set of clusters
Ranking of clusters
Ranking of documents
Markov Random Field (MRF)
(Metzler&Croft ’05)
Nearest Neighbor clustering
(Griffiths et al. ’86)
Each cluster is replaced with its
documents
ClustMRF
![Page 18: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/18.jpg)
11
12.5
14
15.5
17
ClueWebB
MAP
11
12
13
14
15
GOV2
MAP
Comparison with the initial ranking■ Init: Markov Random Field (Metzler&Croft ’05)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
♦♦
Clu
stM
RF
Clu
stM
RF
![Page 19: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/19.jpg)
Comparison with other cluster ranking methods
■ AMean: Arithmetic mean of query similarity values (Liu&Croft ’08)
■ GMean: Geometric mean of query-similarity values (Liu&Croft ’08 ,Seo&Croft ’10)
■ ClustRanker: Uses measures of document and cluster biases (Kurland ’08)
■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
11
14
17
ClueWebB
MAP
12
13
14
15
GOV2
MAP
♦♦ ♦ ♦ ♦
Clu
stM
RF
Clu
stM
RF
![Page 20: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/20.jpg)
Comparison with automatic query expansion
■ RM3 (Abdul-Jaleel et al. ’04)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF
12
13
14
GOV2
MAP
13
15
17
ClueWebB
MAP
Clu
stM
RF
Clu
stM
RF
♦♦
![Page 21: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/21.jpg)
Diversifying search results
![Page 22: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/22.jpg)
Diversifying search resultso MMR (Carbonell&Goldstein ’98) and xQuAD (Santos et al. ’10)
iteratively re-rank the initial listo In each iteration a document is scored by:
Relevance estimates:■ Init (MRF)■ ClustMRF♦ Statistically significant
differences with ClustMRF
ClueWebB
(1 )β β+ −Relevance to the query
Diversity with respect todocuments already selected
30
36
42
48
MMR xQuAD
α-NDCG@20
♦♦
![Page 23: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/23.jpg)
TREC (Text REtrieval Conference)o Annual “competition” of search approacheso In 2013, the Technion won the first place in the Web
retrieval track according to most evaluation criteriao ClustMRF was used to re-rank the results of a highly
effective learning-to-rank approach
![Page 24: The Cluster Hypothesis: Ranking Document Clusters](https://reader033.vdocuments.mx/reader033/viewer/2022051323/627c35136efce0125935f8b6/html5/thumbnails/24.jpg)
Summary
o We presented a novel cluster ranking approach thato outperforms state-of-the-art document-based retrieval
methodso outperforms state-of-the-art cluster-based retrieval
methodso can be used to improve the performance of results-
diversification methods