the cluster hypothesis: ranking document clusters

The Cluster Hypothesis: Ranking Document Clusters

Oren KurlandFaculty of Industrial Engineering and Management

Technion

* Based on joint work with Fiana Raiber* This work has been supported by, and carried out, at the Technion-Microsoft Electronic Commerce Research Center

Query = “oren kurland dblp”

Search #1 Search #2

Is search a solved problem?

The ad hoc retrieval task

Rank documents in a corpus by their relevance to the information need expressed by a query

Example: Vector space model(Salton ’68)

o query= “Technion”o �⃗�𝑞 =< 0, … , 0,1,0, … , 0 >

o document= “Faculty Technion Student Technion”

o 𝑑𝑑 =< 0, … , 0,1,0, … , 1,0, … , 2,0, … , 0 >

o 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑; 𝑞𝑞) ≜ cos(�⃗�𝑞, 𝑑𝑑)

o Term weighting scheme: TF.IDFo TF: the number of occurrences of a term in the documento IDF: The inverse of the document frequency of the term

Web search engineso Use a variety of relevance signals

o The textual similarity between the page and the query (query-dependent)

o The textual similarity between the query and the anchor text of pages that point to the page (query-dependent)

o The PageRank score of the page (query-independent)o The clickthrough rate for the page (query-independent)o …

o Learning-to-rank (Liu ’09)

The document-query similarityo Relevance is determined based on whether the

document content satisfies the information need expressed by the query

o The document-query similarity is among the most important features for ranking pages in Web search engines (Liu ’09)

Back to classical information retrieval?

The cluster hypothesis

“Closely associated documents tend to be relevant to the same requests”

(Jardine&van Rijsbergen ’71,van Rijsbergen ’79)

Operational consequence:

Relevant documents should be more similar to each other than to non-relevant documents

Cluster-based results interface

Clustering the pages Automatically labeling the clustersFinding the highest quality clusters

Cluster-based document retrieval

Query

Initial list of documents

Set of clusters

Ranking of clusters

Ranking of documents

Document ranking method

Clustering method

Each cluster is replaced with its

documents

Cluster ranking method

The optimal cluster problem(Kurland&Domshlak ’08)

p@5

Doc-query similarity

Query expansion

Oracle experiment

The cluster ranking task

( ) ( ) ( ),( )

,p C Q

p C Qp Q

p C Q= =

o Estimate the probability that cluster C is relevant to query Q:

o Estimate p(C,Q) using Markov Random Fields

rank

Markov Random Fields

( )( )( ),

ll L Gl

p C QZ

ψ∈=

∏

G

( )L G −

( )l lψ −Z −

l −

o Define a graph : o Nodes ‒ random variables representing Q and C’s documentso Edges ‒ dependencies between the variableso set of cliques in Go clique o potential defined over lo normalization factor

o A common instantiation of potential functions:o feature function defined over lo weight associated with ( ) ( )( )expl l ll f lψ λ=( )lf l −

lλ − ( )lf ldef

ClustMRF

o A linear (in feature functions) cluster ranking function that depends on the graph G

o Next:o Determine the clique set L(G)o Associate feature functions with cliques

( ) ( )( )

, l ll L G

p C Q f lλ∈

= ∑rank

The lQD clique

Q

d3d2d1

lQD lQD lQD

o Contains the query Qand a single document din cluster C

o Consider query-similarity values of C’s documents independently

( )1

| |log ( , ) Cgeo qsim QDf l sim Q d− =

( , )sim

def

(cf., Liu&Croft ’08)

. . is an inter-text similarity measurenumber of documents in C| |C −

The lQC cliqueo Contains the query Q

and all C’s documents

o Induce information from relations between query-similarity values of C’s documents

Q

d3d2d1

lQC

( ) { }( )log ( , )QC d CA qsimf l A sim Q d− ∈=def

{ }A∈ min, max, stdv

The lC cliqueo Contains only C’s

documents

o Induce information based on query-independent properties of C’s documents (e.g., PageRank score, ratio of stopwords to non stopwords)

Q

d3d2d1

lC

( ) ( ){ }( )logdP C CAf l A P d− ∈

=

def

{ }A∈ min, max, geometric mean

P is a query-independent document (quality) measure

Empirical evaluation

Query

Initial list of 50

documents

Set of clusters

Ranking of clusters

Ranking of documents

Markov Random Field (MRF)

(Metzler&Croft ’05)

Nearest Neighbor clustering

(Griffiths et al. ’86)

Each cluster is replaced with its

documents

ClustMRF

11

12.5

14

15.5

17

ClueWebB

MAP

11

12

13

14

15

GOV2

MAP

Comparison with the initial ranking■ Init: Markov Random Field (Metzler&Croft ’05)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF

♦♦

Clu

stM

RF

Clu

stM

RF

Comparison with other cluster ranking methods

■ AMean: Arithmetic mean of query similarity values (Liu&Croft ’08)

■ GMean: Geometric mean of query-similarity values (Liu&Croft ’08 ,Seo&Croft ’10)

■ ClustRanker: Uses measures of document and cluster biases (Kurland ’08)

■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF

11

14

17

ClueWebB

MAP

12

13

14

15

GOV2

MAP

♦♦ ♦ ♦ ♦

Clu

stM

RF

Clu

stM

RF

Comparison with automatic query expansion

■ RM3 (Abdul-Jaleel et al. ’04)■ ClustMRF (Our algorithm)♦ Statistically significant differences with ClustMRF

12

13

14

GOV2

MAP

13

15

17

ClueWebB

MAP

Clu

stM

RF

Clu

stM

RF

♦♦

Diversifying search results

Diversifying search resultso MMR (Carbonell&Goldstein ’98) and xQuAD (Santos et al. ’10)

iteratively re-rank the initial listo In each iteration a document is scored by:

Relevance estimates:■ Init (MRF)■ ClustMRF♦ Statistically significant

differences with ClustMRF

ClueWebB

(1 )β β+ −Relevance to the query

Diversity with respect todocuments already selected

30

36

42

48

MMR xQuAD

α-NDCG@20

♦♦

TREC (Text REtrieval Conference)o Annual “competition” of search approacheso In 2013, the Technion won the first place in the Web

retrieval track according to most evaluation criteriao ClustMRF was used to re-rank the results of a highly

effective learning-to-rank approach

Summary

o We presented a novel cluster ranking approach thato outperforms state-of-the-art document-based retrieval

methodso outperforms state-of-the-art cluster-based retrieval

methodso can be used to improve the performance of results-

diversification methods

the cluster hypothesis: ranking document clusters

Documents