topical query decomposition
DESCRIPTION
Topical Query Decomposition. Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08. Abstract. Given a query and a document retrieval system - PowerPoint PPT PresentationTRANSCRIPT
Topical Query Decomposition
Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis
Yahoo! ResearchBarcelona, Spain
KDD 08
2
Abstract
Given a query and a document retrieval system To produce a small set of queries whose union
of resulting documents corresponds approximately to that of the original query.
Set cover problem Greedy algorithm
Clustering problem Two-phase algorithm based on hierarchical
agglomerative clustering (dynamic programming)
3
Introduction
A query log L A list of pairs < q, D(q) >
q: query, D(q): its result a set of documents that answer
query q
Q(q) the maximal set of queries pi, where for each pi, the set D(pi) has at least one document in common with the documents returned by q
4
5
The goal is to compute a cover. Selecting a subcollection C Q(q7) such that it
covers almost all of D(q7)
6
Problem Statement – 1/3
Red-Blue set cover problem U={b1,…bn, r1,…rm} ( for a query q ) B={b1,…bn} (i.e., document set) R={r1,…rm} (i.e., query set) S={S1,…,Sk} is provided from L (query log L)
Si U Si
B : blue points in Si (SiB= Si B)
SiR : red points in Si (Si
R= Si B) Goal: To find a subcollection C ⊆ S that
covers many blue points of U without covering too many red points.
7
Problem Statement – 2/3
For each query q, the candidate queries Q(q)
For each set Si with blue and red points, its weight is
scatter sc(Si) (coherence: opposite of scatter)
ii SvSu
i v,udSsc min 2)()(
1))(1()(
)(
2
}{
b,qclickslogbw
bw|S| BiSbwi
8
Problem Statement – 3/3
Our goal is to find a subcollection C ⊆ S that covers almost all the blue points of U and has large coherence.
More precisely, we want that C satisfies the following properties: Cover-blue Not-cover-red Small-overlap Coherence
9
Greedy Algorithm – 1/2
At i-th iteration , minimizes s(S,VB,VR)
C, R, O are parameters that weight the relative importance of the three terms.
VB : blue balls were already selected at before iterations
VR : red balls were already selectedat before iterations
D. Peleg. Approximation algorithm for the label-covermax and red-blue set cover problem. Journal of Discrete Algorithms, 2007
10
Greedy Algorithm – 2/2
11
Integer Programming
Si+S2+….Sl <=10
Si <= 1
12
Clustering-Based Method
Two-phase approach First phase: all points in set B are clustered
using a hierarchical agglomerative clustering algorithm. (CLUTO toolkit)
Second phases: to match the clusters of the hierarchy produced by the agglomerative algorithm with the sets of S.
The main idea is to match sets of S into clusters of Every node T ∈ corresponds to a cluster
T(B) be the set of points in B
13
Clustering-Based Method
Dendrogram
14
Clustering-Based Method -Dynamic Programming - 1/2
Complete Coverage: for each set S S v.s. for each node T ∈ , Matching score m(T, S)
m*(T) the score of the best matching set in S.
Optimal cost of covering the points of TB with sets in S.
15
Clustering-Based Method -Dynamic Programming - 2/2 Partial Coverage:
U weights the relative importance between the two terms, the scatter cost of the sets S and the number of uncovered points.
16
Application
Query log L : 2.9 million distinct queries A majority of users only looks at the first page
of results, while few users request more result pages.
D(q): any user asking for q in the query log navigated, and consider the set of result documents for the query
24 million distinct documents seen by the users
17
Application - Candidate queries for the cover For each query q, the candidate queries Qk(q)
18
Application - Results A set of 100 queries were randomly picked
from top 10,000 queries submitted by users.
Cost of k queries The number of documents
included outside the set D(q) Average numbre of queries
covering each element Coverage after the top k
candidates have been picked
19
20
21
22
Conclusions
A novel problem : Topical query decomposition
Elegant solutions red-blue metric set cover clustering with predefined clusters.
( hierarchical agglomerative clustering ) The set-cover formulation provides solutions
of better quality Code and data for reproducing the results
shown in Table 3 is available at http://www.yr-bcn.es/querydecomp/ .