ranking objects by exploiting relationships: computing top-k over aggregation
DESCRIPTION
Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation. Kaushik Chakrabarti. Venkatesh Ganti. Jiawei Han. Dong Xin. Presented by: Vaidergorn Eitan. Outline. Introduction System Overview Scoring Functions SQL implementation Early Termination Approach Experiments - PowerPoint PPT PresentationTRANSCRIPT
11
Ranking Objects by Exploiting Ranking Objects by Exploiting Relationships: Computing Top-K Relationships: Computing Top-K
over Aggregationover Aggregation
Kaushik ChakKaushik Chakrabartirabarti
Venkatesh GanVenkatesh Gantiti
Dong XinDong XinJiawei HanJiawei Han
Presented by: Vaidergorn EitanPresented by: Vaidergorn Eitan
22
OutlineOutline
• Introduction
• System Overview
• Scoring Functions
• SQL implementation
• Early Termination Approach
• Experiments
• Conclusions
33
IntroductionIntroduction
• More and more document collections, has their documents relate to objects.• Laptop reviews site:
Laptop reviews
44
OF (Object Finder) QueriesOF (Object Finder) Queries
IntroductionIntroduction
OF: I need the best “lightweight” & a “business use” laptop.
Laptop reviews
55
IntroductionIntroduction
• The goal:• Get Top K
• Exploiting the relationships between documents and objects.
• Exploiting the Fact that we need only K.
66
IntroductionIntroduction
• Search Objects - SOs - Documents
• Target Objects - TOs
77
OutlineOutline
• Introduction
• System Overview
• Scoring Functions
• SQL implementation
• Early Termination Approach
• Experiments
• Conclusions
88
System OverviewSystem Overview•FTS (Full Text Search):
•Input: Keyword/s.•Output: Ranked lists of documents
Review ID DocScore
2 1.9
3 1.0
1 0.8
99
System OverviewSystem Overview
• FTS (Full Text Search):– Most relational DBMS now support FTS functionality.
1010
System OverviewSystem Overview• DBMS:
– T– R
T is used only for the final lookup of t
he TO values
1111
OutlineOutline
• Introduction
• System Overview
• Scoring Functions
• SQL implementation
• Early Termination Approach
• Experiments
• Conclusions
1212
Scoring FunctionsScoring Functions• The OF evaluation system returns top K target objects
that has the best scores according to scoring function.
1313
Scoring FunctionsScoring Functions
• W={w1,w2,…,wN} – keywords in the OF query.
• Li – ranked sorted list– <document id, DocScore>
• Dt – list of documents related to t TOs
1414
Scoring FunctionsScoring Functions• Score matrix Mt – for each t in TOs
d1 1.1 0
d3 0 1.0
d6 2 0.8
tdi D iw W1w 2w
• Score(t) - the relevance score for the TO t.• compute rows score • compute cols score
combF
aggF
1515
Scoring FunctionsScoring Functions Row-marginal Class:
1616
Scoring FunctionsScoring FunctionsColumn-marginal Class:
1717
Scoring FunctionsScoring Functions1. Fcomb is monotonic:
Fcomb(x1,…,xn) ≤ Fcomb(y1,…,yn) when xi ≤ yi
2. Fagg is subset monotonic: Fagg(S) ≤ Fagg(S’) if S ≤ S’.
3. Fagg distributes over append:
Fagg(R1 append R2)= Fagg(Fagg(R1),Fagg(R2)).append here is ordered concatenation of tuples.
1818
OutlineOutline
• Introduction
• System Overview
• Scoring Functions
• SQL implementation
• Early Termination Approach
• Experiments
• Conclusions
1919
SQL ImplementationSQL Implementation
2020
OutlineOutline
• Introduction
• System Overview
• Scoring Functions
• SQL implementation
• Early Termination Approach
• Experiments
• Conclusions
2121
Early Termination ApproachEarly Termination Approach• Intuition: top scoring documents typically contrib
ute the most to the scores of high scoring TOs.
• The TOs related to these top scoring documents are most likely to be the best candidate matches.
• We progressively retrieve documents in the decreasing order of their scores, and maintain upper and lower bound scores for the related TOs.
2222
Early Termination ApproachEarly Termination Approach
1. Generate-only Approach: • Rely on bounds • stops when identified the best K TOs
2. Generate-Prune Approach: • candidate generation • Stop condition more relaxed• pruning phase.
2323
Candidate GenerationCandidate Generation
• Ci– We retrive in chunks from Li.
• Prefix(Li) – documents retrieved so far from the Lis (rank list).
• SeenTOs – current aggregation scores.
– AggResulti - For each Li, table containing• numSeen • aggScore
– upper bound and lower bound scores.
2424
Candidate GenerationCandidate Generation• 1 3 2agg comb iF F Sum C K B
2525
Candidate GenerationCandidate Generation
• The Algorithm has 5 steps:
2626
Candidate GenerationCandidate Generation• Step1 - Retrieve Documents :
– we retrieve the next Ci from each Li. – Reduce the number of join queries (with R).
2727
Candidate GenerationCandidate Generation• Step2 - Update SeenTOs:
Prefix(L1)
Prefix(L2)
AggResult(1)AggResult(2)
2828
Candidate GenerationCandidate GenerationPrefix(L1)
Prefix(L2)
AggResult(1)AggResult(2)
Numseen[1]
aggScore[1]
ub[1] Numseen[1]
aggScore[1]
ub[2] Lb ub
T1 1 1.0 1 1.0
T2 1 1.0 0 0
T3 1 0.6 1 0.5
T4 1 0.6 1 0.5
2929
Candidate GenerationCandidate GenerationPrefix(L1)
Prefix(L2)
AggResult(1)AggResult(2)
Numseen[1]
aggScore[1]
ub[1] Numseen[1]
aggScore[1]
ub[2] Lb ub
T1 1 1.0 1 1.0
T2 1 1.0 0 0
T3 2 0.8 1 0.8
T4 1 0.6 1 0.5
T5 1 0.2 1 0.3
3030
Candidate GenerationCandidate Generation• Step3 - Compute bounds:
– t.lb= Fcomb(t.aggScore[1],…t.aggScore[N]).
3131
Candidate GenerationCandidate Generation• B:
– maximum number of documents in any ranked list Li that can contribute to the score of any target object t.
• xi – DocScore of last document retrieved from Li.
• t.ub[i]= Fagg(t.aggScore[i], Fagg(xi,xi,..,)).
t.ub= Fcomb(t.ub[1],…,t.ub[N]).
(B- t.numseen[i]) times
t1.ub[1]=1.0+1.0*(2-1)=2
t2.ub[1]=1.0+1.0*(2-1)=2
3232
Candidate GenerationCandidate Generation• Step4 - Stopping Condition:
We can stop when there are at least K objects in SeenTOs whose lower bound scores are higher than the upper bound score of any unseen TOs.
• UnseenUB=Fcomb (Fagg(x1,x1,…),…, Fagg(xN,xN,…,).
• So the stopping criterion is: LBK ≥ UnseenUB• LBK – the Kth high LB
B times
3333
Candidate GenerationCandidate GenerationPrefix(L1)
X1=0.2; X2=0.3
3434
Candidate GenerationCandidate Generation
• LBK ≥ UnseenUB
• UnseenUB= ((0.2+02)+(0.3+0.3))=1
• LB3 = 1.1
3535
Candidate GenerationCandidate Generation• Step5 - Identify candidates: • Top(List,X)
– the top X elements in the list. • The set of candidates is defined by Top(UB,h)
– h - least value which satisfies:LBK≥UBh+1
( ) 1d v
Top LB,K Top UB,h
3636
Candidate GenerationCandidate Generation
• LBK≥UBh+1
• LB3 = 1.1
• LB3≥UB4+1 => h=4
• Top(LB,3)={t1,t3,t4} Top(UB,4)={t1,t2,t3,t4}.• Top(UB,h)={t1,t2,t3,t4}
Top LB,K Top UB,h
3737
Pruning to the Final Top-KPruning to the Final Top-K
3838
Pruning to the Final Top-KPruning to the Final Top-K
• UB={t1(2.5), t2(1.8), t3(1.6), t4(1.6)} K=3
• t1=((1+0.1)+(0.1+1))=2.2• t1=2.2, t2=1.6, t3=1.6, t4=1.6• UB={t1(2.2), t2(1.6), t3(1.6), t4(1.6)}• The final top-k results are {t1, t2, t3}
w1 w2
d1 1.0 0.1
d2 0.1 1.0
3939
Exact Top-K with Approximate Exact Top-K with Approximate scoresscores
• Exact Top-K with Approximate Scores:
• Crossing Objects: its rank in LB is more than K and its rank in UB is K or less.
• Boundary Objects: a pair of target objects (A,B):1. The top K in UB and LB are
same.2. A is the Kth object in LB and
uth object in UB (u ≤ k)3. B is the (K+1)th object in UB
and lth object in LB (l ≥ K+1)4. LBK ≤ UBK+1
UB LB
1 A C
2 A=1.5
3 B=1.6
4 C B
K=2
4040
OutlineOutline
• Introduction
• System Overview
• Scoring Functions
• SQL implementation
• Early Termination Approach
• Experiments
• Conclusions
4141
ExperimentExperiment
• Our documents comprise of a collection of 714,192 news articles from 03’-04’ obtained from MSNBC news portal.
• We index those news articles inside SQL Server FTS engine.
• We extract three types of named entities: PersonNames, OrganizationNames, and LocationNames.
4242
ExperimentExperiment
• To get realistic OF queries, we picked the following top 10 sport news queries on Google in 2004 .
4343
ExperimentExperiment
• “PersonNames” the desired entity type for all the queries. All our measurements are averaged across the 10 queries.
• Implementation all 3 approaches to evaluate OF queries: SQL implemetation, GenPrune,GenOnly.
• SUM as the combination function.SUM as the aggregation function.
4444
ExperimentExperiment
4545
ExperimentExperiment
4646
4747
OutlineOutline
• Introduction
• System Overview
• Scoring Functions
• SQL implementation
• Early Termination Approach
• Experiments
• Conclusions
4848
ConclusionsConclusions
• Class of OF queries and defined its semantics.
• Two broad class of scoring functions, which exploit relationships between documents and objects, to compute the relevance score of the target objects for a given set of keywords.
• We present early termination techniques which shows that our approach is 4-5 times faster than SQL implementation.
4949