ranking objects by exploiting relationships: computing top-k over aggregation

11

Ranking Objects by Exploiting Ranking Objects by Exploiting Relationships: Computing Top-K Relationships: Computing Top-K

over Aggregationover Aggregation

Kaushik ChakKaushik Chakrabartirabarti

Venkatesh GanVenkatesh Gantiti

Dong XinDong XinJiawei HanJiawei Han

Presented by: Vaidergorn EitanPresented by: Vaidergorn Eitan

22

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

33

IntroductionIntroduction

• More and more document collections, has their documents relate to objects.• Laptop reviews site:

Laptop reviews

44

OF (Object Finder) QueriesOF (Object Finder) Queries


OF: I need the best “lightweight” & a “business use” laptop.

Laptop reviews

55


• The goal:• Get Top K

• Exploiting the relationships between documents and objects.

• Exploiting the Fact that we need only K.

66


• Search Objects - SOs - Documents

• Target Objects - TOs

77

OutlineOutline

• Introduction

• System Overview




• Experiments

• Conclusions

88

System OverviewSystem Overview•FTS (Full Text Search):

•Input: Keyword/s.•Output: Ranked lists of documents

Review ID DocScore

2 1.9

3 1.0

1 0.8

99

System OverviewSystem Overview

• FTS (Full Text Search):– Most relational DBMS now support FTS functionality.

1010

System OverviewSystem Overview• DBMS:

– T– R

T is used only for the final lookup of t

he TO values

1111

OutlineOutline

• Introduction

• System Overview




• Experiments

• Conclusions

1212

Scoring FunctionsScoring Functions• The OF evaluation system returns top K target objects

that has the best scores according to scoring function.

1313

Scoring FunctionsScoring Functions

• W={w1,w2,…,wN} – keywords in the OF query.

• Li – ranked sorted list– <document id, DocScore>

• Dt – list of documents related to t TOs

1414

Scoring FunctionsScoring Functions• Score matrix Mt – for each t in TOs

d1 1.1 0

d3 0 1.0

d6 2 0.8

tdi D iw W1w 2w

• Score(t) - the relevance score for the TO t.• compute rows score • compute cols score

combF

aggF

1515

Scoring FunctionsScoring Functions Row-marginal Class:

1616

Scoring FunctionsScoring FunctionsColumn-marginal Class:

1717

Scoring FunctionsScoring Functions1. Fcomb is monotonic:

Fcomb(x1,…,xn) ≤ Fcomb(y1,…,yn) when xi ≤ yi

2. Fagg is subset monotonic: Fagg(S) ≤ Fagg(S’) if S ≤ S’.

3. Fagg distributes over append:

Fagg(R1 append R2)= Fagg(Fagg(R1),Fagg(R2)).append here is ordered concatenation of tuples.

1818

OutlineOutline

• Introduction

• System Overview




• Experiments

• Conclusions

1919

SQL ImplementationSQL Implementation

2020

OutlineOutline

• Introduction

• System Overview




• Experiments

• Conclusions

2121

Early Termination ApproachEarly Termination Approach• Intuition: top scoring documents typically contrib

ute the most to the scores of high scoring TOs.

• The TOs related to these top scoring documents are most likely to be the best candidate matches.

• We progressively retrieve documents in the decreasing order of their scores, and maintain upper and lower bound scores for the related TOs.

2222

Early Termination ApproachEarly Termination Approach

1. Generate-only Approach: • Rely on bounds • stops when identified the best K TOs

2. Generate-Prune Approach: • candidate generation • Stop condition more relaxed• pruning phase.

2323

Candidate GenerationCandidate Generation

• Ci– We retrive in chunks from Li.

• Prefix(Li) – documents retrieved so far from the Lis (rank list).

• SeenTOs – current aggregation scores.

– AggResulti - For each Li, table containing• numSeen • aggScore

– upper bound and lower bound scores.

2424

Candidate GenerationCandidate Generation• 1 3 2agg comb iF F Sum C K B

2525


• The Algorithm has 5 steps:

2626

Candidate GenerationCandidate Generation• Step1 - Retrieve Documents :

– we retrieve the next Ci from each Li. – Reduce the number of join queries (with R).

2727

Candidate GenerationCandidate Generation• Step2 - Update SeenTOs:

Prefix(L1)

Prefix(L2)

AggResult(1)AggResult(2)

2828

Candidate GenerationCandidate GenerationPrefix(L1)

Prefix(L2)


Numseen[1]

aggScore[1]

ub[1] Numseen[1]

aggScore[1]

ub[2] Lb ub

T1 1 1.0 1 1.0

T2 1 1.0 0 0

T3 1 0.6 1 0.5

T4 1 0.6 1 0.5

2929


Prefix(L2)


Numseen[1]

aggScore[1]

ub[1] Numseen[1]

aggScore[1]

ub[2] Lb ub

T1 1 1.0 1 1.0

T2 1 1.0 0 0

T3 2 0.8 1 0.8

T4 1 0.6 1 0.5

T5 1 0.2 1 0.3

3030

Candidate GenerationCandidate Generation• Step3 - Compute bounds:

– t.lb= Fcomb(t.aggScore[1],…t.aggScore[N]).

3131

Candidate GenerationCandidate Generation• B:

– maximum number of documents in any ranked list Li that can contribute to the score of any target object t.

• xi – DocScore of last document retrieved from Li.

• t.ub[i]= Fagg(t.aggScore[i], Fagg(xi,xi,..,)).

t.ub= Fcomb(t.ub[1],…,t.ub[N]).

(B- t.numseen[i]) times

t1.ub[1]=1.0+1.0*(2-1)=2

t2.ub[1]=1.0+1.0*(2-1)=2

3232

Candidate GenerationCandidate Generation• Step4 - Stopping Condition:

We can stop when there are at least K objects in SeenTOs whose lower bound scores are higher than the upper bound score of any unseen TOs.

• UnseenUB=Fcomb (Fagg(x1,x1,…),…, Fagg(xN,xN,…,).

• So the stopping criterion is: LBK ≥ UnseenUB• LBK – the Kth high LB

B times

3333


X1=0.2; X2=0.3

3434


• LBK ≥ UnseenUB

• UnseenUB= ((0.2+02)+(0.3+0.3))=1

• LB3 = 1.1

3535

Candidate GenerationCandidate Generation• Step5 - Identify candidates: • Top(List,X)

– the top X elements in the list. • The set of candidates is defined by Top(UB,h)

– h - least value which satisfies:LBK≥UBh+1

( ) 1d v

Top LB,K Top UB,h

3636


• LBK≥UBh+1

• LB3 = 1.1

• LB3≥UB4+1 => h=4

• Top(LB,3)={t1,t3,t4} Top(UB,4)={t1,t2,t3,t4}.• Top(UB,h)={t1,t2,t3,t4}

Top LB,K Top UB,h

3737

Pruning to the Final Top-KPruning to the Final Top-K

3838

Pruning to the Final Top-KPruning to the Final Top-K

• UB={t1(2.5), t2(1.8), t3(1.6), t4(1.6)} K=3

• t1=((1+0.1)+(0.1+1))=2.2• t1=2.2, t2=1.6, t3=1.6, t4=1.6• UB={t1(2.2), t2(1.6), t3(1.6), t4(1.6)}• The final top-k results are {t1, t2, t3}

w1 w2

d1 1.0 0.1

d2 0.1 1.0

3939

Exact Top-K with Approximate Exact Top-K with Approximate scoresscores

• Exact Top-K with Approximate Scores:

• Crossing Objects: its rank in LB is more than K and its rank in UB is K or less.

• Boundary Objects: a pair of target objects (A,B):1. The top K in UB and LB are

same.2. A is the Kth object in LB and

uth object in UB (u ≤ k)3. B is the (K+1)th object in UB

and lth object in LB (l ≥ K+1)4. LBK ≤ UBK+1

UB LB

1 A C

2 A=1.5

3 B=1.6

4 C B

K=2

4040

OutlineOutline

• Introduction

• System Overview




• Experiments

• Conclusions

4141

ExperimentExperiment

• Our documents comprise of a collection of 714,192 news articles from 03’-04’ obtained from MSNBC news portal.

• We index those news articles inside SQL Server FTS engine.

• We extract three types of named entities: PersonNames, OrganizationNames, and LocationNames.

4242


• To get realistic OF queries, we picked the following top 10 sport news queries on Google in 2004 .

4343


• “PersonNames” the desired entity type for all the queries. All our measurements are averaged across the 10 queries.

• Implementation all 3 approaches to evaluate OF queries: SQL implemetation, GenPrune,GenOnly.

• SUM as the combination function.SUM as the aggregation function.

4444


4545


4747

OutlineOutline

• Introduction

• System Overview




• Experiments

• Conclusions

4848

ConclusionsConclusions

• Class of OF queries and defined its semantics.

• Two broad class of scoring functions, which exploit relationships between documents and objects, to compute the relevance score of the target objects for a given set of keywords.

• We present early termination techniques which shows that our approach is 4-5 times faster than SQL implementation.