ranking objects by exploiting relationships: computing top-k over aggregation

49
1 Ranking Objects by Ranking Objects by Exploiting Exploiting Relationships: Computing Relationships: Computing Top-K over Aggregation Top-K over Aggregation Kaushik Ch Kaushik Ch akrabarti akrabarti Venkatesh G Venkatesh G anti anti Dong Xin Dong Xin Jiawei Han Jiawei Han Presented by: Vaidergorn Presented by: Vaidergorn Eitan Eitan

Upload: feryal

Post on 08-Jan-2016

32 views

Category:

Documents


4 download

DESCRIPTION

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation. Kaushik Chakrabarti. Venkatesh Ganti. Jiawei Han. Dong Xin. Presented by: Vaidergorn Eitan. Outline. Introduction System Overview Scoring Functions SQL implementation Early Termination Approach Experiments - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

11

Ranking Objects by Exploiting Ranking Objects by Exploiting Relationships: Computing Top-K Relationships: Computing Top-K

over Aggregationover Aggregation

Kaushik ChakKaushik Chakrabartirabarti

Venkatesh GanVenkatesh Gantiti

Dong XinDong XinJiawei HanJiawei Han

Presented by: Vaidergorn EitanPresented by: Vaidergorn Eitan

Page 2: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

22

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

Page 3: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

33

IntroductionIntroduction

• More and more document collections, has their documents relate to objects.• Laptop reviews site:

Laptop reviews

Page 4: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

44

OF (Object Finder) QueriesOF (Object Finder) Queries

IntroductionIntroduction

OF: I need the best “lightweight” & a “business use” laptop.

Laptop reviews

Page 5: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

55

IntroductionIntroduction

• The goal:• Get Top K

• Exploiting the relationships between documents and objects.

• Exploiting the Fact that we need only K.

Page 6: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

66

IntroductionIntroduction

• Search Objects - SOs - Documents

• Target Objects - TOs

Page 7: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

77

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

Page 8: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

88

System OverviewSystem Overview•FTS (Full Text Search):

•Input: Keyword/s.•Output: Ranked lists of documents

Review ID DocScore

2 1.9

3 1.0

1 0.8

Page 9: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

99

System OverviewSystem Overview

• FTS (Full Text Search):– Most relational DBMS now support FTS functionality.

Page 10: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1010

System OverviewSystem Overview• DBMS:

– T– R

T is used only for the final lookup of t

he TO values

Page 11: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1111

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

Page 12: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1212

Scoring FunctionsScoring Functions• The OF evaluation system returns top K target objects

that has the best scores according to scoring function.

Page 13: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1313

Scoring FunctionsScoring Functions

• W={w1,w2,…,wN} – keywords in the OF query.

• Li – ranked sorted list– <document id, DocScore>

• Dt – list of documents related to t TOs

Page 14: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1414

Scoring FunctionsScoring Functions• Score matrix Mt – for each t in TOs

d1 1.1 0

d3 0 1.0

d6 2 0.8

tdi D iw W1w 2w

• Score(t) - the relevance score for the TO t.• compute rows score • compute cols score

combF

aggF

Page 15: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1515

Scoring FunctionsScoring Functions Row-marginal Class:

Page 16: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1616

Scoring FunctionsScoring FunctionsColumn-marginal Class:

Page 17: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1717

Scoring FunctionsScoring Functions1. Fcomb is monotonic:

Fcomb(x1,…,xn) ≤ Fcomb(y1,…,yn) when xi ≤ yi

2. Fagg is subset monotonic: Fagg(S) ≤ Fagg(S’) if S ≤ S’.

3. Fagg distributes over append:

Fagg(R1 append R2)= Fagg(Fagg(R1),Fagg(R2)).append here is ordered concatenation of tuples.

Page 18: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1818

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

Page 19: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

1919

SQL ImplementationSQL Implementation

Page 20: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2020

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

Page 21: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2121

Early Termination ApproachEarly Termination Approach• Intuition: top scoring documents typically contrib

ute the most to the scores of high scoring TOs.

• The TOs related to these top scoring documents are most likely to be the best candidate matches.

• We progressively retrieve documents in the decreasing order of their scores, and maintain upper and lower bound scores for the related TOs.

Page 22: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2222

Early Termination ApproachEarly Termination Approach

1. Generate-only Approach: • Rely on bounds • stops when identified the best K TOs

2. Generate-Prune Approach: • candidate generation • Stop condition more relaxed• pruning phase.

Page 23: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2323

Candidate GenerationCandidate Generation

• Ci– We retrive in chunks from Li.

• Prefix(Li) – documents retrieved so far from the Lis (rank list).

• SeenTOs – current aggregation scores.

– AggResulti - For each Li, table containing• numSeen • aggScore

– upper bound and lower bound scores.

Page 24: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2424

Candidate GenerationCandidate Generation• 1 3 2agg comb iF F Sum C K B

Page 25: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2525

Candidate GenerationCandidate Generation

• The Algorithm has 5 steps:

Page 26: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2626

Candidate GenerationCandidate Generation• Step1 - Retrieve Documents :

– we retrieve the next Ci from each Li. – Reduce the number of join queries (with R).

Page 27: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2727

Candidate GenerationCandidate Generation• Step2 - Update SeenTOs:

Prefix(L1)

Prefix(L2)

AggResult(1)AggResult(2)

Page 28: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2828

Candidate GenerationCandidate GenerationPrefix(L1)

Prefix(L2)

AggResult(1)AggResult(2)

Numseen[1]

aggScore[1]

ub[1] Numseen[1]

aggScore[1]

ub[2] Lb ub

T1 1 1.0 1 1.0

T2 1 1.0 0 0

T3 1 0.6 1 0.5

T4 1 0.6 1 0.5

Page 29: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

2929

Candidate GenerationCandidate GenerationPrefix(L1)

Prefix(L2)

AggResult(1)AggResult(2)

Numseen[1]

aggScore[1]

ub[1] Numseen[1]

aggScore[1]

ub[2] Lb ub

T1 1 1.0 1 1.0

T2 1 1.0 0 0

T3 2 0.8 1 0.8

T4 1 0.6 1 0.5

T5 1 0.2 1 0.3

Page 30: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3030

Candidate GenerationCandidate Generation• Step3 - Compute bounds:

– t.lb= Fcomb(t.aggScore[1],…t.aggScore[N]).

Page 31: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3131

Candidate GenerationCandidate Generation• B:

– maximum number of documents in any ranked list Li that can contribute to the score of any target object t.

• xi – DocScore of last document retrieved from Li.

• t.ub[i]= Fagg(t.aggScore[i], Fagg(xi,xi,..,)).

t.ub= Fcomb(t.ub[1],…,t.ub[N]).

(B- t.numseen[i]) times

t1.ub[1]=1.0+1.0*(2-1)=2

t2.ub[1]=1.0+1.0*(2-1)=2

Page 32: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3232

Candidate GenerationCandidate Generation• Step4 - Stopping Condition:

We can stop when there are at least K objects in SeenTOs whose lower bound scores are higher than the upper bound score of any unseen TOs.

• UnseenUB=Fcomb (Fagg(x1,x1,…),…, Fagg(xN,xN,…,).

• So the stopping criterion is: LBK ≥ UnseenUB• LBK – the Kth high LB

B times

Page 33: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3333

Candidate GenerationCandidate GenerationPrefix(L1)

X1=0.2; X2=0.3

Page 34: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3434

Candidate GenerationCandidate Generation

• LBK ≥ UnseenUB

• UnseenUB= ((0.2+02)+(0.3+0.3))=1

• LB3 = 1.1

Page 35: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3535

Candidate GenerationCandidate Generation• Step5 - Identify candidates: • Top(List,X)

– the top X elements in the list. • The set of candidates is defined by Top(UB,h)

– h - least value which satisfies:LBK≥UBh+1

( ) 1d v

Top LB,K Top UB,h

Page 36: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3636

Candidate GenerationCandidate Generation

• LBK≥UBh+1

• LB3 = 1.1

• LB3≥UB4+1 => h=4

• Top(LB,3)={t1,t3,t4} Top(UB,4)={t1,t2,t3,t4}.• Top(UB,h)={t1,t2,t3,t4}

Top LB,K Top UB,h

Page 37: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3737

Pruning to the Final Top-KPruning to the Final Top-K

Page 38: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3838

Pruning to the Final Top-KPruning to the Final Top-K

• UB={t1(2.5), t2(1.8), t3(1.6), t4(1.6)} K=3

• t1=((1+0.1)+(0.1+1))=2.2• t1=2.2, t2=1.6, t3=1.6, t4=1.6• UB={t1(2.2), t2(1.6), t3(1.6), t4(1.6)}• The final top-k results are {t1, t2, t3}

w1 w2

d1 1.0 0.1

d2 0.1 1.0

Page 39: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

3939

Exact Top-K with Approximate Exact Top-K with Approximate scoresscores

• Exact Top-K with Approximate Scores:

• Crossing Objects: its rank in LB is more than K and its rank in UB is K or less.

• Boundary Objects: a pair of target objects (A,B):1. The top K in UB and LB are

same.2. A is the Kth object in LB and

uth object in UB (u ≤ k)3. B is the (K+1)th object in UB

and lth object in LB (l ≥ K+1)4. LBK ≤ UBK+1

UB LB

1 A C

2 A=1.5

3 B=1.6

4 C B

K=2

Page 40: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4040

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

Page 41: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4141

ExperimentExperiment

• Our documents comprise of a collection of 714,192 news articles from 03’-04’ obtained from MSNBC news portal.

• We index those news articles inside SQL Server FTS engine.

• We extract three types of named entities: PersonNames, OrganizationNames, and LocationNames.

Page 42: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4242

ExperimentExperiment

• To get realistic OF queries, we picked the following top 10 sport news queries on Google in 2004 .

Page 43: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4343

ExperimentExperiment

• “PersonNames” the desired entity type for all the queries. All our measurements are averaged across the 10 queries.

• Implementation all 3 approaches to evaluate OF queries: SQL implemetation, GenPrune,GenOnly.

• SUM as the combination function.SUM as the aggregation function.

Page 44: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4444

ExperimentExperiment

Page 45: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4545

ExperimentExperiment

Page 46: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4646

Page 47: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4747

OutlineOutline

• Introduction

• System Overview

• Scoring Functions

• SQL implementation

• Early Termination Approach

• Experiments

• Conclusions

Page 48: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4848

ConclusionsConclusions

• Class of OF queries and defined its semantics.

• Two broad class of scoring functions, which exploit relationships between documents and objects, to compute the relevance score of the target objects for a given set of keywords.

• We present early termination techniques which shows that our approach is 4-5 times faster than SQL implementation.

Page 49: Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

4949