io-top-k: index-access optimized top-k query processing debapriyo majumdar max-planck-institut für...

46
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger Bast, Ralf Schenkel, Martin Theobald, Gerhard Weikum VLDB 2006, Seoul, Korea

Upload: kristin-beasley

Post on 13-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

IO-Top-k:Index-access Optimized Top-k Query Processing

Debapriyo MajumdarMax-Planck-Institut für Informatik

Saarbrücken, Germany

Joint work withHolger Bast, Ralf Schenkel, Martin Theobald, Gerhard

Weikum

VLDB 2006, Seoul, Korea

Page 2: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Setup

priceresolutio

n zoom

camera 1

€300

camera 5

8MP

camera 3

7x

camera 3

€330

camera 1

7MP

camera 1

5x

camera 5

€490

camera 4

6MP

camera 2

4x

camera 4

€580

camera 2

4MP

camera 5

4x

Pre-computed index-lists over multiple attributes

lists are accessible by both sorted and random accesses

combine scores by some monotonic aggregation function:

. res + .zoom - . price

Goal: find the top-k items with highest total scores

single numeric score for every item for each attribute

Page 3: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms: example

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

List 1 List 2 List 3

Fagin’s NRA Algorithm:

Page 4: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms: example

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Fagin’s NRA Algorithm: round 1

item 83

[0.9, 2.1]

item 17

[0.6, 2.1]

item 25

[0.6, 2.1]

Candidatesmin top-2 score: 0.6maximum score for unseen items: 2.1

min-top-2 < best-score of candidates

List 1 List 2 List 3

read one item from every list current

scorebest-score

Page 5: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms: example

lists

sort

ed b

y

score

Fagin’s NRA Algorithm: round 2

item 17

[1.3, 1.8]

item 83

[0.9, 2.0]

item 25

[0.6, 1.9]

item 38

[0.6, 1.8]

item 78

[0.5, 1.8]

Candidatesmin top-2 score: 0.9maximum score for unseen items: 1.8

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

List 1 List 2 List 3

read one item from every list

min-top-2 < best-score of candidates

Page 6: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms: example

lists

sort

ed b

y

score

item 83

[1.3, 1.9]

item 17

[1.3, 1.9]

item 25

[0.6, 1.5]

item 78

[0.5, 1.4]

Candidatesmin top-2 score: 1.3maximum score for unseen items: 1.3

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Fagin’s NRA Algorithm: round 3

List 1 List 2 List 3

read one item from every list

min-top-2 < best-score of candidates

no more new items can get into top-2

but, extra candidates left in queue

Page 7: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms: example

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

lists

sort

ed b

y

score

item 17

1.6

item 83

[1.3, 1.9]

item 25

[0.6, 1.4]

Candidatesmin top-2 score: 1.3maximum score for unseen items: 1.1

Fagin’s NRA Algorithm: round 4

List 1 List 2 List 3

read one item from every list

min-top-2 < best-score of candidates

no more new items can get into top-2

but, extra candidates left in queue

Page 8: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms: example

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

lists

sort

ed b

y

score

item 83

1.8

item 17

1.6

Candidatesmin top-2 score: 1.6maximum score for unseen items: 0.8

Done!

Fagin’s NRA Algorithm: round 5

List 1 List 2 List 3

read one item from every list

no extra candidate in queue

Page 9: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms

NRA performs only sorted accesses (SA) (No Random Access) Random access (RA)

– lookup actual (final) score of an item– costlier than SA (100 – 100,000 times), cR/cS := (cost of RA)/(cost of SA)

– often very useful

CA (Combined Algorithm), (Fagin et al., 2001)– one RA after every cR/cS SAs

– total cost of SA ~ total cost of RA

Measure of effectiveness (access cost): #SA + cR/cS x #RA

Full-merge: compute scores for all items followed by partial sort– simple and efficient– important baseline for any top-k algorithm

Problems with NRA, CA – high bookkeeping overhead: cannot beat full-merge in runtime– for “high” values of k, gain in even access cost not significant

Page 10: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Top-k algorithms Greedy heuristics for sorted access scheduling, based on

crude estimate of scores (Guntzer, Balke, Kiessling, ITCC 2001)

RankSQL: ordering of binary rank joins at query planning time (Ilyas et al., SIGMOD ’04 and Li et al., SIGMOD ’05)

Scheduling RAs on “expensive predicates”, where SAs may not even be possible on all attributes (our setting is different)

– MPro (Chang and Hwang, SIGMOD 2002)

– Upper, Pick (Bruno, Gravano and Marian, ICDE ’02, ACM TODS ’04)

Probabilistic pruning of candidates, RA scheduling (Theobald, Schenkel and Weikum, VLDB ’04, VLDB ’05)

Main related previous works: NRA, CA

Page 11: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Our algorithm: IO-Top-k

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 1: same as NRA

item 83

[0.9, 2.1]

item 17

[0.6, 2.1]

item 25

[0.6, 2.1]

Candidatesmin top-2 score: 0.6maximum score for unseen items: 2.1

List 1 List 2 List 3

min-top-2 < best-score of candidates

not necessarily round robin

Page 12: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Our algorithm: IO-Top-k

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 2

item 17

[1.3, 1.8]

item 83

[0.9, 2.0]

item 25

[0.6, 1.9]

item 78

[0.5, 1.4]

Candidatesmin top-2 score: 0.9maximum score for unseen items: 1.4

List 1 List 2 List 3

min-top-2 < best-score of candidates

not necessarily round robin

Page 13: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Our algorithm: IO-Top-k

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 3

item 17

1.6

item 83

[1.3, 1.9]

item 25

[0.6, 1.4]

Candidatesmin top-2 score: 1.3maximum score for unseen items: 1.1

List 1 List 2 List 3

min-top-2 < best-score of candidates

not necessarily round robin

potential candidate for

top-2

Page 14: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Our algorithm: IO-Top-k

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 4: random access for item 83

item 83

1.8

item 17

1.6

Candidatesmin top-2 score: 1.6maximum score for unseen items: 1.1

Done!

fewer sorted accesses

carefully scheduled random access

List 1 List 2 List 3

random access for item 83

no extra candidate in queue

not necessarily round robin

Page 15: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Outline

Our contributions– Inverted block-index data structure

– Sorted access scheduling

– Random access scheduling

– Lower bound

Experiments

Conclusion

Page 16: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Inverted block-index

Lists are first sorted by score

Page 17: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Inverted block-index

Lists are first sorted by score

sort each block by item-id

333

222

111

Top-k algorithm with block-index

1 1 1

2 2

3 3

1 1 1

2 2 2

3 3 3

full-merge

blocks are sorted by item ids, efficiently merged by full-merge!

and so on…full merge

pruned

split into blocks

Choose block size balancing disk seek time and data transfer rate

Low overhead: prune once every round

Page 18: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

List 1 List 2 List 3

Inverted Block-Index

General Paradigm

Page 19: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

List 1 List 2 List 3

b11 b21 b31

b12 b22 b32

b13 b23 b33

b14 b24 b34

General Paradigm We assign benefits to every block of each list

Optimization problem– Goal: choose a total of 3 blocks from any

of the lists such that the total benefit is maximized

– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it

– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities

– We choose the schedule with maximum benefit, and continue to next round

Inverted Block-Index

Page 20: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

List 1 List 2 List 3

b11 b21 b31

b12 b22 b32

b13 b23 b33

b14 b24 b34

General Paradigm We assign benefits to every block of each list

Optimization problem– Goal: choose a total of 3 blocks from any

of the lists such that the total benefit is maximized

– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it

– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities

– We choose the schedule with maximum benefit, and continue to next round

Inverted Block-Index

Page 21: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

List 1 List 2 List 3

b11 b21 b31

b12 b22 b32

b13 b23 b33

b14 b24 b34

General Paradigm We assign benefits to every block of each list

Optimization problem– Goal: choose a total of 3 blocks from any

of the lists such that the total benefit is maximized

– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it

– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities

– We choose the schedule with maximum benefit, and continue to next round

Inverted Block-Index

scans to different

depths in lists

Page 22: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

List 1 List 2 List 3

Knapsack for Score Reduction (KSR)

Pre-compute score reduction ij of every block

of each list : (max-score of the block – min-score of the block)

Inverted Block-Index

List 1 List 2 List 3

Page 23: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

Knapsack for Score Reduction (KSR)

Pre-compute score reduction ij of every block

of each list : (max-score of the block – min-score of the block)

Candidate item d is already seen in list 3. If we

scan list 3 further, score sd and best-score bd of

d do not change: no benefit

In list 2, d is not yet seen. If we scan one block

(block22) from list 2

– with high probability d will not be not found

in that block: best-score bd of d decreases

by 22

Benefit of block B in list i

d B (1 - Pr[d found in B]) ~ d B

sum taken over all candidates d not yet seen in list i

Inverted Block-Index

List 1 List 2 List 3

item d[sd,bd]

scanned till some depth

Page 24: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Random access scheduling

List 1 List 2 List 3

Redundant random accesses of CA CA: one RA after every cR/cS SAs

Many RAs turn out to be redundant

Our strategy: two-phase algorithm

First sorted access rounds only, then switch to random access: no redundant random access

Switch from SA to RA, when– max-score for unseen ≤ min-top-k score

– estimated RA-cost ≤ total SA-cost so far

– cost of SA ~ cost of RA

List 1 List 2 List 3

CA: RA for

item d

But d is found anyway in

subsequent SA round

need to estimate cost of RA

Page 25: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Random access scheduling

current min-top-3 score

candidate items sorted by best score: CA style

random access

List 1 List 2 List 3

lists scanned till some depths by sorted access

Estimating number of random accesses

best-scores

current scores

Each random access can prune some candidates, so better estimate of #RAs

necessary

A crude upper estimate: #of items in queue

pruned

Page 26: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Random access scheduling

current min-top-3 score

candidate items sorted by best score: CA style

Estimating number of random accesses

item d[sd,bd]

bd

If there are

at least three items before d

with final score > bd,

d will be pruned before random access

random accesses

d is pruned

Page 27: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Random access scheduling

current min-top-3 score

candidate items sorted by best score: CA style

Estimating number of random accesses

item d[sd,bd]

bd

If there are

less than three items before d

with final score > bd,

a random access for d must be made

random accesses

next: RA for d

Page 28: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Random access scheduling

current min-top-k score

candidate items sorted by best score

Estimating number of random accesses

item d[sd,bd]

bd

Let d be the j-th item dj by best-score

ordering

For all i < j, define random variables Fi,j

Fi,j = 1 if final-score(di) > the best-

score(d),

0 otherwise

We compute Pr[Fi,j = 1] using histogram

of the score distributions of the lists

Observation:

Pr[RA is made for d] = Pr[F1,j+ + Fj-1,j <

k]

Expected #of random accesses

j Pr[F1,j+ + Fj-1,j < k]

the sum is taken over all candidate items

For General k:

There will be random access for d

if and only if

#of items before d

with final score > bd

is less than k

j-1 items

Page 29: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Experiments: estimate of RA

queue size

queue size

EST EST DONE

DONE

TREC Terabyte data, TREC 2005 adhoc task queries

After all sorted accesses

#items in queue, #RA estimated and #RA actually done

Tota

l R

A f

or

50

qu

eri

es

Page 30: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Lower bound: what is the best possible?

List 1 List 2 List 3 Try every possible SA-schedule

Count essential number of RAs that must be done

Page 31: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Lower bound: what is the best possible?

List 1 List 2 List 3 Try every possible SA-schedule

Count essential number of RAs that must be done

#SA CR/CS x #RA =

Total cost

Schedule 1 6 x 10000 + 1000 x 75 = 135,000

block size 10,000

Page 32: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Lower bound: what is the best possible?

List 1 List 2 List 3 Try every possible SA-schedule

Count essential number of RAs that must be done

#SA CR/CS x #RA =

Total cost

Schedule 1 6 x 10000 + 1000 x 75 = 135,000

Schedule 2 9 x 10000 + 1000 x 12 = 102,000

block size 10,000

Page 33: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Lower bound: what is the best possible?

List 1 List 2 List 3 Try every possible SA-schedule

Count essential number of RAs that must be done

#SA CR/CS x #RA =

Total cost

Schedule 1 6 x 10000 + 1000 x 75 = 135,000

Schedule 2 9 x 10000 + 1000 x 12 = 102,000

Schedule 3 12 x 10000 + 1000 x 3 = 123,000

… … … …

… … … …

Lower bound … … 102,000carefully engineered dynamic programming to try out all schedules

block size 10,000

Page 34: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Experiments: TREC

10 50 100 200 5000

4,000,000

k

avera

ge c

ost

(#

SA

+ 1

00

0 x

#

RA

)

full merge

NRA

CA

IO-Top-k (OUR)

lower bound

10 50 100 200 5000

250

k

avera

ge r

unnin

g t

ime (

mill

iseco

nds)

full merge

NRA

IO-Top-k (OUR)100

TREC Terabyte benchmark collection over 25 million documents, 426 GB raw data 50 queries from TREC 2005 adhoc task

CA

Page 35: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Experiments: HTTP logs

FIFA World Cup HTTP logs

World cup 1998 1.3 billion HTTP

requests schema Log( interval,

user-id, bytes ) aggregated for each

user within one-day intervals

typical query: find k users with most usage during June 1-10

full merge

NRA

CA

IO-Top-k (OUR)

lower bound

Page 36: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Experiments: IMDB

IMDB movie data more than 375,000

movies, 1,200,000 persons

attributes: Title, Genre, Actors, Description

20 human generated queries

full merge

CA

NRAIO-Top-k (OUR)

lower bound

Page 37: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Conclusion

We presented

An inverted block-index data structure– efficient: optimizes disk access

– performs fast merge in blocks, minimizes overhead

Integrated sorted access and random access scheduling

– SA scheduling: maximizes benefit of scanning blocks

– RA scheduling: effectively estimate RA-cost at every round

– postpone RA till the end of all SA: save redundant RAs

Lower Bound– shows that our algorithm is close to the best possible

Page 38: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Thank you!

Page 39: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger
Page 40: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger
Page 41: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Appendix

Page 42: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

List 1 List 2 List 3

Knapsack for Benefit Aggregation (KBA)

Pre-compute expected score eij of an item seen in

block j of list i : (average score of the block)

Pre-compute score reduction ij of every block of each

list : (max-score of the block – min-score of the block)

Inverted Block-Index

List 1 List 2 List 3

e11 e21 e31

e12 e22 e32

e13 e23 e33

e14 e24 e34

Page 43: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Sorted access scheduling

Knapsack for Benefit Aggregation (KBA)

Pre-compute expected score eij of an item seen in

block j of list i : (average score of the block)

Pre-compute score reduction ij of every block of each

list : (max-score of the block – min-score of the block)

Candidate item d is already seen in list 3. If we scan

list 3 further, score sd and best-score bd of d do not

change

In list 2, d is not yet seen. If we scan one block from list 2

– either d is found in that block: score sd of d

increases, expected increase = e22

– or d is not found in that block: best-score bd of d

decreases by 22

Benefit of block B in list i

d eB Pr[d found in B] + B (1 - Pr[d found in B])

The sum is taken over all candidates d not yet seen in list i

Inverted Block-Index

List 1 List 2 List 3

e11 e21 e31

e12 e22 e32

e13 e23 e33

e14 e24 e34

item d[sd,bd]

Page 44: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Random access scheduling: details

current min-top-k score

candidate items sorted by best score

Estimating number of random accesses

item d[sd,bd]

bd

Let d be the j-th items by best-score ordering

For all i < j, Define random variables Fi,j which

takes value 1 if final score of the i-th item is greater than the best-score of d, 0 otherwise

Compute Pr[Fi,j = 1] using the expected score gain

of the i-th item from lists where it is not yet seen

Also define a random variable Rj which takes value

1 if a random access is made for d, 0 otherwise

Observation: Pr[Rj = 1] = Pr[F1,j+ + Fj-1,j < k]

Let Xj := F1,j + + Fj-1,j

Assume Fi,js are independent, then Xj follows

Poisson distribution with mean i Pr[Fi,j = 1]

We can compute Pr[Xj < k] using the incomplete

gamma function

Expected number of random accesses is

j E(Rj) = j Pr[Rj = 1] = j Pr[Xj < k]

the sum is taken over all candidate items

There will be random access for d

if and only if

#of items before d

with final score > bd

is less than k

j-1 items

Page 45: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

Other Experiments

For different values of cost of RA compared to cost of SA

CR/S ratio: 100, 1000 and 10000

varying query size

title fields: average size 3

description fields: average size 8

TREC Terabyte collection indexed with BM25 scores

query size: 3

query size: 8

20,000,000

0

cost

(cost of RA)/(cost of SA)

cost

0

3,000,000

Page 46: IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger

End of appendix