io-top-k: index-access optimized top-k query processing debapriyo majumdar max-planck-institut für...

IO-Top-k:Index-access Optimized Top-k Query Processing

Debapriyo MajumdarMax-Planck-Institut für Informatik

Saarbrücken, Germany

Joint work withHolger Bast, Ralf Schenkel, Martin Theobald, Gerhard

Weikum

VLDB 2006, Seoul, Korea

Setup

priceresolutio

n zoom

camera 1

€300

camera 5

8MP

camera 3

7x

camera 3

€330

camera 1

7MP

camera 1

5x

camera 5

€490

camera 4

6MP

camera 2

4x

camera 4

€580

camera 2

4MP

camera 5

4x

…

…

…

…

…

…

Pre-computed index-lists over multiple attributes

lists are accessible by both sorted and random accesses

combine scores by some monotonic aggregation function:

. res + .zoom - . price

Goal: find the top-k items with highest total scores

single numeric score for every item for each attribute

Top-k algorithms: example

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

List 1 List 2 List 3

Fagin’s NRA Algorithm:


lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Fagin’s NRA Algorithm: round 1

item 83

[0.9, 2.1]

item 17

[0.6, 2.1]

item 25

[0.6, 2.1]

Candidatesmin top-2 score: 0.6maximum score for unseen items: 2.1

min-top-2 < best-score of candidates


read one item from every list current

scorebest-score


lists

sort

ed b

y

score


item 17

[1.3, 1.8]

item 83

[0.9, 2.0]

item 25

[0.6, 1.9]

item 38

[0.6, 1.8]

item 78

[0.5, 1.8]


item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1


read one item from every list



lists

sort

ed b

y

score

item 83

[1.3, 1.9]

item 17

[1.3, 1.9]

item 25

[0.6, 1.5]

item 78

[0.5, 1.4]


item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1





no more new items can get into top-2

but, extra candidates left in queue


item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

lists

sort

ed b

y

score

item 17

1.6

item 83

[1.3, 1.9]

item 25

[0.6, 1.4]






no more new items can get into top-2

but, extra candidates left in queue


item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

lists

sort

ed b

y

score

item 83

1.8

item 17

1.6


Done!




no extra candidate in queue

Top-k algorithms

NRA performs only sorted accesses (SA) (No Random Access) Random access (RA)

– lookup actual (final) score of an item– costlier than SA (100 – 100,000 times), cR/cS := (cost of RA)/(cost of SA)

– often very useful

CA (Combined Algorithm), (Fagin et al., 2001)– one RA after every cR/cS SAs

– total cost of SA ~ total cost of RA

Measure of effectiveness (access cost): #SA + cR/cS x #RA

Full-merge: compute scores for all items followed by partial sort– simple and efficient– important baseline for any top-k algorithm

Problems with NRA, CA – high bookkeeping overhead: cannot beat full-merge in runtime– for “high” values of k, gain in even access cost not significant

Top-k algorithms Greedy heuristics for sorted access scheduling, based on

crude estimate of scores (Guntzer, Balke, Kiessling, ITCC 2001)

RankSQL: ordering of binary rank joins at query planning time (Ilyas et al., SIGMOD ’04 and Li et al., SIGMOD ’05)

Scheduling RAs on “expensive predicates”, where SAs may not even be possible on all attributes (our setting is different)

– MPro (Chang and Hwang, SIGMOD 2002)

– Upper, Pick (Bruno, Gravano and Marian, ICDE ’02, ACM TODS ’04)

Probabilistic pruning of candidates, RA scheduling (Theobald, Schenkel and Weikum, VLDB ’04, VLDB ’05)

Main related previous works: NRA, CA

Our algorithm: IO-Top-k

lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 1: same as NRA

item 83

[0.9, 2.1]

item 17

[0.6, 2.1]

item 25

[0.6, 2.1]




not necessarily round robin


lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 2

item 17

[1.3, 1.8]

item 83

[0.9, 2.0]

item 25

[0.6, 1.9]

item 78

[0.5, 1.4]






lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 3

item 17

1.6

item 83

[1.3, 1.9]

item 25

[0.6, 1.4]





potential candidate for

top-2


lists

sort

ed b

y

score

item 25

0.6

item 17

0.6

item 83

0.9

item 78

0.5

item 38

0.6

item 17

0.7

item 83

0.4

item 14

0.6

item 61

0.3

item 17

0.3

item 5

0.6

item 81

0.2

item 21

0.2

item 83

0.5

item 65

0.1

item 91

0.1

item 21

0.3

item 10

0.1

item 44

0.1

Round 4: random access for item 83

item 83

1.8

item 17

1.6


Done!

fewer sorted accesses

carefully scheduled random access


random access for item 83

no extra candidate in queue


Outline

Our contributions– Inverted block-index data structure

– Sorted access scheduling

– Random access scheduling

– Lower bound

Experiments

Conclusion

Inverted block-index

Lists are first sorted by score

Inverted block-index

Lists are first sorted by score

sort each block by item-id

333

222

111

Top-k algorithm with block-index

1 1 1

2 2

3 3

1 1 1

2 2 2

3 3 3

full-merge

blocks are sorted by item ids, efficiently merged by full-merge!

and so on…full merge

pruned

split into blocks

Choose block size balancing disk seek time and data transfer rate

Low overhead: prune once every round

Sorted access scheduling


Inverted Block-Index

General Paradigm



b11 b21 b31

b12 b22 b32

b13 b23 b33

b14 b24 b34

General Paradigm We assign benefits to every block of each list

Optimization problem– Goal: choose a total of 3 blocks from any

of the lists such that the total benefit is maximized

– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it

– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities

– We choose the schedule with maximum benefit, and continue to next round




b11 b21 b31

b12 b22 b32

b13 b23 b33

b14 b24 b34

General Paradigm We assign benefits to every block of each list

Optimization problem– Goal: choose a total of 3 blocks from any

of the lists such that the total benefit is maximized

– We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it

– But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities

– We choose the schedule with maximum benefit, and continue to next round


scans to different

depths in lists



Knapsack for Score Reduction (KSR)

Pre-compute score reduction ij of every block

of each list : (max-score of the block – min-score of the block)




Knapsack for Score Reduction (KSR)

Pre-compute score reduction ij of every block

of each list : (max-score of the block – min-score of the block)

Candidate item d is already seen in list 3. If we

scan list 3 further, score sd and best-score bd of

d do not change: no benefit

In list 2, d is not yet seen. If we scan one block

(block22) from list 2

– with high probability d will not be not found

in that block: best-score bd of d decreases

by 22

Benefit of block B in list i

d B (1 - Pr[d found in B]) ~ d B

sum taken over all candidates d not yet seen in list i



item d[sd,bd]

scanned till some depth

Random access scheduling


Redundant random accesses of CA CA: one RA after every cR/cS SAs

Many RAs turn out to be redundant

Our strategy: two-phase algorithm

First sorted access rounds only, then switch to random access: no redundant random access

Switch from SA to RA, when– max-score for unseen ≤ min-top-k score

– estimated RA-cost ≤ total SA-cost so far

– cost of SA ~ cost of RA


CA: RA for

item d

But d is found anyway in

subsequent SA round

need to estimate cost of RA


current min-top-3 score

candidate items sorted by best score: CA style

random access


lists scanned till some depths by sorted access

Estimating number of random accesses

best-scores

current scores

Each random access can prune some candidates, so better estimate of #RAs

necessary

A crude upper estimate: #of items in queue

pruned





item d[sd,bd]

bd

If there are

at least three items before d

with final score > bd,

d will be pruned before random access

random accesses

d is pruned





item d[sd,bd]

bd

If there are

less than three items before d

with final score > bd,

a random access for d must be made

random accesses

next: RA for d


current min-top-k score

candidate items sorted by best score


item d[sd,bd]

bd

Let d be the j-th item dj by best-score

ordering

For all i < j, define random variables Fi,j

Fi,j = 1 if final-score(di) > the best-

score(d),

0 otherwise

We compute Pr[Fi,j = 1] using histogram

of the score distributions of the lists

Observation:

Pr[RA is made for d] = Pr[F1,j+ + Fj-1,j <

k]

Expected #of random accesses

j Pr[F1,j+ + Fj-1,j < k]

the sum is taken over all candidate items

For General k:

There will be random access for d

if and only if

#of items before d

with final score > bd

is less than k

j-1 items

Experiments: estimate of RA

queue size

queue size

EST EST DONE

DONE

TREC Terabyte data, TREC 2005 adhoc task queries

After all sorted accesses

#items in queue, #RA estimated and #RA actually done

Tota

l R

A f

or

50

qu

eri

es

Lower bound: what is the best possible?

List 1 List 2 List 3 Try every possible SA-schedule

Count essential number of RAs that must be done




#SA CR/CS x #RA =

Total cost

Schedule 1 6 x 10000 + 1000 x 75 = 135,000

block size 10,000




#SA CR/CS x #RA =

Total cost

Schedule 1 6 x 10000 + 1000 x 75 = 135,000

Schedule 2 9 x 10000 + 1000 x 12 = 102,000

block size 10,000




#SA CR/CS x #RA =

Total cost

Schedule 1 6 x 10000 + 1000 x 75 = 135,000

Schedule 2 9 x 10000 + 1000 x 12 = 102,000

Schedule 3 12 x 10000 + 1000 x 3 = 123,000

… … … …

… … … …

Lower bound … … 102,000carefully engineered dynamic programming to try out all schedules

block size 10,000

Experiments: TREC

10 50 100 200 5000

4,000,000

k

avera

ge c

ost

(#

SA

+ 1

00

0 x

#

RA

)

full merge

NRA

CA

IO-Top-k (OUR)

lower bound

10 50 100 200 5000

250

k

avera

ge r

unnin

g t

ime (

mill

iseco

nds)

full merge

NRA

IO-Top-k (OUR)100

TREC Terabyte benchmark collection over 25 million documents, 426 GB raw data 50 queries from TREC 2005 adhoc task

CA

Experiments: HTTP logs

FIFA World Cup HTTP logs

World cup 1998 1.3 billion HTTP

requests schema Log( interval,

user-id, bytes ) aggregated for each

user within one-day intervals

typical query: find k users with most usage during June 1-10

full merge

NRA

CA

IO-Top-k (OUR)

lower bound

Experiments: IMDB

IMDB movie data more than 375,000

movies, 1,200,000 persons

attributes: Title, Genre, Actors, Description

20 human generated queries

full merge

CA

NRAIO-Top-k (OUR)

lower bound

Conclusion

We presented

An inverted block-index data structure– efficient: optimizes disk access

– performs fast merge in blocks, minimizes overhead

Integrated sorted access and random access scheduling

– SA scheduling: maximizes benefit of scanning blocks

– RA scheduling: effectively estimate RA-cost at every round

– postpone RA till the end of all SA: save redundant RAs

Lower Bound– shows that our algorithm is close to the best possible

Thank you!

Appendix



Knapsack for Benefit Aggregation (KBA)

Pre-compute expected score eij of an item seen in

block j of list i : (average score of the block)

Pre-compute score reduction ij of every block of each

list : (max-score of the block – min-score of the block)



e11 e21 e31

e12 e22 e32

e13 e23 e33

e14 e24 e34


Knapsack for Benefit Aggregation (KBA)

Pre-compute expected score eij of an item seen in

block j of list i : (average score of the block)

Pre-compute score reduction ij of every block of each

list : (max-score of the block – min-score of the block)

Candidate item d is already seen in list 3. If we scan

list 3 further, score sd and best-score bd of d do not

change

In list 2, d is not yet seen. If we scan one block from list 2

– either d is found in that block: score sd of d

increases, expected increase = e22

– or d is not found in that block: best-score bd of d

decreases by 22

Benefit of block B in list i

d eB Pr[d found in B] + B (1 - Pr[d found in B])

The sum is taken over all candidates d not yet seen in list i



e11 e21 e31

e12 e22 e32

e13 e23 e33

e14 e24 e34

item d[sd,bd]

Random access scheduling: details

current min-top-k score

candidate items sorted by best score


item d[sd,bd]

bd

Let d be the j-th items by best-score ordering

For all i < j, Define random variables Fi,j which

takes value 1 if final score of the i-th item is greater than the best-score of d, 0 otherwise

Compute Pr[Fi,j = 1] using the expected score gain

of the i-th item from lists where it is not yet seen

Also define a random variable Rj which takes value

1 if a random access is made for d, 0 otherwise

Observation: Pr[Rj = 1] = Pr[F1,j+ + Fj-1,j < k]

Let Xj := F1,j + + Fj-1,j

Assume Fi,js are independent, then Xj follows

Poisson distribution with mean i Pr[Fi,j = 1]

We can compute Pr[Xj < k] using the incomplete

gamma function

Expected number of random accesses is

j E(Rj) = j Pr[Rj = 1] = j Pr[Xj < k]

the sum is taken over all candidate items

There will be random access for d

if and only if

#of items before d

with final score > bd

is less than k

j-1 items

Other Experiments

For different values of cost of RA compared to cost of SA

CR/S ratio: 100, 1000 and 10000

varying query size

title fields: average size 3

description fields: average size 8

TREC Terabyte collection indexed with BM25 scores

query size: 3

query size: 8

20,000,000

0

cost

(cost of RA)/(cost of SA)

cost

0

3,000,000

End of appendix

io-top-k: index-access optimized top-k query processing debapriyo majumdar max-planck-institut für...

Documents

score item

example item

score of candidates

maximum score

score fagins nra algorithm

mp camera

x camera

unseen items