top- k query evaluation with probabilistic guarantees

Top-K Query Evaluation with Probabilistic GuaranteesMartin Theobald, Gerhard Weikum, Ralf SchenkelPresenter: Avinandan Sengupta

• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion

Presentation Outline

Data and a Query

Scrip ID Earnings Per Share

P/Eratio

β ... Average Market

Cap (B$)

SNPS 1.27 17.63 0.69 ... 3.27

IBM 12.28 13.85 0.72 ... 200

... … … … ... ...

INFY 2.72 19.51 1.17 30.4

MSFT 2.70 9.32 1.03 210

GOOG 27.73 19.33 1.13 173

Top 10 midcap

stocks with low β

Hypothetical DB of NASDAQ traded stocks. Data collated from Google Finance

Attributes

Objects

P/ERatio

(norm)

INFY: 1

GOOG: 0.99

SNPS: 0.90

IBM: 0.70

MSFT: 0.47

(norm)

SNPS: 1

IBM: 0.96

MSFT: 0.67

GOOG: 0.61

INFY: 0.59

Average MarketCap (B$)

SNPS: 1

INFY : 0.80

GOOG: 0.05

IBM: 0.07

MSFT: 0.08

PEj/Highest PE (β-1j /max(β-1

j)) Grades based on how close the market cap is to the midcap median; normalized

Midcap median 4.5B≅

Hypothetical Graded Lists(made fit for consumption by Top-k processors)

f = 0.5*P/E + 1.0*β-1 + 1.0*MCap

weights

Aggregate function

normalization

Top-kList

SNPS, X

INFY, Y

GOOG, Z

Top-k resultsP/E

Ratio(norm)

INFY: 1

GOOG: 0.99

SNPS: 0.90

IBM: 0.70

MSFT: 0.47

(norm)

SNPS: 1

IBM: 0.96

MSFT: 0.67

GOOG: 0.61

INFY: 0.59

Average MarketCap (B$)

SNPS: 1

INFY : 0.80

GOOG: 0.05

IBM: 0.07

MSFT: 0.08

Top-k Processor

Fagin’s Threshold Algorithm (TA)

• Access the n lists in parallel.• As an object oi is seen, perform a random access

to the other lists to find the complete score for oi.• Do the same for all objects in the current row.• Now compute the threshold τ as the sum of

scores in the current row.• The algorithm stops after k objects have been

found with a score above τ.

TA with No Random Access (TA-NRA)

• Access the n lists in parallel.• For an item a, compute its (B)est score:

Ba = f { f {scorej | j ∈ seen-attributes(a)}, f {highk | k ∉ seen-attributes(a)}}

highk = last seen score for the kth attribute

and its (W)orst scoreWa = f { f {scorej | j ∈ seen-attributes(a)}, f {0 | k ∉ seen-attributes(a)}}

• Halt when k distinct objects have been seen and there is no object m outside the Top-k list whose Bm ≥ Wk – this means that we also maintain a table of all seen objects with their W/B

scores

Top-kList

SNPS, W1, B1

INFY, W2, B2

GOOG, Wk, Bk

Running Top-k list; contains the k objectswith largest W values; ties broken with B values

Issues with TA and TA-NRA

• High space-time costs• Overly conservative

Are we solving the right problem?

• Is random access possible in most common scenarios?– Web content– XML data, hierarchical data sets

• Does the user need an exact top-k query result?– Or is she satisfied with an approximation?

How about an approximate solution?

• Can we remove candidates (objects that we think can make it to the top-k list) from consideration early on in the process?– Quickly reach solution

Pictorially...

Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)

Probabilistic TA-NRA - 1

• Predict the total score of a item for which a partial score is known

• Avoid the overly conservative best-score/worst-score bounds of the original TA-NRA– Instead, calculate the probability that the total

score of the item exceeds a threshold (making the item interesting for the top-k result)

Probabilistic TA-NRA - 2

• If this probability is sufficiently low (below a threshold), drop the item from the candidate list.

• The probabilistic prediction involves computing the convolution of the score distributions of different index lists.

Score Distribution of Lists - How?

(norm)

SNPS: 1

IBM: 0.96

MSFT: 0.67

GOOG: 0.61

INFY: 0.59

score0.59 1.0

Median 0.65

Parameter fitting curve fitting

What it is and What it is not

• Probabilistic guarantees are not about query run-times but about query result quality

• Probabilistic guarantees refers to the approximation of the top-k ranks in a completely scored and exactly ranked result set

The Math

Set of seen attributes for

an object

More Math...

What distributions to consider?

• Uniform distribution– simplest assumptions– convolutions based on moment-generating functions with

generalized Chernoff-Hoeffding bounds• Poisson estimations– efficiently evaluated, provides a reasonable fit for tf*idf

based score distributions for Web corpora• Histograms– when above methods fail– Involves non-trivial computation (done offline per list)

Solving Convolutions? Difficult

• When the PDF is a uniform distribution, its solution becomes difficult– Use alternate techniques other than convolution– Off-load computation to available probabilistic

engines – OpenMaple, etc

Queue Management

Source: http://www.mpi-inf.mpg.de/~mtb/pub/imprs-topk-xml_poster.pdf (author’s webpage)

Results

Performance as a function of ε

Source: Paper

Precision of probabilistic predictors for tf*idf, Uniform-, and Zipf-distributed scores

Source: Paper

• New algorithms were developed based on probabilistic score predictions– Trade-off a small amount of top-k result quality for a

drastic reduction of sorted accesses• Intelligent management of priority queues for

efficient implementation was presented• Assumptions were made regarding the aggregation

function to be summation• Future work to be based on ranked retrieval of XML

data and integrating into XXL search engine

Conclusion

Thanks!

top- k query evaluation with probabilistic guarantees

seen score

seen objects

right problem

query evaluation

query result

average marketcap b

random access tanraaccess

n lists

Documents

probabilistic graphical modelsprobabilistic graphical...

1 probabilistic/uncertain data management slides based on...

scrubbing query results from probabilistic databases

research article continuous probabilistic skyline queries...

in-database batch and query-time inference over ... ·...

query-specific learning and inference for probabilistic...

top-k query evaluation on probabilistic data christopher...

top- k query evaluation with probabilistic guarantees martin...

probabilistic threshold range aggregate query processing...

efficient query evaluation on probabilistic databases papers...

project summary career: program synthesis with ... ·...

query auditing for protecting max/min values of sensitive...

query answering in probabilistic datalog+/– ontologies...

probabilistic ranking of database query results

sensitivity analysis & explanations for robust query...

efficient query evaluation on probabilistic databases nilesh...

finding probabilistic nearest neighbors for query objects...

query answering in probabilistic datalog+/{ ontologies under...

ludwig- maximilians- university munich database systems...

vldb ´04 top-k query evaluation with probabilistic...