ranksql: query algebra and optimization for relational top-k queries

23
1 RankSQL: Query Algebra and RankSQL: Query Algebra and Optimization for Relational Top-k Optimization for Relational Top-k Queries Queries AUTHORS: Chengkai Li Kevin Chen-Chuan Chang Ihab F. Ilyas Sumin Song Presenter: Roman Yarovoy October 3, 2007

Upload: norton

Post on 21-Mar-2016

36 views

Category:

Documents


0 download

DESCRIPTION

RankSQL: Query Algebra and Optimization for Relational Top-k Queries. AUTHORS: Chengkai Li Kevin Chen-Chuan Chang Ihab F. Ilyas Sumin Song Presenter: Roman Yarovoy October 3, 2007. Before RankSQL. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

1

RankSQL: Query Algebra and RankSQL: Query Algebra and Optimization for Relational Top-k Optimization for Relational Top-k

QueriesQueries

AUTHORS: Chengkai Li Kevin Chen-Chuan Chang Ihab F. Ilyas Sumin Song

Presenter: Roman Yarovoy

October 3, 2007

Page 2: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

2

Before RankSQLBefore RankSQL

• Ranking (top-k) queries: Query result is sorted by rank and limited to top k results.

• Support for ranking was lacking from RDBMS.

• Previously, isolated cases of top-k query processing were studied.

• No way to integrate top-k operations with other relational operations.

Page 3: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

3

Previous (traditional) Previous (traditional) approachapproach

• Query processing without ranking support: 1. Evaluate select-project-join (SPJ) query and

materialize the result. 2. Sort the result according to a given ranking function.

3. Take only top k tuples. • Associated problems:

– No interest in total order of all the results. – Evaluating ranking function(s) can be expensive.

Page 4: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

4

Key contributionKey contribution

Li et al. proposed: Extending relational algebra to support

ranking as a first-class database construct.

Consequence: Rank-aware relational query engine Rank-aware query optimization.

Page 5: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

5

Top-k query: Example 1Top-k query: Example 1

• R • T TID a1 a2 p1 p2 p3

r1 27 50 0.5 0.3 0.25

r2 17 60 0.6 0.5 0.65

r3 47 90 0.7 0.7 0.95

r4 87 10 0.8 0.1 0.35

r5 47 70 0.9 0.1 0.75

TID b1 b2 p4 p5

t1 47 55 0.5 0.1

t2 66 65 0.5 0.7

t3 27 15 0.5 0.8

t4 99 95 0.5 0.2

Page 6: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

6

Example 1 (cont’d)Example 1 (cont’d)

SELECT *FROM R r, T tWHERE r.a1=t.b1 AND r.a2>t.b2ORDER-BY p1+p2+p3+p4+p5LIMIT 2

(where F = p1+p2+p3+p4+p5)

TID a1 a2 b1 b2 Fr3/t1 47 90 47 55 2.95r1/t3 27 50 27 15 2.35r5/t1 47 70 47 55 2.35

Page 7: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

7

Rank-relational algebraRank-relational algebra

• There was no way to express such query in relational algebra.

• Extend relational algebra by adding rank as a first-class operation.

• Based on the observations of first-class constructs (eg. selection), two requirements are needed to support ranking:

1. Splitting – Predicate-by-predicate rank evaluation. 2. Interleaving – Swapping rank operator with other

operators (i.e. ranking is not only applied after filtering).

Page 8: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

8

Ranking PrincipleRanking Principle• Def: Given a ranking function F and a set of evaluated

predicates P={p1, p2, … , pn}, maximal-possible score of a tuple t is defined as:

• Ranking Principle: If FP[t1] > FP[t2], then t1 must be ranked before t2.

Page 9: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

9

Rank-RelationRank-Relation

Def: For monotonic scoring function F(p1, …, pn) and a subset P of {p1, …, pn}, a relation R augmented with ranking induced by P is called a rank-relation, denoted by RP.

• Implicit attribute of RP is the score of tuple t, that is FP[t].

• Order relationship of RP : – For all t1, t2 Є RP : t1 < RP t2 ↔ FP[t1] < FP[t2]

Page 10: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

10

Operators of rank-relationsOperators of rank-relations

• Rank (or μ) operator “adds” a predicate p to set P.

– i.e. μp(RP) ≡ R P U{p}.

• Example 2: μp1(R{p2}) ≡ R{p1, p2}, where

F=∑(p1, p2, p3). TID a1 a2 p1 p2 p3 F{p1, p2}

r3 47 90 0.7 0.7 0.95 2.4

r2 17 60 0.6 0.5 0.65 2.1

r5 47 70 0.9 0.1 0.75 2.0

r4 87 10 0.8 0.1 0.35 1.9

r1 27 50 0.5 0.3 0.25 1.8

Page 11: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

11

Extended operatorsExtended operators

Page 12: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

12

Example 3: Extended JoinExample 3: Extended Join

πa1,a2,b2(σc (R{p1, p2 p3} JOIN T{p4, p5}))

SELECT r.a1, r.a2, t.b1 FROM R r, T tWHERE cORDER-BY FLIMIT 2

(F = ∑ P and c = r.a1+r.a2 < t.b1)

TID a1 a2 b2 F {p1, p2, p3, p4, p5}

r2/t4 17 60 99 2.45

r4/t4 87 10 99 1.95

r1/t4 27 50 99 1.75

Page 13: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

13

Extended operators (cont’d)Extended operators (cont’d)

• Note: – Cartesian product is defined similarly to join,

but not discussed in the paper. – Projection operator π has not changed. – Computation is based on both Boolean and

ranking logical properties. – Perform Boolean operations and maintain

the order induced by all given ranking predicates.

Page 14: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

14

Equivalence relationsEquivalence relations

• In the extended rank-relational model, ranking is a first-class construct.

• Can derive algebraic equivalences from the definitions of operators (Proofs are omitted).

• Example 4:

1. σc(RP) ≡ (σcR)P 2. RP1 ∩ TP2 ≡ (R ∩ T)P1 U P2

• Thus, we can interleave the rank operator with other operators (i.e. push μ down across operators).

Page 15: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

15

Equivalence relations (cont’d)Equivalence relations (cont’d)

Page 16: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

16

Equivalence relations (cont’d)Equivalence relations (cont’d)

• Note: – Proposition 1 states that ranking can be done in

stages (i.e. one predicate at the time). – By Propositions 2, 3, and 4, the relations hold

commutative and associative laws. – By Propositions 4 and 5, μ can be swapped with

other operators.

Page 17: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

17

Incremental executionIncremental execution

• Blocking operators (eg. sort) lead to materialization of intermediate results.

• Goal: To avoid materialization and implement a pipelining execution strategy.

• We want to split rank computation into stages and to reduce the number of tuples considered in the upcoming stages.

• We can output (i.e. advance to the next stage) a tuple t, whenever t has a score which is greater or equal to the score of any future tuple t′′ .

Page 18: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

18

Incremental execution (cont’d)Incremental execution (cont’d)

• Apply μp to RP and maintain priority queue ordered by P U{p}. – Let X = set of tuples from preceding stage.

• Draw t′ from X.

• If FP U{p}[t] ≥ FP[t′] and FP[t′] ≥ FP[t′′] for any future t′′ drawn from x,

then FP U{p}[t] ≥ FP U{p}[t′′] and t can be output (proceed to next stage).

Page 19: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

19

Example 5: Top 2 of WExample 5: Top 2 of W

• Given F = AVG(p6, p7, p8)

• idxScanp6(W) μp7 μp8

TID x p6 p7 p8 Fw1 3 0.9 0.8 0.2 29/30

w2 7 0.8 0.7 0.1 14/15

w3 5 0.7 0.6 0.1 9/10

w4 1 0.5 0.4 0.9 5/6

TID x Fw1 3 9/10

w2 7 5/6

w3 5 23/30

w4 1 19/30

TID x Fw1 3 19/30

w4 1 3/5

w2 7 8/15

w3 5 7/15

Page 20: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

20

Different evaluation plansDifferent evaluation plans

• There exist algorithms to implement rank-aware operators as well as incremental evaluation.

• Efficiency of query evaluation will now depend not only on the regular operators, but also on the rank-aware operators.

• Due to algebraic equivalence laws, we can define additional evaluation plans.

• Hence, we want a query optimizer to take additional execution plans into consideration.

Page 21: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

21

Rank-aware optimizerRank-aware optimizer• Extended algebra Extended search space. • Impact on enumeration algorithm:

– Li et al. designed a 2-dimension enumeration algorithm: Dimension 1 = Join size, Dimension 2 = Ranking predicates.

– The algorithm is exponential in both dimensions. – Heuristics applied to reduce search space.

• Impact on cost model: – For ranking queries, it is more difficult to estimate the

query cardinality of the intermediate results, whose accuracy is the core of the cost model.

– Authors proposed to estimate cardinality by randomly sampling tuples.

Page 22: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

22

CritiqueCritique

• Erroneous examples. • No example of “tie-breaking” function. • Bad explanation of incremental evaluation.

Page 23: RankSQL: Query Algebra and Optimization for Relational Top-k Queries

23

Future research directionsFuture research directions

• Cardinality estimation: New/improved techniques for random sampling over joins.

• Dynamically determined/chosen k. • Exploring physical properties of rank-aware

execution plans.