robust query processing through progressive optimization

35
Robust Query Processing through Progressive Optimization SIGMOD 2004 Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic Modified by S. Sudarshan from talk by Raja Agrawal

Upload: senta

Post on 18-Mar-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Robust Query Processing through Progressive Optimization. Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic. SIGMOD 2004. Modified by S. Sudarshan from talk by Raja Agrawal. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Robust Query Processing through Progressive Optimization

Robust Query Processing through Progressive Optimization

SIGMOD 2004

Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic

Modified by S. Sudarshan from talk by Raja Agrawal

Page 2: Robust Query Processing through Progressive Optimization

Motivation Current optimizers depend heavily

upon the cardinality estimations What if there errors in those

estimations? Errors can occur due to …

Inaccurate statistics Invalid assumptions (e.g. attribute

independence)

Page 3: Robust Query Processing through Progressive Optimization

Progressive Query Optimization Idea: lazily trigger reoptimization

during execution if cardinality counts indicate current plan is suboptimal introduces checkpoint (CHECK) operator

to compare actual vs estimated cardinality

key idea: precompute cardinality ranges for which plan is optimal

Page 4: Robust Query Processing through Progressive Optimization

Evaluating a re-optimization scheme Risk Vs Opportunity Risk:

Extent to which re-optimization is not worthwhile reoptimization chooses another bad plan work redone cardinality errors may even cancel, and fixing

one may give an even worse plan! Opportunity:

Refers to the aggressiveness more CHECK operators..

Page 5: Robust Query Processing through Progressive Optimization

Background

Redbrick Star schema with fact table and multiple dimension

tables First apply selections on dimension tables Then decide what plan to use

Kabra & DeWitt 98 (KD98) Introduced idea of mid-query reoptimization Allow partial results to be use like materialized views But ad-hoc cardinality threshold, and only reoptimize

fully materialized plans

Opportunity

Risk

Page 6: Robust Query Processing through Progressive Optimization

Background

Tukwila data integration system optimizer may have no idea of statistics interleave optimization and query execution

partial query plans Fragment: fully pipelined tree with doubly

pipelined hash join Query Scrambling

reorder query to deal with delayed sources

Opportunity

Risk

Page 7: Robust Query Processing through Progressive Optimization

Background

Eddies (Telegraph) Ingres/DEC Rdb: run multiple access methods

competitively then choose Parametric Query Optimization (PQO)

e.g. Cole and Graefe 94, Hulgeri and Sudarshan 02 Choose from a set of plans, each optimal for

selectivity range POP: converse: find optimal cardinality range for a

give plan

Opportunity

Risk

Page 8: Robust Query Processing through Progressive Optimization

Example of Progressive Optimization in Action

Page 9: Robust Query Processing through Progressive Optimization

Progressive Query Optimization(POP)

Page 10: Robust Query Processing through Progressive Optimization

Architecture of POP

Page 11: Robust Query Processing through Progressive Optimization

Architecture of POP CHECK operator to find if a plan is

suboptimal At optimization time, find out cardinality range

(at CHECK location) for which plan is optimal At run time, ensure cardinality within [l,u] If violated, stop plan execution and reoptimize

Location of CHECKs Re-optimize

taking observed cardinality into account, and exploiting intermediate results where beneficial Heuristic: limit number of reoptimizations

(default: 3)

Page 12: Robust Query Processing through Progressive Optimization

Validity Ranges Consider a plan edge e that flows rows into

operator o, let P be the subplan rooted at o. The validity range for e is an upper and lower

bound on the number of rows flowing through e, such that if the range is violated at runtime, we can guarantee P is suboptimal

Ad-hoc thresholds (proposed earlier) are a bad idea E.g. even a 100x error on very small relation

may not make a difference in optimal plan

Page 13: Robust Query Processing through Progressive Optimization

Finding Optimality Ranges Plan Popt with root operator oopt is being

compared with another plan Palt different only in the root operator oalt.

Page 14: Robust Query Processing through Progressive Optimization

Finding Optimality Ranges Need to solve

cost(Palt , c) – cost(Popt , c) = 0 where c is the cardinality on edge e

Cost functions can be complex/non-linear/non-continuous

Page 15: Robust Query Processing through Progressive Optimization

Newton-Raphson Iteration

Page 16: Robust Query Processing through Progressive Optimization

What does this achieve? Detects suboptimality of the root operator where Popt

and Palt share the same input edges. Validity range might miss a cross-over point with a

plan that uses a different join order (and hence has different input edges).

Two plans are structurally equivalent if they share the same set of edges where an edge is defined by the set of rows flowing

through it during query execution. Allows different algorithms, and flipping inner/outer

Page 17: Robust Query Processing through Progressive Optimization

Optimality wrt structurally equivalent plans Theorem: …. Suppose edges edges ei1 , ei2 ,

… , eik are seen to be “erroneous” wrt cardinality. Then the following statements are equivalent:1. P is suboptimal with respect to another plan P' that

has the same set of edges {e1 , e2 , … , em}2. At least one of Pi1 , Pi2 , … , Pik is suboptimal given

the cardinality errors in those edges in {e1 , e2 , … , em } that lie under them.

3. At least one of oi1 , oi2 , … , oik is a suboptimal operator given the cardinality errors in {e1 , e2 , … , em} that are in its input edges.

Page 18: Robust Query Processing through Progressive Optimization

Conservative detection of suboptimality Suppose we “detect” suboptimality of (R

Join S) Join T wrt estimated costs of (R Join T) Join S During run time, we can never observe the

cardinality of R Join T We would be making an arbitrary guess as to the

correlation of the predicates on the R and T tables

Best not to infer suboptimality wrt such estimates

However, reoptimization may result in a different join order

Page 19: Robust Query Processing through Progressive Optimization

Exploiting Intermediate Results All the intermediate results are stored as

temporary MVs with cardinalities available to the optimizer can be reused if it leads to a better plan

but not necessarily used, e.g. if join result is very large, and a different join order is preferred

must be reused if it has performed side-effects Reoptimization done as part of same

transaction

Page 20: Robust Query Processing through Progressive Optimization

Optional use of MV

Page 21: Robust Query Processing through Progressive Optimization

Variants of CHECK Variants applicable in different cases,

trade off risk for opportunity Variants

Lazy checking Lazy checking with eager materialization Eager checking without compensation Eager checking with buffering Eager checking with deferred

compensation

Page 22: Robust Query Processing through Progressive Optimization

Lazy Checking Adding CHECKs above a materialization

point (SORT, TEMP etc) No results have been output yet And materialized results can be re-used very low overhead

Page 23: Robust Query Processing through Progressive Optimization

Lazy checking with eager materialization Can insert materialization point if it

does not exists already Risk: overhead of materialization Typically done only for outer input of

indexed nested-loop join low cost if outer is small (as estimated by

optimizer) and INL is in trouble anyway if outer is large

Page 24: Robust Query Processing through Progressive Optimization

Eager Checking Lazy checking may be too late

e.g. if very bad join order chosen, with huge intermediate results

Idea: check even before entire result is materialized, and stop early

Problem: what if some results have already been output? Compensation

Page 25: Robust Query Processing through Progressive Optimization

Eager Checking EC without Compensation:

CHECK is pushed down the materialization point, into pipeline

Page 26: Robust Query Processing through Progressive Optimization

Eager Checking EC with buffering

CHECK and buffer output from buffer once sure about bound

e.g. [0,b), or [b,infinity] else reoptimize “delayed pipelining”

Page 27: Robust Query Processing through Progressive Optimization

EC with Deferred Compensation Only SPJ queries Identifier of all rows returned to the user

are stored in a table S, which is used later in the new plan for anti-join with the new-result stream

Page 28: Robust Query Processing through Progressive Optimization

CHECK Placement

Page 29: Robust Query Processing through Progressive Optimization

CHECK Placement LCEM and ECB – outer side of nested-

loop join LC – above materialization points ECWC and ECDC – anywhere Do not place CHECKs if

no alternative plan above CHECK simple queries with low estimated cost

Page 30: Robust Query Processing through Progressive Optimization

Performance Analysis: Robustness TPC-H Q10: Replace constant in selection on lineitem

by parameter marker, so optimizer doesn’t know actual selectivity

5 different optimal plans

Page 31: Robust Query Processing through Progressive Optimization

Risk Analysis Analyze LC, LCEM, ECB Can be reoptimized more than once Conclusion: low overhead/risk

Page 32: Robust Query Processing through Progressive Optimization

Opportunity Analysis Goal: how often does opportunity to reoptimize arise? Introduce LC/LCEM/ECB checkpoints But turn off reoptimization, and run same plan Opportunity region for ECB: dotted line

Page 33: Robust Query Processing through Progressive Optimization

POP in (in)action Real world workload (DMV data and queries) Complex predicates leading to cardinality estimation

errors substring comparison, like, IN, ..

Page 34: Robust Query Processing through Progressive Optimization

POP in (in)action (contd.) Re-optimization may result in the

choice of worse plan due to: Two estimation errors canceling out each

other Re-using intermediate results

Page 35: Robust Query Processing through Progressive Optimization

Conclusions POP gives us a robust mechanism for

re-optimization through inserting of CHECK (in its various flavors)

Higher opportunity at low risk