dynamic query optimization

33
Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM Corporation Progressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation 2 Dynamic Query Optimization

Upload: nascha

Post on 12-Jan-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Dynamic Query Optimization. Problems with static optimization. Cost function instability: cardinality error of n-way join grows exponentially with n Unknown run-time bindings for host variables Changing environment parameters: amount of available space, concurrency rate, etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation2

Dynamic Query Optimization

Page 2: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation3

Problems with static optimization

Cost function instability: cardinality error of n-way join grows exponentially with n

Unknown run-time bindings for host variables Changing environment parameters: amount of available space, concurrency

rate, etc

Static optimization comes in two flavours:

1. Optimize query Q, store the plan, run it whenever Q is posed2. Every time when Q is posed, optimize it and run it

Page 3: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation4

Early Solutions1. run several plans simultaneously for a short time, and then select one “best” plan

and run it for a long time

2. at every point in a standard query plan where the optimizer cannot accurately estimate the selectivity of an input, a choose-plan operator is inserted

Select Choose-Plan

Unbound predicate

File Scan B-tree-scan

Get-Set R

Page 4: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation5

Dynamic Mid-Query Reoptimization

Features of the algorithm:

Annotated query execution plan Runtime collection of statistics Dynamic resource reallocation Query plan modification Keeping overhead low

Page 5: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation6

Motivating Exampleselect avg(Rel1.selectattr1),

avg(Rel1.selectattr2),

Rel1.groupattr

from Rel1, Rel2, Rel3

where Rel1.selectatrr1 <: value1

and Rel1.selectatrr2 <: value2

and Rel1.jointatrr2 = Rel2.jointatrr2

and Rel1.jointatrr3 = Rel3.jointatrr3

Page 6: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation8

Dynamic Resource Reallocation Assume 8MB memory available and 4.2MB necessary for each hash-join The optimizer allocates 4.2MB for the first hash-join and 250KB for the second

(causing it to execute in two passes) During execution, the statistics collector find out that only 7,500 tuples produced

by the filter The memory manager allocates each of the two hash-joins 2.05MB

Page 7: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation9

Query Plan Modification Once the statistics are available, modify the plan on the fly

– Hard to implement!

Original plan Modified plan – optimal solution

Page 8: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation10

Query Plan Modification: practical solution

select avg(Temp1.selectattr1),

avg(Temp1.selectattr2),

Temp1.groupattr

from Temp1, Rel3

whereTemp1.joinatrr3=Rel3.joinattr3

group by Temp1.groupattr

• Store a partially computed query to disk• Submit a new query using the partial results

Page 9: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation11

Robust Query Processing through Progressive Optimization

Page 10: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation12

Motivation Estimation errors in query optimization

– Due to correlations in data

SELECT count(*) from cars, accidents, ownersWHERE c.id = a.cid and c.id=o.cid and c.make=‘Honda’ and c.model=‘Accord’

– Over-specified queries SELECT * from customers where SSN=blah and name=blah’

– Mis-estimated single-predicate selectivitySELECT count(*) from cars where c.make=?

– Out-of-date statistics

Can cause bad plans Leads to unpredictable performance

Page 11: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation13

Traditional Query Processing

Optimizer

Best Plan

Plan Execution

Optimizer

Best Plan

StatisticsSQL Compilation

Page 12: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation14

LEO: DB2’s Learning Optimizer

Plan Execution

Optimizer

Best Plan

Plan Execution

Optimizer

Best Plan

Statistics

Adjustments

SQL Compilation

Actual Cardinalities

Estimated Cardinalities

1. Monitor

2. Analyze

3. Feedback4. Exploit

Adjustments

EstimatedCardinalities

ActualCardinalities Use feedback from

cardinality errors toimprove future plans

Page 13: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation15

Progressive Optimization (POP)

knl

Optimizer

Best Plan

Plan Execution

with CHECK

Optimizer

Best PlanWith CHECK

StatisticsSQL Compilation

“MQT”with Actual

Cardinality

Re-optimize If CHECK fails

Partial Results

New Best Plan

New Plan

Execution

1

2

34

5

6 Use feedback from cardinality errors toimprove current plan

Page 14: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation16

Outline Progressive Optimization

– Solution overview

– Checkpoint placement

– Validity range computation Performance Results

Page 15: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation17

Progressive Optimization Why wait till query is finished to correct problem?

– Can detect problem early!

– Correct the plan dynamically before we waste any more time! May never execute this exact query again

– Parameter markers

– Rare correlations

– Complex predicates

Long-running query won’t notice re-optimization overhead

Result: Plan more robust to optimizer mis-estimates

Page 16: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation18

Solution Overview Add CHECKpoints to Query Execution Plans

– Check Estimated cardinalities vs. Actuals at runtime When checking fails:

– Treat already computed (intermediate) results as materialized views

– Correct the cardinality estimates based on the actual cardinalities

– Re-optimize the query, possibly exploiting already performed work

Questions:

– Where to add checkpoints?

– When is an error big enough to be worth reoptimizing?

Tradeoff between opportunity (# reoptimization points) and risk (performance regression)

Page 17: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation19

CHECK Placement (1) Three constraints Must not have performed side-effects

– Given out results to application– Performed updates

Want to reuse as much as possible

Don’t reoptimize if the plan is almost finished

Page 18: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation20

CHECK Placement (2)

Lazy CHECK: – Just above a dam: TEMP, SORT, HSJN inner– Very low risk of regression– Provides safeguard for hash-join, merge-join, etc.

Lazy Checking with Eager Materialization– Pro-actively add dams to enable checkpointing– E.g. outer of nested-loops join

Eager Checking – It may be too late to wait until the dam is complete– Check cardinalities before tuples are inserted into the dam

Can extrapolate to estimate final cardinality

DAM

Eager Check

Lazy Check

NLJN

Page 19: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation21

CHECK Operator Execution

IF actual cardinality not in [low, high]):

– Save as a “view match structure” whose Definition (“matching”) was pre-computed at compile time Cardinality is actual cardinality

– Terminate execution & return special error code

– Re-invoke query compiler ELSE continue execution

How to set the [low,high] range?

Page 20: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation22

Outline Progressive Query Processing

– Solution overview

– Checkpoint placement

– Validity range computation Performance Results

Page 21: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation23

Validity Range Determination (1) At a given operator, what input cardinality change will cause a plan change? i.e. when is

this plan valid In general, equivalent to parametric optimization

– Super-exponential explosion of alternative plans to consider– Finds optimal plan for each value range, for each subset of predicates,

So we focus on changes in a single operator– Local decision– E.g. NLJN HSJN– Not join order changes– Advantage: Can be tracked during original optimization– Disadvantage: Pessimistic model, since it misses reoptimization opportunities

Page 22: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation24

Validity Range Determination (2)

Suppose P1 and P2 considered during optimizer pruning

– cost(P1, est_cardouter) < cost(P2, est_cardouter)

– Estimate upper and lower bounds on cardouter s.t. P2 dominates P1

– Use bounds to update (narrow) the validity range of outer (likewise for inner) Applies to arbitrary operators Can be applied all the way up the plan tree

P1

L1

outer inner

QP

L2

outer inner

P Q

P2

Page 23: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation25

Example of a Cost Analysis Lineitem × Orders query

– Vary selectivity of o_orderdate < ‘date’ predicate

N1,M1,H1: Orders as outer N2,M2,H2: Lineitem as outer

Optimal Plan: N1H2M1

N1

H2M1

Page 24: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation26

Upper Bounds from pruning M1 with N1

Upper bounds vary Misses pruning with H2 because outer/inner reversed

Still upper bounds set conservatively; no false reoptimization

Page 25: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation27

Lower Bounds from pruning N1 with M1

N1

H2M1

Page 26: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation28

Outline Progressive Query Processing

– Solution overview

– Checkpoint placement

– Validity range computation Performance Results

– Parameter markers (TPCH query)

– Correlations (customer workload for a motor vehicles department)

– Re-optimization Opportunities with POP

Page 27: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation29

Robustness for Parameter Marker in TPC-H Query 10

0

100

200

300

400

500

600

700

800

0 20 40 60 80 100

Actual Selectivity

Exe

cutio

n T

ime

Default Selectivity Estimate, with POPDefault Selectivity EstimateCorrect Selectivity Estimate

4-way Join:goes thru 5 differentoptimal plans

Page 28: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation30

Response Time of DMV with and without POP

0

200

400

600

800

1000

1200

1400

standard POP

resp

onse

tim

e

Box: 25th to 75th percentile of queries

Page 29: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation31

Speed-Up (+) vs. Regression (-) of DMV with POP

-10

0

10

20

30

40

50

60

70

80

90

39 Real-World Complex Queries

Sp

eed

up

(+

) vs

. Reg

ress

ion

(-)

F

acto

r

Page 30: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation32

Scatter Plot of Response Times for DMV

0

250

500

750

1000

1250

1500

0 250 500 750 1000 1250 1500

Response Time without POP

Res

pon

se t

ime

wit

h P

OP

Degradation

Improvement

Page 31: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation33

Reoptimization Opportunities with POP

0

0.2

0.4

0.6

0.8

1

1.2

Queries

Fra

ctio

n of

Que

ry E

xecu

tion

Com

plet

ed

LC (above HJ)LCEMLC (above TMP/SORT)

Q2 Q3 Q4 Q5 Q7 Q8 Q11 Q18

Page 32: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation34

Related Work Choose-Plans: Graefe/Cole, Redbrick, … Parametric Query Optimization Least-expected cost optimization Kabra/DeWitt Mid-query re-optimization,

Query Scrambling Runtime Adaptation

– Adaptive Operators: DB2/zOS, DEC RDB, …: adaptive selection of access methods Ingres: adaptive nested loop join XJoin, Tukwila: adaptive hash join Pang/Carey/Livny, Zhang/Larson: dynamic memory adjustment …

– Convergent query processing

– Eddies: adaptation of join orders

– SteMs: adaptation of join algorithms, spanning trees, …

Page 33: Dynamic Query Optimization

Transparent Access to Grid Data Objects | IBM Confidential © 2003 IBM CorporationProgressive Query Processing | ACM SIGMOD 2004 © 2004 IBM Corporation35

Conclusions POP makes plans for complex queries more robust to optimizer misestimates Significant performance improvement on real workloads Overhead of re-optimization is very low, scales with DB size Validity ranges tell us how risky a plan is

– Can be used for many applications to act upon cardinality sensitivity

Future Work:– CHECK estimates other than cardinality

# concurrent applications Memory available in buffer pool, sort heap Actual run time, actual # I/Os

– Avoid re-optimization too late in plan of if cost of optimization too high– Re-optimization in shared-nothing query plans– Extend validity ranges to more general plan robustness measures