lecture 12 - relational query optimizerdzeina/courses/epl446/lectures/12.pdf · lecture outline...

25
12-1 EPL446: Advanced Database Systems -Demetris Zeinalipour (University of Cyprus) EPL446 Advanced Database Systems Lecture 12 A Typical Relational Query Optimizer Chapter 15: Ramakrishnan & Gehrke (* exlclude 15.5 and 15.7) Demetris Zeinalipour http://www.cs.ucy.ac.cy/~dzeina/courses/epl446 Department of Computer Science University of Cyprus

Upload: lyhuong

Post on 15-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

12-2EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Lecture OutlineRelational Query Optimizer

• Introduction to Relational Query Optimization (Στεζιακή Βεληιζηοποίηζη Επερφηήζεφν)

• Relational Algebra Equivalences(Ιζοδσναμίες Στεζιακών Τελεζηών)

• Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)

• Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)

• Cost Estimation of Plans(Υπολογιζμός Κόζηοσς με Εκηέλεζης Πλάνφν)

Query Optimization

and Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

12-3EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Relational Query Optimization(Στεζιακή Βεληιζηοποίηζη Επερφηήζεφν)

• A user of a DBMS formulates SQL queries.

• The query optimizer translates this query into an

equivalent Relational Algebra (RA) query, i.e. a

RA query with the same result.

• Τo optimize the efficiency of query processing, the

query optimizer reorders the individual

operations (ηελεζηέρ) within the RA query.

• Re-ordering has to preserve the query semantics

(ζημαζιολογία) and is based on Rel. Algebra

equivalences (we will see those in a while)

12-4EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Relational Query Optimization(Στεζιακή Βεληιζηοποίηζη Επερφηήζεφν)

• Why can re-ordering improve the

efficiency?

• Different orders can imply different sizes

of the intermediate results.

• The smaller the intermediate results, the

more efficient the execution plan!

12-6EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Relational Algebra Equivalences(Ιζοδσναμίες Στεζιακών Τελεζηών)

• The most important RA equivalences are commutative

(ανηιμεηάθεζη) and associative laws

(πποζεηαιπιζμόρ).

• A commutative law (ανηιμεηάθεζη) about some

operation (e.g., about join) states that the order of (two)

arguments does not matter.

– e.g., Join is Commutative (R S) ≡ (S R)

• An associative law (πποζεηαιπιζμόρ) about some

(binary) operation states that (more than two) arguments

can be grouped either from the left or from the right.

– e.g., Join is Associative R ( S Τ ) ≡ ( R S ) Τ

• If an operation is both commutative and associative,

then any number of arguments can be (re-)ordered in an

arbitrary manner.

12-7EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

RA Equivalences: Joins(Ιζοδσναμίες Σ.Τ.: Σσνενώζεις)

• The following (binary) RA operations are

commutative and associative: , , ,

• For example, we have:

(R S) ≡ (S R) (Commutative, Αντιμετάθεση)

– the order of (two) arguments does not matter.

R(SΤ) ≡ (RS)Τ (Associative, Προσεταιρισμός)

– arguments can be grouped either from the left or

from the right.

• The Set Difference ( - ) is not commutative but it

is associative:

(R-S) ≡! (S-R) but R-(S-T) ≡ (R-S)-T

12-8EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

RA Equivalences: Selections((Ιζοδσναμίες Σ.Τ.: Επιλογέρ)

• Selections are crucial from the point of view of query

optimization, because they typically reduce the size

of intermediate results by a significant factor.

• Laws for selections (επιλογέρ) only:

– σA1 … An (R) ≡ σA1 (… σAn (R))

(Cascade Conditions, Διάδοζη)

– σA1(σA2(R)) ≡ σA2(σA1(R))

(Commutative, Ανηιμεηάθεζη)

12-9EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

RA Equivalences: Selections((Ιζοδσναμίες Σ.Τ.: Επιλογέρ)

• Laws for the combination of selections and , :

if R has all attributes mentioned in c,

σc(R S) ≡ σc(R) S

• Laws for the combination of selections and -,, (άλλερ

ζςνολοθεωπηηικέρ ππάξειρ):

σc(R S) ≡ σc(R) σc(S)

• The above laws can be applied to “push selections down”

as much as possible in an expression, i.e. performing

selections as early as possible, e.g.,

σA(R S) ≡ σB C D (R S) ≡ σD(σB(R) σC(S))

• Selection over a Cartesian Product yields a Join

σA(R S) ≡ R c S

Κατηγορήματα

(predicates)

12-10EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

RA Equivalences: Projections((Ιζοδσναμίες Σ.Τ.: Πποβολέρ)

• Projection can be cascaded (διάδοζη)

• Projection is distributive (επιμεπιζηική) over set

operators (, , -,/)

• Selection and projection: A projection commutes

with a selection

– Αpplies only if ζ uses attributes retained by π

• Projection and Joins: we can `push‟ the projection

down by retaining only attributes of R (and S) that are

needed for the join (or are kept by the projection a)

))(...(( 211 RR AnAAA

* Study book chapter 15.3 for more details for RA equivalences

12-11EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)

• An SQL query is parsed into a

collection of query blocks (μπλοκ

επερωηήζεων), and these are

optimized one block-at-a-time.

• Nested blocks are usually treated

as calls to a subroutine, made once

per outer tuple.

SELECT S.sname

FROM Sailors S

WHERE S.age IN(SELECT MAX (S2.age)

FROM Sailors S2

GROUP BY S2.rating)

Nested block(εμφωλευμένο

μπλοκ)

Outer block(Εξωτερικό

Μπλοκ)

• For each block, the plans considered are:

– All available access methods, for each relation in the FROM clause.

– All possible join trees for the relations in the FROM clause.

• We shall the above in further details in the following slides…

SQL=>RA

Enum. PlansEst. Cost

12-12EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Optimizing Query Block Example(Παράδειγμα Βεληιζηοποίηζης Μπλόκ)

• Example Schema

– Sailors (sid: integer, sname:string, rating:integer, age:real)

– Reserves (sid:integer, bid:integer, day:dates, rname:string)

– Boats(bid:integer, bname:string, color:string)

• Example Query

– For each sailor with the highest rating (over all sailors) and at

least two reservations for red boats, find the sailor id and the

earliest date on which the sailor has a reservation for a red boat.

SELECT S.sid. MIN(R.day) // Find sailor ID & earl. day of red reserv.

FROM Sailors S. Reserves R. Boats B

WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = „red' AND

S.rating = ( SELECT MAX (S2.rating) FROM Sailors S2 )

GROUP BY S.sid HAVING COUNT (*) > 1 // At least two such reservations

// Highest rating

SQL

SQL=>RA

Enum. PlansEst. Cost

12-13EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)

SELECT S.sid. MIN(R.day)

FROM Sailors S. Reserves R. Boats B

WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = „red' AND

S.rating = ( SELECT MAX (S2.rating) FROM Sailors S2 ) GROUP BY S.sid

HAVING COUNT (*) > 1

SELECT S.sid. MIN(R.day)

FROM Sailors S. Reserves R. Boats B

WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = „red' AND

S.rating = Reference to Nested (Inner) BlockGROUP BY S.sid

HAVING COUNT (*) > 1

SQL: Only consider the Outer Block for the Optimization part…

Extended* Relational Algebra Block:

SQL: User’s Query

* recall that Having, Group-by & Aggr. not

part of Relational Algebra)

SQL=>RA

Enum. PlansEst. Cost

12-14EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)

• A query is treated as a ζ-π- algebra expression

with the remaining operations (if any) carried out on

the result.

• For our example, the optimizer only considers:

• Aggregates, Having, Group-By are calculated after

computing the ζ-π- of a query.

• Now the Optimizer needs to i) enumerate the

alternative plans and ii) estimate cost of each plan.

Relational Algebra Block (will be considered for evaluation):

SQL=>RA

Enum. PlansEst. Cost

12-15EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)

• Problem: The space of alternative plans for a given query is

very large!

• To motivate the discussion consider the binary query

evaluation plans and assume that only 1 join alg. exists.

• Question: How many such plans can we have?

• Answer: Number of Binary Trees with n nodes: – N=4 we have 336 possible trees

– N=5 we have 1008 possible trees

– ….

– N=10 we have 6 x 1010 possible trees

BA

C

D

BA

C

D

C DBA

)!1(

)!2(

n

nCn

SQL=>RA

Enum. PlansEst. Cost

Number of

Binary Plans:

We certainly need to prune (κλαδέψοςμε) the search space!

12-16EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)

• The Query Optimizer therefore focuses on a subset

of plans.

SQL=>RA

Enum. PlansEst. Cost

• Algebraic plans: those that can be

expressed with Relational Algebra

operators.

• Enumerable plans: e.g., only binary

plans.

• Searched plans: Among binary plans only

consider the left-deep plans, i.e., where

right child of each join is a leaf (base

relation)

• Constructed plans: Those that are

actually constructed.

Focus of the Query Optimizer

12-17EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)

• Left-deep (αριστεροβαθή) join trees:– Α left-deep tree is a tree in which the right child of each join is a leaf

(i.e., a base table or index).

– Left-deep trees allow us to generate all fully pipelined plans

(πλήπωρ ζωληνωμένα πλάνα εκηέλεζηρ) .

• As results are generated these are forwarded to the operator

higher in the tree hierarchy.

• Intermediate results not written to temporary files.

• ΝΟΤ all left-deep trees are fully pipelined (e.g., SM join, no

results are generated during sorting but only during merging).

BA

C

D

SQL=>RA

Enum. PlansEst. Cost

12-18EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)

• Even by only considering left-deep plans, the number of

plans still grows rapidly when number of join increases!

• In particular, we have n! possible plans, where N the number

of base relations participating in a join.– With N=4, we have 24 possible plans

– With N=5, we have 120 possible plans

– With N=6, we have 720 possible plans

– ….

– With N=10, we have 3628800 possible plans

BA

C

D

Number of

Left-Deep

Plans*: n!

AB

C

D

CB

A

D

...

SQL=>RA

Enum. PlansEst. Cost

* Again assuming that only 1 join algorithm exists

12-19EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)

• When enumerating a plan we need a way to

determine the cost of each plan

• The cost of a query plan is determined largely by

the order in which the tables are joined.

• Most query optimizers determine join order via a

dynamic programming algorithm pioneered by

IBM's System R database project (next slide)

SQL=>RA

Enum. PlansEst. Cost

12-21EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)

Sketch of Enumeration Algorithm (uses Dynamic Programming)

• Pass 1: Find Access Paths (file scan, indexes, etc) for each Relation in

Query.

– Objective: Record the cheapest way to scan the relation, as well as the cheapest

way to scan the relation that produces records in a particular sorted order.

– e.g., FileScan for fetching all tuples and B+Tree for fetching IDs in sort order.

• Pass 2: For each 2-relation pair (for which a join condition exists) find

the cheapest way to join relations and generate results i) with no order

and ii) with order.

– Utilize the available join algorithms implemented by the DBMS (nested-loops join, sort-

merge join, etc).

• Pass 3: For each 3-relation pair (for which a join condition exists) find

the cheapest way to join relations and generate results i) with no order

and ii) with order.

– In particular, it will join each two-relation plan produced by the previous phase with the

remaining relations in the query.

• Pass N: Continue the above until all the relations in query are considered

• At the end we will obtain the overall best plan!

SQL=>RA

Enum. PlansEst. Cost

12-22EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Cost Estimation of Plans(Υπολογιζμός Κόζηοσς με Εκηέλεζης Πλάνφν)

• Consider a Query Block:

• Maximum # tuples in result is the product of the cardinalities of

relations in the FROM clause.

– i.e., |A|*|B|* … * |Z|

• Reduction factor (RF) (Σσνηελεζηής Μείωζης): defines the

ratio of the expected result size / input size

– e.g., term1 yields 200 expected answers out of 1000 => RF term1=0.2

– Result cardinality = Max # tuples * product of all RF’s.

• How can a DBMS know these RFs for a table without

spending too much time? (next slide)

SELECT attribute list

FROM A, B, …, Z

WHERE term1 AND ... AND termz

SQL=>RA

Enum. PlansEst. Cost

12-23EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Reduction Factors Using Histograms(Σσνηελεζηές Μείφζης με Ιζηογράμμαηα)

• Wrong Answer: Scan the table => Too Expensive

• Correct Answer: Utilize Histograms (tiny data

structures that approximate the real distribution of

values in a table (stored in system catalog)

• Example

Initial Distribution of “age”

Fre

qu

en

cy o

f Ap

pe

ara

nce

Equiwidth Histogram Equidepth Histogram

SQL=>RA

Enum. PlansEst. Cost

12-24EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Query Optimization Example(Παράδειγμα Βεληιζηοποίηζης Επερώηηζης)

• Consider that we have the following access methods for

the working example we’ve been using

• Sailors:

– Clustered B+ tree on rating

– Clustered Hash index on sid

• Reserves:

– Unclustered B+ tree on bid

• Task: The query optimizer needs to

optimize the query evaluation plan

on the right…

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

12-25EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Query Optimization Example(Παράδειγμα Βεληιζηοποίηζης Επερώηηζης)

• Pass1:

– Sailors:

• Utilize Clustered B+ tree for rating>5

• If index was unclustered we would consider the FileScan.

• In many cases might consider B+ tree as tuples are in rating

order).

– Reserves:

• Utilize B+ tree on bid matches as it can quickly match

bid=100 (regardless of whether the index is Clustered

or unclustered.

• Pass2:

– Consider each plan retained from Pass 1 as the outer, and

consider how to join it with the (only) other inner relation.

• e.g., Reserves as outer: Hash index can be used to get

Sailors (inner) tuples that satisfy sid = outer tuple’s sid

value.

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

B+

Hash

B+

12-26EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Highlights of System R Optimizer(Σηοιτεία για ηο Βεληιζηοποιηηή System R)

• Basic Ideas in the System R Query Optimizer:

– Plan Space: Too large, must be pruned!

– Cost estimation: Approximate art at best.

• Characteristics:– Statistics, maintained in system catalogs, used to estimate cost of

operations and result sizes.

– Considers only left-deep plans

– NO nested sub-queries (as these would increase the plan search space =>

slow)

– NO Duplicate elimination in the tree (only as a final step)

• Why? Duplicate elimination requires sorting or hashing , consequently the operator

can not pipeline the results higher in the Query Plan)

– Considers combination of CPU and I/O costs.

• Impact:

– Most widely used currently; works well for < 10 joins.

12-27EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)

Summary

(Σύνουη)

• Query optimization is an important task in a relational

DBMS.

• Must understand optimization in order to understand

the performance impact of a given database design

(relations, indexes) on a workload (set of queries).

• Two parts to optimizing a query:

– Consider a set of alternative plans.

• Must prune search space; typically, left-deep plans only.

– Must estimate cost of each plan that is considered.

• Must estimate size of result and cost for each plan node.

• Key issues: Statistics, indexes, operator implementations.