query optimization - new mexico state universityhcao/teaching/cs582/note/... · cost difference...

63
Database Management Systems II, Huiping Cao 1 Database Management Systems II, Huiping Cao 1 Query Optimization References: [RG-3ed] Chapter 15 [SKS-6ed] Chapter 13

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

Database Management Systems II, Huiping Cao 1Database Management Systems II, Huiping Cao 1

Query Optimization

References: q  [RG-3ed] Chapter 15q  [SKS-6ed] Chapter 13

Page 2: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

2Database Management Systems II, Huiping Cao

Query Optimization

q  Introductionq  Evaluation of Expressionsq  Query Blocksq  Transformation of Relational Expressionsq  Cost Estimationq  Enumeration of Alternative Plansq  Nested Subqueries

Page 3: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

3Database Management Systems II, Huiping Cao

Individual Operators

q  Queries are composed of a few basic operators: the implementation of these operators can be carefully tuned (and it is important to do this!).

q  Many alternative implementation techniques for each operator; no universally superior technique for most operators.

q  Must consider available alternatives for each operation in a query and choose the best one based on system statistics, etc. This is part of the broader task of optimizing a query composed of several operations.

Page 4: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

4Database Management Systems II, Huiping Cao

Introductionq  Alternative ways of evaluating a given query

q  Equivalent expressionsq  Different algorithms for each operation

SELECT name,titleFROM instructor, teaches, courseWHERE dept_name = “Music” AND instroctor.id = teaches.iid AND course.cid = teaches.cid;

Page 5: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

5Database Management Systems II, Huiping Cao

Introductionq  Alternative ways of evaluating a given query

q  Equivalent expressionsq  Different algorithms for each operation

Page 6: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

6Database Management Systems II, Huiping Cao

Introduction (Cont.)

q  An evaluation plan defines exactly what algorithm is used for each operation, and how the execution of the operations is coordinated.

q  Find out how to view query execution plans on your favorite database

Page 7: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

7Database Management Systems II, Huiping Cao

PostgreSQL

EXPLAIN select * from measurement_instance as mi, measurement_type as mt where mt.annot_id=mi.did and mt.mtypelabel=mi.mtypelabel;

QUERY PLAN ------------------------------------------------------------------------------------ Hash Join (cost=11.75..25.96 rows=1 width=1377) Hash Cond: ((mi.did = mt.annot_id) AND (mi.mtypelabel = mt.mtypelabel)) -> Seq Scan on measurement_instance mi (cost=0.00..12.40 rows=240 width=310) -> Hash (cost=10.70..10.70 rows=70 width=1067) -> Seq Scan on measurement_type mt (cost=0.00..10.70 rows=70 width=1067)(5 rows)

Page 8: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

8Database Management Systems II, Huiping Cao

Introduction (Cont.)

q  Cost difference between evaluation plans for a query can be enormousq  E.g. seconds vs. days in some cases

q  Steps in cost-based query optimization1.  Generate logically equivalent expressions using equivalence

rules2.  Annotate resultant expressions to get alternative query plans3.  Choose the cheapest plan based on estimated cost

Page 9: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

9Database Management Systems II, Huiping Cao

Introduction (Cont.)

q  Estimation of plan cost based on:q  Statistical information about relations. Examples:

!  number of tuples, number of distinct values for an attributeq  Statistics estimation for intermediate results

!  to compute cost of complex expressionsq  Cost formulae for algorithms, computed using statistics

Page 10: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

10Database Management Systems II, Huiping Cao

Query Optimization

q  Introductionq  Evaluation of Expressionsq  Query Blocksq  Transformation of Relational Expressionsq  Cost Estimationq  Enumeration of Alternative Plansq  Nested Subqueries

Page 11: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

11Database Management Systems II, Huiping Cao

Evaluation of Expressions

q  So far: we have seen algorithms for individual operationsq  Alternatives for evaluating an entire expression tree

q  Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk.

q  Pipelining: pass on tuples to parent operations even as an operation is being executed

Page 12: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

12Database Management Systems II, Huiping Cao

Materialization

q  Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations.

q  E.g., in figure below, compute and store then compute and store its join with instructor, and finally compute the projection on name.

)("Watson" departmentbuilding =σ

Π

σ

name

building = “Watson”

department

instructor

Page 13: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

13Database Management Systems II, Huiping Cao

Materialization (Cont.)

q  Materialized evaluation is always applicableq  Cost of writing results to disk and reading them back can be

quite highq  If ignoring cost of writing results to disk,

! Overall cost = Sum of costs of individual operations + cost of writing intermediate results to

diskq  Double buffering: use two output buffers for each operation,

when one is full write it to disk while the other is getting filledq  Allow overlap of disk writes with computation and reduces

execution time

Page 14: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

14Database Management Systems II, Huiping Cao

Pipelining

q  Pipelined evaluation: evaluate several operations simultaneously, passing the results of one operation on to the next.q  E.g., in the previous expression tree, do not store the

results of

q  Instead, pass tuples directly to the join. Similarly, do not store the results of join, pass tuples directly to projection.

q  Much cheaper than materialization: no need to store a temporary relation to disk.

q  Pipelining may not always be possible – e.g., sort, hash-join. q  Pipelines can be executed in two ways: demand driven and

producer driven

)("Watson" departmentbuilding =σ

Page 15: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

15Database Management Systems II, Huiping Cao

Pipelining (Cont.)

q  In demand driven or lazy evaluation or pull modelq  System repeatedly requests next tuple from top level operationq  Each operation requests next tuple from children operations as

required, in order to output its next tupleq  In between calls, operation has to maintain “state” so it knows what to

return nextq  In producer-driven or eager pipelining or push model

q  Operators produce tuples eagerly and pass them up to their parents! Buffer maintained between operators, child puts tuples in buffer,

parent removes tuples from buffer!  If buffer is full, child waits till there is space in the buffer, and then

generates more tuplesq  System schedules operations that have space in the output buffer and

can process more input tuples

Page 16: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

16Database Management Systems II, Huiping Cao

Evaluation Algorithms for Pipelining

q  Some algorithms are not able to output results even as they get input tuplesq  E.g. merge join, or hash joinq  Intermediate results written to disk and then read back

q  Blocking operationsq  Operations are pipelined

Page 17: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

17Database Management Systems II, Huiping Cao

q  PostgreSQL q  Explain command: show the execution plan of a statementq  Ref: http://www.postgresql.org/docs/8.1/static/sql-explain.html

q  MySQLq  Explain commandq  Ref: http://dev.mysql.com/doc/refman/5.0/en/explain.html

Page 18: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

18Database Management Systems II, Huiping Cao

Query Optimization

q  Introductionq  Evaluation of Expressionsq  Query Blocksq  Transformation of Relational Expressionsq  Cost Estimationq  Enumeration of Alternative Plansq  Nested Subqueries

Page 19: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

19Database Management Systems II, Huiping Cao

Query Blocks: Units of Optimization

q  An SQL query is parsed into a collection of query blocks, and these are optimized one block at a time.

q  A query blockq  No nestingq  Exactly one SELECT and one FROM

clauseq  At most one WHERE clause, GROUP

BY clause, and HAVING clause! WHERE clause in conjunctive normal

form

SELECT S.sname FROM Sailors S WHERE S.age IN (SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)

Nested block Outer block

Page 20: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

20Database Management Systems II, Huiping Cao

Query Blocks: Units of Optimization

q  Nested blocks are usually treated as calls to a subroutine, made once per outer tuple. (This is an over-simplification, but serves for now.)

q  For each block, the plans considered are:q  All available access methods, for each

relation in the FROM clause.q  All left-deep join trees (i.e., all ways to

join the relations one-at-a-time, with the inner relation in the FROM clause, considering all relation permutations and join methods.)

SELECT S.sname FROM Sailors S WHERE S.age IN (SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating)

Nested block Outer block

Page 21: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

21Database Management Systems II, Huiping Cao

Query Optimization

q  Introductionq  Evaluation of Expressionsq  Query Blocksq  Transformation of Relational Expressionsq  Cost Estimationq  Enumeration of Alternative Plansq  Nested Subqueries

Page 22: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

22Database Management Systems II, Huiping Cao

Transformation of Relational Expressions

q  Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instanceq  Note: order of tuples is irrelevantq  we do not care if they generate different results on

databases that violate integrity constraintsq  In SQL, inputs and outputs are multisets of tuples

q  Two expressions in the multiset version of the relational algebra are said to be equivalent if the two expressions generate the same multiset of tuples on every legal database instance.

q  An equivalence rule says that expressions of two forms are equivalentq  Can replace expression of the first form by the second, or

vice versa

Page 23: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

23Database Management Systems II, Huiping Cao

Equivalence Rules

1. Conjunctive selection operations can be deconstructed into a sequence of individual selections.

2. Selection operations are commutative.

3.  Only the last in a sequence of projection operations is needed, the others can be omitted.

4.  Selections can be combined with Cartesian products and theta joins.a.  σθ(E1 X E2) = E1 θ E2 b.  σθ1(E1 θ2 E2) = E1 θ1∧ θ2 E2

))(())((1221EE θθθθ σσσσ =

))(()(2121EE θθθθ σσσ =∧

)())))((((121EE LLnLL Π=ΠΠΠ ……

Page 24: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

24Database Management Systems II, Huiping Cao

Equivalence Rules (Cont.)

5. Theta-join operations (and natural joins) are commutative.E1 θ E2 = E2 θ E1

6. (a) Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3)

(b) Theta joins are associative in the following manner:

(E1 θ1 E2) θ2∧θ3 E3 = E1 θ1∧θ3 (E2 θ2 E3) where θ2 involves attributes from only E2 and E3.

Page 25: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

25Database Management Systems II, Huiping Cao

Equivalence Rules (Cont.)

7. The selection operation distributes over the theta join operation under the following two conditions:(a) When all the attributes in θ0 involve only the attributes of one of the expressions (E1) being joined. σθ0(E1 θ E2) = (σθ0(E1)) θ E2

(b) When θ1 involves only the attributes of E1 and θ2 involves only the attributes of E2. σθ1∧θ2 (E1 θ E2) = (σθ1(E1)) θ (σθ2 (E2))

Page 26: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

26Database Management Systems II, Huiping Cao

Pictorial Depiction of Equivalence Rules

Page 27: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

27Database Management Systems II, Huiping Cao

Equivalence Rules (Cont.)

8. The projection operation distributes over the theta join operation as follows:(a) if θ involves only attributes from L1 ∪ L2:

(b) Consider a join E1 θ E2. q  Let L1 and L2 be sets of attributes from E1 and E2,

respectively. q  Let L3 be attributes of E1 that are involved in join condition θ,

but are not in L1 ∪ L2, andq  Let L4 be attributes of E2 that are involved in join condition θ,

but are not in L1 ∪ L2.

))(())(()( 2121 2121 EEEE LLLL ∏∏=∏ ∪ θθ

)))(())((()( 2121 42312121EEEE LLLLLLLL ∪∪∪∪ ∏∏∏=∏ θθ

Page 28: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

28Database Management Systems II, Huiping Cao

Equivalence Rules (Cont.)

9.  The set operations union and intersection are commutative E1 ∪ E2 = E2 ∪ E1 E1 ∩ E2 = E2 ∩ E1

■  (set difference is not commutative).10.  Set union and intersection are associative.

(E1 ∪ E2) ∪ E3 = E1 ∪ (E2 ∪ E3) (E1 ∩ E2) ∩ E3 = E1 ∩ (E2 ∩ E3)

11.  The selection operation distributes over ∪, ∩ and –. σθ (E1 – E2) = σθ (E1) – σθ(E2) and similarly for ∪ and ∩ in place of – Also: σθ (E1 – E2) = σθ(E1) – E2 and similarly for ∩ in place of –, but not for ∪

12. The projection operation distributes over union ΠL(E1 ∪ E2) = (ΠL(E1)) ∪ (ΠL(E2))

Page 29: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

29Database Management Systems II, Huiping Cao

Transformation Example: Pushing Selections

q  Query: Find the names of all instructors in the Music department, along with the titles of the courses that they teach

q Πname, title(σdept_name= “Music”(instructor (teaches Πcourse_id, title (course))))

q  Transformation using rule 7a.

q Πname, title((σdept_name= “Music”(instructor)) (teaches Πcourse_id, title (course)))

q  Performing the selection as early as possible reduces the size of the relation to be joined.

Page 30: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

30Database Management Systems II, Huiping Cao

Example with Multiple Transformations

q  Query: Find the names of all instructors in the Music department who have taught a course in 2009, along with the titles of the courses that they taughtq  Πname, title(σdept_name= “Music”∧year = 2009

(instructor (teaches Πcourse_id, title (course))))q  Transformation using join associatively (Rule 6a):

q  Πname, title(σdept_name= “Music”∧year = 2009

((instructor teaches) Πcourse_id, title (course)))q  Second form provides an opportunity to apply the “perform

selections early” rule, resulting in the subexpression σdept_name = “Music” (instructor) σyear = 2009 (teaches)

Page 31: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

31Database Management Systems II, Huiping Cao

Multiple Transformations (Cont.)

Page 32: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

32Database Management Systems II, Huiping Cao

Transformation Example: Pushing Projections

q  Consider: Πname, title(σdept_name= “Music” (instructor) teaches) Πcourse_id, title (course))))

q  When we compute(σdept_name = “Music” (instructor teaches)

we obtain a relation whose schema is:(ID, name, dept_name, salary, course_id, sec_id, semester, year)

q  Push projections using equivalence rules 8a and 8b; eliminate unneeded attributes from intermediate results to get: Πname, title(Πname, course_id ( σdept_name= “Music” (instructor) teaches)) Πcourse_id, title (course))))

q  Performing the projection as early as possible reduces the size of the relation to be joined.

Page 33: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

33Database Management Systems II, Huiping Cao

Join Ordering Example

q  For all relations r1, r2, and r3,(r1 r2) r3 = r1 (r2 r3 )

(Join Associativity)q  If r2 r3 is quite large and r1 r2 is small, we choose

(r1 r2) r3

so that we compute and store a smaller temporary relation.

Page 34: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

34Database Management Systems II, Huiping Cao

Join Ordering Example (Cont.)

q  Consider the expressionΠname, title(σdept_name= “Music” (instructor) teaches)

Πcourse_id, title (course))))q  Could compute teaches Πcourse_id, title (course) first

but the result of the first join is likely to be a large relationq  Only a small fraction of the university’s instructors are likely to

be from the Music departmentq  It is better to compute

σdept_name= “Music” (instructor) teaches

Page 35: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

35Database Management Systems II, Huiping Cao

Query Optimization

q  Introductionq  Evaluation of Expressionsq  Query Blocksq  Transformation of Relational Expressionsq  Cost Estimationq  Enumeration of Alternative Plansq  Nested Subqueries

Page 36: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

36Database Management Systems II, Huiping Cao

Cost Estimation

q  For each plan considered, must estimate cost: q Must estimate cost of each operation in plan tree.

! Depends on input cardinalities. ! Cost of operations (sequential scan, index scan,

joins, etc.) q Must also estimate size of result for each operation in

tree! ! Use information about the input relations. ! For selections and joins, assume independence of

predicates.

Page 37: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

37Database Management Systems II, Huiping Cao

Estimating result size

q  Reduction factorq  The ratio of the (expected) result size to the input size considering

only the selection represented by the term.

q  How to calculate reduction factors:q  Column = value

! 1/Nkeys(I)–  I: index on column

! 1/10: randomq  Column1 = column2

! 1/Max(Nkeys(I1), Nkeys(I2)): both columns have indexes! 1/Nkeys(I): either column1 or column 2 has index I! 1/10: no index

Page 38: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

38Database Management Systems II, Huiping Cao

Size estimation (cont.)

q  How to calculate reduction factors:q  Column>value

!  (High(I)-value)/(High(I)-Low(I)): with index! Less than half: no index

q  Column IN (list of values)! RF (column=value) * number of items! At most half

q  Reduction factor for projection

Page 39: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

39Database Management Systems II, Huiping Cao

Improved Statistics: Histograms

q  Column > valueq  rN for uniform distribution

q  Histogram on age

q  Age > 13q  Nonuniform: 9q  Uniform: (1/15)*45 = 3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

9

3421

02 3 3

1 2 13

8

42

Page 40: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

40Database Management Systems II, Huiping Cao

Histograms

q  Equi-width histogramsq  Divide the range into subranges of equal size(in terms of the values)

q  Equi-depth histogramsq  Divide the range into subranges such that the number of tuples in each

subrange is equalq  Age>13

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

2.252.67

Bucket 1Count=8

5.01.33

Bucket 2Count=4

Bucket 3Count=15

Bucket 5Count=15

Bucket 4Count=3

Bucket 1Count=9

Bucket 2Count=10

Bucket 3Count=10

Bucket 5Count=9

Bucket 4Count=7

5.01.0

2.55.0

1.75

9.0

Page 41: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

41Database Management Systems II, Huiping Cao

Query Optimization

q  Introductionq  Evaluation of Expressionsq  Query Blocksq  Transformation of Relational Expressionsq  Cost Estimationq  Enumeration of Alternative Plansq  Nested Subqueries

Page 42: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

42Database Management Systems II, Huiping Cao

Enumeration of Alternative Plans

q  Given a query, an optimizer essentially q  enumerates a certain set of plans, q  chooses the plan with the least estimated cost.

q  Algebraic equivalenceq  Cost estimation

q  Subset of plans considered by a typical optimizer

Page 43: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

43Database Management Systems II, Huiping Cao

Enumeration of Alternative Plans

q  There are two main cases:q  Single-relation plansq  Multiple-relation plans

q  Consider a query block:q  Maximum # tuples in result is the product of the cardinalities of

relations in the FROM clause.q  Reduction factor (RF) associated with each term reflects the

impact of the term in reducing result size. Result cardinality = Max # tuples *product of all RF’s.

q  For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops:q  Each available access path (file scan/index) is considered, and the

one with the least estimated cost is chosen.

Page 44: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

44Database Management Systems II, Huiping Cao

Single Relation Queries -- Plans without Index

SELECT S.rating, COUNT(DISTINCT S.sname) as dsnameFROM Sailors SWHERE S.rating>5 AND S.age = 20GROUP BY S.ratingHAVING dsname >2;

q Plans without indexq  Scan the relation and apply selections and projectionsq  Writing out tuples after the selections and projectionsq  Sorting these tuples to implement the GROUP BY clause

Page 45: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

45Database Management Systems II, Huiping Cao

Example

q  File scan of Sailors: 500q  Writing out (S.rating, S.sname) is 500*ratio

q  Let selection RF of rating: 0.5q  Let selection RF of age: 0.1q  Let projection RF: 0.8q  Result: 500* 0.04=20

q  Sorting intermediate relationq  Let memory is enough to finish sorting in two passes

(Relational optimizers often assume that a relation can be sorted in two passes to simplify the estimation of sorting costs.)

q  3*20 = 60q  Total cost: 500+20+60 = 580

Page 46: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

46Database Management Systems II, Huiping Cao

Single Relation Queries --Plans Utilizing an Index

q  Single-index access pathq  When several indexes match the selection conditions, choose the

access path that the result will be fewest pagesq  Multiple-index access path

q  Intersect the sets of record idsq  Sort according to page idsq  Retrieve data

q  Sorted index access pathq  Group by attributes form a prefix of a tree index

q  Index-only access pathq  Only index scan; avoid retrieving data tuplesq  Steps: (1) Apply selections, (2) remove unwanted attributes, (3)

sort for grouping, (4) compute aggregationq  Works even if the index does not match the selections in the

WHERE clause

Page 47: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

47Database Management Systems II, Huiping Cao

Example

SELECT S.rating, COUNT(DISTINCT S.sname) as dsnameFROM Sailors SWHERE S.rating>5 AND S.age = 20GROUP BY S.ratingHAVING dsname >2;

Assumption q  (1) B+-tree index on rating;q  (2) Hash index on age; q  (3) B+-tree index on <rating, sname, age>

Page 48: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

48Database Management Systems II, Huiping Cao

Example

q  Single-index access pathq  Use hash index on age

! Cost: retrieve the index entries + tuples! Apply rating>5 condition

q  Project out fields mentioned in SELECT, GROUP BY, HAVING

q  Write out temporary results (only keep sname and rating)q  Sort on the rating field for GROUP BYq  Apply aggregation and HAVING

Page 49: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

49Database Management Systems II, Huiping Cao

Example

q  Multiple-index access pathq  Retrieve rids of tuples satisfying rating>5 (B+-tree index)q  Retrieve rids of tuples satisfying age=20 (Hash index)q  Sort the rids according to page idq  Retrieve the corresponding data tuples; retain just the rating

and name fieldsq  Write temporary results q  Sort on the rating field for GROUP BYq  Apply aggregation and HAVING

q  The other two cases?

Page 50: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

50Database Management Systems II, Huiping Cao

Queries Over Multiple Relations

q  Linear trees: at least one child of a join node is a base tableq  Left-deep tree: the right child of each join node is a base table

q  Fundamental decision in System R: only left-deep join trees are considered.q  As the number of joins increases, the number of alternative plans

grows rapidly; we need to restrict the search space.q  Left-deep trees allow us to generate all fully pipelined plans.

!  Intermediate results not written to temporary files.! Not all left-deep trees are fully pipelined (e.g., Sort-Merge join).

B A

C

D

B A

C

D

C D B A

Page 51: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

51Database Management Systems II, Huiping Cao

Enumeration of Left-Deep Plans

q  Left-deep plans differ only in (1) the order of relations, (2) the access method for each relation, and (3) the join method for each join.

q  Enumerated using N passes (if N relations joined):q  Pass 1: Find best 1-relation plan for each relation.

! Selection terms only related to one relation! Project out useless attributes! Cheapest one

q  Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.)

q  Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.)

q  For each subset of relations, retain only:q  Cheapest plan overall, plusq  Cheapest plan for each interesting order of the tuples.

Page 52: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

52Database Management Systems II, Huiping Cao

Enumeration of Left-Deep Plans

q  Pass 2: Find best way to join the results of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) !  Consider each single relation retained after Pass 1 as the

outer relation and every other relation as the inner relation!  A: outer relation; B: inner relation

–  Selections that involve only B è apply before join–  Selections that define join–  Selections that involve attributes in other relations è

apply after the join!  The first two groups of selections è access path for B!  Project out useless attributes from B!  Pipelined?!  Best access method

Page 53: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

53Database Management Systems II, Huiping Cao

Enumeration of Plans (Cont.)

q  ORDER BY, GROUP BY, aggregates etc. are handled as a final step, using either an “interestingly ordered” plan or an additional sorting operator.

q  In spite of pruning plan space, this approach is still exponential in the # of tables.

Page 54: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

54Database Management Systems II, Huiping Cao

Cost Estimation for Multi-relation Plansq  Consider a query block:

q  Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause.

q  Reduction factor (RF) associated with each term reflects the impact of the term in reducing result size. Result cardinality = Max # tuples *product of all RF’s.

q  Multi-relation plans are built up by joining one new relation at a time.q  Cost of join method, plus estimation of join cardinality gives us

both cost estimate and result size estimate

SELECT attribute list

FROM relation list

WHERE term1 AND ... AND termk

Page 55: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

55Database Management Systems II, Huiping Cao

Example

SELECT snameFROM Sailors as S, Reserves as RWhere S.sid = R.sid AND bid=100 AND rating>5;

q  Pass1:q  Sailors: B+ tree matches rating>5,

and is probably cheapest. However, if this selection is expected to retrieve a lot of tuples, file scan may be cheaper.! Still, B+ tree plan kept (because tuples are in

rating order).q  Reserves: B+ tree on bid matches bid=100;

cheapest.

Sailors: Unclustered B+ tree on rating Unclustered Hash on sid Reserves: Unclustered B+ tree on bid

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

Page 56: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

56Database Management Systems II, Huiping Cao

Example (cont.)

q  Pass 2: q  Consider each plan retained from Pass 1 as the

outer, and consider how to join it with the (only) other relation.

q  Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value.

q  Sailors as outer: ! Two selection conditions: (1) bid=100; (2)

sid=value

Sailors: Unclustered B+ tree on rating Unclustered Hash on sid Reserves: Unclustered B+ tree on bid

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

Page 57: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

57Database Management Systems II, Huiping Cao

Query Optimization

q  Introductionq  Evaluation of Expressionsq  Query Blocksq  Transformation of Relational Expressionsq  Cost Estimationq  Enumeration of Alternative Plansq  Nested Subqueries

Page 58: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

58Database Management Systems II, Huiping Cao

Nested Sub-queries

q  Find the names of sailors with the highest rating

SELECT S.snameFROM Sailors SWHERE S.rating = (SELECT MAX (S2.rating)

FROM Sailors S2)q  Steps:

q  The nested subquery can be evaluated just once à single value

q  This value is incorporated into the top-level query! E.g., S.rating = 8

q  Sub-query returns a value;

Page 59: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

59Database Management Systems II, Huiping Cao

Nested Sub-queriesq  Find the names of sailors who have reserved boat with number

103SELECT S.snameFROM Sailors SWHERE S.sid = (SELECT R.sid

FROM Reserves R WHERE R.bid = 103)

q  Steps:q  The nested subquery can be evaluaed just once à relationq  Join between S and this temporal relation.q  Smart: temporary relation as outer relation, S.sid has index;

index nested loop join. Generally, NO. q  Sub-query returns a relation;

Page 60: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

60Database Management Systems II, Huiping Cao

Nested Sub-queriesq  Correlated queries: Find the names of sailors who have reserved boar

number 103SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT *

FROM Reserves RWHERE R.bid = 103 AND S.sid = R.sid)

q  Steps: Evaluate the nested sub-query for each tuple of Sailors.q  Problems

q  Nested sub-query is evaluated once per outer tuple; ! What if the same value appears in the correlation field more

than one time? q  Not set-oriented; precludes other join alternativesq  Even index nested loop join has problemq  Implicit ordering of these blocks means that some good strategies

are not considered. (Sailor as outer/Reserves as outer)

Page 61: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

61Database Management Systems II, Huiping Cao

Nested Queries

q  A nested query: equivalent query without nesting

q  A correlated query: equivalent query without correlation

q  The unnested and “decorrelated” version of the query is typically optimized better.

q  Many current optimizers cannot transform one of the nested versions to nonnested versions.

SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid)

Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer value

Equivalent non-nested query: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103

Page 62: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

62Database Management Systems II, Huiping Cao

System R Optimizer

q  The use of statistics to estimate the cost of query evaluation plan

q  Consider only plans with binary joins in which the inner relation is a base relation (i.e., not a temporary relation) q  Reduce the number of alternative plans

q  Focus on optimization on the unnested SQL queriesq  Model of cost that accounted for I/O costs and CPU costsq  Not perform duplicate elimination (except DISTINCT clause)

Page 63: Query Optimization - New Mexico State Universityhcao/teaching/cs582/note/... · Cost difference between evaluation plans for a query can be enormous! E.g. seconds vs. days in some

63Database Management Systems II, Huiping Cao

Summary

q  Query optimization is an important task in a relational DBMS.q  Must understand optimization in order to understand the

performance impact of a given database design (relations, indexes) on a workload (set of queries).

q  Two parts to optimizing a query:q  Consider a set of alternative plans.

! Must prune search space; typically, left-deep plans only.q  Must estimate cost of each plan that is considered.

! Must estimate size of results and cost for each plan node.! Key issues: Statistics, indexes, operator implementations.