icom 6005 – database management systems design

28
ICOM 6005 – Database Management ICOM 6005 – Database Management Systems Design Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization

Upload: lonato

Post on 14-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

ICOM 6005 – Database Management Systems Design. Dr. Manuel Rodr í guez-Mart í nez Electrical and Computer Engineering Department Lecture 15 – Query Optimization. Query Optimization. Read : Chapter 12, sec 12.4 Chapter 15 SAC+79 Pages Purpose: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ICOM 6005 – Database Management Systems Design

ICOM 6005 – Database Management ICOM 6005 – Database Management Systems DesignSystems Design

Dr. Manuel Rodríguez-Martínez

Electrical and Computer Engineering Department

Lecture 15 – Query Optimization

Page 2: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 2

Query OptimizationQuery Optimization

• Read :– Chapter 12, sec 12.4– Chapter 15– SAC+79 Pages

• Purpose:– Study different algorithms to optimize queries

submitted to the DBMS

Page 3: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 3

IntroductionIntroduction

• SQL query gets translated into relational algebra expression

• Relational algebra expression is represented as tree– This is what DBMS “understands” how to process– Expression becomes a plan once we identify access

methods for each operator

• Relational algebra expression might have an equivalent expression– Example: R(A,B,C) , S(A, D, F)

• But, each expression might have different cost• How do we find the cheapest expression?

))(( ...,.,.,.,. SRSR ASARFSDSCRBRAR ×= =σπ><

Page 4: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 4

Relational DBMS ArchitectureRelational DBMS Architecture

Disk Space Management

Buffer Management

File and Access Methods

Relational Operators

Query Optimizer

Query Parser

Client API

Client

DB

ExecutionEngine

Concurrencyand Recovery

Page 5: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 5

Query OptimizerQuery Optimizer

• Module in DBMS in charge of finding cheapest available plan to execute a query

• Building one is not easy!• Optimizer searches for plans and compares then

based on cost• Cost can be:

– Resource usage– Response time– Power consumption – Number of I/Os– Network transmission cost

Page 6: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 6

Query PlansQuery Plans

• Query plan specifies the operations to be executed• Tree of operators

– Each operator corresponds to a relational operator

• Leaf nodes usually represent base tables

R S

R S

T

S

A,B

S

A,B

σA>2

Page 7: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 7

Executing Query PlansExecuting Query Plans

• Plans generated by the optimizer are fed to the execution engine

• Plans support iterator interface – Open – initialize the operator– Next – get next tuple from operator– Close – de-allocate resources from operator

• Execution engine invokes each method• Invocation triggers cascade of calls

– Each operator call the corresponding methods on child nodes

– Example: open on join, causes call to open on outer table and call to open on inner table. Same for next and close.

Page 8: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 8

Pipelined vs Materialized ExecutionPipelined vs Materialized Execution

• Pipelined – The output tuple from one operator immediately becomes

input tuple to its parent operator in the tree

• Materialized– The output tuples from one operator must be stored to disk

first (into a temporary table)– Once the operator finishes, its parent operator can access

the materialed tuples

• Most execution engines use pipelined– Saving in I/O can be substantial!

• Some operator cannot be pipelined– Sorting, projections with duplicate elimination

• Query optimizer must be aware of this issue!

Page 9: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 9

Generation of Query PlansGeneration of Query Plans

• Optimizer generates query plan after a search finds the optimal one– According to some criteria

• Search is a search by construction– Alterative plans are built and compared

– Cheapest one is kept

• Two major algorithms exist– Dynamic programming (SAC+79)

• Exhaustive search of plan space• Finds the optimal

– Randomized Algorithm• Random search of plan space• Quickly finds sub-optimal but good plan

• Optimization philosophy find a good plan quick, avoid bad ones!

Page 10: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 10

Left-Deep Join vs Bushy PlansLeft-Deep Join vs Bushy Plans

• Query Optimizer generate two major types of plans– Left-deep plans– Bushy plans

• Left-deep plans– Every join has a base table as the inner join table– Use in commercial systems (first in System R)– Good for dynamic programming– Good for optimizing resource usage

• Bushy plans– Joins might have intermediate tables as input to the join– Good for randomized search– Use in research prototypes for distributed databases– Good for optimizing response time

Page 11: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 11

Left-deep plansLeft-deep plans

R S

R S

T

R S

T

U

Each join always has base table as inner table

Page 12: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 12

Bushy PlansBushy Plans

R S

T

R S

R S

T

U

U V

Page 13: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 13

Left-Deep vs Bushy PlansLeft-Deep vs Bushy Plans

• Bushy plans– Enable parallelism in operator evaluation– Operator can execute at different rates

• Good in distributed environments

– More complicate to build (harder optimizer)

• Left-deep plans– Joins are run in sequence

• Susceptible to bottleneck at some operator

– Simpler to build (easier optimizer)– Good for single site systems

• Everything runs on the same machine

• Commercial DBMS systems use Left-deep plans

Page 14: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 14

Cost of a planCost of a plan

• The cost of plan depends on the metric you wish to optimize

• Resource usage (CPU + I/O + Network)– Cost is the sum of the resources used by each operator

• Response time– Cost of the slowest path in the tree

• Number of I/Os– Cost is the sum of the I/Os generated by each operator

• Network cost– Cost is sum of cost in moving data between operators

Page 15: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 15

Organization of an optimizerOrganization of an optimizer

Query ExecutionEngine

Query Parser

Parse Tree

Query Plan

SQL Query

CatalogManager

Catalog

QueryOptimizer

PlanGenerator

CostEstimator

Page 16: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 16

Generating AlternativesGenerating Alternatives

• Relational equivalences are used by the optimizer to generate different operators that do the same

• Selection equivalences:– Cascading selections

– Commutative selections

))))((...(()(2121 ... RR

kk pppppp σσσσ =∧∧∧

))(())((1221RR pppp σσσσ =

Page 17: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 17

Equivalence RulesEquivalence Rules

• Projections– Cascading projections

• Joins– Commutative rule

– Associative rule

)( RSSR ><>< =

TSRTSR ><><><>< )()( =

)))((()(211

RRkAAAA ππππ ≡

Page 18: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 18

Equivalence Rules(2)Equivalence Rules(2)

• Commute selections and projections

• Pushing selections

• Decomposing selections

)()( ...21RR

kpppP ∧∧∧=σσ

SRSR pp ><>< )()( σσ =

val ))(())(( opApRR APPA ≡⇔= πσσπ

Page 19: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 19

System R OptimizerSystem R Optimizer

• Based on left-deep plans and dynamic programming– Most commercial systems use a System R type of optimizer

• Cost is based on resource usage– Cost = CPU Cost + I/O Cost– Given a plan P, cost of P is computed as

• Cost(P) = operatorCost(P.root) + Cost(P.root.leftChild) + Cost(P.root.rightChild)

Page 20: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 20

Estimating Cost of OperatorsEstimating Cost of Operators

• Key feature for this is selectivity factors, selectivity, and join costs

• Example:– R has no index

• |R| = 100,000, ||R|| = 5000

– S has un-clustered Index on

Join attribute• |S| = 70,000, ||S|| = 2500

– What algorithm shall be use for join?• Chose between: BNLJ, INLJ, e GHJ with 20 B

– What is the cost?

R S

Page 21: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 21

System R Search AlgorithmSystem R Search Algorithm

• Idea:– Build every possible plan and keep track of

• Cheapest plan (overall)

• Cheapest plans that bring data in sorted order (called interesting orders)

• Dynamic Programming (divide-and-conquer)– To find plan for n-way join you

• First find singe table plans

• Then find plans for all (n-1)-way join and find a plan to join missing table with an (n-1)-way join

• Plan for smaller joins are saved on a table

Page 22: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 22

System R Search Algorithm (2)System R Search Algorithm (2)

• Process:– For an n-way join between tables R1, R2, …, Rn:

• Find the access path to access each table– Plans access to get R1, R2, …, Rn

» This includes application of selection and projections for each table

• Find the access path to compute 2-way joins– 2-way joins for all possible pairs of tables

• Find the access path to compute 3-way joins– Add a table to a 2-way join (forms all possible 3-way joins)

• Find the access path to compute 4-way joins– Add a table to a 3-way join (forms all possible 4-way joins)

• …

• Find the access path to compute the n-way join– Add a table to a n-1 way join (forms all possible n-way joins)

Page 23: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 23

System R Search Algorithm (2)System R Search Algorithm (2)Plan SystemROptimizer(R1, R2, …, Rn){

for (int i = 0; i < n; ++i){ // single table access pathsoptPlan(Ri) = selectPlan(Ri);

}for (int i=2; I < n; ++i){ // join access paths, start with 2 tables, then 3, …,

for all S {R1, R2, R3, …Rn} s.t. |S| == i { // S is the next set to join

bestPlan = dummy plan with infinite cost;for all Rj, Sj s.t. S = Sj {Rj} { // Sj & Rj are pieces of S

P = joinPlan(optPlan(Sj), optPlan(Rj))if (cost(P) < cost(bestPlan)){

bestPlan = P;}

}optPlan(Sj) = bestPlan;

}}S = {R1, R2, …, Rn);return optPlan(S);

}

Page 24: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 24

IllustrationIllustration

• How does the algorithm works for this case• Tables

– R - |R| = 100,000, ||R|| = 8,000– S - |S| = 90,000, ||S|| = 6,000– T - |T| = 120,000 ||T|| = 10,000– U - |U| = 80,000 ||U|| = 4,000

• All tables are stored in heap files• DBMS has: Blocked nested loops join, Hash Join and

25 free buffers• What is the best plan for query:

– RS T U

Page 25: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 25

Illustration (2)Illustration (2)

• How does the algorithm works for this case• Tables

– R - |R| = 100,000, ||R|| = 8,000• R is stored on a clustered B+tree matching join attribute with T

– S - |S| = 90,000, ||S|| = 6,000• |R| is stored on

– T - |T| = 120,000 ||T|| = 10,000

• DBMS has: Blocked nested loops join, Indexed-nested loops join, and Hash Join and 3 free buffers

• What is the best plan for query:– RS T

Page 26: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 26

Illustration (3)Illustration (3)

• How does the algorithm works for this case• Tables

– R - |R| = 100,000, ||R|| = 8,000• R is stored on a clustered B+tree matching join attribute with T

– S - |S| = 90,000, ||S|| = 6,000• |R| is stored on

– T - |T| = 120,000 ||T|| = 10,000

• DBMS has: Blocked nested loops join, Indexed-nested loops join, and Hash Join and 25 free buffers

• What is the best plan for query: σA>3 B = ‘NY’ (R)S T– If SFA>3 = .10 and SF B= ‘NY’ = 0.05 and A>3 matches index

on R.

Page 27: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 27

Issues with SystemR optimizerIssues with SystemR optimizer

• Algorithm performs exhaustive search of left-deep plans

• Dynamic Programming is ill-suited for optimization of response time– Principle of optimality is not observed

• Difficult (but not impossible) to modify for bushy plans– Search space is huge– Need pruning techniques to cut on the number of plans

stored

• Do we need exhaustive search?– Optimal plan vs sub-optimal that is good and quick to find

• Disaster avoidance – More important to avoid bad plans!!!

Page 28: ICOM 6005 – Database Management Systems Design

ICOM 6005 Dr. Manuel Rodriguez Martinez 28

Alternative approaches Alternative approaches

• Randomized Query Optimization– Use randomized algorithms to build and search plans– Good for bushy plans

• Rule-based Query Optimization– Use rules to guide the search and better prune space– Good to apply special cases and pruning

• Parametric Query Optimization– Add run-time parameters to really capture the reality of the

system

• Multiple-query Optimization– Optimizer takes 2 or more queries at a time for optimization