query planing & optimization - imperial college londonpjm/co572/lectures/query... · query...

38
Query Planing & Optimization Holger Pirk 1 / 38

Upload: others

Post on 22-May-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Query Planing & Optimization

Holger Pirk

1 / 38

Page 2: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Query Optimization

2 / 38

Page 3: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Motivation

Query OptimizationStart with a correct planCreate a better plan

I maintain correctnessI a better plan is often much more complicated

A step backSecondary goal: Performance

I some kind of numeric valuePrimary goal: Correctness

I a boolean

3 / 38

Page 4: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Plan Correctness

Expectation ManagementCorrectness is hard to prove when semantics are fuzzyQuery optimizers settle for equivalenceThis puts the burden of correctness on the initial compiler

Visualisation

SQL

Plan

Compiler

Optimizer

4 / 38

Page 5: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Plan equivalenceRelational Algebra

is closed, i.e., every operator takes relations as input and producesrelationsOperators are easily composableSyntactically correct plans: easySemantically correct plans: harderSemantically equivalent plans: very tricky

Semantic equivalencePlans are (semantically) equivalent if they (provably) produce thesame output on any databasePlan equivalence is quite hard to prove

IdeaDivide and Conquer

5 / 38

Page 6: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

TransformationsRationale

An equivalent transformation of a subplan is an equivalenttransformation of the entire planFor example: adjacent selections can be reordered

HenceCreate localized transformation rules in the form Pattern =>Rewrite

I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)

ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done

6 / 38

Page 7: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

The two dimensions of query optimization

Algorithm-Agnostic Algorithm-AwareData-AgnosticData-Aware

7 / 38

Page 8: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

The two dimensions of query optimization

Logical PhysicalRule-BasedCost-Based

8 / 38

Page 9: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Logical Optimization

9 / 38

Page 10: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Rule-Based Logical Query Optimization

10 / 38

Page 11: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Rule-Based Query Optimization

What could possibly be done. . .without knowledge of the data andwithout knowledge of algorithms?

Well,apply heuristics: things that are (almost) universally good,understand the principles,build something very portable/robust (on relational algebra)

11 / 38

Page 12: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Operator Pushdown

Selectionscan be pushed through joins if they only refer to attributes from oneside of the joinThis is often a good optimization (dependes on join costs &selectivity)

12 / 38

Page 13: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Operator Reordering: Recall

HenceCreate localized transformation rules in the form Pattern =>Rewrite

I For example Select(Select(input, condition1), condition2)=> Select(Select(input, condition2), condition1)

ApplicationTraverse the plan tree from the root on (in any order)For every traversed node, see if the pattern matchesIf so, replace it with the rewrite and start again from the rootIf the pattern never matched, you are done

The problemHow do you decide when to apply the ruleHow do you avoid infinite loops through contradicting rules

13 / 38

Page 14: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Illustration: Pushing downselections

σpriority=“urgent”

Order Customer

σstatus<>“fulfilled”

⨝order.customer_id = customer.id

Γnation, min(order.date)

14 / 38

Page 15: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Operator Reordering

Many operators are associativeTwo-way JoinsSelectionsUnionsDifferences

This allows us to swap their orderIn the hope of reducing costs

Heavily heuristics drivenrelative selectivities can be estimated from the shapes of predicates

I s(==) < s(<) < s(<>)

15 / 38

Page 16: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Rule-Based Query Optimization

The solution: guardsPlace guard conditions on the rules

I An easy exampleI Select(Select(input, condition1), condition2) if

condition1.cmp = ’>’ and condition2.cmp = ’==’=> Select(Select(input, condition2), condition1)

ContextRule-based Query Optimization is the standard in "simple" DBMSs

I MonetDB, Spark

16 / 38

Page 17: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Transformations

Before

⨝order.customer_id = customer.id

σpriority=“urgent” Customer

σstatus<>“fulfilled”

Order

Γnation, min(order.date)

After

⨝order.customer_id = customer.id

σstatus<>“fulfilled” Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

17 / 38

Page 18: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Rule-Based Query Optimization in Action

Before Optimization

⨝order.customer_id = customer.id

σstatus<>“fulfilled” Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

Customerid nation1 UK2 USA3 China4 Uk

Orderstatus priority c_id datex x 1 17x x 2 12x x 1 5x x 3 93x x 3 21x x 3 42x x 1 31x x 2 8x x 3 74x x 2 44x x 1 94x x 2 88

. . . and four million more18 / 38

Page 19: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Solution

Γcustomer_id, min(nation), min(order.date)

σstatus<>“fulfilled” Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

⨝order.customer_id = customer.id

⨝order.customer_id = customer.id

Γcustomer_id, min(order.date) Customer

σpriority=“urgent”

Order

Γnation, min(order.date)

σstatus<>“fulfilled”

19 / 38

Page 20: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Problems with rule-based optimization

Very brittle: a missing rule can have devastating consequencesContradicting rules lead to infinite cyclesOften wrong

I Does not take data into account

20 / 38

Page 21: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Cost-based logical query optimization

21 / 38

Page 22: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Cost Metric

What constitutes a "better" planNot an easy question to answerWe define some numeric cost metric

For nowSum of the number of tuples produced by each operator

22 / 38

Page 23: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Cost-Based logical query optimization

IdeaCost are data dependent. . .. . . but we don’t know the data (before running the query). . .. . . so, let’s estimate it!

A simple approachLet’s Estimate the number of tuples produced by an equality select

I (remember, joins are cross products with selects on top)We are selecting one value out of all values in the database

I Assuming uniform distribution: 1distinct values in column

Let’s keep the number of distinct values as a "statistic"Selectivity of priority=="urgent" predicate: 50%In practice: only very few orders are urgent, say 2%

23 / 38

Page 24: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Statistics

Histogramskeep a tuple count for every unique value for every columnEquality predicate selectivity is simply occurences of a value

total tuple count

General predicate estimates basically evaluate the query on thehistogram first

Status Histogram

pendingshippingshipped fulfilled0

20

40

60

80

Priority Histogram

urgent normal0

20

40

60

80

100

24 / 38

Page 25: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Complicating Factors

25 / 38

Page 26: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Attribute Correllation

The questionWhat is the selectivity of the second selection (status=="pending" )Assume we have a histogramwell, it is 2%(assuming attribute independence)

Now, assume the followingThe median time to fulfill an order is a weekSome orders take more than two weeksSomeone gets upset about this and tries to fix itThe person sets all order that are pending for more than a week, topriority urgentThat means, that 50% of the pending orders are now urgent

26 / 38

Page 27: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Dealing with correllation

Multidimensional histograms assuming independence

Pending Shipping Shipped FulfilledNormal 160 20 10 10Urgent 16 2 1 1

27 / 38

Page 28: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Dealing with correllation

Multidimensional histograms

Pending Shipping Shipped FulfilledNormal 80 20 10 10Urgent 96 2 1 1

28 / 38

Page 29: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Physical Optimization

29 / 38

Page 30: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

It all comes down to the metric

Examples for physical Cost MetricsNumber of volcano function callsSum of all produced Tuples (intermediate and final)Number if Page Faults (I/O)CPU costsmax(I/O, CPU)Total Intermediate Size

30 / 38

Page 31: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Physical plan optimization

Counting tuples is hard, counting costs is harderPhysical plans are more complicated: they don’t only contain therelational operator but the algorithmDifferent algorithms have different costs

I In terms of intermediate sizesI In terms of CPU (i.e., function calls)

For example: Nested Loop JoinsI require less space (no need for overallocation)I and don’t require hash calculationI But induce more comparisons

31 / 38

Page 32: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Rule-based physical optimization

32 / 38

Page 33: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Rule-based physical optimization

State of the artis rule based

I Remember the rules for join algorithms? Yeah, that!Very focused on hardware:

I ParallelismI Cache-conscious partitioning

Cost based optimization of physical plans is a research topic(incidentally, one of mine)When it comes to indices: use them!

33 / 38

Page 34: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Cost-based physical optimization

34 / 38

Page 35: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Access Path Selection

Data can be read from multiple sourcesThe base tableA column-store indexA tree indexBitmaps

Indices usually don’t contain all the necessary dataThey are mainly used for tuple selection

I not attribute projection

They may need to be combined with base table data

35 / 38

Page 36: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Access Path Selection

ExampleA customer has six attributes: id, name, address, nation,phone, accountNumber

Suppose you have a column-index on nation

The query is select * from customer where nation = "UK"

36 / 38

Page 37: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

Worksheet: By-Reference Bulk Processing of DSM Data

Consider the following queryALTER TABLE ordersADD FOREIGN KEY (customer_id) REFERENCES customer(id);

select phone, count(*) from orders, customer where order.date > 2018-01-01and customer.nation = UK and customer_id = customer.id group by phone

Attributes (assume all integers)I Order: date, status, priority, discount, customer_idI Customer: id, name, address, nation, phone, accountNumber

TaskFind the best plan in terms of Page Faults/Cache Misses

I Selectivities:F order.date > 2018-01-01: 10%F customer.nation = UK: 90%

I DSM, spanned pages, 5M Orders, 1M customersI 512 KB Cache, 64 Byte Pages

37 / 38

Page 38: Query Planing & Optimization - Imperial College Londonpjm/CO572/lectures/Query... · Query Planing & Optimization HolgerPirk 1/38. Query Optimization 2/38. Motivation Query Optimization

The End

38 / 38