1 query evaluation partially using prof. hector garcia-molina’s slides (notes06, notes07)...

83
1 Query Evaluation Partially using Prof. Hector Garcia- Molina’s slides (Notes06, Notes07) http://www-db.stanford.edu/~ullman/dscb.h tml Donghui Zhang Northeastern University

Upload: quinton-sharratt

Post on 01-Apr-2015

256 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

1

Query Evaluation

Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07)http://www-db.stanford.edu/~ullman/dscb.html

Donghui ZhangNortheastern University

Page 2: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

2

Query Evaluation

SQL Query Query Result

SELECT E.NameFROM Emp EWHERE E.SSN<5000AND E.Age>50

Michael JordanDonghui Zhang

• Check the data and meta data;• Produce query result

Server

Michael JordanDonghui Zhang

???

Page 3: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

3

Query Evaluation Steps

• Query Compiling: get logical Q.P.• Query Optimization: choose a physical

Q.P.• Query Execution: execute

Page 4: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

4

parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answerSQL query

parse tree

logical query plan

“ improved” l.q.p

l.q.p. +sizes

statistics

query compiling

query optimization

query execution

Page 5: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

5

Query Compiling Parse

• Background knowledge: Grammar.• Input: SQL query.• Output: a parse tree.

• Start with a simple grammar:– Only SFW (no group by, having, nested query)– Simple AND condition (no OR, UNION, EXISTS, IN, …)– One table (no conditions like E.did=D.did)

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Page 6: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

6

• <SFW> := SELECT <SelList> FROM <Table> WHERE <CondList>

• <SelList> := <Attribute> | <Attribute>, <SelList> • <CondList> := <Condition> | <Condition> AND

<CondList>• <Condition> := <Attribute> <op> <value>• <op>:= > | < | = | >= | <=

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50Query Compiling Parse

Grammar

Page 7: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

7

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50Query Compiling Parse

Parse Tree

<SFW>

SELECT <SelList> FROM <Table> WHERE <CondList>

<Attribute> <op> <value>

E.SSN < 5000 <op> <value>

E.Age > 50

<Attribute>

<Condition>

Emp E<Attribute>

E.Name

<Condition>AND<CondList>

Page 8: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

8

Query Compiling Convert

• Input: a parse tree.• Output: a logical query plan.

• Algorithm: followed by . E.Name(E.SSN<5000 AND E.Age>50(E) )

• Alternatively, a l.q.p tree.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Page 9: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

9

Query Compiling Apply Laws

• Replace with , push [and ] down.

• Only used for multiple tables. So skip.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Page 10: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

10

parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answerSQL query

parse tree

logical query plan

“ improved” l.q.p

l.q.p. +sizes

statistics

query compiling

query optimization

query execution

Page 11: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

11

Query Optimization Estimate Result Sizes

• The size of each input table is stored as meta data.

• Intermediate result: size not known, but needed to estimate I/O cost of physical plan.

• But for the simple case, can be evaluated on the fly. So no need to estimate the size of . So skip.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Page 12: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

12

Query Optimization Consider Physical Plans

• Associate each RA operator with an implementation scheme.

• Multiple implementation schemes? Enumerate all.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 1 (always work!)

scan

on-the-fly

Page 13: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

13

Query Optimization Consider Physical Plans

• For the other physical plans, need to know what indices exist.

• Primary index: controls the actual storage of a table.– Suppose a primary B+-tree index exists on SSN.

• Secondary index: built on some other attribute. Does not store the actual record. Each leaf entry stores a set of page IDs in the primary index.– Suppose a secondary B+-tree index exists on Age.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

e.g. entry in Age index:

Age=50, pageIDs={1, 4, 6}

21 3 54 6

SSN index

Page 14: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

14

Query Optimization Consider Physical Plans

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 2

range search in SSN index

on-the-fly

Page 15: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

15

Query Optimization Consider Physical Plans

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 3

range search in Age index, follow pointers to SSN index

on-the-fly

Page 16: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

16

Query Optimization Estimate Costs

• Estimate #I/Os for each physical plan.• Pick the cheapest one.

• Input: physical plan.• Additional input:

– meta data (e.g. how many levels a B+-tree has)– assumptions (e.g. the root node of every B+-tree is

pinned)– memory buffer size.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Page 17: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

17

Query Optimization Estimate Costs Meta Data

• All the database tables.• For each table R:

– Schema– T(R): #records in R– For every attribute A:

• V(R, A): #distinct values of A• min(R, A): minimum value of A• max(R, A): maximum value of A

– Primary index: #levels, #leaf nodes.– Secondary index: #levels, #leaf nodes, average

#pageIDs per leaf entry.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Page 18: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

18

Query Optimization Estimate Costs sample input

• Assume for table E:– Schema = (SSN: int, Name: string, Age: int, Salary: int) – T(E) = 100 tuples. – For attribute SSN:

• V(E, SSN)=100, min(E, SSN)=0000, max(E, SSN)=9999– For attribute Age:

• V(E, Age)=20, min(E, Age)=21, max(E, Age)=60– Primary index on SSN: 3 level B+-tree, 50 leaf nodes.– Secondary index on Age: 2 level B+-tree, 10 leaf nodes,

every leaf entry points to 3.5 pageIDs (on average).

• Assumptions: all B+-tree roots are pinned. Can reach the first leaf page of a B+-tree directly.

• Memory buffer size: 2 pages.

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Page 19: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

19

Query Optimization Estimate Costs

• Cost = 50. (The primary index has 50 leaf nodes. Assume we can reach the first leaf page of a B+-tree directly.)

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 1 (always work!)

scan

on-the-fly

Page 20: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

20

Query Optimization Estimate Costs

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 2

range search in SSN index

on-the-fly

• Cost = 25. SSN<5000 selects half of the employees, so 50/2=25 leaf nodes.

• Note: if condition is E.SSN>5000, needs 1 more I/O.

Page 21: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

21

Query Optimization Estimate Costs

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 3

range search in Age index, follow pointers to SSN index

on-the-fly

• Cost = 10/4 + 20/4 * 3.5 = 21.

#I/Os in the Age index #I/Os in the SSN index

Page 22: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

22

Query Optimization Estimate Costs

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 3

range search in Age index, follow pointers to SSN index

on-the-fly

• Cost = 10/4 + 20/4 * 3.5 = 21.

Age index has 10 leaf nodes. Check 1/4 of them, since [51,60] is 1/4 of [21,60].

Page 23: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

23

Query Optimization Estimate Costs

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

Emp E

E.SSN<5000 AND E.Age>50

E.Name

Plan 3

range search in Age index, follow pointers to SSN index

on-the-fly

• Cost = 10/4 + 20/4 * 3.5 = 21.

20 distinct ages divided by 4to get #ages in [51,60].

times 3.5 (#pageIDs per page)to get #I/Os in the SSN index.

Page 24: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

24

Query Optimization Pick Best

SELECT E.NameFROM Emp EWHERE E.SSN<5000 AND E.Age>50

physical plan I/O cost

Plan 1: scan 50

Plan 2: range search SSN index

25

Plan 3: range search Age index

21

Pick!

Page 25: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

25

parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answerSQL query

parse tree

logical query plan

“ improved” l.q.p

l.q.p. +sizes

statistics

query compiling

query optimization

query execution

Page 26: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

26

Another case study: two tables.

• Extended grammar:– Only SFW (no group by, having, nested query)– Simple AND condition (no OR, UNION, EXISTS, IN, …)– Allow two tables (allow conditions like E.did=D.did)

• Example query:SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND

D.budget=1000

Page 27: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

27

• <SFW> := SELECT <SelList> FROM <TableList> WHERE <CondList>

• <SelList> := <Attribute> | <Attribute>, <SelList> • <TableList> := <Table> | <Table>, <Table>• <CondList> := <Condition> | <Condition> AND

<CondList>• <Condition> := <Attribute> <op> <value> |

<Attribute> = <Attribute>• <op>:= > | < | = | >= | <=

Query Compiling Parse Grammar

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Page 28: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

28

Query Compiling Parse Parse Tree

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

<SFW>

SELECT <SelList> FROM<TableList>WHERE<CondList>

<Attribute>

E.Name

, <SelList>

<Attribute>

D.Dname

Page 29: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

29

Query Compiling Parse Parse Tree

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

<SFW>

SELECT <SelList> FROM <CondList><TableList>WHERE

<Table> <Table>

Emp E Dept D

,

Page 30: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

30

Query Compiling Parse Parse Tree

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

<SFW>

SELECT <SelList> FROM <CondList>

<Attribute> = <Attribute>

E.Did D.Did <Condition>

<Condition> AND <CondList>

<Condition>AND <CondList>

<TableList>WHERE

Page 31: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

31

Query Compiling Convert

• Algorithm: then then .

E.Name. D.Dname(E.Did=D.Did AND E.SSN<5000 AND

D.budget=1000(ED) )

• The l.q.p tree:

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

E.Name, D.Dname

Dept D

Page 32: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

32

Query Compiling Apply Laws

• Always always: (try to) replace with !

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

E.Name, D.Dname

Dept D

Page 33: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

33

Query Compiling Apply Laws

• Always always: (try to) replace with !

• Also, push down.

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.SSN<5000 AND D.budget=1000

E.Name, D.Dname

Dept D

Page 34: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

34

Query Compiling Apply Laws

• Always always: (try to) replace with !

• Also, push down.

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.SSN<5000 AND D.budget=1000

E.Name, D.Dname

Dept D

Page 35: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

35

Query Compiling Apply Laws

• Always always: (try to) replace with !

• Also, push down.

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000

Page 36: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

36

Query Compiling Apply Laws Theory Behind

• Let p = predicate with only E attributes q = predicate with only D attributes m = E & D’s common attributes are equal• We have:

pqm (E D) = p(E) q(D)

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Page 37: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

37

parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answerSQL query

parse tree

logical query plan

“ improved” l.q.p

l.q.p. +sizes

statistics

query compiling

query optimization

query execution

Page 38: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

38

Query Optimization Consider Physical Plans

• Because join is so important, let’s skip result size estimation for now, and let’s assume selections are not pushed down.

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.SSN<5000 AND D.budget=1000

E.Name, D.Dname

Dept D

Page 39: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

39

Four Join Algorithms

• Iteration join (nested loop join)• Merge join• Hash join• Join with index

Page 40: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

40

Example E D over common attribute Did

• E:– T(E)=10,000 – primary index on SSN, 3 levels. – |E|= 1,000 leaf nodes.

• D:– T(D)=5,000– primary index on Did. 3 levels.– |D| = 500 leaf nodes.

• Memory available = 101 blocks

Page 41: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

41

Iteration Join

1. for every block in E2. scan through D;3. join records in the E block with records in the D block.

• I/O cost = |E| + |E| * |D| =

1000 + 1000*500 = 501,000.

• Works good for small buffer (e.g. two blocks).

Page 42: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

42

• Can we do better?Use our memory(1) Read 100 blocks of E(2) Read all of D (using 1 block) + join(3) Repeat until done

• I/O cost = |E| + |E|/100 * |D| =

1000 + 10*500 = 6,000.

Page 43: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

43

• Can we do better?Reverse join order: D E. i.e. For every 100 D blocks, go

through E.

• I/O cost = |D| + |D|/100 * |E| =

500 + 5*1000 = 5,500.

Page 44: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

44

• Merge join (conceptually)(1) if R1 and R2 not sorted, sort them(2) i 1; j 1;

While (i T(R1)) (j T(R2)) do if R1{ i }.C = R2{ j }.C then

outputTuples else if R1{ i }.C > R2{ j }.C then j j+1 else if R1{ i }.C < R2{ j }.C then i i+1

Page 45: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

45

Procedure Output-TuplesWhile (R1{ i }.C = R2{ j }.C) (i T(R1)) do

[jj j;

while (R1{ i }.C = R2{ jj }.C) (jj T(R2)) do

[output pair R1{ i }, R2{ jj };

jj jj+1 ]

i i+1 ]

Page 46: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

46

Example

i R1{i}.C R2{j}.C j1 10 5 12 20 20 23 20 20 34 30 30 45 40 30 5

50 6 52 7

Page 47: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

47

Merge Join Cost

• Recall that |E|=1000, |D|=500. And |D| is already sorted on Did.

• External sort E: pass 0, by reading and writing E, produces a file with 10 sorted runs. Another read is enough.

• No need to write! Can pipeline to join operator.

• Cost = 3*1000 + 500 = 3,500.

Page 48: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

48

• Hash join (conceptual)– Hash function h, range 0 k– Buckets for R1: G0, G1, ... Gk– Buckets for R2: H0, H1, ... Hk

Algorithm(1) Hash R1 tuples into G buckets(2) Hash R2 tuples into H buckets(3) For i = 0 to k do

match tuples in Gi, Hi buckets

Page 49: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

49

Simple example hash: even/odd

R1 R2 Buckets2 5 Even 4 4 R1 R23 12 Odd: 5 38 139 8

1114

2 4 8 4 12 8 14

3 5 9 5 3 13 11

Page 50: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

50

Hash Join Cost

• Read + write both E and D for partitioning, then read to join.

• Cost = 3 * (1000 + 500) = 4,500.

Page 51: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

51

• Join with index (Conceptually)

For each r E do

Find the corresponding D tuple by probing index.

• Assuming the root is pinned in memory,Cost = |E| + T(E)*2 = 1000 + 10,000*2 = 21,000.

Page 52: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

52

Note:

• The costs are different if integrate selection conditions!

• E.g. for the index join, only check half of E. So should be 500+5,000*2=10,500.

• Selection condition which is not used during join should be evaluated to filter the join result. E.g. index join checked D without evaluating the selection condition on D.

Page 53: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

53

physical plan with selections being pushed down

• Finally, let’s consider pushing down selections.• Now that the join operator takes intermediate

results (which could be written to disk), we need to estimate their sizes…

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000

Page 54: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

54

parse

convert

apply laws

estimate result sizes

consider physical plans estimate costs

pick best

execute

{P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi

answerSQL query

parse tree

logical query plan

“ improved” l.q.p

l.q.p. +sizes

statistics

query compiling

query optimization

query execution

Page 55: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

55

Estimating result size

• Keep statistics for relation R– T(R) : # tuples in R– S(R) : # of bytes in each R tuple– V(R, A) : # distinct values in R for

attribute A– min(R, A)– max(R, A)

Page 56: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

56

Example R A: 20 byte string

B: 4 byte integerC: 8 byte dateD: 5 byte string

A B C D

cat 1 10 a

cat 1 20 b

dog 1 30 a

dog 1 40 c

bat 1 50 d

T(R) = 5 S(R) = 37V(R,A) = 3 V(R,C) = 5V(R,B) = 1 V(R,D) = 4

Page 57: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

57

Size estimates for W = R1 x R2

T(W) =

S(W) =

T(R1) T(R2)

S(R1) + S(R2)

Page 58: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

58

S(W) = S(R)

T(W) = ?

Size estimate for W = A=a(R)

Page 59: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

59

Example R V(R,A)=3

V(R,B)=1V(R,C)=5V(R,D)=4

W = z=val(R) T(W) =

A B C D

cat 1 10 a

cat 1 20 b

dog 1 30 a

dog 1 40 c

bat 1 50 d

T(R)V(R,Z)

Page 60: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

60

Assumption:

Values in select expression Z = valare uniformly distributedover possible V(R,Z) values.

Page 61: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

61

What about W = z val (R) ?

T(W) = ?

• T(W) = T(R)/2?

Page 62: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

62

• Solution: Estimate values in range

Example R ZMin=1 V(R,Z)=10

W= z 16 (R)

Max=20

f = 5 (fraction of range) 20

T(W) = f T(R)

Page 63: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

63

Size estimate for W = R1 R2

Let x = attributes of R1 y = attributes of R2

X Y =

Same as R1 x R2

Case 1

Page 64: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

64

W = R1 R2 X Y = AR1 A B C R2 A D

Case 2

Assumption:

V(R1,A) V(R2,A) Every A value in R1 is in R2

V(R2,A) V(R1,A) Every A value in R2 is in R1

Page 65: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

65

R1 A B C R2 A D

Computing T(W) when V(R1,A) V(R2,A)

Take 1 tuple Match

1 tuple matches with T(R2)

tuples... V(R2,A)

so T(W) = T(R2) T(R1) V(R2, A)

Page 66: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

66

• V(R1,A) V(R2,A) T(W) = T(R2) T(R1)

V(R2,A)

• V(R2,A) V(R1,A) T(W) = T(R2) T(R1)

V(R1,A)

[A is common attribute]

Page 67: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

67

T(W) = T(R2) T(R1)max{ V(R1,A), V(R2,A) }

In general W = R1 R2

Page 68: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

68

S(W) = S(R1) + S(R2) - S(A) size of attribute

A

Page 69: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

69

Note: for complex expressions, need

intermediate T,S,V results.

E.g. W = [A=a (R1) ] R2

Treat as relation U

T(U) = T(R1)/V(R1,A) S(U) = S(R1)

Also need V (U, *) !!

Page 70: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

70

To estimate Vs

E.g., U = A=a (R1) Say R1 has attribs A,B,C,D

V(U, A) = V(U, B) =V(U, C) = V(U, D) =

Page 71: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

71

Example R 1 V(R1,A)=3

V(R1,B)=1V(R1,C)=5V(R1,D)=3

U = A=a (R1)

A B C D

cat 1 10 10

cat 1 20 20

dog 1 30 10

dog 1 40 30

cat 1 50 10

V(U,A) =1 V(U,B) =1 V(U,C) = T(R1)

V(R1,A)V(U,D) ... somewhere in between

Page 72: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

72

For an arbitrary attribute D other than A (the attribute being selected)V(R1,D) ranges from 1 to T(R1), andV(U,D) ranges from 1 to T(R1)/V(R1,A).

),1(/)1(

),(

)1(

),1(

ARVRT

DUV

RT

DRVLet’s make

Or, V(U,D) = V(R1,D)/V(R1,A)

Page 73: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

73

For Joins U = R1(A,B) R2(A,C)

V(U,A) = min { V(R1, A), V(R2, A) }V(U,B) = V(R1, B)V(U,C) = V(R2, C)

Page 74: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

74

Example:

Z = R1(A,B) R2(B,C) R3(C,D)

T(R1) = 1000 V(R1,A)=50 V(R1,B)=100

T(R2) = 2000 V(R2,B)=200 V(R2,C)=300

T(R3) = 3000 V(R3,C)=90 V(R3,D)=500

R1

R2

R3

Page 75: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

75

T(U) = 10002000 V(U,A) = 50 200 V(U,B) = 100

V(U,C) = 300

Partial Result: U = R1 R2

Page 76: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

76

Z = U R3

T(Z) = 100020003000 V(Z,A) = 50200300 V(Z,B) = 100

V(Z,C) = 90 V(Z,D) = 500

Page 77: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

77

• E:– T(E)=10,000 – primary index on SSN, 3 levels. – |E|= 1,000 leaf nodes.– V(E,SSN)=10,000: from 0000 to 9999.

• D:– T(D)=5,000– primary index on Did. 3 levels.– |D| = 500 leaf nodes.– V(D,budget)=20: from 100 to 10,000.

• Memory available = 11 blocks• ?? What’s the best physical plan?

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

Example

Note: |E’| = 500|D’| = 25

Page 78: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

78

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

l.q.p

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000

Page 79: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

79

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

p.q.p #1

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000

range search scan

iteration join; D is outer table

Cost = 500 (read D)+ 25 (write D’)+ 25 + ceiling(25/10)*500

= 2050

Page 80: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

80

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

p.q.p #2

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000

range search scan

sort merge Cost = 5*500 (sort E’; no write)+ 500 (read D)

= 3000

Page 81: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

81

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

p.q.p #3

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000

range search scan

hash join Cost = 3*500 (for E’)+ 500 (read D)+ 25 (write D’)+ 3*25 (for D’)

= 3000

Note: M should be bigger than sqrt(min{|E’|, |D’|})+1. - Why? - What if not?

Page 82: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

82

SELECT E.Name, D.DnameFROM Emp E, Dept DWHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000

p.q.p #4

Emp E

E.SSN<5000

E.Name, D.Dname

Dept D

D.budget=1000

range search

index nested loop join

Cost = 500 (scan E’)+ 5000*(3-1) (for D)

= 10,500

Page 83: 1 Query Evaluation Partially using Prof. Hector Garcia-Molina’s slides (Notes06, Notes07) ullman/dscb.html Donghui Zhang Northeastern

83

Some notes

• For BNL, merge, hash joins: always push selection!

• For index join, do not push selection on the inner table (the one whose primary key is involved in the join condition).

• For BNL, make the smaller table be the outer table – join could be free if it fits in memory!