query processor a query processor is a module in the dbms that performs the tasks to process, to...

Query ProcessorQuery Processor

A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level query

For a DDBMS, the QP also does data localization for the query based on the fragmentation scheme and generates the execution strategy that incorporates the communication operations involved in processing the query

Query OptimizerQuery Optimizer

Queries expressed in SQL can have multiple equivalent relational algebra query expressions

The distributed query optimizer must select the ordering of relational algebra operations, sites to process data, and possibly the way data should be transferred. This makes distributed query processing significantly more difficult

Complexity of Relational Algebra Complexity of Relational Algebra Operations Operations The relational algebra is used to express the

output of the query. The complexity of relational algebra operations play a role in defining some of the principles of query optimization. All complexity measures are based on the cardinality of the relation

Operations Complexity Select, Project (w/o duplicate elimination) O(n)Project (with duplicate elimination), Group O(n logn)Join, Semi-join, Division, Set Operators O(n logn)Cartesian Product O(n2 )

Characteristics of Query Characteristics of Query Processors Processors Languages

Input language can be relational algebra or calculus; output language is relational algebra (annotated with communication primitives). The query processor must efficiently map input language to output language

Types of Optimization The output language specification represents

the execution strategy. There can be many such strategies, the best one can be selected through exhaustive search, or by applying heuristic (minimize size of intermediate relations). For distributed databases semi joins can be applied to reduce data transfer.

When to Optimize When to Optimize

Static: done before executing the query (at compilation time), cost of optimization amortized over multiple executions, mostly based on exhaustive search. Since sizes of intermediate relations need to be estimated, it can result in sub-optimal strategies.

Dynamic: done at run time; every time the query is executed, can make use of exact sizes of intermediate relations, expensive, based on heuristics

Hybrid: mixes static and dynamic approaches; the approach is mainly static, but dynamic query optimization may take place when high difference between predicted and actual sizes are detected

Characteristics of Query Characteristics of Query Processors Processors

Statistics fragment cardinality and size size and number of distinct values for each

attribute. detailed histograms of attribute values for better selectivity estimation.

Decision Sites one site or several sites participate in

selection of strategy Exploitation of network topology

wide area network communication cost local area network parallel execution

Characteristics of Query Characteristics of Query Processors Processors

Exploitation of replicated fragments larger number of possible strategies Use of Semijoins reduce size of data transfer increase # of messages and local

processing good for fast or slow networks?

Layers of Query Processing Layers of Query Processing

QUERY DECOMPOSITION

DATA LOCALIZATION

GLOBAL OPTIMIZATION

LOCAL OPTIMIZATION

FRAGMENT SCHEMA

STATISTICS ON FRAGMENTS

LOCAL SCHEMA

GLOBAL SCHEMA

Calculus Query on Distributed Relations

Algebra Query on Distributed Relations

Fragment Query

Optimized Fragment Query With Communication Operations

Optimized Local Queries

CONTROL SITE

LOCAL SITE

Query Decomposition Query Decomposition

Normalization The calculus query is written in a normalized

form (CNF or DNF) for subsequent manipulation

Analysis The query is analyzed for semantic

correctness Simplification

Redundant predicates are eliminated to obtain simplified queries

Restructuring The calculus query is translated to optimal

algebraic query representation

Query Decomposition: Query Decomposition: Normalization Normalization Lexical and syntactic analysis

check validity check for attributes and relations type checking on the qualification

There are two possible forms of representing the predicates in query qualification: Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF) CNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn) DNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn) OR's mapped into union AND's mapped into join or selection

Query Decomposition: Analysis Query Decomposition: Analysis

Queries are rejected because the attributes or relations are not defined in the

global schema; or operations used in qualifiers are semantically

incorrect For only those queries that do not use

disjunction or negation semantic correctness can be determined by using query graph

One node of the query graph represents result sites, others operand relations, edge between nodes operand nodes represent joins, and edge between operand node and result node represents project

Query Graph and Join GraphQuery Graph and Join Graph

SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND G.JNO = J.JNO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog''

EMP

Result

ASG

G.JNO = J.JNO E. ENo = G. ENO

Resp

Ename

DUR >= 36

JNAME = ``CAD''

Title = ``Prog''

EMPPROJ

ASG

G.JNO = J.JNO E. ENo = G. ENO

PROJ

Disconnected Query Graph Disconnected Query Graph

Semantically incorrect conjunctive multivariable query without negation have query graphs which are not connected

SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog''

EMP

Result

ASG

E. ENo = G. ENO

Resp

Ename

DUR >= 36 JNAME = ``CAD''

Title = ``Prog''

PROJ

SimplificationSimplification: Eliminating : Eliminating Redundancy Redundancy Elimination of redundant predicates using

well known idempotency rules: p p = p; p p = p; p true =

true; p false = p; p true = p; p false =

false; p1 (p1 p 2 ) = p1;

p1 (p1 p 2 ) = p1

Such redundant predicates arise when user query is enriched with several predicates to incorporate view relation correspondence, and ensure semantic integrity and security

Eliminating Redundancy-- An Eliminating Redundancy-- An ExampleExample

SELECT TITLE FROM E WHERE (NOT (TITLE = ``Programmer'') AND (TITLE = ``Programmer'' OR TITLE = ``Elec.Engr'') AND NOT (TITLE = ``Elec.Engr'')) OR ENAME = ``J.Doe'';

SELECT TITLE FROM E WHERE ENAME = ``J.Doe'';

Eliminating Redundancy-- An Eliminating Redundancy-- An ExampleExample

p1 = <TITLE = ``Programmer''> p2 = <TITLE = ``Elec. Engr''> p3 = <ENAME = ``J.Doe''>

The disjunctive normal form of the query is = (¬ p1 p1 ¬p2) (¬ p1 p2 ¬ p2) p3 = (false ¬ p2) (¬ p1 false) Ú p3 = false false p3 = p3

Let the query qualification is (¬ p1 (p1 p2) ¬ p2) p3

Query Decomposition: Query Decomposition: Rewriting Rewriting

Rewriting calculus query in relational algebra; straightforward transformation from

relational calculus to relational algebra, and

restructuring relational algebra expression to improve performance

Rewriting -- Transformation Rewriting -- Transformation Rules (I) Rules (I)

Commutativity of binary operations: R S S R

R S S R Associativity of binary operations:

(R S) T R ( S T )

Idempotence of unary operations: grouping of projections and selections A’ ( A’’ (R )) A’ (R ) for A’A’’ A p1(A1) ( p2(A2) (R )) p1(A1) p2(A2) (R )

R S S R

(R S) T R (S T)

Rewriting -- Transformation Rewriting -- Transformation Rules (II)Rules (II) Commuting selection with projection

A1, …, An ( p (Ap) (R )) A1, …, An ( p (Ap) ( A1, …, An, Ap(R )))

Commuting selection with binary operations p (Ai)(R S) ( p (Ai)(R)) S

p (Ai)(R S) ( p (Ai)(R)) S

p (Ai)(R S) p (Ai)(R) p (Ai)(S)

Commuting projection with binary operations C(R S) A(R) B (S) C = A B

C(R S) C(R) C (S)

C (R S) C (R) C (S)

An SQL Query and Its Query An SQL Query and Its Query TreeTree

ASG EMP

ENAME

(ENAME<>“J.DOE” )(JNAME=“CAD/CAM” ) (Dur=12 Dur=24)

PROJ

SELECT Ename FROM J, G, E WHERE G.Eno=E.ENo AND G.JNo = J.JNo AND ENAME <> `J.Doe' AND JName = `CAD'

AND (Dur=12 or Dur=24)

JNO

ENO

Query Decomposition: Query Decomposition: Rewriting Rewriting

ENAME

JNO JNO, ENAME

JNO, ENO ENO, ENAME

Dur=12 Dur=24

JNAME=“CAD/CAM”

ENAME<>“J.DOE”

PROJ ASG EMP

ENO

JNO

Data LocalizationData Localization

Input: Algebraic query on distributed relations

Determine which fragments are involved

Localization program substitute for each global query its

materialization program optimize

Data Localization-- An ExampleData Localization-- An Example

PROJ

ASG1 EMP1

ENAME

Dur=12 Dur=24

JNAME=“CAD/CAM”

ENAME<>“J.DOE”

ENO

JNO

EMP is fragmented intoEMP1 = ENO “E3” (EMP)

EMP2 = “E3” < ENO “E6” (EMP)

EMP3 = ENO >“E6” (EMP)

ASG is fragmented intoASG1 = ENO “E3” (ASG)

ASG2 = ENO >“E3” (ASG)

EMP1 EMP1

ASG2ASG1

Reduction with SelectionReduction with Selection


EMP2 = “E3” < ENO “E6” (EMP)


SELECT *FROM EMPWHERE ENO=“E5”

EMP1 EMP2 EMP3

ENO=“E5”

EMP2

ENO=“E5”

EMP

ENO=“E5”

Given Relation R, FR={R1, R2, …, Rn} where Rj =pj(R)

pj(Rj) = if x R: (pi(x)pj(x))

Reduction with joinReduction with join


EMP2 = “E3” < ENO “E6” (EMP)


ASG is fragmented intoASG1 = ENO “E3” (ASG)

ASG2 = ENO >“E3” (ASG)

ASG1 EMP1

ENO

EMP1 EMP1

ASG2ASG1

SELECT *FROM EMP, ASGWHERE EMP.ENO=ASG.ENO

ENO

ASG EMP

ASG1 EMP1

ENO

EMP2 EMP3

ASG2ASG1

Reduction with Join (I)Reduction with Join (I)

(R1 R2) S (R1 S) (R2 S)

ASG1EMP1

ENO

ASG1EMP2

ENO

ASG2EMP2

ENO

ASG1EMP3

ENO

ASG2EMP3

ENO

ASG2EMP1

ENO

Reduction with Join (II)Reduction with Join (II)

ASG1EMP1

ENO

ASG2EMP2

ENO

ASG2EMP3

ENO

Given Ri =pi(R) and Rj =pj(R)

Ri Rj = if x Ri , y Rj: (pi(x)pj(y))

Reduction with join1. Distribute join over union2. Eliminate unnecessary work

Reduction for VFReduction for VF

Find useless intermediate relationsRelation R defined over attributes A = {A1, A2, …, An} vertically fragmented as Ri =A’

(R) where A’ A K,D (Ri) is useless if the set of projection attributes D is not in A’

EMP1= ENO,ENAME (EMP)

EMP2= ENO,TITLE (EMP)

SELECT ENAMEFROM EMP

EMP2EMP1

ENO

ENAME

EMP1

ENAME

Reduction for DHFReduction for DHF

Distribute joins over union

Apply the join reduction for horizontal fragmentation

EMP1: TITLE=“Programmer” (EMP)

EMP2: TITLE“Programmer” (EMP)

ASG1: ASG ENO EMP1

ASG2: ASG ENO EMP2

SELECT *FROM EMP, ASGWHERE ASG.ENO = EMP.ENOAND EMP.TITLE = “Mech. Eng.”

ASG1 EMP1

ENO

EMP2

ASG2ASG1

TITLE=“MECH. Eng.”

Reduction for DHF (II)Reduction for DHF (II)

ASG1ASG1 EMP2

TITLE=“Mech. Eng.”

ENO

ASG1ASG2 EMP2


ENO

ASG1ASG2 EMP2


ENO

ASG1

ENO

EMP2ASG2ASG1


Selection firstJoins over union

Reduction for HFReduction for HF

Remove empty relations generated by contradicting selection on horizontal fragments;

Remove useless relations generated by projections on vertical fragments;

Distribute joins over unions in order to isolate and remove useless joins

Reduction for HF --An ExampleReduction for HF --An Example

EMP1 = ENO“E4” (ENO,ENAME (EMP))

EMP2 = ENO>“E4” (ENO,ENAME (EMP))

EMP3 = ENO,TITLE (EMP)

QUERY

SELECT ENAME

FROM EMP

WHERE ENO = “E5”

ASG1

ENO

EMP3EMP2EMP1

ENO=“E5”

ENAME

EMP2

ENO=“E5”

ENAME

Why Optimization – An Example Why Optimization – An Example QueryQuery

Select enameFrom EMP e, ASG gWhere e.Eno = g. EnoAnd resp = ‘‘manager’’

EMP(eno, ename, title)ASG(eno, jno, resp, dur)

Find the name of the employees who are managing a project?

ASG EMPASG

resp=”manager”

EMP.Eno=ASG.Eno

Ename

Database

SQL

Query

RA tree

Example - StrategiesExample - Strategies

EMP1 = ENO <= 100(EMP) at site 1

EMP2 = ENO > 100(EMP) at site 2

ASG1 = ENO <= 100(ASG) at site 3

ASG2 = ENO > 100(ASG) at site 4

Fragment Schema

Query site: Site 5

ENO

ASG1

resp=“manager” EMP1

ENO

ASG2

resp=“manager” EMP2

Site 5

ASG1

resp=“manager”

EMP1

ENO

ASG2

EMP2

Plan A

Plan B

ASG1’ ASG2’

Example – DB Statistics & CostsExample – DB Statistics & Costs

Database Statistics EMP has 400 tuples, ASG has 1000 tuples, there are 20 managers in G the data is uniformly distributed among sites. ASG and EMP are locally clustered on

attributes RESP and ENO, respectivelyCosts tuple access tacc = 1 unit,

tuple transfer ttrans = 10 units,

Costs for Example PlanCosts for Example Plan

The cost of Plan A:Produce ASG’ = 20 tacc = 20 (processing locally)

Transfer ASG’ = 20 *ttrans = 200 (transfer to EMP site)

Produce EMP’ = (10+10) * tacc* 2 = 40 (join at the EMP site)

Transfer EMP’ = 20 * ttrans = 200 (send to Site 5)Total cost = 460

The cost of Plan B:Transfer EMP = 400 * ttrans = 4,000(send EMP to

Site 5)Transfer ASG = 1000 * ttrans = 10,000 (send ASG

to Site 5)Produce ASG’ = 1000 * tacc = 1,000 (selection

at Site 5)Join EMP and ASG’ = 400 * 20 * tacc = 8,000 (join at Site 5)Total cost = 23,000

Query OptimizationQuery Optimization

Problems in query optimization Determining the physical copies of the fragments upon

which to execute the fragment query expressions (also known as materialization)

Selecting the order of execution of operations Selecting the method for executing each operation

The above problems are not independent, for instance, the choice of the best materialization for a query depends on the order in which operations are executed. But they are treated as independent. Further, We bypass (1) by taking materialization for granted We bypass (3) by clustering all operations at the same

site as a local database system dependent problem

Query Optimization - ObjectivesQuery Optimization - Objectives

The selection of alternative query execution strategies is made based on predetermined objectives

Two main objectives: minimize the total processing time (total cost)

– network and computers at nodes do not get loaded. – Response time cannot be guaranteed

minimize the response time – allocation must facilitate parallel execution of the query – but throughput may decrease and cost can be higher than

total cost

Total processing time (cost) is the sum of all the time (cost) incurred in executing the query (CPU, I/O, data transfer)

Response time is the elapsed time from the initiation till the completion of the query

Optimization Algorithms – The Optimization Algorithms – The IssuesIssues

Cost model cost components weights for each components costs for primitive operations

Search space The set of equivalent algebra expressions

(query trees) Search strategies

How do we move inside the search space Exhaustive search, heuristics, …

Cost ModelsCost Models

The cost measures are: I/O and CPU for centralized DBMSs and I/O, CPU and data transfer costs for DDBMS

Total cost = CPU cost + I/O cost + communication cost CPU cost: Ccpu* #insts

I/O cost: C i/o* #i/os

Communication Cost Cmsg*#msgs + Ctr*#bytes

– Ccpu, C i/o, Ctr and Cmsg are all assumed to be constants.

Response time = sum (sequential operations) Ccpu*s_#insts

Ci/o*s_#i/os

Cmsg*s_#msg + ctr*s_#bytes– S_x stands for maximum number of sequential x’s that need to

be executed to process the query

Intermediate Result SizeIntermediate Result Size The size of the intermediate relations produced

during the execution facilitates the selection of the execution strategy

This is useful in selecting an execution strategy that reduces data transfer

The sizes of intermediate relations need to be estimated based on cardinalities of relations and lengths of attributes

R{A1, A2,..., An} fragmented as R1,R2,…, Rn the statistical data collected typically are len(Ai), length of attribute Ai in bytes

min(Ai) and max(Ai) for ordered domains

card(dom(Ai)) unique values in dom[Ai] Number of tuples in each fragment card(Rj)

Intermediate Size EstimationIntermediate Size Estimation

Join selectivity factorSFj(r,s) = card(r * s) / card(r) * card(s)

Selecton selectivity factor

SFS(F) = card(f(r)) / card(r) size(r) = card(r) * len(r) Cardinality of intermediate relations

SFS(A = value) = 1/card(dom(A)) SFS(A > value) = max(A) - value/max(A)-min(A) SFS(A < value) = value - min(A)/max(A)-min(A) Sfs(p(Ai)p(Aj)) = sfs(p(Ai)) * sfs(p(Aj)) Sfs(p(Ai) p(Aj)) = sfs(p(Ai)) + sfs(p(Aj)) - sfs(p(Ai)) *

sfs(p(Aj)) SFS(A {values}) = SFS(A = value) * card(values)

Intermediate Size Estimation Intermediate Size Estimation (II)(II) Projection

card(a(r)) = card(r) Cartesian product

card(r X S) = card(r) * card(s) Join

card(R A=B S) = card(s); if A is key in R, B is foreign key in S

card(R A=B S) = SFJ(R,S) * card(r) * card(s)

UnionUpper bound = card(r) + card(s)Lower bound = max{card(r), card(s)}

Cost of Processing Primitive Cost of Processing Primitive OperationsOperations

Selection Projection Union Join

nested-loops sort-merge hash-based

For distributed join, semi-join is proposed to perform joins

Semi-joinSemi-join

R SR’=A(R)

S’ = R’ S

S’

R S’

R S

Amount of data transferred:|R’| + |S’|

1. join is replaced with a project; followed by semi-join; and then join

2. the project and join operations are done at one site, and semi-join at another site

3. amount of data transferred: |R’| + |S’|

Semi-join versus JoinSemi-join versus Join

using sem-ijoin increases local processing costs because a relation must be scanned twice (join, project)

For joining intermediate relations produced during sem-ijoin one cannot exploit indices on the base relations

Sem-ijoin may not be good when communication costs are low

Search SpaceSearch Space

Search space is characterized by alternative execution plans

Most optimizers focus on join trees

For N relations, there are O(N!) equivalent join trees

SELECT ENAME, RESPFROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENOAND ASG.PNO=PROJ.PNO

ENO

ASG EMP

PNO

PROJ

ENO

ASG

EMPPNO

PROJ

ASG

EMP

PNO,ENO

PROJ

Restricting Search SpaceRestricting Search Space

O(N!) is large Considering join

methods, the search space is even bigger

Restrict by means of heuristics Ignore cartisian product …

Restrict the shape of the join tree Only consider deep trees ….

R1

R2 R3

R1 R2

R3

R4

R4

R1 R2 R3R4

deep tree

Left-deep tree

bushy tree

Search StrategySearch Strategy

How to move in the search space to find the optimal plan

Deterministic Start from base relations and build plans

by adding relations at each step Dynamic programming: breadth-first Greedy: depth-first

Randomized Search for the optimal one around a

particular starting point– simulated annealing– iterative improvement

Search Strategies -- ExampleSearch Strategies -- Example

R1 R2

R3

R4

R1 R2 R1 R2

R3

R1 R3

R4

R2

R1 R3

R2

R4

R1 R2

R3

R4

Deterministic

Randomized

Distributed Query Optimization Distributed Query Optimization AlgorithmsAlgorithms

System R and R* Hill Climbing and SDD-1

System R (Centralized) System R (Centralized) Algorithm Algorithm Simple (one relation) queries are executed

according to the best access path. Execute joins

Determine the possible ordering of joins Determine the cost of each ordering Choose the join ordering with the minimal cost

For joins, two join methods are considered: Nested loops Merge join

System R Algorithm -- ExampleSystem R Algorithm -- Example

Names of employees working on the CAD/CAM project

Assume EMP has an index on ENO, ASG has an index on PNO, PROJ has an index on PNO and an index on

PNAME

System R Algorithm -- Example System R Algorithm -- Example

Choose the best access paths to each relation EMP: sequential scan (no selection on EMP) ASG: sequential scan (no selection on ASG) PROJ: index on PNAME (there is a selection on

PROJ based on PNAME) Determine the best join ordering

EMP ASG PROJ ASG PROJ EMP PROJ ASG EMP ASG EMP PROJ EMP PROJ ASG PROJ EMP ASG Select the best ordering based on the join costs

evaluated according to the two methods

System R Example (cont'd) System R Example (cont'd)

Best total join order is one of

EMP ASG PROJ

EMP ASG ASG EMP PROJ × ASGASG PROJEMP × PROJ

(ASG EMP) PROJ (PROJ ASG) EMP

PROJ ASG

(ASG EMP) PROJ (PROJ ASG) EMP

System R Algorithm System R Algorithm

(PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP.

Final plan:

select PROJ using index on PNAME then join with ASG using index on PNO then join with EMP using index on ENO

System R* Distributed Query System R* Distributed Query OptimizationOptimization

Total-cost minimization. Cost function includes local processing as well as transmission.

Algorithm For each relation in query tree find the

best access path For the join of n relations find the optimal

join order strategy each local site optimizes the local query

processing

Data Transfer StrategiesData Transfer Strategies

Ship-whole. entire relation is shipped and stored as temporary relation, merge join algorithm is used, done in pipeline mode

Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple

Join Strategy 1Join Strategy 1

External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s

Strategy 1. Ship the entire outer relation to the site of internal relationTC = LC(get R)

+ CC(size(R)) + LC(get s tuples from S)*card(R)


Ship the entire inner relation to the site of the outer relationTC = LC(get S)

+ CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R)


Fetch tuples of the inner relation for each tuple of the outer relation

TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) *

card(R)+ CC(s*len(S))*card(R)


Move both relations to 3rd site and join thereTC = LC(get R)

+ LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R)

Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost

Hill Climbing Algorithm - Hill Climbing Algorithm - AlgorithmAlgorithmInputs

query graph, locations of relations, and relation statistics

Initial solution the least costly among all when the relations are sent to a

candidate result site denoted by ES0, and the site as chosen site

Splits ES0 intoES1: ship one relation of join to the site of other relation

ES2: these two relations are joined locally and the result is transmitted to the chosen site

If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0,

else select ES1 and ES2.

The process can be recursively applied to ES1 and ES2 till no more benefit occurs

Hill Climbing Algorithm - Hill Climbing Algorithm - ExampleExample

SAL

PNAME=“CAD/CAM”

PROJ

ASG

EMPPNO

TITLE

ENOPAY

Relation Size SiteEMP 8 1PAY 4 2PROJ 1 3ASG 10 4

Ignore the local processing costLength of tuples is 1 for all relation

Site1EMP(8)Site2

PAY(4)

Site3PROJ(1)

Site4ASG(10)

ES0

Cost = 13

84

1

HCA - ExampleHCA - ExampleSite1

EMP(8)Site2

PAY(4)

Site3PROJ(1)

Site4ASG(10)

?

?

?

TITLE

ES1

ES2

ES3

Site1EMP(8)

Site2PAY(4)

Site3PROJ(1)

Site4ASG(10)

Site1EMP(8)Site2

PAY(4)

Site3PROJ(1)

Site4ASG(10)

ES0

Cost = 13

84

1

Solution 1Cost =

Solution 2Cost =

ES1

ES2ES3

ESo is the “BEST”

Hill Climbing Algorithm - Hill Climbing Algorithm - CommentsComments Greedy algorithm:

determines an initial feasible solution and iteratively tries to improve it.

If there are local minimas, it may not find the global minima

If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution.

Site1EMP(8)

Site2PAY(4)

Site3PROJ(1)

Site4ASG(10)

COST =

SDD-1 AlgorithmSDD-1 Algorithm

SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles.

Cost of semijoin:Cost (R SJA S) = CMSG + CTR*size(A(S))

Benefit is the cost of transferring irrelevant tupleBenefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR

A semijoin is beneficial if cost < benefit.

SDD-1: The AlgorithmSDD-1: The Algorithm

initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing

most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected

the above step is done until no more beneficial joins are left

assembly site selection to perform local operations

postoptimization removes unnecessary semijoins

SDD1 - ExampleSDD1 - Example

SELECT *FROM EMP, ASG, PROJWHERE EMP.ENO = ASG.ENOAND ASG.PNO = PROJ.PNO

Site 1EMP

Site 2 ASG

Site 3 PROJ

ENO PNO

Relation Card Tup_Len Rel_sizeEMP 30 50 1500ASG 100 30 3000PROJ 50 40 2000

Relation SFsj Size(PJ(attr))EMP.ENO 0.3 120ASG.ENO 0.8 400ASG.PNO 1.0 400PROJ.PNO 0.4 200

SDD1 - First IterationSDD1 - First Iteration

SJ1: ASG SJ EMPbenefit = (1-0.3)*3000 = 2100; cost = 120

SJ2: ASG SJ PROJbenefit = (1-0.4)*3000 = 1800cost = 200

SJ3: EMP SJ ASGbenefit = (1-0.8)*1500 = 300; cost = 400

SJ4: PROJ SJ ASGbenefit = 0; cost = 400

SJ1 is selected ASG size is reduced

to 3000*0.3=900 ASG’ = ASG SJ EMP Semijoin selectivity

factor is reduced; it is approximated by SFSJ(G.ENO)= 0.8*0.3 = 0.24

SDD-1 - Second & Third SDD-1 - Second & Third IterationsIterations

Second iteration SJ2: ASG’ SJ PROJ

benefit=(1-0.4)*900=540cost=200;

SJ3: EMP SJ ASG’; benefit=(1-0.24)*1500=1140cost=400

SJ3 is selectedEMP’ = EMP SJ ASG size(EMP’) = 1500*0.24 = 360

Third Iteration SJ2: ASG’ SJ PROJ

benefit=(1-0.4)*900=540cost=200;

it is selected reduces size of G

further to 900*0.4=360

Local OptimizationLocal Optimization

Each site optimizes the plan to be executed at the site

A centralized query optimization problem

SDD-1 - Assembly Site SDD-1 - Assembly Site SelectionSelection After reduction

EMP is at site 1 with size 360ASG is at site 2 with size 360PROJ is at site 3 with size 2000

Site 3 is chosen as assembly site

no semijoins reduced in post optimization.

Site1EMP

Site3PROJ

Site2ASG

(ASG SJ EMP) SJ PROJ site 3(EMP SJ ASG) site 3join at site 3

query processor a query processor is a module in the dbms that performs the tasks to process, to...

Documents