query processor a query processor is a module in the dbms that performs the tasks to process, to...
TRANSCRIPT
Query ProcessorQuery Processor
A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level query
For a DDBMS, the QP also does data localization for the query based on the fragmentation scheme and generates the execution strategy that incorporates the communication operations involved in processing the query
Query OptimizerQuery Optimizer
Queries expressed in SQL can have multiple equivalent relational algebra query expressions
The distributed query optimizer must select the ordering of relational algebra operations, sites to process data, and possibly the way data should be transferred. This makes distributed query processing significantly more difficult
Complexity of Relational Algebra Complexity of Relational Algebra Operations Operations The relational algebra is used to express the
output of the query. The complexity of relational algebra operations play a role in defining some of the principles of query optimization. All complexity measures are based on the cardinality of the relation
Operations Complexity Select, Project (w/o duplicate elimination) O(n)Project (with duplicate elimination), Group O(n logn)Join, Semi-join, Division, Set Operators O(n logn)Cartesian Product O(n2 )
Characteristics of Query Characteristics of Query Processors Processors Languages
Input language can be relational algebra or calculus; output language is relational algebra (annotated with communication primitives). The query processor must efficiently map input language to output language
Types of Optimization The output language specification represents
the execution strategy. There can be many such strategies, the best one can be selected through exhaustive search, or by applying heuristic (minimize size of intermediate relations). For distributed databases semi joins can be applied to reduce data transfer.
When to Optimize When to Optimize
Static: done before executing the query (at compilation time), cost of optimization amortized over multiple executions, mostly based on exhaustive search. Since sizes of intermediate relations need to be estimated, it can result in sub-optimal strategies.
Dynamic: done at run time; every time the query is executed, can make use of exact sizes of intermediate relations, expensive, based on heuristics
Hybrid: mixes static and dynamic approaches; the approach is mainly static, but dynamic query optimization may take place when high difference between predicted and actual sizes are detected
Characteristics of Query Characteristics of Query Processors Processors
Statistics fragment cardinality and size size and number of distinct values for each
attribute. detailed histograms of attribute values for better selectivity estimation.
Decision Sites one site or several sites participate in
selection of strategy Exploitation of network topology
wide area network communication cost local area network parallel execution
Characteristics of Query Characteristics of Query Processors Processors
Exploitation of replicated fragments larger number of possible strategies Use of Semijoins reduce size of data transfer increase # of messages and local
processing good for fast or slow networks?
Layers of Query Processing Layers of Query Processing
QUERY DECOMPOSITION
DATA LOCALIZATION
GLOBAL OPTIMIZATION
LOCAL OPTIMIZATION
FRAGMENT SCHEMA
STATISTICS ON FRAGMENTS
LOCAL SCHEMA
GLOBAL SCHEMA
Calculus Query on Distributed Relations
Algebra Query on Distributed Relations
Fragment Query
Optimized Fragment Query With Communication Operations
Optimized Local Queries
CONTROL SITE
LOCAL SITE
Query Decomposition Query Decomposition
Normalization The calculus query is written in a normalized
form (CNF or DNF) for subsequent manipulation
Analysis The query is analyzed for semantic
correctness Simplification
Redundant predicates are eliminated to obtain simplified queries
Restructuring The calculus query is translated to optimal
algebraic query representation
Query Decomposition: Query Decomposition: Normalization Normalization Lexical and syntactic analysis
check validity check for attributes and relations type checking on the qualification
There are two possible forms of representing the predicates in query qualification: Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF) CNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn) DNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn) OR's mapped into union AND's mapped into join or selection
Query Decomposition: Analysis Query Decomposition: Analysis
Queries are rejected because the attributes or relations are not defined in the
global schema; or operations used in qualifiers are semantically
incorrect For only those queries that do not use
disjunction or negation semantic correctness can be determined by using query graph
One node of the query graph represents result sites, others operand relations, edge between nodes operand nodes represent joins, and edge between operand node and result node represents project
Query Graph and Join GraphQuery Graph and Join Graph
SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND G.JNO = J.JNO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog''
EMP
Result
ASG
G.JNO = J.JNO E. ENo = G. ENO
Resp
Ename
DUR >= 36
JNAME = ``CAD''
Title = ``Prog''
EMPPROJ
ASG
G.JNO = J.JNO E. ENo = G. ENO
PROJ
Disconnected Query Graph Disconnected Query Graph
Semantically incorrect conjunctive multivariable query without negation have query graphs which are not connected
SELECT Ename, Resp FROM E, G, J WHERE E. ENo = G. ENO AND JNAME = ``CAD'' AND DUR >= 36 AND Title = ``Prog''
EMP
Result
ASG
E. ENo = G. ENO
Resp
Ename
DUR >= 36 JNAME = ``CAD''
Title = ``Prog''
PROJ
SimplificationSimplification: Eliminating : Eliminating Redundancy Redundancy Elimination of redundant predicates using
well known idempotency rules: p p = p; p p = p; p true =
true; p false = p; p true = p; p false =
false; p1 (p1 p 2 ) = p1;
p1 (p1 p 2 ) = p1
Such redundant predicates arise when user query is enriched with several predicates to incorporate view relation correspondence, and ensure semantic integrity and security
Eliminating Redundancy-- An Eliminating Redundancy-- An ExampleExample
SELECT TITLE FROM E WHERE (NOT (TITLE = ``Programmer'') AND (TITLE = ``Programmer'' OR TITLE = ``Elec.Engr'') AND NOT (TITLE = ``Elec.Engr'')) OR ENAME = ``J.Doe'';
SELECT TITLE FROM E WHERE ENAME = ``J.Doe'';
Eliminating Redundancy-- An Eliminating Redundancy-- An ExampleExample
p1 = <TITLE = ``Programmer''> p2 = <TITLE = ``Elec. Engr''> p3 = <ENAME = ``J.Doe''>
The disjunctive normal form of the query is = (¬ p1 p1 ¬p2) (¬ p1 p2 ¬ p2) p3 = (false ¬ p2) (¬ p1 false) Ú p3 = false false p3 = p3
Let the query qualification is (¬ p1 (p1 p2) ¬ p2) p3
Query Decomposition: Query Decomposition: Rewriting Rewriting
Rewriting calculus query in relational algebra; straightforward transformation from
relational calculus to relational algebra, and
restructuring relational algebra expression to improve performance
Rewriting -- Transformation Rewriting -- Transformation Rules (I) Rules (I)
Commutativity of binary operations: R S S R
R S S R Associativity of binary operations:
(R S) T R ( S T )
Idempotence of unary operations: grouping of projections and selections A’ ( A’’ (R )) A’ (R ) for A’A’’ A p1(A1) ( p2(A2) (R )) p1(A1) p2(A2) (R )
R S S R
(R S) T R (S T)
Rewriting -- Transformation Rewriting -- Transformation Rules (II)Rules (II) Commuting selection with projection
A1, …, An ( p (Ap) (R )) A1, …, An ( p (Ap) ( A1, …, An, Ap(R )))
Commuting selection with binary operations p (Ai)(R S) ( p (Ai)(R)) S
p (Ai)(R S) ( p (Ai)(R)) S
p (Ai)(R S) p (Ai)(R) p (Ai)(S)
Commuting projection with binary operations C(R S) A(R) B (S) C = A B
C(R S) C(R) C (S)
C (R S) C (R) C (S)
An SQL Query and Its Query An SQL Query and Its Query TreeTree
ASG EMP
ENAME
(ENAME<>“J.DOE” )(JNAME=“CAD/CAM” ) (Dur=12 Dur=24)
PROJ
SELECT Ename FROM J, G, E WHERE G.Eno=E.ENo AND G.JNo = J.JNo AND ENAME <> `J.Doe' AND JName = `CAD'
AND (Dur=12 or Dur=24)
JNO
ENO
Query Decomposition: Query Decomposition: Rewriting Rewriting
ENAME
JNO JNO, ENAME
JNO, ENO ENO, ENAME
Dur=12 Dur=24
JNAME=“CAD/CAM”
ENAME<>“J.DOE”
PROJ ASG EMP
ENO
JNO
Data LocalizationData Localization
Input: Algebraic query on distributed relations
Determine which fragments are involved
Localization program substitute for each global query its
materialization program optimize
Data Localization-- An ExampleData Localization-- An Example
PROJ
ASG1 EMP1
ENAME
Dur=12 Dur=24
JNAME=“CAD/CAM”
ENAME<>“J.DOE”
ENO
JNO
EMP is fragmented intoEMP1 = ENO “E3” (EMP)
EMP2 = “E3” < ENO “E6” (EMP)
EMP3 = ENO >“E6” (EMP)
ASG is fragmented intoASG1 = ENO “E3” (ASG)
ASG2 = ENO >“E3” (ASG)
EMP1 EMP1
ASG2ASG1
Reduction with SelectionReduction with Selection
EMP is fragmented intoEMP1 = ENO “E3” (EMP)
EMP2 = “E3” < ENO “E6” (EMP)
EMP3 = ENO >“E6” (EMP)
SELECT *FROM EMPWHERE ENO=“E5”
EMP1 EMP2 EMP3
ENO=“E5”
EMP2
ENO=“E5”
EMP
ENO=“E5”
Given Relation R, FR={R1, R2, …, Rn} where Rj =pj(R)
pj(Rj) = if x R: (pi(x)pj(x))
Reduction with joinReduction with join
EMP is fragmented intoEMP1 = ENO “E3” (EMP)
EMP2 = “E3” < ENO “E6” (EMP)
EMP3 = ENO >“E6” (EMP)
ASG is fragmented intoASG1 = ENO “E3” (ASG)
ASG2 = ENO >“E3” (ASG)
ASG1 EMP1
ENO
EMP1 EMP1
ASG2ASG1
SELECT *FROM EMP, ASGWHERE EMP.ENO=ASG.ENO
ENO
ASG EMP
ASG1 EMP1
ENO
EMP2 EMP3
ASG2ASG1
Reduction with Join (I)Reduction with Join (I)
(R1 R2) S (R1 S) (R2 S)
ASG1EMP1
ENO
ASG1EMP2
ENO
ASG2EMP2
ENO
ASG1EMP3
ENO
ASG2EMP3
ENO
ASG2EMP1
ENO
Reduction with Join (II)Reduction with Join (II)
ASG1EMP1
ENO
ASG2EMP2
ENO
ASG2EMP3
ENO
Given Ri =pi(R) and Rj =pj(R)
Ri Rj = if x Ri , y Rj: (pi(x)pj(y))
Reduction with join1. Distribute join over union2. Eliminate unnecessary work
Reduction for VFReduction for VF
Find useless intermediate relationsRelation R defined over attributes A = {A1, A2, …, An} vertically fragmented as Ri =A’
(R) where A’ A K,D (Ri) is useless if the set of projection attributes D is not in A’
EMP1= ENO,ENAME (EMP)
EMP2= ENO,TITLE (EMP)
SELECT ENAMEFROM EMP
EMP2EMP1
ENO
ENAME
EMP1
ENAME
Reduction for DHFReduction for DHF
Distribute joins over union
Apply the join reduction for horizontal fragmentation
EMP1: TITLE=“Programmer” (EMP)
EMP2: TITLE“Programmer” (EMP)
ASG1: ASG ENO EMP1
ASG2: ASG ENO EMP2
SELECT *FROM EMP, ASGWHERE ASG.ENO = EMP.ENOAND EMP.TITLE = “Mech. Eng.”
ASG1 EMP1
ENO
EMP2
ASG2ASG1
TITLE=“MECH. Eng.”
Reduction for DHF (II)Reduction for DHF (II)
ASG1ASG1 EMP2
TITLE=“Mech. Eng.”
ENO
ASG1ASG2 EMP2
TITLE=“Mech. Eng.”
ENO
ASG1ASG2 EMP2
TITLE=“Mech. Eng.”
ENO
ASG1
ENO
EMP2ASG2ASG1
TITLE=“Mech. Eng.”
Selection firstJoins over union
Reduction for HFReduction for HF
Remove empty relations generated by contradicting selection on horizontal fragments;
Remove useless relations generated by projections on vertical fragments;
Distribute joins over unions in order to isolate and remove useless joins
Reduction for HF --An ExampleReduction for HF --An Example
EMP1 = ENO“E4” (ENO,ENAME (EMP))
EMP2 = ENO>“E4” (ENO,ENAME (EMP))
EMP3 = ENO,TITLE (EMP)
QUERY
SELECT ENAME
FROM EMP
WHERE ENO = “E5”
ASG1
ENO
EMP3EMP2EMP1
ENO=“E5”
ENAME
EMP2
ENO=“E5”
ENAME
Why Optimization – An Example Why Optimization – An Example QueryQuery
Select enameFrom EMP e, ASG gWhere e.Eno = g. EnoAnd resp = ‘‘manager’’
EMP(eno, ename, title)ASG(eno, jno, resp, dur)
Find the name of the employees who are managing a project?
ASG EMPASG
resp=”manager”
EMP.Eno=ASG.Eno
Ename
Database
SQL
Query
RA tree
Example - StrategiesExample - Strategies
EMP1 = ENO <= 100(EMP) at site 1
EMP2 = ENO > 100(EMP) at site 2
ASG1 = ENO <= 100(ASG) at site 3
ASG2 = ENO > 100(ASG) at site 4
Fragment Schema
Query site: Site 5
ENO
ASG1
resp=“manager” EMP1
ENO
ASG2
resp=“manager” EMP2
Site 5
ASG1
resp=“manager”
EMP1
ENO
ASG2
EMP2
Plan A
Plan B
ASG1’ ASG2’
Example – DB Statistics & CostsExample – DB Statistics & Costs
Database Statistics EMP has 400 tuples, ASG has 1000 tuples, there are 20 managers in G the data is uniformly distributed among sites. ASG and EMP are locally clustered on
attributes RESP and ENO, respectivelyCosts tuple access tacc = 1 unit,
tuple transfer ttrans = 10 units,
Costs for Example PlanCosts for Example Plan
The cost of Plan A:Produce ASG’ = 20 tacc = 20 (processing locally)
Transfer ASG’ = 20 *ttrans = 200 (transfer to EMP site)
Produce EMP’ = (10+10) * tacc* 2 = 40 (join at the EMP site)
Transfer EMP’ = 20 * ttrans = 200 (send to Site 5)Total cost = 460
The cost of Plan B:Transfer EMP = 400 * ttrans = 4,000(send EMP to
Site 5)Transfer ASG = 1000 * ttrans = 10,000 (send ASG
to Site 5)Produce ASG’ = 1000 * tacc = 1,000 (selection
at Site 5)Join EMP and ASG’ = 400 * 20 * tacc = 8,000 (join at Site 5)Total cost = 23,000
Query OptimizationQuery Optimization
Problems in query optimization Determining the physical copies of the fragments upon
which to execute the fragment query expressions (also known as materialization)
Selecting the order of execution of operations Selecting the method for executing each operation
The above problems are not independent, for instance, the choice of the best materialization for a query depends on the order in which operations are executed. But they are treated as independent. Further, We bypass (1) by taking materialization for granted We bypass (3) by clustering all operations at the same
site as a local database system dependent problem
Query Optimization - ObjectivesQuery Optimization - Objectives
The selection of alternative query execution strategies is made based on predetermined objectives
Two main objectives: minimize the total processing time (total cost)
– network and computers at nodes do not get loaded. – Response time cannot be guaranteed
minimize the response time – allocation must facilitate parallel execution of the query – but throughput may decrease and cost can be higher than
total cost
Total processing time (cost) is the sum of all the time (cost) incurred in executing the query (CPU, I/O, data transfer)
Response time is the elapsed time from the initiation till the completion of the query
Optimization Algorithms – The Optimization Algorithms – The IssuesIssues
Cost model cost components weights for each components costs for primitive operations
Search space The set of equivalent algebra expressions
(query trees) Search strategies
How do we move inside the search space Exhaustive search, heuristics, …
Cost ModelsCost Models
The cost measures are: I/O and CPU for centralized DBMSs and I/O, CPU and data transfer costs for DDBMS
Total cost = CPU cost + I/O cost + communication cost CPU cost: Ccpu* #insts
I/O cost: C i/o* #i/os
Communication Cost Cmsg*#msgs + Ctr*#bytes
– Ccpu, C i/o, Ctr and Cmsg are all assumed to be constants.
Response time = sum (sequential operations) Ccpu*s_#insts
Ci/o*s_#i/os
Cmsg*s_#msg + ctr*s_#bytes– S_x stands for maximum number of sequential x’s that need to
be executed to process the query
Intermediate Result SizeIntermediate Result Size The size of the intermediate relations produced
during the execution facilitates the selection of the execution strategy
This is useful in selecting an execution strategy that reduces data transfer
The sizes of intermediate relations need to be estimated based on cardinalities of relations and lengths of attributes
R{A1, A2,..., An} fragmented as R1,R2,…, Rn the statistical data collected typically are len(Ai), length of attribute Ai in bytes
min(Ai) and max(Ai) for ordered domains
card(dom(Ai)) unique values in dom[Ai] Number of tuples in each fragment card(Rj)
Intermediate Size EstimationIntermediate Size Estimation
Join selectivity factorSFj(r,s) = card(r * s) / card(r) * card(s)
Selecton selectivity factor
SFS(F) = card(f(r)) / card(r) size(r) = card(r) * len(r) Cardinality of intermediate relations
SFS(A = value) = 1/card(dom(A)) SFS(A > value) = max(A) - value/max(A)-min(A) SFS(A < value) = value - min(A)/max(A)-min(A) Sfs(p(Ai)p(Aj)) = sfs(p(Ai)) * sfs(p(Aj)) Sfs(p(Ai) p(Aj)) = sfs(p(Ai)) + sfs(p(Aj)) - sfs(p(Ai)) *
sfs(p(Aj)) SFS(A {values}) = SFS(A = value) * card(values)
Intermediate Size Estimation Intermediate Size Estimation (II)(II) Projection
card(a(r)) = card(r) Cartesian product
card(r X S) = card(r) * card(s) Join
card(R A=B S) = card(s); if A is key in R, B is foreign key in S
card(R A=B S) = SFJ(R,S) * card(r) * card(s)
UnionUpper bound = card(r) + card(s)Lower bound = max{card(r), card(s)}
Cost of Processing Primitive Cost of Processing Primitive OperationsOperations
Selection Projection Union Join
nested-loops sort-merge hash-based
For distributed join, semi-join is proposed to perform joins
Semi-joinSemi-join
R SR’=A(R)
S’ = R’ S
S’
R S’
R S
Amount of data transferred:|R’| + |S’|
1. join is replaced with a project; followed by semi-join; and then join
2. the project and join operations are done at one site, and semi-join at another site
3. amount of data transferred: |R’| + |S’|
Semi-join versus JoinSemi-join versus Join
using sem-ijoin increases local processing costs because a relation must be scanned twice (join, project)
For joining intermediate relations produced during sem-ijoin one cannot exploit indices on the base relations
Sem-ijoin may not be good when communication costs are low
Search SpaceSearch Space
Search space is characterized by alternative execution plans
Most optimizers focus on join trees
For N relations, there are O(N!) equivalent join trees
SELECT ENAME, RESPFROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENOAND ASG.PNO=PROJ.PNO
ENO
ASG EMP
PNO
PROJ
ENO
ASG
EMPPNO
PROJ
ASG
EMP
PNO,ENO
PROJ
Restricting Search SpaceRestricting Search Space
O(N!) is large Considering join
methods, the search space is even bigger
Restrict by means of heuristics Ignore cartisian product …
Restrict the shape of the join tree Only consider deep trees ….
R1
R2 R3
R1 R2
R3
R4
R4
R1 R2 R3R4
deep tree
Left-deep tree
bushy tree
Search StrategySearch Strategy
How to move in the search space to find the optimal plan
Deterministic Start from base relations and build plans
by adding relations at each step Dynamic programming: breadth-first Greedy: depth-first
Randomized Search for the optimal one around a
particular starting point– simulated annealing– iterative improvement
Search Strategies -- ExampleSearch Strategies -- Example
R1 R2
R3
R4
R1 R2 R1 R2
R3
R1 R3
R4
R2
R1 R3
R2
R4
R1 R2
R3
R4
Deterministic
Randomized
Distributed Query Optimization Distributed Query Optimization AlgorithmsAlgorithms
System R and R* Hill Climbing and SDD-1
System R (Centralized) System R (Centralized) Algorithm Algorithm Simple (one relation) queries are executed
according to the best access path. Execute joins
Determine the possible ordering of joins Determine the cost of each ordering Choose the join ordering with the minimal cost
For joins, two join methods are considered: Nested loops Merge join
System R Algorithm -- ExampleSystem R Algorithm -- Example
Names of employees working on the CAD/CAM project
Assume EMP has an index on ENO, ASG has an index on PNO, PROJ has an index on PNO and an index on
PNAME
System R Algorithm -- Example System R Algorithm -- Example
Choose the best access paths to each relation EMP: sequential scan (no selection on EMP) ASG: sequential scan (no selection on ASG) PROJ: index on PNAME (there is a selection on
PROJ based on PNAME) Determine the best join ordering
EMP ASG PROJ ASG PROJ EMP PROJ ASG EMP ASG EMP PROJ EMP PROJ ASG PROJ EMP ASG Select the best ordering based on the join costs
evaluated according to the two methods
System R Example (cont'd) System R Example (cont'd)
Best total join order is one of
EMP ASG PROJ
EMP ASG ASG EMP PROJ × ASGASG PROJEMP × PROJ
(ASG EMP) PROJ (PROJ ASG) EMP
PROJ ASG
(ASG EMP) PROJ (PROJ ASG) EMP
System R Algorithm System R Algorithm
(PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP.
Final plan:
select PROJ using index on PNAME then join with ASG using index on PNO then join with EMP using index on ENO
System R* Distributed Query System R* Distributed Query OptimizationOptimization
Total-cost minimization. Cost function includes local processing as well as transmission.
Algorithm For each relation in query tree find the
best access path For the join of n relations find the optimal
join order strategy each local site optimizes the local query
processing
Data Transfer StrategiesData Transfer Strategies
Ship-whole. entire relation is shipped and stored as temporary relation, merge join algorithm is used, done in pipeline mode
Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple
Join Strategy 1Join Strategy 1
External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s
Strategy 1. Ship the entire outer relation to the site of internal relationTC = LC(get R)
+ CC(size(R)) + LC(get s tuples from S)*card(R)
Join Strategy 2Join Strategy 2
Ship the entire inner relation to the site of the outer relationTC = LC(get S)
+ CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R)
Join Strategy 3Join Strategy 3
Fetch tuples of the inner relation for each tuple of the outer relation
TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) *
card(R)+ CC(s*len(S))*card(R)
Join Strategy 4Join Strategy 4
Move both relations to 3rd site and join thereTC = LC(get R)
+ LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R)
Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost
Hill Climbing Algorithm - Hill Climbing Algorithm - AlgorithmAlgorithmInputs
query graph, locations of relations, and relation statistics
Initial solution the least costly among all when the relations are sent to a
candidate result site denoted by ES0, and the site as chosen site
Splits ES0 intoES1: ship one relation of join to the site of other relation
ES2: these two relations are joined locally and the result is transmitted to the chosen site
If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0,
else select ES1 and ES2.
The process can be recursively applied to ES1 and ES2 till no more benefit occurs
Hill Climbing Algorithm - Hill Climbing Algorithm - ExampleExample
SAL
PNAME=“CAD/CAM”
PROJ
ASG
EMPPNO
TITLE
ENOPAY
Relation Size SiteEMP 8 1PAY 4 2PROJ 1 3ASG 10 4
Ignore the local processing costLength of tuples is 1 for all relation
Site1EMP(8)Site2
PAY(4)
Site3PROJ(1)
Site4ASG(10)
ES0
Cost = 13
84
1
HCA - ExampleHCA - ExampleSite1
EMP(8)Site2
PAY(4)
Site3PROJ(1)
Site4ASG(10)
?
?
?
TITLE
ES1
ES2
ES3
Site1EMP(8)
Site2PAY(4)
Site3PROJ(1)
Site4ASG(10)
Site1EMP(8)Site2
PAY(4)
Site3PROJ(1)
Site4ASG(10)
ES0
Cost = 13
84
1
Solution 1Cost =
Solution 2Cost =
ES1
ES2ES3
ESo is the “BEST”
Hill Climbing Algorithm - Hill Climbing Algorithm - CommentsComments Greedy algorithm:
determines an initial feasible solution and iteratively tries to improve it.
If there are local minimas, it may not find the global minima
If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution.
Site1EMP(8)
Site2PAY(4)
Site3PROJ(1)
Site4ASG(10)
COST =
SDD-1 AlgorithmSDD-1 Algorithm
SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles.
Cost of semijoin:Cost (R SJA S) = CMSG + CTR*size(A(S))
Benefit is the cost of transferring irrelevant tupleBenefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR
A semijoin is beneficial if cost < benefit.
SDD-1: The AlgorithmSDD-1: The Algorithm
initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing
most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected
the above step is done until no more beneficial joins are left
assembly site selection to perform local operations
postoptimization removes unnecessary semijoins
SDD1 - ExampleSDD1 - Example
SELECT *FROM EMP, ASG, PROJWHERE EMP.ENO = ASG.ENOAND ASG.PNO = PROJ.PNO
Site 1EMP
Site 2 ASG
Site 3 PROJ
ENO PNO
Relation Card Tup_Len Rel_sizeEMP 30 50 1500ASG 100 30 3000PROJ 50 40 2000
Relation SFsj Size(PJ(attr))EMP.ENO 0.3 120ASG.ENO 0.8 400ASG.PNO 1.0 400PROJ.PNO 0.4 200
SDD1 - First IterationSDD1 - First Iteration
SJ1: ASG SJ EMPbenefit = (1-0.3)*3000 = 2100; cost = 120
SJ2: ASG SJ PROJbenefit = (1-0.4)*3000 = 1800cost = 200
SJ3: EMP SJ ASGbenefit = (1-0.8)*1500 = 300; cost = 400
SJ4: PROJ SJ ASGbenefit = 0; cost = 400
SJ1 is selected ASG size is reduced
to 3000*0.3=900 ASG’ = ASG SJ EMP Semijoin selectivity
factor is reduced; it is approximated by SFSJ(G.ENO)= 0.8*0.3 = 0.24
SDD-1 - Second & Third SDD-1 - Second & Third IterationsIterations
Second iteration SJ2: ASG’ SJ PROJ
benefit=(1-0.4)*900=540cost=200;
SJ3: EMP SJ ASG’; benefit=(1-0.24)*1500=1140cost=400
SJ3 is selectedEMP’ = EMP SJ ASG size(EMP’) = 1500*0.24 = 360
Third Iteration SJ2: ASG’ SJ PROJ
benefit=(1-0.4)*900=540cost=200;
it is selected reduces size of G
further to 900*0.4=360
Local OptimizationLocal Optimization
Each site optimizes the plan to be executed at the site
A centralized query optimization problem
SDD-1 - Assembly Site SDD-1 - Assembly Site SelectionSelection After reduction
EMP is at site 1 with size 360ASG is at site 2 with size 360PROJ is at site 3 with size 2000
Site 3 is chosen as assembly site
no semijoins reduced in post optimization.
Site1EMP
Site3PROJ
Site2ASG
(ASG SJ EMP) SJ PROJ site 3(EMP SJ ASG) site 3join at site 3