1 5. distributed query processing chapter 7 overview of query processing chapter 8 query...

of 64 /64
1 5. Distributed Query Processing Chapter 7 Overview of Query Processing Chapter 8 Query Decomposition and Data Localization

Author: lynette-mills

Post on 31-Dec-2015

229 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • 5. Distributed Query ProcessingChapter 7 Overview of Query Processing

    Chapter 8 Query Decomposition and Data Localization

  • OutlineOverview of Query Processing ()Query Decomposition and Localization ()

  • Query Processing High level user queryLow level data manipulation commandsQuery Processor

  • Query Processing ComponentsQuery language that is usedSQL (Structured Query Language)Query execution methodologyThe steps that the system goes through in executing high-level (declarative) user queriesQuery optimization How to determine the best execution plan?

  • Tuple calculus: { t | F(t) } where t is a tuple variable, and F(t) is a well formed formulaExample: Get the numbers and names of all managers.

    Query Language

  • Domain calculus: where xi is a domain variable, and is a well formed formulaExample: { x, y | EMP(x, y, Manager") }

    Variables are position sensitive!Query Language (cont.)

  • SQL is a tuple calculus language.

    SELECTENO,ENAMEFROMEMPWHERETITLE=ProgrammerEnd user uses non-procedural (declarative) languages to express queries.Query Language (cont.)

  • Query Processing Objectives & ProblemsQuery processor transforms queries into procedural operations to access data in an optimal way.

    Distributed query processor has to deal with query decomposition and data localization.

  • Centralized Query Processing AlternativesSELECTENAMEFROMEMP E, ASG GWHEREE.ENO = G.ENO AND TITLE=manager

    Strategy 2 avoids Cartesian product, so is better.Strategy 1:

    Strategy 2:

  • Distributed Query ProcessingQuery processor must consider the communication cost and select the best site.The same query example, but relation G and E are fragmented and distributed.

  • Distributed Query Processing Plans By centralized optimization,

    Two distributed query processing plans

    *

  • Distributed Query Plan IResult = (EMP1 EMP2) ENO TITLE=manager (ASG1 ASG2) Site 1ASG2ASG1Site 2Site 3EMP2EMP1Site 4Plan I: To transport all segments to query site and execute there.

    This causes too much network traffic, very costly.Site 5

  • Distributed Query Plan IIPlan II (Optimized):

    Site 1ASG2ASG1EMP2EMP1ASG1 = TITLE=manager (ASG1) Site 2 ASG2 = TITLE=manager (ASG2) Site 3EMP1 = EMP1 ENO ASG1 Site 4EMP2 = EMP2 ENO ASG2 Result = (EMP1 EMP2 )Site 5

  • Costs of the Two PlansAssume:size(EMP)=400, size(ASG)=1000, 20 tuples with TITLE=manager tuple access cost = 1 unit; tuple transfer cost = 10 unitsASG and EMP are locally clustered on attribute TITLE and ENO, respectively.Plan 1Transfer EMP to site 5: 400*tuple transfer cost 4000Transfer ASG to site 5: 1000*tuple transfer cost 10000Produce ASG: 1000*tuple access cost 1000Join EMP and ASG: 400*20*tuple access cost 8000 Total cost 23,000Plan 2Produce ASG: (10+10)*tuple access cost 20Transfer ASG to the sites of EMP: (10+10)*tuple transfer cost 200Produce EMP: (10+10)*tuple access cost * 2 40Transfer EMP to result site: (10+10)*tuple transfer cost 200 Total cost 460

  • Query Optimization ObjectivesMinimize a cost function I/O cost + CPU cost + communication costThese might have different weights in different distributed environments

    Can also maximize throughout

  • Communication Cost Wide area network Communication cost will dominateLow bandwidthLow speedHigh protocol overheadMost algorithms ignore all other cost components Local area networkCommunication cost not that dominateTotal cost function should be considered

  • Complexity of Relational Algebra OperationsMeasured by cardinality n and tuples are sorted on comparison attributesCartesian-Product X

  • Types of Query OptimizationExhaustive searchCost-basedOptimalCombinatorial complexity in the number of relationsWorkable for small solution spacesHeuristicsNot optimalRe-group common sub-expressionsPerform selection and projection ( ) firstReplace a join by a series of semijoinsReorder operations to reduce intermediate relation sizeOptimize individual operations

  • Query Optimization GranularitySingle query at a timeCannot use common intermediate results

    Multiple queries at a timeEfficient if many similar queriesDecision space is much larger

  • Query Optimization TimingStaticDo it at compilation time by using statistics, appropriate for exhaustive search, optimized once, but executed many times.Difficult to estimate the size of the intermediate resultsCan amortize over many executions

    DynamicDo it at execution time, accurate about the size of the intermediate results, repeated for every execution, expensive.

  • Query Optimization Timing (cont.)HybridCompile using a static algorithmIf the error in estimate size > threshold, re-optimizing at run time

  • StatisticsRelationCardinalitySize of a tupleFraction of tuples participating in a join with another relationAttributesCardinality of the domainActual number of distinct valuesCommon assumptionsIndependence between different attribute valuesUniform distribution of attribute values within their domain

  • Decision SitesFor query optimization, it may be done bySingle site centralized approach Single site determines the best scheduleSimpleNeed knowledge about the entire distributed databaseAll the sites involved distributed approachCooperation among sites to determine the scheduleNeed only local informationCost of operationHybrid one site makes major decision in cooperation with other sites making local decisionsOne site determines the global scheduleEach site optimizes the local subqueries

  • Network TopologyWide Area Network (WAN) point-to-pointCharacteristicsLow bandwidthLow speedHigh protocol overheadCommunication cost will dominate; ignore all other cost factorsGlobal schedule to minimize communication costLocal schedules according to centralized query optimization

  • Network Topology (cont.)Local Area Network (LAN)communication cost not that dominateTotal cost function should be consideredBroadcasting can be exploitedSpecial algorithms exist for star networks

  • Other Information to ExploitUsing replications to minimize communication costs

    Using semijoins to reduce the size of operand relations to cut down communication costs when overhead is not significant.

  • Layers of Query Processing QUERY DECOMPOSITIONDATA LOCALIZATIONGLOBAL OPTIMIZATIONLOCAL OPTIMIZATIONCalculus Query on Distributed RelationsAlgebra Query on Distributed Relations Fragment QueryOptimized Fragment Query With Communication OperationsOptimized Local QueriesCONTROL SITELOCAL SITE

  • Step 1 - Query DecompositionDecompose calculus query into algebra query using global conceptual schema information.

  • Step 1 - Query Decomposition (cont.)1) Normalization The calculus query is written in a normalized form (CNF or DNF) for subsequent manipulation2) Analysis To reject normalized queries for which further processing is either impossible or unnecessary (type incorrect or semantically incorrect)3) Simplification Redundant predicates are eliminated to obtain simplified queries 4) Restructuring The calculus query is translated to optimal algebraic query representation More than one translation is possible

  • 1) Normalization Lexical and syntactic analysis check validity (similar to compilers)check for attributes and relations type checking on the qualification There are two possible forms of representing the predicates in query qualification: Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF) CNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn)DNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn)OR's mapped into union AND's mapped into join or selection

  • 1) Normalization (cont.)The transformation of the quantifier-free predicate is straightforward using the well-known equivalence rules for logical operations ( ):1.2.3.4.5.6.7.8.9.10.

  • 1) Normalization (cont.)ExampleSELECTENAME FROMEMP, ASGWHEREEMP.ENO=ASG.ENO ANDASG.JNO=J1AND (DUR=12 OR DUR=24)

    The conjunctive normal form:

  • 2) Analysis Objectivereject type incorrect or semantically incorrect queries.

    Type incorrect if any of its attribute or relation names is not defined in the global schema; if operations are applied to attributes of the wrong type

  • 2) Analysis (cont.) Type incorrect exampleSELECTE# FROMEMPWHEREENAME>200

    ! Undefined attribute! Type mismatch

  • 2) Analysis (cont.)Semantically incorrectComponents do not contribute in any way to the generation of the resultFor only those queries that do not use disjunction () or negation (), semantic correctness can be determined by using query graph

  • Query GraphTwo kinds of nodesOne node represents the result relationOther nodes represent operand relations Two types of edgesan edge to represent a join if neither of its two nodes is the resultan edge to represent a projection if one of its node is the result nodeNodes and edges may be labeled by predicates for selection, projection, or join.

  • Query Graph ExampleSELECTENAME, RESP FROMEMP,ASG,PROJ WHEREEMP.ENO=ASG.GNO AND ASG.PNO=PROJ.PNO AND PNAME=CAD/CAM ANDDUR>36 ANDTITLE=Programmer

  • Join Graph Example 1 A subgraph of query graph for join operation.

  • Tool of Analysis A conjunctive query without negation is semantically incorrect if its query graph is NOT connected!

  • Analysis Example

  • Query Graph Example 2

  • 3) SimplificationUsing idempotency rules to eliminate redundant predicates from WHERE clause.

  • SELECT TITLEFROM EMPWHERE ENAME="J.Doe"SELECT TITLEFROM EMPWHERE (NOT(TITLE=Programmer) AND (TITLE=Programmer OR TITLE=Electrical Eng.) AND NOT(TITLE=Electrical Eng.)) OR ENAME=J.Doeis equivalent toSimplification Example

  • Simplification Example (cont.)p1 = p2 = p3 = The disjunctive normal form of the query is = ( p1 p1 p2) ( p1 p2 p2) p3 = (false p2) ( p1 false) p3 = false false p3 = p3Let the query qualification is ( p1 (p1 p2) p2) p3

  • 4) RewritingConverting a calculus query in relational algebrastraightforward transformation from relational calculus to relational algebra restructuring relational algebra expression to improve performancemaking use of query trees

  • Relational Algebra TreeA tree defined by:a root node representing the query resultleaves representing database relationsnon-leaf nodes representing relations produced by operations edges from leaves to root representing the sequences of operations

  • An SQL Query and Its Query TreeASGEMPENAME(ENAMEJ.DOE )(JNAME=CAD/CAM ) (Dur=12 Dur=24)

    PROJSELECT Ename FROM J, G, E WHERE G.Eno=E.ENo AND G.JNo = J.JNo AND ENAME `J.Doe' AND JName = `CAD' AND (Dur=12 or Dur=24)

  • How to translate an SQL query into an algebra tree?Create a leaf for every relation in the FROM clauseCreate the root as a project operation involving attributes in the SELECT clauseCreate the operation sequence by the predicates and operators in the WHERE clause

  • Rewriting -- Transformation Rules (I) Commutativity of binary operations: R S S R R S S RAssociativity of binary operations: (R S) T R ( S T )

    Idempotence of unary operations: grouping of projections and selections A ( A (R )) A (R ) for AA Ap1(A1) ( p2(A2) (R )) p1(A1) p2(A2) (R )

  • Rewriting -- Transformation Rules (II)Commuting selection with projectionA1, , An ( p (Ap) (R )) A1, , An ( p (Ap) ( A1, , An, Ap(R )))Commuting selection with binary operations p (Ai)(R S) ( p (Ai)(R)) S p (Ai)(R S) ( p (Ai)(R)) S p (Ai)(R S) p (Ai)(R) p (Ai)(S) Commuting projection with binary operations C(R S) A(R) B (S) C = A B C(R S) C(R) C (S) C (R S) C (R) C (S)

  • How to use transformation rules to optimize?Unary operations on the same relation may be grouped to access the same relation onceUnary operations may be commuted with binary operations, so that they may be performed first to reduce the size of intermediate relationsBinary operations may be reordered

  • Optimization of Previous Query Tree

  • Step 2 : Data LocalizationTask : To translate a query on global relation into algebra queries on physical fragments, and optimize the query by reduction.

  • Data Localization-- An ExamplePROJASG1EMP1ENAMEDur=12 Dur=24JNAME=CAD/CAMENAMEJ.DOEEMP is fragmented intoEMP1 = ENO E3 (EMP)EMP2 = E3 < ENO E6 (EMP)EMP3 = ENO >E6 (EMP)ASG is fragmented intoASG1 = ENO E3 (ASG)ASG2 = ENO >E3 (ASG)EMP2EMP3ASG2ASG1

  • Reduction with Selection for PHFEMP is fragmented intoEMP1 = ENO E3 (EMP)EMP2 = E3 < ENO E6 (EMP)EMP3 = ENO >E6 (EMP)SELECT *FROM EMPWHERE ENO=E5Given Relation R, FR={R1, R2, , Rn} where Rj =pj(R)pj(Rj) = if x R: (pi(x)pj(x))

  • Reduction with Join for PHFEMP is fragmented intoEMP1 = ENO E3 (EMP)EMP2 = E3 < ENO E6 (EMP)EMP3 = ENO >E6 (EMP)ASG is fragmented intoASG1 = ENO E3 (ASG)ASG2 = ENO >E3 (ASG)SELECT *FROM EMP, ASGWHERE EMP.ENO=ASG.ENO

  • Reduction with Join for PHF (I)(R1 R2) S (R1 S) (R2 S)

  • Reduction with Join for PHF (II)Given Ri =pi(R) and Rj =pj(R)Ri Rj = if x Ri , y Rj: (pi(x)pj(y))Reduction with join1. Distribute join over union2. Eliminate unnecessary work

  • Reduction for VFFind useless intermediate relationsRelation R defined over attributes A = {A1, A2, , An} vertically fragmented as Ri =A (R) where A A D (Ri) is useless if the set of projection attributes D is not in A

    EMP1= ENO,ENAME (EMP)EMP2= ENO,TITLE (EMP)SELECT ENAMEFROM EMPEMP1ENAME

  • Reduction for DHFDistribute joins over unionApply the join reduction for horizontal fragmentationSELECT *FROM EMP, ASGWHERE ASG.ENO = EMP.ENO AND EMP.TITLE = Mech. Eng.

  • Reduction for DHF (II)Joins over union

  • Reduction for Hybrid Fragmentation

    Combines the rules already specified:Remove empty relations generated by contradicting selection on horizontal fragments;Remove useless relations generated by projections on vertical fragments;Distribute joins over unions in order to isolate and remove useless joins.

  • Reduction for Hybrid Fragmentation - ExampleEMP1 = ENOE4 (ENO,ENAME (EMP))EMP2 = ENO>E4 (ENO,ENAME (EMP))EMP3 = ENO,TITLE (EMP)

    QUERY:SELECT ENAMEFROM EMPWHERE ENO = E5

  • Question & Answer

    *1*