query optimization for semistructured data

Click here to load reader

Post on 31-Dec-2015

18 views

Category:

Documents

2 download

Embed Size (px)

DESCRIPTION

Query Optimization for Semistructured Data. Jason McHug, Jennifer Widom Stanford University. - Rajendra S. Thapa. ………..Road Map. Lore System Query Execution Engine Statistic and cost model Performance Results. Lore Data Model - OEM. Data Guide. Path Expression. Simple Path Expression - PowerPoint PPT Presentation

TRANSCRIPT

  • Query Optimization for Semistructured Data

    Jason McHug, Jennifer Widom Stanford University- Rajendra S. Thapa

  • ..Road Map

    Lore SystemQuery Execution EngineStatistic and cost modelPerformance Results

  • Lore Data Model - OEM

  • Data Guide

  • Path ExpressionSimple Path Expressionspecifies a single-step navigating in the databaseDBGroup.member y denotes variable y ranges all member-labeled sub-objects of the object assigned to xPath Expressionordered list of simple path expressionsDBGroup.Member x, x.Age y-variable y ranges over all objects that can be reached by starting with the DBGroup object, following an edge labeled Member, then following an edge labeled Age.

  • Query languageQuery:SELECT xFROM DBGroup.Member xWHERE exists y in x.Age: y
  • Lore architecture

  • Lore architecture

    Textual InterfaceDataEngineQuery ProcessingParsing PreprocessorLogical Query Plan Generation Query OptimizationPhysical Query Plan Generation Execution of Physical Query Plan

  • Queries can be executed in many waysSELECT x FROM DBGroup.Member xWHERE exists y in x.Age: y
  • CCDBDATop-down preferredSelect x from A.B x where exists y in x.C: y = 5Querytop down would explore only this path- only one path A.B.C

    bottom-up would visit all leaf objects with value 5 and their parents555C

  • CCCBBBABottom-up preferredMany A.B.C pathsBut only a leaf satisfying the predicatebottom-up is a good candidate544Select x from A.B x where exists y in x.C: y = 5Query

  • CCCBBBAHybrid preferred544BBDDSelect x from A.B x where exists y in x.C: y = 5Query

  • Query Execution EngineLogical Query Plans-logical query plan operators- structure of the planPhysical Query Plans-operators- some physical plansStatistics and Cost Model

    Plan Enumeration

  • Query Execution EngineLogical operatorsDiscoverChainGlueCreate TempProject---------Logical Query plansVariable bindinga variable x in the query is said to be bound if object o has been assigned to xEvaluation an evaluation of a query plan (or sub-plan) is a list of all variables appearing in the plan along with the object(if any) bound to each variable.Rotation

  • ChainChainDiscover(x,B,y)Discover(z,D,v)Discover(y,C,z)Representation of a Path expression in the logical query planx.B y, y.C z, z.D v

  • CreatTemp(x,t2)Select(y,

  • Query Execution EngineOperatorsScan(x, l, y)Lindex(x, l, y)Pindex(Path Expression, x)Bindex(l, x, y)Name(x, n)Vindex(Op, Value, l, x)---------Physical Query planslllcbay = {a, b, c}x

  • Some physical plans for a simple logical Query Plan

    Discover(A,B,x)Discover(x,C,y)ChainLogical Query PlanA.B x, x.C y

  • physical plans Scan(A,B,x)Scan(x,C,y)NLJScan PlanLindex(x,C,y)Name(t, A)NLJLindex PlanLindex(t,B,x)A.B x, x.C y

  • more physical plans... A.B x, x.C yName(t, A)Scan(x,C,y)NLJBindex PlanBindex(t,B,x)Pindex(A.B x, x.C y, y)Pindex Plan

  • how physical plans are produced.

    Each logical plan node creates an optimal physical plan given a set of bound variable.

    During plan enumeration we track1. Whether the variable is bound or not2. Which plan operator has bound the variable3. All other plan operators that use the variable4. Whether the variable is stored within a temporary result.

  • how physical plans are produced.

    SELECT x FROM DBGroup.Member xWHERE exists y in x.Age: y

  • possible physical plans

    Fig. (a)Logical plan

  • possible physical plans

    fig. (c)Logical planPhysical plans

  • more physical plan.

    Fig. (d)Logical plan

  • Statistic and Cost Model

    Each physical plan is assigned a cost based on the estimated I/O and CPU time required to execute a plan.The costing procedure is recursive.I/O first then CPU time to decide the cheaper plan.

  • Performance Result

    A simple query SELECT DBGroup.Movie.Title -11 different query plans- * the best plan uses Lores path index to quickly locate all the movie titles- second plan is top-down strategy- the worst plan uses Bindex operators and hash joinsExperiment 1

  • Performance Result

    Same query with a Genere subobject having value Comedy- point queryExperiment 2

  • Performance Result

    Experiment 3- Same point query- all possible plans are not executed- different plans were generated or disallowing the use of particular operator or indexes.

  • Performance Result

    Experiment 4Query selects movies with certain quality rating.

  • .future Work

    Optimization techniques for branching path expressiona query rewrite that moves Where clause predicates into the From clause and a transformation that introduces a Group-by clause when a large number of paths pass through a small number of objects.Partially correlated sub-planssimilar to correlated subqueries but rely on the bindings passed between portions of the physical query plan rather than on the query itself.In the area of statisticefficient statistics-gathering algorithmsstatistic about the location of objects on diskmodification to the cost formulas to generate more accurate cost estimates

    OEM (Object Exchange Model)schema-lessself describinglabeled directed graphvertices are objects and each object has a unique object identifier (oid)atomic objectsno outgoing edgescontains a value (integer, real, string, gif, java, audio, etc)Complex objectshas outgoing edgesName are special labels that serve as aliases for single object (eg DBGroup is a name that denotes object &1)OEM object corresponds to elements in XMLA Data guide is a concise and accurate summary of the structure of an OEM database, stored itself as an OEM object.

    Data guides are dynamically generated and maintained over all or part of an existing database.Two user interface - textual interface used by developers for debugging - graphical interface for end users, provides tools for browsing query results.The object manager component which appears just above the storage component functions as as interface between the processor and the low-level file constructs.

    The query processor, between the user interface and the object manager, follows the basic steps answering a query.After a query is parsed, it is preprocessed to factor out common sub-expressions and convert Lorel short-hands into more OQL form.

    The Logical query plan generator then creates a single logical query plan describing a very high level execution strategy for the query.

    The Logical query plan is transformed into physical query plans. Query optimizer selects the best query physical plan based on I/O cost and CPU cost.Top down - Explores all members sub-objects of DBGruop and for each one looks for the existence of an Age sub-object of the Member object whose value is less than 30.

    Bottom Up - First identify all objects that satisfy the y