Download - Query Optimization for Semistructured Data
Query Optimization for Semistructured Data
Jason McHug, Jennifer Widom Stanford University
- Rajendra S. Thapa
Path ExpressionSimple Path Expression
– specifies a single-step navigating in the database
DBGroup.member y – denotes variable y ranges all member-labeled sub-
objects of the object assigned to x
Path Expression– ordered list of simple path expressions
DBGroup.Member x, x.Age y
-variable y ranges over all objects that can be reached by starting with the DBGroup object, following an edge labeled Member, then following an edge labeled Age.
Query languageQuery:
SELECT x
FROM DBGroup.Member x
WHERE exists y in x.Age: y<30
<Member>
<Name>Smith</Name>
<Age>28</Age>
<Office>Gates 252 </Office>
<Office>
<Building> CIS </Building>
<Room>411 </Room>
</Office>
</Member>
Result:
Lore architectureTextual Interface
DataEngine
Query ProcessingParsing
Preprocessor
Logical Query Plan Generation
Query Optimization
Physical Query Plan Generation
Execution of Physical Query Plan
Queries can be executed in many ways
Top down
Bottom Up
Hybrid
SELECT x FROM DBGroup.Member x
WHERE exists y in x.Age: y<30
CC
D BD
A
Top-down preferred
Select x
from A.B x
where exists y in x.C: y = 5
Query
•top down would explore only this path
- only one path A.B.C
•bottom-up would visit all leaf objects
with value 5 and their parents
555
C
CCC
B BB
A
Bottom-up preferred
•Many A.B.C paths
•But only a leaf satisfying the predicate
•bottom-up is a good candidate
544
Select x
from A.B x
where exists y in x.C: y = 5
Query
Query Execution Engine
• Logical Query Plans
-logical query plan operators
- structure of the plan
• Physical Query Plans
-operators
- some physical plans
• Statistics and Cost Model
• Plan Enumeration
Query Execution Engine
Logical operators
Discover
Chain
Glue
Create Temp
Project
---
---
---
Logical Query plans
•Variable binding
a variable x in the query is said to be bound if object o has been assigned to x
•Evaluation
an evaluation of a query plan (or sub-plan) is a list of all variables appearing in the plan along with the object(if any) bound to each variable.
•Rotation
Chain
Chain
Discover(x,”B”,y)
Discover(z,”D”,v)
Discover(y,”C”,z)
Representation of a Path expression in the logical query plan
x.B y, y.C z, z.D v
CreatTemp(x,t2)
Select(y,<30)Exists(y)Discover(t1,”Member”,x)Name(“DBGroup”,t1)
Glue
GlueChain
Project(t2)
Discover(x,”Age”,y)
Complete logical query planSELECT x
FROM DBGroup.Member x
WHERE exists y in x.Age: y<30
Query Execution Engine
Operators
Scan(x, l, y)
Lindex(x, l, y)
Pindex(Path Expression, x)
Bindex(l, x, y)
Name(x, n)
Vindex(Op, Value, l, x)
---
---
---
Physical Query plans
lll
cb
a
y = {a, b, c}
x
Some physical plans for a simple logical Query Plan
Discover(A,”B”,x)
Discover(x,”C”,y)
Chain
Logical Query Plan
A.B x, x.C y
physical plans
Scan(A,”B”,x)
Scan(x,”C”,y)
NLJ
Scan Plan
Lindex(x,”C”,y)
Name(t, A)
NLJ
Lindex Plan
Lindex(t,”B”,x)
A.B x, x.C y
more physical plans... A.B x, x.C y
Name(t, A)
Scan(x,”C”,y)
NLJ
Bindex Plan
Bindex(t,”B”,x)
Pindex(“A.B x, x.C y”, y)
Pindex Plan
how physical plans are produced.
• Each logical plan node creates an optimal physical plan given a set of bound variable.
• During plan enumeration we track1. Whether the variable is bound or not
2. Which plan operator has bound the variable
3. All other plan operators that use the variable
4. Whether the variable is stored within a temporary result.
how physical plans are produced.SELECT x
FROM DBGroup.Member x
WHERE exists y in x.Age: y<30
Logical plan
Statistic and Cost Model
• Each physical plan is assigned a cost based on the estimated I/O and CPU time required to execute a plan.
• The costing procedure is recursive.
• I/O first then CPU time to decide the cheaper plan.
Performance Result
A simple query
SELECT DBGroup.Movie.Title
-11 different query plans
- * the best plan uses Lore’s path index to quickly locate all the movie titles
- second plan is top-down strategy
- the worst plan uses Bindex operators and hash joins
Experiment 1
Performance Result
Same query with a Genere subobject having value ‘Comedy’
- point query
Experiment 2
Performance ResultExperiment 3
- Same point query
- all possible plans are not executed
- different plans were generated or disallowing the use of particular operator or indexes.
…….future Work
• Optimization techniques for branching path expression– a query rewrite that moves Where clause predicates into the From
clause and a transformation that introduces a Group-by clause when a large number of paths pass through a small number of objects.
• Partially correlated sub-plans– similar to correlated subqueries but rely on the bindings passed between
portions of the physical query plan rather than on the query itself.
• In the area of statistic– efficient statistics-gathering algorithms– statistic about the location of objects on disk– modification to the cost formulas to generate more accurate cost
estimates