vldb'02, aug 20 efficient structural joins on indexed xml1 efficient structural joins on...
TRANSCRIPT
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
1
Efficient Structural Joins on Indexed XML Documents
Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras,
Carlo Zaniolo
VLDB 2002
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
2
Overview
MotivationProblem DescriptionStructural JoinsStructural Joins using B+-treesStructural Joins using R-treesProblem VariationsExperimental Results
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
3
Motivation (1)
Query languages for XML qualify documents for retrieval both by their
structure and the values of their elements.
Example:
section[title=“Overview”]//figure[caption=“R-tree”]
(path-expression query)
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
4
Motivation (2)
When the XML document is combined with a numbering scheme, path expression queries require the computation of structural joins.
Numbering Schemes Each node is assigned a unique interval. The intervals of a parent node contains
the intervals of all its children.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
5
Motivation (2)
When the XML document is combined with a numbering scheme, path expression queries require the computation of structural joins.
From path expressions to structural join: two nodes qualify for a path expression query if one is an ancestor of the other. With intervals, this is equivalent to containment.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
6
Problem Description
• Structural Join: Let A and D be two lists containing the instances of two particular tags in an XML document, join A and D using their containment associations as the join condition.
• [Al-Khalifa, etc. 2002] proposed non-indexed structural join algorithms.
• We extend their algorithms to take advantage of existing indices on the two lists.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
7
Structural Joins, no indices1. Let a, d be the first elements of A and D
2. while (A, D are not empty or the stack is not empty) do
3. if (a.start > stack.top and d.start > stack.top) then
4. stack.pop()
5. else if (a.start < d.start) then
6. stack.push(a)
7. Let a be the next element in A
8. else
9. output d as descendant of all elements in stack
10. let d be the next element in D
11. endif
12. endwhile
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
9
Structural Joins using B+-trees
Existing structural join algorithms sequentially scan the input lists.Durable numbering schemes have enabled indexing of XML files with mainstream indices.Such indices can result in sub-linear access time as they provide the facility to skip elements that don’t participate in the join.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
10
Motivation for using the B+-tree index (1)
a1
d1 d2
a2 a12
a3 a4 a8
a5 a9
a6 a7 a10a11
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
11
Motivation for using the B+-tree index (2)
d2
d3 d13
d4 d5 d9
d6d10
d7 d8 d11d12
a1 a2
d1d14
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
12
Structural Joins using B+-trees1. Put pointers a and d at the beginning of lists A and D
2. while ( not at the end of A or D ) do
3. if ( a is an ancestor of d ) then
4. Push into stack all elements in A that are ancestors of d
5. Join d with all elements in stack and let d=d->next
6. else if ( a.start < d.start ) then // jump ancestor A
7. Pop all elements in stack which are before d
8. Move a forward by skipping sub-trees of last element popped
12. else // a is after d; jump descendant D
13. Join d with all elements in stack
14. Move d forward by skipping all D elements with start<a.start
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
13
Containment forest
Structure linking elements that belong to the same tag.Each element corresponds to a node in the structure and is linked to other elements via parent, first-child and right-sibling pointers.Can be embedded within the associated B+-treeImproves CPU time
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
14
Containment forest example
A(150,250)
A(10,500)
A(800,900)
A(1400,2000)
A(300,400)
A(830,860)
A(1530,1560)
A(1700,1800)
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
15
Containment forest properties
The (start, end) interval of each node contains all intervals in its subtree.The start numbers in the forest follow a preorder traversal.The start (end) numbers of sibling nodes are in increasing order.
Containment forest can be dynamically maintained. Efficient algorithms for element
insertion/deletion
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
16
Structural Join using R-trees (1)
The interval (start, end) of an element can be mapped to a point (e.start, e.end) in the 2-D space which is then indexed by an R-tree.An R-tree can also be used to index the element (start, end) ranges as 1-D intervals
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
17
Structural Join using R-trees (2)
two points two pages
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
18
Problem Variations
Self Joins non-indexed algorithm that traverses the
element list exactly once
Structural Join in a pipelining environment
Feedback between modules can help to skip elements that don’t take part in the join
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
19
Performance Analysis (1)
Effect of skipping only ancestors in join performance
Join Ancestor
s
no-index B+ B+psp B+sp R* R*2
90% 182 180 180 190 230 228
70% 150 149 150 155 198 196
55% 132 130 130 140 176 178
40% 109 108 108 114 160 156
25% 86 84 84 90 132 130
10% 74 67 67 70 122 119
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
20
Performance Analysis (2)
Effect of skipping only descendants in join performance
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
21
Performance Analysis (3)
Effect of skipping both ancestors and descendants
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
22
Performance Analysis (4)
Comparison of B+-tree and B+psp algorithms
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML
23
Conclusions
We presented efficient ways to perform structural joins over XML data utilizing existing indices.Experimental results showed that among the indexed approaches, the B+-tree with sibling pointers performs the best.Easily maintainable solution that provided drastic improvement over no-index case.