vldb'02, aug 20 efficient structural joins on indexed xml1 efficient structural joins on...

23
VLDB'02, Aug 20 Efficient Structural Join s on Indexed XML 1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo VLDB 2002

Upload: vivian-clarke

Post on 02-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

1

Efficient Structural Joins on Indexed XML Documents

Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras,

Carlo Zaniolo

VLDB 2002

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

2

Overview

MotivationProblem DescriptionStructural JoinsStructural Joins using B+-treesStructural Joins using R-treesProblem VariationsExperimental Results

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

3

Motivation (1)

Query languages for XML qualify documents for retrieval both by their

structure and the values of their elements.

Example:

section[title=“Overview”]//figure[caption=“R-tree”]

(path-expression query)

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

4

Motivation (2)

When the XML document is combined with a numbering scheme, path expression queries require the computation of structural joins.

Numbering Schemes Each node is assigned a unique interval. The intervals of a parent node contains

the intervals of all its children.

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

5

Motivation (2)

When the XML document is combined with a numbering scheme, path expression queries require the computation of structural joins.

From path expressions to structural join: two nodes qualify for a path expression query if one is an ancestor of the other. With intervals, this is equivalent to containment.

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

6

Problem Description

• Structural Join: Let A and D be two lists containing the instances of two particular tags in an XML document, join A and D using their containment associations as the join condition.

• [Al-Khalifa, etc. 2002] proposed non-indexed structural join algorithms.

• We extend their algorithms to take advantage of existing indices on the two lists.

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

7

Structural Joins, no indices1. Let a, d be the first elements of A and D

2. while (A, D are not empty or the stack is not empty) do

3. if (a.start > stack.top and d.start > stack.top) then

4. stack.pop()

5. else if (a.start < d.start) then

6. stack.push(a)

7. Let a be the next element in A

8. else

9. output d as descendant of all elements in stack

10. let d be the next element in D

11. endif

12. endwhile

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

8

Example

a1

d1

d2

a2 a4

a3 d3

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

9

Structural Joins using B+-trees

Existing structural join algorithms sequentially scan the input lists.Durable numbering schemes have enabled indexing of XML files with mainstream indices.Such indices can result in sub-linear access time as they provide the facility to skip elements that don’t participate in the join.

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

10

Motivation for using the B+-tree index (1)

a1

d1 d2

a2 a12

a3 a4 a8

a5 a9

a6 a7 a10a11

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

11

Motivation for using the B+-tree index (2)

d2

d3 d13

d4 d5 d9

d6d10

d7 d8 d11d12

a1 a2

d1d14

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

12

Structural Joins using B+-trees1. Put pointers a and d at the beginning of lists A and D

2. while ( not at the end of A or D ) do

3. if ( a is an ancestor of d ) then

4. Push into stack all elements in A that are ancestors of d

5. Join d with all elements in stack and let d=d->next

6. else if ( a.start < d.start ) then // jump ancestor A

7. Pop all elements in stack which are before d

8. Move a forward by skipping sub-trees of last element popped

12. else // a is after d; jump descendant D

13. Join d with all elements in stack

14. Move d forward by skipping all D elements with start<a.start

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

13

Containment forest

Structure linking elements that belong to the same tag.Each element corresponds to a node in the structure and is linked to other elements via parent, first-child and right-sibling pointers.Can be embedded within the associated B+-treeImproves CPU time

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

14

Containment forest example

A(150,250)

A(10,500)

A(800,900)

A(1400,2000)

A(300,400)

A(830,860)

A(1530,1560)

A(1700,1800)

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

15

Containment forest properties

The (start, end) interval of each node contains all intervals in its subtree.The start numbers in the forest follow a preorder traversal.The start (end) numbers of sibling nodes are in increasing order.

Containment forest can be dynamically maintained. Efficient algorithms for element

insertion/deletion

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

16

Structural Join using R-trees (1)

The interval (start, end) of an element can be mapped to a point (e.start, e.end) in the 2-D space which is then indexed by an R-tree.An R-tree can also be used to index the element (start, end) ranges as 1-D intervals

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

17

Structural Join using R-trees (2)

two points two pages

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

18

Problem Variations

Self Joins non-indexed algorithm that traverses the

element list exactly once

Structural Join in a pipelining environment

Feedback between modules can help to skip elements that don’t take part in the join

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

19

Performance Analysis (1)

Effect of skipping only ancestors in join performance

Join Ancestor

s

no-index B+ B+psp B+sp R* R*2

90% 182 180 180 190 230 228

70% 150 149 150 155 198 196

55% 132 130 130 140 176 178

40% 109 108 108 114 160 156

25% 86 84 84 90 132 130

10% 74 67 67 70 122 119

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

20

Performance Analysis (2)

Effect of skipping only descendants in join performance

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

21

Performance Analysis (3)

Effect of skipping both ancestors and descendants

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

22

Performance Analysis (4)

Comparison of B+-tree and B+psp algorithms

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML

23

Conclusions

We presented efficient ways to perform structural joins over XML data utilizing existing indices.Experimental results showed that among the indexed approaches, the B+-tree with sibling pointers performs the best.Easily maintainable solution that provided drastic improvement over no-index case.