tree-pattern queries on a lightweight xml processor
DESCRIPTION
Tree-Pattern Queries on a Lightweight XML Processor. MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras. Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks. Outline. Motivation and Contributions Background Method Categorization - PowerPoint PPT PresentationTRANSCRIPT
Tree-Pattern Queries on a Lightweight XML Processor
MIRELLA M. MORO
Zografoula Vagena
Vassilis J. Tsotras
Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 2
Outline
Motivation and Contributions Background Method Categorization Experimental Evaluation Conclusions
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 3
Motivation
XML query languages: selection on both value and structure “Tree-pattern” queries (TPQ) very common in XML
Many promising holistic solutions
None in lightweight XML engines Without optimization module (e.g. eXist, Galax) Effective, robust processing method
Reasons: No systematic comparison of query methods under a common
storage model No integration of all methods under such storage model
Context: XPath semantics, stored data (indexed at will)
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 4
Contributions
TPQ methods over unified environment Method Categorization: data access patterns and
matching algorithm Common storage model + integration of all methods
Capture the access features Permit clustering data with off-the-shelf access methods
(e.g. B+tree) Novel variations of methods using index structures +
Handle TPQ Extensive comparative study
Synthetic, benchmark and real datasets Decision in the applicability, robustness and efficiency
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 5
article
author
last
procs
conf
Background
article (2,19)
last(7,9)
2<7<9<19
Bib (1,20)
article (2,19)
title(3,5)
procs(14,18)
author(6,13)
last(7,9)
first(10,12)
David J.(11)
DeWitt(8)
conf(15,17)
t1(4)
VLDB(16)
XML database = forest of unranked, ordered, node-labeled trees, one tree per document
TPQ
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 6
Common Storage Model
B+ Tree on ( tag, initial )
bib (1,16)
book (2,9) (10,17)
author (3,8) (11,16) (19,24)
name (4,5) (12,13) (20,21)
paper (18,25)
address (6,7) (14,15) (22,23)
bib(1,26)
book (2,9) paper (18,25)
author (3,8) author (19,24)
name(4,5)
address(6,7)
name(20,21)
address(22,23)
book (10,17)
author(11,16)
name(12,13)
address(14,15)
Input = sequence (list) of elements One list per document tag = element list
Node clustering by index structures Numbering scheme
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 7
Method Categorization
Parameters: access pattern and matching algorithm
(1) set based techniques
(2) query driven
(3) input driven
(4) structural summaries
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 8
Cat 1: Set-based Techniques
Access Pattern Matching Process
Sorted/indexedJoin sets,
merge individual paths
Input: sequences of elements, one list per query node element, possibly indexed (set-based)
Major representative: TwigStack Optimal XML pattern matching algorithm (ancestor/descendant)
Stack-based processing Set of stacks = compact encoding of partial and total results in
linear space (possibly exponential number of answers)
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 9
TwigStack + Indexes
B+tree, built on the left attribute From ancestor: probe descendants: skip initial nodes Ancestor skipping not effective (up to 1st element that follows)
XB-tree: on (left,right) bounding segment XR-tree: on (left,right), B+tree with complex index key + stab lists
A comparative study* shows that Skipping ancestors: XBTree better (XBTree size is smaller) Recursive level of ancestors: XBTree better again
Searching on stab lists of XR-tree is less efficient Plain B+tree: skips descendants, BUT not ancestors XBTwigStack is our choice
* H.Li et al. “An Evaluation of XML Indexes for Structural Joins”. Sigmod Record, 33(3), Sept 04
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 10
Cat 2: Query Driven Techniques
Processing: the query defines the way input is probed
Major representatives: ViST and PRIX Specific details: significantly different Same strategy
Convert both document and query to sequences Processing query = subsequence matching
Access Pattern Matching Process
Indexed/random Incremental construction of each result instance
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 11
ViST and PRIX
Recursively identify matches = quadratic time Optimize the naïve solution:
Identify candidate nodes for each matching step Index structures to cluster those candidates
Subsequence matching process = a plan consisting of INLJ among relations, each of which groups document nodes with the same label
For a given query, joins sequence statically defined by the sequencing of the query
INLJ plans are a superset of the static plans that PRIX and VIST use
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 12
ViST x PRIX x INLJ
Percentage of nodes processed by each algorithm INLJ: best plan
Dataset #nodes VIST PRIX INLJ
100% 100 100 100
LEAVES: 80% 100 84.23 84.20
LEAVES: 1% 100 1.33 1.32
ROOT: 80% 84.22 100 84.18
ROOT: 1% 1.33 100 1.33
INTERNAL: 80% 89.48 89.49 84.20
INTERNAL: 1% 34.24 34.22 1.64
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 13
INLJ : improved B+tree
TPQ evaluation of relational plan Independence of the ordered XML model Total avoidance of false positives
a 1,52
b32,41
b34,37
a33,40
c 38,39
c35,36
b 2,31 b42,51b elem. list
33
34,41 42,51 2,31 32,41
Consider b//cStarting from c
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 14
Cat 3: Input Driven Techniques
Processing: at each point, the flow of computation is guided entirely by the input through a Finite State Machine (DFA/NFA)
Advantages Each node processed only once Simplicity, sequential access pattern
Problem: skipping elements
Access Pattern Matching Process
Sequential Input drives computation, merge individual paths
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 15
SingleDFA and IdxDFA
SingleDFA <element> triggers the DFA, choosing next state </element>: execution backtracks to when start processed TPQ matching: intermediate results compacted on stacks
Experiments show reading whole input = not enough
Speeding up navigation: IdxDFA Instead of reading sequentially: use indexes and
skip descendants
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 16
IdxDFA: example
c1
b2 a3
c4 d6
c5 d7
b9
d9 c10
d11 a12
c16 d6b13
d14 c15
b21
c22
a
b
c d
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 17
IdxDFA: example
c1
b2 a3
c4 d6
c5 d7
b9
d9 c10
d11 a12
c16 d6b13
d14 c15
b21
c22
b9c4 d6
c5 d7
a
b
c d
d9 c10
d11 a12
b13
d14 c15
c16 d6 b21
c22
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 18
Cat 4: Graph Summary Evaluation
Structural summary: index node identifies a group of nodes in the document
Processing: identify index nodes that satisfy the query + post processing filtering
Beneficial: when there is a reasonable structural index, much smaller than document
Problem: graph size comparable/larger than original document
Access Pattern Matching Process
Indexed/Random Merge-join partitioned input,merge individual paths
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 19
Categories Summary
Access Pattern
Matching Process Methods
Set Based Sorted/Indexed
Join sets, merge individual paths
Twigstack /XB, B+tree, XR-tree
Query Driven
Indexed/random
Incremental construction of each result instance
(ViST, PRIX)INLJ
Input Driven
Sequential Input drives computation, merge individual paths
SingleDFA, IdxDFA
Structural Summary
Indexed/random
Merge-join partitioned input, merge individual paths
Structural indexes
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 20
Experimental Evaluation
1. Experiments with real datasets
2. Experiments with synthetic datasets Further analyze each method Characterize the methods according to specific
features available in each custom dataset
3. More sets of experiments Closely verify XBTWIGSTACK versus INLJ
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 21
Algorithms using the same API Analysis varying structure and selectivity Performance measure = total time required to
compute a query Number of nodes as secondary information
Intel Pentium 4 2.6GHz, 1Gb ram Berkeley DB: 100 buffers, page size 8Kb, B+ tree Real/benchmark datasets
XMark (Internet auction, 1.4 GB raw data, ± 17 million nodes), Protein Sequence Database
Setup
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 22
XMark
0
5
10
15
20
25
30
35
40
X1 X2 X4 X6
Queries
Tim
e (
sec)
XBTwigStack
SingleDFA
IdxDFA
INLJ
StrIdx
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 23
Custom Data
Goal: isolate important features Query //a//b[.//c]//d
Simple enough for detailed investigation Complex enough to provide large number of
different data access possibilities Vary selectivity of each element separately Add recursion to key elements (root, leaf)
a
b
c d
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 24
Custom Data
0
5
10
15
20
25
30
D1:100 D2:80 D3:50 D4:10 D5:01
Dataset:Selectivity
Tim
e (
sec)
XBTwigStack
IdxDFA
INLJ
a
b
c d
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 25
Custom Data
0
5
10
15
20
25
30
D1:100 D6:80 D8:10 D10:80 D12:10
Dataset:Selectivity
Tim
e (
sec)
XBTwigStack
IdxDFA
INLJ
a
b
c d
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 26
XBTwigStack x INLJ
On large dataset, 40mi nodes, 1Gb, 1% selectivity Difference of 40s between XBTwig and INLJ best plan
0
50
100
150
200
250
R40
Tim
e (
se
c)
ABCD
ABDC
BACD
BADC
BCAD
BCDA
BDAC
BDCA
CBAD
CBDA
DBAC
DBCA
XBTwig
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 27
0
25
50
75
100
125
150
175
200
225
250
275
300
325
350
R1 R3 R4 R7 R9
ABCD ABDC BACD BADC BCAD BCDA BDAC BDCA CBAD CBDA DBAC DBCA XBTWIG
XBTwigStack x INLJ
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 28
Conclusions
Categorization of TPQ processing algorithms Adaptations for processing TPQ
DFA + accessing nodes from B+tree INLJ + ancestor skipping
DFA-based improved, IdxDFA, not enough Structural summary available and smaller than
document: StrIdx XBTwigStack: most robust and predictable
INLJ when high selectivity: no guarantee about chosen plan without optimizer module
Questions?
EXTRA SLIDES
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 31
Bib (1,36)
article (2,19)
title(3,5)
procs(14,18)
author(6,13)
last(7,9)
first(10,12)
David J.(11)
DeWitt(8)
conf(15,17)
t1(4)
article(20,35)
title(21,23)
author(24,31)
last(25,27)
first(28,30)
procs(32,34)
Hongjun(29)
Lu(26)
conf(33)
t2(22)
article
author
last
procs
conf
Background
Region numbering scheme : (left, right)
VLDB(16)
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 32
TwigStack
a1
b1
b2
a2
c2
c1
a
b
c
a1
Sa Sb Sc
a2
b1
b2
c1c2
a2 b2 c1
a1 b1 c1
a1 b1 c2
a1 b2 c1
1) solutions individual root-to-leaf paths2) merge-join those partial solutions→ before adding element to stack:
(i) the node has a descendant on each of the query children streams
(ii) each of those descendant nodes recursively satisfies this property
→ optimized by indexes
doc query
results
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 33
TwigStack + Indexes
B+-tree: built on the left attribute Access ancestor then probe descendant stream
to skip unmatchable initial nodes Ancestor skipping not effective:
Skip only up to the first element following a given one
XB-tree: index on (left,right) bounding segment Pointer to children (region completely included in parent) Leaves sorted on left Region: ancestor access effective
XR-tree: index on (left,right) = B+tree with complex index key + stab lists Ancestor skipping: elements stabbed by left
b1
b2
b3
a1
c2
c1
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 34
ViST, Virtual Suffix Tree
Input: sequence of (symbol, path) pairs
(a1,)(b1,a1)(a2,a1b1)(b2,a1b1a2)(c1,a1b1a2b2)(c2, a1b1a2) Document and query translated Virtual suffix tree (B+-tree) indexed left
Processing Structural query = find (non-contiguous)
subsequence matches → suffix tree Benefit: query as a whole instead of merging parts
One query path per time Efficient when query top defines the results
a1
b1
b2
a2
c2
c1
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 35
ViST, index
Virtual suffix tree B+tree, nodes indexed on the left position D- ancestor and S-ancestor
a a
b c
b
a
c
c
1,13
2,4
3
5,7
6 9,11
10
8,12
D-Ancestor
(c,bac)
B+
(b,)
(a,b)
(b,ba)
(c,ba)
S-Ancestor
1,13
69,11
2,45,78,12
3
10
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 36
A18
B11
B4
B14
B17
C6
C1
F10
C13
D3
D8
D16
Document
v5v
0v
9 v
12 v
2 v7
v15
A
B
C D
N N
Query
(A,ε ) (B,A)(C,B)(D,B)
Query Sequence
ViST
(A,ε)1(B,A)
2(C,AB)
3(D,AB)
4(B,A)
5(C,AB)
6(D,AB)
7(F,AB)
8(B,A)
9(C,AB)
10(B,A)
11(D,AB)
12
(A,ε)1(B,A)
2(C,AB)
3(D,AB)
4, (A,ε)
1(B,A)
2(C,AB)
3(D,AB)
7
(A,ε)1(B,A)
2(C,AB)
3(D,AB)
12, (A,ε)
1(B,A)
2(C,AB)
6(D,AB)
7
(A,ε)1(B,A)
2(C,AB)
6(D,AB)
12, (A,ε)
1(B,A)
2(C,AB)
10(D,AB)
12
(A,ε)1(B,A)
5(C,AB)
6(D,AB)
7, (A,ε)
1(B,A)
5(C,AB)
6(D,AB)
12
(A,ε)1(B,A)
5(C,AB)
10(D,AB)
12, (A,ε)
1(B,A)
9(C,AB)
10(D,AB)
12
ViST
ViST SubsequenceMatching
FinalFiltering
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 37
ViST, algorithm
Q = q1, … qk, query sequenceD-tree B+ index of (symbol, prefix)S-tree B+ index of region labels
function Search (region, i)if i < |Q|
T = retrieve qi S-tree from D-treeN = retrieve from S-tree all nodes
in the range regionfor each node c(left,right) N
Search ( (left,right), i+1)else return result
Search ( (1,13), (a,b) )
Search ( (1,13), (b,) )
Search ( (2,4), (c,ba) )
(c,bac)
B+
(b,)
(a,b)
(b,ba)
(c,ba)
1,13
69,11
2,45,78,12
3
10
Search ( (5,7), (c,ba) )Search ( (8,12), (c,ba) )
b
a
c
Q = (b,) (a,b) (c,ba)
a a
b c
b
a
c
c
1,13
2,4
3
5,7
6 9,11
10
8,12
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 38
ViST, access order
Search ( (1,13), (a,b) )
Search ( (1,13), (b,) )
Search ( (2,4), (c,ba) )
(c,bac)
B+
(b,)
(a,b)
(b,ba)
(c,ba)
1,13
69,11
2,45,78,12
3
10
Search ( (5,7), (c,ba) )Search ( (8,12), (c,ba) )
Q = (b,) (a,b) (c,ba)
a a
b c
b
a
c
c
1
2
X
4
5 7
6
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 39
ViST, discussion
Worst-case storage requirement for D-Ancestor is > linear in #elements E.g. unary tree with n nodes, sequence O(n2)
False alarms
Our implementation: no false alarms //a[//b]//c unordered
Vist: (a, )(b,a)(c,a) & (a, )(c,a)(b,a) Our implementation: run the twig query only once
a
b c
d e f d
a
b b
d e
a
b
d e
D1 = (a,) (b,a) (d,ab) (e,ab) (c,a) (f,ac) (d,ac)
D2 = (a,) (b,a) (d,ab) (b,a) (e,ab)
Q = (a,) (b,a) (d,ab) (e,ab)
a
b c
a
c b
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 40
PRIX, PRüfer seqs. for Indexing XML
Input: sequence of labels Document & query mapped by Prüfer’s method Tree → sequence: remove one node at a time
Processing Sequence matching against indexed db: filter non-matches Refinement phases: filter twig-matches, the results:
Form a tree, satisfy the twig query, include the leaf nodes
LPS = A C B C C B A C A E E E D ANPS = 15 3 7 6 6 7 15 9 15 13 13 13 14 15
(Any numbering scheme, here is post-order)
Bottom-up approach
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 41
PRIX, Processing
Problems Complex solution //a[//b]//c unordered: same problem as ViST
What we do Region based numbering scheme and XB-tree Bottom-up traversal of the query + subtwig
merging Access nodes in the same order
Efficient when query bottom defines results
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 42
A(k) index
A(k) → k is the degree of similarity, “size of common path” k k-bisimilarity
1) for any two nodes u and v, u 0 v iff u and v have same label2) u k v iff u k-1 v and for every parent u’ of v’, there is a parent v’ of v s.t. u’ k-1 v’ and vice-versa
2 3
4 5
6 7
1 A
D D
C
E
B
E
2 3
4,5
6,7
1 A
D
CB
E
A(0) A(1)
2 3
4 5
1 A
D D
CB
6,7 E
2 3
4 5
6 7
1 A
D D
C
E
B
E
A(2)Original document
UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 43
Protein
0
10
20
30
40
50
60
P1 P2 P4 P5 P6
Queries
Tim
e (s
ec) XBTwigStack
SingleDFA
IdxDFA
INLJ
StrIdx