tree-pattern queries on a lightweight xml processor

43
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks

Upload: kevork

Post on 01-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Tree-Pattern Queries on a Lightweight XML Processor. MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras. Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks. Outline. Motivation and Contributions Background Method Categorization - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tree-Pattern Queries on a Lightweight XML Processor

Tree-Pattern Queries on a Lightweight XML Processor

MIRELLA M. MORO

Zografoula Vagena

Vassilis J. Tsotras

Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks

Page 2: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 2

Outline

Motivation and Contributions Background Method Categorization Experimental Evaluation Conclusions

Page 3: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 3

Motivation

XML query languages: selection on both value and structure “Tree-pattern” queries (TPQ) very common in XML

Many promising holistic solutions

None in lightweight XML engines Without optimization module (e.g. eXist, Galax) Effective, robust processing method

Reasons: No systematic comparison of query methods under a common

storage model No integration of all methods under such storage model

Context: XPath semantics, stored data (indexed at will)

Page 4: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 4

Contributions

TPQ methods over unified environment Method Categorization: data access patterns and

matching algorithm Common storage model + integration of all methods

Capture the access features Permit clustering data with off-the-shelf access methods

(e.g. B+tree) Novel variations of methods using index structures +

Handle TPQ Extensive comparative study

Synthetic, benchmark and real datasets Decision in the applicability, robustness and efficiency

Page 5: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 5

article

author

last

procs

conf

Background

article (2,19)

last(7,9)

2<7<9<19

Bib (1,20)

article (2,19)

title(3,5)

procs(14,18)

author(6,13)

last(7,9)

first(10,12)

David J.(11)

DeWitt(8)

conf(15,17)

t1(4)

VLDB(16)

XML database = forest of unranked, ordered, node-labeled trees, one tree per document

TPQ

Page 6: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 6

Common Storage Model

B+ Tree on ( tag, initial )

bib (1,16)

book (2,9) (10,17)

author (3,8) (11,16) (19,24)

name (4,5) (12,13) (20,21)

paper (18,25)

address (6,7) (14,15) (22,23)

bib(1,26)

book (2,9) paper (18,25)

author (3,8) author (19,24)

name(4,5)

address(6,7)

name(20,21)

address(22,23)

book (10,17)

author(11,16)

name(12,13)

address(14,15)

Input = sequence (list) of elements One list per document tag = element list

Node clustering by index structures Numbering scheme

Page 7: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 7

Method Categorization

Parameters: access pattern and matching algorithm

(1) set based techniques

(2) query driven

(3) input driven

(4) structural summaries

Page 8: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 8

Cat 1: Set-based Techniques

Access Pattern Matching Process

Sorted/indexedJoin sets,

merge individual paths

Input: sequences of elements, one list per query node element, possibly indexed (set-based)

Major representative: TwigStack Optimal XML pattern matching algorithm (ancestor/descendant)

Stack-based processing Set of stacks = compact encoding of partial and total results in

linear space (possibly exponential number of answers)

Page 9: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 9

TwigStack + Indexes

B+tree, built on the left attribute From ancestor: probe descendants: skip initial nodes Ancestor skipping not effective (up to 1st element that follows)

XB-tree: on (left,right) bounding segment XR-tree: on (left,right), B+tree with complex index key + stab lists

A comparative study* shows that Skipping ancestors: XBTree better (XBTree size is smaller) Recursive level of ancestors: XBTree better again

Searching on stab lists of XR-tree is less efficient Plain B+tree: skips descendants, BUT not ancestors XBTwigStack is our choice

* H.Li et al. “An Evaluation of XML Indexes for Structural Joins”. Sigmod Record, 33(3), Sept 04

Page 10: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 10

Cat 2: Query Driven Techniques

Processing: the query defines the way input is probed

Major representatives: ViST and PRIX Specific details: significantly different Same strategy

Convert both document and query to sequences Processing query = subsequence matching

Access Pattern Matching Process

Indexed/random Incremental construction of each result instance

Page 11: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 11

ViST and PRIX

Recursively identify matches = quadratic time Optimize the naïve solution:

Identify candidate nodes for each matching step Index structures to cluster those candidates

Subsequence matching process = a plan consisting of INLJ among relations, each of which groups document nodes with the same label

For a given query, joins sequence statically defined by the sequencing of the query

INLJ plans are a superset of the static plans that PRIX and VIST use

Page 12: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 12

ViST x PRIX x INLJ

Percentage of nodes processed by each algorithm INLJ: best plan

Dataset #nodes VIST PRIX INLJ

100% 100 100 100

LEAVES: 80% 100 84.23 84.20

LEAVES: 1% 100 1.33 1.32

ROOT: 80% 84.22 100 84.18

ROOT: 1% 1.33 100 1.33

INTERNAL: 80% 89.48 89.49 84.20

INTERNAL: 1% 34.24 34.22 1.64

Page 13: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 13

INLJ : improved B+tree

TPQ evaluation of relational plan Independence of the ordered XML model Total avoidance of false positives

a 1,52

b32,41

b34,37

a33,40

c 38,39

c35,36

b 2,31 b42,51b elem. list

33

34,41 42,51 2,31 32,41

Consider b//cStarting from c

Page 14: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 14

Cat 3: Input Driven Techniques

Processing: at each point, the flow of computation is guided entirely by the input through a Finite State Machine (DFA/NFA)

Advantages Each node processed only once Simplicity, sequential access pattern

Problem: skipping elements

Access Pattern Matching Process

Sequential Input drives computation, merge individual paths

Page 15: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 15

SingleDFA and IdxDFA

SingleDFA <element> triggers the DFA, choosing next state </element>: execution backtracks to when start processed TPQ matching: intermediate results compacted on stacks

Experiments show reading whole input = not enough

Speeding up navigation: IdxDFA Instead of reading sequentially: use indexes and

skip descendants

Page 16: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 16

IdxDFA: example

c1

b2 a3

c4 d6

c5 d7

b9

d9 c10

d11 a12

c16 d6b13

d14 c15

b21

c22

a

b

c d

Page 17: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 17

IdxDFA: example

c1

b2 a3

c4 d6

c5 d7

b9

d9 c10

d11 a12

c16 d6b13

d14 c15

b21

c22

b9c4 d6

c5 d7

a

b

c d

d9 c10

d11 a12

b13

d14 c15

c16 d6 b21

c22

Page 18: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 18

Cat 4: Graph Summary Evaluation

Structural summary: index node identifies a group of nodes in the document

Processing: identify index nodes that satisfy the query + post processing filtering

Beneficial: when there is a reasonable structural index, much smaller than document

Problem: graph size comparable/larger than original document

Access Pattern Matching Process

Indexed/Random Merge-join partitioned input,merge individual paths

Page 19: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 19

Categories Summary

Access Pattern

Matching Process Methods

Set Based Sorted/Indexed

Join sets, merge individual paths

Twigstack /XB, B+tree, XR-tree

Query Driven

Indexed/random

Incremental construction of each result instance

(ViST, PRIX)INLJ

Input Driven

Sequential Input drives computation, merge individual paths

SingleDFA, IdxDFA

Structural Summary

Indexed/random

Merge-join partitioned input, merge individual paths

Structural indexes

Page 20: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 20

Experimental Evaluation

1. Experiments with real datasets

2. Experiments with synthetic datasets Further analyze each method Characterize the methods according to specific

features available in each custom dataset

3. More sets of experiments Closely verify XBTWIGSTACK versus INLJ

Page 21: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 21

Algorithms using the same API Analysis varying structure and selectivity Performance measure = total time required to

compute a query Number of nodes as secondary information

Intel Pentium 4 2.6GHz, 1Gb ram Berkeley DB: 100 buffers, page size 8Kb, B+ tree Real/benchmark datasets

XMark (Internet auction, 1.4 GB raw data, ± 17 million nodes), Protein Sequence Database

Setup

Page 22: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 22

XMark

0

5

10

15

20

25

30

35

40

X1 X2 X4 X6

Queries

Tim

e (

sec)

XBTwigStack

SingleDFA

IdxDFA

INLJ

StrIdx

Page 23: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 23

Custom Data

Goal: isolate important features Query //a//b[.//c]//d

Simple enough for detailed investigation Complex enough to provide large number of

different data access possibilities Vary selectivity of each element separately Add recursion to key elements (root, leaf)

a

b

c d

Page 24: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 24

Custom Data

0

5

10

15

20

25

30

D1:100 D2:80 D3:50 D4:10 D5:01

Dataset:Selectivity

Tim

e (

sec)

XBTwigStack

IdxDFA

INLJ

a

b

c d

Page 25: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 25

Custom Data

0

5

10

15

20

25

30

D1:100 D6:80 D8:10 D10:80 D12:10

Dataset:Selectivity

Tim

e (

sec)

XBTwigStack

IdxDFA

INLJ

a

b

c d

Page 26: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 26

XBTwigStack x INLJ

On large dataset, 40mi nodes, 1Gb, 1% selectivity Difference of 40s between XBTwig and INLJ best plan

0

50

100

150

200

250

R40

Tim

e (

se

c)

ABCD

ABDC

BACD

BADC

BCAD

BCDA

BDAC

BDCA

CBAD

CBDA

DBAC

DBCA

XBTwig

Page 27: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 27

0

25

50

75

100

125

150

175

200

225

250

275

300

325

350

R1 R3 R4 R7 R9

ABCD ABDC BACD BADC BCAD BCDA BDAC BDCA CBAD CBDA DBAC DBCA XBTWIG

XBTwigStack x INLJ

Page 28: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 28

Conclusions

Categorization of TPQ processing algorithms Adaptations for processing TPQ

DFA + accessing nodes from B+tree INLJ + ancestor skipping

DFA-based improved, IdxDFA, not enough Structural summary available and smaller than

document: StrIdx XBTwigStack: most robust and predictable

INLJ when high selectivity: no guarantee about chosen plan without optimizer module

Page 29: Tree-Pattern Queries on a Lightweight XML Processor

Questions?

Page 30: Tree-Pattern Queries on a Lightweight XML Processor

EXTRA SLIDES

Page 31: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 31

Bib (1,36)

article (2,19)

title(3,5)

procs(14,18)

author(6,13)

last(7,9)

first(10,12)

David J.(11)

DeWitt(8)

conf(15,17)

t1(4)

article(20,35)

title(21,23)

author(24,31)

last(25,27)

first(28,30)

procs(32,34)

Hongjun(29)

Lu(26)

conf(33)

t2(22)

article

author

last

procs

conf

Background

Region numbering scheme : (left, right)

VLDB(16)

Page 32: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 32

TwigStack

a1

b1

b2

a2

c2

c1

a

b

c

a1

Sa Sb Sc

a2

b1

b2

c1c2

a2 b2 c1

a1 b1 c1

a1 b1 c2

a1 b2 c1

1) solutions individual root-to-leaf paths2) merge-join those partial solutions→ before adding element to stack:

(i) the node has a descendant on each of the query children streams

(ii) each of those descendant nodes recursively satisfies this property

→ optimized by indexes

doc query

results

Page 33: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 33

TwigStack + Indexes

B+-tree: built on the left attribute Access ancestor then probe descendant stream

to skip unmatchable initial nodes Ancestor skipping not effective:

Skip only up to the first element following a given one

XB-tree: index on (left,right) bounding segment Pointer to children (region completely included in parent) Leaves sorted on left Region: ancestor access effective

XR-tree: index on (left,right) = B+tree with complex index key + stab lists Ancestor skipping: elements stabbed by left

b1

b2

b3

a1

c2

c1

Page 34: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 34

ViST, Virtual Suffix Tree

Input: sequence of (symbol, path) pairs

(a1,)(b1,a1)(a2,a1b1)(b2,a1b1a2)(c1,a1b1a2b2)(c2, a1b1a2) Document and query translated Virtual suffix tree (B+-tree) indexed left

Processing Structural query = find (non-contiguous)

subsequence matches → suffix tree Benefit: query as a whole instead of merging parts

One query path per time Efficient when query top defines the results

a1

b1

b2

a2

c2

c1

Page 35: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 35

ViST, index

Virtual suffix tree B+tree, nodes indexed on the left position D- ancestor and S-ancestor

a a

b c

b

a

c

c

1,13

2,4

3

5,7

6 9,11

10

8,12

D-Ancestor

(c,bac)

B+

(b,)

(a,b)

(b,ba)

(c,ba)

S-Ancestor

1,13

69,11

2,45,78,12

3

10

Page 36: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 36

A18

B11

B4

B14

B17

C6

C1

F10

C13

D3

D8

D16

Document

v5v

0v

9 v

12 v

2 v7

v15

A

B

C D

N N

Query

(A,ε ) (B,A)(C,B)(D,B)

Query Sequence

ViST

(A,ε)1(B,A)

2(C,AB)

3(D,AB)

4(B,A)

5(C,AB)

6(D,AB)

7(F,AB)

8(B,A)

9(C,AB)

10(B,A)

11(D,AB)

12

(A,ε)1(B,A)

2(C,AB)

3(D,AB)

4, (A,ε)

1(B,A)

2(C,AB)

3(D,AB)

7

(A,ε)1(B,A)

2(C,AB)

3(D,AB)

12, (A,ε)

1(B,A)

2(C,AB)

6(D,AB)

7

(A,ε)1(B,A)

2(C,AB)

6(D,AB)

12, (A,ε)

1(B,A)

2(C,AB)

10(D,AB)

12

(A,ε)1(B,A)

5(C,AB)

6(D,AB)

7, (A,ε)

1(B,A)

5(C,AB)

6(D,AB)

12

(A,ε)1(B,A)

5(C,AB)

10(D,AB)

12, (A,ε)

1(B,A)

9(C,AB)

10(D,AB)

12

ViST

ViST SubsequenceMatching

FinalFiltering

Page 37: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 37

ViST, algorithm

Q = q1, … qk, query sequenceD-tree B+ index of (symbol, prefix)S-tree B+ index of region labels

function Search (region, i)if i < |Q|

T = retrieve qi S-tree from D-treeN = retrieve from S-tree all nodes

in the range regionfor each node c(left,right) N

Search ( (left,right), i+1)else return result

Search ( (1,13), (a,b) )

Search ( (1,13), (b,) )

Search ( (2,4), (c,ba) )

(c,bac)

B+

(b,)

(a,b)

(b,ba)

(c,ba)

1,13

69,11

2,45,78,12

3

10

Search ( (5,7), (c,ba) )Search ( (8,12), (c,ba) )

b

a

c

Q = (b,) (a,b) (c,ba)

a a

b c

b

a

c

c

1,13

2,4

3

5,7

6 9,11

10

8,12

Page 38: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 38

ViST, access order

Search ( (1,13), (a,b) )

Search ( (1,13), (b,) )

Search ( (2,4), (c,ba) )

(c,bac)

B+

(b,)

(a,b)

(b,ba)

(c,ba)

1,13

69,11

2,45,78,12

3

10

Search ( (5,7), (c,ba) )Search ( (8,12), (c,ba) )

Q = (b,) (a,b) (c,ba)

a a

b c

b

a

c

c

1

2

X

4

5 7

6

Page 39: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 39

ViST, discussion

Worst-case storage requirement for D-Ancestor is > linear in #elements E.g. unary tree with n nodes, sequence O(n2)

False alarms

Our implementation: no false alarms //a[//b]//c unordered

Vist: (a, )(b,a)(c,a) & (a, )(c,a)(b,a) Our implementation: run the twig query only once

a

b c

d e f d

a

b b

d e

a

b

d e

D1 = (a,) (b,a) (d,ab) (e,ab) (c,a) (f,ac) (d,ac)

D2 = (a,) (b,a) (d,ab) (b,a) (e,ab)

Q = (a,) (b,a) (d,ab) (e,ab)

a

b c

a

c b

Page 40: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 40

PRIX, PRüfer seqs. for Indexing XML

Input: sequence of labels Document & query mapped by Prüfer’s method Tree → sequence: remove one node at a time

Processing Sequence matching against indexed db: filter non-matches Refinement phases: filter twig-matches, the results:

Form a tree, satisfy the twig query, include the leaf nodes

LPS = A C B C C B A C A E E E D ANPS = 15 3 7 6 6 7 15 9 15 13 13 13 14 15

(Any numbering scheme, here is post-order)

Bottom-up approach

Page 41: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 41

PRIX, Processing

Problems Complex solution //a[//b]//c unordered: same problem as ViST

What we do Region based numbering scheme and XB-tree Bottom-up traversal of the query + subtwig

merging Access nodes in the same order

Efficient when query bottom defines results

Page 42: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 42

A(k) index

A(k) → k is the degree of similarity, “size of common path” k k-bisimilarity

1) for any two nodes u and v, u 0 v iff u and v have same label2) u k v iff u k-1 v and for every parent u’ of v’, there is a parent v’ of v s.t. u’ k-1 v’ and vice-versa

2 3

4 5

6 7

1 A

D D

C

E

B

E

2 3

4,5

6,7

1 A

D

CB

E

A(0) A(1)

2 3

4 5

1 A

D D

CB

6,7 E

2 3

4 5

6 7

1 A

D D

C

E

B

E

A(2)Original document

Page 43: Tree-Pattern Queries on a Lightweight XML Processor

UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 43

Protein

0

10

20

30

40

50

60

P1 P2 P4 P5 P6

Queries

Tim

e (s

ec) XBTwigStack

SingleDFA

IdxDFA

INLJ

StrIdx