benchmarking holistic approaches to xml tpq processing jiaheng lu renmin university of china...
Post on 27-Mar-2015
220 Views
Preview:
TRANSCRIPT
Benchmarking Holistic Approaches to XML TPQ Processing
Jiaheng Lu
Renmin University of China
BenchmarX 2010
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
2
A little bit of history
Database world 1970 relational databases 1990 object oriented database 1995 semi-structured databases
Document world 1974 SGML (Structured Generalized Markup
Language) 1990 HTML (Hypertext MarkupLanguage) 1992 URL (Universal Resource Locator)
1996 XML (eXtensible Markup Language)
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
3
What is XML
The eXtensible Markup Language (XML) is the universal format for structured documents and data on the Web.
Advantages of XML: Human- and machine-readable format More flexible than HTML, not so complicated
as SGML Unlike relational table, XML can describe tree
and graph structural data
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
4
What is XML
Basic Specification: XML 1.0, W3C Recommendation Feb’98
<book year=“1967”> <title>The politics of experience</title> <author> <firstname>Ronald</firstname> <lastname>Laing</lastname> </author></book>
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
5
XML Tree
An XML document is commonly modeled as a rooted, ordered tree.
book
@year title author
“1967” firstname lastname“The politics…”
“Lazing”
“Ronald”
“year” is an attribute
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
6
XML query language
Major standards for querying XML data XPath and XQuery
“XPath is a language for addressing parts of an XML document ” XPath 1.0 W3C, Nov 1999 E.g. paper [title=“XML”]/author
“XQuery is an XML query language which provide features for retrieving and interpreting information from XML documents. ” XQuery 1.0 Nov 2005
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
7
An XQuery example
XQuery:<results> { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return <result> { $t } { $a } </result> } </results>
Create a flat list of all the title-author pairs for every book in bibliography.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
8
XML Twig Pattern
XML Twig Pattern Query (TPQ) is a core operation in XPath and XQuery
Definition of XML twig pattern : an XML twig pattern is a small tree whose nodes are tags, attributes or text values; and edges are either parent-child (P-C) or ancestor-descendant (A-D) relationships
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
9
An XML twig pattern example
XQuery:<results> { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return <result> { $t } { $a } </result> } </results>
$b
$t: $a:
To answer the XQuery, we need to first match the following XML twig pattern:
bib
book
title author
Create a flat list of all the title-author pairs for every book in bibliography.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
10
Research Problem
Given an XML twig pattern Q, and an XML database D, we need to find ALL the matches of Q on D efficiently.
E.g. Consider the following twig pattern and document:
Twig pattern:
section
title figure
An XML tree:
s1
s2
f1
p1
t1
t2
Query solutions:
(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
11
Why research XML twig pattern match
An XML query includes two parts: value match and twig match.
Twig Match:New challenge!
XPath: paper [title=“XML”]/author
Value (content) match
paper
title author
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
12
Approach Overview
(1) Labeling: Assign each element in the XML document tree an integer label to capture the structural information of documents
(2) Computing: Use labels to answer the twig pattern without traversing the original document
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
13
Related work graph
XML TPQ AlgorithmsXML TPQ
Algorithms
Containment scheme
[SIGMOD’01]
Containment scheme
[SIGMOD’01]
Labeling schemes
Computing algorithms
Stack-merge [ICDE ’02]
Stack-merge [ICDE ’02]Dewey scheme [
SIGMOD’02 ]
Dewey scheme [ SIGMOD’02 ]
TwigStack [SIGMOD ’02]TwigStack [SIGMOD ’02]
Twig2Stack [VLDB’06]Twig2Stack [VLDB’06]
TJFast [VLDB ’05]TJFast [VLDB ’05]
XPath-SQL [SIGMOD ’02]
XPath-SQL [SIGMOD ’02]
TreeMatch[ TKDE’2010]TreeMatch[ TKDE’2010]
Dynamic Dewey scheme
[ SIGMOD’09 ]
Dynamic Dewey scheme
[ SIGMOD’09 ]
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
14
Approach Overview
(1) Labeling Region encoding (or called containment) labeling
scheme (start,end,level)
An example XML tree with region encoding labels
s1
s2
f1
p1
t1
t2
(1,12,1)
(2,3,2)
(5,6,3)
(4,11,2)
(7,10,3)
(8,9,4)
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
15
Approach Overview
(1) Labeling Dewey (or called prefix) labeling scheme: integer
sequenceAn example XML tree with Dewey labels
s1
s2
f1
p1
t1
t2
0
1.0
1
1.1
1.1.0
ε
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
16
Approach Overview
(2) Computing Inverted data list: each data list contains all labels of
elements with the same tag name
Query:
s
An XML tree:
t f
s (1,12,1),
t
f
(2,3,2),
(8,9,4)
Data lists:
s1
s2
f1
p1
t1
t2
(1,12,1)
(2,3,2)
(5,6,3)
(4,11,2)
(7,10,3)
(8,9,4)
(5,6,3)
(4,11,2)
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
17
Previous work: TwigStack [1]
(2) Computing TwigStack [1] is a holistic algorithm for XML twig
matching on containment labeling scheme. Two steps in TwigStack :
(1) intermediate path solutions are output to match each query root-to-leaf path; and
(2) these intermediate path solutions are merged to get the final results.
[1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
18
Running example: TwigStack algorithm
s
t f
Query:
s (1,12,1)
t
f
(2,3,2)
(8,9,4)
Data streams:
(5,6,3)
(4,11,2)
State of stacks:
Output path intermediate solutions:
(1,12,1) (2,3,2)
s//t:
(1,12,1) (5,6,3)(4,11,2) (5,6,3)
s//f:
(1,12,1) (8,9,4)(4,11,2) (8,9,4)
Final results:
(1,12,1) (2,3,2) (8,9,4)(1,12,1) (5,6,3) (8,9,4)(4,11,2) (5,6,3) (8,9,4)
(1,12,1) (4,11,2)
(2,3,2) (5,6,3)
(8,9,4)
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
19
Limitations of TwigStack
(1) TwigStack may output many useless intermediate results for queries with parent-child relationship
(2) TwigStack cannot process XML twig queries with ordered predicates, like “Proceeding”, “Following” in XPath
(3) TwigStack cannot answer queries with wildcards in branching nodes.
E.g. *
B C
The parent of B should be an ancestor of C
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
20
Outline
Introduction Holistic algorithms:
TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Benchmark experiments Conclusions and future work
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
21
Inefficiency of TwigStack
TwigStack is inefficient to answer twig query with parent-child edges
More than 99% intermediate results are useless, TwigStack wastes too much time to output useless intermediate results! More than 99% intermediate results are useless, TwigStack wastes too much time to output useless intermediate results!
0
10000
20000
30000
40000
50000
60000
70000
80000
Q1 Q2 Q3
UsefulUseless
Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP in Tree Bank data
# o
f inte
rmedia
te p
ath
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
22
Example to illustrate the inefficiency of TwigStack for queries with P-C edge
Twig pattern:
A
BD
C
An XML tree:
A1
E1
D1
B1
TwigStack outputs the useless root-to-leaf intermediate path solutions:
(A1, B1, C1), (A1, B2, C1) …… (A1, Bn, Cn)
Bn-1
B2 Bn
……C1 Cn-1
C2 Cn
The reason for the inefficiency of TwigStack :TwigStack assumes that all edges are A-D
relationships in the first step and does not consider level information
The reason for the inefficiency of TwigStack :TwigStack assumes that all edges are A-D
relationships in the first step and does not consider level information
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
23
Naïve improvement is incorrect
Twig pattern:
A
BD
C
An XML tree:
A1
E1
D1
B1
Naïve improvement:
because A1 is not the parent of D1 , we do not output the following path solutions
(A1, B1, C1), (A1, B2,C1) …… (A1, Bn, Cn) by considering level information
Bn-1
B2 Bn
……C1 Cn-1
C2 Cn
But this naïve
approach is NOT correct
for some
cases!
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
24
Problem of naïve approach
Naïve approach possibly make a wrong decision about whether the current element contributes to final results
Example:
Twig pattern:
A
BC
D
An XML tree:
A1
C1
D1
C2
B1
Cn
D2
When we read A1, B1, C1
and D1, since C1 is not the parent D1 , according to the naïve approach, we decide that C1 and D1 do not belong to query answers.
But it is wrong!
Dm ……
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
25
Our solution: Look-ahead
New technique used in our new algorithm called TwigStackList: Look-ahead
Twig pattern:
A
BC
D
An XML tree:
A1
C1
D1
C2
B1
Cn
Dm+1
When we read A1, B1, C1 and D1, we do not hurriedly decide whether C1 or D1 belongs to final solutions, but buffer C1 to Cn in the a main-memory list structure.
Since Cn is the parent, we are sure that (A1, B1, Cn , D1) is a real match.
Dm ……
Why not buffer D1 to Dm? Too many!
Why not buffer D1 to Dm? Too many!
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
26
Running example: TwigStackList algorithm
Query:
A (1,11,1)
B
(3,10,2)
Data streams:XML tree:A1
C1
D1
C2
B1
C3
D2
A
BC
D
C
D
(1,11,1)
(2,2,2)
(2,2,2)(4,8,3)
(5,7,4)
(6,6,5)
(9,9,3)
(3,10,2) (4,8,3)(5,7,4)
(6,6,5)(9,9,3)
SA
SB SC
SD
List LC
(5,7,4)
Output path solutions:
(1,11,1) (2,2,2)
A//B A//C/D
(1,11,1) (5,7,4) (6,6,5)
(3,10,2)
(1,11,1) (3,10,2) (9,9,3)
(1,11,1)
(2,2,2)
(3,10,2) (4,8,3)(5,7,4)
(9,9,3)(6,6,5)
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
27
Features of TwigStackList
Main memory efficient Size of stack and list is no more than |Depth(Tree)| TwigStackList can process very large documents with
small main memory cost I/O efficient
Each element is scanned once For a large query class, TwigStackList guarantees that
each output path solution is useful to final answers.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
28
Optimal query classes
If an algorithm does not output any useless intermediate path solution for a query Q for all given documents, we call this algorithm is optimal with respective to Q
If an algorithm has a larger optimal query class, this algorithm has better ability to control the size of intermediate results
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
29
Optimal query classes
.
Only A-D in branching edgesA
B C
C
A
B
D
D
Optimal Class of TwigStack
Optimal Class of TwigStackList
Only A-D in all edges
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
30
Outline
Introduction Holistic algorithms:
TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Benchmark experiments Conclusions
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
31
Motivation
TwigStack and TwigStackList cannot handle order-based twig query. XPath and XQuery includes ordered axes such as following, preceding, following-
sibling and preceding-sibling.
A/B[following-sibling::C]
XPath expressionA
B C
<
This symbol shows that B
and C are ordered.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
32
Ordered twig query pattern Ordered XML twig pattern : sibling query nodes should be matched according to their order in the twig query. Example
A
B
C
<
D
A1
B1D1
C1
D2
D3
Only D2 and D3 contribute to final results.
Only D2 and D3 contribute to final results.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
33
OrderedTJ
OrderedTJ, a new algorithm proposed for evaluating ordered twig query pattern. OrderedTJ, which extends TwigStackList, also uses stack and list data structure
What’s the main
modification of OrderedTJ over TwigStackList?
OrderedTJ additionally checks the
order conditions of
elements before
outputting intermediate
paths.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
34
OrderedTJ Before any element is pushed to the stack, OrderedTJ checks the order condition
A
B
C
<
A1
B1D1
DataQuery
A (1,9,1)
B
Data streams:
C
(3,5,2)
(4,4,3)
C1
D2
(1,9,1)
(2,2,2) (3,5,2)
(6,8,2)
(7,7,3)
SA
SB
SD
Output intermediate path solutions:A/B/C
(1,9,1) (3,5,2) (4,4,3)
A//D
(1,9,1) (6,8,2)
D
D3
SC
(4,4,3) D (2,2,2) (6,8,2) (7,7,3)
(1,9,1)
(3,5,2)
(4,4,3)
(1,9,1) (7,7,3)
(6,8,2) (7,7,3)
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
35
The optimal query classes of OrderedTJ OrderedTJ can guarantee the optimality for ordered queries with A-D relationships from the second branching edges. In other words, OrderedTJ is optimal for queries with P-C relationship in the first branching edges.
A
B C
<
OrderedTJ is Optimal for Q2
A
B C
TwigStackList is non-optimal
for Q1.
Q1 Q2
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
36
Outline
Introduction Holistic algorithms:
TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Benchmark experiments Conclusions and future work
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
37
iTwigJoin algorithm
TwigStack and OrderedTJ partition data to streams according to their tag names alone
We propose two new data partition schemes (1) Tag+level scheme (2) Prefix path scheme
Potential benefits: Enlarge the optimal query classes Reduce I/O cost
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
38
Data partition scheme
A1
C2
C1
B1
C3
TA A1
TB
TC C1, C2, C3
Tag partition
B1
Tag+Level partition
A1
C2
B1
C1, C3
Prefix Path partition
TA A1
TAB
TAC C2
B1
TABC C1
C3TACC
Tag partition
Tag +levelpartition
Refined
By level
Prefix pathpartition
Refined
By path
T2B
T1A
T2C
T3C
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
39
Property of three schemes
1. the number of inverted lists : increasing (CPU cost increase correspondingly)
2. the optimal query classes : enlarging (output cost decrease correspondingly)
3. the number of elements scan : decreasing (input cost decrease correspondingly)
Tag scheme
Tag +levelscheme
Refined
By level
Prefix pathscheme
Refined
By path
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
40
The number of inverted lists : increasing
A1
C2
C1
B1
C3
TA A1
TB
TC C1, C2, C3
Tag partition
B1
Tag+Level partition
A1
C2
B1
C1, C3
Prefix Path partition
TA A1
TAB
TAC C2
B1
TABC C1
C3TACC
T2B
T1A
T2C
T3C
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
41
The optimal query classes : enlarging
Only A-D in branching edges
and only P-C in all edges and only 1-branching
A
B C
C
A
B
D
D
Optimal class of tag scheme
Optimal Class of tag+level scheme
Only A-D in branching edges
Only A-D in branching edges and only P-C in all edges
A
B C
Optimal Class of prefix path scheme
E
E ED
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
42
The number of elements scan : decreasing
TA A1
TB
TC C1, C2
Tag scheme
B1
Tag+Level scheme
A1
C1
B1
C2
Prefix Path scheme
TDA A1
TDAB
TDC C1
B1
TDCC C2
T3B
T2A
T2C
T3C
A
BC
Query
Data
D1
C1
B1
A1
C2
1:
2:
3:
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
43
iTwigJoin algorithm
A general algorithm which can be applied on all three schemes
For different schemes, iTwigJoin achieves different performance.
The main technical difficult in designing iTwigJoin is to handle many current nodes for one tag name.
We classify the current visited
elements to three categories:
current-match, current-useless and
current-blocked
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
44
Three kinds of elements
Current-match : the element is guaranteed to contribute to final answers with current elements.
Current-useless : the element is guaranteed not to contribute to final answers with current and remaining elements.
Current-blocked: the element is neither current-match nor current-useless.
Current-blockedCurrent-blocked
MatchMatch UselessUseless
Matching data appears
Cannot get any matching data
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
45
Example on three kinds of elements
A
BC
A1
A3
B2
B1
C1
A1
B1
Tag+level scheme
C2
B2
Query
A2 C2
Document
A2, A3
1:
2:
3:
C1
Current-blocked : B2,C1
Current-match: A1,B1,C2
Current-useless : A2
T2B
T2A
T3B
T3C
T2C
T1A
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
46
Example on three kinds of elements
A
BC
A1
A3
B2
B1
C1
A1
B1
Tag+level scheme
C2
B2
Query
A2 C2
Document
A2, A3
1:
2:
3:
C1
B2 ,C1 are converted from current-blocked to current-match due to the appearance of A3.
B2 ,C1 are converted from current-blocked to current-match due to the appearance of A3.
T1A
T2A
T2B
T3B
T2C
T3C
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
47
Main flowchart of iTwigJoin
Is there any current-useless element?
Is there any current-match element?
Choose the smallest current-blocked element and output intermediate path solutions, then advance to the next element
See whether it contributes
to previous match, and advance
to the next element
Output intermediate path solutions, and advance
to the next element
Are all elements scanned? End of the algorithm
N
Y
N
N
Y
Y
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
48
Outline
Introduction Holistic algorithms:
TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Benchmark experiments Conclusions and future work
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
49
Motivation: new labeling scheme
TwigStackList, OrderedTJ and iTwigJoin are all based on the containment labeling scheme
Why not try Dewey labeling scheme for
XML twig pattern query ?
Oh, it is really a
novel idea!
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
50
Original Dewey Labeling Scheme
In Dewey labeling scheme, each element is presented by an integer sequence:
(i) the root is labeled by a empty stringε (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th
child of s. For example:
s1
s2
f1
f2t1
t2
1 2 3
2.1 2.2
ε
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
51
Main problem of the original Dewey
If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms.
Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
52
Modular function
We need to know some schema information: DTD (Document Type Definitions ) or XML schema
Given DTD information: book → author, title, chapter* Our solution: using modular function, we create a match
between an element tag and an integer number. We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2;
where, Xt is the last integer of the label of tag t.
bookε
0
titleauthor 1
chapter2
chapter
5
Why not 3 as the original Dewey ?
The number of distinct tags under
book
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
53
Derive element tag
From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthor mod 3 = 0 Xtitle mod 3 = 1
Xchapter mod 3 = 2.
bookε
0
titleauthor 1
chapter2
chapter
5
? ? ? ?
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
54
More examples for assigning labels
Let us consider a more complicated DTD a → (b | c )*, d?, c+ We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2
(Why do we use mod 3 instead of 4?)
aε
0
db
2c4
c
7
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
55
Derive the path from a label
By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label.
For example:
DTD:
book → author, title, chapter*
chapter → (paragraph | section)*
section → (paragraph | section)*
book
chapter
sectionauthor title
book
author
title
chapter
paragraph
section
Mod 3=0
Mod 3=1
Mod 3=2 Mod 2=0
Mod 2=1
Mod 2=0
Mod 2=1
Question: Given a label 5.1.0, what is the corresponding path ?
Document:
FST:
chapter
section
paragraphsection
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
56
Derive the path from a label
By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label.
For example:DTD:
book → author, title, chapter*
chapter → (paragraph | section)*
section → (paragraph | section)*
book
chapter
sectionauthor title
Document:chapter
section
paragraphsection
Following the above red path, we get
5.1.0 denotes :
book/ chapter/section/paragraph
book
author
title
chapter
paragraph
section
Mod 3=0
Mod 3=1
Mod 3=2 Mod 2=0
Mod 2=1
Mod 2=0
FST:
Mod 2=1
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
57
Two properties of extended Dewey
Find Ancestor Label From a label of any element, we can derive the labels of
its all ancestors. Find Ancestor Name
From a label of any element, we can derive the tag names of its all ancestors.
Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
58
A new algorithm: TJFast
For each node n in the query, there exists a corresponding input stream Tn.
Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order.
For each branching node b of twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? )
During any point of computing, the size of set Sb is bounded by the depth of the XML document.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
59
An example for TJFast algorithm
Document: Query:
A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0TD:
TC:
DTD:
a -> a*,d*, b*
b -> d*, c*
d -> c*
Root0
…
0.5.0
A set for the branching node A
Why are there only two streams?
{ }
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
60
An example for TJFast algorithm
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
0.0.1 a1/a2/d1derive
0.3.2.1 a1/a3/b1/c1derive
By finite state transducer of extended Dewey labeling scheme
TD:
TC:
{ }
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
61
An example for TJFast algorithm
Document: Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Both a1 and a3 possibly involve in query answers. (Why not a2 ?)TD:
TC:
{ }
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
62
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Then we insert a1, a3 to the set,
Output Path solutions:
A//D A/B//C
(a1, d1) (a3, b1, c1)
TD:
TC:
An example for TJFast algorithm
{a1,a3}
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
63
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Move the cursor of TD from d1 to d2
TD:
TC:
An example for TJFast algorithm
Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)
{a1,a3}
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
64
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Move the cursor of stream TD
from d2 to d3
TD:
TC:
An example for TJFast algorithm
Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)(a1, d3)
{a1,a3}
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
65
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Move the cursor of stream TC from c1 to c2
TD:
TC:
An example for TJFast algorithm
Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2) (a1, b2, c2)(a3, d2)(a1, d3)
{a1,a3}
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
66
Document:
Query:A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A// D:<a1, d1>, <a1, d2>,<a1, d3>,<a3, d2>
A/B//C:<a1,b2, c2>,<a3, b1,c1>
Phase 1. Intermediate paths
<a1,d1,b2,c2>,<a1,d2, b2,c2>,
<a1,d3,b2,c2>,<a3,d2, b1,c1>,
<A, D, B,C>
Phase 2. Final solutions
Join
Sort and merge-join in TJFast
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
67
TJFast+L
Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extending TJFast
Two benefits of TJFast+L over TJFast reduce I/O cost by reading less elements enlarge optimal query classes
Q: Why not apply
extended Dewey on Prefix-path scheme ?
Because by finite state
transducer, we can know the
path information…
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
68
Optimal query classes
.
Only P-C in all edges
A
B C
C
A
B
D
D
Optimal Class of TJFast
Optimal Class of TJFast+L
Only A-D in branching edges
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
69
Outline
Introduction Holistic algorithms:
TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Benchmark experiments Conclusions and future work
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
70
State-of-the-art: XML Query Processing
Path Tree
Holistic Approach
PathStack [Bruno, et. al] TwigStack [Bruno, et. al]
(GTP)
Generalized Tree Pattern
?
Twig2Stack
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
71
Processing Generalized Tree Pattern (GTP) Queries
B
A
D
XQuery:FOR $b in //A[E]//B, $d in $b/$DLET $c = $b/CRETURN $b, $c, $d
C
EMandatory Axis
Optional Axis
Return node
Group return node
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
72
Motivation: PathStack [Bruno et.al]
Query: //A//B; Data:
Key observation: minimize intermediate results through compact representation of path matches, by
Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2, b2→a2
Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2
TwigStack [Bruno et.al] minimizes intermediate results through: Output only those path matches that are in final twig results However, such optimality cannot be guaranteed [Choi, et.al] Not helpful for processing GTP queries
Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)?
S[A]a1
S[B]b1b2a2a2
b1
a1
b2
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
73
Hierarchical Stack Encoding
Inter-node: //A//B Can still use explicit edges
Intra-node: A Matching elements forms a tree structure as well
Associate each query node with a hierarchical stack Push element e into hierarchical stack HS[E] iff e satisfies the sub-twig
query rooted at E Matching can be determined when entire sub-tree of e seen Require post-order document traversal
a2
a3 a4
a1
HS[A]
a3 a4
a2a1
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
74
Twig2Stack: Running Example
C
B
A
D
a2
c1
b2
b1
d1
a1[1,20], 1
[2,15], 2
[3,14], 3
[4,11], 4
[8, 9], 6
[5,10], 5
d2[6,7], 6
c2
[12,13], 4
b3
d3
[16,19], 2
[17,18], 3
HS[B]
b2
HS[C]
c1
b1
HS[A]
a2
HS[D]
d2d1
c2d3
TwigStack needs to enumerate3 matches for //A/B//D and 2 for//A/B//C then join them together.
Twig2Stack requires neither path joins nor path enumeration!
MergingStacks
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
75
Not yet done: Memory Usage Hierarchical Stack Encoding could hold entire document in memory in
the worst case Unlike DOM approach, only matches need to be stored
Tag match (Partial) twig match Predicate evaluation
Early result enumeration dramatically reduces the memory usage Enumerate query results before the end of document and release
buffer Main idea: hybrid of top-down (PathStack) and bottom-up (Twig2Stack)
approaches
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
76
Outline
Introduction Holistic algorithms:
TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Benchmark experiments Conclusions and future work
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
77
TreeMatch (TKDE 2010)
Twig pattern:
A
B C
An XML tree:
A1
C1
B1
A2
B2
C2
It is the real reason
for sub-
optimality!
B1 B2
C1 C2
Matching cross:
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
78
Bounded and Unbounded Matching Cross
Twig pattern:
A
B C
An XML tree:
A1
C1
B1
A2
B2n
C2n
B1 B2n
C1 C2n
Unbounded Matching cross:
An
Bn …
Bn+1 C 2n-1
Cn…
……
……
A1 An
C1 C2n
……
……
Bounded Matching cross:
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
79
BMC and UMC
Bounded Matching Cross (BMC): Optimal class Store limited number of nodes in main memory
Unbounded Matching Cross (UMC): Sub-optimal class, but not all Cannot guarantee to store limited number of nodes in
main memory, but a sub-class of UMC is still optimal
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
80
Unbounded Matching Cross with Mediator
Twig pattern:
(output: node C)
A
B C
An XML tree:
A1
B1
A2 Cn
B1 Bn+1
C1 Cn
Unbounded Matching cross:
Bn …
Bn+1 C 1
……
……
An …
B2n C n-1
Node A is a mediator node and we do not need to
store all Bi in main
memory!
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
81
Optimal query classes
Only A-D in branching edgesA
B C
C
A
B
D
D
Optimal Class of TwigStack
Optimal Class of TwigStackList
Only A-D in all edges
C
A
B
Only A-D in non-output branching edges
Optimal Class of TreeMatch
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
82
Outline
Introduction Holistic algorithms:
TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Benchmark experiments Conclusions and future work
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
83
Experiment Setup
Implementation (Seven algorithms) TwigStack (SIGMOD2002) TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Datasets XMark, DBLP, TreeBank
Metrics Query processing time IO time
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
84
Experiments
Benchmarks XMark: Synthetic Data DBLP: Real Data for DBLP database Treebank: Real Data from Wall Street Journal
XMark DBLP Treebank
Data size(MB) 582 130 82
Nodes(million) 8 3.3 2.4
Max/Avg depth 12/5 6/2.9 36/7.8
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
85
Tested queries
Source Twig Queries
Q1 DBLP //proceedings//title[.//i]//sup
Q2 DBLP //article[.//sup]//title//sub
Q3 Treebank /S[.//VP/IN]//NP
Q4 Treebank /S/VP/PP[IN]/NP/VBN
Q5 Treebank //VP[DT]//PRP_DOLLAR_
Some tested queries
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
86
Tested queries (Cont.)
Q1,Q2,Q3 are based on XMark data and Q4,Q5 Q6 are on TreeBank data.
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
87
TwigStackList V.s.TwigStack
Experiment data: TreeBank
Compared to TwigStack, TwigStackList significantly reduces the size of output useless elements. Compared to TwigStack, TwigStackList significantly reduces the size of output useless elements.
0
10000
20000
30000
40000
50000
60000
70000
80000
Q1 Q2 Q3
UsefulTwigStackTwigStsackList
Q1=VP[/DT]//PRP DOLLAR, Q2=S[/JJ]/NP, Q3=S[//VP/IN]//NP
# o
f inte
rmedia
te p
ath
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
88
TwigStackList V.s. OrderedTJ
STW: Straightforward-TwigStack STWL: Straightforward-TwigStackList
02468
101214
Q1 Q2 Q3
Quer i es on XMark
Exec
utio
n ti
me (
s)
STW STWL OrderedTJ
OrderedTJ is significantly better than two straightforward method on XMark and TreeBank data
05
101520253035
Q4 Q5 Q6
Queri es on Tree dataEx
ecut
ion
time
(sec
onds
)
STW STWL OrderedTJ
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
89
iTwigJoin
The decrease of the number of elements scanned
0
1
2
3
4
5
Q1 Q2 Q3
XMark data query
Byte
s sc
anne
d (M
)
t ag tag+l evel prefi x path
More refined schemes scan less elements to answer a query.
0
2
4
6
8
10
12
14
Q4 Q5 Q6Treebank data query
Byte
s sc
anne
d (M
)
tag tag+l evel prefi x path
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
90
iTwigJoin
Performance of queries for three streaming schemes
0
2
4
6
8
10
Q1 Q2 Q3
XMark quer i es
Exec
utio
n ti
me
Tag Tag+l evel Prefi x path
Prefix path scheme is suitable for large but shallow document, and tag+level scheme generally works well even for complicated recursive documents.
0
10
20
30
40
50
60
Q4 Q5 Q6
Quri es on TreeBankEx
ecut
ion
time
(s)
Tag Tag+l evel Prefi x path
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
91
TwigStackList V.S. iTwigJoin
Observation: iTwigJoin scans far less elements than TwigStack and TwigStackList in two twig queries.
TreeBank data
0
200000
400000
600000
800000
1000000
1200000
Q3 Q4 Q5
Numb
er o
f el
emen
ts r
ead
Twi gStack Twi gStackLi st i Twi gJ oi n
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
92
TwigStackList V.S. iTwigJoin
0
5
10
15
20
Q3 Q4 Q5
Exec
utio
n ti
me(s
econ
ds)
Twi gStack Twi gStackLi st i Twi gJ oi n
Observation: iTwigJoin has much better performance than that of TwigStack/TwigStackList.
Explanation: iTwigJoin reduces I/O cost by reading less elements
TreeBank data
0
1
2
3
4
5
6
Q1 Q2
Exec
utio
n ti
me (
seco
nds)
Twi gStack Twi gStackLi st i Twi gJ oi n
DBLP data
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
93
iTwigJoin, TJFast, Twig2Stack,
00. 5
11. 5
22. 5
33. 5
44. 5
5
1 2
Exec
utio
n ti
me (
s)
i Twi gJ oi n TJ Fast Twi g2Stack
Observation: iTwigJoin/TJFast has better performance than that of Twig2Stack
Reason: iTwigJoin/TJFast reduces I/O cost by reading less elements
TreeBank dataDBLP data
0
2
46
8
10
1214
16
18
1 2 3
Exec
utio
n ti
me (
s)
i Twi gJ oi n TJ Fast Twi g2Stack
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
94
Experiments: TJFastL and iTwigJoin
Observation: Both algorithms are based on tag+level scheme. TJFastL has much better performance than iTwigJoin on tag+level scheme.
Explanation: TJFast reduces I/O cost by reading less elements.
0123456789
Q3 Q4 Q5
Exec
utio
n ti
me (
seco
nds)
i Twi gJ oi n TJ FastL
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
Q1 Q2
Exec
utio
n ti
me (
seco
nds)
i Twi gJ oi n TJ FastL
DBLP data TreeBank data
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
95
TJFast and TreeMatch
Observation: TreeMatch has much better performance than that of TJFast.
Explanation: TreeMatch reduces I/O cost over TJFast.
00. 050. 1
0. 150. 2
0. 250. 3
0. 350. 4
0. 45
Q1 Q2
Exec
utio
n ti
me(s
econ
ds)
TJ Fast TreeMatch
0
1
2
3
4
5
6
Q3 Q4 Q5
Exec
utio
n ti
me (
seco
nds)
TJ Fast TreeMatch
DBLP data TreeBank data
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
96
Conclusions
Efficient processing of twig queries is a core operation in XPath and XQuery
We reviewed and compared seven holistic algorithms TwigStack(SIGMOD 2002) TwigStackList (CIKM2005) OrderedTJ (DEXA2006) iTwigJoin (SIGMOD2005) TJFast (VLDB2005) Twig2Stack(VLDB2006) TreeMatch (TKDE2010)
Comprehensive benchmark experiments show the correctness and efficiency of holistic algorithms
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
97
Conclusions (Cont.)
Holistic TPQ processing, I/O cost takes most of time
TJFast reduces input data size
Twig2Stack reduces output size
TreeMatch reduces both input and output data size
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
98
Reference works
[1] J. Lu, T. W. Ling,Z. Bao and C. Wang Extended XML Tree Pattern Matching: Theories and Algorithms IEEE TKDE Journal 2010 (to appear)
Propose TreeMatch algorithm [2] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with
parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004. Propose TwigStackList algorithm [3] J. Lu and T. W. Ling, Labeling and querying dynamic XML trees, In
Proceedings of the Sixth Asia Pacific Web Conference, 2004, 180–189 Propose a new labeling scheme for dynamic XML documents [4] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig pattern matching
using structural indexingtechniques. In SIGMOD, 2005. Propose two new data streaming techniques [5] J. Lu, T. W. Ling, C. Chan, and T. Chen, From region encoding to extended
dewey: On efficient processing of XML twig pattern matching, In Proceedings of VLDB, 2005, pp. 193–204.
Propose TJFast algorithm
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
99
Reference works (Cont.)
[6] J. Lu, T. W. Ling, T. Yu, C. Li, and W. Ni, Efficient processing of ordered XML twig pattern matching, Proceedings of DEXA, 2005, pp. 300–309
Propose OrderedTJ algorithm [7] J. Lu, T. W. Ling, and T. Chen, TJFast: Effective processing of XML
twigpattern matching, Proceedings of WWW, 2005, pp. 1118–1119. Propose extended Dewey labeling scheme [8] T. Yu, T. W. Ling, J. Lu: TwigStackListNot: A Holistic Twig Join
Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 249-263
Propose an algorithm for twig queries with NOT predicate [9] J, Lu, R Yang, W. Ling, A. K.H Tung: Efficient XML tree pattern
matching: theory and algorithm Submit to IEEE TKDE Journal Propose a theory and algorithm for extended XML tree pattern
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
100
Reference works (Cont.)
[10] S. Al-Khalifa , H.V. Jagadish, J. Patel, Y. Wu N. Koudas, D. Srivastava : Structural Joins: A Primitive for Efficient XML Query Pattern Matching. ICDE 2002 141- 152
Propose StackTree algorithm [11] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig
joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.
Propose TwigStack algorithm [12] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G.
M. Lohman, On supporting containment queries in relational database management systems, In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2001, pp. 425–436.
Propose containment labeling scheme
Ben
chm
arX 10 K
eyno
te
Jiaheng Lu Benchmarking Holistic Approaches to TPQ Processing
101
Reference works (Cont.)
[13] H. Jiang, W Wang and H. Lu Holistic twig joins on indexed XML documents VLDB 2003
Propose TSGeneric algorithm [14] I. Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram, E. J.
Shekita, and C. Zhang, Storing and querying ordered XML using a relational database system, In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002, pp. 204–215.
Propose Dewey labeling scheme [15] H. Wang, S. park, W Fan and P.S. Yu ViST: A dynamic
index method for querying XML data by tree structures In SIGMOD 2003
Propose ViST system [16] B. Yang M. Fontoura, E.J. Shekita, S. Rajagopalan and K.S.
Beyer Virtual Corsors for XML joins CIKM pages 523-532 2004
Propose Virtual cursor algorithm
top related