from region encoding to extended dewey: on efficient processing of xml twig pattern matching
DESCRIPTION
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. Jiaheng Lu , Tok Wang Ling , Chee-Yong Chan , Ting Chen National University of Singapore. Outline. Background Define our problem: XML twig pattern matching Previous work and problems - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/1.jpg)
From Region Encoding To Extended Dewey: On Efficient
Processing of XML Twig Pattern Matching
Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen
National University of Singapore
![Page 2: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/2.jpg)
2
Outline Background
Define our problem: XML twig pattern matching Previous work and problems
Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast
Experimental results Conclusion
![Page 3: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/3.jpg)
3
XML basics Short for Extensible Markup Language A language for defining the syntax and semantics of
structured data An XML document is commonly modeled as a
rooted, ordered and tagged tree. book
preface chapter chapter
section
section
paragraph
section
paragraph
paragraph
………….
title
title
“XML”“Data”
“Intro”
“…” “…”
“…”
![Page 4: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/4.jpg)
4
Querying XML Data Major standards for querying XML data
XPath and XQuery XML twig pattern matching is a core operation in
XPath and XQuery Definition of XML twig pattern : An XML twig pattern
is a small tree whose nodes are tags, attributes or text values; and edges are either Parent-Child edges or Ancestor-Descendant edges
![Page 5: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/5.jpg)
5
An XML twig pattern example Create a flat list of all the title-author pairs for
every book in bibliography.
$b: book
$t: title
bib
$a: author
Ancestor-descendant relationship
Parent-child relationship
XQuery:
<results>
{
for $b in doc("bib.xml")/bib//book,
$t in $b/title,
$a in $b/author,
return
<result> { $t } { $a } </result>
}
</results>
To answer the XQuery, we need to first match the following XML twig pattern:
![Page 6: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/6.jpg)
6
Our research problem
Problem Statement Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D. E.g. Consider the following twig pattern and document:
An XML tree:
s1
s2
f1
p1
t1
t2
Section
Title Figure
Twig pattern: Query answers:
(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
![Page 7: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/7.jpg)
7
Our research problem
Problem Statement Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D. E.g. Consider the following twig pattern and document:
An XML tree:
s1
s2
f1
p1
t1
t2
Section
Title Figure
Twig pattern: Query solutions:
(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
![Page 8: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/8.jpg)
8
Our research problem
Problem Statement Given an XML twig pattern Q, and an XML database
D, we need to find ALL the matches of Q on D. E.g. Consider the following twig pattern and document:
An XML tree:
s1
s2
f1
p1
t1
t2
Section
Title Figure
Twig pattern: Query solutions:
(s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
![Page 9: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/9.jpg)
9
Outline Background
Define our problem: XML twig pattern matching Previous work and challenge
Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast
Experiments Conclusion
![Page 10: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/10.jpg)
10
Related work TreeMerge and Stack-tree [Al-Khalifa ICDE 2002]
A stack-based binary join algorithm But large intermediate results
TwigStack [ Bruno SIGMOD 2002] A holistic twig join algorithm. Sub-optimal for queries with parent-child relationships
TwigStackList [ Lu CIKM 2004] A new holistic twig join algorithm, which produces less
useless intermediate results than TwigStack does for queries with parent-child relationship
![Page 11: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/11.jpg)
11
Our research goal In this research, we want to design a new holistic twig
join algorithm which is more efficient than previous work.
Two aspects to achieve this goal: (1) Input: reduce the input I/O cost
(2) Output: reduce the size of intermediate results
![Page 12: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/12.jpg)
12
Outline Background
Define our problem: XML twig pattern matching Previous work and challenges
Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast
Experiments Conclusion
![Page 13: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/13.jpg)
13
Original Dewey Labeling Scheme In Dewey labeling scheme, each element is presented by a vector: (i) the root is labeled by an empty stringε (ii) for a non-root element u, label(u)= label(s).x, where u is the
x-th child of s. For example:
s1
s2
f1
f2t1
t2
1 2 3
2.1 2.2
ε
![Page 14: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/14.jpg)
14
Main problem of the original Dewey If we use the original Dewey labeling scheme to answer a twig query, we need to read labels for all query nodes. Thus, we have no performance benefit compared to pervious methods.
Our idea: Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone.
![Page 15: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/15.jpg)
15
Modulo function We need to know some schema information: DTD
(Document Type Definitions ) or XML schema Given DTD information: book → author, title, chapter* Our solution: using modulo function, we create a match
between an element tag and a integer number. We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3
= 2;
where Xt is the last component of the label of tag t.
bookε
0
titleauthor 1
chapter2
chapter
5
Why not 3 as the original Dewey ?
![Page 16: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/16.jpg)
16
Derive element tag From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthormod 3 = 0 Xtitlemod 3 = 1
Xchaptermod 3 = 2.
bookε
0
titleauthor 1
chapter2
chapter
5
? ? ? ?
![Page 17: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/17.jpg)
17
Derive the path from a label By following a finite state transducer (FST), we may recursively
derive the whole path from any extended Dewey label. For example:
DTD:
book → author, title, chapter*
chapter → (paragraph | section)*
section → (paragraph | section)*
book
chapter
sectionauthor title
book
author
title
chapter
paragraph
section
Mod 3=0
Mod 3=1
Mod 3=2 Mod 2=0
Mod 2=1
Mod 2=0
Mod 2=1
Question: Given a label 5.1.0 for an element, what is the corresponding path ?
Document:
FST:
chapter
section
paragraphsection
![Page 18: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/18.jpg)
18
Derive the path from a label By following a finite state transducer (FST), we may recursively
derive the whole path from any extended Dewey label. For example:
DTD:
book → author, title, chapter*
chapter → (paragraph | section)*
section → (paragraph | section)*
book
chapter
sectionauthor title
Document:chapter
section
paragraphsection
Following the above red path, we get
5.1.0 denotes :
book/ chapter/section/paragraph
book
author
title
chapter
paragraph
section
Mod 3=0
Mod 3=1
Mod 3=2 Mod 2=0
Mod 2=1
Mod 2=0
FST:
Mod 2=1
![Page 19: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/19.jpg)
19
Two properties of extended Dewey Find Ancestor Label From a label of any element, we can derive the labels of its all
ancestors. Find Ancestor Name
From a label of any element, we can derive the tag names of its all ancestors.
Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.
![Page 20: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/20.jpg)
20
Outline Background
Define our problem: XML twig pattern matching Previous work and challenges
Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast (a Fast Twig Join
algorithm) Experiments Conclusion
![Page 21: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/21.jpg)
21
A new algorithm: TJFast
For each node n in the query, there exists a corresponding input stream Tn.
Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order.
For each branching node b of the twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStack, what difference? )
During any point of computing, the size of set Sb is bounded by the depth of the XML document.
![Page 22: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/22.jpg)
22
A new algorithm: TJFast Two-phase algorithm:
Phase 1 : parts of intermediate root-leaf paths are output Insert elements that possibly involve in query answers to sets Output intermediate paths according to elements in sets
Phase 2 : the intermediate paths are merge-joined to get the final results
![Page 23: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/23.jpg)
23
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0TD:
TC:
{ }
DTD:
a -> a*,d*, b*
b -> d*, c*
d -> c*
Root
0…
0.5.0
ε
A set for the branching node A
Why do we not need TA, TB streams?
![Page 24: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/24.jpg)
24
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{ }Root
0…
0.5.0
ε
0.0.1 a1/a2/d1derive
0.3.2.1 a1/a3/b1/c1derive
By finite state transducer of extended Dewey labeling scheme
TD:
TC:
![Page 25: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/25.jpg)
25
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{ }Root
0…
0.5.0
ε
Both a1 and a3 possibly involve in query answers. (Why not a2 ?)TD:
TC:
![Page 26: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/26.jpg)
26
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{ }Root
0…
0.5.0
ε
Then we insert a1 to the set, since a1 is an ancestor of a3. TD:
TC:
![Page 27: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/27.jpg)
27
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{a1 }
Root
0…
0.5.0
ε
Move the cursor of TD from d1 to d2 and output one path solution <a1, d1>
TD:
TC:
![Page 28: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/28.jpg)
28
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{a1,a3 }Root
0…
0.5.0
ε
We insert a3 to the set, since a3 definitely involves in query answers.
0.3.1 a1/a3/d2derive
TD:
TC:
![Page 29: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/29.jpg)
29
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{a1,a3 }Root
0…
0.5.0
ε
Move the cursor of stream TD from d2 to d3 and output <a1,d2> and <a3,d2>.
TD:
TC:
![Page 30: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/30.jpg)
30
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{a1,a3 }Root
0…
0.5.0
ε
Move the cursor of stream TC from c1 to c2 and output the path <a3,b1,c1>TD:
TC:
![Page 31: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/31.jpg)
31
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{a1,a3 }Root
0…
0.5.0
ε
1. Move the cursor TD of to the end and output path solution <a1,d3>
TD:
TC:
![Page 32: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/32.jpg)
32
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{a1,a3 }Root
0…
0.5.0
ε
1. Move the cursor of TC of to the end and output <a1,b2,c2>
TD:
TC:
![Page 33: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/33.jpg)
33
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
{a1,a3 }Root
0…
0.5.0
ε
Now all five elements has been scanned, in the second phase we merge-join all output path solutions.
TD:
TC:
![Page 34: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/34.jpg)
34
An example for TJFast algorithmDocument: Query:
A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A// D:<a1, d1>, <a1, d2>,<a1, d3>,<a3, d2>
A/B//C:<a1,b2, c2>,<a3, b1,c1>
Phase 1. Intermediate paths
<a1,d1,b2,c2>,<a1,d2, b2,c2>,
<a1,d3,b2,c2>,<a3,d2, b1,c1>,
<A, D, B,C>
Phase 2. Final solutions
Join
![Page 35: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/35.jpg)
35
Outline Background
Define our problem: XML twig pattern matching Previous work and challenges
Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast
Experimental results Conclusion
![Page 36: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/36.jpg)
36
Experiments
Benchmarks XMark: Synthetic Data DBLP: Real Data for DBLP database Treebank: Real Data from Wall Street Journal
XMark DBLP Treebank
Data size(MB) 582 130 82
Nodes(million) 8 3.3 2.4
Max/Avg depth 12/5 6/2.9 36/7.8
![Page 37: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/37.jpg)
37
Path query
Path Queries
PQ1 /site/closed-auctions/closed_auction/price
PQ2 /site/regions//item/location
PQ3 /site/people/person/gender
PQ4 /site/open_auctions/open_auction/reserve
We compared PathStack[1] and TJFast on the following four path queries on XMark data.
![Page 38: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/38.jpg)
38
Experiments: Number of elements read and input file size for path queries
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Q1 Q2 Q3 Q4
Disk
file
s re
ad(K
Byt
es)
PathStack TJFast
0
50000
100000
150000
200000
250000
Q1 Q2 Q3 Q4
Number of elements read
PathStack TJFast
Observation: TJFast scans less elements than PathStack does.
Explanation: TJFast only scans labels for leaf nodes in queries, but PathStack scans all nodes in the query.
![Page 39: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/39.jpg)
39
Experiments: Execution time for path queries
0
0. 5
1
1. 5
2
2. 5
3
Q1 Q2 Q3 Q4
Exec
utio
n ti
me(s
econ
d)
PathStack TJ Fast
Observation: TJFast has better performance for all four path queries than PathStack.
Explanation: TJFast reduces I/O cost by reading less elements.
![Page 40: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/40.jpg)
40
Twig queries
Source Twig Queries
TQ1 DBLP //proceedings//title[.//i]//sup
TQ2 DBLP //article[.//sup]//title//sub
TQ3 Treebank /S[.//VP/IN]//NP
TQ4 Treebank /S/VP/PP[IN]/NP/VBN
TQ5 Treebank //VP[DT]//PRP_DOLLAR_
We compared TwigStack, TwigStackList and TJFast on the following five twig queries on DBLP and TreeBank data.
![Page 41: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/41.jpg)
41
Experiments: Number of elements read and input file size for twig queries
0
1000
2000
3000
4000
5000
6000
Q1 Q2
Disk
file
s si
ze(K
Byt
es)
Twi gStack Twi gStackLi st TJFast
Observation: TJFast scans far less elements than TwigStack and TwigStackList do in two twig queries.
Explanation: TJFast only scans elements for leaf nodes in queries. But TwigStack/TwigStackList needs to scan elements for all nodes. And the number of elements for non-leaf nodes is much more than that of leaf nodes.
0
100000
200000
300000
400000
500000
600000
Q1 Q2
Numb
er o
f el
emen
ts r
ead
Twi gStack Twi gStackLi st TJFast
![Page 42: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/42.jpg)
42
Experiments: Execution time for twig queries
Observation: For DBLP data, TJFast has much better performance than that of TwigStack/TwigStackList.
Explanation: TJFast reduces I/O cost by reading less elements.
TW-SS and TJ-SS denote the sequential scan time of input data for TwigStack/TwigStacklist and TJFast, respectively.
0
12
34
56
78
9
Q1 Q2
Exec
utio
n ti
me(s
econ
d)
TW- SS Twi gStack Twi gStackLi st TJ - SS TJ Fast
![Page 43: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/43.jpg)
43
Outline Background
Define our problem: XML twig pattern matching Previous work and challenges
Our new twig matching algorithms A new labeling scheme: extended Dewey A new holistic algorithm: TJFast
Experimental results Conclusion
![Page 44: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/44.jpg)
44
Conclusions Efficient processing of twig queries is a core
operation in XPath and XQuery We have proposed a new labeling scheme,
extended Dewey and a new holistic twig pattern matching algorithm: TJFast.
Compared to previous work TJFast reduces the input I/O cost TJFast reduces the output I/O cost for intermediate
results.
![Page 45: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/45.jpg)
45
Reference [1] S. Al-Khalifa , H.V. Jagadish, J. Patel, Y. Wu N. Koudas, D.
Srivastava : Structural Joins: A Primitive for Efficient XML Query Pattern Matching. ICDE 2002 141- 152
Propose StackTree algorithm [2] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins:
optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.
Propose TwigStack algorithm [3] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig
pattern matching using structural indexingtechniques. In SIGMOD, 2005.
Propose two new data streaming techniques [4] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An efficient
XPath processing system. In Proc. of SIGMOD, pages 47-58, 2004. Propose a new algorithm for XPath query
![Page 46: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/46.jpg)
46
Reference [5] H. Jiang, W Wang and H. Lu Holistic twig joins on indexed XML
documents VLDB 2003 Propose TSGeneric algorithm [6] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig
patterns with parent child edges: a look-ahead approach. In CIKM, pages 533-542, 2004.
Propose TwigStackList algorithm [7] P. Rao and B. Moon PRIX: Indexing and querying XML using prufer
sequences In ICDE pages 288-300 2004 Propose PRIX system [8] H. Wang, S. park, W Fan and P.S. Yu ViST: A dynamic index
method for querying XML data by tree structures In SIGMOD 2003 Propose ViST system [9] B. Yang M. Fontoura, E.J. Shekita, S. Rajagopalan and K.S. Beyer
Virtual Corsors for XML joins CIKM pages 523-532 2004 Propose Virtual cursor algorithm
![Page 47: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/47.jpg)
47
END
Thank you!
Q & A
![Page 48: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/48.jpg)
48
Related work Comparison between Virtual Cursor (VC) [Yang
CIKM 2004] and our work Develop independently Finite state transducer in TJFast, path table in VC
Size of path table depends on the distinct paths, but that of FST depends on the distinct elements types.
TJFast reduces the number of useless intermediate path when queries with parent-child edges, but VC has not this property
![Page 49: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/49.jpg)
49
Backup
a
b c
d e
Query:
a1
b1
a2
d1
c1
f2
c2
e1
f1
Document
TwigStackList outputs <a1,b1> . But TJFast does not output this
path solution.
![Page 50: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/50.jpg)
50
Labels sizeXmark DBLP TreeBank
Region encoding(MB)
71.9 21.6 23.3
Original Dewey(MB)
56.2 18.1 22.8
Extended Dewey(MB)
72.6 19.5 28.7
![Page 51: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/51.jpg)
51
Optimal query classes If an algorithm does not output any
useless intermediate results for an query Q for all given documents, we call this algorithm is optimal for query Q.
If an algorithm has a larger optimal query class, this algorithm has better ability to control the size of intermediate results.
![Page 52: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/52.jpg)
52
Optimal class of TJFast and TwigStack
TwigStack TJFast
Optimal query class
All edges are ancestor-descendant relationships
All edges connecting branching nodes and the children are ancestor-descendant relationship
a
b c
a
b c
d
a
b c
Even for non-optimal queries, TJFast usually output less useless intermediate paths than TwigStack do.
![Page 53: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/53.jpg)
53
Update of XML documents In order to support the update of XML
documents, we need to slightly modify extended Dewey labeling scheme.
Our idea comes from ORDPATH*. We can avoid to relabel the documents in any
circumstance of update.
* P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-friendly XML node labels. In
SIGMOD, pages 903--908, 2004.
![Page 54: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/54.jpg)
54
More examples for assigning labels Let us consider a more complicated DTD
a → (b | c )*, d?, c+ We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2
(Why do we use mod 3 instead of 4?)
aε
0
db
2c4
c
7
![Page 55: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56814f60550346895dbd1683/html5/thumbnails/55.jpg)
55
Computing cost of FST The CPU time complexity of FST is linear in the length
of an extended Dewey label, but independent of the complexity of schema definition.
The main memory size of FST is quadratic to the number of distinct element names in XML documents, as the number of transition in FST is quadratic in the worst case.