1 holistic twig joins: optimal xml pattern matching acm sigmod 2002
TRANSCRIPT
1
Holistic Twig Joins:Optimal XML Pattern Matching
ACM SIGMOD 2002
2
In this lecture
The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions
3
The problem
To find semantically connected data in the XML document in the efficient way.
There are many intermediate results produced that doesn’t participate in the final answers.
4
The problem (example)
For example we have this XQuery expression: book[ title = ‘XML’ ] // author [ fn = ‘jane’ and ln = ‘doe’]
We can translate it to the twig (small tree) patternbook
title
XML
author
fn
jane
ln
doe
5
The problem (example)
In order to solve this problem we have to Find all binary relationships line (book, title) and
(author, fn) To connect all the patterns we have found to
the compile answer.The problem is that every book has title
but there are only some of the with title ‘XML’, so we produce many intermediate answers that doesn’t participate in the final answer.
6
In this lecture
The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions
7
Idea
The main Idea of the paper is how to save intermediate results in a compact way.
To develop algorithm that will be independent of the size of intermediate results.
The is a family of stack based algorithms invented for this purpose.
8
In this lecture
The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions
9
Representing position of elements
Every node in the XML document is represented as Leaf: 3-tuple (DocId, LeftPos, LevelNum) Node: 3-tuple (DocId, LeftPos : RightPos, LevelNum)
10
Representing position of elements
For examplebook
title
XML
authors
(1,3,3)
(1,2:4,2)
(1,1:31,1)
(1,5:30,2)
author
fn
jane
ln
poe
(1,6:13,2)
(1,7:9,3)
(1,8,4)
(1,10:12,3)
(1,11,4)
author
fn
john
ln
doe
(1,14:21,2)
(1,15:17,3)
(1,16,4)
(1,18:20,3)
(1,19,4)
author
fn
jane
ln
doe
(1,22:29,2)
(1,23:25,3)
(1,24,2)
(1,26:28,2)
(1,27,2)
11
Representing position of elements
For example
12
Representing position of elements
Profits:Easy to determine
ancestor-descendant relationship a node n1(D1,L1:R1,N1) is descendant to node n2(D2,L2:R2,N2) iff D1 = D2 , L2<L1 and R1<R2
parent-child relationship a node n1(D1,L1:R1,N1) is parent to node n2(D2,L2:R2,N2) iff D1 = D2 , L2<L1 , R1<R2 and N1+1=N2
fn
(1,7:9,3)
book
(1,1:31,1)
ln
poe
(1,10:12,3)
(1,11,4)
13
Representing position of elements
Available cases:
14
Matching stream
A stream Tq contains positional representations of the database nodes that match the node q
The nodes in the stream are sorted by the (DocId,LeftPos)
15
jane(1,8,4)
jane(1,24,2)
author(1,22:29,2)
author(1,14:21,2)
author
Matching stream (example)book
title
XML
authors
(1,3,3)
(1,2:4,2)
(1,1:31,1)
(1,5:30,2)
fn ln
poe
(1,7:9,3) (1,10:12,3)
(1,11,4)
fn
john
ln
doe
(1,15:17,3)
(1,16,4)
(1,18:20,3)
(1,19,4)
fn ln
doe
(1,23:25,3) (1,26:28,2)
(1,27,2)
Tauthor Tjane
(1,6:13,2)
author(1,14:21,2)
author(1,6:13,2)
author(1,22:29,2)
jane(1,8,4)
jane(1,24,2)
The operations available on the streams eof, advance, next, nextL, nextR
16
Linked stacks
Idea: Repeatedly construct stacks that contain partial
and total answers Remove partial answers that couldn’t be
extended to total answers
17
Linked stacks (example)
A1
B1
A2
B2
C1
Data
A
B
C
Query
A1B1
A2B2
C1
Stack encoding
A1 B1 C1
A2 B2 C1
A1 B2 C1
Query results
18
In this lecture
The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions
19
Stack based algorithms
The stack based algorithms uses chain of linked stack to compactly represent partial and full results
20
B1
B2
2:8
4:6
A1
A2
3:7
PathStack algorithm
C1
Data
A
B
C
Query
TA TB
1:9
5
TCA1
A2
3:7
1:9
B1
B2
2:8
4:6
C1
5
21
PathStack algorithm
B1
B2
2:8
4:6
A1
A2
3:7
C1
Data
A
B
C
Query
A1 B1 C1
A1 B2 C1
A2 B2 C1
Query results
TA TB
1:9
5
TC
C1
5
Stack encoding
SC SB SA
A1
1:9
B1
2:8
A2
3:7
B2
4:6
Always take an element with smallest LeftPos
22
C2
8
A
B
C
Query
A1 B1 C1
A1 B2 C1
A2 B2 C1
TA TB TC
Stack encoding
SC SB SA
A1
1:10
B1
2:9
A2
3:7
B2
4:6
B1
B2
2:9
4:6
A1
A2
3:7
C1
Data
1:10
5
Add C2
here
C2
8
A1 B1 C2
RightPos < LeftPos
PathStack algorithm
23
PathStack algorithm problems
To find a twig we have to divide it to many paths and Again we have intermediate results that doesn’t
participate in the final result
authors
(5:30)
author
fn
jane
ln
poe
(6:13)
(7:9)
(8)
(10:12)
(11)
author
fn
john
ln
doe
(14:21)
(15:17)
(16)
(18:20)
(19)
author
fn
jane
ln
doe
(22:29)
(23:25)
(24)
(26:28)
(27)
Query
author
fn
jane
ln
doe
24
In this lecture
The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions
25
TwigStack Algorithm
Idea Before adding the node to the stack check that
he has suns that satisfies the twig pattern. When checking the sons theirs sons are checked to
Now we can be shure that every path result is joinable with at least one other path result and participates in at least one full answer.
26
TwigStack Algorithm
authors
(5:30)
author
fn
jane
ln
poe
(6:13)
(7:9)
(8)
(10:12)
(11)
author
fn
john
ln
doe
(14:21)
(15:17)
(16)
(18:20)
(19)
author
fn
jane
ln
doe
(22:29)
(23:25)
(24)
(26:28)
(27)
author
fn
jane
ln
doe
27
In this lecture
The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions
28
Conclusions
The PathStack and TwigStack algorithms are effective in terms of amount of intermediate results
But: They are only effective for founding ancestor-
descendant relationships. If we have also parent-son relationships in the twig
then not all nodes that are inserted to the stacks participate in the final result.
29
Brake ?
30
Query Structured Text in an XML Database
ACM SIGMOD 2003
31
In this lecture
AbstractIntroductionMotivationAlgebraAccess methodsConclusions
32
Abstract
XML documents often contain documents with structured text
It is important to integrate “information retrieval” style query evaluation
It is well studied for natural languagesBut in the case of XML the data could
reside in element descendants.
33
In this lecture
AbstractIntroductionMotivationAlgebraAccess methodsConclusions
34
Introduction
Boolean style queries (XQuery) Useful when users are aware of the underlying
schema
But Users often don’t know the schema And collections of XML documents are
frequently heterogeneous.
35
Introduction
So we have to use relevance ranking in order to define the IR on XML
Problem: traditional IR is “document-centric”
XML IR should Be much more granulated Take document structure into account Allow more complex analysis then
determination of relevance
36
In this lecture
AbstractIntroductionMotivationAlgebraAccess methodsConclusions
37
Motivation
article
article-title
InternetTechnologies
author
fname sname
Jane Doe
chapter
ct
Cashing andReplication
chapter
ct
Search andRetrieval
section
section-title
SearchEngine
section
section-title
InformationRetrieval
section
section-title
Examplesp p p
… Here are someIR based Search Engines: …
…search engine NewSearch uses a new information retrieval technology
semantic information retrieval techniques are also being incorporated into some search engines
#a1
#a2 #a3
#a4 #a5
#a6
#a7
#a10
#a11
…
…
#a12
#a13
#a14
#a15
#a16
#a17
#a18
#a19
#a20 We have the
following XML document named article.xml
38
Motivation
Consider the query Find document components in articles.xml that are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary.
Using AND and OR predicated will not give us the desirable result
39
Motivation
article
article-title
InternetTechnologies
author
fname sname
Jane Doe
chapter
ct
Cashing andReplication
chapter
ct
Search andRetrieval
section
section-title
SearchEngine
section
section-title
InformationRetrieval
section
section-title
Examplesp p p
… Here are someIR based Search Engines: …
…search engine NewSearch uses aninformation retrieval technology
semantic information retrieval techniques are also being incorporated into somesearch engines
#a1
#a2 #a3
#a4 #a5
#a6
#a7
#a10
#a11
…
…
#a12
#a13
#a14
#a15
#a16
#a17
#a18
#a19
#a20 We have the
following XML document named article.xml
40
Motivation
Illustrating granulation problemWhat elements to rank?
If we will rank article The user will see all the article while the relevant
information concentrated only in the third chapter If we will rank paragraphs
The paragraphs of the last section will be returned separately
• The semantic linkage is broken and has to be reconstructed by the user
41
Motivation
IR-style XML queries don’t have to be stand alone
If the use know the structure of the XML document he can add some structural constraints and limit the number of uninteresting results
42
In this lecture
AbstractIntroductionMotivationAlgebraAccess methodsConclusions
43
Algebra
We want to fold into a database framework the notion of relevance scoring and ranking
45
Algebra
Scored Data Tree Definition:
A rooted ordered tree, such that each node has attribute-value pairs, including at least a tag and a real number valued score
A score of a tree is a score of a root node Example:
article[3.6] #a1
author #a3
sname #a5
section[3.6] #a16
46
Algebra
Scored Pattern Tree Definition:
P = (T,F,S)• T=>node-labeled and edge-labeled tree• F=> formula of boolean combination of predicates
applicable to nodes• S=> set of scoring function
47
Algebra
Scored Pattern Tree Example:
Query2:Find document components in the artilce.xml that are part of an article written by an author with last name “Doe” and are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary.
T: $1
$2
$3
$4
pc
pc
ad*
F:
$1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe”
S:
$4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})}
$1.score = $4.score
48
Algebra
Common operators Selection => Scored Selection Projection => Scored Projection Join => Scored Join
New Operators Threshold Pick
49
Algebra (New Operators)
Threshold
T: $1
$2
$3
$4
pc
pc
ad*
F:
$1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe”
S:
$4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})}
$1.score = $4.score
TC%a > ...
article[3.6] #a1
author #a3
sname #a5
section[3.6] #a16
article[3.6] #a3
author #a23
sname #a25
section[3.6] #a36
article[3.6] #a1
author #a3
sname #a5
section[3.6] #a16
article[3.6] #a3
author #a23
sname #a25
section[3.6] #a36
article[3.6] #a3
author #a23
sname #a25
section[3.6] #a36
50
Algebra (New Operators)
Pick
article[3.6] #a1
author #a3
sname #a5
section[3.6] #a16
article[3.6] #a2
author #a13
sname #a15
section[3.6] #a26
article[3.6] #a3
author #a23
sname #a25
section[3.6] #a36
T: $1
$2
$3
$4
pc
pc
ad*
F:
$1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe”
S:
$4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})}
$1.score = $4.score
PC
article[3.6] #a1
author #a3
sname #a5
section[3.6] #a16
article[3.6] #a3
author #a23
sname #a25
section[3.6] #a36
51
Pick Example:
Algebra (New Operators)
article[5.6] #a1
chapter[5.0] #a10
section[0.8] #a12
title[0.6] #a2
sname #a5
article[5.6] #a1
section[0.6] #a14 section[3.6] #a16
title[0.8] #a13 title[0.6] #a15 p[0.8] #a18 p[1.4] #a19 p[1.4] #a20
Data Tree
Pick Condition
Data is relevant if:
1. score > 0.8
2. more then 50% of children are relevant
3. it’s direct parent node is not picked
52
Translating to XQuery
Query1 Find document components in articles.xml that are
about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary
XQueryFor $a in document(“articles.xml”)//article/descendant-or-self::*Score $a using ScoreFoo($a,{“search engine”}, {“internet”, ”information retrieval”})Pick $a using PickFoo($a)Return
<result><score>$a</score>
</result>Sortby( score )Threshold @score >=4 stop after 5
53
In this lecture
AbstractIntroductionMotivationAlgebraAccess methodsConclusions
54
Access Methods
Score-Generating Methods TermJoin
Score-Utilizing Methods Pick
55
Score-Generating Methods
How to give initial score to the data treeThe score of every node should be
computed according to the amount of terms that we are searching in the node or it’s descendants.
56
Naïve algorithm
For every node recompute the value of the scores of all it’s ancestors
a
b a
c a
The runtime is bad
57
TermJoin
Stack Based algorithm Use a stack to store the ancestors of every
node Now all ancestors would be affected by the
node
58
Ta
TermJoin
ab
bc ac
a b
(1:9)
(2:7)
(3:5)
ab (4)
(6)
(8)
ab(1:9)
a(3:5)
ab(4)
ac
(8)
Encoding Stack
Phrase: “a”1a
bc (2:7)
0a
1a
1a
2a
2a
3a
1a
4a
If we have more then one word in the phrase we will operate some matching streams simultaneously
1b
59
Score-Utilizing Methods
Methods that help us to filter the data according to theirs scores
Two such methods are Threshold Pick
Pick could be much of challenge to implement
60
Score-Utilizing Methods
Pick algorithm The most complex part of the algorithm is
removing redundancy. The is vertical (parent-child) and horizontal
(among the siblings, e.g. return the first author from the relevant article) redundancy.
The problem is solved with stack-based algorithm
61
Pick algorithm
1chapter
2title 3section
Search andretrieval 4p 5p
6section 7section
… IR …Search engine
… Search engineretrieval of syntactic
information
score = 1
score = 2score = 2
score = 0
score = 4
score = 5
score = 0
Ancestor Stack
containing elements not yet fully explored
Main stackcontaining elements
can not yet be eliminated
2
score >= 2percentage >= 50%
1 0/1
4
3 1/1
4
5
2/25
3
67
1/21/31/4
62
Algebra (New Operators)
Pick
T: $1
$4
ad*
F:
$1.tag=article
S:
$4.score = { ScoreFoo({“search engine”})}
$1.score = $4.score
PC
1chapter
2title 3section
Search andretrieval 4p 5p
6section 7section
… IR …Search engine
… Search engineretrieval of syntactic
information
score >= 2percentage >= 50%
section
p p
… IR …Search engine
… Search engineretrieval of syntactic
information
63
In this lecture
AbstractIntroductionMotivationAlgebraAccess methodsConclusions
64
Conclusion
Stack based algorithms are used for efficient implementation of new ideas
Usable algebra is presented that deals with scoring and relevance in the XML keyword search
Possible extension of XQuery