1 holistic twig joins: optimal xml pattern matching acm sigmod 2002

1

Holistic Twig Joins:Optimal XML Pattern Matching

ACM SIGMOD 2002

2

In this lecture

The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions

3

The problem

To find semantically connected data in the XML document in the efficient way.

There are many intermediate results produced that doesn’t participate in the final answers.

4

The problem (example)

For example we have this XQuery expression: book[ title = ‘XML’ ] // author [ fn = ‘jane’ and ln = ‘doe’]

We can translate it to the twig (small tree) patternbook

title

XML

author

fn

jane

ln

doe

5

The problem (example)

In order to solve this problem we have to Find all binary relationships line (book, title) and

(author, fn) To connect all the patterns we have found to

the compile answer.The problem is that every book has title

but there are only some of the with title ‘XML’, so we produce many intermediate answers that doesn’t participate in the final answer.

6

In this lecture


7

Idea

The main Idea of the paper is how to save intermediate results in a compact way.

To develop algorithm that will be independent of the size of intermediate results.

The is a family of stack based algorithms invented for this purpose.

8

In this lecture


9

Representing position of elements

Every node in the XML document is represented as Leaf: 3-tuple (DocId, LeftPos, LevelNum) Node: 3-tuple (DocId, LeftPos : RightPos, LevelNum)

10


For examplebook

title

XML

authors

(1,3,3)

(1,2:4,2)

(1,1:31,1)

(1,5:30,2)

author

fn

jane

ln

poe

(1,6:13,2)

(1,7:9,3)

(1,8,4)

(1,10:12,3)

(1,11,4)

author

fn

john

ln

doe

(1,14:21,2)

(1,15:17,3)

(1,16,4)

(1,18:20,3)

(1,19,4)

author

fn

jane

ln

doe

(1,22:29,2)

(1,23:25,3)

(1,24,2)

(1,26:28,2)

(1,27,2)

11


For example

12


Profits:Easy to determine

ancestor-descendant relationship a node n1(D1,L1:R1,N1) is descendant to node n2(D2,L2:R2,N2) iff D1 = D2 , L2<L1 and R1<R2

parent-child relationship a node n1(D1,L1:R1,N1) is parent to node n2(D2,L2:R2,N2) iff D1 = D2 , L2<L1 , R1<R2 and N1+1=N2

fn

(1,7:9,3)

book

(1,1:31,1)

ln

poe

(1,10:12,3)

(1,11,4)

13


Available cases:

14

Matching stream

A stream Tq contains positional representations of the database nodes that match the node q

The nodes in the stream are sorted by the (DocId,LeftPos)

15

jane(1,8,4)

jane(1,24,2)

author(1,22:29,2)

author(1,14:21,2)

author

Matching stream (example)book

title

XML

authors

(1,3,3)

(1,2:4,2)

(1,1:31,1)

(1,5:30,2)

fn ln

poe

(1,7:9,3) (1,10:12,3)

(1,11,4)

fn

john

ln

doe

(1,15:17,3)

(1,16,4)

(1,18:20,3)

(1,19,4)

fn ln

doe

(1,23:25,3) (1,26:28,2)

(1,27,2)

Tauthor Tjane

(1,6:13,2)

author(1,14:21,2)

author(1,6:13,2)

author(1,22:29,2)

jane(1,8,4)

jane(1,24,2)

The operations available on the streams eof, advance, next, nextL, nextR

16

Linked stacks

Idea: Repeatedly construct stacks that contain partial

and total answers Remove partial answers that couldn’t be

extended to total answers

17

Linked stacks (example)

A1

B1

A2

B2

C1

Data

A

B

C

Query

A1B1

A2B2

C1

Stack encoding

A1 B1 C1

A2 B2 C1

A1 B2 C1

Query results

18

In this lecture


19

Stack based algorithms

The stack based algorithms uses chain of linked stack to compactly represent partial and full results

20

B1

B2

2:8

4:6

A1

A2

3:7

PathStack algorithm

C1

Data

A

B

C

Query

TA TB

1:9

5

TCA1

A2

3:7

1:9

B1

B2

2:8

4:6

C1

5

21

PathStack algorithm

B1

B2

2:8

4:6

A1

A2

3:7

C1

Data

A

B

C

Query

A1 B1 C1

A1 B2 C1

A2 B2 C1

Query results

TA TB

1:9

5

TC

C1

5

Stack encoding

SC SB SA

A1

1:9

B1

2:8

A2

3:7

B2

4:6

Always take an element with smallest LeftPos

22

C2

8

A

B

C

Query

A1 B1 C1

A1 B2 C1

A2 B2 C1

TA TB TC

Stack encoding

SC SB SA

A1

1:10

B1

2:9

A2

3:7

B2

4:6

B1

B2

2:9

4:6

A1

A2

3:7

C1

Data

1:10

5

Add C2

here

C2

8

A1 B1 C2

RightPos < LeftPos

PathStack algorithm

23

PathStack algorithm problems

To find a twig we have to divide it to many paths and Again we have intermediate results that doesn’t

participate in the final result

authors

(5:30)

author

fn

jane

ln

poe

(6:13)

(7:9)

(8)

(10:12)

(11)

author

fn

john

ln

doe

(14:21)

(15:17)

(16)

(18:20)

(19)

author

fn

jane

ln

doe

(22:29)

(23:25)

(24)

(26:28)

(27)

Query

author

fn

jane

ln

doe

24

In this lecture


25

TwigStack Algorithm

Idea Before adding the node to the stack check that

he has suns that satisfies the twig pattern. When checking the sons theirs sons are checked to

Now we can be shure that every path result is joinable with at least one other path result and participates in at least one full answer.

26

TwigStack Algorithm

authors

(5:30)

author

fn

jane

ln

poe

(6:13)

(7:9)

(8)

(10:12)

(11)

author

fn

john

ln

doe

(14:21)

(15:17)

(16)

(18:20)

(19)

author

fn

jane

ln

doe

(22:29)

(23:25)

(24)

(26:28)

(27)

author

fn

jane

ln

doe

27

In this lecture


28

Conclusions

The PathStack and TwigStack algorithms are effective in terms of amount of intermediate results

But: They are only effective for founding ancestor-

descendant relationships. If we have also parent-son relationships in the twig

then not all nodes that are inserted to the stacks participate in the final result.

29

Brake ?

30

Query Structured Text in an XML Database

ACM SIGMOD 2003

31

In this lecture

AbstractIntroductionMotivationAlgebraAccess methodsConclusions

32

Abstract

XML documents often contain documents with structured text

It is important to integrate “information retrieval” style query evaluation

It is well studied for natural languagesBut in the case of XML the data could

reside in element descendants.

33

In this lecture


34

Introduction

Boolean style queries (XQuery) Useful when users are aware of the underlying

schema

But Users often don’t know the schema And collections of XML documents are

frequently heterogeneous.

35

Introduction

So we have to use relevance ranking in order to define the IR on XML

Problem: traditional IR is “document-centric”

XML IR should Be much more granulated Take document structure into account Allow more complex analysis then

determination of relevance

36

In this lecture


37

Motivation

article

article-title

InternetTechnologies

author

fname sname

Jane Doe

chapter

ct

Cashing andReplication

chapter

ct

Search andRetrieval

section

section-title

SearchEngine

section

section-title

InformationRetrieval

section

section-title

Examplesp p p

… Here are someIR based Search Engines: …

…search engine NewSearch uses a new information retrieval technology

semantic information retrieval techniques are also being incorporated into some search engines

#a1

#a2 #a3

#a4 #a5

#a6

#a7

#a10

#a11

…

…

#a12

#a13

#a14

#a15

#a16

#a17

#a18

#a19

#a20 We have the

following XML document named article.xml

38

Motivation

Consider the query Find document components in articles.xml that are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary.

Using AND and OR predicated will not give us the desirable result

39

Motivation

article

article-title

InternetTechnologies

author

fname sname

Jane Doe

chapter

ct

Cashing andReplication

chapter

ct

Search andRetrieval

section

section-title

SearchEngine

section

section-title

InformationRetrieval

section

section-title

Examplesp p p

… Here are someIR based Search Engines: …

…search engine NewSearch uses aninformation retrieval technology

semantic information retrieval techniques are also being incorporated into somesearch engines

#a1

#a2 #a3

#a4 #a5

#a6

#a7

#a10

#a11

…

…

#a12

#a13

#a14

#a15

#a16

#a17

#a18

#a19

#a20 We have the

following XML document named article.xml

40

Motivation

Illustrating granulation problemWhat elements to rank?

If we will rank article The user will see all the article while the relevant

information concentrated only in the third chapter If we will rank paragraphs

The paragraphs of the last section will be returned separately

• The semantic linkage is broken and has to be reconstructed by the user

41

Motivation

IR-style XML queries don’t have to be stand alone

If the use know the structure of the XML document he can add some structural constraints and limit the number of uninteresting results

42

In this lecture


43

Algebra

We want to fold into a database framework the notion of relevance scoring and ranking

45

Algebra

Scored Data Tree Definition:

A rooted ordered tree, such that each node has attribute-value pairs, including at least a tag and a real number valued score

A score of a tree is a score of a root node Example:

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

46

Algebra

Scored Pattern Tree Definition:

P = (T,F,S)• T=>node-labeled and edge-labeled tree• F=> formula of boolean combination of predicates

applicable to nodes• S=> set of scoring function

47

Algebra

Scored Pattern Tree Example:

Query2:Find document components in the artilce.xml that are part of an article written by an author with last name “Doe” and are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary.

T: $1

$2

$3

$4

pc

pc

ad*

F:

$1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe”

S:

$4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})}

$1.score = $4.score

48

Algebra

Common operators Selection => Scored Selection Projection => Scored Projection Join => Scored Join

New Operators Threshold Pick

49

Algebra (New Operators)

Threshold

T: $1

$2

$3

$4

pc

pc

ad*

F:


S:


$1.score = $4.score

TC%a > ...

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

50


Pick

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a2

author #a13

sname #a15

section[3.6] #a26

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

T: $1

$2

$3

$4

pc

pc

ad*

F:


S:


$1.score = $4.score

PC

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

51

Pick Example:


article[5.6] #a1

chapter[5.0] #a10

section[0.8] #a12

title[0.6] #a2

sname #a5

article[5.6] #a1

section[0.6] #a14 section[3.6] #a16

title[0.8] #a13 title[0.6] #a15 p[0.8] #a18 p[1.4] #a19 p[1.4] #a20

Data Tree

Pick Condition

Data is relevant if:

1. score > 0.8

2. more then 50% of children are relevant

3. it’s direct parent node is not picked

52

Translating to XQuery

Query1 Find document components in articles.xml that are

about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary

XQueryFor $a in document(“articles.xml”)//article/descendant-or-self::*Score $a using ScoreFoo($a,{“search engine”}, {“internet”, ”information retrieval”})Pick $a using PickFoo($a)Return

<result><score>$a</score>

</result>Sortby( score )Threshold @score >=4 stop after 5

53

In this lecture


54

Access Methods

Score-Generating Methods TermJoin

Score-Utilizing Methods Pick

55

Score-Generating Methods

How to give initial score to the data treeThe score of every node should be

computed according to the amount of terms that we are searching in the node or it’s descendants.

56

Naïve algorithm

For every node recompute the value of the scores of all it’s ancestors

a

b a

c a

The runtime is bad

57

TermJoin

Stack Based algorithm Use a stack to store the ancestors of every

node Now all ancestors would be affected by the

node

58

Ta

TermJoin

ab

bc ac

a b

(1:9)

(2:7)

(3:5)

ab (4)

(6)

(8)

ab(1:9)

a(3:5)

ab(4)

ac

(8)

Encoding Stack

Phrase: “a”1a

bc (2:7)

0a

1a

1a

2a

2a

3a

1a

4a

If we have more then one word in the phrase we will operate some matching streams simultaneously

1b

59

Score-Utilizing Methods

Methods that help us to filter the data according to theirs scores

Two such methods are Threshold Pick

Pick could be much of challenge to implement

60

Score-Utilizing Methods

Pick algorithm The most complex part of the algorithm is

removing redundancy. The is vertical (parent-child) and horizontal

(among the siblings, e.g. return the first author from the relevant article) redundancy.

The problem is solved with stack-based algorithm

61

Pick algorithm

1chapter

2title 3section

Search andretrieval 4p 5p

6section 7section

… IR …Search engine

… Search engineretrieval of syntactic

information

score = 1

score = 2score = 2

score = 0

score = 4

score = 5

score = 0

Ancestor Stack

containing elements not yet fully explored

Main stackcontaining elements

can not yet be eliminated

2

score >= 2percentage >= 50%

1 0/1

4

3 1/1

4

5

2/25

3

67

1/21/31/4

62


Pick

T: $1

$4

ad*

F:

$1.tag=article

S:

$4.score = { ScoreFoo({“search engine”})}

$1.score = $4.score

PC

1chapter

2title 3section

Search andretrieval 4p 5p

6section 7section



information

score >= 2percentage >= 50%

section

p p



information

63

In this lecture


64

Conclusion

Stack based algorithms are used for efficient implementation of new ideas

Usable algebra is presented that deals with scoring and relevance in the XML keyword search

Possible extension of XQuery

1 holistic twig joins: optimal xml pattern matching acm sigmod 2002

Documents