1 holistic twig joins: optimal xml pattern matching acm sigmod 2002

63
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

Upload: gabriella-maxwell

Post on 29-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

1

Holistic Twig Joins:Optimal XML Pattern Matching

ACM SIGMOD 2002

Page 2: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

2

In this lecture

The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions

Page 3: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

3

The problem

To find semantically connected data in the XML document in the efficient way.

There are many intermediate results produced that doesn’t participate in the final answers.

Page 4: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

4

The problem (example)

For example we have this XQuery expression: book[ title = ‘XML’ ] // author [ fn = ‘jane’ and ln = ‘doe’]

We can translate it to the twig (small tree) patternbook

title

XML

author

fn

jane

ln

doe

Page 5: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

5

The problem (example)

In order to solve this problem we have to Find all binary relationships line (book, title) and

(author, fn) To connect all the patterns we have found to

the compile answer.The problem is that every book has title

but there are only some of the with title ‘XML’, so we produce many intermediate answers that doesn’t participate in the final answer.

Page 6: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

6

In this lecture

The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions

Page 7: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

7

Idea

The main Idea of the paper is how to save intermediate results in a compact way.

To develop algorithm that will be independent of the size of intermediate results.

The is a family of stack based algorithms invented for this purpose.

Page 8: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

8

In this lecture

The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions

Page 9: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

9

Representing position of elements

Every node in the XML document is represented as Leaf: 3-tuple (DocId, LeftPos, LevelNum) Node: 3-tuple (DocId, LeftPos : RightPos, LevelNum)

Page 10: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

10

Representing position of elements

For examplebook

title

XML

authors

(1,3,3)

(1,2:4,2)

(1,1:31,1)

(1,5:30,2)

author

fn

jane

ln

poe

(1,6:13,2)

(1,7:9,3)

(1,8,4)

(1,10:12,3)

(1,11,4)

author

fn

john

ln

doe

(1,14:21,2)

(1,15:17,3)

(1,16,4)

(1,18:20,3)

(1,19,4)

author

fn

jane

ln

doe

(1,22:29,2)

(1,23:25,3)

(1,24,2)

(1,26:28,2)

(1,27,2)

Page 11: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

11

Representing position of elements

For example

Page 12: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

12

Representing position of elements

Profits:Easy to determine

ancestor-descendant relationship a node n1(D1,L1:R1,N1) is descendant to node n2(D2,L2:R2,N2) iff D1 = D2 , L2<L1 and R1<R2

parent-child relationship a node n1(D1,L1:R1,N1) is parent to node n2(D2,L2:R2,N2) iff D1 = D2 , L2<L1 , R1<R2 and N1+1=N2

fn

(1,7:9,3)

book

(1,1:31,1)

ln

poe

(1,10:12,3)

(1,11,4)

Page 13: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

13

Representing position of elements

Available cases:

Page 14: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

14

Matching stream

A stream Tq contains positional representations of the database nodes that match the node q

The nodes in the stream are sorted by the (DocId,LeftPos)

Page 15: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

15

jane(1,8,4)

jane(1,24,2)

author(1,22:29,2)

author(1,14:21,2)

author

Matching stream (example)book

title

XML

authors

(1,3,3)

(1,2:4,2)

(1,1:31,1)

(1,5:30,2)

fn ln

poe

(1,7:9,3) (1,10:12,3)

(1,11,4)

fn

john

ln

doe

(1,15:17,3)

(1,16,4)

(1,18:20,3)

(1,19,4)

fn ln

doe

(1,23:25,3) (1,26:28,2)

(1,27,2)

Tauthor Tjane

(1,6:13,2)

author(1,14:21,2)

author(1,6:13,2)

author(1,22:29,2)

jane(1,8,4)

jane(1,24,2)

The operations available on the streams eof, advance, next, nextL, nextR

Page 16: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

16

Linked stacks

Idea: Repeatedly construct stacks that contain partial

and total answers Remove partial answers that couldn’t be

extended to total answers

Page 17: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

17

Linked stacks (example)

A1

B1

A2

B2

C1

Data

A

B

C

Query

A1B1

A2B2

C1

Stack encoding

A1 B1 C1

A2 B2 C1

A1 B2 C1

Query results

Page 18: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

18

In this lecture

The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions

Page 19: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

19

Stack based algorithms

The stack based algorithms uses chain of linked stack to compactly represent partial and full results

Page 20: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

20

B1

B2

2:8

4:6

A1

A2

3:7

PathStack algorithm

C1

Data

A

B

C

Query

TA TB

1:9

5

TCA1

A2

3:7

1:9

B1

B2

2:8

4:6

C1

5

Page 21: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

21

PathStack algorithm

B1

B2

2:8

4:6

A1

A2

3:7

C1

Data

A

B

C

Query

A1 B1 C1

A1 B2 C1

A2 B2 C1

Query results

TA TB

1:9

5

TC

C1

5

Stack encoding

SC SB SA

A1

1:9

B1

2:8

A2

3:7

B2

4:6

Always take an element with smallest LeftPos

Page 22: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

22

C2

8

A

B

C

Query

A1 B1 C1

A1 B2 C1

A2 B2 C1

TA TB TC

Stack encoding

SC SB SA

A1

1:10

B1

2:9

A2

3:7

B2

4:6

B1

B2

2:9

4:6

A1

A2

3:7

C1

Data

1:10

5

Add C2

here

C2

8

A1 B1 C2

RightPos < LeftPos

PathStack algorithm

Page 23: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

23

PathStack algorithm problems

To find a twig we have to divide it to many paths and Again we have intermediate results that doesn’t

participate in the final result

authors

(5:30)

author

fn

jane

ln

poe

(6:13)

(7:9)

(8)

(10:12)

(11)

author

fn

john

ln

doe

(14:21)

(15:17)

(16)

(18:20)

(19)

author

fn

jane

ln

doe

(22:29)

(23:25)

(24)

(26:28)

(27)

Query

author

fn

jane

ln

doe

Page 24: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

24

In this lecture

The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions

Page 25: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

25

TwigStack Algorithm

Idea Before adding the node to the stack check that

he has suns that satisfies the twig pattern. When checking the sons theirs sons are checked to

Now we can be shure that every path result is joinable with at least one other path result and participates in at least one full answer.

Page 26: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

26

TwigStack Algorithm

authors

(5:30)

author

fn

jane

ln

poe

(6:13)

(7:9)

(8)

(10:12)

(11)

author

fn

john

ln

doe

(14:21)

(15:17)

(16)

(18:20)

(19)

author

fn

jane

ln

doe

(22:29)

(23:25)

(24)

(26:28)

(27)

author

fn

jane

ln

doe

Page 27: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

27

In this lecture

The ProblemIdeaPreliminariesPathStack AlgorithmTwigStack AlgorithmConclusions

Page 28: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

28

Conclusions

The PathStack and TwigStack algorithms are effective in terms of amount of intermediate results

But: They are only effective for founding ancestor-

descendant relationships. If we have also parent-son relationships in the twig

then not all nodes that are inserted to the stacks participate in the final result.

Page 29: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

29

Brake ?

Page 30: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

30

Query Structured Text in an XML Database

ACM SIGMOD 2003

Page 31: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

31

In this lecture

AbstractIntroductionMotivationAlgebraAccess methodsConclusions

Page 32: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

32

Abstract

XML documents often contain documents with structured text

It is important to integrate “information retrieval” style query evaluation

It is well studied for natural languagesBut in the case of XML the data could

reside in element descendants.

Page 33: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

33

In this lecture

AbstractIntroductionMotivationAlgebraAccess methodsConclusions

Page 34: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

34

Introduction

Boolean style queries (XQuery) Useful when users are aware of the underlying

schema

But Users often don’t know the schema And collections of XML documents are

frequently heterogeneous.

Page 35: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

35

Introduction

So we have to use relevance ranking in order to define the IR on XML

Problem: traditional IR is “document-centric”

XML IR should Be much more granulated Take document structure into account Allow more complex analysis then

determination of relevance

Page 36: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

36

In this lecture

AbstractIntroductionMotivationAlgebraAccess methodsConclusions

Page 37: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

37

Motivation

article

article-title

InternetTechnologies

author

fname sname

Jane Doe

chapter

ct

Cashing andReplication

chapter

ct

Search andRetrieval

section

section-title

SearchEngine

section

section-title

InformationRetrieval

section

section-title

Examplesp p p

… Here are someIR based Search Engines: …

…search engine NewSearch uses a new information retrieval technology

semantic information retrieval techniques are also being incorporated into some search engines

#a1

#a2 #a3

#a4 #a5

#a6

#a7

#a10

#a11

#a12

#a13

#a14

#a15

#a16

#a17

#a18

#a19

#a20 We have the

following XML document named article.xml

Page 38: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

38

Motivation

Consider the query Find document components in articles.xml that are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary.

Using AND and OR predicated will not give us the desirable result

Page 39: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

39

Motivation

article

article-title

InternetTechnologies

author

fname sname

Jane Doe

chapter

ct

Cashing andReplication

chapter

ct

Search andRetrieval

section

section-title

SearchEngine

section

section-title

InformationRetrieval

section

section-title

Examplesp p p

… Here are someIR based Search Engines: …

…search engine NewSearch uses aninformation retrieval technology

semantic information retrieval techniques are also being incorporated into somesearch engines

#a1

#a2 #a3

#a4 #a5

#a6

#a7

#a10

#a11

#a12

#a13

#a14

#a15

#a16

#a17

#a18

#a19

#a20 We have the

following XML document named article.xml

Page 40: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

40

Motivation

Illustrating granulation problemWhat elements to rank?

If we will rank article The user will see all the article while the relevant

information concentrated only in the third chapter If we will rank paragraphs

The paragraphs of the last section will be returned separately

• The semantic linkage is broken and has to be reconstructed by the user

Page 41: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

41

Motivation

IR-style XML queries don’t have to be stand alone

If the use know the structure of the XML document he can add some structural constraints and limit the number of uninteresting results

Page 42: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

42

In this lecture

AbstractIntroductionMotivationAlgebraAccess methodsConclusions

Page 43: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

43

Algebra

We want to fold into a database framework the notion of relevance scoring and ranking

Page 44: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

45

Algebra

Scored Data Tree Definition:

A rooted ordered tree, such that each node has attribute-value pairs, including at least a tag and a real number valued score

A score of a tree is a score of a root node Example:

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

Page 45: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

46

Algebra

Scored Pattern Tree Definition:

P = (T,F,S)• T=>node-labeled and edge-labeled tree• F=> formula of boolean combination of predicates

applicable to nodes• S=> set of scoring function

Page 46: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

47

Algebra

Scored Pattern Tree Example:

Query2:Find document components in the artilce.xml that are part of an article written by an author with last name “Doe” and are about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary.

T: $1

$2

$3

$4

pc

pc

ad*

F:

$1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe”

S:

$4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})}

$1.score = $4.score

Page 47: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

48

Algebra

Common operators Selection => Scored Selection Projection => Scored Projection Join => Scored Join

New Operators Threshold Pick

Page 48: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

49

Algebra (New Operators)

Threshold

T: $1

$2

$3

$4

pc

pc

ad*

F:

$1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe”

S:

$4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})}

$1.score = $4.score

TC%a > ...

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

Page 49: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

50

Algebra (New Operators)

Pick

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a2

author #a13

sname #a15

section[3.6] #a26

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

T: $1

$2

$3

$4

pc

pc

ad*

F:

$1.tag=article & $2.tag=author & $3.tag=sname & $3.content = “Doe”

S:

$4.score = { ScoreFoo({“search engine”},{“internet”,”information retrieval”})}

$1.score = $4.score

PC

article[3.6] #a1

author #a3

sname #a5

section[3.6] #a16

article[3.6] #a3

author #a23

sname #a25

section[3.6] #a36

Page 50: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

51

Pick Example:

Algebra (New Operators)

article[5.6] #a1

chapter[5.0] #a10

section[0.8] #a12

title[0.6] #a2

sname #a5

article[5.6] #a1

section[0.6] #a14 section[3.6] #a16

title[0.8] #a13 title[0.6] #a15 p[0.8] #a18 p[1.4] #a19 p[1.4] #a20

Data Tree

Pick Condition

Data is relevant if:

1. score > 0.8

2. more then 50% of children are relevant

3. it’s direct parent node is not picked

Page 51: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

52

Translating to XQuery

Query1 Find document components in articles.xml that are

about “search engine”. Relevance to “internet” and “information retrieval” is desirable but not necessary

XQueryFor $a in document(“articles.xml”)//article/descendant-or-self::*Score $a using ScoreFoo($a,{“search engine”}, {“internet”, ”information retrieval”})Pick $a using PickFoo($a)Return

<result><score>$a</score>

</result>Sortby( score )Threshold @score >=4 stop after 5

Page 52: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

53

In this lecture

AbstractIntroductionMotivationAlgebraAccess methodsConclusions

Page 53: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

54

Access Methods

Score-Generating Methods TermJoin

Score-Utilizing Methods Pick

Page 54: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

55

Score-Generating Methods

How to give initial score to the data treeThe score of every node should be

computed according to the amount of terms that we are searching in the node or it’s descendants.

Page 55: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

56

Naïve algorithm

For every node recompute the value of the scores of all it’s ancestors

a

b a

c a

The runtime is bad

Page 56: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

57

TermJoin

Stack Based algorithm Use a stack to store the ancestors of every

node Now all ancestors would be affected by the

node

Page 57: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

58

Ta

TermJoin

ab

bc ac

a b

(1:9)

(2:7)

(3:5)

ab (4)

(6)

(8)

ab(1:9)

a(3:5)

ab(4)

ac

(8)

Encoding Stack

Phrase: “a”1a

bc (2:7)

0a

1a

1a

2a

2a

3a

1a

4a

If we have more then one word in the phrase we will operate some matching streams simultaneously

1b

Page 58: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

59

Score-Utilizing Methods

Methods that help us to filter the data according to theirs scores

Two such methods are Threshold Pick

Pick could be much of challenge to implement

Page 59: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

60

Score-Utilizing Methods

Pick algorithm The most complex part of the algorithm is

removing redundancy. The is vertical (parent-child) and horizontal

(among the siblings, e.g. return the first author from the relevant article) redundancy.

The problem is solved with stack-based algorithm

Page 60: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

61

Pick algorithm

1chapter

2title 3section

Search andretrieval 4p 5p

6section 7section

… IR …Search engine

… Search engineretrieval of syntactic

information

score = 1

score = 2score = 2

score = 0

score = 4

score = 5

score = 0

Ancestor Stack

containing elements not yet fully explored

Main stackcontaining elements

can not yet be eliminated

2

score >= 2percentage >= 50%

1 0/1

4

3 1/1

4

5

2/25

3

67

1/21/31/4

Page 61: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

62

Algebra (New Operators)

Pick

T: $1

$4

ad*

F:

$1.tag=article

S:

$4.score = { ScoreFoo({“search engine”})}

$1.score = $4.score

PC

1chapter

2title 3section

Search andretrieval 4p 5p

6section 7section

… IR …Search engine

… Search engineretrieval of syntactic

information

score >= 2percentage >= 50%

section

p p

… IR …Search engine

… Search engineretrieval of syntactic

information

Page 62: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

63

In this lecture

AbstractIntroductionMotivationAlgebraAccess methodsConclusions

Page 63: 1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002

64

Conclusion

Stack based algorithms are used for efficient implementation of new ideas

Usable algebra is presented that deals with scoring and relevance in the XML keyword search

Possible extension of XQuery