1 schema & schema integration carsten karl dennis schade thorsten dollmann

81
1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

Upload: reynard-valentine-park

Post on 21-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

1

Schema & Schema Integration

Carsten Karl

Dennis Schade

Thorsten Dollmann

Page 2: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

2

Outline

XTRACT System for inferring DTDs from a set of XML documents

Incremental validation of XML Documents

Page 3: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

3

Schema & XML Databases

Databases need a Schema DTDs serve the role of the schema of the

document Efficient storage of XML data Optimization of XML queries

DTDs are not mandatory !!!!

Page 4: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

4

XTRACT

Goal:Infer DTDs from a set of XML documents

Page 5: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

5

Problem Simplification and Abstraction Infer a DTD for each tag separately Separate example sequences for each

<e> Infer a “good” DTD for each <e> Resulting document DTD is a composition

of all inferred “tag”-DTDs

Page 6: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

6

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book

Page 7: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

7

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book

Page 8: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

8

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

Page 9: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

9

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author

Page 10: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

10

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author

Page 11: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

11

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author { <name> <age>}

Page 12: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

12

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author { <name> <age>}

Page 13: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

13

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author { <name> <age>}

Page 14: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

14

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author { <name> <age>, <name> }

Page 15: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

15

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author { <name> <age>, <name> }

editor

Page 16: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

16

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author { <name> <age>, <name> }

editor

Page 17: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

17

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book { <title><author><author><editor> }

author { <name> <age>, <name> }

editor { <name> }

Page 18: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

18

What is a “good” DTD ?

Given the example sequence set I={ ab, abab, ababab }

Possible DTDs:

(ab)*

PreciseConciseCandidate DTD

(a|b)*

(ab|abab|ababab)

ab|ab(ab|abab)

Yes No

No

No

Yes

Yes

Yes Somewhat

Page 19: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

19

What is a “good” DTD ? (ctd.)

A good DTD D must satisfy two restrictions R1: D should be concise R2: D should be precise

Minimum Description Length quantifies and resolves the tradeoff between R1 and R2

Page 20: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

20

The MDL Principle

MDL principle states: The best theory to infer from a given set of data is the one which minimizes the sum of1. The length of the theory in bits

2. The length of the data, in bits, when encoded with the help of the theory

Page 21: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

21

Overview of XTRACT System

MDL Modul

Factoring

Generalization

Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }

Sg = I { (ab)*, (a|b)*, b*d, b*e }

Sf = Sg { (a|b)(c|d), b*(d|e) }

Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

Page 22: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

22

MDL Subsystem

In order to use the MDL principle, we need to

Define theory description length Define data description length Solve the resulting minimization problem

Page 23: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

23

MDL Coding scheme

Description Length of a DTDNumber of characters of the DTD

Cost of encoding the example sequencesencoding of b in terms of DTD a | b | c is 1,

cost 1 (position of b in the DTD)encoding of bbb in terms of DTD b* is 3

(number of repetitions of b), cost 1encoding of b in terms of DTD b is , cost 0

Page 24: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

24

MDL Subsystem Minimization

Input Sequences Candidate DTDs

ab

abb

abbb

abbbb

ab

(a|b)*

ab*

abb

Page 25: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

25

MDL Subsystem Minimization

Input Sequences Candidate DTDs

ab

abb

abbb

abbbb

(a|b)*

ab*

abb

63

4

5

6

7

abbbbb

30

+ 1b)= 1*+ (1a

Page 26: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

26

MDL Subsystem Minimization

Input Sequences Candidate DTDs

ab

abb

abbb

abbbb

abbbbb

(a|b)*

ab*

abb

30

3

1

1

1

1

1

8

Page 27: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

27

MDL Subsystem Minimization

Input Sequences Candidate DTDs

ab

abb

abbb

abbbb

(a|b)*

ab*

abbabbbbb

30

8

3

0

3

Page 28: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

28

MDL Subsystem Minimization

Input Sequences Candidate DTDs

ab

abb

abbb

abbbb

ab

(a|b)*

ab*

abb

30

8

3

Page 29: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

29

Overview of XTRACT System

MDL Modul

Factoring

Generalization

Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }

Sg = I { (ab)*, (a|b)*, b*d, b*e }

Sf = Sg { (a|b)(c|d), b*(d|e) }

Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

Page 30: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

30

Generalization Subsystem

Goal: Infer regular expressions from example sequences Produce candidate DTDs such as a*bc,(abc)*, (a|b|c)*,((ab)*c)*

Generate more general DTDs Two heuristics:

DiscoverSeqPattern(s,r): s=abbbbc => ab*c DiscoverOrPattern(s,d): s=abacbc => (a|b|c)*

Candidate DTDs are generated by calling the above functions for appropriate values of r and d

Page 31: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

31

DiscoverSeqPattern Example

( a b ) * c a b c ( a b ) * c

( a b ) * c a b c ( a b ) * c

( a b ) * c ) *(

The pattern must occur at least two times: r=2

a b a b a b c a b c a b a b ca b

a b a b a b c a b c a b a b ca b

( a b ) * c a b c a b a b ca b

( a b ) * c a b c a b a b ca b

Page 32: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

32

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax

Page 33: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

33

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax Step 1: Partition

Page 34: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

34

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax Step 1: Partition

Page 35: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

35

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax Step 1: Partition

Page 36: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

36

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax Step 1: Partition

Page 37: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

37

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax Step 1: Partition

Page 38: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

38

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax Step 1: Partition

Page 39: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

39

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax Step 2:

replace pattern a1…an by (a1|..|an)*

Page 40: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

40

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a ( x ca| c ) * Step 2:

replace pattern a1…an by (a1|..|an)*

Page 41: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

41

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a ( x ca| c ) *

x is an auxiliary symbol introduced by DiscoverSeqPattern

a ( ca| c ) *((de)*e)*

x = ((de)*e)*

Page 42: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

42

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | |

Page 43: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

43

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | | =>

Page 44: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

44

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | | => a(c|d)

Page 45: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

45

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | | => a(c|d) |

Page 46: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

46

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | | => a(c|d) | b(c|d)

Page 47: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

47

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | | => a(c|d) | b(c|d)

=>

Page 48: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

48

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | | => a(c|d) | b(c|d)

=> (a|b)(c|d)

Reduces MDL description length of the candidate DTDs Adoption of factoring algorithms for Boolean expressions

Use heuristic algorithm for selecting subsets of candidate DTDs that give a good factored form

Page 49: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

49

Factoring Subsystem Heuristics

Choose subsets S of candidate DTDs from SG such that

DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG

is high

Page 50: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

50

Factoring PrefixesCandidate DTDs

longer prefixes result in MDL cost reduction factored DTD covers all input sequences

abcddd

abceee

abcfff

abcggg

abcd*

abce*

abcf*

abcg*

abc(d*|e*|f*|g*)

Page 51: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

51

Factoring Subsystem Heuristics

Choose subsets S of candidate DTDs from SG such that

DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG

is high

Page 52: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

52

Factoring Subsystem Heuristics

Choose subsets S of candidate DTDs from SG such that

DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG

is high The overlap between every pair of DTDs

D, D’ in S should be minimal

Page 53: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

53

Factoring Subsystem Overlap

Input Sequences Candidate DTDs

eab

eabb

eabbb

eababab

e(a|b)*

eab*

e((a|b)*|ab*)

Page 54: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

54

Factoring Subsystem Overlap

Input Sequences Candidate DTDs

eab

eabb

eabbb

eababab

e(a|b)*

eab*

e((a|b)*|ab*)

New factored form has much higher MDL cost ! Does not cover more input sequences then e(a|b|)*

Page 55: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

55

Experimental Validation

Comparison of XTRACT with IBM DDbE (Data Description by Example)

Synthetic DocumentsRandomly generated example sequences for

synthetic DTDs Real Life Documents

Example documents from different sources e.g. Newspaper Association of America

Page 56: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

56

Synthetic Documents1 abcde|efgh|ij|klm

2 (a|b|c|d|f)*gh

3 (a|b|c)d*e*(fgh)*

4 (abcd)*|(e|f|g)*|h|(ijklm)*

5 a*|(b|c|d|e|f)*|gh|(i|j|k)*|(lmn)*

XTRACT recovers each single one of them

DDbE shows serious weaknesses

Recovers only the first one correctly

Deduced DTDs are over-generalizations

Does not even cover all example sequences

Level of factoring is limited

Page 57: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

57

Real Life Documents

No Simplified DTD DTD obtained by XTRACT

DTD obtained by DDbE

1 a|b|c|d|e a|b|c|d|e a|b|c|d|e

2 (a|b|c|d|e)* (a|b|c|d|e)* (a|b|c|d|e)*

3 ab*c* ab*c* (ab+c*)|(ac*)

4 a*b?c?d? a*b?c?d? (a+b(c|(c?d))?)|((b|a+)?cd)|((a+|b)?d)|((a+|b)?c)|(a+|b)

5 (a(bc)+d)* (a(bc)*d)* (a|b|c|d)+

6 (ab?c*d?)* - (a|b|c|d)+

Page 58: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

58

Conclusion

MDL principle used to control the tradeoff between model simplicity and model generalisation

General purpose tool to extract regular expressions from example documents

Experimental results provide strong support Future work:

Generalization subsystem should detect patterns containing ? nested within Kleene stars (a(bc)?)*

Enhance the system to detect even more complex DTDs

Page 59: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

59

Incremental Validation of XML - Documents

Page 60: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

60

Abstraction of XML and DTD’sXML Docs abstracted as Labeled Ordered Trees

LOT• element content and attribute values are

ignored

DTD as extended CFG• start symbol (root)• productions : associate to each label a regular

expression that specifies the acceptable labels of the list of children of a node with the given labelLOT satisfies a DTD tree is derivation of the

grammar

Page 61: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

61

DTDs: Abstraction & Exampleroot : carscars used newused car*new car*car (year|) model

95 Tigra 94 Astra Mini Boxster03

cars

used new

car car car

year model year model model

car

modelyear

Page 62: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

62

Tree Satisfying DTD, General Case

1 2 ii-1 i+1 k-1 k… …

s1 s2 sk-1 sk…

…a b c

root : … r

L(r)

Page 63: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

63

Incremental Validation Problem Statement

For each valid tree T : given a series of update commands,

• efficiently decide if the updated tree T’ is valid

• efficiently update auxiliary structure A(T) and T

Page 64: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

64

Updates (1): Node Renaming u(vi,)

1 2 ii-1 i+1 k-1 k… …

r

s1 s2 sk-1 sk…

…a b c

vi

Page 65: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

65

Incremental Validation of Strings Renaming u(i,b) in string 1...n

with respect to regular language specified by NFA N(Σ,Q,q0,F,δ)

validating updated string from scratch: O(n|Q|2log|Q|)

maintain auxiliary information:

Pre(i) = δ(q0, 1, … i-1) Post(i) = { s | δ(s, i+1, … n) ε F)}

1... i-1b i+1… n valid <-> exists s1 ε Pre(i), s2 ε Post(i) such that s2 ε δ(b,s1)

Page 66: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

66

Validating a Renaming u(ai, )

12 ii-1 i+1 n-1 n

… N…

Validation of one update in O(1) given

precomputedPre and Post

Post(i)

Pre(i)

But u(i, ) requires recomputation of Pre(i),

Pre(i+1), … and of Post(i), Post(i-1), …

q0 1

2 i-1

qF

n

n-1i+1 …

q0

1

2 i-1

Page 67: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

67

Transition Relation Definition

12 i j n-1 n

… …… …m

Ti,j = { (q, q’) | }

i+1

q i…i+1

q’j

m+1

Ti,j = Ti,m Tm+1,j

Page 68: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

68

Divide-and-conquer approach Transition-Relation-Tree Τn (n=2k)

root: T1,2k

node Tij has children Ti,k and Tk+1,j leaves Ti,i , 1≤i≤n

number of nodes: n+ (n/2) + … + 2 + 1 = 2n-1 balanced

→ Τn has depth log n

Page 69: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

69

Transition Relation Trees

1 2 3 4 5 6 7 8

T5,8T1,4

T3,4T1,2 T5,6 T7,8

T1,1 T2,2

T3,3 T4,4

T5,5 T6,6

T7,7 T8,8

T1,8

Page 70: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

70

Updating Tn

affected nodes are lying on the path from a leaf to the root

bottom-up recomputing Tij‘s: each Tij with children Tik and Tkj for which at least

one child has been recomputed is replaced by Tik ° Tkj

→ O(log n) recomputations

updated string valid if

<qo,f> T1n for some f F

Page 71: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

71

Maintenance of the Structure and Validation in O(log n)

u(6, )

1 2 3 4 5 6 7 8

T1,1 T2,2 T3,3 T4,4 T5,5 T6,6 T7,7 T8,8

T1,2 T3,4 T5,6 T7,8

T5,8T1,4

T1,8If (q0, qF) then valid

T6,6

T5,6

T5,8

T1,8

Page 72: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

72

Insertions and Deletions

positions of nodes in the string can change length n of string is dynamic → Recomputing of the entire tree Tn necessary

New approach based on B-Trees: tree structure can be incrementally maintained tree is still balanced and has depth O(log n)

Page 73: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

73

Transition B-Trees (2-3 Trees)

1

2

3

5

6

7

9

T1 T2 T3 T5 T6 T7 T9

Ta Tb TcTa = T1 T2

If (q0, qF) Ta Tb Tc then valid

Page 74: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

74

Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions

1

2

3

5

6

7

9

8

T1 T2 T3 T5 T6 T7 T8 T9

Ta Tb Tc

Page 75: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

75

Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions

1

2

3

5

6

4

7

9

8

T1 T2T7 T8 T9

Ta Tb Tc

T3 T5 T6

Page 76: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

76

Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions

T3 T4 T5 T6

1

2

3

5

6

4

7

9

8

T1 T2T7 T8 T9

Ta Tb Tc

Page 77: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

77

Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions

Ta Td Te Tc

T3 T4 T5 T6

1

2

3

5

6

4

7

9

8

T1 T2T7 T8 T9

Tf Tg

Page 78: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

78

Auxiliary Structures for Incremental DTD Validation

1 2 ii-1 i+1 k-1 k… …

r

s1 s2 sk-1 sk…

vi

u(vi, )

r

i…

r

r

Page 79: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

79

XML Schema Validation

XML Schema provide a mechanism to decouple element names from their types and thus allow context-dependent definitions of their structure

Update to a single node may have global repercussions for the typing of the tree

Need more theory: Specialized DTD‘s , binary tree encoding, non-

deterministic tree automata… details are left to the interested reader…

Page 80: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

80

Review Given m updates on tree of size n:

incrementally validate DTD in O(m log n)

validate XML Schema in O(m log2 n)

Weakness

Only updates that affected one node at a time are considered

Page 81: 1 Schema & Schema Integration Carsten Karl Dennis Schade Thorsten Dollmann

81

Summary

XTRACT as a tool to infer DTDs from a set of example XML documents

An approach to incrementally validate a XML document after an update

Questions?