1 schema & schema integration carsten karl dennis schade thorsten dollmann

1

Schema & Schema Integration

Carsten Karl

Dennis Schade

Thorsten Dollmann

2

Outline

XTRACT System for inferring DTDs from a set of XML documents

Incremental validation of XML Documents

3

Schema & XML Databases

Databases need a Schema DTDs serve the role of the schema of the

document Efficient storage of XML data Optimization of XML queries

DTDs are not mandatory !!!!

4

XTRACT

Goal:Infer DTDs from a set of XML documents

5

Problem Simplification and Abstraction Infer a DTD for each tag separately Separate example sequences for each

<e> Infer a “good” DTD for each <e> Resulting document DTD is a composition

of all inferred “tag”-DTDs

6

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book

title author author editor

name name nameage

Tag Example sequence set

book

7

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage


book

8

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage


book { <title><author><author><editor> }

9

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage



author

10

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage



author

11

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage



author { <name> <age>}

12

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage




13

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage




14

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage



author { <name> <age>, <name> }

15

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage




editor

16

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage




editor

17

Example

<book>

<title> </title>

<author>

<name> </name>

<age> </age>

</author>

<author>

<name> </name>

</author>

<editor>

<name> </name>

</editor>

</book>

book


name name nameage




editor { <name> }

18

What is a “good” DTD ?

Given the example sequence set I={ ab, abab, ababab }

Possible DTDs:

(ab)*

PreciseConciseCandidate DTD

(a|b)*

(ab|abab|ababab)

ab|ab(ab|abab)

Yes No

No

No

Yes

Yes

Yes Somewhat

19

What is a “good” DTD ? (ctd.)

A good DTD D must satisfy two restrictions R1: D should be concise R2: D should be precise

Minimum Description Length quantifies and resolves the tradeoff between R1 and R2

20

The MDL Principle

MDL principle states: The best theory to infer from a given set of data is the one which minimizes the sum of1. The length of the theory in bits

2. The length of the data, in bits, when encoded with the help of the theory

21

Overview of XTRACT System

MDL Modul

Factoring

Generalization

Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }

Sg = I { (ab)*, (a|b)*, b*d, b*e }

Sf = Sg { (a|b)(c|d), b*(d|e) }

Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

22

MDL Subsystem

In order to use the MDL principle, we need to

Define theory description length Define data description length Solve the resulting minimization problem

23

MDL Coding scheme

Description Length of a DTDNumber of characters of the DTD

Cost of encoding the example sequencesencoding of b in terms of DTD a | b | c is 1,

cost 1 (position of b in the DTD)encoding of bbb in terms of DTD b* is 3

(number of repetitions of b), cost 1encoding of b in terms of DTD b is , cost 0

24

MDL Subsystem Minimization

Input Sequences Candidate DTDs

ab

abb

abbb

abbbb

ab

(a|b)*

ab*

abb

25



ab

abb

abbb

abbbb

(a|b)*

ab*

abb

63

4

5

6

7

abbbbb

30

+ 1b)= 1*+ (1a

26



ab

abb

abbb

abbbb

abbbbb

(a|b)*

ab*

abb

30

3

1

1

1

1

1

8

27



ab

abb

abbb

abbbb

(a|b)*

ab*

abbabbbbb

30

8

3

0

3

28



ab

abb

abbb

abbbb

ab

(a|b)*

ab*

abb

30

8

3

29

Overview of XTRACT System

MDL Modul

Factoring

Generalization

Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }

Sg = I { (ab)*, (a|b)*, b*d, b*e }

Sf = Sg { (a|b)(c|d), b*(d|e) }

Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)

30

Generalization Subsystem

Goal: Infer regular expressions from example sequences Produce candidate DTDs such as a*bc,(abc)*, (a|b|c)*,((ab)*c)*

Generate more general DTDs Two heuristics:

DiscoverSeqPattern(s,r): s=abbbbc => ab*c DiscoverOrPattern(s,d): s=abacbc => (a|b|c)*

Candidate DTDs are generated by calling the above functions for appropriate values of r and d

31

DiscoverSeqPattern Example

( a b ) * c a b c ( a b ) * c

( a b ) * c a b c ( a b ) * c

( a b ) * c ) *(

The pattern must occur at least two times: r=2

a b a b a b c a b c a b a b ca b

a b a b a b c a b c a b a b ca b

( a b ) * c a b c a b a b ca b

( a b ) * c a b c a b a b ca b

32

DiscoverOrPattern Example

Given:

• the example sequence s=axcxac

• distance parameter d=2

a x c cax

33


Given:



a x c cax Step 1: Partition

34


Given:




35


Given:




36


Given:




37


Given:




38


Given:




39


Given:



a x c cax Step 2:

replace pattern a1…an by (a1|..|an)*

40


Given:



a ( x ca| c ) * Step 2:

replace pattern a1…an by (a1|..|an)*

41


Given:



a ( x ca| c ) *

x is an auxiliary symbol introduced by DiscoverSeqPattern

a ( ca| c ) *((de)*e)*

x = ((de)*e)*

42

Factoring Subsystem

Goal: Combine different candidates to derive more compact, factored DTDs

Example candidate set Sg = { ac, ad, bc, bd }

ac ad bc bd| | |

43

Factoring Subsystem



ac ad bc bd| | | =>

44

Factoring Subsystem



ac ad bc bd| | | => a(c|d)

45

Factoring Subsystem



ac ad bc bd| | | => a(c|d) |

46

Factoring Subsystem



ac ad bc bd| | | => a(c|d) | b(c|d)

47

Factoring Subsystem




=>

48

Factoring Subsystem




=> (a|b)(c|d)

Reduces MDL description length of the candidate DTDs Adoption of factoring algorithms for Boolean expressions

Use heuristic algorithm for selecting subsets of candidate DTDs that give a good factored form

49

Factoring Subsystem Heuristics

Choose subsets S of candidate DTDs from SG such that

DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG

is high

50

Factoring PrefixesCandidate DTDs

longer prefixes result in MDL cost reduction factored DTD covers all input sequences

abcddd

abceee

abcfff

abcggg

abcd*

abce*

abcf*

abcg*

abc(d*|e*|f*|g*)

51




is high

52




is high The overlap between every pair of DTDs

D, D’ in S should be minimal

53

Factoring Subsystem Overlap


eab

eabb

eabbb

eababab

e(a|b)*

eab*

e((a|b)*|ab*)

54

Factoring Subsystem Overlap


eab

eabb

eabbb

eababab

e(a|b)*

eab*

e((a|b)*|ab*)

New factored form has much higher MDL cost ! Does not cover more input sequences then e(a|b|)*

55

Experimental Validation

Comparison of XTRACT with IBM DDbE (Data Description by Example)

Synthetic DocumentsRandomly generated example sequences for

synthetic DTDs Real Life Documents

Example documents from different sources e.g. Newspaper Association of America

56

Synthetic Documents1 abcde|efgh|ij|klm

2 (a|b|c|d|f)*gh

3 (a|b|c)d*e*(fgh)*

4 (abcd)*|(e|f|g)*|h|(ijklm)*

5 a*|(b|c|d|e|f)*|gh|(i|j|k)*|(lmn)*

XTRACT recovers each single one of them

DDbE shows serious weaknesses

Recovers only the first one correctly

Deduced DTDs are over-generalizations

Does not even cover all example sequences

Level of factoring is limited

57

Real Life Documents

No Simplified DTD DTD obtained by XTRACT

DTD obtained by DDbE

1 a|b|c|d|e a|b|c|d|e a|b|c|d|e

2 (a|b|c|d|e)* (a|b|c|d|e)* (a|b|c|d|e)*

3 ab*c* ab*c* (ab+c*)|(ac*)

4 a*b?c?d? a*b?c?d? (a+b(c|(c?d))?)|((b|a+)?cd)|((a+|b)?d)|((a+|b)?c)|(a+|b)

5 (a(bc)+d)* (a(bc)*d)* (a|b|c|d)+

6 (ab?c*d?)* - (a|b|c|d)+

58

Conclusion

MDL principle used to control the tradeoff between model simplicity and model generalisation

General purpose tool to extract regular expressions from example documents

Experimental results provide strong support Future work:

Generalization subsystem should detect patterns containing ? nested within Kleene stars (a(bc)?)*

Enhance the system to detect even more complex DTDs

59

Incremental Validation of XML - Documents

60

Abstraction of XML and DTD’sXML Docs abstracted as Labeled Ordered Trees

LOT• element content and attribute values are

ignored

DTD as extended CFG• start symbol (root)• productions : associate to each label a regular

expression that specifies the acceptable labels of the list of children of a node with the given labelLOT satisfies a DTD tree is derivation of the

grammar

61

DTDs: Abstraction & Exampleroot : carscars used newused car*new car*car (year|) model

95 Tigra 94 Astra Mini Boxster03

cars

used new

car car car

year model year model model

car

modelyear

62

Tree Satisfying DTD, General Case

1 2 ii-1 i+1 k-1 k… …

…

s1 s2 sk-1 sk…

…a b c

root : … r

…

L(r)

63

Incremental Validation Problem Statement

For each valid tree T : given a series of update commands,

• efficiently decide if the updated tree T’ is valid

• efficiently update auxiliary structure A(T) and T

64

Updates (1): Node Renaming u(vi,)

1 2 ii-1 i+1 k-1 k… …

…

r

s1 s2 sk-1 sk…

…a b c

vi

65

Incremental Validation of Strings Renaming u(i,b) in string 1...n

with respect to regular language specified by NFA N(Σ,Q,q0,F,δ)

validating updated string from scratch: O(n|Q|2log|Q|)

maintain auxiliary information:

Pre(i) = δ(q0, 1, … i-1) Post(i) = { s | δ(s, i+1, … n) ε F)}

1... i-1b i+1… n valid <-> exists s1 ε Pre(i), s2 ε Post(i) such that s2 ε δ(b,s1)

66

Validating a Renaming u(ai, )

12 ii-1 i+1 n-1 n

… N…

Validation of one update in O(1) given

precomputedPre and Post

Post(i)

Pre(i)

But u(i, ) requires recomputation of Pre(i),

Pre(i+1), … and of Post(i), Post(i-1), …

q0 1

2 i-1

…

qF

n

n-1i+1 …

q0

1

2 i-1

…

67

Transition Relation Definition

12 i j n-1 n

… …… …m

Ti,j = { (q, q’) | }

i+1

q i…i+1

q’j

m+1

Ti,j = Ti,m Tm+1,j

68

Divide-and-conquer approach Transition-Relation-Tree Τn (n=2k)

root: T1,2k

node Tij has children Ti,k and Tk+1,j leaves Ti,i , 1≤i≤n

number of nodes: n+ (n/2) + … + 2 + 1 = 2n-1 balanced

→ Τn has depth log n

69

Transition Relation Trees

1 2 3 4 5 6 7 8

T5,8T1,4

T3,4T1,2 T5,6 T7,8

T1,1 T2,2

T3,3 T4,4

T5,5 T6,6

T7,7 T8,8

T1,8

70

Updating Tn

affected nodes are lying on the path from a leaf to the root

bottom-up recomputing Tij‘s: each Tij with children Tik and Tkj for which at least

one child has been recomputed is replaced by Tik ° Tkj

→ O(log n) recomputations

updated string valid if

<qo,f> T1n for some f F

71

Maintenance of the Structure and Validation in O(log n)

u(6, )

1 2 3 4 5 6 7 8

T1,1 T2,2 T3,3 T4,4 T5,5 T6,6 T7,7 T8,8

T1,2 T3,4 T5,6 T7,8

T5,8T1,4

T1,8If (q0, qF) then valid

T6,6

T5,6

T5,8

T1,8

72

Insertions and Deletions

positions of nodes in the string can change length n of string is dynamic → Recomputing of the entire tree Tn necessary

New approach based on B-Trees: tree structure can be incrementally maintained tree is still balanced and has depth O(log n)

73

Transition B-Trees (2-3 Trees)

1

2

3

5

6

7

9

T1 T2 T3 T5 T6 T7 T9

Ta Tb TcTa = T1 T2

If (q0, qF) Ta Tb Tc then valid

74

Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions

1

2

3

5

6

7

9

8

T1 T2 T3 T5 T6 T7 T8 T9

Ta Tb Tc

75


1

2

3

5

6

4

7

9

8

T1 T2T7 T8 T9

Ta Tb Tc

T3 T5 T6

76


T3 T4 T5 T6

1

2

3

5

6

4

7

9

8

T1 T2T7 T8 T9

Ta Tb Tc

77


Ta Td Te Tc

T3 T4 T5 T6

1

2

3

5

6

4

7

9

8

T1 T2T7 T8 T9

Tf Tg

78

Auxiliary Structures for Incremental DTD Validation

1 2 ii-1 i+1 k-1 k… …

…

r

s1 s2 sk-1 sk…

…

vi

u(vi, )

r

i…

…

r

r

79

XML Schema Validation

XML Schema provide a mechanism to decouple element names from their types and thus allow context-dependent definitions of their structure

Update to a single node may have global repercussions for the typing of the tree

Need more theory: Specialized DTD‘s , binary tree encoding, non-

deterministic tree automata… details are left to the interested reader…

80

Review Given m updates on tree of size n:

incrementally validate DTD in O(m log n)

validate XML Schema in O(m log2 n)

Weakness

Only updates that affected one node at a time are considered

81

Summary

XTRACT as a tool to infer DTDs from a set of example XML documents

An approach to incrementally validate a XML document after an update

Questions?