1 schema & schema integration carsten karl dennis schade thorsten dollmann
TRANSCRIPT
1
Schema & Schema Integration
Carsten Karl
Dennis Schade
Thorsten Dollmann
2
Outline
XTRACT System for inferring DTDs from a set of XML documents
Incremental validation of XML Documents
3
Schema & XML Databases
Databases need a Schema DTDs serve the role of the schema of the
document Efficient storage of XML data Optimization of XML queries
DTDs are not mandatory !!!!
4
XTRACT
Goal:Infer DTDs from a set of XML documents
5
Problem Simplification and Abstraction Infer a DTD for each tag separately Separate example sequences for each
<e> Infer a “good” DTD for each <e> Resulting document DTD is a composition
of all inferred “tag”-DTDs
6
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book
7
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book
8
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
9
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author
10
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author
11
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>}
12
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>}
13
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>}
14
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
15
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
editor
16
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
editor
17
Example
<book>
<title> </title>
<author>
<name> </name>
<age> </age>
</author>
<author>
<name> </name>
</author>
<editor>
<name> </name>
</editor>
</book>
book
title author author editor
name name nameage
Tag Example sequence set
book { <title><author><author><editor> }
author { <name> <age>, <name> }
editor { <name> }
18
What is a “good” DTD ?
Given the example sequence set I={ ab, abab, ababab }
Possible DTDs:
(ab)*
PreciseConciseCandidate DTD
(a|b)*
(ab|abab|ababab)
ab|ab(ab|abab)
Yes No
No
No
Yes
Yes
Yes Somewhat
19
What is a “good” DTD ? (ctd.)
A good DTD D must satisfy two restrictions R1: D should be concise R2: D should be precise
Minimum Description Length quantifies and resolves the tradeoff between R1 and R2
20
The MDL Principle
MDL principle states: The best theory to infer from a given set of data is the one which minimizes the sum of1. The length of the theory in bits
2. The length of the data, in bits, when encoded with the help of the theory
21
Overview of XTRACT System
MDL Modul
Factoring
Generalization
Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }
Sg = I { (ab)*, (a|b)*, b*d, b*e }
Sf = Sg { (a|b)(c|d), b*(d|e) }
Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)
22
MDL Subsystem
In order to use the MDL principle, we need to
Define theory description length Define data description length Solve the resulting minimization problem
23
MDL Coding scheme
Description Length of a DTDNumber of characters of the DTD
Cost of encoding the example sequencesencoding of b in terms of DTD a | b | c is 1,
cost 1 (position of b in the DTD)encoding of bbb in terms of DTD b* is 3
(number of repetitions of b), cost 1encoding of b in terms of DTD b is , cost 0
24
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
ab
(a|b)*
ab*
abb
25
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
(a|b)*
ab*
abb
63
4
5
6
7
abbbbb
30
+ 1b)= 1*+ (1a
26
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
abbbbb
(a|b)*
ab*
abb
30
3
1
1
1
1
1
8
27
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
(a|b)*
ab*
abbabbbbb
30
8
3
0
3
28
MDL Subsystem Minimization
Input Sequences Candidate DTDs
ab
abb
abbb
abbbb
ab
(a|b)*
ab*
abb
30
8
3
29
Overview of XTRACT System
MDL Modul
Factoring
Generalization
Input Sequences I = { ab,abab,ac, ad, bc, bd, bbd, bbbe }
Sg = I { (ab)*, (a|b)*, b*d, b*e }
Sf = Sg { (a|b)(c|d), b*(d|e) }
Inferred DTD: (ab)* | (a|b)(c|d) | b*(d|e)
30
Generalization Subsystem
Goal: Infer regular expressions from example sequences Produce candidate DTDs such as a*bc,(abc)*, (a|b|c)*,((ab)*c)*
Generate more general DTDs Two heuristics:
DiscoverSeqPattern(s,r): s=abbbbc => ab*c DiscoverOrPattern(s,d): s=abacbc => (a|b|c)*
Candidate DTDs are generated by calling the above functions for appropriate values of r and d
31
DiscoverSeqPattern Example
( a b ) * c a b c ( a b ) * c
( a b ) * c a b c ( a b ) * c
( a b ) * c ) *(
The pattern must occur at least two times: r=2
a b a b a b c a b c a b a b ca b
a b a b a b c a b c a b a b ca b
( a b ) * c a b c a b a b ca b
( a b ) * c a b c a b a b ca b
32
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax
33
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
34
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
35
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
36
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
37
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
38
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 1: Partition
39
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a x c cax Step 2:
replace pattern a1…an by (a1|..|an)*
40
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a ( x ca| c ) * Step 2:
replace pattern a1…an by (a1|..|an)*
41
DiscoverOrPattern Example
Given:
• the example sequence s=axcxac
• distance parameter d=2
a ( x ca| c ) *
x is an auxiliary symbol introduced by DiscoverSeqPattern
a ( ca| c ) *((de)*e)*
x = ((de)*e)*
42
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | |
43
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | =>
44
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d)
45
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) |
46
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) | b(c|d)
47
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) | b(c|d)
=>
48
Factoring Subsystem
Goal: Combine different candidates to derive more compact, factored DTDs
Example candidate set Sg = { ac, ad, bc, bd }
ac ad bc bd| | | => a(c|d) | b(c|d)
=> (a|b)(c|d)
Reduces MDL description length of the candidate DTDs Adoption of factoring algorithms for Boolean expressions
Use heuristic algorithm for selecting subsets of candidate DTDs that give a good factored form
49
Factoring Subsystem Heuristics
Choose subsets S of candidate DTDs from SG such that
DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG
is high
50
Factoring PrefixesCandidate DTDs
longer prefixes result in MDL cost reduction factored DTD covers all input sequences
abcddd
abceee
abcfff
abcggg
abcd*
abce*
abcf*
abcg*
abc(d*|e*|f*|g*)
51
Factoring Subsystem Heuristics
Choose subsets S of candidate DTDs from SG such that
DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG
is high
52
Factoring Subsystem Heuristics
Choose subsets S of candidate DTDs from SG such that
DTDs in S have a common prefix p or suffix snumber of DTDs with this common prefix in SG
is high The overlap between every pair of DTDs
D, D’ in S should be minimal
53
Factoring Subsystem Overlap
Input Sequences Candidate DTDs
eab
eabb
eabbb
eababab
e(a|b)*
eab*
e((a|b)*|ab*)
54
Factoring Subsystem Overlap
Input Sequences Candidate DTDs
eab
eabb
eabbb
eababab
e(a|b)*
eab*
e((a|b)*|ab*)
New factored form has much higher MDL cost ! Does not cover more input sequences then e(a|b|)*
55
Experimental Validation
Comparison of XTRACT with IBM DDbE (Data Description by Example)
Synthetic DocumentsRandomly generated example sequences for
synthetic DTDs Real Life Documents
Example documents from different sources e.g. Newspaper Association of America
56
Synthetic Documents1 abcde|efgh|ij|klm
2 (a|b|c|d|f)*gh
3 (a|b|c)d*e*(fgh)*
4 (abcd)*|(e|f|g)*|h|(ijklm)*
5 a*|(b|c|d|e|f)*|gh|(i|j|k)*|(lmn)*
XTRACT recovers each single one of them
DDbE shows serious weaknesses
Recovers only the first one correctly
Deduced DTDs are over-generalizations
Does not even cover all example sequences
Level of factoring is limited
57
Real Life Documents
No Simplified DTD DTD obtained by XTRACT
DTD obtained by DDbE
1 a|b|c|d|e a|b|c|d|e a|b|c|d|e
2 (a|b|c|d|e)* (a|b|c|d|e)* (a|b|c|d|e)*
3 ab*c* ab*c* (ab+c*)|(ac*)
4 a*b?c?d? a*b?c?d? (a+b(c|(c?d))?)|((b|a+)?cd)|((a+|b)?d)|((a+|b)?c)|(a+|b)
5 (a(bc)+d)* (a(bc)*d)* (a|b|c|d)+
6 (ab?c*d?)* - (a|b|c|d)+
58
Conclusion
MDL principle used to control the tradeoff between model simplicity and model generalisation
General purpose tool to extract regular expressions from example documents
Experimental results provide strong support Future work:
Generalization subsystem should detect patterns containing ? nested within Kleene stars (a(bc)?)*
Enhance the system to detect even more complex DTDs
59
Incremental Validation of XML - Documents
60
Abstraction of XML and DTD’sXML Docs abstracted as Labeled Ordered Trees
LOT• element content and attribute values are
ignored
DTD as extended CFG• start symbol (root)• productions : associate to each label a regular
expression that specifies the acceptable labels of the list of children of a node with the given labelLOT satisfies a DTD tree is derivation of the
grammar
61
DTDs: Abstraction & Exampleroot : carscars used newused car*new car*car (year|) model
95 Tigra 94 Astra Mini Boxster03
cars
used new
car car car
year model year model model
car
modelyear
62
Tree Satisfying DTD, General Case
1 2 ii-1 i+1 k-1 k… …
…
s1 s2 sk-1 sk…
…a b c
root : … r
…
L(r)
63
Incremental Validation Problem Statement
For each valid tree T : given a series of update commands,
• efficiently decide if the updated tree T’ is valid
• efficiently update auxiliary structure A(T) and T
64
Updates (1): Node Renaming u(vi,)
1 2 ii-1 i+1 k-1 k… …
…
r
s1 s2 sk-1 sk…
…a b c
vi
65
Incremental Validation of Strings Renaming u(i,b) in string 1...n
with respect to regular language specified by NFA N(Σ,Q,q0,F,δ)
validating updated string from scratch: O(n|Q|2log|Q|)
maintain auxiliary information:
Pre(i) = δ(q0, 1, … i-1) Post(i) = { s | δ(s, i+1, … n) ε F)}
1... i-1b i+1… n valid <-> exists s1 ε Pre(i), s2 ε Post(i) such that s2 ε δ(b,s1)
66
Validating a Renaming u(ai, )
12 ii-1 i+1 n-1 n
… N…
Validation of one update in O(1) given
precomputedPre and Post
Post(i)
Pre(i)
But u(i, ) requires recomputation of Pre(i),
Pre(i+1), … and of Post(i), Post(i-1), …
q0 1
2 i-1
…
qF
n
n-1i+1 …
q0
1
2 i-1
…
67
Transition Relation Definition
12 i j n-1 n
… …… …m
Ti,j = { (q, q’) | }
i+1
q i…i+1
q’j
m+1
Ti,j = Ti,m Tm+1,j
68
Divide-and-conquer approach Transition-Relation-Tree Τn (n=2k)
root: T1,2k
node Tij has children Ti,k and Tk+1,j leaves Ti,i , 1≤i≤n
number of nodes: n+ (n/2) + … + 2 + 1 = 2n-1 balanced
→ Τn has depth log n
69
Transition Relation Trees
1 2 3 4 5 6 7 8
T5,8T1,4
T3,4T1,2 T5,6 T7,8
T1,1 T2,2
T3,3 T4,4
T5,5 T6,6
T7,7 T8,8
T1,8
70
Updating Tn
affected nodes are lying on the path from a leaf to the root
bottom-up recomputing Tij‘s: each Tij with children Tik and Tkj for which at least
one child has been recomputed is replaced by Tik ° Tkj
→ O(log n) recomputations
updated string valid if
<qo,f> T1n for some f F
71
Maintenance of the Structure and Validation in O(log n)
u(6, )
1 2 3 4 5 6 7 8
T1,1 T2,2 T3,3 T4,4 T5,5 T6,6 T7,7 T8,8
T1,2 T3,4 T5,6 T7,8
T5,8T1,4
T1,8If (q0, qF) then valid
T6,6
T5,6
T5,8
T1,8
72
Insertions and Deletions
positions of nodes in the string can change length n of string is dynamic → Recomputing of the entire tree Tn necessary
New approach based on B-Trees: tree structure can be incrementally maintained tree is still balanced and has depth O(log n)
73
Transition B-Trees (2-3 Trees)
1
2
3
5
6
7
9
T1 T2 T3 T5 T6 T7 T9
Ta Tb TcTa = T1 T2
If (q0, qF) Ta Tb Tc then valid
74
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
1
2
3
5
6
7
9
8
T1 T2 T3 T5 T6 T7 T8 T9
Ta Tb Tc
75
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
1
2
3
5
6
4
7
9
8
T1 T2T7 T8 T9
Ta Tb Tc
T3 T5 T6
76
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
T3 T4 T5 T6
1
2
3
5
6
4
7
9
8
T1 T2T7 T8 T9
Ta Tb Tc
77
Transition B-Trees (2-3 Trees) for O(log n) Insertions and Deletions
Ta Td Te Tc
T3 T4 T5 T6
1
2
3
5
6
4
7
9
8
T1 T2T7 T8 T9
Tf Tg
78
Auxiliary Structures for Incremental DTD Validation
1 2 ii-1 i+1 k-1 k… …
…
r
s1 s2 sk-1 sk…
…
vi
u(vi, )
r
i…
…
r
r
79
XML Schema Validation
XML Schema provide a mechanism to decouple element names from their types and thus allow context-dependent definitions of their structure
Update to a single node may have global repercussions for the typing of the tree
Need more theory: Specialized DTD‘s , binary tree encoding, non-
deterministic tree automata… details are left to the interested reader…
80
Review Given m updates on tree of size n:
incrementally validate DTD in O(m log n)
validate XML Schema in O(m log2 n)
Weakness
Only updates that affected one node at a time are considered
81
Summary
XTRACT as a tool to infer DTDs from a set of example XML documents
An approach to incrementally validate a XML document after an update
Questions?