inference of concise dtds from xml data geert jan bex 1 frank neven 1 thomas schwentick 2 karl tuyls...
TRANSCRIPT
Inference of Concise DTDs from XML data
Geert Jan Bex1
Frank Neven1
Thomas Schwentick2
Karl Tuyls3
1 Hasselt University and Transnational University of Limburg2 Dortmund University3 Maastricht University and Transnational University of Limburg
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
Aims & requirements
• Problem: infer DTD from XML corpus• Requirements:
– Concise: humans can interpret/validate
– Work on large data sets
– Work on small data sets
– Robust to noise
DTD
XML
Why DTD inference?
• Schema inference– ≈ 50 % of XML documents : no schema [Barbosa et al. 2005]
– ≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005]
– Improving existing schemas
– “Noisy” XML documents ≈ 90 % of XHTML docs : not valid
• Related work– Fails on real-world, large data sets
– Results not concise
Why schemas?• Validation : efficiency, security
• Optimization : search, processing
• Static analysis, type checking (e.g., XQuery)
• Software development : modeling,OR-mapping
• Integration : (meta-)data sources
• Schema matching
• Semantics
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
XML documents
book
title author author year
…
…… ……
book
title editor year isbn
…
……
Learning regular expressionfrom set of strings
title (author+ + editor+) year isbn?
Learning automata?
Well studied, but…
Learning automata≠
learning regular expressions
((b?(a+c))+ d)+ e
• abbb + abbd + acd + ac– most specific regex for S
• (a + b + c + d)*– most general regex for S
Learning regular languages?S = { abbb, abbd, acd, ac }
???
<<
a (b* + c) d??
generalizationvs.
specificity
positive examples only!
Impossible…in general
Subclasses
• Single Occurrence Regular Expressions
– 99 % of regular expression in DTDs/XSDs
• CHAin Regular Expressions
– 90 % of regular expression in DTDs/XSDs
Infer with iDTD
Infer with CRX
Outline
• Goals & motivation• Problem setting
• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
SOREs• What’s a SORE
header . protein . organism . reference* . comment* . genetics* . complex* .
function* . classification? . keywords? . feature* . summary . sequence
authors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?
title . (author . affiliation?)+ . abstract
• … and what’s nottitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract
duplicate element names
Sample SOA
W = {bacacdacde, cbacdbacde, abccaadcde}
b
a
c e
d
Single Occurrence Automaton
2T-Inf
[Garcia & Vidal 1990]
Sample SOA
• SOA size– || + 2 states– O(||2) transitions
• Complexity of algorithm– O(||W||)– streaming
• Algorithm sound– W L(SOA)
∑∈
=Ss
sS
in general: |S| |L(SOA)|<<
SOA SORE: REWRITE
b
a
e
d
coptional b
a
e
d
cb?disjunction a, c
e
d
b?
a+c
concatenation b?, a+c
e
d
b? (a+c)
e
d
((b? (a+c))+
self-loop b? (a+c)
((b? (a+c))+ d)+ e
REWRITE: properties• Theorem
– REWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete)
– Complexity: O(||4)
• SORE size– || symbols
– O(||) operators
REWRITE + repairs = iDTDW = {bacacdacde, cbacdbacde}
b
a
c e
d
no rules apply !!!
almost disjunction a, c
b
a
e
d
c
((b? (a+c))+ d)+ e
Fix:enable-disjunctionenable-optional
iDTD: properties• Theorem
– iDTD transforms SOA into SORE such that L(SOA) L(SORE)
• iDTD can be parameterized for performance
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions
CHAREs
• Definition: A chain regular expression is a sequence of
factors f1,…,fn such that no alphabet symbol occurs more than once and a factor is one of
• (a1 + … + ak)
• (a1 + … + ak)?
• (a1 + … + ak)+
• (a1 + … + ak)*
CRX derives
CHAin Regular Expressions
Chain Regular expression eXtraction
CHAREs
• What’s a chain
header . protein . organism . reference* . comment* . genetics* . complex* . function* .
classification? . keywords? . feature* . summary . sequence
authors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?
• … and what’s not
title . (author . affiliation?)+ . abstract
title . ((author . affiliation)+ + (editor . affiliation)+) . abstract
not a factor
duplicate element names
CRX run: pre-order relation
a b c c d ec c c a db f e gb f h i
Sample W
Pre-order relation W
a bb cc dd e
c aa d
b ff ee g
f hh i
a
b
c f
e
d g
h i
a W b and b W c then a W c
CRX run: transitive closure
a b c c d ec c c a db f e gb f h i
Sample W
f
e
d g
h i
a
b
c
CRX run: transitive closure
a b c c d ec c c a db f e gb f h i
Sample W
f
e
d g
h i
a
b
c
a,b,c
equivalence class
a W b and b W a then a W b
Symbol occurs in exactly one equivalence class
CRX run: folding
a b c c d ec c c a db f e gb f h i
Sample W
f
e
d g
h i
a,b,c
predecessor set successor set
partial order W
pred() = {’ | ’ W }
succ() = {’ | W ’}
CRX run: folding
a b c c d ec c c a db f e gb f h i
Sample W
e
g
h i
a,b,c d,f
partial order W
pred() = {’ | ’ W }
succ() = {’ | W ’}
W: partial order W
CRX run: multiplicity & RE
a b c c d ec c c a db f e gb f h i
Sample W
e
g
h i
a,b,c d,f+ ?
?
? ?
e?. .h? i?.g?.. (d + f)(a + b + c)+
Chain Regular Expression
topological sort
CRX algorithm: properties
• Optimality: W linearly ordered CHARE r, W L(r) and L(r) L(rW): rW = r
• Performance : O(||W|| + |Σ|3)
• Training set size:Any CHARE r can be learned from{w | w L(r) w’ L(r): |w| |w’| + 2}
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE
• Experiments• Extensions• Conclusions
Related work
• XTRACT [Garofalakis et al. 2000]
– Pioneer– More general than iDTD– Focuses on regular expressions that don’t occur
in real DTDs no concise schemas
• Trang: roughly equivalent to CRX– Inconsistent results
Data
• Real world regular expressions– SOREs– Non SOREs
• Real world data when available
• Synthetic data otherwise
real
wor
ld d
ata
real
wor
ld r
egex
es
Experiments: generalization
CRX
iDTD
no repairs
Experiments: generalization
CRXiDTD
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments
• Extensions• Conclusions
Extensions
• Incremental computation– new data update internal representation
(SOA or partial order)
• Noise– Support for element name too small ignore element– SOA: support for edges too small delete edges
before repair
• Numerical predicates– Bookkeeping: minOccurs, maxOccurs
• Generating XSDs– Infer data types (integer, double, date,…)
Outline
• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions
• Conclusions
Conclusions• iDTD + CRX
– learns robust class of regexes from positive examples
– complete in their target class for sufficient data– deals with insufficient data– performs well on real world data– runs efficiently
• Future work: inferring XML Schemas