inference of concise dtds from xml data geert jan bex 1 frank neven 1 thomas schwentick 2 karl tuyls...

Inference of Concise DTDs from XML data

Geert Jan Bex1

Frank Neven1

Thomas Schwentick2

Karl Tuyls3

1 Hasselt University and Transnational University of Limburg2 Dortmund University3 Maastricht University and Transnational University of Limburg

Outline

• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions

Aims & requirements

• Problem: infer DTD from XML corpus• Requirements:

– Concise: humans can interpret/validate

– Work on large data sets

– Work on small data sets

– Robust to noise

DTD

XML

Why DTD inference?

• Schema inference– ≈ 50 % of XML documents : no schema [Barbosa et al. 2005]

– ≈ 66 % of DTDs and XSDs : not valid [Bex et al. 2005]

– Improving existing schemas

– “Noisy” XML documents ≈ 90 % of XHTML docs : not valid

• Related work– Fails on real-world, large data sets

– Results not concise

Why schemas?• Validation : efficiency, security

• Optimization : search, processing

• Static analysis, type checking (e.g., XQuery)

• Software development : modeling,OR-mapping

• Integration : (meta-)data sources

• Schema matching

• Semantics

Outline


XML documents

book

title author author year

…

…… ……

book

title editor year isbn

…

……

Learning regular expressionfrom set of strings

title (author+ + editor+) year isbn?

Learning automata?

Well studied, but…

Learning automata≠

learning regular expressions

((b?(a+c))+ d)+ e

• abbb + abbd + acd + ac– most specific regex for S

• (a + b + c + d)*– most general regex for S

Learning regular languages?S = { abbb, abbd, acd, ac }

???

<<

a (b* + c) d??

generalizationvs.

specificity

positive examples only!

Impossible…in general

Subclasses

• Single Occurrence Regular Expressions

– 99 % of regular expression in DTDs/XSDs

• CHAin Regular Expressions

– 90 % of regular expression in DTDs/XSDs

Infer with iDTD

Infer with CRX

Outline

• Goals & motivation• Problem setting

• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions• Conclusions

SOREs• What’s a SORE

header . protein . organism . reference* . comment* . genetics* . complex* .

function* . classification? . keywords? . feature* . summary . sequence

authors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?

title . (author . affiliation?)+ . abstract

• … and what’s nottitle . ((author . affiliation)+ + (editor . affiliation)+) . abstract

duplicate element names

Sample SOA

W = {bacacdacde, cbacdbacde, abccaadcde}

b

a

c e

d

Single Occurrence Automaton

2T-Inf

[Garcia & Vidal 1990]

Sample SOA

• SOA size– || + 2 states– O(||2) transitions

• Complexity of algorithm– O(||W||)– streaming

• Algorithm sound– W L(SOA)

∑∈

=Ss

sS

in general: |S| |L(SOA)|<<

SOA SORE: REWRITE

b

a

e

d

coptional b

a

e

d

cb?disjunction a, c

e

d

b?

a+c

concatenation b?, a+c

e

d

b? (a+c)

e

d

((b? (a+c))+

self-loop b? (a+c)

((b? (a+c))+ d)+ e

REWRITE: properties• Theorem

– REWRITE transforms SOA into equivalent SORE for sufficient data, reports failure otherwise (sound & complete)

– Complexity: O(||4)

• SORE size– || symbols

– O(||) operators

REWRITE + repairs = iDTDW = {bacacdacde, cbacdbacde}

b

a

c e

d

no rules apply !!!

almost disjunction a, c

b

a

e

d

c

((b? (a+c))+ d)+ e

Fix:enable-disjunctionenable-optional

iDTD: properties• Theorem

– iDTD transforms SOA into SORE such that L(SOA) L(SORE)

• iDTD can be parameterized for performance

Outline


CHAREs

• Definition: A chain regular expression is a sequence of

factors f1,…,fn such that no alphabet symbol occurs more than once and a factor is one of

• (a1 + … + ak)

• (a1 + … + ak)?

• (a1 + … + ak)+

• (a1 + … + ak)*

CRX derives

CHAin Regular Expressions

Chain Regular expression eXtraction

CHAREs

• What’s a chain

header . protein . organism . reference* . comment* . genetics* . complex* . function* .

classification? . keywords? . feature* . summary . sequence

authors . citation . volume? . month? . year . pages? . (title + descr)? . xrefs?

• … and what’s not

title . (author . affiliation?)+ . abstract

title . ((author . affiliation)+ + (editor . affiliation)+) . abstract

not a factor

duplicate element names

CRX run: pre-order relation

a b c c d ec c c a db f e gb f h i

Sample W

Pre-order relation W

a bb cc dd e

c aa d

b ff ee g

f hh i

a

b

c f

e

d g

h i

a W b and b W c then a W c

CRX run: transitive closure


Sample W

f

e

d g

h i

a

b

c

CRX run: transitive closure


Sample W

f

e

d g

h i

a

b

c

a,b,c

equivalence class

a W b and b W a then a W b

Symbol occurs in exactly one equivalence class

CRX run: folding


Sample W

f

e

d g

h i

a,b,c

predecessor set successor set

partial order W

pred() = {’ | ’ W }

succ() = {’ | W ’}

CRX run: folding


Sample W

e

g

h i

a,b,c d,f

partial order W

pred() = {’ | ’ W }

succ() = {’ | W ’}

W: partial order W

CRX run: multiplicity & RE


Sample W

e

g

h i

a,b,c d,f+ ?

?

? ?

e?. .h? i?.g?.. (d + f)(a + b + c)+

Chain Regular Expression

topological sort

CRX algorithm: properties

• Optimality: W linearly ordered CHARE r, W L(r) and L(r) L(rW): rW = r

• Performance : O(||W|| + |Σ|3)

• Training set size:Any CHARE r can be learned from{w | w L(r) w’ L(r): |w| |w’| + 2}

Outline

• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE

• Experiments• Extensions• Conclusions

Related work

• XTRACT [Garofalakis et al. 2000]

– Pioneer– More general than iDTD– Focuses on regular expressions that don’t occur

in real DTDs no concise schemas

• Trang: roughly equivalent to CRX– Inconsistent results

Data

• Real world regular expressions– SOREs– Non SOREs

• Real world data when available

• Synthetic data otherwise

real

wor

ld d

ata

real

wor

ld r

egex

es

Experiments: generalization

CRX

iDTD

no repairs

Experiments: generalization

CRXiDTD

Outline

• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments

• Extensions• Conclusions

Extensions

• Incremental computation– new data update internal representation

(SOA or partial order)

• Noise– Support for element name too small ignore element– SOA: support for edges too small delete edges

before repair

• Numerical predicates– Bookkeeping: minOccurs, maxOccurs

• Generating XSDs– Infer data types (integer, double, date,…)

Outline

• Goals & motivation• Problem setting• iDTD: Sample SOA SORE• CRX: Sample CHARE • Experiments• Extensions

• Conclusions

Conclusions• iDTD + CRX

– learns robust class of regexes from positive examples

– complete in their target class for sufficient data– deals with insufficient data– performs well on real world data– runs efficiently

• Future work: inferring XML Schemas

inference of concise dtds from xml data geert jan bex 1 frank neven 1 thomas schwentick 2 karl tuyls...

Documents