a normal form for xml documents

24
A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto

Upload: quant

Post on 31-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

A Normal Form for XML Documents. Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto. Motivating Example. courses. course. course. info. @cno. student. @cno. student. @sno. name. student. “cs100”. “cs225”. “123”. “Fox”. @sno. name. grade. @sno. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Normal Form for  XML Documents

A Normal Form for XML Documents

Marcelo Arenas Leonid Libkin

Department of Computer ScienceUniversity of Toronto

Page 2: A Normal Form for  XML Documents

2

Motivating Example

courses

coursecourse info

@cno @cnostudent student

@snoname gradegrade name@sno

student

name@sno. . .

“123” “123” “A+”“B+”

“cs100” “cs225”

“Fox”“Fox”

“123” “Fox”

Page 3: A Normal Form for  XML Documents

3

Our Goal

courses course*course @cno, student*student @sno, name, gradename Sgrade S

DTD: Integrity Constraints:

, info*

info @sno, name

two students with the same @sno value must have the same name.

@sno is the key of info.

Page 4: A Normal Form for  XML Documents

4

A “Non-relational” Example

DBLP

conf conf

title issueissue

article articlearticle

@yeartitle title @year

@year

“ICDT”

@year

author @yeartitleauthor“1999”

“1999”

“1999”“Dong” “2001”“Jarke”

“2001”

“. . .” “. . .” “. . .”

. . .

Page 5: A Normal Form for  XML Documents

5

Problems to Address

Functional dependencies for XML.

Normal form for XML documents (XNF). Generalizes BCNF.

Algorithm for normalizing XML documents. Implication problem for functional

dependencies.

Page 6: A Normal Form for  XML Documents

6

Framework: DTDs

We do not consider mixed content, IDs and IDREFs.

Paths(D): all paths in a DTD Dcourses.course courses.course.@cnocourses.course.student.namecourses.course.student.name.S

We distinguish three kinds of elements: attributes (@), strings (S) and element types.

Page 7: A Normal Form for  XML Documents

7

Framework: XML Trees

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”

@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A-”

S S S S

Page 8: A Normal Form for  XML Documents

8

Towards FDs for XML

We know how to handle FDs in relational DBs.

We need a relational representation of XML.

We do it by considering tree tuples: a tree tuple in a DTD D is a mapping

t : Paths(D) Vertices Strings {}

consistent with D: could be extracted from an XML tree conforming to D.

Page 9: A Normal Form for  XML Documents

9

Tree Tuples

v1

v2

v0

courses

course

@cno student

“cs100”

t(courses) = v0

t(courses.course) = v1

t(courses.course.@cno) = “cs100”t(courses.course.student) = v2

t(p) = , for the remaining paths

A tree tuple represents an XML tree:

Page 10: A Normal Form for  XML Documents

10

XML Tree: set of Tree Tuples

v1

v2

v3 v4

v5

v6 v7

v0

. . .

courses

coursecourse

@cno

“cs100”

@sno name grade @sno name grade

student student

“123” “456”

“Fox” “B+” “Smith” “A-”

S S S S

v1

v2

courses

course

@cno

“cs100”

student

v0

v3 v4

@sno name grade

“123”

“Fox” “B+”

S S

v5

v6 v7

@sno name grade

student

“456”

“Smith” “A-”

S S

. . .

course

Page 11: A Normal Form for  XML Documents

11

Functional Dependencies

Expressions of the form: X Y

defined over a DTD D, where X, Y are finitenon-empty subsets of Paths(D).

XML tree T can be tested for satisfaction of X Y

if:

X Y Paths(T) Paths(D)

T X Y if for every pair u, v of tree tuples in T:

u.X = v.X and u.X ≠ implies u.Y = v.Y

Page 12: A Normal Form for  XML Documents

12

FD: Examples

University DTD: courses course*course @cno, student*student @sno, name, grade

Two students with the same @sno value must have the same name:

courses.course.student.@sno courses.course.student.name.S

Every student can have at most one grade in every course:

{ courses.course, courses.course.student.@sno }

courses.course.student.grade.S

Page 13: A Normal Form for  XML Documents

13

Checking FD Satisfaction

v1

v2

v3 v4

v6

v7 v8

v0

courses

coursecourse

@cno

“cs100”@sno name grade @sno name grade

student

“123” “123”

“Fox” “B+” “Fox” “A+”

S S S S

v5

@cno

“cs225”

studentv1

v2

v3 v4

v0

courses

course

@cno

“cs100”@sno name grade

student

“123”

“Fox” “B+”

S S

v6

v7 v8

course

@sno name grade

“123”

“Fox” “A+”

S S

v5

@cno

“cs225”

student

{ courses.course, courses.course.student.@sno } courses.course.student.grade.S

Page 14: A Normal Form for  XML Documents

14

Checking FD Satisfaction

v1

v2

v3 v4

v5

v6 v7

v0

courses

course

@cno

“cs100”@sno name grade @sno name grade

student

“123” “123”

“Fox” “B+” “Fox” “A+”

S S S S

studentv1

v2

v3 v4

v0

courses

course

@cno

“cs100”@sno name grade

student

“123”

“Fox” “B+”

S S

v5

v6 v7

@sno name grade

“123”

“Fox” “A+”

S S

student

{ courses.course, courses.course.student.@sno } courses.course.student.grade.S

Page 15: A Normal Form for  XML Documents

15

Implication Problem for FD

Given a DTD D and a set of functional dependencies {}:

(D, ) if for any XML tree T conforming to D and satisfying , it is

the case that T

(D, )+ = { | (D, ) }

Functional dependency is trivial if it is implied by the DTD alone: (D, )

Page 16: A Normal Form for  XML Documents

16

XNF: XML Normal Form

XML specification: a DTD D and a set of functional dependencies .

A Relational DB is in BCNF if for every non-trivial functional dependency X Y in the specification, X is a key.

(D, ) is in XNF if:

For each non-trivial FD X p.@l or X p.S in (D, )+, X p is in (D, )+.

Page 17: A Normal Form for  XML Documents

17

Back to DBLP

DBLP is not in XNF:

DBLP.conf.issue DBLP.conf.issue.article.@year (D,)+

DBLP.conf.issue DBLP.conf.issue.article

(D,)+

Proposed solution is in XNF.

Page 18: A Normal Form for  XML Documents

18

Normalization Algorithm

The algorithm applies two transformations until

the schema is in XNF.

If there is an anomalous FD of the form:

DBLP.conf.issue DBLP.conf.issue.article.@year

then apply the “DBLP example rule”.

Otherwise: choose a minimal anomalous FD and apply the “University example rule”.

Page 19: A Normal Form for  XML Documents

19

Normalizing XML Documents

Theorem The decomposition algorithm terminates and outputs a specification in XNF.

It does not lose information:

Unnormalized NormalizedXML document XML Document

Q1, Q2 are XQuery core queries.

It works even if we cannot compute (D,)+. If we know (D,)+, the output is better.

Q1

Q2

Page 20: A Normal Form for  XML Documents

20

Implication Problem (cont’d)

Typically, regular expressions used in DTDs are rather simple.

Trivial regular expression: s1, s2, …, sn

Each si is either one of ai, ai?, ai+ or ai*

For i ≠ j, ai ≠ aj

D is a simple DTD if all its productions use “permutations” of trivial regular expressions.Example: (a | b)* is a permutation of a*, b*

Page 21: A Normal Form for  XML Documents

21

Implication Problem (cont’d)

Theorem For simple DTDs: The implication problem for FDs is solvable

in O(n2). Testing if a specification is in XNF can be

done in O(n3).

Other results: There is a larger class of DTDs for which

these problems are tractable. There is an even larger class of DTDs for

which they are coNP-complete.

Page 22: A Normal Form for  XML Documents

22

Final Comments

The normalization algorithm can be improved in various ways.

Why XNF? Theorem XNF generalizes BCNF and a normal form

for nested relations (called NNF) when those are coded as XML documents.

A complete classification of the complexity of the implication problem for FDs remains open.

Implementation.

Page 23: A Normal Form for  XML Documents

Backup Slides

Page 24: A Normal Form for  XML Documents

24

Normalization Algorithm

We consider FDs of the form:

{q, p1.@l1, …, pn.@ln } p

where n 0 and q ends with an element type.

The algorithm applies two transformations until the schema is in XNF. If there is an anomalous FD with n = 0:

DBLP.conf.issue DBLP.conf.issue.article.@year

then apply the “DBLP example rule”.

Otherwise: choose a minimal anomalous FD and apply the “University example rule”.