l09: introduction to xml data management xml and xml query languages structural summary and coding...

L09: Introduction to XML Data Management

XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems

XML and XML Query Languages

XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems

H. Lu/HKUST L09: Introduction to XML Data Management 3

XML

Extensible Markup Language for data A W3C standard to complement HTML

http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

Standard for publishing and interchange Origins: structured text SGML

“Cleaner” SGML for the Internet

Motivation: HTML describes presentation XML describes content

SGMLXMLHTML4.0

XML – Describing the Content<project>

<talk ><title> XML Query Processing & Optimization

</title> <date> March 18, 2004 </date> <instructor> Instructor

<name> Lu Hongjun </name> <affiliation> HKUST </ affiliation > <email> [email protected] </email>

<name> Jeffrey X. Yu </name> <affiliation> CUHK </ affiliation > <email> [email protected] </email>

</ instructor > </talk>

</project>


XML Document/Data

Hierarchical document format for information exchange in WWW

Self describing data (tags) Nested element structures having a root Element data can have

Attributes Sub-elements

Basic XML Structures

Elements: <title>… </title>,<name>… </name> Open & close tags or “empty tag” Ordered, nestable an element can be empty

Attributes PCDATA/CDATA An XML document: single root element

well formed XML document: if it has matching tags

Basic XML Structures: Attributes Single-valued, ordered

<project proj_id = “P1234” budget = “1000000”> <title> XML Data Management </title> … <year> 2003-2004 </year></project>

Special types: ID, IDREF, IDREFS <member id=“m007”> <name> James </name> </member> <project id=“p123”> <title> XML Data Management </title> <member idref=“m007 m008”/> </project>

Other XML Structures

Processing instructions: instructions for applications<?xml version=“1.0”?>

CDATA sections: treat content as char data<![CDATA[<tag>Whatever!!!</tag><whatever>]]>

Comments: just like HTML

Entities: external resources and macros &my-entity; (non-parameter entity) %param-entity; (parameter entity for DTD

declarations)

Data Centric vs. Document centric<project>

<pname> XML </pname><member ID=”&3”, age = 50 > <name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>

<title> Managing XML data using RDBMS </title>

<year> 2001 </year> </publication>

… </member><member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project>

<pname> Data mining </pname> </project></member>

</project>

<bio> Dr Lu is a professor at

 HKUST. He worked at NUS> before

1998. </bio>


XML Data Model

Several competing models Document Object Model (DOM)

a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents

http://www.w3.org/DOM/


DOM Core Interface : Node

DOM tree: a tree-like structure of Node objects – the root of the tree is a document object. Node Object (nodeName, nodeValue, nodeType,

parentNode, childnodes, firstChild, lastChild, previousSibling, nextSibling, attributes, ownerDocument)

nodeType: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE


DOM Interface

Each node of the document tree may have a number of child nodes, contained in a NodeList object.

Two ways of accessing a node object Based on the location of an object in the

document tree Based on the name of an object


publicatiom Node

NodeType=ELEMENT_NODE

tagName = “publication”

NodeValue = ‘nill’

A Sample DOM Tree

&28&26 &27

&70 &71

&66&65XML

H. Lu

Managing … 2001 Data mining

J.X. [email protected] &294

publicationname emai

l

pname member

member

project

year

projectname

pnametitleauthor

&69

&24

&1

&3&2

Project Node

NodeType=ELEMENT_NODE

tagName = “project”

NodeValue = ‘nill’

name Node

NodeType=TEXT_NODE

tagName = “name”

NodeValue = ‘H. Lu’


Data Graph

Similar to DOM tree, but may have different notations that represent an XML document

&28

&26

&27

&70

&71

&66

&65XML

H. Lu


J.X. [email protected]&294

publicationname email

pname member member

project

year

projectname

pnametitle

author

&69

&24

&1

&3&2

age

50

age

50


Document Type Definition

Inherited from SGML DTD standard BNF grammar establishing constraints on element

structure and content Specification of attributes and their types Definitions of entities

A Sample DTD

project

email

*

member

author

ID name

*

*

&1

&3

&5&6

&7

&4

?

publication

title year

&9&8

&2

pname

<?xml version="1.0" standalone="yes"?><!DOCTYPE Research ><!ELEMENT project(pname,member*,publication*)><!ELEMENT pname(#PCDATA)> <!ELEMENT member (name,email?, publication*,

project*)><!ATTLIST member ID ID #REQUIRED><!ELEMENT name(#PCDATA)><!ELEMENT email(#PCDATA)><!ELEMENT publication(title,year)><!ATTLIST publication author IDREF IMPLIED)><!ELEMENT title(#PCDATA)><!ELEMENT year(#PCDATA)>

*


XML Query Languages

There have been a large number of proposals during the past few years: XPath [Clark, DeRose, W3C 1999] XQuery [Boag, Chamberlin et al, W3C 2003] XML-QL[Deutsch, Fernandez et al, QL99] XQL [Robie, Lapp, QL99] XML_GL [Ceri, Comai et al, WWW99] Quilt [Chamberlin, Robie et al, 2000]

From W3C XQuery 1.0 (W3C Working Draft, 12 November 2003)

• http://www.w3.org/TR/xquery/ XPath 2.0 (W3C Working Draft 12 November 2003)

• http://www.w3.org/TR/xpath20/


XPath: XML Path Language The purpose

To address the node of an XML tree using a path notation for navigating through the hierarchical structure of an XML document.

Uses a compact, non-XML syntax Designed to be embedded in a host language (e.g., XSLT,

XQuery) XPath Expressions

String of characters Value of an expression is always an ordered collection of

zero or more items (atomic value, node)


XPath: Steps

An XPath expression has following syntax: Path::=/Step1/Step2/…/Stepn,

where each Xpath step is defined as follows: Step::=Axis::Node-test Predicate* Axis specifies the “direction” in which the document should be

navigated. For example, child::title[position() = 2]

There are 12 axes: child, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, following, preceding, following-sibling, preceding-sibling, attribute, self, namespace

XPath Path Expressions

project matches a project element

* matches any element

/ matches the root element

/project matches a project element under root

project/member matches a member in project

project//name matches a name in project, at any depth

//title matches a title at any depth

member|publciation matches a member or a publication

@age matches an age attribute

project/member/@age matches age attribute in member, in project

project/member/[@age<“45”] matches a member with age < 45

XPath Query Examples

Result: <name> H. Lu </name>

<name> J.X. Yu </name>

/project/member/name: matches a name of member in project

/project/publication/venue

Result: empty – there was no venue element

//pname : matches a pname at any depth

Result: <pname> XML </pname><pname> Data mining </pname>

/project/member/name/text(): text of name elements

Result: H. Lu

J.X. Yu

More XPath Queries/project/member[publication] <member ID=”&3”, age = 50 >

<name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>

<title> Managing XML data using RDBMS </title><year> 2001 </year>

</publication></member>

/project/member[@age < “45”]<member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project>


/project [member/@age < “25”]No element returned

/project/member[email/text()] [email protected]


XQuery

XQuery 1.0: An XML Query Language W3C Working Draft 12 November 2003 http://www.w3.org/TR/xquery/

XPath expressions are still the basic building block


XQuery XQuery 1.0: An XML Query Language

W3C Working Draft 12 November 2003 http://www.w3.org/TR/xquery/

FLWR Expressions: FOR-LET-WHERE-RETURN

FOR/LET Clauses

WHERE Clause

RETURN Clause

Ordered list of tuples of bound variables

Instance of XML Query data model

FOR $x in expr binds $x to each value in the list expr

LET $x = expr binds $x to the entire list expr Useful for common subexpressions

and for aggregations

Pruned list of tuples of bound variables

XQuery Examples

<result>

FOR $x in /project/member/publication

WHERE $x/year > 2000

RETURN <recentpub> $x/title

</ recentpub >

</result><active_members>

FOR $m IN distinct(document(“project.xml")//member) LET $p := document(“project.xml")//publication[author = $m] WHERE count($p) > 10 RETURN $m

</ active_members >

distinct = a function that eliminates duplicates

count = a (aggregate) function that returns the number of elements

Structural Summary and Coding Scheme

XML and XML Query Languages Structural Summary and Coding Scheme Managing XMLData in Relational Systems


Structural Summary

A structural summary for a data graph GD(VD, ED ) is another labeled graph GI (VI, EI ).

Each node vi GI represents a set of nodes, extent(vi ), and extent(vi ) VD.

An edge ed (vi , vi’) GI exists if there is an edge ed (vd , vd’) GD vd extent(vi ), vd’ extent(vi’ ).

The summary preserves all the paths in the data graph. A path expression query can be executed on GI instead of GD, which is most likely more efficient since size of GI is much smaller than GD.


Structural Summary

Basically, nodes in the data graph is grouped based on certain criteria, each group of nodes is represented by one node in the summary.

The size of summary will be determined by the grouping criteria.

Desired properties in supporting evaluating path expression queries using summary: The results are safe (no false negatives)

• If not safe, only approximate answers can be obtained

The results are precise: contains no false positives • If not precise, need validate results using the data graph


Structural Summary

r

a1 a2 a3

b1 b2 b3

c1 c2 c3

R

A

B

C

{a1,a2,a3}

{b1,b2,b3}

{c1,c2,c3}

Data Graph Structural summary

{r}


Sample Structural Summaries

Query workload independent summaries Data Guide 1-index [Milo, Suciu, ICDT99] A(k) index [Kaushik, Shenoy, ICDE02]

Query workload dependent summaries APEX [Chung, Min et al, SIGMOD02] D(k)-index [Chen, Lim et al, SIGMOD03]


Data Guides

DataGuide: dynamic structural summary of current database Each label path in

database appears once in DataGuide

No extraneous paths in DataGuide

Maintained incrementally as database evolves

Serves role of schema

C1 is duplicated to achieve determinism in DataGuides


Bisimilarity and 1-Index

Most existing structural summary are based on graph bisimilarity, defined as follows: Two data nodes u and v are bisimilar (u v) if

• u and v have the same label;

• if u’ is a parent of u, then there is a parent v’ of v such that u’ v’, and vice versa;

Intuitively, the set of paths coming into them is the same if two nodes are bisimilar

Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277-295, January 1999.


1-Index

1-index: Each index node represents an equivalence class, in which data nodes are mutually bisimilar.

Evaluating path expression query using 1-index safe: the result always

contains the result of evaluating on the data graph;

precise: its result contains no false data node;


K-bisimilarity

1-index can be big Formally, based on the notion of k-bisimilarity (k ) which is

defined inductively: Node u k v iff u k-1 v, and for every parent u’ of u, there

is a parent v’ of v such that u’ k-1 v’, and vice versa; For any two nodes, u and v, u 0 v iff u and v have the

same label; Intuitively, if two data nodes are k-bisimilar, the set of paths

coming into them with length ( k) is the same


A(k)-Index

A(k)-Index: group nodes based on their local structure – paths of length up to k, instead of the global path information data nodes in each index nodes of A(k) index are mutually

k-bisimilar; Evaluation path expression query using A(k)-index:

safe: its result always contains the result of evaluating on the data graph;

precision: its result contains no false data node;

Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. ICDE’02, 129-140.


A(2)-Index

C2 and C3 can be grouped because their length-2 incoming paths are the same


APEX: Adaptive Path Index

1-index, A(k)-index and F&B index are all workload independent APEX: Adaptive Path index

Maintains two types of paths in the summary:• All paths of length two so that all queries can be answered using APEX• Full paths are maintained for those paths that frequently appear in query

workload so that frequently asked queries can be answered efficiently A hash table is included in the index so that partial matching

queries with the self-or-descendent axis (//) can be processed efficiently

C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, SIGMOD 02


D(k)-Index

A generalization of 1-Index and A(k)-Index. Assigning different local bisimilarites to index nodes in the

summary structure according to the query load to optimize its structure.

for any two index nodes ni and nj, k(ni) k(nj)-1 if there is

an edge from ni to nj, in which k(ni) and k(ni) are ni and nj’s

local bisimilarities, respectively. Advantage over 1-Index and A(k)-Index

workload-sensitive; can be more efficiently updated

Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data. SIGMOD 03, 134-144.


Node (Edge) Encoding

Structural relationships Is node u an ancestor of node v? Is node u the parent of node v?

Assigning a unique code to a node (edge) in the data graph so that the above question can be answered by looking at the codes rather than the original data graphs.

Issues: Length of the code. Complexity for computing the structural relationship.

between two nodes from their codes. Efficient code generation and code maintenance.


XML Data Coding Scheme

Region-based XML document is ordered Codes are assigned based on the lexicographical location

of an element in the original document Path-based

XML document is nested Codes are assigned based on the nesting structure of the

document, or the path that reaches and element from the root.

There are quite a number of variants for both categories of coding schemes

XML Region Based Coding Region code: (start, end, level)

u is an ancestor of v iff u.start < v.start < u.end u is the parent of v, additionally, u.level = v.level-1

Only a depth-first traversal for code generation Property: strictly nesting

Completely disjoint (case 1,4) or containing (case 2,3) Formally, a.start < b.start < a.end, if a is an ancestor of b

a

b

b b b b

r o o t r o o t r o o t r o o t

a

a

aa

a

a a

b b

b

c as e 1 c as e 2 c as e 3 c as e 4


Sample of Region Codes

dblp

pape r

t i t le al lautho r s

autho r autho r

t i t le

autho r

ye ar c o nf

pape r

jan e p o e jac k lee

2 0 0 1 VL D BX M L X M L d atab as e

(1,37)

(14,16)

(10,12)(7,9)

(6,13)(3,5)

(2,20)

(25,29)

(21,36)

(22,24)

L ev el

0

1

2

3

4

(4)

(8) (11)

(15) (23)

(27)

(34)

al lautho rs

jac k lee

c o nf

S I G M O D

(17,19)

(18)

ye ar

2 0 0 3(26,28)

(30,32)

(31)

(33,35)

The order of start values is also the document order The region can also be interpreted as an interval

Dewey

<contact>

<name>blah</name>

<phone>

<office>1234</office>

<home>5678</home>

<mobile>0000</mobile>

</phone>

</contact>

contact

name phone

blah office home mobile

1234 5678 0000

1

1.1

1.1.1

1.2

1.2.1

1.2.1.1

1.2.2

1.2.2.1

1.2.3

1.2.3.1a.Dewey is a prefix of d.DeweyIgor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002.

Managing XML Data in Relational Systems

XML and XML Query Languages XML Coding Scheme and Structural Summary Managing XMLData in Relational Systems


XML-Enabled DB Systems

IBM DB2 XML Extender XML column support, XML Collection, File liked from

the DBMS, or Character Large Objects (CLOBs). Side Tables server as XML indexes

Oracle 9i CLOB, OracleText Cartridge, XMLType, and XML SQL

Utility Microsoft SQL Server

CLOBs, Generic Edge technique and user-defined decomposition (from XML to tables), XML views.


Storing XML Data in RDBMSs

RDBMS: a matured technology RDBMS widely available

Less investment to adopt the new technology Easy to be integrated with other existing applications Impedance mismatch

Two level nature of relational schema (tuples and attributes) vs. arbitrary nesting of XML DTD

Flat structure vs. recursion Structure-based and content-based query


XQuery vs SQL: Different Culture

Data Characteristics Relational data: regular, homogeneous, flat structure in

nature, and no order among tuples. XML data: irregular, heterogeneous, unpredictable

structure, order sensitive. Query Languages

SQL:• Select-from-where• With capability to support some fix-point operation

XQuery: • FLWOR (pronounced “flower”): For-let-where-order-return• Simple/Regular Path expressions


Storing XML Data in RDBMSs: Architecture

DTD

Relational

Schema

XML

Documents

Tuples

XML

Query

SQL

Query

Relational

Result

XML

Result

Automatic Schema/Data Mapping

Commercial RDBMS


Storing XML Data in RDBMSs: Issues

Schema/Data mapping: Automate storage of XML in RDBMS

Query mapping: Provide XML views of relational sources

Result construction: Export existing data as XML


XML-Relational Mapping Model mapping

Database schemas represent constructs of the XML document model.

• DTD Independent

[Florescu & Kossmann 99, Yoshikawa, et. al. TOIT01]

Structure mapping Database schemas represent the logical structure

of target XML documents• DTD Dependent

[Shanmugasundaram et. al. VDLB 99]

A Simple XML Document<project>

<pname> XML </pname><member ID=”&3”> <name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>

<title> Managing XML data using RDBMS </title>

<year> 2001 </year> </publication>

… </member><member ID=”&24”> <name> J.X. Yu </name> <project>


</project>


A Sample DOM Tree

&28&26 &27

&70 &71

&66&65XML

H. Lu


J.X. [email protected] &294

publicationname emai

l

pname member

member

project

year

projectname

pnametitleauthor

&69

&24

&1

&3&2


Model Mapping: Document Model to Relation

Database schema represents the constructs of XML documents

Fixed database schema for all XML documents Data graph : tree (may contain cycles) Relational schema represents a tree Pros and cons

DTD is not required. Documents may not conform to DTD

Fixed schema: no schema evolution issue Large collection of documents with various DTDs Semantics get (totally) lost


Model Mapping – Edge/Monet Approach Edge oriented approach

Single table schema [Florescu & Kossmann 99]Edge (source, ordinal, target, label, flag, value)

Monet [Schmidt et. al. WebDB00]• multiple tables, horizontal partitions of edge table on

label-path

Source Ordinal Target Label Flag Value&1 1 &2 "Pname" val "XML"&1 1 &3 "Member" ref -&1 2 &24 "Member" ref -&3 1 &26 "name" val "H.Lu"

Note: Document ID is omitted here


Querying with Edge

select name.Valuefrom Edge dbgroup, Edge member, Edge age, Edge name where dbgroup.Label = `DBGroup' and member.Label = `Member' and age.Label = `Age' and name.Label = `Name' and dbgroup.Source = 0 and dbgroup.Target = member.Source and member.Target = age.Source and member.Target = name.Source and cast (age.Value as int) > 20

/DBGroup/Member[Age>20]/Name


Model Mapping – Node Approach

XRel [Yoshikawa et. al. TOIT 2001] Four table schema

Element(pathID, start, end, ordinal)Attribute(pathID, start, end, value)Text(pathID, start, end, value)Path(pathID, pathexp)

PathPathID PathExp

1 #/Project2 #/Project#/pname3 #/Project#/member4 #/Project#/member#/name… …

ElementPathID Start End Ordinal

2 1 5 14 6 9 14 21 25 2… … … …

TextPathID Start End Value

2 3 4 "XML"4 7 8 "H. Lu"4 22 24 "J.X. Yu"… … … …

Querying with XRel

select v2.Value from Element e1, Path p1, Path p2, Path p3, Text v1, Text v2where p1.Pathexp = `\#/DBGroup\#/Member' and p2.Pathexp = `\#/DBGroup\#/Member\#/Age' and p3.Pathexp = `\#/DBGroup\#/Member\#/Name' and e1.PathID = p1.PathID and v1.PathID = p2.PathID and v2.PathID = p3.PathID /* containment testing */ and e1.Start < v1.Start and e1.End > v1.End and e1.Start < v2.Start and e1.End > v2.End and cast(v1.Value as int ) > 20

/DBGroup/Member[Age>20]/Name

Structural Mapping: Simplifying DTDs

DTD element specifications can be of arbitrary complexity

<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> is valid!

Simple DTD for translation purposes: Key observations: not necessary to regenerate

DTD from relational schema XML queries query the position of an element,

relative to its siblings, and the parent/child relationships.

DTD Simplification: Transformations

(e1, e2)* e1*, e2*(e1, e2)? e1?, e2?(e1|e2) e1?, e2?

e1** e1*e1*? e1*e1?* e1*e1?? e1?

..., a*, ..., a*, ... a*, ...

..., a*, ..., a?, ... a*, ...

..., a?, ..., a*, ... a*, ...

..., a?, ..., a?, ... a*, ……, ...a, …, a, … a*, …

[Deutsch, Fernandez, and Suciu, SIGMOD99]

[Shanmugasundaram, Tufte, He, Zhang, DeWitt, and Naughton, VLDB99]

Simplification Transformations

Grouping Transformations

<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>

<!ELEMENT a (b*, c?, e*, f*)>

Flattening Transformations

A Sample DTD

<!ELEMENT book (booktitle, author) <!ELEMENT booktitle (#PCDATA)><!ELEMENT author (name, address)> <!ATTLIST author id ID #REQUIRED> <!ELEMENT name (firstname?, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT address ANY><!ELEMENT article (title, author*, contactauthor)> <!ELEMENT title (#PCDATA)> <!ELEMENT contactauthor EMPTY> <!ATTLIST contactauthor authorID IDREF IMPLIED><!ELEMENT monograph (title, author, editor)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA #REQUIRED>

*

book

articlemonograph

booktitle

author

contactauthor

authorid

title

editor

name

firstname lastname

name

?

address authorid

*

?

[Shanmugasundaram et. al. VDLB 99]

DTD to Relational Schema: Naïve Approach Each Element ==> Relation Each Attribute of Element ==> Column of Relation Connect elements using foreign keys

<!ELEMENT author (name, address)><!ATTLIST author id ID #REQUIRED><!ELEMENT name (firstname?, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address ANY>

author (authorID: integer, id: string)name (nameID: integer, authorID: integer)firstname (firstnameID: integer, nameID: integer, value: string)lastname (lastnameID: integer, nameID: integer, value: string)address (addressID: integer, authorID: integer, value: string)


Basic Inlining Technique Problem of the naïve approach: fragmentation – too many

tables Results in 5 relations in the previous example: retrieving first and

last names of an author Intuition:

Inline as many sub-elements as possible Do not inline only if it is a set sub-element

• RDBMSs do not all support set-valued columns. Connect relations using foreign keys

• Can handle recursions A document can be rooted at any element

• Create separate a relation for each root


Basic Inlining Technique: Relation Schemas

article (articleID: integer, article.contactauthor.authorid: string, article.title: string)

article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string,

article.author.name.lastname: string, article.author.address: string, article.author.authorid: string)

article

author

contactauthor

authorid

?

title

firstname lastname

name

?

address authorid

*


Basic Inlining Technique: Pros & Cons

Reduces number of joins for queries like “get the first and last names of a book author”

Efficient for queries such as “list all authors of books” Queries like “list all authors with name Ullman”

Union of 5 queries! Large number of relations:

Unrolling recursive strongly connected components (major)

Separate relational schema for each element as root (minor)


Shared Inlining Technique

Intuition: Inline as many sub-elements as possible. Do not inline only if it is a shared, recursive or

set sub-element. An element node is represented in exactly one

relation. Technique:

Mapping the following nodes into relations:• Shared: In-degree >= 2 in DTD graph

• Root elements: In-degree = 0


Issues with Sharing Elements

Parent of elements not fixed at schema level Need to store type and ids of parents (or if there are

no parents) parentCODE field (type of parent) parentID field (id of parent) Not foreign key relationship


Shared: Relational Schema

book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string)

article (articleID: integer, article.contactauthor.isroot: boolean,

article.contactauthor.authorid: string)

monograph (monographID: integer, monograph.parentID: integer,

monograph.parentCODE: integer, monograph.editor.isroot: boolean,

monograph.editor.name: string)

title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string)

author (authorID: integer, author.parentID: integer, author.parentCODE: integer,

author.name.isroot: boolean, author.name.firstname.isroot: :boolean,

author.name.firstname: string, author.name.lastname.isroot: boolean,

author.name.lastname: string, author.address.isroot: boolean,

author.address: string, author.authorid: string)


Shared Inlining Techniques: Pros & Cons

+ Reduces number of joins for queries like “get the first and last names of an author”

+ Efficient for queries such as “list all authors with name Ullman”

- Sharing whenever possible implies extra joins for path expressions• “Article with a given title name”


Hybrid Inlining Technique

Inlines some elements that are shared in Shared Elements with in-degree >= 2 that are not set sub-

elements or recursive Handles set and recursive sub-elements as in Shared


Hybrid: Relational Schema

book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string)article (articleID: integer, article.contactauthor.isroot: boolean, article.contactauthor.authorid: string, article.title.isroot: boolean, article.title: string)monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, monograph.editor.isroot: boolean, monograph.editor.name: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string)author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.isroot: boolean, author.name.firstname.isroot: boolean, author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string)


Hybrid Inlining Technique: Pros & Cons

+ Reduces joins through shared elements (that are not set or recursive elements)

+ Shares some strengths of Shared:• Reduces joins for queries like “get first and last names of a

book author”- Requires more SQL sub-queries to retrieve all authors with

name Ullman.• Tradeoff between reducing number of queries and

reducing number of joins• Shared and Hybrid target query- and join-reduction

respectively


More on Shared and Hybrid

Shared and Hybrid have pros and cons In many cases, Shared and Hybrid are nearly

identical Number of joins per SQL query ~ path length Mainly due to large number of set nodes Problem as join processing is expensive!


Regular Expressions

Path expression queries can be represented by regular expressions.

Considering path expressions in the following from

r = (r)* | (r)+ | (r)? | r1/r2 | r1|r2 | r1//r2 | name.

*: 0 or more occurrences

+: 1 or more occurrences

? : 0 or 1 occurrences

r1/r2 : form a path from r1 to r2 (child)

r1//r2 : form a path from r1 to r2 (descendant)

| : disjunction.


SPE to SQL

/member/publication/author/name

select m2.name

from member m1, publication, member m2

where publication.perantid = m1.ID

and publication.author = m2.ID

Find the name of the authors for all member’s publications

email

member

author

publicationID name

*&4

&7&8 &9 &5

?*

member (ID, name, email, PARENTID);publication (ID, title, author, year, PARENTID);


RPE Expansion

project//publication

project/member/(project.member)*/publication| project/(member.project)*/publication

project

email

*

member

author

ID name

*

*

&2

&4

&7&8 &9 &5

?*

publication

title year &12

&10

List the title of publications for all projects

Substitute //


RPE Expansion

project/member/(project/member)*/publication/title |project/(member/project)*/publication/title

project

email

*

member

author

ID name

*

*

&2

&4

&7&8 &9 &5

?*

publication

title year &12

&10

List the title of publications for all projects

select project.publication.title unionselect project.member.publication.title unionselect project.member.project.publication.title

Expanding *


Recursive Path Expression Queries to SQL Some DBMS supports least-fixed point computation. E.g.,

WITH statement in DB2

WITH R(PARENTID, ID) AS ( select m.PARENTID, p1.ID from member m, project p1 where m.ID=p1.PARENTID UNION ALL select R.PARENTID, p1.ID from R, member m, project p1 where R.ID=m.PARENTID and m.ID=p1.PARENTID) select p3.* from project p2, R, publicaton p3where p2.ID=R.PARENTID and R.ID=p3.PARENTID;

project/(member/project)*/publication project

*

member*

&4

&2

&5

publication


Expanding Recursive Path Expression Queries

Expanding wild cards before sending to DBMS Transitive closure operation is not always supported by

RDBMS Transitive closure with arbitrary nesting seems not

supported Can handle nested recursive queries (though DB2 does not

support it) How many SQL statements are required?

Executing SQL until empty result returned VXMLR approach: keep statistics[Zhou et. al. VLDB 2001]


Query Translation for Structural Mapping

Translating XML-QL into SQL [Shanmugasundaram, et al, VLDB99]

Simple Path Expressions to SQL Simple Recursive Path Expressions to SQL Arbitrary Path Expressions to Simple Recursive

Path Expressions Discussion based on Shared approach

Queries with Simple Path ExpressionsWHERE <book> <booktitle> The Selfish Gene </booktitle> <author> <name>

<firstname> $f </firstname><lastname> $l </lastname>

</name> </author> </book> IN * CONFORMING TO pubs.dtdCONSTRUCT <result> $f $l </result>

Select A.”author.name.firstname”, A.”author.name.lastname”From author A, book BWhere B.bookID = A.parentIDAND A.parentCODE = 0AND B.”book.booktitle” = “The Selfish Gene”

Queries with Recursive Path ExpressionsWHERE <*.monograph> <editor.(monograph.editor)*> <name> $n </name> </> <title> Subclass Cirripedia </title> </> IN * CONFORMING TO pubs.dtdCONSTRUCT <result> $n </result>

With Q1 (monographID, name) AS(Select X.monographID, X.”editor.name” From monograph X Where X.title = “Subclass Cirripedia”UNION ALL Select Z.monographID, Z.”editor.name” From Q1 Y, monograph Z Where Y.monographID = Z.parentID AND Z.parentCODE = 0)Select A.name From Q1 A

Queries with Arbitrary Path Expressions

Split complex path expression to (possibly many) simple recursive path expressions

Has effect of splitting a single XML-QL query to (possibly many) SQL queries

Can handle nested recursive queries

WHERE <(article|monograph).$*.name> $n </>

CONSTRUCT <name> $n </>


References (1)[Aboulnaga, Alameldeen et al, VLDB01]

Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. VLDB 2001.

[Bohannon et al, ICDE 2002]P. Bohannon, J. Freire, P. Roy, and J. Simeon. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of ICDE, 2002.

[Boag, Chamberlin et al, W3C 2003] Scott Boag, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, Jérôme Siméon,

XQuery 1.0: An XML Query Language, http://www.w3.org/TR/xquery[Bruno et al, SIGMOD02]

N. Bruno, N. Koudas, D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In SIGMOD Int'l Conf. on Management of Data, 310-311, 2002.

[Chen, Jagadish et al, ICDE01] Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of the IEEE International Conference on Data Engineering, pages 595-604, 2001.

[Cohen, Kaplen et al, PODS02] E. Cohen, H. Kaplan, T. Milo. Labeling dynamic XML trees. In Symposium on Principles of Database Systems (PODS), 271-281, 2002.

[Cark, DeRose, W3C 1999] James Clark, and Steven DeRose, XML Path Language (XPath) Version 1.0

http://www.w3.org/TR/xpath


References (2)

[Ceri, Comai et al, WWW99]S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi, and L. Tanca. XMLGL: a graphical language for querying and restructuring WWW data. In International World Wide Web Conference (WWW), Toronto, Canada, May 1999.

[Chamberlin, Robie et al, 2000]Don Chamberlin, Jonathan Robie, and Daniela Florescu. Quilt: An XML query language for heterogeneous data source. In Proceedings of the Third International Workshop on the Web and Databases, May 2000.

[Chamberlin, Draper et al, 2003]Don Chamberlin, Denie Draper, Mary Fernandez, Michael Kay, Jonathan Robie, Michael Rys, Jerome Simeon, Jim Tivy, Philip Wadler. Editor: Howard Katz. XQuery from the Experts A Guide to the W3C XML Query Language. Addison-Wesley Press, 2003

[Chaudhri, Rashid et al, 2003]Akmal B. Chaudhri , Awais Rashid , Roberto Zicari. XML Data Management: Native XML and XML-Enabled Database Systems. Addison-Wesley Press, 2003

[Chen, Lim et al, SIGMOD03]Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data. In SIGMOD'03, 134-144.

[Chien, Vagena and Zhang et al, VLDB02]S.-Y. Chien, Z. Vagena, D. Zhang, V. Tsotras, and C. Zaniolo. Efficient structural joins on indexed XML documents. In VLDB02, pages 263--274, 2002.

[Chung, Min et al, SIGMOD02]C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, In SIGMOD'02, 2002


References (3)

[Deutsch, Fernandez et al, QL98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A query language for XML. In

M. Marchiori, editor. QL'98--The Query Languages Workshop. W3C, Dec. 1998. http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/

[Deutsch, Fernandez, and Suciu, SIGMOD99]A. Deutsch, M. Fernandez, and D. Suciu. Storing Semistructured Data with STORED. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1999.

[Dietz STOC 82] Paul F. Dietz. Maintaining order in a linked list. STOC 1982.

[Grust SIGMOD02] Torsten Grust. Accelerating XPath Location Steps. In Proc. of the 21st ACM SIGMOD Conference, pages 109--120, Madison, Wisconsin, USA, June 2002. ACM Press.

[Jiang, Lu, Wang and Ooi, ICDE03]Haifeng Jiang, Hongjun Lu, Wei Wang, Beng Chin Ooi, XR-Tree: Indexing XML Data for Efficient Structural Joins, The 19th International Conference on Data Engineering (ICDE 2003), page 253-264, Bangalore, India, March 5-8, 2003.

[Jiang, Wang, Lu and Yu, VLDB03]Haifeng Jiang, Wei Wang, Hongjun Lu, Jeffrey Xu Yu, Holistic Twig Joins on Indexed XML Documents, The 29th International Conference on Very Large Data Bases (VLDB 2003), pages 273-284, Berlin, Germany, September 9-12, 2003.


References (4)

[Kaushik, Shenoy, ICDE02]Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. In ICDE’02, 129-140.

[Kha et al, ICDE01]Dao Dinh Kha, Masatoshi Yoshikawa, and Shunsuke Uemura. An XML indexing structure with relative region coordinate. ICDE 2001.

[Krishnamurthy et al, 2003]R. Krishnamurthy, R. Kaushik, J. Naughto XML, XML-to-SQL Query Translation Literature: The State of the Art and Open Problems, Symposium (XSym), Sep 2003.

[Li and Moon, VLDB01] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. VLDB 2001.

[Milo, Suciu, ICDT99]Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277-295, January 1999.

[Lee, Srivastava DASFAA04] Dongwon Lee and Divesh Srivastava. Counting relaxed twig matches in a tree. DASFAA 2004.

[Lim, Wang et al, VLDB02] Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ronald Parr. XPathLearner: An on-line self-tuning Markov histogram for XML path selectivity estimation. VLDB 2002.

[Lee, Yoo et al, 1996] Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon, and P. Bruce Berra. Index structures forstructured documents. In Proceedings of the ACM Conference on Digital Libraries, 1996.


References (5)

[Manolescu, Florescu et al, 2001]I. Manolescu, D. Florescu, and D. Kossmann. Pushing XML queries inside relational databases. Tech. Report no. 4112, INRIA, 2001

[Manolescu, Florescu et al, VLDB01]I. Manolescu, D. Florescu, and D. Kossmann. Answering xml queries over heterogeneous data sources. In proceedings on the International Conference on Very Large Data Bases (VLDB), Rome, Italy, September 2001.

[Meier, 2002] Wolfgang Meier. eXist: An open source native XML database. In Web, Web-Services, and Database Systems 2002, 2002.

[McHugh, Widom, VLDB99] Jason McHugh and Jennifer Widom. Query optimization for XML. VLDB 1999.

[Polyzotis, Garofalakis SIGMOD02] Neoklis Polyzotis and Minos N. Garofalakis. Statistical synopses for graph-structured XML databases. SIGMOD 2002.

[Polyzotis, Garofalakis VLDB02] Neoklis Polyzotis and Minos N. Garofalakis. Structure and value synopses for XML data graphs. VLDB 2002.

[Robie, Lapp, QL98]J. Robie, J. Lapp, and D. Schach. XML query language (XQL). In M. Marchiori, editor. QL'98--The Query Languages Workshop. W3C, Dec. 1998. http://www.w3.org/TandS/QL/QL98/pp/xql.html


References (6)[Schmidt et. al. WebDB00]

A. Schmidt, M. L. Kersten, M. Windhouwer, and F. Waas. Efficient relational storage and retrieval of XML documents. In WebDB (Informal Proceedings), pages 47--52, 2000.

[Shanmugasundaram, Tufte, He, Zhang, DeWitt, and Naughton, VLDB99]Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, and Jeffrey F. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of 25th International Conference on Very Large Data Bases (VLDB'99), pages 79-90. Morgan Kaufmann, 1999.

[Shanmugasundaram et. al. VDLB 99]Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314

[Srivastava, Al-Khalifa et al, ICDE02]D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In ICDE, pages 141-- 152, 2002.

[Tatarinov, Viglas et al, SIGMOD02] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002.

[Wang, Jiang et al, SIGMOD03]Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Xu Yu. Containment Join Size Estimation: Models and Methods. The 2003 ACM SIGMOD International Conference on Management of Data(SIGMOD03), San Diego, California, June 9 - June 12, 2003. Pages 145-156


References (7)[Wang, Jiang et al, ICDE03]

Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey Xu Yu. PBiTree coding and efficient processing of containment joins. ICDE 2003.

[Wu et al., EDBT02]

Yuqing Wu, Jignesh Patel, H. V. Jagadish. Using Histograms to Estimate Answer Size for XML Queries. Information Systems 28 (1-2): 33-59 (2003) -- Special Issue: Best Papers from EDBT 2002.

[Wu et al, ICDE03]

Yuqing Wu, Jignesh Patel and H.V. Jagadish, Structural Join Order Selection for XML Query Optimization. ICDE 2003.

[Yoshikawa, et. al. TOIT01]

Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura, Shunsuke Uemura: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Techn. 1(1): 110-141 (2001)

[Zhou et. al. VLDB 2001]

Aoying Zhou, Hongjun Lu, Shihui Zheng, Yuqi Liang, Long Zhang, Wenyun Ji, Zengping Tian: VXMLR: A Visual XML-Relational Database System. VLDB 2001: 719-720

[Zhang, Naughton SIGMOD01]

Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, and Guy M. Lohman. On supporting containment queries in relational database management systems. SIGMOD 2001.