l09: introduction to xml data management xml and xml query languages structural summary and coding...
TRANSCRIPT
L09: Introduction to XML Data Management
XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems
XML and XML Query Languages
XML and XML Query Languages Structural Summary and Coding Scheme Managing XML Data in Relational Systems
H. Lu/HKUST L09: Introduction to XML Data Management 3
XML
Extensible Markup Language for data A W3C standard to complement HTML
http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)
Standard for publishing and interchange Origins: structured text SGML
“Cleaner” SGML for the Internet
Motivation: HTML describes presentation XML describes content
SGMLXMLHTML4.0
H. Lu/HKUST L09: Introduction to XML Data Management 4
XML – Describing the Content<project>
<talk ><title> XML Query Processing & Optimization
</title> <date> March 18, 2004 </date> <instructor> Instructor
<name> Lu Hongjun </name> <affiliation> HKUST </ affiliation > <email> [email protected] </email>
<name> Jeffrey X. Yu </name> <affiliation> CUHK </ affiliation > <email> [email protected] </email>
</ instructor > </talk>
</project>
H. Lu/HKUST L09: Introduction to XML Data Management 5
XML Document/Data
Hierarchical document format for information exchange in WWW
Self describing data (tags) Nested element structures having a root Element data can have
Attributes Sub-elements
H. Lu/HKUST L09: Introduction to XML Data Management 6
Basic XML Structures
Elements: <title>… </title>,<name>… </name> Open & close tags or “empty tag” Ordered, nestable an element can be empty
Attributes PCDATA/CDATA An XML document: single root element
well formed XML document: if it has matching tags
H. Lu/HKUST L09: Introduction to XML Data Management 7
Basic XML Structures: Attributes Single-valued, ordered
<project proj_id = “P1234” budget = “1000000”> <title> XML Data Management </title> … <year> 2003-2004 </year></project>
Special types: ID, IDREF, IDREFS <member id=“m007”> <name> James </name> </member> <project id=“p123”> <title> XML Data Management </title> <member idref=“m007 m008”/> </project>
H. Lu/HKUST L09: Introduction to XML Data Management 8
Other XML Structures
Processing instructions: instructions for applications<?xml version=“1.0”?>
CDATA sections: treat content as char data<![CDATA[<tag>Whatever!!!</tag><whatever>]]>
Comments: just like HTML<!-- Comments -->
Entities: external resources and macros &my-entity; (non-parameter entity) %param-entity; (parameter entity for DTD
declarations)
H. Lu/HKUST L09: Introduction to XML Data Management 9
Data Centric vs. Document centric<project>
<pname> XML </pname><member ID=”&3”, age = 50 > <name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>
<title> Managing XML data using RDBMS </title>
<year> 2001 </year> </publication>
… </member><member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project>
<pname> Data mining </pname> </project></member>
</project>
<bio><p> Dr Lu is a professor at
<b> HKUST. </b> He worked at <b> NUS> </b> before
1998. </p></bio>
H. Lu/HKUST L09: Introduction to XML Data Management 10
XML Data Model
Several competing models Document Object Model (DOM)
a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents
http://www.w3.org/DOM/
H. Lu/HKUST L09: Introduction to XML Data Management 11
DOM Core Interface : Node
DOM tree: a tree-like structure of Node objects – the root of the tree is a document object. Node Object (nodeName, nodeValue, nodeType,
parentNode, childnodes, firstChild, lastChild, previousSibling, nextSibling, attributes, ownerDocument)
nodeType: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE
H. Lu/HKUST L09: Introduction to XML Data Management 12
DOM Interface
Each node of the document tree may have a number of child nodes, contained in a NodeList object.
Two ways of accessing a node object Based on the location of an object in the
document tree Based on the name of an object
H. Lu/HKUST L09: Introduction to XML Data Management 13
publicatiom Node
NodeType=ELEMENT_NODE
tagName = “publication”
NodeValue = ‘nill’
A Sample DOM Tree
&28&26 &27
&70 &71
&66&65XML
H. Lu
Managing … 2001 Data mining
J.X. [email protected] &294
publicationname emai
l
pname member
member
project
year
projectname
pnametitleauthor
&69
&24
&1
&3&2
Project Node
NodeType=ELEMENT_NODE
tagName = “project”
NodeValue = ‘nill’
name Node
NodeType=TEXT_NODE
tagName = “name”
NodeValue = ‘H. Lu’
H. Lu/HKUST L09: Introduction to XML Data Management 14
Data Graph
Similar to DOM tree, but may have different notations that represent an XML document
&28
&26
&27
&70
&71
&66
&65XML
H. Lu
Managing … 2001 Data mining
J.X. [email protected]&294
publicationname email
pname member member
project
year
projectname
pnametitle
author
&69
&24
&1
&3&2
age
50
age
50
H. Lu/HKUST L09: Introduction to XML Data Management 15
Document Type Definition
Inherited from SGML DTD standard BNF grammar establishing constraints on element
structure and content Specification of attributes and their types Definitions of entities
H. Lu/HKUST L09: Introduction to XML Data Management 16
A Sample DTD
project
*
member
author
ID name
*
*
&1
&3
&5&6
&7
&4
?
publication
title year
&9&8
&2
pname
<?xml version="1.0" standalone="yes"?><!DOCTYPE Research ><!ELEMENT project(pname,member*,publication*)><!ELEMENT pname(#PCDATA)> <!ELEMENT member (name,email?, publication*,
project*)><!ATTLIST member ID ID #REQUIRED><!ELEMENT name(#PCDATA)><!ELEMENT email(#PCDATA)><!ELEMENT publication(title,year)><!ATTLIST publication author IDREF IMPLIED)><!ELEMENT title(#PCDATA)><!ELEMENT year(#PCDATA)>
*
H. Lu/HKUST L09: Introduction to XML Data Management 17
XML Query Languages
There have been a large number of proposals during the past few years: XPath [Clark, DeRose, W3C 1999] XQuery [Boag, Chamberlin et al, W3C 2003] XML-QL[Deutsch, Fernandez et al, QL99] XQL [Robie, Lapp, QL99] XML_GL [Ceri, Comai et al, WWW99] Quilt [Chamberlin, Robie et al, 2000]
From W3C XQuery 1.0 (W3C Working Draft, 12 November 2003)
• http://www.w3.org/TR/xquery/ XPath 2.0 (W3C Working Draft 12 November 2003)
• http://www.w3.org/TR/xpath20/
H. Lu/HKUST L09: Introduction to XML Data Management 18
XPath: XML Path Language The purpose
To address the node of an XML tree using a path notation for navigating through the hierarchical structure of an XML document.
Uses a compact, non-XML syntax Designed to be embedded in a host language (e.g., XSLT,
XQuery) XPath Expressions
String of characters Value of an expression is always an ordered collection of
zero or more items (atomic value, node)
H. Lu/HKUST L09: Introduction to XML Data Management 19
XPath: Steps
An XPath expression has following syntax: Path::=/Step1/Step2/…/Stepn,
where each Xpath step is defined as follows: Step::=Axis::Node-test Predicate* Axis specifies the “direction” in which the document should be
navigated. For example, child::title[position() = 2]
There are 12 axes: child, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, following, preceding, following-sibling, preceding-sibling, attribute, self, namespace
H. Lu/HKUST L09: Introduction to XML Data Management 20
XPath Path Expressions
project matches a project element
* matches any element
/ matches the root element
/project matches a project element under root
project/member matches a member in project
project//name matches a name in project, at any depth
//title matches a title at any depth
member|publciation matches a member or a publication
@age matches an age attribute
project/member/@age matches age attribute in member, in project
project/member/[@age<“45”] matches a member with age < 45
H. Lu/HKUST L09: Introduction to XML Data Management 21
XPath Query Examples
Result: <name> H. Lu </name>
<name> J.X. Yu </name>
/project/member/name: matches a name of member in project
/project/publication/venue
Result: empty – there was no venue element
//pname : matches a pname at any depth
Result: <pname> XML </pname><pname> Data mining </pname>
/project/member/name/text(): text of name elements
Result: H. Lu
J.X. Yu
H. Lu/HKUST L09: Introduction to XML Data Management 22
More XPath Queries/project/member[publication] <member ID=”&3”, age = 50 >
<name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>
<title> Managing XML data using RDBMS </title><year> 2001 </year>
</publication></member>
/project/member[@age < “45”]<member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project>
<pname> Data mining </pname> </project></member>
/project [member/@age < “25”]No element returned
/project/member[email/text()] [email protected]
H. Lu/HKUST L09: Introduction to XML Data Management 23
XQuery
XQuery 1.0: An XML Query Language W3C Working Draft 12 November 2003 http://www.w3.org/TR/xquery/
XPath expressions are still the basic building block
H. Lu/HKUST L09: Introduction to XML Data Management 24
XQuery XQuery 1.0: An XML Query Language
W3C Working Draft 12 November 2003 http://www.w3.org/TR/xquery/
FLWR Expressions: FOR-LET-WHERE-RETURN
FOR/LET Clauses
WHERE Clause
RETURN Clause
Ordered list of tuples of bound variables
Instance of XML Query data model
FOR $x in expr binds $x to each value in the list expr
LET $x = expr binds $x to the entire list expr Useful for common subexpressions
and for aggregations
Pruned list of tuples of bound variables
H. Lu/HKUST L09: Introduction to XML Data Management 25
XQuery Examples
<result>
FOR $x in /project/member/publication
WHERE $x/year > 2000
RETURN <recentpub> $x/title
</ recentpub >
</result><active_members>
FOR $m IN distinct(document(“project.xml")//member) LET $p := document(“project.xml")//publication[author = $m] WHERE count($p) > 10 RETURN $m
</ active_members >
distinct = a function that eliminates duplicates
count = a (aggregate) function that returns the number of elements
Structural Summary and Coding Scheme
XML and XML Query Languages Structural Summary and Coding Scheme Managing XMLData in Relational Systems
H. Lu/HKUST L09: Introduction to XML Data Management 27
Structural Summary
A structural summary for a data graph GD(VD, ED ) is another labeled graph GI (VI, EI ).
Each node vi GI represents a set of nodes, extent(vi ), and extent(vi ) VD.
An edge ed (vi , vi’) GI exists if there is an edge ed (vd , vd’) GD vd extent(vi ), vd’ extent(vi’ ).
The summary preserves all the paths in the data graph. A path expression query can be executed on GI instead of GD, which is most likely more efficient since size of GI is much smaller than GD.
H. Lu/HKUST L09: Introduction to XML Data Management 28
Structural Summary
Basically, nodes in the data graph is grouped based on certain criteria, each group of nodes is represented by one node in the summary.
The size of summary will be determined by the grouping criteria.
Desired properties in supporting evaluating path expression queries using summary: The results are safe (no false negatives)
• If not safe, only approximate answers can be obtained
The results are precise: contains no false positives • If not precise, need validate results using the data graph
H. Lu/HKUST L09: Introduction to XML Data Management 29
Structural Summary
r
a1 a2 a3
b1 b2 b3
c1 c2 c3
R
A
B
C
{a1,a2,a3}
{b1,b2,b3}
{c1,c2,c3}
Data Graph Structural summary
{r}
H. Lu/HKUST L09: Introduction to XML Data Management 30
Sample Structural Summaries
Query workload independent summaries Data Guide 1-index [Milo, Suciu, ICDT99] A(k) index [Kaushik, Shenoy, ICDE02]
Query workload dependent summaries APEX [Chung, Min et al, SIGMOD02] D(k)-index [Chen, Lim et al, SIGMOD03]
H. Lu/HKUST L09: Introduction to XML Data Management 31
Data Guides
DataGuide: dynamic structural summary of current database Each label path in
database appears once in DataGuide
No extraneous paths in DataGuide
Maintained incrementally as database evolves
Serves role of schema
C1 is duplicated to achieve determinism in DataGuides
H. Lu/HKUST L09: Introduction to XML Data Management 32
Bisimilarity and 1-Index
Most existing structural summary are based on graph bisimilarity, defined as follows: Two data nodes u and v are bisimilar (u v) if
• u and v have the same label;
• if u’ is a parent of u, then there is a parent v’ of v such that u’ v’, and vice versa;
Intuitively, the set of paths coming into them is the same if two nodes are bisimilar
Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277-295, January 1999.
H. Lu/HKUST L09: Introduction to XML Data Management 33
1-Index
1-index: Each index node represents an equivalence class, in which data nodes are mutually bisimilar.
Evaluating path expression query using 1-index safe: the result always
contains the result of evaluating on the data graph;
precise: its result contains no false data node;
H. Lu/HKUST L09: Introduction to XML Data Management 34
K-bisimilarity
1-index can be big Formally, based on the notion of k-bisimilarity (k ) which is
defined inductively: Node u k v iff u k-1 v, and for every parent u’ of u, there
is a parent v’ of v such that u’ k-1 v’, and vice versa; For any two nodes, u and v, u 0 v iff u and v have the
same label; Intuitively, if two data nodes are k-bisimilar, the set of paths
coming into them with length ( k) is the same
H. Lu/HKUST L09: Introduction to XML Data Management 35
A(k)-Index
A(k)-Index: group nodes based on their local structure – paths of length up to k, instead of the global path information data nodes in each index nodes of A(k) index are mutually
k-bisimilar; Evaluation path expression query using A(k)-index:
safe: its result always contains the result of evaluating on the data graph;
precision: its result contains no false data node;
Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. ICDE’02, 129-140.
H. Lu/HKUST L09: Introduction to XML Data Management 36
A(2)-Index
C2 and C3 can be grouped because their length-2 incoming paths are the same
H. Lu/HKUST L09: Introduction to XML Data Management 37
APEX: Adaptive Path Index
1-index, A(k)-index and F&B index are all workload independent APEX: Adaptive Path index
Maintains two types of paths in the summary:• All paths of length two so that all queries can be answered using APEX• Full paths are maintained for those paths that frequently appear in query
workload so that frequently asked queries can be answered efficiently A hash table is included in the index so that partial matching
queries with the self-or-descendent axis (//) can be processed efficiently
C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, SIGMOD 02
H. Lu/HKUST L09: Introduction to XML Data Management 38
D(k)-Index
A generalization of 1-Index and A(k)-Index. Assigning different local bisimilarites to index nodes in the
summary structure according to the query load to optimize its structure.
for any two index nodes ni and nj, k(ni) k(nj)-1 if there is
an edge from ni to nj, in which k(ni) and k(ni) are ni and nj’s
local bisimilarities, respectively. Advantage over 1-Index and A(k)-Index
workload-sensitive; can be more efficiently updated
Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data. SIGMOD 03, 134-144.
H. Lu/HKUST L09: Introduction to XML Data Management 39
Node (Edge) Encoding
Structural relationships Is node u an ancestor of node v? Is node u the parent of node v?
Assigning a unique code to a node (edge) in the data graph so that the above question can be answered by looking at the codes rather than the original data graphs.
Issues: Length of the code. Complexity for computing the structural relationship.
between two nodes from their codes. Efficient code generation and code maintenance.
H. Lu/HKUST L09: Introduction to XML Data Management 40
XML Data Coding Scheme
Region-based XML document is ordered Codes are assigned based on the lexicographical location
of an element in the original document Path-based
XML document is nested Codes are assigned based on the nesting structure of the
document, or the path that reaches and element from the root.
There are quite a number of variants for both categories of coding schemes
H. Lu/HKUST L09: Introduction to XML Data Management 41
XML Region Based Coding Region code: (start, end, level)
u is an ancestor of v iff u.start < v.start < u.end u is the parent of v, additionally, u.level = v.level-1
Only a depth-first traversal for code generation Property: strictly nesting
Completely disjoint (case 1,4) or containing (case 2,3) Formally, a.start < b.start < a.end, if a is an ancestor of b
a
b
b b b b
r o o t r o o t r o o t r o o t
a
a
aa
a
a a
b b
b
c as e 1 c as e 2 c as e 3 c as e 4
H. Lu/HKUST L09: Introduction to XML Data Management 42
Sample of Region Codes
dblp
pape r
t i t le al lautho r s
autho r autho r
t i t le
autho r
ye ar c o nf
pape r
jan e p o e jac k lee
2 0 0 1 VL D BX M L X M L d atab as e
(1,37)
(14,16)
(10,12)(7,9)
(6,13)(3,5)
(2,20)
(25,29)
(21,36)
(22,24)
L ev el
0
1
2
3
4
(4)
(8) (11)
(15) (23)
(27)
(34)
al lautho rs
jac k lee
c o nf
S I G M O D
(17,19)
(18)
ye ar
2 0 0 3(26,28)
(30,32)
(31)
(33,35)
The order of start values is also the document order The region can also be interpreted as an interval
H. Lu/HKUST L09: Introduction to XML Data Management 43
Dewey
<contact>
<name>blah</name>
<phone>
<office>1234</office>
<home>5678</home>
<mobile>0000</mobile>
</phone>
</contact>
contact
name phone
blah office home mobile
1234 5678 0000
1
1.1
1.1.1
1.2
1.2.1
1.2.1.1
1.2.2
1.2.2.1
1.2.3
1.2.3.1a.Dewey is a prefix of d.DeweyIgor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002.
Managing XML Data in Relational Systems
XML and XML Query Languages XML Coding Scheme and Structural Summary Managing XMLData in Relational Systems
H. Lu/HKUST L09: Introduction to XML Data Management 45
XML-Enabled DB Systems
IBM DB2 XML Extender XML column support, XML Collection, File liked from
the DBMS, or Character Large Objects (CLOBs). Side Tables server as XML indexes
Oracle 9i CLOB, OracleText Cartridge, XMLType, and XML SQL
Utility Microsoft SQL Server
CLOBs, Generic Edge technique and user-defined decomposition (from XML to tables), XML views.
H. Lu/HKUST L09: Introduction to XML Data Management 46
Storing XML Data in RDBMSs
RDBMS: a matured technology RDBMS widely available
Less investment to adopt the new technology Easy to be integrated with other existing applications Impedance mismatch
Two level nature of relational schema (tuples and attributes) vs. arbitrary nesting of XML DTD
Flat structure vs. recursion Structure-based and content-based query
H. Lu/HKUST L09: Introduction to XML Data Management 47
XQuery vs SQL: Different Culture
Data Characteristics Relational data: regular, homogeneous, flat structure in
nature, and no order among tuples. XML data: irregular, heterogeneous, unpredictable
structure, order sensitive. Query Languages
SQL:• Select-from-where• With capability to support some fix-point operation
XQuery: • FLWOR (pronounced “flower”): For-let-where-order-return• Simple/Regular Path expressions
H. Lu/HKUST L09: Introduction to XML Data Management 48
Storing XML Data in RDBMSs: Architecture
DTD
Relational
Schema
XML
Documents
Tuples
XML
Query
SQL
Query
Relational
Result
XML
Result
Automatic Schema/Data Mapping
Commercial RDBMS
H. Lu/HKUST L09: Introduction to XML Data Management 49
Storing XML Data in RDBMSs: Issues
Schema/Data mapping: Automate storage of XML in RDBMS
Query mapping: Provide XML views of relational sources
Result construction: Export existing data as XML
H. Lu/HKUST L09: Introduction to XML Data Management 50
XML-Relational Mapping Model mapping
Database schemas represent constructs of the XML document model.
• DTD Independent
[Florescu & Kossmann 99, Yoshikawa, et. al. TOIT01]
Structure mapping Database schemas represent the logical structure
of target XML documents• DTD Dependent
[Shanmugasundaram et. al. VDLB 99]
H. Lu/HKUST L09: Introduction to XML Data Management 51
A Simple XML Document<project>
<pname> XML </pname><member ID=”&3”> <name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>
<title> Managing XML data using RDBMS </title>
<year> 2001 </year> </publication>
… </member><member ID=”&24”> <name> J.X. Yu </name> <project>
<pname> Data mining </pname> </project></member>
</project>
H. Lu/HKUST L09: Introduction to XML Data Management 52
A Sample DOM Tree
&28&26 &27
&70 &71
&66&65XML
H. Lu
Managing … 2001 Data mining
J.X. [email protected] &294
publicationname emai
l
pname member
member
project
year
projectname
pnametitleauthor
&69
&24
&1
&3&2
H. Lu/HKUST L09: Introduction to XML Data Management 53
Model Mapping: Document Model to Relation
Database schema represents the constructs of XML documents
Fixed database schema for all XML documents Data graph : tree (may contain cycles) Relational schema represents a tree Pros and cons
DTD is not required. Documents may not conform to DTD
Fixed schema: no schema evolution issue Large collection of documents with various DTDs Semantics get (totally) lost
H. Lu/HKUST L09: Introduction to XML Data Management 54
Model Mapping – Edge/Monet Approach Edge oriented approach
Single table schema [Florescu & Kossmann 99]Edge (source, ordinal, target, label, flag, value)
Monet [Schmidt et. al. WebDB00]• multiple tables, horizontal partitions of edge table on
label-path
Source Ordinal Target Label Flag Value&1 1 &2 "Pname" val "XML"&1 1 &3 "Member" ref -&1 2 &24 "Member" ref -&3 1 &26 "name" val "H.Lu"
Note: Document ID is omitted here
H. Lu/HKUST L09: Introduction to XML Data Management 55
Querying with Edge
select name.Valuefrom Edge dbgroup, Edge member, Edge age, Edge name where dbgroup.Label = `DBGroup' and member.Label = `Member' and age.Label = `Age' and name.Label = `Name' and dbgroup.Source = 0 and dbgroup.Target = member.Source and member.Target = age.Source and member.Target = name.Source and cast (age.Value as int) > 20
/DBGroup/Member[Age>20]/Name
H. Lu/HKUST L09: Introduction to XML Data Management 56
Model Mapping – Node Approach
XRel [Yoshikawa et. al. TOIT 2001] Four table schema
Element(pathID, start, end, ordinal)Attribute(pathID, start, end, value)Text(pathID, start, end, value)Path(pathID, pathexp)
PathPathID PathExp
1 #/Project2 #/Project#/pname3 #/Project#/member4 #/Project#/member#/name… …
ElementPathID Start End Ordinal
2 1 5 14 6 9 14 21 25 2… … … …
TextPathID Start End Value
2 3 4 "XML"4 7 8 "H. Lu"4 22 24 "J.X. Yu"… … … …
H. Lu/HKUST L09: Introduction to XML Data Management 57
Querying with XRel
select v2.Value from Element e1, Path p1, Path p2, Path p3, Text v1, Text v2where p1.Pathexp = `\#/DBGroup\#/Member' and p2.Pathexp = `\#/DBGroup\#/Member\#/Age' and p3.Pathexp = `\#/DBGroup\#/Member\#/Name' and e1.PathID = p1.PathID and v1.PathID = p2.PathID and v2.PathID = p3.PathID /* containment testing */ and e1.Start < v1.Start and e1.End > v1.End and e1.Start < v2.Start and e1.End > v2.End and cast(v1.Value as int ) > 20
/DBGroup/Member[Age>20]/Name
H. Lu/HKUST L09: Introduction to XML Data Management 58
Structural Mapping: Simplifying DTDs
DTD element specifications can be of arbitrary complexity
<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> is valid!
Simple DTD for translation purposes: Key observations: not necessary to regenerate
DTD from relational schema XML queries query the position of an element,
relative to its siblings, and the parent/child relationships.
H. Lu/HKUST L09: Introduction to XML Data Management 59
DTD Simplification: Transformations
(e1, e2)* e1*, e2*(e1, e2)? e1?, e2?(e1|e2) e1?, e2?
e1** e1*e1*? e1*e1?* e1*e1?? e1?
..., a*, ..., a*, ... a*, ...
..., a*, ..., a?, ... a*, ...
..., a?, ..., a*, ... a*, ...
..., a?, ..., a?, ... a*, ……, ...a, …, a, … a*, …
[Deutsch, Fernandez, and Suciu, SIGMOD99]
[Shanmugasundaram, Tufte, He, Zhang, DeWitt, and Naughton, VLDB99]
Simplification Transformations
Grouping Transformations
<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>
<!ELEMENT a (b*, c?, e*, f*)>
Flattening Transformations
H. Lu/HKUST L09: Introduction to XML Data Management 60
A Sample DTD
<!ELEMENT book (booktitle, author) <!ELEMENT booktitle (#PCDATA)><!ELEMENT author (name, address)> <!ATTLIST author id ID #REQUIRED> <!ELEMENT name (firstname?, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT address ANY><!ELEMENT article (title, author*, contactauthor)> <!ELEMENT title (#PCDATA)> <!ELEMENT contactauthor EMPTY> <!ATTLIST contactauthor authorID IDREF IMPLIED><!ELEMENT monograph (title, author, editor)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA #REQUIRED>
*
book
articlemonograph
booktitle
author
contactauthor
authorid
title
editor
name
firstname lastname
name
?
address authorid
*
?
[Shanmugasundaram et. al. VDLB 99]
H. Lu/HKUST L09: Introduction to XML Data Management 61
DTD to Relational Schema: Naïve Approach Each Element ==> Relation Each Attribute of Element ==> Column of Relation Connect elements using foreign keys
<!ELEMENT author (name, address)><!ATTLIST author id ID #REQUIRED><!ELEMENT name (firstname?, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address ANY>
author (authorID: integer, id: string)name (nameID: integer, authorID: integer)firstname (firstnameID: integer, nameID: integer, value: string)lastname (lastnameID: integer, nameID: integer, value: string)address (addressID: integer, authorID: integer, value: string)
H. Lu/HKUST L09: Introduction to XML Data Management 62
Basic Inlining Technique Problem of the naïve approach: fragmentation – too many
tables Results in 5 relations in the previous example: retrieving first and
last names of an author Intuition:
Inline as many sub-elements as possible Do not inline only if it is a set sub-element
• RDBMSs do not all support set-valued columns. Connect relations using foreign keys
• Can handle recursions A document can be rooted at any element
• Create separate a relation for each root
H. Lu/HKUST L09: Introduction to XML Data Management 63
Basic Inlining Technique: Relation Schemas
article (articleID: integer, article.contactauthor.authorid: string, article.title: string)
article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string,
article.author.name.lastname: string, article.author.address: string, article.author.authorid: string)
article
author
contactauthor
authorid
?
title
firstname lastname
name
?
address authorid
*
H. Lu/HKUST L09: Introduction to XML Data Management 64
Basic Inlining Technique: Pros & Cons
Reduces number of joins for queries like “get the first and last names of a book author”
Efficient for queries such as “list all authors of books” Queries like “list all authors with name Ullman”
Union of 5 queries! Large number of relations:
Unrolling recursive strongly connected components (major)
Separate relational schema for each element as root (minor)
H. Lu/HKUST L09: Introduction to XML Data Management 65
Shared Inlining Technique
Intuition: Inline as many sub-elements as possible. Do not inline only if it is a shared, recursive or
set sub-element. An element node is represented in exactly one
relation. Technique:
Mapping the following nodes into relations:• Shared: In-degree >= 2 in DTD graph
• Root elements: In-degree = 0
H. Lu/HKUST L09: Introduction to XML Data Management 66
Issues with Sharing Elements
Parent of elements not fixed at schema level Need to store type and ids of parents (or if there are
no parents) parentCODE field (type of parent) parentID field (id of parent) Not foreign key relationship
H. Lu/HKUST L09: Introduction to XML Data Management 67
Shared: Relational Schema
book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string)
article (articleID: integer, article.contactauthor.isroot: boolean,
article.contactauthor.authorid: string)
monograph (monographID: integer, monograph.parentID: integer,
monograph.parentCODE: integer, monograph.editor.isroot: boolean,
monograph.editor.name: string)
title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string)
author (authorID: integer, author.parentID: integer, author.parentCODE: integer,
author.name.isroot: boolean, author.name.firstname.isroot: :boolean,
author.name.firstname: string, author.name.lastname.isroot: boolean,
author.name.lastname: string, author.address.isroot: boolean,
author.address: string, author.authorid: string)
H. Lu/HKUST L09: Introduction to XML Data Management 68
Shared Inlining Techniques: Pros & Cons
+ Reduces number of joins for queries like “get the first and last names of an author”
+ Efficient for queries such as “list all authors with name Ullman”
- Sharing whenever possible implies extra joins for path expressions• “Article with a given title name”
H. Lu/HKUST L09: Introduction to XML Data Management 69
Hybrid Inlining Technique
Inlines some elements that are shared in Shared Elements with in-degree >= 2 that are not set sub-
elements or recursive Handles set and recursive sub-elements as in Shared
H. Lu/HKUST L09: Introduction to XML Data Management 70
Hybrid: Relational Schema
book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string)article (articleID: integer, article.contactauthor.isroot: boolean, article.contactauthor.authorid: string, article.title.isroot: boolean, article.title: string)monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, monograph.editor.isroot: boolean, monograph.editor.name: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string)author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.isroot: boolean, author.name.firstname.isroot: boolean, author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string)
H. Lu/HKUST L09: Introduction to XML Data Management 71
Hybrid Inlining Technique: Pros & Cons
+ Reduces joins through shared elements (that are not set or recursive elements)
+ Shares some strengths of Shared:• Reduces joins for queries like “get first and last names of a
book author”- Requires more SQL sub-queries to retrieve all authors with
name Ullman.• Tradeoff between reducing number of queries and
reducing number of joins• Shared and Hybrid target query- and join-reduction
respectively
H. Lu/HKUST L09: Introduction to XML Data Management 72
More on Shared and Hybrid
Shared and Hybrid have pros and cons In many cases, Shared and Hybrid are nearly
identical Number of joins per SQL query ~ path length Mainly due to large number of set nodes Problem as join processing is expensive!
H. Lu/HKUST L09: Introduction to XML Data Management 73
Regular Expressions
Path expression queries can be represented by regular expressions.
Considering path expressions in the following from
r = (r)* | (r)+ | (r)? | r1/r2 | r1|r2 | r1//r2 | name.
*: 0 or more occurrences
+: 1 or more occurrences
? : 0 or 1 occurrences
r1/r2 : form a path from r1 to r2 (child)
r1//r2 : form a path from r1 to r2 (descendant)
| : disjunction.
H. Lu/HKUST L09: Introduction to XML Data Management 74
SPE to SQL
/member/publication/author/name
select m2.name
from member m1, publication, member m2
where publication.perantid = m1.ID
and publication.author = m2.ID
Find the name of the authors for all member’s publications
member
author
publicationID name
*&4
&7&8 &9 &5
?*
member (ID, name, email, PARENTID);publication (ID, title, author, year, PARENTID);
H. Lu/HKUST L09: Introduction to XML Data Management 75
RPE Expansion
project//publication
project/member/(project.member)*/publication| project/(member.project)*/publication
project
*
member
author
ID name
*
*
&2
&4
&7&8 &9 &5
?*
publication
title year &12
&10
List the title of publications for all projects
Substitute //
H. Lu/HKUST L09: Introduction to XML Data Management 76
RPE Expansion
project/member/(project/member)*/publication/title |project/(member/project)*/publication/title
project
*
member
author
ID name
*
*
&2
&4
&7&8 &9 &5
?*
publication
title year &12
&10
List the title of publications for all projects
select project.publication.title unionselect project.member.publication.title unionselect project.member.project.publication.title
Expanding *
H. Lu/HKUST L09: Introduction to XML Data Management 77
Recursive Path Expression Queries to SQL Some DBMS supports least-fixed point computation. E.g.,
WITH statement in DB2
WITH R(PARENTID, ID) AS ( select m.PARENTID, p1.ID from member m, project p1 where m.ID=p1.PARENTID UNION ALL select R.PARENTID, p1.ID from R, member m, project p1 where R.ID=m.PARENTID and m.ID=p1.PARENTID) select p3.* from project p2, R, publicaton p3where p2.ID=R.PARENTID and R.ID=p3.PARENTID;
project/(member/project)*/publication project
*
member*
&4
&2
&5
publication
H. Lu/HKUST L09: Introduction to XML Data Management 78
Expanding Recursive Path Expression Queries
Expanding wild cards before sending to DBMS Transitive closure operation is not always supported by
RDBMS Transitive closure with arbitrary nesting seems not
supported Can handle nested recursive queries (though DB2 does not
support it) How many SQL statements are required?
Executing SQL until empty result returned VXMLR approach: keep statistics[Zhou et. al. VLDB 2001]
H. Lu/HKUST L09: Introduction to XML Data Management 79
Query Translation for Structural Mapping
Translating XML-QL into SQL [Shanmugasundaram, et al, VLDB99]
Simple Path Expressions to SQL Simple Recursive Path Expressions to SQL Arbitrary Path Expressions to Simple Recursive
Path Expressions Discussion based on Shared approach
H. Lu/HKUST L09: Introduction to XML Data Management 80
Queries with Simple Path ExpressionsWHERE <book> <booktitle> The Selfish Gene </booktitle> <author> <name>
<firstname> $f </firstname><lastname> $l </lastname>
</name> </author> </book> IN * CONFORMING TO pubs.dtdCONSTRUCT <result> $f $l </result>
Select A.”author.name.firstname”, A.”author.name.lastname”From author A, book BWhere B.bookID = A.parentIDAND A.parentCODE = 0AND B.”book.booktitle” = “The Selfish Gene”
H. Lu/HKUST L09: Introduction to XML Data Management 81
Queries with Recursive Path ExpressionsWHERE <*.monograph> <editor.(monograph.editor)*> <name> $n </name> </> <title> Subclass Cirripedia </title> </> IN * CONFORMING TO pubs.dtdCONSTRUCT <result> $n </result>
With Q1 (monographID, name) AS(Select X.monographID, X.”editor.name” From monograph X Where X.title = “Subclass Cirripedia”UNION ALL Select Z.monographID, Z.”editor.name” From Q1 Y, monograph Z Where Y.monographID = Z.parentID AND Z.parentCODE = 0)Select A.name From Q1 A
H. Lu/HKUST L09: Introduction to XML Data Management 82
Queries with Arbitrary Path Expressions
Split complex path expression to (possibly many) simple recursive path expressions
Has effect of splitting a single XML-QL query to (possibly many) SQL queries
Can handle nested recursive queries
WHERE <(article|monograph).$*.name> $n </>
CONSTRUCT <name> $n </>
H. Lu/HKUST L09: Introduction to XML Data Management 83
References (1)[Aboulnaga, Alameldeen et al, VLDB01]
Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. VLDB 2001.
[Bohannon et al, ICDE 2002]P. Bohannon, J. Freire, P. Roy, and J. Simeon. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of ICDE, 2002.
[Boag, Chamberlin et al, W3C 2003] Scott Boag, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, Jérôme Siméon,
XQuery 1.0: An XML Query Language, http://www.w3.org/TR/xquery[Bruno et al, SIGMOD02]
N. Bruno, N. Koudas, D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In SIGMOD Int'l Conf. on Management of Data, 310-311, 2002.
[Chen, Jagadish et al, ICDE01] Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of the IEEE International Conference on Data Engineering, pages 595-604, 2001.
[Cohen, Kaplen et al, PODS02] E. Cohen, H. Kaplan, T. Milo. Labeling dynamic XML trees. In Symposium on Principles of Database Systems (PODS), 271-281, 2002.
[Cark, DeRose, W3C 1999] James Clark, and Steven DeRose, XML Path Language (XPath) Version 1.0
http://www.w3.org/TR/xpath
H. Lu/HKUST L09: Introduction to XML Data Management 84
References (2)
[Ceri, Comai et al, WWW99]S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi, and L. Tanca. XMLGL: a graphical language for querying and restructuring WWW data. In International World Wide Web Conference (WWW), Toronto, Canada, May 1999.
[Chamberlin, Robie et al, 2000]Don Chamberlin, Jonathan Robie, and Daniela Florescu. Quilt: An XML query language for heterogeneous data source. In Proceedings of the Third International Workshop on the Web and Databases, May 2000.
[Chamberlin, Draper et al, 2003]Don Chamberlin, Denie Draper, Mary Fernandez, Michael Kay, Jonathan Robie, Michael Rys, Jerome Simeon, Jim Tivy, Philip Wadler. Editor: Howard Katz. XQuery from the Experts A Guide to the W3C XML Query Language. Addison-Wesley Press, 2003
[Chaudhri, Rashid et al, 2003]Akmal B. Chaudhri , Awais Rashid , Roberto Zicari. XML Data Management: Native XML and XML-Enabled Database Systems. Addison-Wesley Press, 2003
[Chen, Lim et al, SIGMOD03]Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data. In SIGMOD'03, 134-144.
[Chien, Vagena and Zhang et al, VLDB02]S.-Y. Chien, Z. Vagena, D. Zhang, V. Tsotras, and C. Zaniolo. Efficient structural joins on indexed XML documents. In VLDB02, pages 263--274, 2002.
[Chung, Min et al, SIGMOD02]C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, In SIGMOD'02, 2002
H. Lu/HKUST L09: Introduction to XML Data Management 85
References (3)
[Deutsch, Fernandez et al, QL98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A query language for XML. In
M. Marchiori, editor. QL'98--The Query Languages Workshop. W3C, Dec. 1998. http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/
[Deutsch, Fernandez, and Suciu, SIGMOD99]A. Deutsch, M. Fernandez, and D. Suciu. Storing Semistructured Data with STORED. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1999.
[Dietz STOC 82] Paul F. Dietz. Maintaining order in a linked list. STOC 1982.
[Grust SIGMOD02] Torsten Grust. Accelerating XPath Location Steps. In Proc. of the 21st ACM SIGMOD Conference, pages 109--120, Madison, Wisconsin, USA, June 2002. ACM Press.
[Jiang, Lu, Wang and Ooi, ICDE03]Haifeng Jiang, Hongjun Lu, Wei Wang, Beng Chin Ooi, XR-Tree: Indexing XML Data for Efficient Structural Joins, The 19th International Conference on Data Engineering (ICDE 2003), page 253-264, Bangalore, India, March 5-8, 2003.
[Jiang, Wang, Lu and Yu, VLDB03]Haifeng Jiang, Wei Wang, Hongjun Lu, Jeffrey Xu Yu, Holistic Twig Joins on Indexed XML Documents, The 29th International Conference on Very Large Data Bases (VLDB 2003), pages 273-284, Berlin, Germany, September 9-12, 2003.
H. Lu/HKUST L09: Introduction to XML Data Management 86
References (4)
[Kaushik, Shenoy, ICDE02]Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. In ICDE’02, 129-140.
[Kha et al, ICDE01]Dao Dinh Kha, Masatoshi Yoshikawa, and Shunsuke Uemura. An XML indexing structure with relative region coordinate. ICDE 2001.
[Krishnamurthy et al, 2003]R. Krishnamurthy, R. Kaushik, J. Naughto XML, XML-to-SQL Query Translation Literature: The State of the Art and Open Problems, Symposium (XSym), Sep 2003.
[Li and Moon, VLDB01] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. VLDB 2001.
[Milo, Suciu, ICDT99]Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277-295, January 1999.
[Lee, Srivastava DASFAA04] Dongwon Lee and Divesh Srivastava. Counting relaxed twig matches in a tree. DASFAA 2004.
[Lim, Wang et al, VLDB02] Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ronald Parr. XPathLearner: An on-line self-tuning Markov histogram for XML path selectivity estimation. VLDB 2002.
[Lee, Yoo et al, 1996] Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon, and P. Bruce Berra. Index structures forstructured documents. In Proceedings of the ACM Conference on Digital Libraries, 1996.
H. Lu/HKUST L09: Introduction to XML Data Management 87
References (5)
[Manolescu, Florescu et al, 2001]I. Manolescu, D. Florescu, and D. Kossmann. Pushing XML queries inside relational databases. Tech. Report no. 4112, INRIA, 2001
[Manolescu, Florescu et al, VLDB01]I. Manolescu, D. Florescu, and D. Kossmann. Answering xml queries over heterogeneous data sources. In proceedings on the International Conference on Very Large Data Bases (VLDB), Rome, Italy, September 2001.
[Meier, 2002] Wolfgang Meier. eXist: An open source native XML database. In Web, Web-Services, and Database Systems 2002, 2002.
[McHugh, Widom, VLDB99] Jason McHugh and Jennifer Widom. Query optimization for XML. VLDB 1999.
[Polyzotis, Garofalakis SIGMOD02] Neoklis Polyzotis and Minos N. Garofalakis. Statistical synopses for graph-structured XML databases. SIGMOD 2002.
[Polyzotis, Garofalakis VLDB02] Neoklis Polyzotis and Minos N. Garofalakis. Structure and value synopses for XML data graphs. VLDB 2002.
[Robie, Lapp, QL98]J. Robie, J. Lapp, and D. Schach. XML query language (XQL). In M. Marchiori, editor. QL'98--The Query Languages Workshop. W3C, Dec. 1998. http://www.w3.org/TandS/QL/QL98/pp/xql.html
H. Lu/HKUST L09: Introduction to XML Data Management 88
References (6)[Schmidt et. al. WebDB00]
A. Schmidt, M. L. Kersten, M. Windhouwer, and F. Waas. Efficient relational storage and retrieval of XML documents. In WebDB (Informal Proceedings), pages 47--52, 2000.
[Shanmugasundaram, Tufte, He, Zhang, DeWitt, and Naughton, VLDB99]Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, and Jeffrey F. Naughton. Relational databases for querying XML documents: Limitations and oppor- tunities. In Proceedings of 25th International Conference on Very Large Data Bases (VLDB'99), pages 79-90. Morgan Kaufmann, 1999.
[Shanmugasundaram et. al. VDLB 99]Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314
[Srivastava, Al-Khalifa et al, ICDE02]D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In ICDE, pages 141-- 152, 2002.
[Tatarinov, Viglas et al, SIGMOD02] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002.
[Wang, Jiang et al, SIGMOD03]Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Xu Yu. Containment Join Size Estimation: Models and Methods. The 2003 ACM SIGMOD International Conference on Management of Data(SIGMOD03), San Diego, California, June 9 - June 12, 2003. Pages 145-156
H. Lu/HKUST L09: Introduction to XML Data Management 89
References (7)[Wang, Jiang et al, ICDE03]
Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey Xu Yu. PBiTree coding and efficient processing of containment joins. ICDE 2003.
[Wu et al., EDBT02]
Yuqing Wu, Jignesh Patel, H. V. Jagadish. Using Histograms to Estimate Answer Size for XML Queries. Information Systems 28 (1-2): 33-59 (2003) -- Special Issue: Best Papers from EDBT 2002.
[Wu et al, ICDE03]
Yuqing Wu, Jignesh Patel and H.V. Jagadish, Structural Join Order Selection for XML Query Optimization. ICDE 2003.
[Yoshikawa, et. al. TOIT01]
Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura, Shunsuke Uemura: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Techn. 1(1): 110-141 (2001)
[Zhou et. al. VLDB 2001]
Aoying Zhou, Hongjun Lu, Shihui Zheng, Yuqi Liang, Long Zhang, Wenyun Ji, Zengping Tian: VXMLR: A Visual XML-Relational Database System. VLDB 2001: 719-720
[Zhang, Naughton SIGMOD01]
Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, and Guy M. Lohman. On supporting containment queries in relational database management systems. SIGMOD 2001.