semistructured data and xml · 2010-11-15 · 7 the semistructured data model arc labels...
TRANSCRIPT
SEMISTRUCTURED DATA AND XML
2222
HOW THE WEB IS TODAY
HTML documents
often generated by applications
consumed by humans only
easy access: across platforms, across organizations
only layout, no semantic information
No application interoperability:
HTML not understood by applications
screen scraping brittle
Database technology: client-server
still vendor specific
3333
XML DATA EXCHANGE FORMAT
A standard from the W3C (World Wide WebConsortium, http://www.w3.org).
The mission of the W3C
„. . . developing common protocols that promoteits evolution and ensure its interoperability. . .“.
Basic ideas XML = data
XML generated by applications
XML consumed by applications
Easy access: across platforms, organizations.
4444
PARADIGM SHIFT ON THE WEB
For web search engines:
From documents (HTML) to data (XML)
From document management to documentunderstanding (e.g., question answering)
From information retrieval to data management
For database systems:
From relational (structured) model to semistructureddata
From data processing to data /query translation
From storage to transport
5555
THE SEMISTRUCTURED DATA MODEL
&o1
&o12 &o24 &o29
&o43&96
&243 &206
&25
“Serge”“Abiteboul”
1997
“Victor”“Vianu”
122 133
paperbook
paper
references
referencesreferences
authortitle
yearhttp
author
authorauthor
title publisherauthor
authortitle
page
firstnamelastname
firstname lastname firstlast
Bib
Object ExchangeModel (OEM)complex object
atomic object
6666
THE SEMISTRUCTURED DATA MODEL
Data is self-describing, i.e. the data description isintegrated with the data itself rather than in aseparate schema.
Database is a collection of nodes and arcs(directed graph).
Leaf nodes represent data of some atomic type(atomic objects, such as numbers or strings).
Interior nodes represent complex objects consistingof components (child nodes), connected by arcs tothis node.
Arcs are directed and connect two nodes.
7777
THE SEMISTRUCTURED DATA MODEL
Arc labels indicates the relationship between thetwo corresponding nodes.
The root node is the only interior node without in-arcs, representing the entire database.
All database objects are children of the root node.
Every node must be reachable from the root.
A general graph structure is possible, i.e. thegraph need not be a tree structure.
8888
SYNTAX FOR SEMISTRUCTURED DATA
Bib: &o1 { paper: &o12 { … },
book: &o24 { … },
paper: &o29
{ author: &o52 “Abiteboul”,
author: &o96 { firstname: &243 “Victor”,
lastname: &o206 “Vianu”},
title: &o93 “Regular path queries with constraints”,
references: &o12,
references: &o24,
pages: &o25 { first: &o64 122, last: &o92 133}
}
}
Observe: Nested tuples, set-values, oids!
9999
SYNTAX FOR SEMISTRUCTURED DATA
May omit oids:
{ paper: { author: “Abiteboul”,
author: { firstname: “Victor”,
lastname: “Vianu”},
title: “Regular path queries …”,
page: { first: 122, last: 133 }
}
}
10101010
VS. RELATIONAL MODEL
Missing attributes
Additional attributes
Multiple attribute values (set-valued attributes)
Objects as attribute values
No global schema
only the first characteristics supported by relationalmodel, all others are not
11111111
VS. RELATIONAL MODEL
Semistructured data
Self-describing,
Irregular data,
No a-priori structure.
Relational DB
Separate schema,
Regular data,
A-priori structure.
XML
13131313
IMPORTANT XML STANDARDS
XSL/XSLT: presentation and transformationstandards
RDF: resource description framework (meta-infosuch as ratings, categorizations, etc.)
Xpath/Xpointer/Xlink: standard for linking todocuments and elements within
Namespaces: for resolving name clashes DOM: Document Object Model for manipulating
XML documents SAX: Simple API for XML parsing XQuery: query language
14141414
XML
A W3C standard to complement HTML
Origins: Structured text SGML
Large-scale electronic publishing
Data exchange on the web
Motivation:
HTML describes presentation
XML describes content
http://www.w3.org/TR/2000/REC-xml-20001006 (version 2,10/2000)
SGMLXMLHTML4.0
15151515
FROM HTML TO XML
HTML describes the presentation
16161616
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteboul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
HTML describes the presentation
17171717
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content
18181818
WHY ARE WE DB’ERS INTERESTED?
It’s data. That’s us.
Database issues: How are we going to model XML? (graphs).
How are we going to query XML? (XQuery)
How are we going to store XML (in a relationaldatabase? object-oriented? native?)
How are we going to process XML efficiently? (manyinteresting research questions!)
19191919
ELEMENTS
Tagsbook, title, author, …
start tag: <book>, end tag: </book>
defined by user / programmer (different from HTML!)
Elements <book>…<book>,<author>…</author>
An element consists of a matching start and end tagand the enclosed content.
Elements can be nested, i.e. content of one elementcan consist of sequence of other elements.
20202020
ATTRIBUTES
Attributes can be associated with any element.
Provide additional information about elements.
Attributes can have only one value.
Example
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>
Attributes can also be used to connect elements.
21212121
NON-TREE-LIKE XML
So far: only tree-like XML documents,i.e. each element is nested within at most oneother element.
Attributes can also be used to create non-treeXML documents.
Attributes with a domain of ID serve as primarykeys of elements.
Attributes with a domain of IDREF serve asforeign keys referencing the ID of anotherelement.
22222222
NON-TREE-LIKE XML
Example of a non-tree structure<persons>
<person personid=“o555”><name> Jane </name>
</person>
<person personid=“o456”>
<name> Mary </name>
<children refs=“o123 o555”</children >
</person>
<person personid=“o123” mother=“o456”>
<name>John</name>
</person>
</persons>
23232323
NAMESPACES
An XML document can involve tags that comefor multiple sources.
One and the same tag can appear in more thanone source.
<table> <tr>
<td>Apples</td>
<td>Bananas</td>
</tr> </table>
<table>
<name>African Coffee Table</name>
<width>80</width>
<length>120</length>
</table>
24242424
NAMESPACES
Name conflicts can be resolved by prefixing tagnames according to their source.<h:table>
<h:tr> <h:td>Apples</h:td><h:td>Bananas</h:td></h:tr>
</h:table>
<f:table><f:name>African Coffee Table</f:name><f:width>80</f:width><f:length>120</f:length>
</f:table>
When using prefixes in XML, a namespace forthe prefix must be defined.
The namespace must be referenced (via an URI)in the start tag of an enclosing element .
25252525
WELL-FORMED XML
A well-formed XML document satisfies the
following conditions:
Begins with a declaration that it is XML.
Has a single root element that encloses the whole
document.
Consists of properly nested elements, i.e. start and end
tag of an element are within the same enclosing
element.
standalone =“yes” states that document has no
DTD.
In this mode, you can invent your own tags, like in
semistructured data model.
26262626
WELL-FORMED XML
<?XML version=“1.0” standalone =“yes” ?><bibliography>
<book> <title> Foundations… </title><author> Abiteboul </author><author> Hull </author><author> Vianu </author><publisher> Addison Wesley </publisher><year> 1995 </year>
</book><book> <title> … </title>
. . .</book>…
</bibliography>
27272727
WELL-FORMED XML
HTML browsers will display documents with
errors (like missing end tags).
The W3C XML specification states that a
program should stop processing an XML
document if it finds an error.
The main reason is that XML is being consumed
by programs rather than by humans (as HTML).
W3C provides a validator that checks whether an
XML document is well-formed.
28282828
VALID XML
The validator can also check whether an XML
document is valid, i.e. conforms to a Document
Type Definition (DTD).
A DTD specifies the allowable tags and how they
can be nested.
XML with a DTD is no longer semistructured
(self-describing).
However, a DTD is less rigid than the schema of a
relational DB. E.g., a DTD allows missing and
multiple attributes / elements.
DTD
30303030
DOCUMENT TYPE DEFINITIONS
Document Type Definition (DTD): set of rules
(grammar) specifying elements, attributes and all
other aspects of XML documents.
For each element, specify name and content type.
Content type can, e.g., be
#PCDATA (character string),
other elements,
regular expression made of the above content types* = zero or more occurrences? = zero or one occurrence+ = one or more occurrences, = sequence of elements.
31313131
DOCUMENT TYPE DESCRIPTORS
Sort of like a schema but not really.
Inherited from SGML DTD standard
BNF grammar establishing constraints on elementstructure and content
Definitions of entities
<!ELEMENT Book (title, author*) >
<!ELEMENT title #PCDATA><!ELEMENT author (name, address,age?)>
<!ATTLIST Book id ID #REQUIRED><!ATTLIST Book pub IDREF #IMPLIED>
32323232
EXAMPLE DTD: PRODUCT CATALOG
<!DOCTYPE CATALOG [
<!ELEMENT CATALOG (PRODUCT+)>
<!ELEMENT PRODUCT (SPECIFICATIONS+,OPTIONS?,PRICE+,NOTES?)>
<!ATTLIST PRODUCT NAME CDATA #IMPLIED
CATEGORY (HandTool|Table|Shop-Professional) "HandTool"
PARTNUM CDATA #IMPLIED
PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago"
INVENTORY (InStock|Backordered|Discontinued) "InStock">
<!ELEMENT SPECIFICATIONS (#PCDATA)>
<!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED
POWER CDATA #IMPLIED>
<!ELEMENT OPTIONS (#PCDATA)>
<!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte"
ADAPTER (Included|Optional|NotApplicable) "Included"
CASE (HardShell|Soft|NotApplicable) "HardShell">
<!ELEMENT PRICE (#PCDATA)>
<!ATTLIST PRICE MSRP CDATA #IMPLIED
WHOLESALE CDATA #IMPLIED
STREET CDATA #IMPLIED
SHIPPING CDATA #IMPLIED>
<!ELEMENT NOTES (#PCDATA)> ]>
33333333
SHORTCOMINGS OF DTDS
Useful for documents, but not so good for data:
Element name and type are associated globally
No support for structural re-use Object-oriented-like structures aren’t supported
No support for data types Can’t do data validation
Can have a single key item (ID), but: No support for multi-attribute keys
No support for foreign keys (references to other keys)
No constraints on IDREFs (reference only a Section)
XML SCHEMA
35353535
XML SCHEMA
The successor of DTDs to specify a schema for
XML documents.
A W3C standard.
Includes and extends functionality of DTDs.
In particular, XML Schemas support data types.
This makes it easier to validate the correctness of
data and to work with data from a database.
XML Schemas are written in XML. You don't have
to learn a new language and can use your XML
parser to parse your Schema files.
36363636
EXAMPLE XML SCHEMA
<schema version=“1.0”xmlns=“http://www.w3.org/1999/XMLSchema”>
<element name=“author” type=“string” /><element name=“date” type = “date” /><element name=“abstract”><type> … </type>
</element><element name=“paper”><type><attribute name=“keywords” type=“string”/><element ref=“author” minOccurs=“0”maxOccurs=“*” />
<element ref=“date” /><element ref=“abstract” minOccurs=“0”maxOccurs=“1” />
<element ref=“body” /></type>
</element></schema>
37373737
SIMPLE ELEMENTS
Simple elements contain only text.
They can have one of the built-in datatypes:
xs:string, xs:decimal, xs:integer, xs:boolean
xs:date, xs:time.
Example
<xs:element name="lastname“ type="xs:string"/>
<xs:element name="age" type="xs:integer"/>
<xs:element name="dateborn" type="xs:date"/>
38383838
SIMPLE ELEMENTS
Restrictions allow you to further constrain the
content of simple elements.
<xs:element name="age">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="120"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
39393939
ATTRIBUTES
Attributes can be specified using the attribute
element:
<xs:attribute name="xxx" type="yyy"/>
Attribute elements are nested within the element
of the element with which they are associated.
By default, attributes are optional.
To make an attribute mandatory, use
<xs:attribute name="lang“ type="xs:string“use="required"/>
Attributes can have the same built-in datatypes
as simple elements.
40404040
COMPLEX ELEMENTS
Complex elements can contain other elements and can
have attributes.
Nested elements need to occur in the order specified.
The number of repetitions of elements are controlled by
the attributes minOccurs and maxOccurs. The default
is one repetition.
A complex element with an attribute:
<xs:element name="product">
<xs:complexType>
<xs:attribute name="prodid" type="xs:positiveInteger"/>
</xs:complexType>
</xs:element>
41414141
COMPLEX ELEMENTS
A complex element containing a sequence of
nested (simple) elements:
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
42424242
COMPLEX ELEMENTS
If you name the complex element, other elements
can reference and include it:
<xs:complexType name="persontype">
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
<xs:element name="person" type="persontype"/>
43434343
EXAMPLE XML SCHEMA
<schema version=“1.0”xmlns=“http://www.w3.org/1999/XMLSchema”>
<element name=“author” type=“string” /><element name=“date” type = “date” /><element name=“abstract”><type> … </type>
</element><element name=“paper”><type><attribute name=“keywords” type=“string”/><element ref=“author” minOccurs=“0”maxOccurs=“*” />
<element ref=“date” /><element ref=“abstract” minOccurs=“0”maxOccurs=“1” />
<element ref=“body” /></type>
</element></schema>
44444444
XML VS. SEMISTRUCTURED DATA
Both described best by a graph.
Both are schema-less, self-describing(XML without DTD / XML schema).
XML is ordered, semistructured data is not.
XML can mix text and elements:
<talk> Making Java easier to type and easier to type
<speaker> Phil Wadler </speaker>
</talk>
XML has lots of other stuff: attributes, entities,processing instructions, comments.
XML-PATH = XPATH
46464646
QUERY LANGUAGES FOR XML
XPath is a simple query language based ondescribing similar paths in XML documents.
XQuery extends XPath in a style similar to SQL,introducing iterations, subqueries, etc.
XPath and XQuery expressions are applied to anXML document and return a sequence ofqualifying items.
Items can be primitive values or nodes (elements,attributes, documents).
The items returned do not need to be of the sametype.
47474747
XPATH
A path expression returns the sequence of allqualifying items that are reachable from the inputitem following the specified path.
A path expression is a sequence consisting of tagsor attributes and special characters such asslashes (“/”).
Absolute path expressions are applied to someXML document and returns all elements that arereachable from the document’s root elementfollowing the specified path.
Relative path expressions are applied to anarbitrary node.
48484848
XPATH
<?XML version=“1.0” standalone =“yes” ?>
<bibliography>
<book bookID = “b100“> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher>Addison Wesley </publisher>
<year> 1995</year> </book>
…
</bibliography>
Applied to the above document, the XPath expression/bibliography/book/author returns the sequence
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author> . . .
49494949
ATTRIBUTES
If we do not want to return the qualifying elements, but the valueone of their attributes, we end the path expression with @attribute.
<?XML version=“1.0” standalone =“yes” ?>
<bibliography>
<book bookID = “b100“> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year> </book>
the XPath expression
/bibliography/book/@bookID
returns the sequence
“b100“ . . .
50505050
WILDCARDS
We can use wildcards instead of actual tags andattributes:* means any tag, and@* means any attribute.
Examples/bibliography/*/author returns the sequence
<author> Abiteboul </author>
<author> Hull </author>./bibliography//author/@* returns the sequence
“IBM““a739“.
51515151
PATH EXPRESSIONS
Examples:
Bib.paper
Bib.book.publisher
Bib.paper.author.lastname
Given an OEM instance, the value of a pathexpression p is a set of objects
52525252
PATH EXPRESSIONS
Examples:
DB =&o1
&o12 &o24 &o29
&o43
&o70 &o71
&96
&243 &206
&25
“Serge”“Abiteboul”
1997
“Victor”“Vianu”
122 133
paperbook
paper
references
referencesreferences
authortitle year httpauthor
authorauthor
titlepublis herauthor
authortitle
page
firstnamelastname
firstname lastname firstlast
Bib
&o44 &o45 &o46
&o47 &o48 &o49 &o50 &o51
&o52
Bib.paper={&o12,&o29}
Bib.book.publisher={&o51}
Bib.paper.author.lastname={&o71,&206}
XML-QUERY = XQUERY
54545454
XQUERY
Summary:
FOR-LET-WHERE-ORDERBY-RETURN = FLWOR
FOR/LET Clauses
WHERE Clause
ORDERBY/RETURN Clause
List of tuples
List of tuples
Instance of Xquery data model
55555555
XQUERY
FLWOR expressions are similar to SQL select . .from . . . where . . . queries.
XQuery allows zero, one or more for and letclauses.
The where clause is optional.
There is one optional order-by clause.
Finally, there is exactly one return clause.
XQuery is case-sensitive.
XQuery (and XPath) is a W3C standard.
56565656
XQUERY CLAUSES
for $x in expr
Defines node variable $x.
The expression expr evaluates to a sequence of items.
The variable $x is assigned to each item, in turn, andthe body of the for clause is executed once for eachassignment.
let $x := expr
Defines collection variable $x.
The expression expr evaluates to a sequence of items.
The variable is bound to the entire sequence of items.
Useful for common subexpressions and foraggregations.
57575757
XQUERY CLAUSES
where condition
The condition is a boolean expression.
The clause is applied to some item.
If and only if the condition evaluates to true, thefollowing return clause is executed for that item.
return expression
The result of a FLWOR clause is a sequence of items.
Expression defines the result format for the current(qualifying) item.
The sequence of items produced by expression isappended to the sequence of items produced so far.
58585858
INTERPRETATION AS XQUERY
XQuery expressions can be used wherever anXML expression of any kind is permitted.
Any text string is acceptable as content of a tag orvalue of an attribute.
If a string contains an XQuery expression thatshould be evaluated, this substring must besurrounded by curly brackets {}.
Example
for $b in doc("bib.xml")/bibliography/bookreturn <result id = {$b/@bookID}>{$b/title}</result>
59595959
FOR V.S. LET
Find all books
FOR $x IN document("bib.xml")/bib/book
RETURN <result> $x </result>
Returns:
<result> <book>...</book></result><result> <book>...</book></result><result> <book>...</book></result>...
LET $x IN document("bib.xml")/bib/book
RETURN <result> $x </result>
Returns:
<result> <book>...</book><book>...</book><book>...</book>...
</result>
60606060
XQUERY
Find all book titles published after 1995:
FOR $x IN document("bib.xml")/bib/book
WHERE $x/year > 1995
RETURN $x/title
Result:<title> abc </title><title> def </title><title> ghi </title>
61616161
ORDERING THE QUERY RESULT
The order-by clause allows you to order theresults of an XQuery expression.
order-by list of expressions
The sort order is based on the value of the firstexpression. Ties are broken based on the value ofthe second (if necessary third etc.) expression.
By default, the order is ascending.
A descending sort order can be specified usingdescending.
62626262
ELIMINATION OF DUPLICATES
The built-in function distinct-values eliminatesduplicates from a sequence of result items.
In principle, it applies only to primitive (atomic)types.
It can also be applied to elements, but then it willremove their tags, replacing them by quotes “”.
Example
If return $b/title produces<title> aaa </title> <title> bbb </title><title> aaa </title>
then distinct-values (return $b/title) produces
“aaa” “bbb”.
63636363
XQUERY
For each author of a book by Morgan Kaufmann, listall books she published:
FOR $a IN distinct(document("bib.xml")/bib/book[publisher=“Morgan Kaufmann”]/author)
RETURN <result>
$a,
FOR $t IN /bib/book[author=$a]/title
RETURN $t
</result>
distinct = a function thateliminates duplicates
Result:
<result>
<author>Jones</author>
<title> abc </title>
<title> def </title>
</result>
<result>
<author> Smith</author>
<title> ghi </title>
</result>
64646464
JOINS
We can join two or more documents, by using one
variable for each of the documents .
We let a variable range over the elements of the
corresponding document, within a for-clause.
Need to be careful when comparing elements for
equality, since their equality is by element
identity, not by element content.
Typically, we want to compare the element
content.
The built-in function data(E) returns the content
of an element E.
65656565
XQUERY
Find books whose price is larger than average:
LET $a=avg(document("bib.xml")/bib/book/price)
FOR $b in document("bib.xml")/bib/book
WHERE $b/price > $a
RETURN $b
66666666
SORTING IN XQUERY
<publisher_list>FOR $p IN distinct(document("bib.xml")//publisher)
ORDERBY $pRETURN <publisher> <name> $p/text() </name> ,
FOR $b IN document("bib.xml")//book[publisher = $p]
ORDERBY $b/price DESCENDINGRETURN <book>
$b/title ,$b/price
</book></publisher>
</publisher_list>
67676767
IF-THEN-ELSE
FOR $h IN //holding
ORDERBY $h/titleRETURN <holding>
$h/title,
IF $h/@type = "Journal"
THEN $h/editor
ELSE $h/author
</holding>
68686868
EXISTENTIAL QUANTIFIERS
FOR $b IN //book
WHERE SOME $p IN $b//para SATISFIES
contains($p, "sailing")
AND contains($p, "windsurfing")
RETURN $b/title
69696969
QUANTIFICATION
XQuery supports the existential and the universal
quantifier.
Universal quantifier
every $v in expression1 satisfies expression 2
Existential quantifier
some $v in expression1 satisfies expression 2
Expression1 evaluates to a sequence of items,
expression 2 is a boolean expression.
70707070
AGGREGATION
XQuery provides built-in functions for the
standard aggregations such as SUM, MIN,
COUNT and AVG.
They can be applied to any XQuery expression, i.e.
to any sequence of items.
Example
avg(doc("bib.xml")/bibliography/book/price)
count(doc("bib.xml")/bibliography/book/price)
Computes the average book price and the number of
books, resp.
71717171
XQUERY EXAMPLES
Find books whose price is larger than the average
price.
Uses aggregate operator (avg), applied to the result of
a path expression.
let $a:=avg(doc("bib.xml")/bibliography/book/price)
for $b in doc("bib.xml")/bibliography/book
where $b/price > $a
return $b
72727272
XQUERY EXAMPLES
Find title of books with a paragraph containing the
terms “sailing” and “windsurfing”.
Uses existential quantifier (some) and string
matching (contains).
for $b in doc("bib.xml")//book
where some $p in $b//para satisfies
contains($p, "sailing") and contains($p, "windsurfing")
return $b/title