semistructured data and xml · 2010-11-15 · 7 the semistructured data model arc labels...

SEMISTRUCTURED DATA AND XML

2222

HOW THE WEB IS TODAY

HTML documents

often generated by applications

consumed by humans only

easy access: across platforms, across organizations

only layout, no semantic information

No application interoperability:

HTML not understood by applications

screen scraping brittle

Database technology: client-server

still vendor specific

3333

XML DATA EXCHANGE FORMAT

A standard from the W3C (World Wide WebConsortium, http://www.w3.org).

The mission of the W3C

„. . . developing common protocols that promoteits evolution and ensure its interoperability. . .“.

Basic ideas XML = data

XML generated by applications

XML consumed by applications

Easy access: across platforms, organizations.

4444

PARADIGM SHIFT ON THE WEB

For web search engines:

From documents (HTML) to data (XML)

From document management to documentunderstanding (e.g., question answering)

From information retrieval to data management

For database systems:

From relational (structured) model to semistructureddata

From data processing to data /query translation

From storage to transport

5555

THE SEMISTRUCTURED DATA MODEL

&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle

yearhttp

author

authorauthor

title publisherauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

Object ExchangeModel (OEM)complex object

atomic object

6666


Data is self-describing, i.e. the data description isintegrated with the data itself rather than in aseparate schema.

Database is a collection of nodes and arcs(directed graph).

Leaf nodes represent data of some atomic type(atomic objects, such as numbers or strings).

Interior nodes represent complex objects consistingof components (child nodes), connected by arcs tothis node.

Arcs are directed and connect two nodes.

7777


Arc labels indicates the relationship between thetwo corresponding nodes.

The root node is the only interior node without in-arcs, representing the entire database.

All database objects are children of the root node.

Every node must be reachable from the root.

A general graph structure is possible, i.e. thegraph need not be a tree structure.

8888

SYNTAX FOR SEMISTRUCTURED DATA

Bib: &o1 { paper: &o12 { … },

book: &o24 { … },

paper: &o29

{ author: &o52 “Abiteboul”,

author: &o96 { firstname: &243 “Victor”,

lastname: &o206 “Vianu”},

title: &o93 “Regular path queries with constraints”,

references: &o12,

references: &o24,

pages: &o25 { first: &o64 122, last: &o92 133}

}

}

Observe: Nested tuples, set-values, oids!

9999

SYNTAX FOR SEMISTRUCTURED DATA

May omit oids:

{ paper: { author: “Abiteboul”,

author: { firstname: “Victor”,

lastname: “Vianu”},

title: “Regular path queries …”,

page: { first: 122, last: 133 }

}

}

10101010

VS. RELATIONAL MODEL

Missing attributes

Additional attributes

Multiple attribute values (set-valued attributes)

Objects as attribute values

No global schema

only the first characteristics supported by relationalmodel, all others are not

11111111

VS. RELATIONAL MODEL

Semistructured data

Self-describing,

Irregular data,

No a-priori structure.

Relational DB

Separate schema,

Regular data,

A-priori structure.

XML

13131313

IMPORTANT XML STANDARDS

XSL/XSLT: presentation and transformationstandards

RDF: resource description framework (meta-infosuch as ratings, categorizations, etc.)

Xpath/Xpointer/Xlink: standard for linking todocuments and elements within

Namespaces: for resolving name clashes DOM: Document Object Model for manipulating

XML documents SAX: Simple API for XML parsing XQuery: query language

14141414

XML

A W3C standard to complement HTML

Origins: Structured text SGML

Large-scale electronic publishing

Data exchange on the web

Motivation:

HTML describes presentation

XML describes content

http://www.w3.org/TR/2000/REC-xml-20001006 (version 2,10/2000)

SGMLXMLHTML4.0

15151515

FROM HTML TO XML

HTML describes the presentation

16161616

HTML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteboul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

HTML describes the presentation

17171717

XML

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

…

</bibliography>

XML describes the content

18181818

WHY ARE WE DB’ERS INTERESTED?

It’s data. That’s us.

Database issues: How are we going to model XML? (graphs).

How are we going to query XML? (XQuery)

How are we going to store XML (in a relationaldatabase? object-oriented? native?)

How are we going to process XML efficiently? (manyinteresting research questions!)

19191919

ELEMENTS

Tagsbook, title, author, …

start tag: <book>, end tag: </book>

defined by user / programmer (different from HTML!)

Elements <book>…<book>,<author>…</author>

An element consists of a matching start and end tagand the enclosed content.

Elements can be nested, i.e. content of one elementcan consist of sequence of other elements.

20202020

ATTRIBUTES

Attributes can be associated with any element.

Provide additional information about elements.

Attributes can have only one value.

Example

<book price = “55” currency = “USD”>

<title> Foundations of Databases </title>


…

<year> 1995 </year>

</book>

Attributes can also be used to connect elements.

21212121

NON-TREE-LIKE XML

So far: only tree-like XML documents,i.e. each element is nested within at most oneother element.

Attributes can also be used to create non-treeXML documents.

Attributes with a domain of ID serve as primarykeys of elements.

Attributes with a domain of IDREF serve asforeign keys referencing the ID of anotherelement.

22222222

NON-TREE-LIKE XML

Example of a non-tree structure<persons>

<person personid=“o555”><name> Jane </name>

</person>

<person personid=“o456”>

<name> Mary </name>

<children refs=“o123 o555”</children >

</person>

<person personid=“o123” mother=“o456”>

<name>John</name>

</person>

</persons>

23232323

NAMESPACES

An XML document can involve tags that comefor multiple sources.

One and the same tag can appear in more thanone source.

<table> <tr>

<td>Apples</td>

<td>Bananas</td>

</tr> </table>

<table>

<name>African Coffee Table</name>

<width>80</width>

<length>120</length>

</table>

24242424

NAMESPACES

Name conflicts can be resolved by prefixing tagnames according to their source.<h:table>

<h:tr> <h:td>Apples</h:td><h:td>Bananas</h:td></h:tr>

</h:table>

<f:table><f:name>African Coffee Table</f:name><f:width>80</f:width><f:length>120</f:length>

</f:table>

When using prefixes in XML, a namespace forthe prefix must be defined.

The namespace must be referenced (via an URI)in the start tag of an enclosing element .

25252525

WELL-FORMED XML

A well-formed XML document satisfies the

following conditions:

Begins with a declaration that it is XML.

Has a single root element that encloses the whole

document.

Consists of properly nested elements, i.e. start and end

tag of an element are within the same enclosing

element.

standalone =“yes” states that document has no

DTD.

In this mode, you can invent your own tags, like in

semistructured data model.

26262626

WELL-FORMED XML

<?XML version=“1.0” standalone =“yes” ?><bibliography>

<book> <title> Foundations… </title><author> Abiteboul </author><author> Hull </author><author> Vianu </author><publisher> Addison Wesley </publisher><year> 1995 </year>

</book><book> <title> … </title>

. . .</book>…

</bibliography>

27272727

WELL-FORMED XML

HTML browsers will display documents with

errors (like missing end tags).

The W3C XML specification states that a

program should stop processing an XML

document if it finds an error.

The main reason is that XML is being consumed

by programs rather than by humans (as HTML).

W3C provides a validator that checks whether an

XML document is well-formed.

28282828

VALID XML

The validator can also check whether an XML

document is valid, i.e. conforms to a Document

Type Definition (DTD).

A DTD specifies the allowable tags and how they

can be nested.

XML with a DTD is no longer semistructured

(self-describing).

However, a DTD is less rigid than the schema of a

relational DB. E.g., a DTD allows missing and

multiple attributes / elements.

DTD

30303030

DOCUMENT TYPE DEFINITIONS

Document Type Definition (DTD): set of rules

(grammar) specifying elements, attributes and all

other aspects of XML documents.

For each element, specify name and content type.

Content type can, e.g., be

#PCDATA (character string),

other elements,

regular expression made of the above content types* = zero or more occurrences? = zero or one occurrence+ = one or more occurrences, = sequence of elements.

31313131

DOCUMENT TYPE DESCRIPTORS

Sort of like a schema but not really.

Inherited from SGML DTD standard

BNF grammar establishing constraints on elementstructure and content

Definitions of entities

<!ELEMENT Book (title, author*) >

<!ELEMENT title #PCDATA><!ELEMENT author (name, address,age?)>

<!ATTLIST Book id ID #REQUIRED><!ATTLIST Book pub IDREF #IMPLIED>

32323232

EXAMPLE DTD: PRODUCT CATALOG

<!DOCTYPE CATALOG [

<!ELEMENT CATALOG (PRODUCT+)>

<!ELEMENT PRODUCT (SPECIFICATIONS+,OPTIONS?,PRICE+,NOTES?)>

<!ATTLIST PRODUCT NAME CDATA #IMPLIED

CATEGORY (HandTool|Table|Shop-Professional) "HandTool"

PARTNUM CDATA #IMPLIED

PLANT (Pittsburgh|Milwaukee|Chicago) "Chicago"

INVENTORY (InStock|Backordered|Discontinued) "InStock">

<!ELEMENT SPECIFICATIONS (#PCDATA)>

<!ATTLIST SPECIFICATIONS WEIGHT CDATA #IMPLIED

POWER CDATA #IMPLIED>

<!ELEMENT OPTIONS (#PCDATA)>

<!ATTLIST OPTIONS FINISH (Metal|Polished|Matte) "Matte"

ADAPTER (Included|Optional|NotApplicable) "Included"

CASE (HardShell|Soft|NotApplicable) "HardShell">

<!ELEMENT PRICE (#PCDATA)>

<!ATTLIST PRICE MSRP CDATA #IMPLIED

WHOLESALE CDATA #IMPLIED

STREET CDATA #IMPLIED

SHIPPING CDATA #IMPLIED>

<!ELEMENT NOTES (#PCDATA)> ]>

33333333

SHORTCOMINGS OF DTDS

Useful for documents, but not so good for data:

Element name and type are associated globally

No support for structural re-use Object-oriented-like structures aren’t supported

No support for data types Can’t do data validation

Can have a single key item (ID), but: No support for multi-attribute keys

No support for foreign keys (references to other keys)

No constraints on IDREFs (reference only a Section)

XML SCHEMA

35353535

XML SCHEMA

The successor of DTDs to specify a schema for

XML documents.

A W3C standard.

Includes and extends functionality of DTDs.

In particular, XML Schemas support data types.

This makes it easier to validate the correctness of

data and to work with data from a database.

XML Schemas are written in XML. You don't have

to learn a new language and can use your XML

parser to parse your Schema files.

36363636

EXAMPLE XML SCHEMA

<schema version=“1.0”xmlns=“http://www.w3.org/1999/XMLSchema”>

<element name=“author” type=“string” /><element name=“date” type = “date” /><element name=“abstract”><type> … </type>

</element><element name=“paper”><type><attribute name=“keywords” type=“string”/><element ref=“author” minOccurs=“0”maxOccurs=“*” />

<element ref=“date” /><element ref=“abstract” minOccurs=“0”maxOccurs=“1” />

<element ref=“body” /></type>

</element></schema>

37373737

SIMPLE ELEMENTS

Simple elements contain only text.

They can have one of the built-in datatypes:

xs:string, xs:decimal, xs:integer, xs:boolean

xs:date, xs:time.

Example

<xs:element name="lastname“ type="xs:string"/>

<xs:element name="age" type="xs:integer"/>

<xs:element name="dateborn" type="xs:date"/>

38383838

SIMPLE ELEMENTS

Restrictions allow you to further constrain the

content of simple elements.

<xs:element name="age">

<xs:simpleType>

<xs:restriction base="xs:integer">

<xs:minInclusive value="0"/>

<xs:maxInclusive value="120"/>

</xs:restriction>

</xs:simpleType>

</xs:element>

39393939

ATTRIBUTES

Attributes can be specified using the attribute

element:

<xs:attribute name="xxx" type="yyy"/>

Attribute elements are nested within the element

of the element with which they are associated.

By default, attributes are optional.

To make an attribute mandatory, use

<xs:attribute name="lang“ type="xs:string“use="required"/>

Attributes can have the same built-in datatypes

as simple elements.

40404040

COMPLEX ELEMENTS

Complex elements can contain other elements and can

have attributes.

Nested elements need to occur in the order specified.

The number of repetitions of elements are controlled by

the attributes minOccurs and maxOccurs. The default

is one repetition.

A complex element with an attribute:

<xs:element name="product">

<xs:complexType>

<xs:attribute name="prodid" type="xs:positiveInteger"/>

</xs:complexType>

</xs:element>

41414141

COMPLEX ELEMENTS

A complex element containing a sequence of

nested (simple) elements:

<xs:element name="employee">

<xs:complexType>

<xs:sequence>

<xs:element name="firstname" type="xs:string"/>

<xs:element name="lastname" type="xs:string"/>

</xs:sequence>

</xs:complexType>

</xs:element>

42424242

COMPLEX ELEMENTS

If you name the complex element, other elements

can reference and include it:

<xs:complexType name="persontype">

<xs:sequence>

<xs:element name="firstname" type="xs:string"/>

<xs:element name="lastname" type="xs:string"/>

</xs:sequence>

</xs:complexType>

<xs:element name="person" type="persontype"/>

43434343

EXAMPLE XML SCHEMA

<schema version=“1.0”xmlns=“http://www.w3.org/1999/XMLSchema”>

<element name=“author” type=“string” /><element name=“date” type = “date” /><element name=“abstract”><type> … </type>

</element><element name=“paper”><type><attribute name=“keywords” type=“string”/><element ref=“author” minOccurs=“0”maxOccurs=“*” />

<element ref=“date” /><element ref=“abstract” minOccurs=“0”maxOccurs=“1” />

<element ref=“body” /></type>

</element></schema>

44444444

XML VS. SEMISTRUCTURED DATA

Both described best by a graph.

Both are schema-less, self-describing(XML without DTD / XML schema).

XML is ordered, semistructured data is not.

XML can mix text and elements:

<talk> Making Java easier to type and easier to type

<speaker> Phil Wadler </speaker>

</talk>

XML has lots of other stuff: attributes, entities,processing instructions, comments.

XML-PATH = XPATH

46464646

QUERY LANGUAGES FOR XML

XPath is a simple query language based ondescribing similar paths in XML documents.

XQuery extends XPath in a style similar to SQL,introducing iterations, subqueries, etc.

XPath and XQuery expressions are applied to anXML document and return a sequence ofqualifying items.

Items can be primitive values or nodes (elements,attributes, documents).

The items returned do not need to be of the sametype.

47474747

XPATH

A path expression returns the sequence of allqualifying items that are reachable from the inputitem following the specified path.

A path expression is a sequence consisting of tagsor attributes and special characters such asslashes (“/”).

Absolute path expressions are applied to someXML document and returns all elements that arereachable from the document’s root elementfollowing the specified path.

Relative path expressions are applied to anarbitrary node.

48484848

XPATH

<?XML version=“1.0” standalone =“yes” ?>

<bibliography>

<book bookID = “b100“> <title> Foundations… </title>




<publisher>Addison Wesley </publisher>

<year> 1995</year> </book>

…

</bibliography>

Applied to the above document, the XPath expression/bibliography/book/author returns the sequence



<author> Vianu </author> . . .

49494949

ATTRIBUTES

If we do not want to return the qualifying elements, but the valueone of their attributes, we end the path expression with @attribute.

<?XML version=“1.0” standalone =“yes” ?>

<bibliography>

<book bookID = “b100“> <title> Foundations… </title>




<publisher> Addison Wesley </publisher>

<year> 1995 </year> </book>

the XPath expression

/bibliography/book/@bookID

returns the sequence

“b100“ . . .

50505050

WILDCARDS

We can use wildcards instead of actual tags andattributes:* means any tag, and@* means any attribute.

Examples/bibliography/*/author returns the sequence


<author> Hull </author>./bibliography//author/@* returns the sequence

“IBM““a739“.

51515151

PATH EXPRESSIONS

Examples:

Bib.paper

Bib.book.publisher

Bib.paper.author.lastname

Given an OEM instance, the value of a pathexpression p is a set of objects

52525252

PATH EXPRESSIONS

Examples:

DB =&o1

&o12 &o24 &o29

&o43

&o70 &o71

&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle year httpauthor

authorauthor

titlepublis herauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

&o44 &o45 &o46

&o47 &o48 &o49 &o50 &o51

&o52

Bib.paper={&o12,&o29}

Bib.book.publisher={&o51}

Bib.paper.author.lastname={&o71,&206}

XML-QUERY = XQUERY

54545454

XQUERY

Summary:

FOR-LET-WHERE-ORDERBY-RETURN = FLWOR

FOR/LET Clauses

WHERE Clause

ORDERBY/RETURN Clause

List of tuples

List of tuples

Instance of Xquery data model

55555555

XQUERY

FLWOR expressions are similar to SQL select . .from . . . where . . . queries.

XQuery allows zero, one or more for and letclauses.

The where clause is optional.

There is one optional order-by clause.

Finally, there is exactly one return clause.

XQuery is case-sensitive.

XQuery (and XPath) is a W3C standard.

56565656

XQUERY CLAUSES

for $x in expr

Defines node variable $x.

The expression expr evaluates to a sequence of items.

The variable $x is assigned to each item, in turn, andthe body of the for clause is executed once for eachassignment.

let $x := expr

Defines collection variable $x.

The expression expr evaluates to a sequence of items.

The variable is bound to the entire sequence of items.

Useful for common subexpressions and foraggregations.

57575757

XQUERY CLAUSES

where condition

The condition is a boolean expression.

The clause is applied to some item.

If and only if the condition evaluates to true, thefollowing return clause is executed for that item.

return expression

The result of a FLWOR clause is a sequence of items.

Expression defines the result format for the current(qualifying) item.

The sequence of items produced by expression isappended to the sequence of items produced so far.

58585858

INTERPRETATION AS XQUERY

XQuery expressions can be used wherever anXML expression of any kind is permitted.

Any text string is acceptable as content of a tag orvalue of an attribute.

If a string contains an XQuery expression thatshould be evaluated, this substring must besurrounded by curly brackets {}.

Example

for $b in doc("bib.xml")/bibliography/bookreturn <result id = {$b/@bookID}>{$b/title}</result>

59595959

FOR V.S. LET

Find all books

FOR $x IN document("bib.xml")/bib/book

RETURN <result> $x </result>

Returns:

<result> <book>...</book></result><result> <book>...</book></result><result> <book>...</book></result>...

LET $x IN document("bib.xml")/bib/book

RETURN <result> $x </result>

Returns:

<result> <book>...</book><book>...</book><book>...</book>...

</result>

60606060

XQUERY

Find all book titles published after 1995:

FOR $x IN document("bib.xml")/bib/book

WHERE $x/year > 1995

RETURN $x/title

Result:<title> abc </title><title> def </title><title> ghi </title>

61616161

ORDERING THE QUERY RESULT

The order-by clause allows you to order theresults of an XQuery expression.

order-by list of expressions

The sort order is based on the value of the firstexpression. Ties are broken based on the value ofthe second (if necessary third etc.) expression.

By default, the order is ascending.

A descending sort order can be specified usingdescending.

62626262

ELIMINATION OF DUPLICATES

The built-in function distinct-values eliminatesduplicates from a sequence of result items.

In principle, it applies only to primitive (atomic)types.

It can also be applied to elements, but then it willremove their tags, replacing them by quotes “”.

Example

If return $b/title produces<title> aaa </title> <title> bbb </title><title> aaa </title>

then distinct-values (return $b/title) produces

“aaa” “bbb”.

63636363

XQUERY

For each author of a book by Morgan Kaufmann, listall books she published:

FOR $a IN distinct(document("bib.xml")/bib/book[publisher=“Morgan Kaufmann”]/author)

RETURN <result>

$a,

FOR $t IN /bib/book[author=$a]/title

RETURN $t

</result>

distinct = a function thateliminates duplicates

Result:

<result>

<author>Jones</author>

<title> abc </title>

<title> def </title>

</result>

<result>

<author> Smith</author>

<title> ghi </title>

</result>

64646464

JOINS

We can join two or more documents, by using one

variable for each of the documents .

We let a variable range over the elements of the

corresponding document, within a for-clause.

Need to be careful when comparing elements for

equality, since their equality is by element

identity, not by element content.

Typically, we want to compare the element

content.

The built-in function data(E) returns the content

of an element E.

65656565

XQUERY

Find books whose price is larger than average:

LET $a=avg(document("bib.xml")/bib/book/price)

FOR $b in document("bib.xml")/bib/book

WHERE $b/price > $a

RETURN $b

66666666

SORTING IN XQUERY

<publisher_list>FOR $p IN distinct(document("bib.xml")//publisher)

ORDERBY $pRETURN <publisher> <name> $p/text() </name> ,

FOR $b IN document("bib.xml")//book[publisher = $p]

ORDERBY $b/price DESCENDINGRETURN <book>

$b/title ,$b/price

</book></publisher>

</publisher_list>

67676767

IF-THEN-ELSE

FOR $h IN //holding

ORDERBY $h/titleRETURN <holding>

$h/title,

IF $h/@type = "Journal"

THEN $h/editor

ELSE $h/author

</holding>

68686868

EXISTENTIAL QUANTIFIERS

FOR $b IN //book

WHERE SOME $p IN $b//para SATISFIES

contains($p, "sailing")

AND contains($p, "windsurfing")

RETURN $b/title

69696969

QUANTIFICATION

XQuery supports the existential and the universal

quantifier.

Universal quantifier

every $v in expression1 satisfies expression 2

Existential quantifier

some $v in expression1 satisfies expression 2

Expression1 evaluates to a sequence of items,

expression 2 is a boolean expression.

70707070

AGGREGATION

XQuery provides built-in functions for the

standard aggregations such as SUM, MIN,

COUNT and AVG.

They can be applied to any XQuery expression, i.e.

to any sequence of items.

Example

avg(doc("bib.xml")/bibliography/book/price)

count(doc("bib.xml")/bibliography/book/price)

Computes the average book price and the number of

books, resp.

71717171

XQUERY EXAMPLES

Find books whose price is larger than the average

price.

Uses aggregate operator (avg), applied to the result of

a path expression.

let $a:=avg(doc("bib.xml")/bibliography/book/price)

for $b in doc("bib.xml")/bibliography/book

where $b/price > $a

return $b

72727272

XQUERY EXAMPLES

Find title of books with a paragraph containing the

terms “sailing” and “windsurfing”.

Uses existential quantifier (some) and string

matching (contains).

for $b in doc("bib.xml")//book

where some $p in $b//para satisfies

contains($p, "sailing") and contains($p, "windsurfing")

return $b/title

semistructured data and xml · 2010-11-15 · 7 the semistructured data model arc labels...

Documents