introduction to xmlavid.cs.umass.edu/courses/445/f2009/lectures/lec14-xml.pdf · introduction to...
TRANSCRIPT
Introduction to XML
Yanlei DiaoUMass Amherst
Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.
2
Data Integration & Sharing
Intranet………
NULLAnna Wang111
513-123-0102Frank James109
413-555-1223Joe Black107
ContactNameSid
…………
Contact (NOT NULL)LastNameFirstNameSid
Amherst College Student Database
UMass Student Database
3
WWW
Structured data - Databases
Unstructured Text - Documents
Semistructured Data
Integration of Text & Structured Data
4
Needs for A New Data Model
Loose and rich structure
Evolving, unknown, irregular structures
Integration of structured, but heterogeneous data sources
Textual data with tags and links
Combination of data models (relational, hierarchical)
Answer to all of above : Extensible Markup Language (XML) http://www.w3.org/TR/REC-xml/
5
From HTML to XML
6
HTML
<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999
HTML describes the presentation.
7
XML
<bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …
</bibliography>
XML describes the content!
8
Outline
XML Syntax
XML typing
9
XML Syntax
Tags: book, title, author, …− start tag: <book>− end tag: </book>
Elements: <book>…</book>,<author>…</author>− An element can have text data in its content.− An element can have nested elements.− Empty element: <red></red>, abbrv. <red/>
XML document: single root element
An XML document is well formed if it has a single rootelement and matching tags.
10
XML Syntax
<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>
Attributes are alternative ways to represent data.
11
XML Syntax
<family><person id=“o555”> <name> Jane </name> </person><person id=“o456”> <name> Mary </name>
<children idrefs=“o123 o555”/></person><person id=“o123” mother=“o456”><name>John</name></person>
</family>
Oids and references in XML are just syntax.
12
XML Semantics: a Tree !
<data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand </address> <phone> 23456 </phone> </person></data>
data
Mary
person
person
name address
name address
street no city
Maple 345 Seattle
JohnThai
phone
23456
id
o555
Elementnode
Textnode
Attributenode
Order matters ! IDREF will turn it to a graph.
13
XML Data
XML is self-describing due to the use of tags
Schema elements become part of the data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part of the
data, and are repeated many times
Consequence: XML is much more flexible
Some real data:http://www.cs.washington.edu/research/xmldatasets/
14
Relational Data as XML
<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row>
</person>
row row row
name name namephone phone phone
“John” 3634 “Sue” “Dick”6343 6363
personXML: person
name phone
John 3634
Sue 6343
Dick 6363
15
XML is Semi-structured Data
Missing attributes:
Could represent ina table with nulls
<data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person></data>
← no phone !
name phone
John 1234
Joe -
16
XML is Semi-structured Data
Repeated attributes
Impossible in tables:nested collections
(non 1NF)
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
← two phones !
name phone
Mary 2345 3456 ???
17
XML is Semi-structured Data
Attributes with different types in different objects
Mixed content: <p><person>Joe</person> went to <school>Umass</school>.</p>
<data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person></data>
← structured name !
← unstructured name !
18
Outline
XML Syntax
XML typing
19
Data Typing in XML
Data typing in the relational model: schema
Data typing in XML– Much more complex– Typing restricts valid trees that can occur
• theoretical foundation: tree languages
– Practical methods:• DTD (Document Type Definition)• XML Schema: superset of DTD (not covered in detail in
this class)
20
Document Type Definitions (DTD)
Part of the original XML specification.• DTD can be part of an XML document.
An XML document may have a DTD.• But it is not required.
XML document iswell-formed = if tags are correctly closedValid = if it further has a DTD and conforms to it
21
DTD Example
<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>
22
DTD Example
<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>
Example of valid XML document:
23
DTD: The Content Model
Content model:– Complex = a regular expression over other elements– Text-only = #PCDATA– Empty = EMPTY– Mixed content = (#PCDATA | A | B | C)*
<!ELEMENT tag_name (CONTENT)>
contentmodel
24
DTD: Regular Expressions
<!ELEMENT name (firstName, lastName))
<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>
<!ELEMENT name (firstName?, lastName))
DTD XML
<!ELEMENT person (name, phone*))
sequence
optional
<!ELEMENT person (name, (phone|email)))
Kleene star
alternation
<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>
25
Attributes in DTDs
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED>
<person age=“25”> <name> ....</name> ...</person>
26
Attributes in DTDs
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED>
<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>
27
Attributes in DTDs
Types: CDATA = string ID = key IDREF = foreign key IDREFS = foreign keys separated by space (Monday | Wednesday | Friday) = enumeration
28
Attributes in DTDs
Kind: #REQUIRED #IMPLIED = optional value = default value value #FIXED = the only value allowed
29
Using DTDs
Must include in the XML document
Either include the entire DTD:– <!DOCTYPE nameRootElement [ DTD ]>
Or include a reference to it:– <!DOCTYPE nameRootElement SYSTEM “http://www.mydtd.org/”>
30
XML Schema
DTDs capture grammatical structure, but have limitations:− Not themselves in XML, inconvenient to build tools− Don’t capture database data types− No way of defining OO-like inheritance…
XML Schema addresses shortcomings of DTDs− XML syntax− Subclassing− Domains and built-in data types− MIN and MAX # of occurrences of elements− http://www.w3.org/XML/Schema