xml tutorial walter underwood senior staff engineer infoseek [email protected]
TRANSCRIPT
Outline
I. XML: Why? What is it?
II. Document Types: representing content
III. Stylesheets: representing presentation
What is XML?
Extensible Markup Language Structured markup Simplified SGML Next-generation HTML W3C Recommendation (spec) Easy to use, easy to implement A buzzword the press can spell
What is XML not?
A programming language A single document type (memo, paper) Replacement for MS Word or FrameMaker An ANSI or ISO standard
Family Tree
SGML (1985)
HTML (1993)
XML (1998)
GML (1969)
Dates are first publication of draft specification
Why not SGML?
Tools are hard to write Tools are expensive Depends on environment (interchange is
difficult) If it did the job, we'd already be using it
Why not HMTL?
Backward compatibility, old browsers Hard to extend (still no formulas, figures) Based on SGML (see previous slide) Too much illegal HTML in use, need clean
slate
An HTML example
<html><body><h1>The Purple Cow</h1>I never saw a purple cow,<br>I never hope to see one;<br>But I can tell you, anyhow,<br>I'd rather see than be one.<br></body></html>
Same thing in XML<?xml version="1.0"?><!DOCTYPE TEI.2 SYSTEM "tei.dtd"><TEI.2><text><body><div1 type="poem"><head>The Purple Cow</head><lg><l>I never saw a purple cow,</l><l>I never hope to see one;</l><l>But I can tell you, anyhow,</l><l>I'd rather see than be one.</l></lg></div1></body></text></TEI.2>
Same thing formatted
The Purple Cow
I never saw a purple cow,I never hope to see one;But I can tell you, anyhow,I'd rather see than be one.
Basic Syntax
Starts with XML declaration<?xml version="1.0"?>
Rest of document inside the "root element"<TEI.2>…</TEI.2>
All text contained in some element<head>The Purple Cow</head>
Start and end tags must match exactly
Well-formed vs. Valid
XML must be well-formed correct syntax tags match, tags nest, all characters legal parser must reject if not well-formed
XML may be valid with respect to a DTD (Document Type Definition) tags are used correctly tags are all declared attributes are declared
Validity Checking
Checks everything specified in a DTD Can't check text (currency, spelling) Checks against DTD: this is a valid memo,
book, bibliography, ... XML editors usually require validity Other tools (search engines) might not
XML Syntax
The XML declaration Elements Entities Text Declarations and Notations Processing Instructions Comments
The XML Declaration
At very beginning of file Officially optional, but always use it Can declare version, encoding, standalone
Must be in that order Each is optional
Must declare other encodings <?xml encoding="Big5"?>
<?xml encoding="ISO-8859-1"?>
Elements
Containing: <person>Nico</person> Empty: <br/> Attributes: <date format="iso8601">… Names can be any Unicode character,
digit, or '.', '-', '_', or ':' (':' is reserved)
<Straße>Kurfürstendamm 175</Straße>
Elements Express Structure
Heading is inside poem element
<div1 type="poem"><head>The Purple Cow</head>
Shows the lines of the poem, not the line breaks on the page
I never saw a purple cow<br> HTML<l>I never saw a purple cow</l> XML
Space between elements is ignored
The Document Tree
<TEI.2><text>
<body><div1>
<head></head><lg>
<l></l><l></l>
</lg></div1>
</body></text>
</TEI.2>
Elements and Attributes
Attributes can parameterize an element <div1 type="poem">
<div1 type="abstract"><div1 type="chapter"><date format="iso8601"><subject scheme="LCSH">
Not as flexible as elements Don't use to save bytes, compress instead
<author first="Fred" last="Flintstone"/> not good
Attribute Syntax
Name can be any Unicode character, digit, or '.', '-', '_', or ':' (':' is reserved)
Cannot repeat Order doesn't matter Values must be quoted (single or double) Values may not contain "<" Values may have defaults in DTD
Special Attributes
xml:lang for language id has unique identifier for element idref references an id xml:* is reserved
Just like HTML, but better Five predefined entities
& ' < > "
Define your own in DTD<!ENTITY euro "€">
Use numeric character references€ €
Use Unicode directly
Entities
Text
Unicode 2.0, see www.unicode.org Use predefined entities (< & …)
XML Example: <char>&amp;</char>
CDATA ("character data") section for raw text without using entities<![CDATA[ XML example: <char>&</char>
]]>
Declarations
Allow validity checking Optional May be internal (in document), external, or
both DTD (Document Type Definition) is all
active declarations Use existing DTDs when possible
External DTD
Most common Use DOCTYPE declaration before root
element <!DOCTYPE greeting SYSTEM "hello.dtd">
<greeting>Hello, world!</greeting>
Internal (standalone) DTD
For custom documents Also uses DOCTYPE declaration
<!DOCTYPE greeting [<!ELEMENT greeting (#PCDATA)>]><greeting>Hello, world!</greeting>
Specify in XML declaration <?xml version="1.0" standalone="yes"?>
External plus Internal DTD
Usually to declare entities Use DOCTYPE declaration before root
element <!DOCTYPE greeting SYSTEM "hello.dtd" [
<!ENTITY excl "!">]><greeting>Hello, world!</greeting>
Element Type Declarations
Declare name Declare allowed content
<!ELEMENT a EMPTY><!ELEMENT b ANY><!ELEMENT either (one | theother)><!ELEMENT ordered (first, second)><!ELEMENT list (item+)><!ELEMENT dl ((dt?, dd?)*)><!ELEMENT text (#PCDATA)><!ELEMENT mixed (#PCDATA | b | i | em)>
Attribute List Declarations
Declare attributes for an element Declare value types Declare defaults
<!ATTLIST termdef id ID #REQUIRED name CDATA #IMPLIED><!ATTLIST list type (bullets|ordered|glossary) "ordered"><!ATTLIST form method CDATA #FIXED "POST">
Entity Declarations
Pretty names for characters <!ENTITY copy "©">
Boilerplate<!ENTITY copyright
"© Infoseek Corp. 1999, All rights reserved">
Used extensively in complex DTDs
Notations
A name of something outside of XML an unparsed entity target of a processing instruction
Mostly useful to applications<!NOTATION WunderFormatter
SYSTEM "http://wunderco.com/formatter/">
Processing Instructions
Instructions to applications fonts? security? correctness checks?
Linking to a style sheet<?xml-stylesheet href="mystyle.css"
type="text/css"?>
Instructions to indexing robots<?robots index="no" follow="yes"?>
Comments
Like HTML and SGML<!-- a comment -->
Anything is OK inside a comment <!-- <head> & <tail> are elements -->
<!-- <?xml?> declaration goes here -->
But don't use structured comments, use processing instructions instead
<!-- Font: Treefrog --> wrong<?WunderFormatter font="Treefrog"?> right
Unicode and Encodings
Unicode in programs UCS-2: two-byte characters UCS-4: four-byte characters (future)
Unicode in files UTF-8: ASCII is ASCII, rest are 1- to 4-bytes UTF-16: two octets per character, initial
ASCII with numeric character references works, too (© for ©)
What is a "document type"?
Technical report Specification Bug report Experiment run summary Software manual Novel Poem Play
What is a DTD?
"Document Type Definition" Bunch of XML declarations Usually external to document Designed for some purpose (use one that
matches your needs) Best left to experts
Types of Document Types Text
TEI (scholarly editions) DocBook (software documentation) NITF (news articles)
Data CML (Chemical Markup Language) AIML (Astronomical Instrument ML)
Mixed often custom (bug reports)
A Bug Report Document
<?xml?><bugreport><product>xmltron</product><version>1.1</version><os>RTE</os><osversion>4.0</osversion><date scheme="ISO8601">1999-11-03</date><report><summary>doesn’t work</summary><detail>at all</detail></report><solution>none yet</solution></bugreport>
Make a Document Type
<!DOCTYPE bugreport [ <!-- declarations go here -->
]><bugreport> ...
Doctype and root element must match
Declarations for Elements
<!DOCTYPE bugreport [<!ELEMENT bugreport wait 'til next slide><!ELEMENT product #PCDATA><!ELEMENT version #PCDATA><!ELEMENT os #PCDATA><!ELEMENT osversion #PCDATA><!ELEMENT date #PCDATA><!ELEMENT report (summary, detail)><!ELEMENT summary #PCDATA><!ELEMENT detail #PCDATA><!ELEMENT solution #PCDATA>]>
Declaration for Root Element
<!DOCTYPE bugreport [<!ELEMENT bugreport (product, version, os, osversion, date, report, solution?)>
<solution> is optional, others required andmust be in this order.
Declarations for Attribures
<!ATTLIST date scheme CDATA #IMPLIED>
"CDATA" instead of "PCDATA" means it isn't "parsed" for entities
Declarations for Attributes
"CDATA" instead of "PCDATA" means it isn't "parsed" for entities (no markup)
#IMPLIED means optional (value implied by document)
separate ATTLIST declarations for the same element are OK
internal ATTLIST declarations override external
<!ATTLIST date scheme CDATA #IMPLIED>
Reusing Element Declarations
<product> <name>xmltron</name> <version>1.1</version></product><os> <name>RTE</name> <version>4.0</version></os>
Use the same elements for product andOS info.
New Declarations for Elements
<!ELEMENT product (name, version)><!ELEMENT os (name, version)><!ELEMENT name #PCDATA><!ELEMENT version #PCDATA>
Customizing Existing DTDs
Add attributes Add entities Rarely change elements
Can't override element declarations Can add new child elements to those that allow
ANY
Some DTDs are designed for extensions
documents = contents + style
Extensible Stylesheet Language (XSL) Specifications still in draft But implementations keeping pace
XSL is in Three Parts
XSLT: transformation XPath: addressing XML entities FO: formatting objects
We will cover only XSLT today
XML into HTML
XSLT can transform into (called "output method"): XML HTML text
Server-side XSLT engine content in XML served as HTML browser never knows
Transforming The Purple Cow
Add HTML intro and outro convert <head> to <h1> convert <lg> to <p> (at beginning of stanza) convert <l> to <br> (at end of line)
The Purple Cow (XML)<?xml version="1.0"?><!DOCTYPE TEI.2 SYSTEM "tei.dtd"><?xml-stylesheet href="purple.xsl" type="text/xml"?><TEI.2><text><body><div1 type="poem"><head>The Purple Cow</head><lg><l>I never saw a purple cow,</l><l>I never hope to see one;</l><l>But I can tell you, anyhow,</l><l>I'd rather see than be one.</l></lg></div1></body></text></TEI.2>
The Purple Cow (HTML)
<html><body><h1>The Purple Cow</h1>I never saw a purple cow,<br>I never hope to see one;<br>But I can tell you, anyhow,<br>I'd rather see than be one.<br></body></html>
Intro and Outro
<?xml version="1.0"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="TEI.2"> <html> <body> <xsl:apply-templates/> </body> </html> </xsl:template></xsl:stylesheet>
XSLT So Far
It is XML Uses XML Namespaces—no name conflicts Defaults to text/xml output method Uses text/html if <html> is output at root Applies templates to input
A Template for Text Content
<xsl:template match="head"> <h1> <xsl:apply-templates/> </h1></xsl:template>
Default element rule applies templates Default text rule copies to output IE5 doesn’t implement the default rules
Default Templates
<!-- Default template for elements, applies to children --><xsl:template match="*|/"> <xsl:apply-templates/></xsl:template>
<!-- Default template for text and attribute nodes, copies content to output --><xsl:template match="text()|@*"> <xsl:value-of-select="."/></xsl:template>
Line Groups and Lines
<!-- put a <p> before each stanza --><xsl:template match="lg"> <p> <xsl:apply-templates/></xsl:template>
<!-- put a <br> after each line --><xsl:template match="l"> <xsl:apply-templates/> <br></xsl:template>
A Complete Stylesheet<?xml version="1.0"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="TEI.2"> <html><body><xsl:apply-templates/></body></html> </xsl:template> <xsl:template match="head"> <h1><xsl:apply-templates/></h1> </xsl:template> <xsl:template match="lg"> <p><xsl:apply-templates/> </xsl:template> <xsl:template match="l"> <xsl:apply-templates/><br> </xsl:template></xsl:stylesheet>
Other XSL Features
Cascading stylesheets Including stylesheets Conditionals (if/else), variables Relative selectors, XPath selectors Counting, sorting String and number manipulation Template modes (e.g. table-of-contents
and full)
Why do it?
Different HTML for different browsers(make sure the default works!)
Index only content with search engine
Generate RTF or TEX with text output method
Analyze XML files (all meta data defined?) Convert between DTDs
XML Information XML at W3C
www.w3.org/XML www.w3.org/TR/REC-xml
The Annotated XML Spec www.xml.com/pub/axml/axmlintro.html
The Robin Cover SGML/XML page (encyclopedic!) www.oasis-open.org/cover/
The XML Bible, Elliott Rusty Harold updates at: metalab.unc.edu/xml/books/bible/
www.xml.com (articles and directory)
XML Software SAX (Simple API for XML)
www.megginson.com/SAX www.jclark.com/XML (C and Java parsers)
DOM (Document Object Model) www.w3c.org/DOM (specs) www.alphaworks.ibm.com (XML4J parser) developer.java.sun.com/developer/products/xml/(Project X)
Parser conformance testing www.xml.com/pub/1999/09/conformance/ www.oasis-open.org/cover/xmlConformance.html
Avoid MSXML (Microsoft), non-standard and buggy
General DTD Resources
Structuring XML Documents, David Megginson The XML and SGML Cookbook : Recipes for
Structured Information, Rick Jellife more an SGML book, but excellent on Internationalization
Specific DTD Resources
Inside XML DTDs: Scientific and Technical, Simon St. Laurent
DocBook: The Definitive Guide, Norman Walsh and Leonard Muellner
TEI Lite and Bare Bones TEI (SGML) www.tei-c.org (TEI Consortium) www-tei.uic.edu/orgs/tei/intros/teiu5.html www-tei.uic.edu/orgs/tei/intros/teiu6.html
Chemical Markup Language: www.xml-cml.org MathML: www.w3.org/TR/REC-MathML