xml validation i dtds robin burke ect 360 winter 2004

38
XML Validation I DTDs Robin Burke ECT 360 Winter 2004

Upload: garry-hines

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

XML Validation IDTDs

Robin Burke

ECT 360

Winter 2004

Outline

History Grammars / Regular expressions DTDs

elements attributes entities

Declarations

Validation

Why bother?

The idea

Language consists of terminals a, b, c

Set of productions beginning with non-terminals

A, B, C rules specifying how to generate sequences of

terminals

Example

A aB A aBA B b generates strings

ababab etc.

Grammar

Can be used to efficiently parse a language basis of all modern programming language

parsing since Algol-60 Java Language Specification is completely in

EBNF grammar

Grammar

XML grammar-based syntax adheres to EBNF

SGML SGML had a more complex language definition

syntax HTML is defined the SGML way

Regular expressions

Language for expressing patterns Basic components

pattern elements optional element = ? repetition (1 or more) = + repetition (0 or more) = * choice = | grouping = ( ) sequence = ,

Examples

(a, b)* all strings "ab" "abab" etc.

(a | b | c)+, q, (b, c)* aaqb bq bqcccccccc

Note

Regular expressions are different in different applications Perl Javascript XML Schemas

DTDs only support ?+*|,()

EBNF

EBNF is more compact version of BNF it uses regular expressions to simplify grammar expression

A aB A aBA turns into

A aB(A)?

only one production per non-terminal allowed

DTDs

Use EBNF to specify structure of XML documents

Plus attributes entities

Syntax holdover from SGML Ugly

DTD Syntax

<!ELEMENT element-name content_model>

Content model contains the RHS of the production rule

Example<!ELEMENT name

(firstName, lastName)>

DTD Syntax cont'd

Not XML <! begins a declaration No "content" Empty elements not indicated with />

Simple content models

Content can be any text #PCDATA

Content can be anything at all (useful for debugging) ANY

Element has no content EMPTY

Example<grades>

<grade><student>Jane Doe</student><assigned-grade>A</assigned-grade>

</grade><grade>

<student>John Doe</student><assigned-grade>A-</assigned-grade>

</grade></grades>

Example<grades>

<grade><student>Jane Doe</student><assigned-grade>A</assigned-grade>

</grade><grade>

<student>John Doe</student><assigned-grade>A-</assigned-grade>

</grade><grade> <student>Wayne Doe</student>

<assigned-grade>I</assigned-grade><reason>Alien abduction</reason>

</grade></grades>

Mixed content Legal to have a content model with text and element data

<story category="national" byline="Karen Wheatley"><headline>President Meets with Congress</headline><![CDATA[ The President meet with Congressional leaders today in

effort to jump-start faltering budget negotiations. Sources described the mood

of the meeting as "cordial". ]]> <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /></story>

CDATA?

Forgot to mention last week Content that appears here will not be parsed

Can include arbitrary text including <, &, etc. Only restriction

termination sequence ]]>

Mixed content, cont'd

<!ELEMENT story (headline, #PCDATA, full-story, image*)>

Mixed content makes handling XML complex necessary for many applications

Recursion

Unlike grammars recursive formulation ≠ repetition

Difference between <!ELEMENT students (student+)> <!ELEMENT students (student, students?)>

Restriction

The grammar cannot be ambiguous A (a, b)| (a, c) this makes the parser implementation difficult

Usually easy to make non-ambiguous A a, (b | c)

Attribute lists

Declared separately from elements can be anywhere in the DTD

Specification includes name of the element name of the attribute attribute type default

Attribute types Character data

CDATA different from XML CDATA section!

Enumerated (yes|no)

ID must be unique in the document

IDREF must refer to an id in the document

NMTOKEN a restriction of CDATA to single "word"

Also IDREFS and NMTOKENS

Default declaration

#REQUIRED #IMPLIED

means optional Value

this becomes the default #FIXED

value provided

Examples

<!ATTLIST img

src CDATA #REQUIRED

alt CDATA #REQUIRED

align (left|right|center) "left"

id ID #IMPLIED

>

<!ATTLIST timestamp

time-zone NMTOKEN #IMPLIED>

Entities

Like macros content to be inserted indicated with &name;

Predefined general entities &amp; &lt; essential part of XML

User-defined general entities &disclaimer;

Entities, cont'd

Parameter entities can also be used to simplify DTD creation or to combine DTDs indicated with a %

More on this next week

Defining general entities

<!ENTITY name content> Example

<!ENTITY disclaimer

"This is a work of fiction. Any resemblance to persons living or dead is unintentional.">

Unparsed data

What about non-text data? images, audio files

In XML we define a notation

create a name and associate an application suggestion to the application

how to interpret the unparsed data not part of parsing operation

Using Notation

<!NOTATION name SYSTEM url> Example

<!NOTATION jpeg SYSTEM "IExplore.exe"> declares the jpeg notation

Example <!ENTITY "photo53" SYSTEM "photo53.jpg"

NDATA jpeg>

Notation, cont'd

Note that the content is defined in the DTD not the document binary data embedded in XML document

Not that useful in practice more likely to use URLs

Typical Example<story category="national" byline="Karen Wheatley">

...

<full_text ref="news801" />

<image src="img2071.jpg" />

<image src="img2072.jpg" />

<image src="img2073.jpg" />

</story>

Now it is up to the application to do something appropriate with the src attribute

A better solution

Use XLink We'll talk about this later

DTD limitations

Not in XML need a special parser for the DTD

No content type restrictions #PCDATA can be anything

Element names must be globally unique cannot reuse a common term at different places in the

document course-name professor-name

DTD benefits

Relatively easy to write and understand wait until you see XML Schema!

Possible to modularize and combine DTDs more next week

Next week

More DTDs Modularization and parameterization on-line reading

Beginning Schemas 4.1-4.30

Lab