xml – extensible markup language. objectives to understand various ways in which xml can be used...
TRANSCRIPT
XML – Extensible Markup Language
Objectives
To understand various ways in which XML can be used History of XML Syntax of XML Difference between HTML, XML and XHTML XML Document Type Definitions (DTDs) XML Schemas To understand types of XML Parsers
Validating vs. Non-Validating Parsers To understand different XML Parser Interfaces
Tree Based Interface Standard : DOM Event Based Interface Standard : SAX
Evaluating Parsers Which parser to use?
History of XML The World Wide Web Consortium (W3C) is an international consortium where
Member organizations, a full-time staff, and the public work together to develop Web standards
Tim Berners-Lee and others created W3C (1994) Berners-Lee, who invented the World Wide Web in 1989.
• In 1970 IBM Introduced SGML
• SGML: Standard Generalized Markup Language
• SGML is a semantic and structural language for text
documents.
• SGML is complicated.
• XML Working Group is formed under W3C in 1996.
• In 1998 W3C introduced XML 1.0
• Extensible Markup Language (XML) is a subset of SGML
What is XML?
XML stands for eXtensible Markup Language XML is a universal method representing data
Used in applications, web and for data exchange XML is a markup language much like HTML, but used
for different purposes XML is not a replacement for HTML
What is XML? XML was designed to describe data XML is a cross-platform, software and hardware
independent tool for transmitting or exchanging information.
XML is an open-standards-based technology Extensible Both Human and machine readable XML Standard
XML 1.0 (1998). XML 1.1 (Feb 2004)
What Exactly is XML used for?
Storing data in a structured manner. ( Tree
structure)
Storing configuration information – typically
data in an application which is not stored in
a database Most server software have configuration files in
XML formats
Contd…
Transmitting data between applications
Overcomes Problems in Client Server applications which are
cross-platform in nature
Ex: A Windows program talking to a mainframe
XML is a universal, standardized language used to represent
data such that it can be both processed independently and
exchanged between programs and applications and between
clients and servers
Disparate systems can exchange information in a common
format
XML Syntax
The syntax rules of XML are very simple and
very strict.
XML tags are not predefined. You must define your
own tags
<college>GCET</college> All XML elements must have a closing tag
<para>This is a paragraph</para>
Contd…
XML tags are case sensitive
<Msg>This is incorrect</msg> Incorrect
<msg>This is correct</msg> Correct
All XML elements must be properly nested
<name>Jill<lname>Jack</name></lname> Incorrect
<name>Jill<lname>Jack</lname></name> Correct
Attribute values must always be quoted
<pen color=red>reynolds</pen> Incorrect
<pen color=“red”>reynolds</pen> Correct
XML Syntax
All XML documents must have a root element
<parent>
<child>
<subchild>.....</subchild>
</child>
</parent>
XML Comments Comments in XML
Comments are similar to HTML
<!-- This is a comment --> <?xml version="1.0"?><!–- Customer details --><customer> <name>John</name> <email>[email protected]</email>
</customer>
<?xml version="1.0"?><!–- Customer details --><customer> <name>John</name> <email>[email protected]</email>
</customer>
XML Code<?xml version="1.0"?><customers><customer> <name>John</name> <email>[email protected]</email></customer> <customer> <name>Tom</name>
<email/></customer>
</customers>
<?xml version="1.0"?><customers><customer> <name>John</name> <email>[email protected]</email></customer> <customer> <name>Tom</name>
<email/></customer>
</customers>
cust.xml
Extensibility in XML A typical XML document is made up of tags
enclosing the data; tag names describe the data
Because the language is extensible, you can create tags that are specific to your need
Contd… For example, your document may contain
tags to structure information about employees The tags may include <Name>, <Designation>,and <Address>
Data stored in XML is self-descriptive One can understand the data by just looking at
tag names
XML – Exchanging Info Between Apps Convert information stored in the database
(or any other format) to an XML format Once it is in XML format, other
applications/programs can parse (read) the XML document, which is made up of the initial data
XML parsers are freely available and are part of many new programming languages
Contd…
An Application
An Application
Spreadsheet Package
Spreadsheet Package
CAD Package
CAD Package
StatisticalProcessing
StatisticalProcessing
XMLDatabaseDatabase
ContentContent
StructureStructure
PresentationPresentation
XML DocXML Doc
DTD/XSDDTD/XSD
XSLXSL
XSD - XML Schema Definition
DTD - Document Type Definition.
XSL - Extensible Stylesheet Language.
Document Type Declaration (DTD)
DTD (Document Type Definition) is used to enforce structure requirements for an XML document
Document type declaration contains reference to Document Type Definition (DTD) and tells the parser which DTD to use for validation
xmldtd.xml
Contd…
<?xml version="1.0"?><!DOCTYPE customers [ <!ELEMENT customers (customer)> <!ELEMENT customer (name,email)> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)>]><customers><customer>
<name>John Conlon</name><email>[email protected]</email>
</customer></customers>
<?xml version="1.0"?><!DOCTYPE customers [ <!ELEMENT customers (customer)> <!ELEMENT customer (name,email)> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)>]><customers><customer>
<name>John Conlon</name><email>[email protected]</email>
</customer></customers>
XML Schema
An XML based alternative to DTD
Richer and more useful than DTDs
Written in XML and Simpler than DTDs
Support data type validation (DTD does not
support data type validation)add.xml
<?xml version="1.0"?> <addressBook>
<person> <cname>Harrison Ford</cname>
<email>[email protected]</email> </person>
<person><cname>Julie</cname>
<email>[email protected]</email>
</person> </addressBook>
<?xml version="1.0"?> <addressBook>
<person> <cname>Harrison Ford</cname>
<email>[email protected]</email> </person>
<person><cname>Julie</cname>
<email>[email protected]</email>
</person> </addressBook>
<?xml version="1.0"?><xs:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema>
<xs:complexType name="record"> <xs:sequence> <xs:element name="cname" type="xs:string"/>
<xs:element name="email" type="xs:string/>
</xs:sequence> </xs:complexType> <xs:element name="addressBook"> <xs:complexType> <xs:sequence> <xs:element name="person" type="record" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence> </xs:complexType> </xs:element> </xs:schema>
<?xml version="1.0"?><xs:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema>
<xs:complexType name="record"> <xs:sequence> <xs:element name="cname" type="xs:string"/>
<xs:element name="email" type="xs:string/>
</xs:sequence> </xs:complexType> <xs:element name="addressBook"> <xs:complexType> <xs:sequence> <xs:element name="person" type="record" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence> </xs:complexType> </xs:element> </xs:schema>
Simple XML Elements with Pre-defined Data Types
Simple XML Element: An XML element that has no
child elements and attributes. Simple XML elements can
be defined in XSD with the following statement:
<xsd:element name="element_name"
type="xsd:type_name"/>
XSD Syntax
Contd…
where "element_name" is the name of the XML element,
and "type_name" is one of the data type names pre-
defined in XSD.
XSD pre-defined data types are divided into 7 groups: Numeric data types Date and time data types String data types Binary data types Boolean data type
XSD Syntax
Simple XML Elements with Extended Data
Types
Simple XML Element: An XML element that has
no child elements and attributes. Simple XML
elements can be defined by using the pre-defined
XSD data types.
They can also be defined by using extended data
types, which are defined by "simpleType" statements: <xsd:simpleType name="my_type_name"> <xsd:restriction base="xsd:type_name"> XSD facet statements </xsd:restriction> </xsd:simpleType> <xsd:element name="element_name" type="my_type_name"/> where "element_name" is the name of the XML element,
"xsd:type_name" is a pre-defined data type serving as the base data type, and "my_type_name" is the new data type extended from the base data type.
Complex XML Elements
Complex XML Element: An XML element that has at least one
child element or at least one attribute. Complex XML elements
must be defined with complex data types, which are defined by
"complexType" statements:
XSD Syntax
<xsd:element name="element_name" type="my_type_name"/> <xsd:complexType name="my_data_type"> <xsd:sequence> <xsd:element name="child_element_1" type="data_type_1"/> <xsd:element name="child_element_2" type="data_type_2"/> ... </xsd:sequence> <xsd:attribute name="attribute_a" type="data_type_a"/> <xsd:attribute name="attribute_b" type="data_type_b"/> ... </xsd:complexType> where "attribute" statement is used to define an attribute, and "sequence"
statement is used to define the group of child elements, and the order the child elements should appear in the XML structure.
Note that "attribute" statements must appear after the child element definition statements.
XSD Syntax
Empty XML Elements Empty XML Element: A special complex XML element
that has one attribute or more and no child text nodes. Empty XML elements must be defined with complex data types in the following format:
<xsd:complexType name="my_data_type">
<xsd:attribute name="attribute_a" type="data_type_a"/>
<xsd:attribute name="attribute_b" type="data_type_b"/>
...
</xsd:complexType>
XSD Syntax
Anomymous Data Types
If data type is specific to a child element in a parent data type,
and there is not need to share it with data types outside the
parent data type, you can define it as anonymous data type - a
non-named data type defined inline. For example, the following
code:
<xsd:complexType name="my_data_type">
<xsd:sequence> <xsd:element name="setting">
<xsd:complexType> <xsd:sequence>
<xsd:element name="property" type="xsd:string"/>
<xsd:element name="value" type="xsd:integer"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
defines "my_data_type" which has a "setting" element,
which has an anonymous data type defined inline.
Well-formed XML Documents
A document is made of elements; There is exactly one element, called the root, or document element
For all other elements, the elements, delimited by start- and end-tags, nest properly within each other
Attributes if any, should have their values enclosed within quotes
Valid XML Documents An XML document is valid if it has an
associated DTD or Schema and if the document complies with the constraints expressed in it
If an XML document is valid, it is also well-formed
Document Type Definitions (DTDs) Describes syntax that explains
which elements may appear in the XML document what are the element contents and attributes
Need for DTD Validating parser ( a program) can be used to check whether
XML data adheres to the rules in DTD The parser can do appropriate error handling if there are any
violation Validity error is not necessary a fatal error, but some
applications may treat it as fatal error
Document Type Declarations A valid XML document must include the
reference to DTD which validates it Types of DTD
Internal DTD: DTD can be embedded into XML document
External DTD: DTD can be in a separate file
Internal DTD DTD embedded in the XML document
The declarations appear between [ and ] E.g. AddressBook.xml
AddressBook.xml
<?xml version='1.0' encoding='utf-8'?><!-- DTD for a AddressBook.xml --><!DOCTYPE AddressBook [
<!ELEMENT AddressBook (Address+)><!ELEMENT Address (Name, Street, City)><!ELEMENT Name (#PCDATA)><!ATTLIST Name salutation CDATA #REQUIRED><!ELEMENT Street (#PCDATA)><!ELEMENT City (#PCDATA)>
]><AddressBook>
<Address><Name salutation="Mr.">Ram</Name><Street>M G Road</Street><City>Bangalore</City>
</Address></AddressBook>
External DTD
DTD is present in separate file Example
The DTD for AddressBook.xml is contained in a file AddressBook.dtd
AddressBook.xml contains only XML Data with a reference to the DTD file
AddressBook.xml
AddressBook.dtdAddressBook.xml
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE AddressBook SYSTEM "file:///c:/XML/AddressBook.dtd">
<AddressBook><Address>
<Name salutation="Mr.">Ram</Name><Street>M G Road</Street><City>Bangalore</City>
</Address></AddressBook>
Anatomy of DTD – Defining new XML tags (Elements)
<!ELEMENT element_name content_specification> element_name: Specifies name of the XML tag Content_specification: Specifies what are the contents of the
element #PCDATA: Parsed character data (Extra white spaces are
ignored) #CDATA: Character data (White spaces retained as is) Nested elements Empty Any (generally avoided but used in mixed content model)
Example:
<!ELEMENT Street (#PCDATA)>
element Street contains the parsed character Data
<!ELEMENT Address (Name, Street, City)>
element Address contains three nested tags Name, Street and City
respectively
<!ELEMENT AddressBook (Address+)>
Element AddressBook contains one or more occurrences of element
Address
Anatomy of DTD – Dealing with multiple children
To declare the children of an element we use syntax similar to regular expression in Perl. To define the children of an element we use the following syntax: (Assume a and b are child elements of the element being declared)
A+ -One or more occurrences of a
A* - Zero or more occurrences of a
A?-a or nothing
A, B – A followed by B
A|B – a or b, but not both
(expression) – Surrounding an expression with parentheses means that it is treated as a unit and may have the suffix operator ?,*or +
Some examples <!ELEMENT ITEM (PRODUCT,NUMBER,(PRICE|
CHARGEACCT|SAMPLE))> <!ELEMENT ITEM (PRODUCT,NUMBER,(PRICE|
CHARGEACCT*|SAMPLE)+)> <!ELEMENT ITEM (#PCDATA|PRODUCTID)*> <!ELEMENT
BOOK(OPENER,SUBTITLE?,INTRODUCTION?,(SECTION|PART)+)>
Anatomy of DTD – Attribute Declarations Specifies allowable attributes of each
element <!ATTLIST Tag-name Attr-Name Attr-Type Restriction> Tag-name : Element name Attr-Name : Name of the attribute, the
attribute is defined for element Tag-Name
Restriction: Value : Shows a simple text value enclosed in quotes #IMPLIED:Indicates that there is no default value for
this attribute, and this attribute need not be used #REQUIRED:Indicates that there is no default value for
this attribute, but that a value must be assigned to this attribute
#FIXED Value: In this case, Value is the attribute’s value, and the attribute must always have this value
Anatomy of DTD – Attribute Declarations Example
<!ATTLIST Name salutation CDATA #REQUIRED>
The element Name has attribute salutation which is of type CDATA
The attribute salutation must be specified in the Name tag
Anatomy of DTD – Entity Declarations (1 of 2)
Way to escape special characters
Some special characters such as <, >, & are not used
as #PCDATA
This escaping of the characters is called as “Entity
reference”
Following different entity references are used in the
XML document Built-in Entities : &, <, >, ', "
Characters Entities : ó representing ó
Example <State>Jammu & Kashmir</State>
Anatomy of DTD – Entity Declarations(2 of 2) Data that is frequently used can be
declared as an General Entity <!ENTITY entity_name entity_contents>
entity_name : Name of the new Entity
entity_contents : Contents of the new entity
Example <!ENTITY MyCountry "India">
Defines the entity called as MyCountry “India” is the contents of entity MyCountry
Usage in the XML Document <Country>&MyCountry;</Country>
XML Schema
What is XML Schema?
An XML vocabulary for expressing your data's structure and
business rules
Validating parsers can use Schema to check whether XML
data adheres to rules in schema
More robust and extensive than DTD, can do even data type
validations
E.g. : Consider following XML Document<Result><EmpNo>45609</EmpNo><Name>Kiran</Name><Subject>
<Name>IWT</Name><Marks>80</Marks><Grade>A</Grade>
</Subject></Result>
Is this data valid?
To be valid, it must meet following business rules (constraints)
The Result must be comprised of a Subject, Marks, Grade in
the order shown
The Subject must be any valid subject from the list (DC, IWT,
Cryptography)
The Marks must be between 0 to 100 only and Grade can be
either A or B or C
How can XML schema help to accomplish this?
Answer It creates XML vocabulary : Defines following set of elements
<Result>, <Subject>, <Marks>, <Grade> It specifies the contents of each element and restrictions on each
element <Result> element must contain <Subject>, <Marks>, <Grade> in that order
<Subject> must be one of the valid subjects (IWT, Cryptography, DC)
The Marks must be between 0 to 100 only Grade can be either A or B or C
XML Schema specifies in which namespace the created vocabulary must be in
It is not an actual URL, but uses URL syntax and should be a unique string
Example: http://www.Results.com Namespace defines the following vocabulary
Example of referring to Schema
<?xml version = "1.0" encoding = "UTF-8"?><res:Result xmlns:res="http://www.Results.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.Results.com Result.xsd">
<res:Name>Kiran</res:Name><res:EmpNo>45609</res:EmpNo><res:Subject>
<res:Name>IWT</res:Name><res:Marks>80.70</res:Marks><res:Grade>A</res:Grade>
</res:Subject><res:Subject>
<res:Name>PF</res:Name><res:Marks>78.30</res:Marks><res:Grade>B+</res:Grade>
</res:Subject></res:Result>
Result.xml
Schema example : Result.xsd<?xml version="1.0" encoding="UTF-8"?><xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.Results.com"
xmlns="http://www.Results.com" elementFormDefault="qualified">
<!-- Root Element Declaration --><xsd:element name="Result"> <xsd:complexType> <xsd:sequence> <xsd:element name="Name"
type="xsd:string"/> <xsd:element name="EmpNo"
type="xsd:int"/> <xsd:element name="Subject"
type="SubjectType" maxOccurs="5"/> </xsd:sequence> </xsd:complexType></xsd:element> <xsd:simpleType name="NameType"> <xsd:restriction base="xsd:string"> <xsd:pattern value="CHSSC|PF|
RDBMS|IWT|AOA"/> </xsd:restriction></xsd:simpleType>
Result.xsd
Schema example : Result.xsd<xsd:complexType name="SubjectType"><xsd:sequence>
<xsd:element name="Name" type="NameType"/> <!-- Reference to the element Marks --> <xsd:element ref="Marks"/> <xsd:element name="Grade"> <xsd:simpleType>
<xsd:restriction base="xsd:string"><xsd:pattern value="A|B+|B|C|D"/>
</xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:sequence></xsd:complexType><xsd:element name="Marks"><xsd:simpleType><xsd:restriction base="xsd:float">
<xsd:minInclusive value="0.0"/> <xsd:maxInclusive value="100.0"/>
</xsd:restriction></xsd:simpleType></xsd:element></xsd:schema>
DTD vs Schema XML document and DTD use different syntax : Inconsistency
Schema uses XML syntax Limited data type capability
DTDs support a very limited capability for specifying data types. DTDs do not support field level validations and complex types
E.g. : You can't, express "I want the <Marks> element to hold an integer with a range of 0 to 100“ in DTD
Schema describes a set of data types compatible with those found in databases E.g.: Database supports integer, string, etc data types Schema supports integer, string etc while the DTD does not
Element Declarations: Simple Element
Syntax : <xsd:element name=“Element_name” type=“Element_type” Occurrence/>
Element_name : Any valid xml name Element_type : Built in Simple type Occurrence : Number of occurrences of that element, optional
Example : <xsd:element name="Name" type="xsd:string"/>
Defines the element Name of type string <xsd:element name=“Marks" type=“xsd:float“ maxOccurs=“5”/>
Defines the element Marks of simple type float
Marks may appear for maximum 5 times
And by default for minimum 1 time
Element Declarations
Syntax : <xsd:element name=“Element_name”>
<xsd:complexType><!-- Element Specification -->
</xsd:complexType></xsd:element>
Example<xsd:element name=“Subject"> <xsd:complexType> <xsd:sequence> <xsd:element name=“Name" type="xsd:string"/>
<xsd:element name=“Marks" type="xsd:float"/> <xsd:element name=“Grade" type="xsd:string"/>
</xsd:sequence> </xsd:complexType><xsd:element> Defines non reusable complex element called ‘Subject’ Each element appears in that sequence because <xsd:sequence> tag is used
Element Declarations: Reusable Simple Type
Element_type_name : Name of the data type Base_data_type : Any of the built in simple data type (integer, float etc) Restriction_specification : Specifies restriction on the element if any
<xsd:simpleType name=“Element_type_name"><xsd:restriction
base="Base_Data_type"><!-- Restriction specification
--></xsd:restriction>
</xsd:simpleType>
Example :<xsd:simpleType name=“MarksType">
<xsd:restriction base="xsd:float"> <xsd:minInclusive value=“0.0"/>
<xsd:maxInclusive value=“100.0”/>
</xsd:restriction> </xsd:simpleType> Defines the reusable element type MarksType Element defined as MarksType may take minimum value of 0.0
and maximum value 100.0 <xsd:element name=“Marks” type=“MarksType”>
Element Declarations: Reusable Complex Type
Syntax <xsd:complexType name=“Type_name”> Defines the reusable type Type_name
Example<xsd:complexType name=“SubjectType“> <xsd:sequence> <xsd:element name=“Name" type=“xsd:string"/>
<xsd:element name=“Marks" type="xsd:int"/>
<xsd:element name=“Grade" type="xsd:string”/>
</xsd:sequence> </xsd:complexType>
Defines reusable complex element type SubjectType Comprises of following elements in the sequence
specified (<xsd:sequence> tag) Name Marks Grade
This type can be used to define elements in your XML<xsd:element name=“Subject” type=“SubjectType”>
Defining the Attributes
Syntax : <xsd:attribute name=“Attr_Name" type=“Attr_Type"/>
Example
<xsd:attribute name=“Project" type=“xsd:string"/>
All attributes are declared as simple types.
Only complex elements can have attributes
Anatomy of XML Schema : Constraints specification
Controls occurrence of individual element or group of elements
Types of constraints <choice> : allows only one element to appear <sequence> : elements must appear in the same
order as they are declared <all> : elements can occur in any
order and in any combination
<choice> constraint E.g.:
<xsd:choice><xsd:element name=“first”/><xsd:element name=“last”/>
</xsd:choice> Allows either first or last name to be used in the
instance XML Document
<sequence> constraints E.g.:
<xsd:sequence> <xsd:element name="Name" type="xsd:string"/>
<xsd:element name="EmpNo" type=“xsd:int"/> <xsd:element name=“Subject" type="SubjectType" maxOccurs="5"/>
</xsd:sequence> All elements must appear in the defined order only
Anatomy of XML Schema : Constraints specification <all> constraints
E.g. : <xsd:all>
<xsd:element name=“invoice”><xsd:element name=“purchaseOrder”><xsd:element name=“mailingLabel”>
</xsd:all> Any of the elements can either appear or not appear Elements may appear in any order
XML Parsers
XML Parser : The Big Picture
Usage of the XML Parser
XML
Document
XML
Parser
Client
Application
API’s
Parsed Data
XML
DTD / Schema
Why to use Parser? Typically use a pre-built XML parser (e.g. JAXP,
Apache Xerces etc) This enables you to build your application much
more quickly
Need for Parser Defining the Parser’s Responsibilities
Ensure that the document adheres to specific standards Does the document match the DTD or Schema? Is the document well-formed?
Make the document contents available to your application
The parser will parse the XML document, and make this data available to your application
An application using parser can access data in XML by going through the hierarchy or using tag names
Types of XML Parsers Validating Parser
a parser that verifies that the XML document adheres to the DTD or Schema
Non-Validating Parser a parser that does not verify the XML document
against the DTD or Schema Most parsers provide an option to turn validation on or
off All parsers checks the well-formedness of XML
document at all times
XML Parser Interfaces Two types of Interfaces provided by XML Parsers
SAX An Event Based Interface DOM a Tree Based Interface
JAXP “Java API for XML Processing” JAXP is part of JDK Provides parsers which can be used in any Java application
It supports both Tree Based Parser : DOM Event Based Parser : SAX
DOM Parser Tree Based Parser
Definition: Parser reads the XML document, and creates an in-memory “tree” representation of XML Document
For example: Given a sample XML document below
What kind of tree would be produced?
<Result><Name>Kiran</Name><EmpNo>45609</EmpNo><Subject>
<Name>CHSSC</Name> <Marks>80</Marks> <Grade>A</Grade>
</Subject></Result>
In memory tree created by Tree Based Parser Tree represents the hierarchy of XML document
DOM Parser
Result
Name
EmpNo
Kiran
45609
Text Nodes
Element Nodes
DOM Parser Tree based APIs presents a memory model of entire
document to an application once parsing has concluded No need to use extra data-structures to maintain the
information during parsing An application can navigate through the tree to find the
desired pieces of document Document Object Model (DOM) is the standard for
Tree Based parsing of XML document
Document Object Model (DOM) The Document Object Model (DOM) is a set of
interfaces defined by the W3C DOM Working Group DOM is the tree based interface used by the
programmers to manipulate the XML document DOM Parser can be Validating or Non Validating DOM Parser represents the logical Model of the XML
document in the memory All the entity reference are expanded before the DOM
tree was constructed
DOM Structure representing XML
Document
Element Element
Attribute
Element
Text
Comment
Result
Name
SubjectKiran
EmpNo
IWT
Text
45609
XML Document Structure
Document Structure representing Result.xml
Name
Grade
Marks
80.0
A
Document Root
Element Node
Text Node
Document Object Model (DOM) : Overview
The root of the DOM Hierarchy is called as a Document node Example : Result
The Child nodes of the Document node are : Element nodes, Comments nodes etc Example : Name, Subject, EmpNo, etc are all Child
Nodes All the nodes in the XML Document are derived from
interface :
org.w3c.dom.Node
The Big picture : Parsing the XML Document
Document builder factory creates an instance of parser with required characteristics
Whether the parser should be validating parser or not
Whether namespace support required or not, Whether to ignore the white spaces between the
elements or not
Factory hides the implementation details of the parser and gives a standard DOM interface for
parsing XML
(Analogous to JDBC driver)
XMLData
DocumentBuilder
(Parser)
DocumentBuilderFactory
Document Object (DOM)
Object
Object Object
Object Object
DomApp.java : Parsing XML Document using DOM Parserpublic class DomApp { public static void main(String argv[]) { MyErrorHandler hErr;
Document hDocument; DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance(); factory.setValidating(true); factory.setNamespaceAware(true);
try {hErr = new MyErrorHandler();
DocumentBuilder hBuilder = factory.newDocumentBuilder();
// Set the error handlerhBuilder.setErrorHandler(hErr);
hDocument = hBuilder.parse( new File(“Result.xml”));
} catch (Exception e){
// Handle exception if generated during parsing
} }// End of Function main}
Parsing the XML Document using DOM Parser
Step 1: Get the instance of document-builder factory.
This will be used to produce the DOM-parser (called DocumentBuilder)
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Step 2: Set the properties of the DOM parser to be produced
a. It should validate the XML Document against the Schema / DTD
b. It should be namespace aware
factory.setValidating(true); factory.setNamespaceAware(true);Step 3 : Obtain the instance of the MyErrorHandler class
This instance handles the error generated during parsing, in application specific way hErr = new MyErrorHandler();
Step 4: Obtain the instance of DOM parser, and register the error handler
This will be used to parse the XML Document and creates the memory based tree representation of the XML DocumentDocumentBuilder hBuilder=factory.newDocumentBuilder();
hBuilder.setErrorHandler(hErr);
Step 5 : Parse the XML Document (Result.xml) using the parser created as above
hDocument = hBuilder.parse( new File(“Result.xml”));
The Node interface is the root of DOM Core class hierarchy
This interface can be used to extract information from any DOM
object without knowing its actual type (e.g. Element node, Text node,
Attr Node etc ) of underlying node
i.e. It is possible to access a document's complete structure and
content using only the methods and properties exposed by the Node
interface
The Class Hierarchy rooted at org.w3c.dom.Node
DOM : Exploring the org.w3c.dom.Node Interface
Node
Element Document
Attr Text Comment
Entity
DOM : Important Methods of Node interface Methods to retrieve the various information from the
XML DOM Tree Node getFirstChild(): Returns the first child of the
current node Node getLastChild(): Returns the last child of the
current node String getNodeName(): The name of this node String getNodeValue(): The value of this node,
depending on its type short getNodeType(): A code representing the type of
the underlying object
Methods to alter the elements of XML DOM Tree
Node insetBefore( Node newChild, Node refChild) Node appendChild (Node newChild) Node removeChild (Node oldChild) Node replaceChild (Node newChild, Node
oldChild )
Using Node InterfaceReslt
Name
SubjectKiran EmpNo
Name45609
Node hLastChild = hNode.getLastChild();
hFirstChild= hFirstChild.getFirstChild();
String sName = hFirstChild.getNodeName()
String sVal = hFirstChild.getNodeValue()
hNode = hDocument.getDocumentElement()
Node hFirstChild= hNode.getFirstChild();
XML Parser Interfaces : Event Based Interface Event Based Interface
Definition : Parser reads the XML document and generates events for each parsing step
Some common parsing events Element start-tag read Element content read Element end- tag read
Example<Result>
<Name>Kiran</Name> <EmpNo>45609</EmpNo> <Subject>
<Name>CHSSC</Name> <Marks>80</Marks> <Grade>A</Grade>
</Subject></Result>
XML Parser Interfaces : Event Generated
startElement : Result startElement : Name contents : Kiran endElement : Name startElement : EmpNo contents : 45609 endElement : EmpNo endElement : Result
XML Parser Interfaces : Event Based Interface For each of these events, your application implements “event
handlers” Each time an event occurs, a different event handler is called Your application intercepts these events, and handles them in any
way you want Application does not wait till the entire document gets parsed Application has to maintain the information from XML document
within local data-structures till it is processed completely Simple API for XML (SAX) is the standard for Event Based parsing
of XML document
SAXApp.java : Parsing XML Document using SAX Parser
public class SAXApp {public static void main(String argv[]) {
//Get the instance of parser event handing class
DefaultHandler handler = new Handler();//Get the instance of SAXParserFactorySAXParserFactory factory =
SAXParserFactory.newInstance();try {
// Set the properties of the parser to be obtained
factory.setValidating(true); factory.setNamespaceAware(true);
// Get the new SAX ParserSAXParser saxParser = factory.newSAXParser();// Parse the file// handler : processes events generated during
parsingsaxParser.parse(new File(“Result.xml”),
handler);}
//Handle any exceptions if generated during parsingcatch (Throwable t) {
t.printStackTrace(); }
} // End of function main}
SAXApp.java : Parsing XML Document using SAX Parserclass Handler extends DefaultHandler{
public void error(SAXParseException e) throws SAXException {System.out.println("Error At Line:”+e.getLineNumber());
System.out.print(“Column: "+e.getColumnNumber());// Print the error messageSystem.out.print(e.getMessage());
}
// Process any fatal errors in the XML documentpublic void fatalError(SAXParseException e) throws SAXException {
System.out.println("Fatal Error At Line:”+e.getLineNumber()); System.out.print(“Column: "+e.getColumnNumber());
// Print the error messageSystem.out.print(e.getMessage());
}} //End Class DefaultHander
Understanding The Simple API for XML (SAX) Step 1: Get the instance of SAXParserFactory
This instance is used to obtain the SAX Parser
SAXParserFactory factory = SAXParserFactory.newInstance();Step 2:Get the instance of the event handler class
This class handles all the events generated by parser DefaultHandler handler = new Handler();
Step 3:Set the properties of the parser to be obtained
a. It should validate the XML Document against the Schema / DTD
b. It should be namespace aware
factory.setValidating(true);
factory.setNamespaceAware(true);Step 4 : Obtain the instance of the SAX Parser using the factory just obtained
SAXParser saxParser = factory.newSAXParser();Step 5: Parse the Result.xml file using the SAX Parser obtained as above
Events generated during parsing will be handled by object handlersaxParser.parse(new File(“Result.xml”), handler);
The Big picture : Paring the XML Document using SAX
XML
Document SAX Parser
SAX Parser
Factory
DefaultHandler/ MyHandler
org.xml.saxContentHander
org.xml.saxErrorHander
org.xml.saxEntityResolver
Parser Events
org.xml.sax class hierarchy
implements
org.xml.sax Interfaces org.xml.sax.DefaultHandler Class
Provides the default implementation of all the events
DefaultHandler implements the ContentHandler, ErrorHandler, DTDHandler, and EntityResolver interfaces (with null methods).
Only the methods which are required are overridden
org.xml.sax.ContentHandler Interface Receive notification of the logical content of a document Defines methods like startDocument(), endDocument(),
startElement(), and endElement() These are invoked when an XML tags arerecognized Also defines methods characters() which are invoked
when the parser encounters the text in an XML element
org.xml.sax Interfaces org.xml.sax.ErrorHandler Interface
Allows SAX application to do customized error handling
The parser will then report all errors and warnings through this interface
Important Methods void error() : receives the notification of
recoverable error void fatalError(): receives the notification of non-
recoverable error void warning(): receives the notification of a
warning
Evaluating Parsers : SAX vs. DOM SAX
Advantage
It is good when serial processing of the document is required
and document is very large
i.e. when the size of the XML document is in terms of GBs.
Disadvantage
Requires internal data structure to maintain the parts of XML
document till the complete processing is not finished, therefore
not suitable for parsing the small XML Documents.
DOM Advantage
Supports DOM Tree Traversing methods Allows modification of XML Document Good when the random access of a document is
required Disadvantage
For large XML documents (size in GBs) requires more memory as compared to memory required to parse XML document using SAX Parser.