1 extensible markup language: xml html: widely supported protocol for formatting data xml: widely...
Post on 21-Dec-2015
232 views
TRANSCRIPT
![Page 1: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/1.jpg)
1
Extensible Markup Language: XML
• HTML: widely supported protocol for formatting data
• XML: widely supported protocol for describing data
• XML is quickly becoming standard for data exchange between applications
![Page 2: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/2.jpg)
Root element contains all other document elements
Optional XML declaration includes version information parameter (MUST be very first line of file)
Because of the nice <tag>.. </tag> structure, the data can be viewed as organized in a tree:
article
title date author summary content
firstName lastName
![Page 3: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/3.jpg)
<?xml version = "1.0"?>
<!– I-sequence structured as XML. -->
<SEQUENCEDATA>
<TYPE>dna</TYPE>
<SEQ>
<NAME>Aspergillus awamori</NAME>
<ID>U03518</ID>
<DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggccca
acctcccatccgtgtctattgtaccctgttgcttcgg
cgggcccgccgcttgtcggccgccgggggggcgcctctg
ccccccgggcccgtgcccgccggagaccccaacacgaac
actgtctgaaagcgtgcagtctgagttgattgaatgcaat
cagttaaaactttcaacaatggatctcttggttccggc
</DATA>
</SEQ>
</SEQUENCEDATA>
An I-sequence might be
structured as XML like this..
SEQUENCEDATA
TYPE SEQ
DATAIDNAME
comment
![Page 4: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/4.jpg)
1
XML is standard: Parsers exist already!
Minus sign
Each parent element/node can be expanded and collapsed
Plus sign
Standard browsers can format XML documents nicely!
![Page 5: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/5.jpg)
1
Python offers a Document Object Model parser!
• A DOM parser returns the whole XML document represented as a tree• All nodes have name (of tag) and value (data)
• Text (including whitespace) represented in nodes with tag name #text
article
title
#text
#text
#text
#text
date
author
summary
content
#text
#text
#text
firstName
#text
lastName
#text
#text
Simple XML
#text
Dec..2001
#text
XML..easy.
#text
In this..XML.
#text
John
#text
Doe
![Page 6: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/6.jpg)
deite
l_fig
16_0
4rev
ised
.py
Parse XML document and load data into variable document
documentElement attribute refers to root node
nodeName refers to element’s tag name
Various node attributes:
firstChild
nextSibling
nodeValue
parentNode
NB: Changes since book!
![Page 7: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/7.jpg)
1
Program output
The first child of root element is: #textwhose next sibling is: titleText inside "title" tag is Simple XMLParent node of title is: article
Here is the root element of the document: articleThe following are its child elements:#texttitle#textdate#textauthor#textsummary#textcontent#text
article
title
#text
#text
#text
#text
date
author
summary
content
#text
#text
#text
firstName
#text
lastName
#text
#text
Simple XML
#text
Dec..2001
#text
XML..easy.
#text
In this..XML.
#text
John
#text
Doe
![Page 8: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/8.jpg)
1
Parsing XML sequence?
• We have i2xml filter (exercise) – we want xml2i also
• New XML structure for Isequences: holds more than one
• Algorithm:– Open file– Use Python parser to obtain the DOM tree– Traverse tree to extract sequence information, build Isequence
objects
SEQUENCEDATA
SEQ (type)
DATAIDNAME
SEQ (type)
DATAIDNAME
Ignoring whitespace nodes, we have to search a tree like this:
![Page 9: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/9.jpg)
We're still being systematic: Usual name for parse method
Obtain a parse tree with the xml data for free
xml2
i.py
(par
t 1)
SEQUENCEDATA
SEQ (type)SEQ (type)
Convert this SEQ subtree to an Isequence object
![Page 10: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/10.jpg)
xml2
i.py
(par
t 2)
SEQ (type)
DATAIDNAME
Way of getting to all attributes of a node
Way of getting to the value of a specific attribute
Recall: text kept in a #text node underneath
#text
![Page 11: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/11.jpg)
1
See all the methods and attributes of a DOM tree on pages 537ff
Attribute/Method Description
appendChild( newChild ) Appends newChild to the list of child nodes. Returns the appended child node.
attributes NamedNodeMap that contains the attribute nodes for the current node.
childNodes NodeList that contains the node’s current children.
firstChild First child node in the NodeList or None, if the node has no children.
insertBefore( newChild,
refChild )
Inserts the newChild node before the refChild node. refChild must be a child node of the current node; otherwise, insertBefore raises a
ValueError exception.
isSameNode( other ) Returns true if other is the current node.
lastChild Last child node in the NodeList or None, if the current node has no children.
nextSibling The next node in the NodeList, or None, if the node has no next sibling.
nodeName Name of the node, or None, if the node does not have a name.
Possible to manipulate the DOM tree using these methods (add new nodes, remove nodes, set attributes etc.)
![Page 12: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/12.jpg)
1
Convert old format XML sequence to new format
SEQUENCEDATA
TYPE SEQ
DATAIDNAME
Old format: sequence type has its own tag TYPE
SEQUENCEDATA
SEQ (type)
DATAIDNAME
New format: sequence type is attribute of SEQ tag
![Page 13: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/13.jpg)
old_
xml2
i.py
Add new method to original xml2i.py and call it after parsing the XML file
![Page 14: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/14.jpg)
old_
xml2
phyl
ip.p
y
Import new module
Check that type information is saved in the Isequence (not used in phylip format)
![Page 15: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/15.jpg)
1
Testing on old format XML sequence
<?xml version = "1.0"?> <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U03518</ID>
<DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtgtctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgccccccgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttgattgaatgcaatcagttaaaactttcaacaatggatctcttggttccggc</DATA> </SEQ> </SEQUENCEDATA>U03518b.xml
python old_xml2phylip.py U03518b.xml U03518b
sequence is of type dna
![Page 16: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/16.jpg)
1
Remark: book uses old version of DOM parser
• XML examples in book won’t work (except the revised fig16.04)
• Look in the presented example programs to see what you have to import
• All the methods and attributes of a DOM tree on pages 537ff are the same
![Page 17: 1 Extensible Markup Language: XML HTML: widely supported protocol for formatting data XML: widely supported protocol for describing data XML is quickly](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649d5d5503460f94a3b653/html5/thumbnails/17.jpg)
1
.. on to the exercises