ufcekg 20 2 data, schemas & applicationsnisansa/classes/02...feed integrators bloglines, google...

24
UFCEKG202 Data, Schemas & Applications Data, Schemas & Applications Lecture 3 Data Representation, XML & RSS N. H. N. D. de Silva (Slides adapted from Prakash Chatterjee, UWE) (Slides adapted from Prakash Chatterjee, UWE)

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

UFCEKG‐20‐2 Data, Schemas & ApplicationsData, Schemas & Applications

Lecture 3Data Representation, XML & RSS

N. H. N. D. de Silva(Slides adapted from Prakash Chatterjee, UWE)(Slides adapted from Prakash Chatterjee, UWE)

Page 2: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Last week:Last week: 

o introduction to the webo introduction to the webo Uri schemas & encodingo http protocolo media typeso media typeso request / response cycleo get, post, put and deleteo introduction to mashupso introduction to mashupso simple mashup example with forms

Feb 2013 2N. H. N. D. de Silva

Page 3: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

WWW : definitionWWW : definition

The World Wide Web (abbreviated as WWW or W3, commonly ( , yknown as the Web), is a system of interlinked hypertextdocuments accessed via the Internet. With a web browser, one can view web pages that may contain text images videos andcan view web pages that may contain text, images, videos, and other multimedia, and navigate between them via hyperlinks.

Wikipedia : World Wide Web

Concept originally proposed by Sir Tim Berners‐Lee (1989) based on earlier hypertext systems Berners‐Lee and Belgian computeron earlier hypertext systems. Berners‐Lee and Belgian computer scientist Robert Cailliau proposed in 1990 to use hypertext "to link and access information of various kinds as a web of nodes in h h h b ll" d h bl l d dwhich the user can browse at will", and they publicly introduced 

the project in December of the same year.

Feb 2013 3N. H. N. D. de Silva

Page 4: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Problem : How to encode data for communicationProblem : How to encode data for communication

Competing constraintsCompeting constraintso Data must be serialised into a character stream o Communicate the meaning of the data as well as the datao Error‐freeo Minimal sizeo Handle Multi Lingual texto Handle Multi‐Lingual text 

Bank of America Market Data Mirrors

Feb 2013 4N. H. N. D. de Silva

Page 5: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Solutions

o Card file based o csvo xls ‐ Excel file formato XMLo SQL export o JSON JavaScript Object Notationo JSON ‐ JavaScript Object Notation

The Medabar in Asmara, Eritrea Google Map

Feb 2013 5N. H. N. D. de Silva

Page 6: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Card-basedCard based

Exampleso ATCO‐CIF for timetables o IGES for Computer‐Aided Design

Characteristicso Based on old 80‐column punched cardso Multiple record typeso Multiple record typeso Fixed field widthso No formal language to define the format 

Feb 2013 6N. H. N. D. de Silva

Page 7: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

CSV

ExamplesAl eston (Bristol) eather datao Alveston (Bristol) weather data

o World Health Organization(WHO) ‐ generated estimates of TB mortality, prevalence, incidence (including incidence of HIV+TB) and case detection rate. 

o 1000 Songs Google Spreadsheeto 1000 Songs ‐ Google Spreadsheet 

CharacteristicsD t l t d b t h t t bo Data values separated by a common separator character ‐ space, comma or tab

o Column position is significanto Lines separated by newlines ‐ coding depends on OS ‐ linefeed (x0A) Unix or 

carriage return (x0D) line feed Windows carriage return on old Macscarriage‐return (x0D), line feed ‐Windows, carriage‐return on old Macs o Separator must not occur in data values, or some other convention needed ‐

Quotes around value, an escape charactero Column headings may be the first lineo Column headings may be the first lineo Only tables ‐ all lines the sameo All columns required ‐ problem for space‐separated data

Feb 2013 7N. H. N. D. de Silva

Page 8: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

d d

D t ith ti l d t d t d d t d

Tagged record structures

Data with optional data and repeated data need more complex structures. Many have been developed for specific domainsdomains

o MARC library catalogue recordso EDIFACT for commercial Electronic Data interchange (EDI)o EDIF LISP ‐based nested data

EXIF d t b dd d i JPEG io EXIF data embedded in a JPEG image 

Feb 2013 8N. H. N. D. de Silva

Page 9: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

XMLXML

A generic data format based on tagged elements in a tree g ggstructure. 

Developed from GML via SGMLDeveloped from GML, via SGML.

GML, a document mark‐up language developed by Charles p g g p yGoldfarb at IBM in 1969.

E lExampleso Alveston WDL config fileo UWE news RSS feedo UWE news RSS feed

Tree with Buddhist prayer flagsFeb 2013 9N. H. N. D. de Silva

Page 10: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

XML domain vocabularies

XML domain vocabularies

XML domain vocabularies

XML defines only the rules for a well‐formed document. The allowable tags, their structuring and order in a document range of allowable values and the meaningdocument, range of allowable values and the meaning of those tags depends on the XML application ‐ called a vocabulary.vocabulary.There are now hundreds of XML vocabularies designed for every sort of datao XHTML ‐ the version of HTML which conforms to XMLo SVG ‐ graphicso TransExchange for timetableso TransExchange for timetableso RSS and Atom for news

Feb 2013 10N. H. N. D. de Silva

Page 11: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

XML processing vocabularies

There are also vocabularies for languages for processing XMLg g p g

o XSLT ‐ for transforming XML documentsf f do XSL‐FO ‐ for transforming to PDF documents

o XML Schema ‐ for defining XML vocabularieso XProc ‐ for defining XML Pipelineso XProc ‐ for defining XML Pipelines 

Feb 2013 11N. H. N. D. de Silva

Page 12: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Problem: News dissemination

I want to disseminate news about my project/company, and allow interested people to read it. e.g. the university 

d h b f l ffwants to spread the news about successful staff

Solution 1 : HTML pageSolution 1 : HTML pagePublish a page of news on the website in HTML

P blProblemso how do visitors know when its changed?o news from different universities cannot be easilyo news from different universities cannot be easily 

combined – (why?) 

Feb 2013 12N. H. N. D. de Silva

Page 13: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Solution : emailEncourage interested users to subscribe to your company newslettercompany newsletter.

Problemso Subscription is a barriero Clutters up email boxeso can look like spamo can look like spamo List management and emailing overhead

Feb 2013 13N. H. N. D. de Silva

Page 14: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

l i d fUWE makes up its own set of additional tags

Solution : Create XML document for newsUWE makes up its own set of additional tags

<newsItem date=‘2007-10-2’> <newsTitle>UWE best in West</newsTitle><newsTitle>UWE best in West</newsTitle><newsBody>UWE wins tiddlewinks again</newsBody><contact>[email protected]</Contact>

</newsItem>

Problemso someone has to design this languageo has to be translated to HTML to display

A d h t d t d lti l t fo A reader has to understand multiple new tags from different sources

o needs to be distinguished from standard HTMLo needs to be distinguished from standard HTML

Feb 2013 14N. H. N. D. de Silva

Page 15: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Aside: NamespacesProblemHow to distinguish in a document XML tags from different vocabularies ?vocabularies ?

Solutiono define a (global) unique URI for the vocabularyo use an arbitrary prefix ‐ news: for all tags in the same 

b l i i hi dvocabulary ‐ unique within a document o link the prefix to the vocabulary in the document

<h1>UWE news</h1><p> <news:item xmlns="http://www.uwe.ac.uk/news" date="2007-10-2“>

<news:Title>UWE best in West</news:Title><news:Body>UWE wins tiddlewinks again</news:Body><news:Contact>[email protected]</news:Contact>

</news:item></p>

Feb 2013 15N. H. N. D. de Silva

Page 16: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Solution : RSSo Standardize on one (or several !) standard tagso Tags are machine‐readable to identify news items in a list 

of web siteso RSS 2.0 

o Really Simple Syndicationo Really Simple Syndicationo Rich Site Summary

o Atom ‐ a more recent formato Atom  a more recent format o Differences ‐ dates (RFC 822 v RFC 3339 timestamps), 

multi‐lingual content

Characteristicso Structure: rss / channel / item Treeo Structure: rss / channel / item Treeo Items in reverse chronological ordero Few mandatory tagsy go Namespaces allow additional vocabularies to be added

Feb 2013 16N. H. N. D. de Silva

Page 17: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Example RSS ‐ UWE newsp<?xml version="1.0" encoding="iso-8859-1"?><rss version="2.0"><channel> <title>UWE News</title><link>http://www.uwe.ac.uk</link><description>Latest UWE press releases</description>i<image><url>http://info.uwe.ac.uk/common/assets/2004Design/logoNoBorder.gif</url><title>University of the West of England</title><link>http://www.uwe.ac.uk</link>/i</image>

<pubDate>Fri, 13 Oct 2008 15:15:44 GMT</pubDate><item><title>New research looks to transport users for solutions</title>li k htt //i f k/ / / ti l ?it 1363 /li k<link>http://info.uwe.ac.uk/news/uwenews/article.asp?item=1363</link>

<description>'Ideas in Transit' is a new initiative which will look totransport users' experiences and creativity as a source of innovationto tackle the UK's transport problems....

/d i ti</description></item>

Feb 2013 17N. H. N. D. de Silva

Page 18: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Example RSS ‐ BBC Finance News? l i "1 0" di " SO 8859 1" ?<?xml version="1.0" encoding="ISO-8859-1" ?><?xml-stylesheet title="XSL_formatting" type="text/xsl“ href="/shared/bsp/xsl/rss/nolsol.xsl"?> <rss version="2.0" xmlns:media="http://search.yahoo.com/mrss"><channel>

<title>BBC News | Business | UK Edition</title><link>http://news.bbc.co.uk/go/rss/-/1/hi/business/default.stm</link><description>Visit BBC News for up-to-the-minute news, breaking news, video, audio and

feature stories. BBC News provides trusted World and UK news as well as local andregional perspectives Also entertainment, business, science, technology and healthregional perspectives. Also entertainment, business, science, technology and healthnews.

</description><language>en-gb</language><lastBuildDate>Mon, 13 Oct 2008 14:28:30 GMT</lastBuildDate> < i ht>C i ht (C) B iti h B d ti C ti<copyright>Copyright: (C) British Broadcasting Corporation, see

http://news.bbc.co.uk/1/hi/help/rss/4498287.stm for terms and conditions of reuse</copyright><docs>http://www.bbc.co.uk/syndication/</docs><ttl>15</ttl> <image>

<title>BBC News</title> <url>http://news.bbc.co.uk/nol/shared/img/bbc_news_120x60.gif</url><link>http://news.bbc.co.uk/go/rss/-/1/hi/business/default.stm</link>

</image></image><item>

<title>UK banks receive &#163;37bn bail-out</title> <description>The UK government says it is to inject a total of up to &#163;37bn into Royal

….. </item></item>

Feb 2013 18N. H. N. D. de Silva

Page 19: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

RSS aggregationProblemHow to keep track of multiple feeds

Solutionhttp://www.youtube.com/watch?v=0klgLsSxGsU&feature=player_embedded#t=0s

o Application needed which is stateful – remembers what ppitems you have read

o Integrates multiple feeds into one ‘magazine’o Polls RSS providers on a regular basis

Feed integrators Bloglines Google Reader reduce the loadFeed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader integrated into MyUWEg y

RSS Aggregation with BloglinesFeb 2013 19N. H. N. D. de Silva

Page 20: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

RSS as a tree structure

o UWE newso BBC Finance newso Earthquakes

Feb 2013 20N. H. N. D. de Silva

Page 21: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

XML Characteristics

o strings enclosed in tags which provide a humanly readable name for the element ‐ so‐called self‐describing

o elements may be nested to create hierarchical data structuresstructures

o element tags may be repeated o element names can be relative to their parento element names can be relative to their parent o element structure can be formally defined

Feb 2013 21N. H. N. D. de Silva

Page 22: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Aside: Self describing

o Element names provide a clue about the meaning of

Aside: Self ‐describing

o Element names provide a clue about the meaning of the data, but not enough

o names are ambiguouso names may be misleadingo what units?

hat acc rac ?o what accuracy?o what origin? ‐ leads to need for meta‐data 

o who createdo who createdo wheno what license to useo why

Feb 2013 22N. H. N. D. de Silva

Page 23: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

XML terminologyXML documents are tree‐structures, with each node bounded by an open and a closing tagy p g go Element: the opening tag, attributes, the body of the 

element and the closing tag. Elements are not elemental!h l b k fo tag name: the name in angle brackets ‐must conform to 

rules, may have a prefixo Attribute: a name="value" pair attached to an elemento Attribute: a name= value  pair attached to an element. 

Names follow the same rules as tag names. o Parent: all elements except the root have one parentp po Child: an element nested in another parent elemento Root: every document has a single root element with no 

parento Mixed Content: an element may contain a mixture of 

text and other elementstext and other elements 

Feb 2013 23N. H. N. D. de Silva

Page 24: UFCEKG 20 2 Data, Schemas & Applicationsnisansa/Classes/02...Feed integrators Bloglines, Google Reader, reduce the load on the provider and provide some filtering There is an RSS reader

Basic XML rules o A single root elemento Tags must be properly nestedo An element must be closed:o An element must be closed: 

o Open and closing tag <p>... </p> o Empty element <br /> or <hr size="3"/>

Other formatting rules o XML names are case sensitive, no spaces, restricted character seto XML names are case sensitive, no spaces, restricted character seto Attribute values must be single or double‐quotedo Special characters coded as references &#10 (a line feed) &gt; > 

S h t h i l i i th t t f to Some characters have special meaning e.g. < is the start of a tag‐within XML data, & is the first character of an entity reference. In XML data these have to be encoded as &lt; and &amp; or ; p;enclosed in <[CDATA[ ....]]> 

o Preferably use standard formats for representing values e.g. 2008 10 14 f d2008‐10‐14 for a date

Feb 2013 24N. H. N. D. de Silva