01 - intro to xml 1-1 · - multi-channel publication (on paper and on the web), search,...

9/9/09

1

Introduction to XML for parliamentary documents(and all other kinds of documents, actually)

Prof. Fabio Vitali University of Bologna

Part 1 Next: Parliamentary activities 2/38

Purpose of these slides ●  Introduce the principal aspects of electronic management of

documents -  What we actually mean by documents (the FRBR hierarchy) -  What are the components of documents -  What do we mean by data and metadata about documents

●  Introduce some technologies related to electronic management of documents…

-  XML -  DTDs -  XML Schema -  XSLT -  RDF and OWL

●  … all somehow connected and related to parliamentary documents (but not necessarily only to them)

Next: Computer support for parliamentary activities 3/38

Parliamentary activities

●  A complex production system that generates documents with different legal status:

-  Bills and acts to become the law of a country -  Debate records (or hansards) to become a lasting

log of the activities of the parliament -  Daily and weekly announcements, tablings and

reports to organize and master the internal logistics

●  Each document of the uttermost importance and to be printed in large quantities and/or made available to a wide part of the population daily and in a very limited amount of time

Next: How the Web can help 4/38

Computer support for parliamentary activities

●  Support for generating documents -  Drafting activities, record keeping, translation into

national languages, etc. ●  Support for workflow

-  Management of documents across lifecycle, storage, security, timely involvement of relevant individuals and offices

●  Support for citizens’ access -  Multi-channel publication (on paper and on the web),

search, classification, identification ●  Further activities

-  Consolidation, comparison, language synchronization, etc.

Next: XML 5/38

How the Web can help ●  Born as a publishing medium ●  HTML helped make it a big success ●  HTML is constraining by its own simplicity

-  Excessive reliance on typographic rather than semantic description

-  Few rules that are not even strongly imposed ●  A new language was invented, called XML, that could

solve that -  Clear differentiation between aspect and meaning -  Strong syntactic rules heavily imposed to guarantee

uniformity, homogeneity, sophisticated applications

Next: Parliamentary documents and XML 6/38

XML

●  XML (Extensible Markup Language) is a W3C standard of incredibly widespread diffusion.

●  XML is pure syntax, without pre-defined semantics. This allows document designers to provide their own semantics.

●  Thanks to the associated languages (DTD, XSLT, RDF) we can create sophisticated applications with big flexibility in uses.

●  XML allows to create markup languages that are readable, generic, structured, hierarchical.

9/9/09

2

Next: Why is XML good? 7/38

Parliamentary documents and XML

●  XML is ideal for representing parliamentary documents (and especially bills and acts):

-  They have a well-defined structure, which is systematic and standardized

-  There are required and optional parts according to rules and tradition

-  There are containment constraints that determine the global correctness of the document

-  There are references to other texts (schedules, other acts, etc.) that can fruitfully be used to create a hypertext network.

Next: Documents 8/38

Conve

rsion

is di

fficu

lt

Conversion is very easy

Energy / Information

Why is XML good?

Documents

Next: 3 problems (2) 10/38

3 problems ●  When are two documents the same document?

●  What is important to record of a document?

●  How do we refer precisely to the normative content of the documents and of their parts?


3 problems (2)

●  When are two documents the same document? -  When they are different physical copies of the same

document (two identical books) -  When they are different ways by which the same words

appear (a MS Word file and its printout on paper) -  When they are two different set of sentences with the

same name and purpose (two versions of the same act) ●  What is important to record of a document?



3 problems (2) ●  When are two documents the same document?

●  What is important to record of a document? -  The words and punctuation it is composed of. -  The way in which is is shown on page (pagination, typography,

colors, margins and fonts) -  The conceptual role of each fragment (this sentence is a title,

this is a clause, this is a reference, etc.) ●  How do we refer precisely to the normative content of

the documents and of their parts?

9/9/09

3

Next: 3 solutions 13/38

3 problems (2) ●  When are two documents the same document?

●  What is important to record of a document?


-  The meaning -  The words -  The name

Next: The IFLA FRBR hierarchy (1) 14/38

3 solutions ●  When are two documents the same document?

-  The IFLA FRBR hierarchy: from abstract ideas to physical files

●  Work, Expression, Manifestation, Item

●  What is important to record of a document? -  The SGML components: from meaning to typography

●  Content, presentation, structure


-  The semantic web approach: applying semantics where it fits

●  data, metadata, ontology


The IFLA FRBR hierarchy (1) ●  Work: a distinct intellectual creation. ●  Expression: the specific form in which a work is realized

-  In our model, all variants and versions of a text that incorporates amendments and updates to an earlier version are considered expressions of the same work.

●  Manifestation: the representation of an expression according to the requirements of a medium

●  Item: a single exemplar of a manifestation -  In our model, a manifestation is a representation of an

expression as an eletronic document in a specific format -  All copies of the same (identical) manifestation are

items. All items are accessible in a specific position on a specific computer.


The IFLA FRBR hierarchy (2) ●  Work:

-  The play “Hamlet” by William Shakepeare -  The Italian act #3 (5 January 2001)

●  Expression: -  The first quarto of “Hamlet” (1601); -  the first folio of “Hamlet” (1623); -  the movie version of “Hamlet” by Kenneth Brannagh

(1996) -  The original version of Italian act 2001; -  the amended version of Italian act 3/2001 as of

19/12/2003

Next: The SGML components (1) 17/38

The IFLA FRBR hierarchy (3) ●  Manifestation:

-  One of the printed versions of the first folio version of “Hamlet” (e.g.: Penguin Books, 1994)

-  One of the computer versions of “Hamlet” (e.g., Project Gutemberg)

-  The NIR XML version of the amended version of Italian Act 3/2001 as of 19/12/2003

-  The printed version of the original version of Italian Act 3/2001 on the Italian Gazette #2 (2001)

●  Item: -  My own copy of “Hamlet” by Penguin Books; the copy

of “Hamlet” on the Gutemberg Project’s own site -  The copy of the NIR XML version of Italian Act 3/2001

on my computer. The one I copy on your computers.


The SGML components (1) ●  Content

-  What exactly was written in the document. -  The content is composed of words, punctuation,

sentences, images, paragraphs and so on. ●  Structure

-  How the content is organized -  All documents have an internal organization,

composed of subdivisions, hierarchies, preambles and conclusions, attachments, and so on. Within a paragraph, all parts that have a relevance (e.g. references, quotations, etc.)

●  Presentation -  The typographical choices to present a document on

screen or on paper.

9/9/09

4


The SGML components (2) ●  The structure adds meaning to pieces of content.

-  The text “Interpretation” assumes meaning once we know it is the title of article #2 of the Italian Act 3/2001

●  The structure connects the presentation to the content -  Once we know that the text “interpretation” is the title

of an article, we can apply the typographical choices associated to article titles.

●  The structure can be used to test the correctness of a document

-  We can deduce that a document is incorrect if there is no title associated to an article.

Next: The Semantic Web approach (1) 20/38

The SGML components (3) ●  The content itself can be categorized in categories:

-  Pure content, ●  appears in the document because it is instrumental to

the message conveyed by the document. For instance, the text “THE RETIREMENT BENEFITS AUTHORITY”

●  This is what we really are interested in -  Structural content

●  appears because it marks the beginning or the end of a structure. For instance, the text “Part II“

●  This can be used for deducing information about the structure

-  Presentation oriented content ●  Appears because it is dictated by the presentation

choices of the document. For instance, page numbers and repeating headers.

●  This can be safely ignored and thrown away.

Next: The semantic web approach (2) 21/38

The Semantic Web approach (1) ●  Data:

-  the actual text as was provided initially by the author of the document

●  Metadata: -  Any consideration or comment or additional

information that can be expressed on the content and on the document.

-  Metadata is generated either by human intervention, or through automated processes.

●  Ontology (in short): -  A representation of the conceptual model that shapes

all metadata associated to a document.

Next: Markup 22/38

The semantic web approach (2) ●  Authors’ contribution: data

-  The words and punctuation and breaks, exactly as have been written and accepted by the original author (with legislation, the legislative body)

●  Editors’ contribution: metadata -  Publication data. Lifecycle information. Footnotes.

Analysis of provisions. -  Metadata is useless unless it is provided following a

precise schema, called ontology. ●  In a way, editors are the authors of the metadata ●  Put it in another way, metadata is information

about a document that was not provided by its authors.

Next: XML Markup (1) 23/38

Markup

●  We call markup the additions to a written text that can let us use applications to work on the text:

-  Structural markup -  Descriptive markup -  Presentation markup

●  With XML, we add markup to the text of a document so that further applications can work on it.

●  XML uses a special syntax to add and distinguish text from markup

Next: XML Markup (2) 24/38

XML Markup (1)

●  XML markup clearly distinguishes elements, text (or #PCDATA) and attributes.

●  An element is contained within start tags and end tags, which are distinguishable through angle brackets:

-  <title>Interpretation.</title> ●  The content of an element can be

-  just text (simple text elements) -  Other elements (structural elements) -  A mix of text and other elements (mixed content

elements)

9/9/09

5

Next: Naming documents and fragments 25/38

XML Markup (2) ●  Within the start tag we can sometimes find attributes,

i.e. additional information about the element ●  <act contains="SingleVersion"> … </act>

-  A special attribute is “href”, that indicates the destination of a reference

●  As in <ref href=”#sec2”>section 2</ref> of this act.

-  Another special attribute is “id” that provides a reliable name for the element to be used in references

●  <clause id=”sec1-cla1"> … </clause> ●  <section id=“sec2”> … </section>

●  In a way, metadata is information about the document, while the attribute is information about the element

Next: Naming documents and fragments (2) 26/38

Naming documents and fragments

●  Uniform resource Identifiers -  These are used throughout the World Wide Web

to indicate resources. -  The best known are the URL (Uniform Resource

Locators) that are used to navigate on the web ●  http://www.akomantoso.org/09-examples

●  Fragment Identifiers -  Within a document, one can point to a specific

part of the etxt through the fragment identifier ●  http://www.akomantoso.org/09-examples#part3

-  This corresponds to an element whose attribute id is “part3”

Next: Markups and languages 27/38

Naming documents and fragments (2)

●  In our case the situation is more complex. Works, expressions and manifestations are not physical resources, but abstract entities.

●  Yet, references are rarely (or never) to items, but to those concepts

●  So works, expressions and manifestations must have their own URI, which is not a URL (i.e., it does not correspond to a physical address on a computer)

●  The act of finding out what is the URL of the item that best represents the manifestation that we are looking for is called URI resolution.

Markups and languages

Next: Structured and hierarchical markup 29/38

Procedural and descriptive markup

●  With procedural markup we precisely indicate the task to apply to each fragment of text in order to, say, display the document.

●  We indicate bold, italic, font name, font size, margins, etc.

●  Basically, the actual usage determines the markup inserted in the text

●  With descriptive markup we precisely identify the structural or semantic roles of each text fragment.

●  Rather than bold or font size, we indicate aspects such as heading, caption, quotation, paragraph, reference, etc.

●  Basically, since structural and semantic roles are independent of usage, I fill the document with persistent information.

Next: Markup meta-language 30/38

Structured and hierarchical markup

●  Markup can be used to identify and exploit structures, i.e. organization of content in connected fragments. It is possible to identify rules to define a concept of correctness of text.

●  Structures can be suggested (descriptive markup) or required (prescriptive markup). Documents are correct (valid) if they adhere to the rules specified.

●  Some structures can be hierarchical. Legislative documents are often a hierarchy of containers.

●  Capturing correctly the hierarchy of containment is an important characteristic for markup languages.

9/9/09

6

Next: Document Type Definitions and XML schema languages 31/38

Markup meta-language ●  A meta-language is a language to define languages, a

grammar to build new languages. ●  XML is not a markup language, but a language to used

to create markup languages. ●  XML does not provide suggestions on how to define

specific aspects of a document: bold or italic or reference or paragraph. Rather, it provides a grammar to provide such aspects can be defined in a new language.

Next: DTDs 32/38

Document Type Definitions and XML schema languages

●  The DTD or the XML Schema (XSD) are documents that describe an XML-based language.

●  They are the necessary step between the meta-language and language.

●  A schema document contains the list of allowed elements, attributes and repeatable document fragments (entities)

●  A schema document further contains the set of all constraints that all elements and attributes must undergo.

●  Constraints are expressed in terms of presence, repeatability and order.

Next: XML Schema 33/38

DTDs

●  DTD is the most basic validation language for XML documents.

-  A W3C standard. Indeed, part of the XML language definition itself.

-  Uses its own (odd) syntax -  Compact, easy to learn and manage -  Can stay with the XML document or be referred

to by the XML document -  Adequately expressive on structures, less so on

data content -  Universally known and used. All tools support it.

Next: Displaying XML documents: XSLT 34/38

XML Schema ●  XML Schema is another validation language:

-  Also a W3C standard, but independent of XML (and independently evolving… version 1.1 is to be standardized later this year)

-  uses XML-based syntax -  Much longer, precise, difficult to read and use -  Needs to stay outide of the XML document -  More precise both for structures and data content

●  You can require a date fragment to actually contain a valid date -  Also widely known, but fights against a number of

competitors, among which Relax NG, an ISO standard. -  Aimed at cross pollination between information

engineering and database structuring.

Next: Metadata and the Semantic Web 35/38

Displaying XML documents: XSLT

●  Displaying XML document is a downstream activity: it is very easy

●  XSLT (XML Style Language - Transformation) is used to generate displayable versions of XML documents.

●  XSLT is very flexible, and the same XML document can use many different XSLT stylesheets for different media and with different graphical layouts and typographical characteristics.

●  XSLT can be used for generating both on-line and on print versions of the same document.

Next: Next 36/38

Metadata and the Semantic Web

●  Traditional Web technologies have only dealt with display on-screen (and, partially, on paper).

●  Metadata are information stored about documents, and can be used for proper cataloguing, classification, search, sophisticated applications.

●  The Semantic Web -  RDF, OWL, Ontologies, Topic Maps, etc. -  connected initiatives to provide web applications

with the capabities to reason about, rather than just display, documents

9/9/09

7

Next: Conclusions 37/38

Next

●  After the break we shall discuss -  The syntax of DTDs -  The basic ideas of XML Schema -  The fundamental concepts of XSLT -  A few points about metadata, metadata schemas,

and ontologies

Fine presentazione 38/38

Conclusions ●  Markup languages are necessary for enriching data with

information about the usages and the applications that can use the data

●  Descriptive markup is the best starting point for the creation of new markup languages.

●  XML is best among markup languages for several reasons: -  It is a non proprietary, widely accepted standard -  It is structured, hierarchical, descriptive -  It allows both prescriptive and descriptive approaches -  Tools exist in all operating systems and computer

architectures.