01 - intro to xml 1-1 · - multi-channel publication (on paper and on the web), search,...
TRANSCRIPT
9/9/09
1
Introduction to XML for parliamentary documents(and all other kinds of documents, actually)
Prof. Fabio Vitali University of Bologna
Part 1 Next: Parliamentary activities 2/38
Purpose of these slides ● Introduce the principal aspects of electronic management of
documents - What we actually mean by documents (the FRBR hierarchy) - What are the components of documents - What do we mean by data and metadata about documents
● Introduce some technologies related to electronic management of documents…
- XML - DTDs - XML Schema - XSLT - RDF and OWL
● … all somehow connected and related to parliamentary documents (but not necessarily only to them)
Next: Computer support for parliamentary activities 3/38
Parliamentary activities
● A complex production system that generates documents with different legal status:
- Bills and acts to become the law of a country - Debate records (or hansards) to become a lasting
log of the activities of the parliament - Daily and weekly announcements, tablings and
reports to organize and master the internal logistics
● Each document of the uttermost importance and to be printed in large quantities and/or made available to a wide part of the population daily and in a very limited amount of time
Next: How the Web can help 4/38
Computer support for parliamentary activities
● Support for generating documents - Drafting activities, record keeping, translation into
national languages, etc. ● Support for workflow
- Management of documents across lifecycle, storage, security, timely involvement of relevant individuals and offices
● Support for citizens’ access - Multi-channel publication (on paper and on the web),
search, classification, identification ● Further activities
- Consolidation, comparison, language synchronization, etc.
Next: XML 5/38
How the Web can help ● Born as a publishing medium ● HTML helped make it a big success ● HTML is constraining by its own simplicity
- Excessive reliance on typographic rather than semantic description
- Few rules that are not even strongly imposed ● A new language was invented, called XML, that could
solve that - Clear differentiation between aspect and meaning - Strong syntactic rules heavily imposed to guarantee
uniformity, homogeneity, sophisticated applications
Next: Parliamentary documents and XML 6/38
XML
● XML (Extensible Markup Language) is a W3C standard of incredibly widespread diffusion.
● XML is pure syntax, without pre-defined semantics. This allows document designers to provide their own semantics.
● Thanks to the associated languages (DTD, XSLT, RDF) we can create sophisticated applications with big flexibility in uses.
● XML allows to create markup languages that are readable, generic, structured, hierarchical.
9/9/09
2
Next: Why is XML good? 7/38
Parliamentary documents and XML
● XML is ideal for representing parliamentary documents (and especially bills and acts):
- They have a well-defined structure, which is systematic and standardized
- There are required and optional parts according to rules and tradition
- There are containment constraints that determine the global correctness of the document
- There are references to other texts (schedules, other acts, etc.) that can fruitfully be used to create a hypertext network.
Next: Documents 8/38
Conve
rsion
is di
fficu
lt
Conversion is very easy
Energy / Information
Why is XML good?
Documents
Next: 3 problems (2) 10/38
3 problems ● When are two documents the same document?
● What is important to record of a document?
● How do we refer precisely to the normative content of the documents and of their parts?
Next: 3 problems (2) 11/38
3 problems (2)
● When are two documents the same document? - When they are different physical copies of the same
document (two identical books) - When they are different ways by which the same words
appear (a MS Word file and its printout on paper) - When they are two different set of sentences with the
same name and purpose (two versions of the same act) ● What is important to record of a document?
● How do we refer precisely to the normative content of the documents and of their parts?
Next: 3 problems (2) 12/38
3 problems (2) ● When are two documents the same document?
● What is important to record of a document? - The words and punctuation it is composed of. - The way in which is is shown on page (pagination, typography,
colors, margins and fonts) - The conceptual role of each fragment (this sentence is a title,
this is a clause, this is a reference, etc.) ● How do we refer precisely to the normative content of
the documents and of their parts?
9/9/09
3
Next: 3 solutions 13/38
3 problems (2) ● When are two documents the same document?
● What is important to record of a document?
● How do we refer precisely to the normative content of the documents and of their parts?
- The meaning - The words - The name
Next: The IFLA FRBR hierarchy (1) 14/38
3 solutions ● When are two documents the same document?
- The IFLA FRBR hierarchy: from abstract ideas to physical files
● Work, Expression, Manifestation, Item
● What is important to record of a document? - The SGML components: from meaning to typography
● Content, presentation, structure
● How do we refer precisely to the normative content of the documents and of their parts?
- The semantic web approach: applying semantics where it fits
● data, metadata, ontology
Next: The IFLA FRBR hierarchy (2) 15/38
The IFLA FRBR hierarchy (1) ● Work: a distinct intellectual creation. ● Expression: the specific form in which a work is realized
- In our model, all variants and versions of a text that incorporates amendments and updates to an earlier version are considered expressions of the same work.
● Manifestation: the representation of an expression according to the requirements of a medium
● Item: a single exemplar of a manifestation - In our model, a manifestation is a representation of an
expression as an eletronic document in a specific format - All copies of the same (identical) manifestation are
items. All items are accessible in a specific position on a specific computer.
Next: The IFLA FRBR hierarchy (3) 16/38
The IFLA FRBR hierarchy (2) ● Work:
- The play “Hamlet” by William Shakepeare - The Italian act #3 (5 January 2001)
● Expression: - The first quarto of “Hamlet” (1601); - the first folio of “Hamlet” (1623); - the movie version of “Hamlet” by Kenneth Brannagh
(1996) - The original version of Italian act 2001; - the amended version of Italian act 3/2001 as of
19/12/2003
Next: The SGML components (1) 17/38
The IFLA FRBR hierarchy (3) ● Manifestation:
- One of the printed versions of the first folio version of “Hamlet” (e.g.: Penguin Books, 1994)
- One of the computer versions of “Hamlet” (e.g., Project Gutemberg)
- The NIR XML version of the amended version of Italian Act 3/2001 as of 19/12/2003
- The printed version of the original version of Italian Act 3/2001 on the Italian Gazette #2 (2001)
● Item: - My own copy of “Hamlet” by Penguin Books; the copy
of “Hamlet” on the Gutemberg Project’s own site - The copy of the NIR XML version of Italian Act 3/2001
on my computer. The one I copy on your computers.
Next: The SGML components (2) 18/38
The SGML components (1) ● Content
- What exactly was written in the document. - The content is composed of words, punctuation,
sentences, images, paragraphs and so on. ● Structure
- How the content is organized - All documents have an internal organization,
composed of subdivisions, hierarchies, preambles and conclusions, attachments, and so on. Within a paragraph, all parts that have a relevance (e.g. references, quotations, etc.)
● Presentation - The typographical choices to present a document on
screen or on paper.
9/9/09
4
Next: The SGML components (3) 19/38
The SGML components (2) ● The structure adds meaning to pieces of content.
- The text “Interpretation” assumes meaning once we know it is the title of article #2 of the Italian Act 3/2001
● The structure connects the presentation to the content - Once we know that the text “interpretation” is the title
of an article, we can apply the typographical choices associated to article titles.
● The structure can be used to test the correctness of a document
- We can deduce that a document is incorrect if there is no title associated to an article.
Next: The Semantic Web approach (1) 20/38
The SGML components (3) ● The content itself can be categorized in categories:
- Pure content, ● appears in the document because it is instrumental to
the message conveyed by the document. For instance, the text “THE RETIREMENT BENEFITS AUTHORITY”
● This is what we really are interested in - Structural content
● appears because it marks the beginning or the end of a structure. For instance, the text “Part II“
● This can be used for deducing information about the structure
- Presentation oriented content ● Appears because it is dictated by the presentation
choices of the document. For instance, page numbers and repeating headers.
● This can be safely ignored and thrown away.
Next: The semantic web approach (2) 21/38
The Semantic Web approach (1) ● Data:
- the actual text as was provided initially by the author of the document
● Metadata: - Any consideration or comment or additional
information that can be expressed on the content and on the document.
- Metadata is generated either by human intervention, or through automated processes.
● Ontology (in short): - A representation of the conceptual model that shapes
all metadata associated to a document.
Next: Markup 22/38
The semantic web approach (2) ● Authors’ contribution: data
- The words and punctuation and breaks, exactly as have been written and accepted by the original author (with legislation, the legislative body)
● Editors’ contribution: metadata - Publication data. Lifecycle information. Footnotes.
Analysis of provisions. - Metadata is useless unless it is provided following a
precise schema, called ontology. ● In a way, editors are the authors of the metadata ● Put it in another way, metadata is information
about a document that was not provided by its authors.
Next: XML Markup (1) 23/38
Markup
● We call markup the additions to a written text that can let us use applications to work on the text:
- Structural markup - Descriptive markup - Presentation markup
● With XML, we add markup to the text of a document so that further applications can work on it.
● XML uses a special syntax to add and distinguish text from markup
Next: XML Markup (2) 24/38
XML Markup (1)
● XML markup clearly distinguishes elements, text (or #PCDATA) and attributes.
● An element is contained within start tags and end tags, which are distinguishable through angle brackets:
- <title>Interpretation.</title> ● The content of an element can be
- just text (simple text elements) - Other elements (structural elements) - A mix of text and other elements (mixed content
elements)
9/9/09
5
Next: Naming documents and fragments 25/38
XML Markup (2) ● Within the start tag we can sometimes find attributes,
i.e. additional information about the element ● <act contains="SingleVersion"> … </act>
- A special attribute is “href”, that indicates the destination of a reference
● As in <ref href=”#sec2”>section 2</ref> of this act.
- Another special attribute is “id” that provides a reliable name for the element to be used in references
● <clause id=”sec1-cla1"> … </clause> ● <section id=“sec2”> … </section>
● In a way, metadata is information about the document, while the attribute is information about the element
Next: Naming documents and fragments (2) 26/38
Naming documents and fragments
● Uniform resource Identifiers - These are used throughout the World Wide Web
to indicate resources. - The best known are the URL (Uniform Resource
Locators) that are used to navigate on the web ● http://www.akomantoso.org/09-examples
● Fragment Identifiers - Within a document, one can point to a specific
part of the etxt through the fragment identifier ● http://www.akomantoso.org/09-examples#part3
- This corresponds to an element whose attribute id is “part3”
Next: Markups and languages 27/38
Naming documents and fragments (2)
● In our case the situation is more complex. Works, expressions and manifestations are not physical resources, but abstract entities.
● Yet, references are rarely (or never) to items, but to those concepts
● So works, expressions and manifestations must have their own URI, which is not a URL (i.e., it does not correspond to a physical address on a computer)
● The act of finding out what is the URL of the item that best represents the manifestation that we are looking for is called URI resolution.
Markups and languages
Next: Structured and hierarchical markup 29/38
Procedural and descriptive markup
● With procedural markup we precisely indicate the task to apply to each fragment of text in order to, say, display the document.
● We indicate bold, italic, font name, font size, margins, etc.
● Basically, the actual usage determines the markup inserted in the text
● With descriptive markup we precisely identify the structural or semantic roles of each text fragment.
● Rather than bold or font size, we indicate aspects such as heading, caption, quotation, paragraph, reference, etc.
● Basically, since structural and semantic roles are independent of usage, I fill the document with persistent information.
Next: Markup meta-language 30/38
Structured and hierarchical markup
● Markup can be used to identify and exploit structures, i.e. organization of content in connected fragments. It is possible to identify rules to define a concept of correctness of text.
● Structures can be suggested (descriptive markup) or required (prescriptive markup). Documents are correct (valid) if they adhere to the rules specified.
● Some structures can be hierarchical. Legislative documents are often a hierarchy of containers.
● Capturing correctly the hierarchy of containment is an important characteristic for markup languages.
9/9/09
6
Next: Document Type Definitions and XML schema languages 31/38
Markup meta-language ● A meta-language is a language to define languages, a
grammar to build new languages. ● XML is not a markup language, but a language to used
to create markup languages. ● XML does not provide suggestions on how to define
specific aspects of a document: bold or italic or reference or paragraph. Rather, it provides a grammar to provide such aspects can be defined in a new language.
Next: DTDs 32/38
Document Type Definitions and XML schema languages
● The DTD or the XML Schema (XSD) are documents that describe an XML-based language.
● They are the necessary step between the meta-language and language.
● A schema document contains the list of allowed elements, attributes and repeatable document fragments (entities)
● A schema document further contains the set of all constraints that all elements and attributes must undergo.
● Constraints are expressed in terms of presence, repeatability and order.
Next: XML Schema 33/38
DTDs
● DTD is the most basic validation language for XML documents.
- A W3C standard. Indeed, part of the XML language definition itself.
- Uses its own (odd) syntax - Compact, easy to learn and manage - Can stay with the XML document or be referred
to by the XML document - Adequately expressive on structures, less so on
data content - Universally known and used. All tools support it.
Next: Displaying XML documents: XSLT 34/38
XML Schema ● XML Schema is another validation language:
- Also a W3C standard, but independent of XML (and independently evolving… version 1.1 is to be standardized later this year)
- uses XML-based syntax - Much longer, precise, difficult to read and use - Needs to stay outide of the XML document - More precise both for structures and data content
● You can require a date fragment to actually contain a valid date - Also widely known, but fights against a number of
competitors, among which Relax NG, an ISO standard. - Aimed at cross pollination between information
engineering and database structuring.
Next: Metadata and the Semantic Web 35/38
Displaying XML documents: XSLT
● Displaying XML document is a downstream activity: it is very easy
● XSLT (XML Style Language - Transformation) is used to generate displayable versions of XML documents.
● XSLT is very flexible, and the same XML document can use many different XSLT stylesheets for different media and with different graphical layouts and typographical characteristics.
● XSLT can be used for generating both on-line and on print versions of the same document.
Next: Next 36/38
Metadata and the Semantic Web
● Traditional Web technologies have only dealt with display on-screen (and, partially, on paper).
● Metadata are information stored about documents, and can be used for proper cataloguing, classification, search, sophisticated applications.
● The Semantic Web - RDF, OWL, Ontologies, Topic Maps, etc. - connected initiatives to provide web applications
with the capabities to reason about, rather than just display, documents
9/9/09
7
Next: Conclusions 37/38
Next
● After the break we shall discuss - The syntax of DTDs - The basic ideas of XML Schema - The fundamental concepts of XSLT - A few points about metadata, metadata schemas,
and ontologies
Fine presentazione 38/38
Conclusions ● Markup languages are necessary for enriching data with
information about the usages and the applications that can use the data
● Descriptive markup is the best starting point for the creation of new markup languages.
● XML is best among markup languages for several reasons: - It is a non proprietary, widely accepted standard - It is structured, hierarchical, descriptive - It allows both prescriptive and descriptive approaches - Tools exist in all operating systems and computer
architectures.