lis512 lecture 5 xml based metadata formats. constraints on xml xml defines a genaral syntax for...

30
lis512 lecture 5 XML based metadata formats

Upload: myles-whitehead

Post on 16-Jan-2016

249 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

lis512 lecture 5

XML based metadata formats

Page 2: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

constraints on XML

XML defines a genaral syntax for documents. These documents can contain records. Records are put together in an XML

document. But to makes sense of XML, we need

something more.

Page 3: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

valid XML

XML that is well-formed may be valid against a set of constraints.

Such a set of constraints can be formally written down.

An old form of a formally written set of constraints is called a DTD.

More recently schema files have become more popular, they do the same thing.

Page 4: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

example: XHTML DTD

[some of this not quite true] XHTML is an XML DTD. Basically it is a set of constraints that an XML

documents has to follow if it is to be a web page.

These constraints are written out in a file called the XHTML DTD.

Page 5: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

sample HTML constraints The root element of a HTML document is

called <html>. It admits two child elements, called <head>

and <body>. There must be one <head> and one <body>. The <head> appears first, then appears the

<body>. <html><head>...</head> <body>...</body></html>

Page 6: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

more HTML constraints

All elements in the body are classified as text level and block-level.

A text-level element has to be a child of a block-level element.

A text-level element can not have block-level elements as children.

Page 7: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

more HTML constraints

There is an element <img> that can be placed in the body. It is a text level element. It is always an empty elemnet. It requires an attribute called src= and an attribute called alt=. It may take a bunch of other attributes.

The fact that <img> creates an image is understood. It is not something that can be formally specified in the DTD.

Page 8: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

meaning what

meaning that a computer can check that there is an src= attribute pointing to an image file.

It can check that the image can be accessed (at this time) and

It can not check the image is “appropriate” (e.g. it is the American flag rather than the Russian flag)

Page 9: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

other XML formats

I will mention here a few other formats that are used in the bibligraphic universe.

Contrary to HTML, which is really a document format, these formas are used to encode certain record.

One we have already seen, MARC XML.

Page 10: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

constrain languages

• There are basically languages that are used to model constraints on XML documents.– DTDs – XML Schema files– Relax NG

• The don't have the same capabilities. Some can implement constraints that others can't.

• No language is the to implement all constraints possible.

Page 11: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

DTDs

• DTDs were define with a parent format of XML called SGML.

• DTDs are not written in an XML syntax, they have their own syntax.

• No software has been written that fully implements SGML and its DTDs.

• DTDs are now considered a legacy format.

Page 12: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

XML Schema

• It's a WC3 sponsored format that aimed to be the ultimate schema language.

• It is quiet abominably complicated and fails miserably in certain areas. For example, it has poor support for unordered contents.

• A look at http://www.w3.org/TR/xmlschema-1/ will help.

Page 13: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

Relax NG

• This is a more informal standard for constraining XML documents.

• It is much easier to learn and use than XML Schema.

• Free software implements the entire standard.

Page 14: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

in the archival world

• Documents and other artifacts held in archives are usually not catalogued item by item. That would be too expensive.

• Instead archivists create "finding aids", documents that describe what can be found in an archival collection.

• A large archive may have hundreds of finding aid, usually on per box of "stuff".

Page 15: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

encoded archival description• http://www.archivists.org/saagroups/ead/

aboutEAD.html says "EAD stands for Encoded Archival Description, and is a non-proprietary de facto standard for the encoding of finding aids for use in a networked (online) environment. Finding aids are inventories, indexes, or guides that are created by archival and manuscript repositories to provide information about specific collections. (cont.)

Page 16: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

continuation of previous quote.

• While the finding aids may vary somewhat in style, their common purpose is to provide detailed description of the content and intellectual organization of collections of archival materials. EAD allows the standardization of collection information in finding aids within and across repositories."

Page 17: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

EAD history

• EAD started in 1993 (just as RePEc) at the University of California as a project lead by Daniel V. Pitti.

• EAD is formally specified as an SGML DTD, but now also as XML Schema file and ReLAX NG Schema.

• The current version is called EAD 2002. It is maintained by the Library of Congress.

Page 18: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

other formats

• There are two modern versions of EAD around. – A RelaxNG form http://www.loc.gov/ead/ead.rng– An XML Schema http://www.loc.gov/ead/ead.xsd

• Both RelaxNG and XML Schema are schema languages. They are ways to specify constraints on XML documents.

Page 19: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

EAD problem

• EAD is very difficult to implement for archivist who generally have low IT skills.

• They are then reliant on somebody else to do it for them. Such personnel does not come cheap.

• See Sonia Yaco’s paper: "It’s Complicated: Barriers to EAD Implementation", based on Survey work by Anlex consulting.

• Library of Congress example appears invalid!

Page 20: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

in the humanities: TEI

• The TEI is the Text Encoding Initiative. • It is a DTD, Relax NG and XML Schema

specification, with documentation, on how to encode texts.

• The texts they talk about are mainly historic and cultural artifacts, – poetry– historical documents

Page 21: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

problems to address

• How to represent faithfully a printed text in an XML form?– imitate page appearance– character recognition and normalization– guessing formal structure from visual appearance.

Page 22: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

example: Langland's Piers Plowman• <l>

<seg>In a somer seson,</seg> <seg>whan softe was the sonne,</seg></l><l> <seg>I shoop me into shroudes</seg> <seg>as I a sheep were,</seg></l><l> <seg>In habite as an heremite </seg> <seg>unholy of werkes,</seg></l><l> <seg>Went wide in this world </seg> <seg>wondres to here.</seg></l>

Page 23: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

example from Pope's Essay on Criticism

<div type="book"n="1"met="-+|-+|-+|-+|-+/"rhyme="aa"><lg n="1" type="paragraph"> <l>'Tis hard to say, if greater Want of Skill</l> <l>Appear in <hi>Writing</hi> or in <hi>Judging</hi> ill;</l> <l>But, of the two, less dang'rous is th'Offence,</l><l>To tire our <hi>Patience</hi>, than mis-lead our <hi>Sense</hi>:</l></lg></div>

Page 24: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

example for manuscript description<handDesc hands="3"><handNote xml:id="Eirsp-1" scope="minor"><p>The first part of the manuscript, <locus from="1v" to="72v:4">fols 1v-72v:4</locus>, is written in a practised

Icelandic Gothic bookhand. This hand is not found elsewhere.</p></handNote><handNote xml:id="Eirsp-2" scope="major"> <p>The second part of the manuscript,

<locus from="72v:4" to="194v">fols 72v:4-194</locus>, is written in a hand contemporary with the first; it can also be found in a fragment of <title>Knýtlinga saga</title>, <ref>AM 20b II fol.</ref>.</p> </handNote> <handNote xml:id="Eirsp-3" scope="minor"> <p>The third hand has written the majority of the chapter headings.

This hand has been identified as the one also found in <ref>AM 221 fol.</ref>.</p></handNote></handDesc>

Page 25: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

ONIX

• ONIX is a record format used in the book trading and publishing industry.

• The London based EDItEUR group, a membership based organization maintains the standard.

Page 26: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

purpose

• ONIX is essentially related to product descriptions in the publishing industry. – Publishers can use it to communicate with

vendors. – Publishers can use it internally to streamline

procedures.• There are there formats– books– serials– rights

Page 27: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

MODS

• This is the Metadata Objects and Description Schema.

• It is basically a MARC 'light'.• By converting MARC to MODS, you loose

some information.• The resulting information is less detailed and

less complicated.

Page 28: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

METS

• http://www.loc.gov/standards/mets/METS%20Documentation%20final%20070930%20msw.pdf says

• "The Metadata Encoding and Transmission Standard (METS) is a data encoding and transmission specification, expressed in XML, that provides the means to convey the metadata necessary for both the management of digital objects within a repository and the exchange of such objects between repositories (or between repositories and their users)."

Page 29: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

METS

• METS is as container structure that does not only allow data about an object of interest be encoded, but also allows for the object itself to be encoded.

• As a result, it is very complicated.

Page 30: Lis512 lecture 5 XML based metadata formats. constraints on XML XML defines a genaral syntax for documents. These documents can contain records. Records

Journal Publishing Tag Set

• This XML DTD is issued by the National Library of Medicine.

• It is the format followed by records in the PubMed database.

• PubMed is the largest sort-of freely available bibliographic dataset for scientific articles.

• Because of its market power, the tag set is spreading through the publishing industry.