metadata transformation important technical considerations: extraction / normalization / enrichment
TRANSCRIPT
METADATA TRANSFORMATION
Important technical considerations: extraction / normalization / enrichment
Extraction: XML is picky
• All tags must be closed as opposed to HTML– <element></element>– <element/>
• Doesn’t like any of your special characters– & = &– < = <– > = >
• Encoding sensitive
Extraction: Attributes• Consider the following example:• If you export names, post codes and coordinates
into the ”coverage” element – how can you use these afterwards?– <coverage>London</coverage>– <coverage>12.1234,89.1235531</coverage>
• The ESE doesn’t define these attributes for anything but language– <coverage type=”text”>London</coverage>– <coverage
type=”coordinates”>12.1234,89.1235531</coverage>
Extraction: Additional data
• ESE may not alway contain all the information which MAY be interesting from an aggregator’s perspective
• The ESE can be extended without breaking the format – but it needs to be done in such a way as not to conflict or interfere with the XML structure of ESE elements
Normalization: dates
• Date extraction is somewhat inaccurate and may well render ”bogus” output– ”...it was almost as bad as in the 1920s...”– ”...back in the dark ages...”– Values given by reference may be erroneously
considered valid for the content
• If uncertain about what to put where – consider what is most useful to the end-user
Normalization: vocabularies
• Van Eyck, Jan– Jan Van Eyck– Van Eyck Jan– Van Eyck, Jan en Hubert– gebroeders Van Eyck– Van Eyck, J. (1395-1441)
• (Example from “Erfgoedplus.be”, courtesy of Jef Malliet)
Normalization: precision
• ca. 1560• 1560 ?• 16th century• 1500-1599
• (Example from “Erfgoedplus.be”, courtesy of Jef Malliet)
Enrichment: what is it?
• Example– Mapping content values to common vocabulary
with defined relationships between them– Enables vast quantities of unrelated content to be
automatically linked to eachother – rendering considerable added value
• Example– Automatic language translation– Poor quality – but possibly better than nothing