managing digital objects and their metadata: challenges and responses douglas campbell and adrienne...
TRANSCRIPT
Managing digital objects and their metadata:
challenges and responses
Douglas Campbell and Adrienne KebbellNational Library of New Zealand Te Puna Mātauranga o Aoteaora
DC-2004 Conference, 12 October 2004
Agenda
• Our situation
• Digital Preservation
Frameworks
• Digital Objects– Complex objects
– Identifiers
– File naming
• Metadata– Frameworks
– Descriptive metadata
– Preservation metadata
– Structural metadata
– Automatic extraction
– Modularity
• Integration
– Business process workflows
National Library of New ZealandTe Puna Mātauranga o Aoteaora
• Collect, maintain, and make accessible literature and information resources that relate to New Zealand and the Pacific
• Alexander Turnbull Library:Preserve New Zealand's documentary heritage for generations to come
• Develop and deliver services for schools to support teaching and learning
• Apply the partnership responsibilities of the Treaty of Waitangi to all activities
National Digital Heritage Archive
• National Library Act 2003 gives legal deposit of electronic
materials to the National Library
• Archive development funded by Government
• Working towards “Trusted Digital Repository” certification
Part 1 Digital Preservation Framework
Open Archival Information System (OAIS) Model
KEY:SIP – Submission Information Package (Ingest)AIP – Archival Information Package (Archive)DIP – Dissemination Information Package (Access)
Dig
ital O
bjec
tsM
etad
ata
Applying OAIS – building our framework
Catalogues
Technical Info
Preservation Info
Selection describe
extract manage
Rights
Digital Store
Digital Object Workbench
• Archive
• Migrate
• Manage media
• Identity
• Prepare
• Arrange
• Authenticate
• Create derivatives
Harvest or
Digitise
acquire
or donatedlegal deposit
retrieveload
Access
metadata conversion search
export
manage
Part 2 Digital Objects
Digital objects are complex
• Website – hundreds of files
• CD-ROM – hard-coded operation
• Diskette of accounts spreadsheets and correspondence –
dissimilar but related
• Self-contained single file, eg. MS Excel
• Dependent multiple files, eg. HTML + GIFs, or EXE + DLLs
• Self-contained multiple files, eg. Series of MS Word letters
Classifying the “conceptual object”
• Simple digital object– A single file
– MS Word document, TIFF image
• Digital object group – A set of independent but related files described as a group
– Disk of 100 MS Word letters
• Complex digital object– A group of dependent files intended to be viewed as a single
conceptual object, often with only one entry point
– Website, CD-ROM
Simple Digital Object
1 Descriptive Record
1 Preservation Object Record
(for PM Word file)
1 Original file [Word]
1 Preservation Master file[Word]
2 Access files [PDF + XML]
1 Simple Object eg. text document 1 PID for 4 files
Object Group
1 Descriptive Record for 800 files [Word, XML, PDF]
•1 Object Pres Data •200 File Data•NN Process Data•NN Metadata Modification Data
1 PID for 800 files
200 Original files [Word]
200 Preservation Master files[Word]
400 Access files [PDF + XML]
1 Object Group eg. 200 letters from
a donor
Complex Digital Object
1 Descriptive Record for 300
files [HTML + gif]
100 Original files [HTML + gif]
100 Preservation Master files[processed for local delivery]
100 Access files [HTML + gif]
1 Complex Object eg. Web Site of 80 html files + 20 gifs
1 PID for 300 files
•1 Object Pres Data •100 File Data•NN Process Data•NN Metadata Modification Data
Complexity of components
Identifiers
Key characteristics of identifiers to consider:
• Granularity – Question: What do we need to identify? Answer: Whatever we need to identify!
• Intelligence – Unanticipated changes may render intelligent identifiers inaccurate, though dumb identifiers place a reliance on external metadata
• Actionable – Need to separate identity from location, eg. two URLs may be two locations of the same entity
• Persistence – Depends mostly on your commitment
• Extensibility – Be generic, follow standards, application independent
Persistent Identifiers
Persistence means different things to different communities,
we separate them into:
• Persistent Identifier (PID) – assigned at the “conceptual”
level of an object, persists in perpetuity
• Persistent Locator (PL) – file locator, persists only for the
life of the file
We guarantee PIDs, but PLs to the “best current format” will
become inoperative over the decades as formats become
obsolescent
File naming conventions – Plan “A”
Plan A: Make filenames unique by including role code, eg:
• DO – Digital Original
• DD – Digital Derivative
• PM – Preservation Master (best attempt to replicate in a
currently accessible format)
• AF – Access Format
• TN – Thumbnail
Filename: IID_role_instance.extension, eg. 1234_af_01.doc
File naming conventions – Plan “B”
Plan B: “Virtualisation”
• Decouple locator and location
• Location and disk partitioning managed dynamically internally, delivered externally via persistent locator– /1234 (to access the default format)
– /1234?role=TN&size=150
• Locator may be HTTP, SOAP, etc.
• Provides additional opportunities such as transparent “on the fly” format conversions or correcting the MIME type reported
Novel
Expression
Manifestation
Component
Item
Work
Manuscript
Word v5 PDF XML
Chap 1
Chap 2
Chap 1
Chap 2
Chap 1
Chap 2
XML XSL XML XSL
DOPM
ASAF AF
DOPM
ASAF AFAF AF
Published
Preservation
Lending
BookManifestation
Item
• FRBR
Part 3 Metadata
Metadata Framework
Four key categories of metadata for digital objects:
• Resource discovery – finding and identifying
• Structural – presenting in context (eg. pages in a book
rather than bunch of files, navigation, etc)
• Rights management and Access control – protection
of property rights, authentication and authorisation
• Technical and Administrative – properties of the
objects, how they were created, changes made, etc.
Metadata Framework
Dublin Core
RDF
XML
Generic or GlobalAccess
NZ
GL
SD
C-G
ovG
ILS
AG
LS
MA
RC
DC
QM
OD
SM
ET
S
DC
-Ed
LO
M
EA
DIS
AD
(G)
Community / Sector
Specific Application
Profiles
Community / Sector
Specific Application Profiles
Following International Guidelines
Local
Library Education Archival Government
Metadata Standards Framework for National Library of New Zealand
Descriptive metadata
Digital Resource Description (DRD) Application Profile
• Lightweight alternative to METS for simple objects based on Qualified DC
• XLink extensions to differentiate links to the multiple derivative files
• Local refinements for different identifier types, eg. local id, persistent id, locator
• RDF/XML encoding syntax
• Used in our “Discover” and “Matapihi” products
Preservation metadata
NLNZ Preservation Metadata (2002)– Object – preservation info for object, eg. ID, software needed
– File – preservation info for a file, eg. format, size
– Process – record of actions taken, eg. format migration
– Metadata modification – record of changes to above metadata
Structural metadata
Metadata Encoding & Transmission Standard (METS)
METS recordHeade
rDescriptiv
eAdministrati
veContent
FilesStructural Map
Structural Links
Behaviour
Metadata Pieces for a Single TIFF Image
Preservation
DCQ Description
<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:nlnzdl="http://www.natlib.govt.nz/dl#" xmlns:ead="http://www.natlib.govt.nz/dl#" xmlns:dcq="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="hdl:1727.11/00002170"> <dc:title>Blue smoke = Kohu auwahi</dc:title> <dc:creator>Karaitiana, Rangi Ruru Wananga, 1909-1970</dc:creator> <dc:subject> <dcq:LCSH> <rdf:value>Popular music - 1941-1950</rdf:value> </dcq:LCSH> </dc:subject> <dc:subject>Maori music - Waiata</dc:subject> <nlnzdl:category> <nlnzdl:NZCT> <rdf:value>A-M-02-03</rdf:value> <rdfs:label>Music Score Covers and Record Sleeves</rdfs:label> </nlnzdl:NZCT> </nlnzdl:category>
<dc:description>For voices and piano.</dc:description> <dc:description>Arranged with tonic sol-fa, chord symbols and two part vocal harmony"--Cover.</dc:description> <dc:description>Words in English and Maori.</dc:description> <dc:description>Caption title.</dc:description> <dc:publisher>C. Begg, [Dunedin] N.Z.</dc:publisher> <dc:contributor>Winchester, George, ca. 1900- , arranger</dc:contributor> <dcq:issued>c1947.</dcq:issued> <dc:type> <dcq:DCMIType> <rdf:value>Text</rdf:value> </dcq:DCMIType> </dc:type> <dc:type> <nlnzdl:LCSHFormOfComposition> <rdf:value>pp</rdf:value> <rdfs:label>Popular music</rdfs:label> </nlnzdl:LCSHFormOfComposition> </dc:type> <dc:format>1 score cover ([1]) p. ; 31 cm.</dc:format> <dcq:extent>17,640KB</dcq:extent> <dcq:extent>81KB</dcq:extent> <dc:format> <dcq:IMT> <rdf:value>image/tiff</rdf:value> </dcq:IMT> </dc:format> <dc:format> <dcq:IMT> <rdf:value>image/jpeg</rdf:value> </dcq:IMT> </dc:format> <dc:format> <dcq:IMT> <rdf:value>image/jpeg</rdf:value> </dcq:IMT> </dc:format> <dc:format> <dcq:IMT> <rdf:value>image/jpeg</rdf:value> </dcq:IMT> </dc:format> <dc:identifier rdf:resource="00182451_00002170_ds.tif"/> <dc:identifier rdf:resource="00182451_00002170_df.jpg"/> <dc:identifier rdf:resource="00182451_00002170_pv.jpg"/> <dc:identifier rdf:resource="00182451_00002170_tn.jpg"/> <nlnzdl:pid rdf:resource="hdl:1727.11/00002170"/> <nlnzdl:object rdf:resource="http://hdl.handle.net/1727.11/00002170"/> <dc:language> <dcq:ISO639-2> <rdf:value>eng</rdf:value> </dcq:ISO639-2> </dc:language> <dc:language> <dcq:ISO639-2> <rdf:value>eng</rdf:value> </dcq:ISO639-2> </dc:language> <dc:language> <dcq:ISO639-2> <rdf:value>mao</rdf:value> </dcq:ISO639-2> </dc:language> <dcq:hasFormat>Also available as an electronic resource.</dcq:hasFormat> <dcq:spatial>New Zealand</dcq:spatial> <dcq:temporal>1947</dcq:temporal> <dc:rights>Permission of the National Library of New Zealand, Te Puna Matauranga o Aotearoa must be obtained before any re-use of this item.</dc:rights> <ead:daoloc ead:behavior="image/tiff" ead:href="http://digital.natlib.govt.nz/source/20020605/00182451_00002170_ds.tif" ead:role="source"/> <ead:daoloc ead:behavior="image/jpeg" ead:href="http://digital.natlib.govt.nz/20020604/00182451_00002170_df.jpg" ead:role="reference" ead:title="Digital image of the cover of the score for Blue smoke. (81KB)"/> <ead:daoloc ead:behavior="image/jpeg" ead:href="http://digital.natlib.govt.nz/20020604/00182451_00002170_pv.jpg" ead:role="display"/> <ead:daoloc ead:behavior="image/jpeg" ead:href="http://digital.natlib.govt.nz/20020604/00182451_00002170_tn.jpg" ead:role="thumbnail"/> </rdf:Description> </rdf:RDF>
METS File Group and structural Map <fileSec> <fileGrp ID="FG2170_pm" USE="Preservation Master"> <file ID="F2170_pm" MIMETYPE="image/tiff" SIZE="17379652" CREATED="1997-04-13T14:51:14"> <FLocat ID="FL2170_pm" LOCTYPE="URL" xlink:href="objects/preservation/1/2170_pm.tif" xlink:actuate="onRequest"/> </file> </fileGrp> <fileGrp ID="FG2170_ds" USE="Dissemination Source"> <file ID="F2170_ds" MIMETYPE="image/tiff" SIZE="17379652" CREATED="2002-09-11T09:12:20"> <FLocat ID="FL2170_ds" LOCTYPE="URL" xlink:href="objects/source/1/2170_ds.tif" xlink:actuate="onRequest"/> </file> </fileGrp> <fileGrp ID="FG2170_df" USE="Dissemination Format"> <file ID="F2170_df" MIMETYPE="image/jpeg" SIZE="123394" CREATED="2002-10-31T15:32:26"> <FLocat ID="FL2170_df" LOCTYPE="URL" xlink:href="objects/access/1/2170_df.jpg" xlink:actuate="onRequest"/> </file> </fileGrp> <fileGrp ID="FG2170_pv" USE="Preview Image"> <file ID="F2170_pv" MIMETYPE="image/jpeg" SIZE="99725" CREATED="2003-04-08T10:56:22"> <FLocat ID="FL2170_pv" LOCTYPE="URL" xlink:href="objects/access/1/2170_pv.jpg" xlink:actuate="onRequest"/> </file> </fileGrp> <fileGrp ID="FG2170_tn" USE="Thumbnail Image"> <file ID="F2170_tn" MIMETYPE="image/jpeg" SIZE="23162" CREATED="2003-04-07T11:33:13"> <FLocat ID="FL2170_tn" LOCTYPE="URL" xlink:href="objects/access/1/2170_tn.jpg" xlink:actuate="onRequest"/> </file>
</fileGrp> </fileSec> <structMap ID="SM2170" TYPE="LOGICAL"> <div ID="DIV2170" LABEL="Blue smoke = Kohu auwahi" TYPE="Image"> <div ID="DIV2170_pm" LABEL="Preservation master" TYPE="tiff image"> <fptr FILEID="F2170_pm"/> </div> <div ID="DIV2170_ds" LABEL="Dissemination Source" TYPE="tiff image"> <fptr FILEID="F2170_ds"/> </div> <div ID="DIV2170_df" LABEL="Dissemination Format" TYPE="jpeg image"> <fptr FILEID="F2170_df"/> </div> <div ID="DIV2170_pv" LABEL="Preview Image" TYPE="jpeg image"> <fptr FILEID="F2170_pv"/> </div> <div ID="DIV2170_tn" LABEL="Thumbnail Image" TYPE="jpeg image"> <fptr FILEID="F2170_tn"/> </div> </div> </structMap>
NLNZ Metadata Extraction Tool
Automatic metadata extraction is essential
• Extracts embedded metadata from 15 common file
formats (eg. TIFF, JPEG, MS Word, PDF) and file details
for other formats
• Built in Java, outputs in XML (customisable using XSLT)
• Graphical interface or command line batch
• 10,000 JPEG files per hour
• Finalist in UK Pilgrim Trust’s 2004 Preservation Awards
Metadata Conversion Engine
Metadata modularity
DescriptiveRecords
MARC
ISAD(G)
Picture AustraliaCROSSWALK
DC XML
METS
DC RDF/XML
Matapihi
Govt Portal
Digital Archive
Discover
AdditionalData
DRD RDF AP
NZGLS
DC RDF/XML
Part 4Business Processes
Integration into the business
• We’re moving from an era of “pilots” to implementation
• Integrating into existing staff workflows rather than
establishing a separate unit
• Documenting the business process workflows
Part 5 Tying it all together
Dig
ital O
bjec
tsM
etad
ata
The Digital Archive Environment
Catalogues
Technical Info
Preservation Info
Selection describe
extract manage
Rights
Digital Store
Digital Object Workbench
• Archive
• Migrate
• Manage media
• Identity
• Prepare
• Arrange
• Authenticate
• Create derivatives
Harvest or
Digitise
acquire
or donatedlegal deposit
retrieveload
Access
metadata conversion search
export
manage
Digital Preservation Reportcard 2004
Digital preservation has come a long way in 5 years:
• From “overwhelmingly daunting” to “potentially achievable”
• A lot of thought, pilots, developments around the world
Improvements needed:
• Tools are still at the emerging stage
• Workflows/social side is sometimes forgotten
• Identifier scheme for PIDs - major outstanding issue
Questions…?
Managing digital objects and their metadata:
challenges and responses
Douglas Campbell and Adrienne KebbellNational Library of New Zealand Te Puna Mātauranga o Aoteaora
DC-2004 Conference, 12 October 2004