managing digital objects and their metadata: challenges

17
Managing digital objects and their metadata: challenges and responses Adrienne Kebbell and Douglas Campbell National Library of New Zealand, Wellington, New Zealand Tel. +64 4 474-3000 Fax +64 4 474-3140 Mail: [email protected], [email protected] Abstract: This paper looks at the challenges facing institutions managing ever- increasing numbers of digital objects over the long term. It considers the current environments and challenges for managing preservation digital objects, the metadata used to manage them, and the business processes that support these activities, and documents the National Library of New Zealand's responses to these challenges. These responses include policies, schemas, systems, and procedures in both implementation and development. Keywords: digital object; standards; metadata; preservation; access; persistent identifier; persistent locator; file format; file role; application profile; crosswalk; OAIS; FRBR; Dublin Core; METS; EAD; DRD; RDF; URI; DOI; Handle; PURL, National Library of New Zealand; Te Puna M_tauranga o Aotearoa. 1 Introduction Institutions around the world are accumulating more and more digital objects. The task of managing these objects throughout their life cycle, especially for institutions tasked to preserve them in perpetuity, becomes more complex the more deeply it is investigated. The National Library of New Zealand Te Puna M_tauranga o Aotearoa (NLNZ) has a number of initiatives underway to build its digital library infrastructure. Some of these were piloted in 2002 when NLNZ released its Discover – Te Kohinga Taonga (1). Discover presents over 2,000 digitised images, audio and video clips in the context of the nation's school arts curriculum and was built using Dublin Core metadata in RDF/XML format. NLNZ is now turning its attention to the challenges inherent in dealing with 'complex digital objects'. This includes both digitised physical resources and those 'born digital', such as multi-file e- text publications, digital archival manuscripts or collections of papers. The challenges fall broadly into three areas: 1. Managing, storing, delivering and preserving the digital objects 2. Generating, collecting and managing the metadata used to manage the objects 3. Managing the business processes around both of these. In this paper we examine the challenges in the context of current global thinking, and look at the strategies NLNZ is pursuing to deal with them. Given that this is a fast-developing area, the solutions presented are interim steps to an overall infrastructure for the National Library's digital library. 2 An NLNZ Framework for Managing Digital Objects There are many facets to the management of digital objects. NLNZ is establishing a framework for all of the processes needed to select, ingest, describe, manage, disseminate and preserve different kinds of digitised and born digital resources. The Open Archival Information System [OAIS] Reference Model (2) provides a useful framework, but the challenge is to translate this conceptual model into practical processes and supporting systems. At present a number of independent systems support the key OAIS processes. Some of these systems

Upload: others

Post on 30-May-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing digital objects and their metadata: challenges

Managing digital objects and their metadata:challenges and responses

Adrienne Kebbell and Douglas CampbellNational Library of New Zealand, Wellington, New Zealand

Tel. +64 4 474-3000Fax +64 4 474-3140

Mail: [email protected], [email protected]

Abstract: This paper looks at thechallenges facing institutions managing ever-increasing numbers of digital objects over thelong term. It considers the current environmentsand challenges for managing preservationdigital objects, the metadata used to managethem, and the business processes that supportthese activities, and documents the NationalLibrary of New Zealand's responses to thesechallenges. These responses include policies,schemas, systems, and procedures in bothimplementation and development.

Keywords: digital object; standards;metadata; preservation; access; persistentidentifier; persistent locator; file format; filerole; application profile; crosswalk; OAIS;FRBR; Dublin Core; METS; EAD; DRD; RDF;URI; DOI; Handle; PURL, National Library ofNew Zealand; Te Puna M_tauranga o Aotearoa.

1 IntroductionInstitutions around the world areaccumulating more and more digitalobjects. The task of managing theseobjects throughout their life cycle,especially for institutions tasked topreserve them in perpetuity, becomesmore complex the more deeply it isinvestigated.

The National Library of New ZealandTe Puna M_tauranga o Aotearoa(NLNZ) has a number of initiativesunderway to build its digital libraryinfrastructure. Some of these werepiloted in 2002 when NLNZ released itsDiscover – Te Kohinga Taonga (1).Discover presents over 2,000 digitisedimages, audio and video clips in thecontext of the nation's school artscurriculum and was built using DublinCore metadata in RDF/XML format.

NLNZ is now turning its attention to thechallenges inherent in dealing with'complex digital objects'. This includesboth digitised physical resources andthose 'born digital', such as multi-file e-text publications, digital archivalmanuscripts or collections of papers.The challenges fall broadly into threeareas:1. Managing, storing, delivering andpreserving the digital objects2. Generating, collecting andmanaging the metadata used to managethe objects3. Managing the business processesaround both of these.

In this paper we examine the challengesin the context of current global thinking,and look at the strategies NLNZ ispursuing to deal with them. Given thatthis is a fast-developing area, thesolutions presented are interim steps toan overall infrastructure for the NationalLibrary's digital library.

2 An NLNZ Framework forManaging Digital Objects

There are many facets to themanagement of digital objects. NLNZ isestablishing a framework for all of theprocesses needed to select, ingest,describe, manage, disseminate andpreserve different kinds of digitised andborn digital resources. The OpenArchival Information System [OAIS]Reference Model (2) provides a usefulframework, but the challenge is totranslate this conceptual model intopractical processes and supportingsystems. At present a number ofindependent systems support the keyOAIS processes. Some of these systems

Page 2: Managing digital objects and their metadata: challenges

leverage NLNZ's investment in productsfrom Endeavor Information Systems –the "Voyager" integrated librarymanagement system and the"ENCompass" digital management

system – while others have beendeveloped internally, and still others areidentified as a gap.

Figure 1 – The OAIS Reference Model for digital preservation (3)

Preservation adds another level ofcomplexity to the set of tasks thatorganisations face as they attempt toassimilate digital content into businessworkflows. NLNZ has a legislativemandate 'to collect, preserve and makeavailable recorded knowledge … relatedto New Zealand' and increasingly thismaterial is digital.

NLNZ must ensure the authenticity andintegrity of digital objects in perpetuity.The attributes of repositories with thistype of responsibility and the potentialfor certification in this area have beendescribed by RLG/OCLC in theirpapers on the ‘Trusted DigitalRepository’ (4).

Other organisations will have a lesserimperative to preserve digital resourcesfor the very long term, but will have

some understanding of the technical,legal, organisational and socialchallenges involved, and collaborationwill be essential.

2.1 International protocols, standardsand best practices

In developing internal digital protocolsNLNZ builds on international standardsand experience. In this environmentsolutions seem always to betantalisingly just around the corner,though surely never so close as they arenow. Some standards and protocols thatNLNZ tentatively chose two to threeyears ago have become much moremainstream. These include the CNRIHandle system, the Metadata EncodingTransmission Standard [METS], and thePDF file format that althoughproprietary, is now such a commondissemination format that a forward

Page 3: Managing digital objects and their metadata: challenges

migration path is assured.

2.2 Complexity of multi-file objectsNLNZ's experience in the Discoverpilot was limited to single image files.The objects NLNZ is now consideringoften contain multiple component files,for example web sites, CD-ROMs,diskettes of correspondence in word-processing fi les or accountsspreadsheets. Some of these files areself-contained (e.g. a singlespreadsheet), while others aredependent on additional files for theiroperation (e.g. HTML files requiringGIF image files to build up a web page,or an executable file requiringsupporting DLL library files in order tooperate).

This inherent complexity can becompounded by actions taken as part ofpreserving and disseminating digitalobjects. Ensuring long-term authenticityand accessibility involves migrating andtransforming files over time into moremanageable formats, and convertingthem into suitable formats fordissemination.

NLNZ groups objects by theircomplexity so that processes performedon them are appropriate. Thesecategories are based on the original‘conceptual object’ as determined anddescribed by the curators, rather thanthe particular structure of the file(s). Wedefine Simple, Group and Complexobjects as follows:• Simple digital objects – A simple

digital object consists of a single filethat is intended to be viewed as oneconceptual object, e.g. a Worddocument or TIFF image.

• Digital object group – A digitalobject group consists of a set ofindependent but related files that

have been collectively described,e.g. a floppy disk containing 100letters. Each file is accessibleindependently (as a Simple object),but its relationship to other objectsin the group provides valuablecontext.

• Complex digital objects – Acomplex digital object consists of agroup of dependent files intended tobe viewed as a single conceptualobject, e.g. a web site or CD-ROM.Often there is only one entry point.

In an attempt to understand thetechnical and business issues associatedwith complex objects, NLNZ created atestbed working with a group of objectscalled the 'Survey Objects'. These wereselected as being representative of thespectrum of born-digital materials thatN L N Z w i l l a c q u i r e(published/unpublished, online/offline,simple/complex; and covering a rangeof MIME types, e.g. .htm, .pdf, .mdb,.doc, .dot, .exe, .xlm. The challengesthat NLNZ has experienced inmanaging and delivering complexdigital objects throughout the digitalcontinuum has informed the practicalaspects of our digital framework.

Figure 2 illustrates the growingcomplexity of the metadatacomponents, even within simpleobjects. These examples show the useof a single Persistent Identifier [PID] foreach 'conceptual object ' orindependently accessible component ofan object group. NLNZ will review thislevel of identity if it proves impracticalfor particular kinds of objects not yettested

.

Page 4: Managing digital objects and their metadata: challenges

Figure 2 – Simple, Complex and Group objects have multiple components

3 SelectionIt is useful to examine the major phasesof the digital continuum in more detail.

The selection process for publisheddigital resources is supported by aspecialised Voyager database. This isused to record decisions leading to theselection or rejection of a resource forthe Digital Archive, and any rights oruse negotiations that may have takenplace with the publisher. Some of thisdata is imported into the IntegratedLibrary System when the item isacquired.

Selection of unpublished digitalresources continues to be recorded inour archival tool, TAPUHI.

4 IngestNLNZ is evaluating a number of toolsto support ingest of born digital anddigitised, and e-resources of all sorts(e.g. learning objects, e-journals, full-text databases) to the digital archive.These include D-Space from the MIT ,the PANDORA Archiving System(PANDAS) from the National Libraryof Australia, and the Fedora Open-Source Digital Repository ManagementSystem from the University of Virginia

and Cornell University. We are alsoexploring the possibility of building ourown digital repository to encompass allof the preservation processes fromselection to long-term preservation.

5 IdentifiersPersistent identifiers [PIDs] are criticalto the successful long-term managementof digital objects. Because of their verypersistence, PIDs require carefulconsideration of both form and contentfrom the outset.NLNZ curators assign identifiers to newarchival resources using a purpose builtInternal Identifier Database [IID DB].This is a simple MS Access Clientrunning on top of a MS SQL Serverdatabase. The system assigns IIDs(discussed below) that become thelocally unique component of a PID, andincorporates this number into thestructured filenames for new objects.The IID DB Client is focused onintegration into workflows: it providesdata to configure uploads of new objectsinto the Digital Archive, to generateappropriate derivatives, and to supportthe replacement of digital objects. TheClient provides an integrated view forcurators of the object together with theassociated administrative data.

Page 5: Managing digital objects and their metadata: challenges

5.1 Identifier characteristicsIn our search for a preferred identifierframework NLNZ investigated the moreimportant characteristics of identifiers(5)(6).

• Granular i ty : The most basicidentity question to resolve is:"What do we need to identify?" Itappears the best answer is"Whatever we need to identify".Determining what we need toidentify is discussed below [seeApplying Identifiers].

Our identifiers will be madeglobally unique by using URIs, asrecommended by the W3C's recentdraft Architecture of the World WideWeb (7), so they can "stand alone",e.g. "hdl:1727.11/1854" instead of"Alexander Turnbull Library's localreference number EP-1957/643".

• I n t e l l i g e n c e : The danger ofintelligent identifiers lies in beingunable to anticipate future changesthat may render them inaccurate.NLNZ considered dumb identifiersto be safer in the long-term, thoughthis places more reliance on externalintelligence, for example inmetadata.

• Actionable: The danger ofactionable identifiers is the easewith which location and identity canbe confused. An entity may exist inmultiple locations so using alocation as an identifier (e.g. a URL)may mean identifying as multipleresources something that should beconsidered a single entity. Whetherour identifiers should be actionableor not depends on our requirementsfor the identifiers, but NLNZ makesa clear distinction between the two.

• Persistence: We recognise thepersistence of our identifiersdepends on our requirements for itand our level of commitment to its

persistence, as discussed below [seePersistent Identifiers].

• Extensibility: We intend to followas generic a scheme as possible, tofollow international standards, andto be application independent.

5.2 Internal identifiersNLNZ sees the identifier question ashaving two domains, internal andexternal. Most external identifiersconsist of: a type prefix, an issuingauthority and a local identifier, so theinternal identifier is seen as the firstpriority to solve.

NLNZ developed an "InternalIdentifier" (IID) scheme, which consistsof a simple running number. Thecharacteristics it exhibits are:• unique – within NLNZ• dumb• not actionable – no need to be• persistent – numbers are not reused• extensible – simple numbers can be

applied to anything.

5.3 Applying IdentifiersThe question of which resources get anIID is more complex and includes thegranularity question.

To help determine what we were tryingto identify and why, NLNZ looked atIFLA's "FRBR" model (8), which isclosely related to both the <indecs>'framework (9) and the DOI framework(10). We found FRBR's to be useful forthe "big picture" (how large numbers ofdisparate resources relate to each other)but less developed around componentsbelow the manifestation level. However,NLNZ work could map successfullyinto FRBR, which also provided aworkable model for relating digital andnon-digital resources to each other [seefigure 3].

Page 6: Managing digital objects and their metadata: challenges

Figure 3 – Digital files occupy the lower levels of FRBR but FRBR provides arelationship with non-digital materials. The item file labels shown use NLNZ RoleCodes.To complement this, NLNZ developed aclassification for file roles, as discussedbelow [see File Naming Conventions].

5.4 Persistent identifiers and locatorsHaving examined the options for PIDs,NLNZ discarded URNs (notactionable), URLs (persistence is relianton the level of commitment), and D.I.Y.(do-it-yourself; doesn ' t fo l lowstandards). This left three maincontenders: DOIs (expensive andclosed/proprietary), Handles (opensystem), and PURLs (whose futureappeared uncertain). We chose to pilotthe Handle server, with each Handleconsisting of NLNZ's Handle prefix andour IID.

As a result of assigning Handles to the2,000+ objects in Discover, NLNZdiscovered the following limitations ofHandles:

• Native support in Web browsers islacking.

• Resolution to multiple locations issupported, but it is not possible tospecify which is preferred in anHTTP request.

• There are questions over thescalability of the database.

• A development path is not assured.NLNZ is now re-evaluating DOIs andPURLs, and is looking at splitting itspersistent identifiers into:

• Persistent identifiers [PID] – thatpersist in perpetuity and areassigned at the "conceptual" level ofan object

• Persistent locators [PL] – filelocators that persist, but only for thelength of the life of the file

This split acknowledges that"persistent" means different things tothe Web community (i.e. no "404 – not

Page 7: Managing digital objects and their metadata: challenges

found" errors) and the archivecommunity (i.e. in perpetuity). NLNZwould guarantee the PL links to eachderivative or component file arepersistent while those files are the "bestcurrent format", but over the decades, asthey become obsolescent, the linkswould become inoperative. It is the PIDlinks (at the conceptual object level)that are the reliable, permanent accesspoints. At all times the originalbitstream would still be available, andtransient formats provided so that thecurrent-day audience can view themeasily without requiring specialresources.

6 Storage and Preservation of FilesNLNZ already has some 750,000 filesin its digital store, and it was inanticipation of very large numbers ofdigital resources that it decided to storethem outside of a database, in astructured Unix directory.

Each filename in the digital store mustbe unique as some batch processes willgather files from separate directories.Unique filenames also make it easier forusers downloading files (e.g. there maybe a lot of "letter.doc" files). NLNZdecided to include the identifier in thefilename, ensuring that all files have anidentifier that travels with them. Thisalso alleviates human error and lostfiles.

The exception to this approach is multi-file or complex objects, where use isdependant on consistent filenames andfile locations (e.g. HTML pagespointing to HTML/GIFs or executablespointing to DLLs). These must bestored with the original directorystructure and filenames intact.

Issues that need to be addressed at thedirectory level include:• predictability of the location/path

for a given file, e.g. able to predictthe path when only the file'sidentifier is known

• storing multiple derivative formatsfrom an object either together orseparately grouped by format

• balancing the maximum number offiles in each directory againstminimising the empty space in each,and so considering the differingsizes of different file types (e.g. athumbnail image versus a full lengthmovie)

• sett ing directory and fi lepermissions

• catering for future growth, includingconversion of formats that maychange the file size yet need toremain in the same location/path forpersistence of access.

Above the directories, NLNZ currentlyplaces files into disk partitions that areassigned arbitrarily but will not changeover time, and that have headroom forfuture file conversions.

6.1 File Naming ConventionsIn order to make filenames uniquewithin NLNZ, names for simple andgroup objects follow the format:"IID_Role_Instance.Ext":IID followedby a code representing its role, anumber for which instance of that role itis, and the format extension, e.g."1234_ah_01.jpg". NLNZ considereddropping the format extension as it mayhave a limited life. In the end, theextension was retained to assist Webbrowsers to display the file and users tosave files locally.

NLNZ developed the "NLNZ RoleDefinitions" to be able to refer to theclasses of the derivatives it generated[see figure 4]. The roles fall into threecategories: original, preservation, andaccess.

Page 8: Managing digital objects and their metadata: challenges

Figure 4 – NLNZ Role Definitions for derivatives and surrogates

6.2 VirtualisationThe above solution was not ideal inpractice as requirements for a persistentlocation clashed with the need over timeto re-locate or re-arrange storage space.To overcome this we are adding a layerof virtualisation.

The virtualisation layer allows filemanagement – partitioning and location– to be managed dynamically internally,while files are presented externally as ifin a persistent location. A proxy-likeprogram is being developed which isoptimised to resolve file requests bylooking up the file's current location anddelivering its content. The approach tophysical location and file namingremains as discussed above, but thevirtualisation layer presents thisexternally in a simplified way. Thisnegates the need for a partition numberin the location path and moves the rolecodes into optional parameters, therebymaking it more extensible, e.g. "/1234"for the default version, or"/1234?role=TN&size=150" for the 150pixel thumbnail. These parameters arestill being worked on.

This l ayer o ffers additionalopportunities such as transparent "on-the-fly" conversions and adjusting the

MIME type reported, which is useful,for instance, for harvested web pageswith file extensions that don't match thereported MIME type.

6.3 File formats and obsolescenceEven lay people are likely to haveencountered obsolescence of fileformats: some files only a few years oldare unreadable already. In addition,media obsolescence can make itdifficult to obtain the files in the firstplace, e.g. 5_ inch floppy disks receivedrequire a disk drive now considered"non-standard" to process them andmany PCs no longer even have a 3 inchfloppy disk drive!

There are three major approaches totackling obsolescence (11).• The Museum approach is to collect

the hardware and software needed toensure ongoing accessibility to theoriginal file formats. The issuesinclude expertise and offsiteaccessibility.

• With the Migration option, files areconverted into current formats. Thisconversion process becomesongoing as technologies continue tochange. The issues include lossyconversions, deciding what must bepreserved (intellectual content,

Page 9: Managing digital objects and their metadata: challenges

presentation, the user experience,etc), and balancing the access-driven desire for current formats vs.the preservation requirement tochange formats as seldom aspossible.

• Emulation requires software to bedeveloped that can simulate theoriginal experience using theoriginal file format but with currenttechnologies. The issues includedevelopment expertise and lossyemulations.

Of these, NLNZ currently prefersmigration because it requires lessdevelopment resources. However werecognise that migration has itslimitations and may not work well withsome materials, so emulation is alsobeing considered. Our PreservationMetadata Schema is seen as integral tocapturing records of all migrationactivities.

6.4 Preferred file formatsNLNZ has considered specifying fileformats for both archival anddissemination purposes. The practicaladvantages of limiting the range of fileformats that have to be managed overtime within the digital archive areobvious. But the tensions between thebest format for faithfully preserving theoriginal and that for providing easyongoing access need to be considered.In some instances it will be possible tospecify preferred formats for deposit,but in many instances it will not. Fileswill be offered to NLNZ in all kinds offormats, often already redundant or

impossible to read. These files will needto be transformed or migrated to currentformats. Proprietary formats may haveto be archived as preservation masters,when other options could alter thecontent or intended form of the object,but dissemination formats should benon-proprietary to ensure easy andenduring access.

NLNZ's Survey Objects work tested theviability of representing various textdocuments for web delivery asXML/HTML and PDF. The accurateconversion of many of the Microsoftapplications to XML is currently aresource-heavy task, and NLNZ truststools for these processes will continueto develop. Other dissemination fileformats NLNZ are currently workingwith include JPEG for images, WAVEand MP3 for sound, MOV and MPEGfor video.

7 Metadata FrameworkA range of metadata is essential to thesuccessful management of digitalresources. NLNZ published the firstphase of its Metadata StandardsFramework (12) in October 2000,focusing on resource discovery. Theframework essentially states that themost appropriate international metadatastandard will be used in any givencircumstances, but it must be able to bemapped to Dublin Core [ISO 15836] atthe common resource discovery layer[see figure 5]. The second phasecovering a schema and data model forpreservation metadata was published inNovember 2002.

Page 10: Managing digital objects and their metadata: challenges

Figure 5 – NLNZ's Metadata Framework with community-specific metadata forvertical access and Dublin Core for horizontal access

There is a plethora of metadata schemasavailable today, each serving differentpurposes, so determining therequirements was NLNZ's first step inselecting appropriate schemas. Theframework established by NLNZfollows the taxonomy in Kenney andRieger's Moving Theory into Practice(13). This taxonomy describes four keymetadata categories for digital objects:• Resource discovery – How do we

ensure that the materials we havecollected can be found and retrievedby our clients? Dublin Core is anearly attempt to provide a 'linguafranca' for resource discovery in anonline environment and is stillevolving.

• Structural – How do we presentour objects in context (e.g. asordered pages of a digitised book)and not just as a bunch of files andhow do we navigate within thiscontext (e.g. page 1 to page 2)?

• Rights management and Accesscontrol – How do we ensureprotection of intellectual propertyrights, authentication of clients andauthorisation of clients?

• Technical and Administrative –What are the essential attributes ofdigital objects and the processes andtechnologies that created them, andwhich are required for long-termstorage, management, preservationand access?

7.1 Building modular metadata

NLNZ's M e t a d a t a S t a n d a rdsFramework identified Dublin Core asthe essential core data for resourcediscovery. Qualified Dublin Core isused for 'simple digital objects' with theaddition of local administrativeelements, built in a modular way withinRDF/XML. This Digital ResourceDescription (DRD) schema (14) isdiscussed below.

Descriptive metadata for NLNZ'sarchival digital collections will be EAD.For complex digital objects, both DCand EAD will be used within the METSframework, with METS providing thestructural and behavioral components ofthe metadata.

This modular approach to buildingmetadata ensures rigid compliance withinternational standard schema, which isan important factor in achievinginteroperabi l i ty with differentconstituent communities nationally andinternationally.

7.2 Descriptive MetadataNLNZ sources metadata from a numberof systems, but primarily from itsMARC-based ILS, and ISAD(G)-basedarchival system, TAPUHI.

Each of these is optimised for publishedand unpublished collection managementtasks respectively. This means thatdescriptive metadata is duplicated in themainstream and digital environments.

Page 11: Managing digital objects and their metadata: challenges

Additional metadata is collected as theoutput of several digital libraryprocesses and combined as part ofscripted data conversions processes. Anumber of scripts convert theseISAD(G) and MARC records to XMLfor loading to ENCompass using eitherNLNZ's qualified Dublin Core [DC] orEncoded Archival Description [EAD]metadata schemes.

7.3 Metadata Conversion EngineThe expense in creating metadatamanually means institutions mustleverage their existing metadata whengenerating metadata in new formats. Alot of new metadata can be generated by"crosswalking" (converting) fromexisting sources. It can often be mappedsuccessfully but is sometimes

fundamentally different, resulting in a"square peg in a round hole". Thesemis-matches may have to be acceptedwithin budgetary constraints, meaningcrosswalking is the only option. Theeffectiveness of the conversion has to beevaluated against the cost of additionalmanual work.

NLNZ realised the need to delivermetadata in different formats to variousaudiences. This metadata also had to bederived from a number of sources(various formats of human-generateddescriptive metadata and auto-generatedtechnical metadata) [see figure 6].

Figure 6 – Multiple sources of metadata feed the conversion

Experience showed that the mostefficient way to achieve this was togenericise the conversion processes [see

figure 7]. When we attempted to deliverDC metadata for Discover using aprocess developed previously for the

Page 12: Managing digital objects and their metadata: challenges

National Library of Australia's PictureAustralia product–we realised that wehad DC data hard-coded to suit PictureAustral ia ' s interface did not suitDiscover's interface. Our process nowconverts the metadata into a base of

pure DC, following DCMI's guidelinesstrictly, and p roduc t - spec i f i crequirements are then added on asnecessary.

Figure 7 – Modular crosswalking as the most efficient model

Currently NLNZ's Metadata ConversionEngine (MCE) consists of severalindependent Perl, Java, and XSLTscripts. There is potential for thedevelopment of a "Metadata CrosswalkDefinition Language" that could capturecrosswalk algorithms generically forapplication within particular MCEimplementations or between MCEs.

7.4 DRD Application ProfileNLNZ has developed "Digital ResourceDescription" (14), an ApplicationProfile based on Dublin Core, forrecords in Discover. One intention forDRD is to provide a "lightweight"alternative to METS for simple digitalobjects. The rationale is that DCQualifiers provide sufficient granularityfor both finding and using single imagefiles in a Web interface, with two

exceptions:• identifying the multiple derivative

files – there is no way ofdifferentiating URLs in multipleIdentifier or hasFormat properties

• identifying the type of identifiers inthe Identifier property, e.g. local,persistent, or location.

The extensions to DC Qualifiers thatNLNZ uses are: Local Identifier,Persistent Identifier, Digital ObjectLocation, XLink 'simple' type attributes(type, href, title, role, arcrole, show,actuate) for more detailed descriptionsof the multiple linking elements, andMetadata Rights Ownership (used fortracking the source of metadata notcreated by NLNZ).

Part of the discipline of creating an

Page 13: Managing digital objects and their metadata: challenges

Application Profile is ensuring importedelements comply with the externalschema they were sourced from. Eachelement was researched and appropriateguidelines for use prepared.

The RDF/XML syntax was chosen as itis the Dublin Core Metadata Initiative'spreferred syntax and it is part of W3C'svision for the future of the Web.

An opportunity identified during thedevelopment of DRD was for amechanism to make managingApplication Profiles easier. NLNZ hasbegun work on a 'Metadata SchemaDescription Language' for centralisingApplication Profile maintenance inXML "super-documents" from whichcan be derived (using XSLT) anyrequired DTDs, XML Schemas, RDFschemas, HTML documentation, orRDDL directory pages.

7.5 NLNZ Preservation MetadataSchema

NLNZ's preservation metadata schemawas developed with practicalimplementation in mind, and waspublished for comment in 2002 (12).This metadata supports curators anddigital library administrators as theymonitor and migrate the different fileformats stored in the repository, toensure they are authentic and accessibleover time. Also referred to as digitalasset management, this is one part of thework of a 'Trusted Digital Repository' as

defined in Trusted Digital Repositories:Attributes and Responsibilities (4).NLNZ have not identified a tool for thispart of the process.

NLNZ's Preservation Metadata wasdeveloped in the light of internationalresearch, particularly that of theNational Library of Australia, andaddresses two functional objectives:• Providing sufficient knowledge to

take appropriate action in order tomaintain a digital object's bitstreamover the long-term

• Ensuring that the content of anarchived object can be rendered andinterpreted, in spite of futurechanges in storage and accesstechnologies.

Preservation metadata will be largelyextracted from the digital files in anautomated process. However it will alsoin part be collected manually over alengthy period, as a digital collection isdonated or selected, and objects areappraised, arranged, described,validated and re-formatted. Curators,digital archivists and technical staff willall record components of thepreservation metadata, and each will allhave different ongoing requirements ofthe preservation records. Formattransformations, digital recovery orconservation processes will requirecurator approval.

Figure 8 – Preservation Metadata Model

The four entities of the NLNZ schema,as shown in figure 8, represent aminimum set of metadata necessary for

preservation:• Object – preservation specific

values associated with the whole of

Page 14: Managing digital objects and their metadata: challenges

the object, e.g. reference number,hardware requirements, softwarerequirements

• Process – actions that havehappened, and changes made, to anobject in NLNZ's care, such as theconversion/migration to a moreviable version

• File – preservation data associatedwith the individual files thatcomprise a digital object, e.g. size,date created, filename

• Metadata Modification – a l lchanges made to the metadatarecord, along with date, time andwho made those changes

7.6 Metadata Extraction ToolNLNZ is concerned about the sheerquantity of metadata needed for themanagement and preservation for everydigital asset [see figure 9]. Automatedsystems are essential.

Figure 9 – The metadata pieces for a single TIFF image (the elements in red arederived programmatically)

A major advantage of digital objects isthat computer programs can runprocesses over them with results thatcould never be achieved by humans.Collecting detailed technical metadataautomatically is not uncommon, butNLNZ could find no tool that wouldopen all of the common file formats andextract the metadata embedded inside.

T h e L i b r a r y c o m m i s s i o n e d

development of the Java-basedPreservation Metadata Extract Tool.This can be run with a GUI or in batch,currently handles 5 common fileformats with 10 additional formats indevelopment, and outputs the results inany XML format – e.g. processing10,000 JPEG files per hour. The outputis another metadata source for the MCE.The tool was selected as a finalist in thePilgrim Trust's 2004 Preservation

Page 15: Managing digital objects and their metadata: challenges

Awards (15).

7.7 Structural MetadataHaving identified all the metadatacomponents needed to supportselecting, ingesting, identifying,describing, managing and preservingdigital objects, the question was, whereto put it? The simple group of XLinksused in NLNZ's DRD schema cannotcater for the complex administrative andtechnical metadata that multi-layeredcomponent and derivative files require.

When the Metadata Encoding &Transmission Standard [METS] cameout in an early release it all suddenlyseemed possible. NLNZ anticipate thedescriptive and administrative metadatawould be stored externally from theMETS records, though the ability toembed it within METS is attractivefrom the metadata interchange point ofview.

It is METS' 'Structural Map'functionality that NLNZ are primarilyattracted to: this provides the first strongmechanism available to track all thedigital files across all purposes(administration, preservation, anddelivery).

8 Integration into the businessNew technologies are often developedwithin projects that sit outside "businessas usual", but eventually the resultsmust be integrated back into thebusiness.

NLNZ has been developing its DigitalLibrary strategies for a number of yearsand has run a number of pilots.Understanding of the requirements isdeepening and the drive now is to movethe collective processes from theory toimplementation. Some staff who will bewith digital objects every day have onlybeen involved in projects on an ad hocbasis – there is a sizable task to up-skill

them and determine the best fit of theprocesses into their work.

8.1 Development of business processworkflows

NLNZ has developed a series ofbusiness process diagrams that reflectthe current workflows. At a verydetailed level, these outline tasks anddecisions, the business unit and staffmember/s responsible, the tools used(which could be particular pieces ofhardware and software, but may also beoffline tools such as collection policiesor printed forms or manuals) and thepieces of metadata that are input/output.Each workflow diagram combinessections that are particular to one typeof digital object (the main categoriesbeing published online, publishedoffline, unpublished and digitised) withgeneric workflow processes (e.g. aroundupload of objects to the digital archive).

These diagrams are a work in progressand are revised constantly. Theproposed outcome is a comprehensiveend-to-end manual for digital objectprocesses and procedures to supportNLNZ's efforts towards trusted digitalrepository status.

9 ConclusionNLNZ's understanding of the processesfor managing, storing, delivering andpreserving digital objects, and ofgenerating and managing the metadatathat supports these processes, hasprogressed a long way in the last fiveyears. The once mammoth-looking listof challenges is being systematicallyreduced as successive components areresolved. NLNZ has built on its earlyexperiences: pilot projects have broughtus face-to-face with the issues and weoffer up the responses here to add to thepool of solutions. NLNZ is still refiningits digital strategies but feels it has afirm basis on which to build its digitalarchive.

Page 16: Managing digital objects and their metadata: challenges

Figure 10 – Simplified view of NLNZ's current solution

AcknowledgementThe authors would like to thank theircolleagues in the Library's DigitalInitiatives and Digital StrategyImplementation teams for theirencouragement, assistance andcontributions.

References

1. National Library of New Zealand.Discover – Te Kohinga Taonga.Av a i l a b l e a t :http://discover.natlib.govt.nz

2. Consultative Committee for SpaceData Systems. Reference Model foran Open Archival InformationSystem (OAIS). CCSDS 650.0-B-1,January 2002. Retrieved 1 May2 0 0 4 f r o mhttp://ssdoo.gsfc.nasa.gov/nost/isoas/

3. Lavoie, B. F. The Open ArchiveInformation System ReferenceModel: Introductory Guide. DPCTechnology Watch Series Report 04-01, January 2004. Retrieved 1 May2 0 0 4 f r o mhttp://www.dpconline.org/docs/lavoie_OAIS.pdf

4. RLG/OCLC Working Group onDigital Archive Attributes. TrustedDigital Repositories: Attributes andResponsibilities. 2002.

5. Green, B. & Bide, M. UniqueIdentifiers: a brief introduction,revised edition. March 1997.Retrieved 1 May 2004 fromhttp://www.bic.org.uk/uniquid.html

6. Paskin, N. On Making andIdentifying a "Copy". D-Lib

Magazine, Volume 9 Number 1,January 2003. Retrieved 1 May2 0 0 4 f r o mhttp://www.dlib.org/dlib/january03/paskin/01paskin.html

7. W3C Technical Architecture Group.Architecture of the World WideWeb, First Edition, Section 2. 9December 2003. Retrieved 1 May2 0 0 4 f r o mhttp://www.w3.org/TR/webarch/

8. IFLA Study Group on theFunctional Requirements forBibliographic Records. Functionalrequirements for bibliographicrecords : final report. 1998.Retrieved 1 May 2004 fromhttp://www.ifla.org/VII/s13/wgfrbr/wgfrbr.htm

9. Rust, G. & Bide, M. INDECSMetadata Framework: Principles,model and data dictionary, v2.0.June 2000. Retrieved 1 May 2004fromhttp://www.indecs.org/project.htm#finalDocs

10. International DOI Foundation. TheDOI Handbook, Edition 3.3.November 2003. Retrieved 1 March2 0 0 4 f r o mhttp://www.doi.org/hb.html

11. Thibodeau, K. Overview ofTechnological Approaches to DigitalPreservation and Challenges inComing Years. In The State ofDig i t a l P rese rva t ion : AnInternational Perspective conferenceproceedings, 2002. Retrieved 1M a r c h 2 0 0 4 f r o mhttp://www.clir.org/pubs/reports/pub107/thibodeau.html

Page 17: Managing digital objects and their metadata: challenges

12. National Library of New Zealand.Metadata Standards Framework.October 2000, November 2002 &June 2003. Retrieved 1 May 2004fromhttp://www.natlib.govt.nz/en/whatsnew/4initiatives.html#meta

13. Kenney, A. & Rieger, O. MovingTheory into Practice: DigitalImaging for Libraries and Archives,Chapter 5. Research LibrariesGroup, 2000.

14. National Library of New Zealand.Digital Resource Description.Retrieved 1 May 2004 fromhttp://www.natlib.govt.nz/dr/drd.html

15. Digital Preservation Coalition.Pilgrim Trust's Digital PreservationAward 2004. Retrieved 1 May 2004fromhttp://www.consawards.ukic.org.uk/digital.html