esip 2009 summer meeting, uc santa barbara, ca, july 7 – 10, 2009 1 stanford digital repository...
TRANSCRIPT
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
1
Stanford Digital Repository
PREMIS & Geospatial Resources
Nancy J. HoebelheinrichInfoAnalyticsSan Mateo, CA
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
2
To Be Discussed
A Brief History of PREMIS An Overview of PREMIS data elements Uses for Geospatial Resources:
Examples
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
3
A Brief History of PREMIS
PREMIS – Preservation Metadata came initially from cultural heritage / digital preservation communities
Built upon previous initiative (2001 - 02 ) Sponsored by two key library descriptive MD utilities
(OCLC and RLG) Preservation Metadata Framework working group Issued a report outlining types of information that
should be associated with an archived digital object
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
4
A Brief History of PREMIS
In 2003 a PREMIS working group formed Comprised of practitioners building or working
on preservation repositories including national data centers in the UK & US, Netherlands, etc.
Focused upon implementable data elements Resulted in a two pronged effort:
Implementation survey Data dictionary of CORE preservation semantic units
(= data elements)
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
5
A Brief History of PREMIS
PREMIS working group publications:“Implementing Preservation Repositories for
Digital Materials: Current Practice and Emerging Trends in the Cultural Heritage Community”, December 2004
“PREMIS Data Dictionary for Preservation Metadata, version 1.0”, May 2005
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
6
A Brief History of PREMIS
PREMIS ImplementationPREMIS Editorial committee formedMaintained by Library of Congress“PREMIS Data Dictionary for Preservation Me
tadata, version 2.0”, March 2008
Who uses? See implementation registryPREMIS Implementors Group (PIG) listserv
for practitioners
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
7
PREMIS Data Model for an “intellectual entity”
OBJECT
RIGHTS
EVENTS
AGENTS
Discrete unit of information in digital form
Rights or permissions info associated with Object or Agent
Important lifecycle events
Parties to Events and/or Rights
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
8
PREMIS Data Model
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
9
More about PREMIS Object
Is an abstraction, meant to cluster semantic units and clarify relationships
Has 3 subtypes: File – the usual suspect Bitstream – contiguous or non-contiguous data within
a file that has meaningful common properties for preservation purposes
Representation -- set of files, including structural metadata, needed for a complete and reasonable rendition of an Intellectual Entity.
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
10
Assumptions underlying PREMIS
Not about “descriptive” metadata (used for search & discovery)
Not about “technical” metadata (usually about the format(s) of the component files or bitstreams)
These areas to be covered by domain specific metadata, e.g., FGDC or ISO profiles
Mind the Gap!
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
11
Simple Example of use of PREMIS Object Data Elements
Applied at file levelAutomatic insertion by Ingest code to retain
important provenance info for each file before moving into the preservation repository
Original file name from data provider Original checksum Original file size
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
12
Element Subelement or Attribute Value objectIdentifier
objectIdentifierType filename
objectIdentifierValue 0372001.tif preservationLevel bit preservation objectCategory file objectCharacteristics compositionLevel 0 fixity messageDigestAlgorithm MD5 messageDigest 0c77e67bebe3f338
4ec8bf4736648e41 size 315827432 format/ formatDesignation formatName TIFF originalName 0372001.tif
PREMIS Object Excerpt (v1.1)
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
13
More about PREMIS Object relationships Defined as associations b/w two or more:
Object entities or Entities of different types, e.g., an Object & an Agent.
Recorded for long term preservation purposes Typical relationship types = structural (component of
representation), derivative (format varieties), dependent (required schema or database structure)
Could be expressed using other schemas for packaging the resource such as METS or XFDU or MPEG DIDL
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
14
Use of PREMIS Rights data elements Applied at representation level Reference to donor’s Deposit Agreement (using
METS) Key info from the ingested Deposit Agreement
for immediate playback
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
15
PREMIS Rights Excerpt (v1.1)Element Subelement or Attribute Value permissionStatement
xmlID SDR Access Phase 1
permissionStatementIdentifier permissionStatementIdentifierType Repository Permissions permissionStatementIdentifierValue All digital objects falling under
SDR Preservation Agreement_BitPreservation, v6.0, David Rumsey Map Collection
grantingAgreement grantingAgreementIdentification library_stanford_edu_fcab81ee605011db96c4339be
grantingAgreementInformation contractAbstract Version 6.0 of Agreement for Bit Preservation of Rumsey Collection
permissionGranted act Public Access termOfGrant startDate 2006-11-01 endDate 2011-11-01 permissionNote/restrictionDefinition
restriction= ="Stanford only” Stanford community only as defined in agreement.
restriction= ="SDR_GROUP_xxx" Named group controlled by SUNET group as defined in agreement.
restriction= ="No access" No access to content content allowed.
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
16
Use of PREMIS Event for simple event Event 1:
Transform of descriptive MD from MS Access db => XML => MODS
Applied at representation level
Why this event? In case of questions
from outside data provider
Retain singular scripts & transform mechanisms
Test practicability of recording such events in production environment
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
17
PREMIS Event Excerpt (v1.1)Element Subelement or Attribute Value eventIdentifier eventIdentifierType MD_Transformation_Process
eventIdentifierValue Rumsey-MODS 3.2 for SDR eventType normalization eventDateTime 2006-12-01T02:48: 22 eventDetail Steps of process transforming data provider's
descriptive metadata to MODS 3.2 records as required for ingestion into SDR.
eventOutcomeInformation / eventOutcomeDetail /
SDR_Rumsey_Transformation / SDR_RumseyTransformationOutput
The Rumsey Access database, as delivered by Luna Insight, was converted to a single XML document using the MS Access Export function. Both the MS Access database is included as well as the XML file.
A PERL script was used to break the monolithic XML document representing the MS Access database into many XML documents each representing a single image in the Rumsey collection. The single XML document was broken into separate documents at each occurrence of the "Object" tag. PERL script in text format is included.
An XSLT was used to make MODS documents for all the Rumsey images. The XSLT file is included.
SDR conversion code was written to pull geographic coordinates and scale metadata out of SUL MARC records from Unicorn catalog and insert them into the MODS records when available.
SDR conversion codes was written to insert the composite MODS records into the METS record for each Rumsey digital object.
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
18
Another example: GIS Dataset: Street network of given metropolitan area
Dataset 1: official street centerline file used by emergency services to locate street addresses
Dataset 2: aspects of the road network including topography, angles & geometry of the road network used for a tourist map
Event to be documented: Merge c:\temp\states1;c:\temp \states2; c:\temp\
USA
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
19
Use of PREMIS Event Data Elements
Want to describe full process of data creation Includes “merge” and
data sources Advantage of
PREMIS – can describe events once in repository
Why this event? Important to describe
processes during different phases of lifecycle, even prior to ingestion
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
20
Use of PREMIS Agent Data Elements
For data management within the repository
Audit trail for descriptive MD
Version of Ingest code? Data provider who
created / altered the resource or the metadata, e.g., USGS which added FGDC MD to HRO from Monterey Bay Water Resource
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
21
PREMIS & Geospatial data -- Comments based on experiences: Works well when:
Domain specific MD exists, e.g., FGDC for descriptive and technical MD
There are levels of the resource with MD to be associated, e.g., at representation & file(s) level
Need to document various points in the lifecycle of the data
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
22
PREMIS & Geospatial data -- Comments based on experiences: In earlier versions of PREMIS unclear how
to document:ContextEnvironment including at time of creation“Significant properties”Existence of geospatial format registries
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
23
PREMIS v 2.0 more flexible
Still XML binding Allows for containers Allows hierarchical relationships Extensible by use of new <premis:extension> element
to insert other elements, XML fragments, e.g., technical MD, provenance metadata, etc.
Board considering the inclusion of mechanism used by packaging schemas to “wrap” or “reference” other metadata
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
24
PREMIS & Complex Geospatial Data For more detail, see “An Investigation into Archiving Geospatial
data Formats “ prepared for NGDA Project, funded by NDIIPP (http://www.ngda.org/research.php) Formats examined Approaches of FGDC, PREMIS, and Center for International Earth Science
Information Network (CIESIN)‘s Geospatial Electronic Record (GER) model on basis of:
Environment/ computer platform Semantic underpinnings domain specific terminology provenance data quality appropriate use
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
25
Examples of Geospatial “Context”
Placing dataset in Time & Space Semantic underpinnings, e.g.,
Abstract Description of purpose / research methodology Intended use of data to avoid misinterpretation or
misuse Where to put?
FGDC has place PREMIS would not necessarily consider this as
“preservation” metadata, but rather “descriptive” or technical MD, however see v 2.0
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
26
Examples of “Environment” and/or “Significant properties” for geospatial data
HW info pertinent at time of data creation SW info pertinent at time of data creation (?)Lineage or “provenance” data e.g., to
communicate processing steps used to create scientific data product
Events, parameters & source data which influenced or impacted the creation of the data set prior to its ingestion into the archive in order to full understand the data that you’re getting
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
27
“Environment” & “Significant properties”, continued…
Data Quality – describing completeness, logical consistency, attribute accuracy
Data Trustworthiness – data creator / provider reliable? = “authentic”
Data Provenance – processes & sources for dataset = “understandable & reliable”
Understanding of the specific needs of the “designated community”
ESIP 2009 Summer Meeting, UC Santa Barbara, CA, July 7 – 10, 2009
28
Questions? / comments?
Nancy J. Hoebelheinrich