20100401 정영임 da 전략 tft_0330
DESCRIPTION
정보 생애주기별 데이터 보존을 위해 고려할 사항TRANSCRIPT
정보 생애주기에 따른 데이터 보존
을 위해 고려할 사항
- 국가 디지털 아카이빙 전략 연구 TF 내부 세미나 -
2010. 4. 1.
정영임
한국과학기술정보연구원 정보유통본부 지식기반실
- 2 -
Table of Contents
1. Digital Archiving in the Framework of Information Life Cycle
Management
2. Creation
3. Acquisition
4. Cataloging/Identification
5. Storage
6. Preservation
7. Access
Digital Archiving in the Framework of Information Life
Cycle Management Digital archiving framework
– Considered at all stages of the information life cycle management
– Information life cycle
• Creation
• Acquisition
• Cataloging/Identification
• Storage
• Preservation
• Access
- 3 -
Creation
Creation
– Defined as an act of producing the information product in the broadest
sense
– Should be regarded as a starting point of long-term and preservation
Suggestion of provision of a preservation indicator for creators
– U.S. Department of Agriculture’s Digital Publications Preservation
Steering Committee
Establishment of guidelines for creators
– Oak Ridge National Laboratory, USA
• A Guide To Record Series Supporting Epidemiological Studies Conducted
for the Department of Energy
• Limits on software
• Format and layout of the documents
- 4 -
Creation
Adaption of Standard Descriptive Languages
– Standard groups incorporate XML and RDF architectures
Attachment of Metadata on Digital Contents
- 5 -
Acquisition and Collection Development
Three main aspects to acquisition of digital objects
– Collection policies
– Gathering methods
– Intellectual Property Concerns
- 6 -
Establishment of Collection Policies
Collection policies
– Selecting What to Archive
• Purpose
– For Dark Archiving: Back issue
– For Light Archiving: Current issue
• Criteria
– Easiness of Content Acquisition
– Quality of Contents
– Utilization
– On-going access fee
• Content Type Coverage: E-journals/R&D Reports/Patents/Scientific Data
– Determining Extent
– Archiving Links
– Refreshing the Archived Contents
- 7 -
Considerations on Gathering Method
Gathering methods
– Hand selection
• Value Judgment and Retention Scheduling (Edinburgh University Library)
– Not preserved
– Preserved for defined period
– Preserved indefinitely
– Automatic selection
• National Library of Sweden: Automatic acquisition without making value
judgment (priority: periodicals, static documents, HTML pages >>
conferences, usenet groups, ftp archives)
• EVA projects: Establishment of time limits to avoid the overloading
- 8 -
Considerations on Intellectual Property Concerns
Reliance on Legislation
– Freedom of Information Act 2001
• The public may have unrestricted access to certain records.
(Consider what categories of information may need to be viewed by the public - these
records need to remain accessible at all times.)
– In general, due to absence of international digital deposit legislation
• PANDORA project seeks permission from the copyright owner
• Swedish and Finnish national library projects do not contact the owners
Making Agreement with Content Providers
– E-journal: Publishers or academic associations
• CLIR/DLF draft model license, NESLi2 Standard license model
• Agreement of Cornell University with publishers
– Government document: Open to public
– Scientific data: individual creators or data centers
• Arts and Humanities Data Service provide information on what is needed
for a digital archive and what creators are likely to be willing to deposit- 9 -
Agreement of Cornell University with Publishers
Topics identified in the agreement(Thomson and Kroch, 2000)– The general responsibilities of the publishers and Cornell
– Characteristics of the data, accompanying metadata, and any additional documentation that are
to be deposited
– Guidelines on transmission methods and media for deposit
– Procedures for the deposit
– Procedures and protocols Cornell will use to verify the arrival and completeness of the data
– Rights of the depositing organizations to audit the repository
– The respective roles, responsibilities, and rights of the Cornell and the data producers with
regard to the data
– Articulation of Cornell's responsibilities and capabilities with regard to the accessioning,
description, management, and even transformation of the deposited data
– Access policies for users of the repository, and how they may vary over time
– Conditions on the use of the data, and again how they may vary over time
– Fees (if any) associated with the deposit
– Cornell's ability to share the data with partners to create an agreed-upon level of redundancy
– Clarification of issues surrounding copyright retained by authors
- 10 -
Identification and Cataloging
Identification
– Provision of a unique key for finding the digital object and linking object
to other related objects
Cataloging in the form of metadata
– Support for organization, access and curation
- 11 -
Persistent Identification
Problems in using URL as Identifier
– Use of server as location identifier can result in lack of persistent over
time both for the source object and any linked objects
– Continuous use of URL
New approaches on persistent identification
– OCLC: PURLs
– ACS: Digital Object Identifier (DOI), MN (Manuscript Number)
– DTIC: Handle® system
– AAS: Bibcode, PubRef numbers
- 12 -
Creation of Metadata at Cataloging Stage (1/3)
Creation Method of Metadata
– Manual creation of metadata
– Automatic generation of metadata
• A project by US Environmental Protection Agency
• Defense Information Technology Testbed project
- 13 -
Creation of Metadata at Cataloging Stage (2/3)
Formats of Descriptive Metadata
– E-journal
• Full MARC cataloging
– Traditional library cataloging standards
– NLA’s PANDORA Archive
• Current development of descriptive metadata standards
– MARCXML, MODS(Metadata Object Descriptive Schema)
– Web-based resources
• Dublin Core-like format
• EVA project
– Non-textual data
• Identification of metadata elements needed for non-textual data types such
as images, video, multimedia and others
– Z39.87 NISO/AIIM Technical metadata for digital still images
– AES X089 core audio metadata
- 14 -
Creation of Metadata at Cataloging Stage (3/3)
Management of Heterogeneous Metadata Format
– Translation between various metadata formats
– Key to the development of networked, heterogeneous archives
– Adaption of packaging metadata standards
• Open Archival Information System (OAIS) Reference Model
– Is developed by ISO Consultative Committee for Space Data Systems
– Encapsulates specific metadata as needed for each object type in a consistent
data model
• Metadata Encoding and Transmission Standard (METS)
– Is produced by Library of Congress Standards Office and Digital Library
Federation
– Provides framework for holding all types of metadata for digital object
• Others
– MPEG-21 Digital Item Declaration Language
– IMS Global Learning Consortium Content Packaging Standards
– Sharable Content Object Reference Model (SCORM)
– CCSDS XML Packaging scheme- 15 -
Development of Technical Model for Storage
Recommendation for Developing a technical model for the
repository (Cornell University)
– Establishing a baseline of e-journal software and file format needs
– Specify the archival repository
– Specifying monitoring tools that will flag documents within the
repository that require migration
– Specifying a baseline hardware and software infrastructure to house
the repository
– Exploring the need and implementation models for redundancy in the
repository
- 16 -
Issues on Changing Storage Media
Problem of changing storage media
– Block size, tape size and tape drive mechanism have changed over
time.
Common Solution
– Data migration to new storage systems
• Much cost and imperfect transferring system is still an issue.
• Check/validation algorithms are extremely important
• Manual check is still necessary.
• Atmospheric Radiation Monitoring Center plans to migrate to new storage
systems every 4-5 years
– Each data migration will take 6-12 months
- 17 -
Issues on Terabytes of Data Storage
Problem of dealing with large-scale data
– Extensive validation routines to ensure the quality of the information as
the information is migrated
• NCBI has 30 Ph.D.s reviewing the information manually, even after it has
passed a variety of validation algorithms
• Similar cost has been spent for
– Corrections and additions to particular records
– Maintenance of a history of changes
– Approval by the owner of all changes controlled by NCBI
Common Solution
– Large-scale data can be stored in different file formats
• Biological sequence data is held in simple ASCII files for preservation
purposes.
• Data in a structured database is provided for searching, reporting and
maintenance
– Extensive tasks can be transitioned to a non-profit consortia
• Protein Data Bank: Collaboratory for Structured Bioinformatics - 18 -
Preservation
Long-term preservation
– No common agreement on the definition of long-term preservation
Main aspects on preservation
– Selection of digital preservation strategies/technologies
– Cycle for hardware/software migration
• No specific investigation on the cycle for hw/sw migration has been done.
• Depending on the particular technologies and subject disciplines, it can be
vary from 2 to 10 years.
– Preservation of the “look and feel” of digital contents
- 19 -
Digital Preservation Strategies
Bitstream Copying
Refreshing
Durable/Persistent Media
Technology Preservation
Digital Archaeology
Analog Backups
Migration (SW, HW migration)
Replication
Reliance on Standards
Normalization
Canonicalization
Emulation
Encapsulation
Universal Virtual Computer- 20 -
Hardware and Software Migration
Problems on Migration
– Migration is not guaranteed to work for all data types
– Migration of information products having used sophisticated software
feature is unreliable
– Generally, there is no backward compatibility, and if it is possible,
there is certainly loss of integrity in the result.
Emulation as an alternative to migration
– Encapsulates the behavior of the hardware/software with the objects
• MS Word 2000 document with metadata indicating how to reconstruct the
document at the engineering level
– Creates an emulation registry identifying the HW/SW environment and
providing information on how to recreate the environment
- 21 -
Advantages and Disadvantages of Preservation Strategies
- 22 -
Selection of Preservation Strategies
A schematic diagram for selection of preservation techniques of digital information.
(Lee et al, 2002)
- 23 -
Preservation of the Look and Feel
Format of materials
– In order to save the “look and feel” of material
• TIFF
– The most prevalent for those organizations involved with the conversion of
paper back file
» E.g.) JSTOR
– This does not allow the embedded references to be active hyper links
• SGML/HTML
– Used by many large publishers after years of converting publication systems
from proprietary format to SGML
– American Astronomical Society (AAS)
– The most prevalent format for purely electronic documents used for both formal
publications and grey literature
– National Library of Sweden
– Concerns remain for long-time preservation
» It may not be accepted as a legal depository form because of its
proprietary nature
- 24 -
Normalization vs. Native Formats
Normalization
– Process of converting the native format to a standard format
• AAS, ACS transform the incoming file into SGML-tagged ASCII format
– Electronic master copy is able to serve as the robust electronic archival copy.
– Well-tagged copy can be updated periodically, at very little cost.
– It takes advantage of advances in both technology and standards.
» Content remains unchanged, but the public electronic version can be
updated to remain compatible with the browsers and other access
technology
– Examples of data normalization provided data community
• NASA Data Active Archive Centers
– Transform incoming satellite and ground monitoring information into standard
Common Data Format
• U.K’s National Digital Archive of Datasets
– Transforms the native format into one of its own devising
• Normalized formats are considered to be the archival versions
– Intellectual property question
- 25 -
Reliance on Standards
Emphasis on Standards
– DOE OSTI
• Limited the number of acceptable input formats
• Text in SGML (and its relatives HTML and XML), PDF, WordPerfect and
Word.
• Image in TIFF Group4 and PDF Image
- 26 -
Preservation Strategies Used in Major Projects
- 27 -
CSI: CISTI Csi, ECO: OCLC Electronic Collections Online, EJO: Ohio LINK Electronic Journal Center
KB: KB e-Depot, KOP: Kopal DDB, LA: LOCKSS Alliance, LANL: Los Alamos National Laboratory Research Library,
NLA: National Library of Australia PANDORA, OSP: Ontario Scholars Portal, PMC: PubMed Central, PORT: Portico
Issues on Access
Access Mechanisms
– Access and display mechanisms
• Providing access
• Restricting access
Rights Management and Security Requirements
– Security and version control
– Creation metadata to manage encryption, watermarks, digital
signatures
- 28 -
Access Mechanisms
Providing Access
– NLM’s Profiles in Science
• Creates an electronic archive of the photographs, text, video, etc
• Electronic archive is used to create new access versions as access
mechanisms change
– Providing access technologies
• Super Distribution
• Value-chain support
Restricting Access
– Usage rule
– Persistent protection
- 29 -
Access
Rights Management and Security Requirements
– Most difficult access issues for digital archiving
– Security and version control impact digital archiving
• Right management includes providing or restricting access as appropriate
• Content protection technologies
– Contents Encryption
– Trusted Environment
– Metadata for managing encryption, watermarks, digital signatures
needs to be created.
- 30 -
References
CLIR, 2002. The State of Digital Preservation: An International Perspective [online] [cited 2009-
07-23]
Hodge, 2000. Best Practices for Digital Archiving: An Information Life Cycle Approach, D-Lib Magazine:6(1) [online] [cited 2009-07-23] < http://www.dlib.org/dlib/january00/01hodge.html>
Hodge et al, 2004. Digital Preservation and Permanent Access to Scientific Information, [online]
[cited 2009-07-23]
ICPSR, 2009. Digital Preservation Management: Implementing Short-term Strategies for Long-
term Problems [online] [cited 2009-12-03] http://www.icpsr.umich.edu/dpm/index.html
Kenney, A. R., Entlich, R., Hirtle, P. B., McGovern, N. Y. and Buckley E. L., 2006. E-Journal Archiving Metes and Bounds: A Survey of the Landscape [online] [cited 2009-12-03]
Lee, K., Slattery, O., Lu, R., Tang, X. and McCrary, V. 2002. The State of the Art and Practice in
Digital Preservation, Journal of Research of the National Institute of Standards and Technology:
107(1), 93-106.
Thomas, S. E. and Kroch, C. A. 2000, Project Harvest: The Cornell University Library's Proposal
to The Andrew W. Mellon Foundation To Develop a Repository for E-Journals, [online] [cited
2010-03-26] <http http://www.diglib.org/preserve/cornellprop.htm >
Edinburgh University Library Digital Archives Research Project. A report and recommendations
- 31 -