jhove2 a next-generation architecture for format-aware preservation processing stephen abrams...
TRANSCRIPT
JHOVE2A Next-Generation Architecture for
Format-Aware Preservation Processing
Stephen AbramsHarvard University
Evan OwensPortico
Tom CramerStanford University
Digital Library Federation Fall ForumPhiladelphia, November 5-7, 2007
JHOVE2 project
• Two year NDIIPP-funded collaborative project to develop “next generation” architecture for format-aware preservation processing
– Harvard University• Stephen Abrams, Gary McGath, Robin Wendler
– Portico• Evan Owens, John Meyer, Sheila Morrissey
– Stanford University• Tom Cramer, Richard Anderson, Hannah Frost, Rachel Gollub,
Nancy Hoebelheinrich, Keith Johnson
• Open source
– Educational Community License (ECL)– SourceForge
JHOVE2 project goals
• Refactor the existing architecture
– Rectify known inefficiencies and idiosyncrasies– Simplify the process of integration– Encourage third-party extensions
• Provide enhancements
– Separate identification from validation– Standardized error handling– Standardized handling of validation profiles – Standardized reporting using METS, with XSL transform– More sophisticated data model – Arbitrary processing modules
JHOVE2 project goals
• Develop modules
– Signature-based identification using DROID– Validation and characterization– Symbolic display of selected binary formats– API-level editing capability– Policy-based assessment
Data model
• Implicit assumption in JHOVE
– 1 object = 1 file = 1 format
• But what about…
– TIFF with embedded ICC profile and XMP metadata• 1 object = 1 file = 3 formats
– JPEG 2000 JPX fragmentation• 1 object = n files = 1 format
– ESRI Shapefile• 1 object = 3 files = 3 formats
• JHOVE2 will support processing of complex aggregate objects and nested formatted bit streams
– 1 object = n files = m formats
Common “backplane”
• Outer loop is an iteration over digital objects• Inner loop of processes applied against each object, passing a
common memory structure
while (has-another-object) { while (has-another-process) { process (object, state); } }
Iterator
module
common data
common data
module
common data
METS writer
displayXSLT
XSL
object
Validation
• There is a useful distinction between well-formedness, validity, renderability, and usability
– Well-formedness and validity are “bright line” determinations relative to a specification
– Renderability is a “bright line” determination relative to a specific rendering tool
– Usability is a “fuzzy” determination relative to local policies and heuristics
Policy-based assessment
• Evaluate objects based on prior characterization and locally-defined policy rules and heuristics, for example:
– Risk of technological obsolescence– Risk of transformative loss
• Codify assessment methodologies and best practice recommendations
• Develop a formal language in which to express policy rules
• Implement a rules engine
Format support
• Audio AIFF, WAVE
• Color ICC
• Document PDF
• GIS Shapefile
• Image GIF, JPEG, JPEG 2000, TIFF
• Text ASCII, HTML, SGML, UTF-8, XML
Schedule
• 6 months of community outreach, requirements gathering, and design
• 6 months implementation of core APIs and the engine
• 1 year implementation of modules
• Continual prototyping and re-factoring
Questions (for you)?
• Do you care about the open source license (ECL)?
• Do you care about the distribution platform (SourceForge)?
• Do you have functional requirements or use cases?
– How do you use JHOVE today?– What needs doesn’t it meet?
• What types of policy assessments do you perform?
– How do you quantify risk?– What is your underlying assessment model?
• Are you aware of existing expression languages and engines for rules-based assessment?
Questions (for you)?
• What can we do to facilitate integration into existing (or planned) systems and workflows?
• What can we do to facilitate third-party development and extension?
– What help would you need to implement your own modules?– Would you be interested in a co-development arrangement with
the JHOVE2 project?
• Do you have interesting test files that you are willing to share?
Questions (from you)?