xml processing in the cloud: large-scale digital preservation in small institutions

Post on 11-May-2015

415 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Digital preservation deals with the problem of retaining the meaning of digital information over time to ensure its accessibility. The process often involves a workflow which transforms the digital objects. The workflow defines document pipelines containing transformations and validation checkpoints, either to facilitate migration for persistent archival or to extract metadata. The transformations, nevertheless, are computationally expensive, and therefore digital preservation can be out of reach for an organization whose core operation is not in data conservation. The operations described the document workflow, however, do not frequently reoccur. This paper combines an implementation-independent workflow designer with cloud computing to support small institution in their ad-hoc peak computing needs that stem from their efforts in digital preservation.

TRANSCRIPT

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-ScaleDigital Preservation in Small Institutions

Peter Wittek

Swedish School of Library and Information ScienceUniversity of Boras

16/05/11

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Outline

1 Workflows and Digital Preservation

2 Computational Requirements of Digital Preservation

3 Preservation Workflow in the Cloud

4 Experimental Results

5 Open Issues

6 Conclusions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Fundamental Issues in Digital Preservation

Digital objects remain authentic and accessibleComponent and management failuresNatural disastersAttacks

Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpart

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Migration, Enrichment, and Other Approaches

Keeping the content of legacy file formats accessibleMost prominent with proprietary file formatsInfrastructure-independent rendering of contentMigration (legal issues)

Dynamic collections: scalabilityReuse

Exploitation with a novel purposeSufficient metadata at document and collection level

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

An Example of Enrichment: ToC Extraction

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Preserving the Pipeline

Reuse of digital content asks for metadata on both thecontent and how it was transformed to its most recent formDocument process preservation helpsArchitecture-independent description of the intent behind adocument process

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

An XML Processing Pipeline

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Workflows and Digital Preservation

Deployment

Translation of abstract description of workflowEclipse Modeling Framework generates Python sourcecodeGrid implementation using iRODS

Integrated Rule-Oriented Data SystemPolicy-based data grid software system

Current experiment using Amazon Web Services

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Computational Requirements of Digital Preservation

Conversion

Steps of a workflow are computationally expensiveXSLT processors

Processing a single large document tree can take hoursDeep parsing and named entity recognition

May involve high-complexity natural language processing

Ad-hoc computations

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Computational Requirements of Digital Preservation

Learning

A step towards digital curationSaaS approach to digital curation

Indexing by Lucene/NutchCollection-level metadata extraction by Mahout

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Preservation Workflow in the Cloud

MapReduce and Deployment

No internal dependencies for the processesDesigned process is exported via the EMF interface toPythonSimple MapReduce driver to execute the process onindividual documents

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Preservation Workflow in the Cloud

The Proposed Architecture

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Experimental Results

Cost

1 4 10 20 40 80

Number of Processing Cores

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Avera

ge C

ost

in U

SD

100100010000

Figure: Comparison of average cost of computations with differentcollection sizes

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Experimental Results

Running time

1 4 10 20 40 80

Number of Processing Cores

0

1000

2000

3000

4000

5000

6000

7000

8000R

unnin

g T

ime (

Min

s)

100100010000

Figure: Comparison of running times with different collection sizes

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Open Issues

Obstacles to Adoption

Persistence and high-reliabilityMapReduceNot just a technological issue

Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Conclusions

Acknowledgment

Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated projecthttp://shaman-ip.eu/shaman/

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

Conclusions

Summary

Digital preservation is an attractive area to be offered asSaaS

Computational needsExpertiseComplexity

Since persistence requires architecture-independence,cloud adoption is straightforwardHigh-reliability can be an issueService-level agreements need further research

top related