xml processing in the cloud: large-scale digital preservation in small institutions

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

XML Processing in the Cloud: Large-ScaleDigital Preservation in Small Institutions

Peter Wittek

Swedish School of Library and Information ScienceUniversity of Boras

16/05/11

Outline

1 Workflows and Digital Preservation

2 Computational Requirements of Digital Preservation

3 Preservation Workflow in the Cloud

4 Experimental Results

5 Open Issues

6 Conclusions

Workflows and Digital Preservation

Fundamental Issues in Digital Preservation

Digital objects remain authentic and accessibleComponent and management failuresNatural disastersAttacks

Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpart

Migration, Enrichment, and Other Approaches

Keeping the content of legacy file formats accessibleMost prominent with proprietary file formatsInfrastructure-independent rendering of contentMigration (legal issues)

Dynamic collections: scalabilityReuse

Exploitation with a novel purposeSufficient metadata at document and collection level

An Example of Enrichment: ToC Extraction

Preserving the Pipeline

Reuse of digital content asks for metadata on both thecontent and how it was transformed to its most recent formDocument process preservation helpsArchitecture-independent description of the intent behind adocument process

An XML Processing Pipeline

Deployment

Translation of abstract description of workflowEclipse Modeling Framework generates Python sourcecodeGrid implementation using iRODS

Integrated Rule-Oriented Data SystemPolicy-based data grid software system

Current experiment using Amazon Web Services

Computational Requirements of Digital Preservation

Conversion

Steps of a workflow are computationally expensiveXSLT processors

Processing a single large document tree can take hoursDeep parsing and named entity recognition

May involve high-complexity natural language processing

Ad-hoc computations

Computational Requirements of Digital Preservation

Learning

A step towards digital curationSaaS approach to digital curation

Indexing by Lucene/NutchCollection-level metadata extraction by Mahout

Preservation Workflow in the Cloud

MapReduce and Deployment

No internal dependencies for the processesDesigned process is exported via the EMF interface toPythonSimple MapReduce driver to execute the process onindividual documents

Preservation Workflow in the Cloud

The Proposed Architecture

Experimental Results

1 4 10 20 40 80

Number of Processing Cores

0.08Avera

100100010000

Figure: Comparison of average cost of computations with differentcollection sizes

Experimental Results

Running time

1 4 10 20 40 80

Number of Processing Cores

100100010000

Figure: Comparison of running times with different collection sizes

Open Issues

Obstacles to Adoption

Persistence and high-reliabilityMapReduceNot just a technological issue

Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers

Conclusions

Acknowledgment

Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated projecthttp://shaman-ip.eu/shaman/

Conclusions

Summary

Digital preservation is an attractive area to be offered asSaaS

Computational needsExpertiseComplexity

Since persistence requires architecture-independence,cloud adoption is straightforwardHigh-reliability can be an issueService-level agreements need further research

xml processing in the cloud: large-scale digital preservation in small institutions

digital preservationpreserving

digital preservationmigration

small institutions workows

digital curation indexing

cloud adoption

digital curation saas

alternative cloud providers

collection level

Documents

xml databases xml-db xml databases

2007 ipres beijing - mixed: preservation by migration to xml

majestic - johnsn · majestic integrity-based investing...

introduction to xml - kth.se · introduction to xml xml...

preservation policies and digitisation in greek...

introduction to xml · introduction to xml xml document...

institutions and the preservation of cultural traits ·...

international preservation news: june 1994international...

preservation by migration to xml

digitization and preservation as an...

‘minute madness’ digital preservation awards 2014...

issues and challenges of audio heritage preservation in...

xml as a preservation strategy

xml extensible markup language. topics what is xml an xml...

xml extensible markup language. agenda introduction to xml...

1 digital preservation. principles and potential role of xml...

xml fundementals xml vs.. html xml vs.. html xml document...

digital preservation in france - kanton st.gallen ·...

small solutions for small institutions – steps towards...

digital preservation: from large-scale institutions via...