using archivemedia to preserve research data

Filling the Digital Preservation Gap

Chris Awre (@clawre) and Jenny Mitcham (@Jenny_Mitcham)

7th or 8th November 2016

Introduction: who we are and why are we doing this?

Research at Hull and York• University of Hull :

– 5 Faculties, 11 Schools– c. 22,000 students– 62% research classed as 3* or 4* in REF 2014– In top 50 UK institutions by ‘research power’

• University of York:– 30+ academic departments– c. 16,000 students– Ranked in the top ten of UK universities for research council

income– Secured £46 million in research council income in 2014/15

Why do we need digital preservation for research data?We can’t ignore digital preservation – moving targets

for data retention mean we need to take this seriously

Funder requirements around retention:• NERC - data should be retained for a minimum of 10 years but

for projects of major importance this may need to be 20 years or longer

• STFC - expect data to be retained for a minimum of 10 years and data that cannot be re-measured should be retained indefinitely

• Wellcome Trust – expect data to be kept for a minimum of 10 years but suggest longer periods for certain types of data

Why do we need digital preservation for research data?University of York RDM questionnaire 2013:

Which data management issues have you come across in your research over the last five years?

24% of 181 researchers who answered this question admitted this had been a problem for them

“Inability to read files in old software formats on old media or because of expired software licences”

Jisc Research Data Spring initiative• A three phase funding programme starting in March 2015• Looking for ideas for technical tools to help with RDM• Ideas crowd-sourced and voted on by anyone interested

• Best ideas invited to pitch for funding• See https://www.jisc.ac.uk/rd/projects/research-data-spring for more

information

https://www.jisc.ac.uk/rd/projects/research-data-spring

Project aim - our pitch!“…to investigate Archivematica and explore how it might be used to provide digital preservation functionality within a wider infrastructure for Research Data Management.”

Project structure• Phase 1 – explore: testing, research, thinking (3

months)

• Phase 2 – develop: make Archivematica better for RDM, plan implementation (4 months)

• Phase 3 – implement: set up proof of concepts at York and Hull and further investigate of file format problem (6 months)

The teamUniversity of Hull:• Chris Awre – Head of Information Services, Library and Learning

Innovation• Richard Green – Independent Consultant• Simon Wilson – University Archivist

University of York:• Julie Allinson – Manager, Digital York• Jen Mitcham – Digital Archivist

Phase 1 - Explore

What were we trying to achieve?A feasibility study:

• What does research data look like?• What does Archivematica do to preserve data?• How can Archivematica integrate with our other RDM systems?• What are our institutional requirements for digital preservation?• Does Archivematica meet those requirements?• Where does Archivematica fall short?

All written up and published in a report…

http://dx.doi.org/10.6084/m9.figshare.1481170

What is Archivematica?● Free and open-source digital preservation system (AGPLv3)

designed to maintain standards-based, long-term access to digital objects

● Allows users to process digital objects from ingest to access using OAIS functional model

● Implements format normalization upon ingest and preserves originals to support emulation and migration strategies

What is Archivematica?● Archivematica is a processing pipeline consisting of a bundle of

open-source tools and python scripts which deliver a series of preservation micro-services

● Archivematica is designed to output high-quality, standards-compliant Archival Information Packages (AIPs)● Bagit, METS, PREMIS

Archivematica development partners

and more!

Why would we recommend Archivematica for RDM?

• It is flexible and can be configured in different ways for different institutional needs and workflows

• It allows many of the tasks around digital preservation to be carried out in an automated fashion

• It can be used alongside other existing systems as part of a wider workflow for research data

• It is a good digital preservation solution for those with limited resources

• It is an evolving solution that is continually driven and enhanced by and for the digital preservation community

• It gives institutions greater confidence that they will be able to continue to provide access to usable copies of research data over time

What are the downsides?• It isn’t a magic bullet

• There is no guarantee your data will be readable in the future

• It can only be as good as current digital preservation practice

• It can be fiddly to install correctly

• The GUI isn’t that intuitive

• You need staff who understand it

How could you use Archivematica?• Host it in-house and link it to an existing repository/access system

(for example DSpace, CONTENTdm, Fedora/Hydra ...or a CRIS)

• Host it in-house and use as a standalone system (you would need to have a storage system in place and establish a way of facilitating access to the data)

• Sign up for a hosted instance of Archivematica with archivesDIRECT (combines Archivematica with DuraCloud storage)

• Sign up for a hosted instance of Archivematica with Arkivum (combines Archivematica with Arkivum storage)

Phase 2 - Develop

Phase 2 development work• Six different areas of work• Development carried out by Artefactual Systems from July 2015 to January 2016• Weekly Google Hangouts to report on progress• Will be available in Archivematica soon...

Deliverable 1

Problem: Research Data needs to be kept, but we don’t know if anyone will ever want itand it might be *massive*

The Solution: enable the DIP to be generated ‘on request’ and not as part of the initial ingest

Deliverable 2Problem: We want to be able to grab the DIP, and metadata about it and pull it into our repository

The Solution: a library to help with parsing and creating METS fileshttps://github.com/artefactual-labs/mets-reader-writer

Deliverable 3Problem: We want to be able to report on what we have

The Solution: a search API to answer basic questions about number of files in storage, their formats, date of ingest etc.

Deliverable 4

Problem: With large datasets, the current checksum mechanism in Archivematica could be a bottleneck

The Solution: support for multiple checksum algorithms

Deliverable 5

Problem: What about all those file formats that Archivematica can’t identify?

The Solution: mechanism for running file identification with multiple tools and a report of unidentified formats.

...and I’ll talk a bit more about file format identification later!

Deliverable 6

Problem: We want to make it easier for institutions to adopt Archivematica

The Solution: a webinar describing Archivematica’s Automation Tools

What worked well• Artefactual staff were good people to work with (and patient)• Artefactual have the bigger picture in mind and really want to understand the use cases• Our work builds on work that others have done and is being used and built on by future work that is in the pipeline:

– Search API work being looked at by Bentley Historical Library– DIP generation by Simon Fraser University

What didn’t work well• Many of the areas of development were only partially solved through our work:

– the problems were big and complex – what does success look like?– perhaps we tried to do too much?

• It was hard to prove the impact of our checksum work• Solving the file identification problem is a huge task and needed more thought....

Implementation plansHull and York also worked on implementation plans for Archivematica. This was key because….

Deciding to use a system is easy...deciding exactly how to use it is much harderSeparate plans created for Hull and York as different RDM systems were in place and there were different institutional needs and priorities.

Phase 3 - Implement

York p-o-c implementation

York wanted to provide:• an easy way of depositing data• a way of monitoring datasets for RDM staff• a way of requesting access to datawith:• data sent to archivematica• dataset metadata pulled from PURE

York p-o-c implementation

Metadata from PURE pulled in nightly or on-demand

Fedora objects created for the dataset to store local admin info and help connect the PURE and Archivematica records

Visual representation of workflow status

York PCDM modellingDataset = Dataset record from PURE

Individual data files stored, but folder structure is notFolder structure available in Archivematica METS

Dataset can be made up of multiple ‘Packages’ of data, eg. newer version

What next in York?•Our RDM staff love the p-o-c and we have

agreement to turn it into a production system over the autumn/winter

•This has been a helpful exercise for broader data modelling / Hydra implementation at York

•York is a pilot in the Jisc Shared Service for Research Data and will move forward with this work over the next couple of years

Hull p-o-c implementation• Hull keen to make Archivematica part of a

workflow for any type of repository content – not just research data. You may have seen a poster at Hydra Connect last year:

Hull’s p-o-c implements most of the automated bulk ingest route, creates AIP(s) and builds repository objects from the DIP(s)

Hull p-o-c implementation• User assembles files and simple

descriptive file(s) in Box folder. Shares the folder with Archivematica

• System checks folder contents and if OK creates a bag (BagIt standard) for each object which is passed to Archivematica

• Archivematica processes the bag to create an AIP which goes to a preservation store…

• …and also a DIP which is passed to the DIP processor

• DIP processor creates Hydra objects from the DIP contents and injects them into the repository QA queue…

• …matched to the AIP by UUID

Thanks to Cottage Labs for all the new development work!

Hull p-o-c options• Depositors have several options:

• A folder containing multiple data files and one descriptive file ➔ a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record)

• A folder containing multiple files and a csv file (one row per file) ➔ multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download

• A folder containing the top-level folder of a structure ➔ a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download

What’s next in Hull?•We hope to be able to take the p-o-c work and turn it into a production system

•Hull is the UK’s “City of Culture” next year and there will be a great deal of digital material that the University Archives want to capture for posterity

Phase 3 - The file formats problem

File formats problemResearch data file formats are:• Numerous• Sometimes a bit obscure• Sometimes very big• Ever-changing• Often very new

This means they can be hard to preserve... The first hurdle is that we can’t identify them. If we can’t identify them how can we carry out preservation activities?

Top research data applications at York

Can we identify our research data?We ran Droid* over the research data deposited with Research Data York over the past year.Out of 3752 individual files:• only 37% (1382) of the files were identified (with varying degrees of accuracy)• there were 34 different identified file formats in the sample

* Droid is a free tool from The National Archives that can be used to automatically identify file formats

Unidentified research data filesFiles not identified by Droid (listed by file ext):– 107 different file extensions not identified– huge number with no extension (help!)– how do we solve the .dat file problem?

Supporting signature development at The National Archives

Creating our own signatures

Conclusions and where to find out more

Impact“In many ways the project at York and Hull felt like a precursor to the Shared Services pilot; highlighting both the potential problems in working with a wide range of stakeholders and systems, as well as the massive benefits possible from pooling our collective knowledge and resources to tackle the technical challenges which remain in RDM.”

From ‘Unlocking Research’ blog from the University of Cambridge Office of Scholarly Communication (16 September 2016)

“I've just read your paper on linking repositories and Archivematica - fascinating and full of very useful information! I will certainly be following up with many of the links and ideas you presented, especially the Jisc Research Data Shared Service work on digital preservation, and I will discuss with my colleagues the possibility of using Archivematica at our data centre and our options for collaboration. Many of our issues relate to the long tail of research data and the preservation of data already archived.”

Comment on Digital Archiving blog (18 October 2016)

Challenges• Impact of short, but focused, timeframes and short lead in times

• Access to appropriate skills (mainly technical development) limited scope of work

• Limited budget, hence ‘parsimonious’ approach to making the best use of this

• Interpretations of digital archiving across preservation, RDM and IT communities

• Balancing dissemination with actual doing!

What have we learned?• Archivematica can be used to manage the preservation of research

data• And that this can be embedded within similar, but different, institutional workflows

• There is benefit in getting focused systems to do what they do best rather than adding functionality to any particular one

• There is a file format recognition issue that will affect long-term preservation of research data files

• But there is a way to address this through the development of additional file signatures

• There is real benefit in working collaboratively in this area, both within the project and beyond it, to identify common ways of tackling the problem of preserving research data

Jisc Research Data Shared Service• Almost in parallel with the Research Data Spring projects Jisc were

planning a Research Data Shared Service• The resulting system will be managed and hosted, and will offer

three core modules : repository, preservation and reporting• Phase 1 and 2 reports from Hull and York very influential for the

preservation module• Commercial and open source offerings for each module, including

Archivematica (for preservation) and Hydra (for the repo) • Over 20 pilot institutions recruited (including York) – all identified

preservation as a priority

• https://www.jisc.ac.uk/rd/projects/research-data-shared-service

https://www.jisc.ac.uk/rd/projects/research-data-shared-service

Further informationWebsite: http://www.york.ac.uk/borthwick/projects/archivematicaBlog: http://digital-archiving.blogspot.co.uk/Reports: https://figshare.com/

http://www.york.ac.uk/borthwick/projects/archivematica

http://digital-archiving.blogspot.co.uk/

https://figshare.com/

using archivemedia to preserve research data

Education