jay gattuso persistently identifying formats

Post on 17-May-2015

1.402 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

'Persistently’ Identifying Formats PRONOM, DROID and the NDHA Jay Gattuso

TRANSCRIPT

‘Persistently’ Identifying Formats

PRONOM, DROID and the NDHA

Jay Gattuso Digital Preservation Analyst

National Digital Heritage ArchiveNational Library of New Zealand

Summary

How Rosetta uses DROIDHow DROID has changed

Research NDHA completedResults

Recommendations

DROID & PRONOM • PRONOM is the most

widely used file format registry in the sector

• DROID is a tool that ‘identifies’ file types (based on PRONOM records)

• Both are from TNA (UK)• DROID Signature v59

– 551 signature sets– 864 file type records

EP/1958/2520-F Registry, Hunter Building, Victoria University of Wellington

Photograph taken for the Evening Post newspaper, 31 Jul 1958 Alexander Turnbull Library

www.nationalarchives.gov.uk/PRONOM/Default.aspx

Rosetta – A Brief History

• NLNZ Digital Preservation Repository

• 4 years since inception• 18 months out of project• 8 significant

upgrades/software revisions• ~6 Million digital objects to

date• Backbone of the ANZ GDAP

1/1-000008-G Smiley's stables and horse repository, Whanganui

Harding, William James, 1826-1899 :Negatives of Wanganui district .Alexander Turnbull Library

Write Once, Read Many

Inside Rosetta, format identification is a ‘WORM’ process.

As a part of the ingest routine, format identification is automatically undertaken, written to the file records, and the system database, and used thereafter as a consistent ‘label’.

E-272-f-001Abbot, John 1751-1840 :

Original drawings of insects by J Abott. [1816?]Alexander Turnbull Library

.

We rely on the persistence of the label to accurately plan activities and ‘measure’ the content or shape of the repository.

Behaviours and functions based on DROID format assertions

Rosetta uses DROID to automatically establish format type.

Rosetta Overview

Validation StackAutomated Format

Identification via DROID

Shape Sorting...

Where:

• The area inside the box is Rosetta

• Each block is a DO• Each shape is a format• The ‘Sorter’ is DROID

Shape Sorting...

Process:

• A record is kept of the ‘shape’ the DO entered the box via

• The record is used by the system to trigger activities

• The DO can be removed from the box using the same shaped hole it used on entry

Shape Sorting...

Expectations:

• The ‘Sorter’ never changes• The blocks never change• A DO placed in the box

yesterday will be the same shape tomorrow

• A DO placed in the box yesterday will be extractable via the shape tomorrow

Shape Sorting...

The reality for NDHA:

• DROID has undergone 2 major revisions

• Container signatures have been included

• Since Rosetta v1 release: – 406 new formats, – 600 changes to signatures– (This is generally a good thing!)

• Rosetta has used DROID versions 3 and 5, currently testing with 6

• Rosetta has used DROID signature versions v13, v37, v45 and v49, testing with v52

• Proposal to use a new DROID method in Rosetta

• How has/will this affect the way we characterise Digital Objects at the NDHA?

Identifying and Quantifying Change

EP/1958/0585-F Signature of Queen Elizabeth II in a visitors book

Negatives of the Evening Post newspaper. Feb 1958Alexander Turnbull Library

• Source set: – 26,000 digital objects, – ~600 Gb of content, – spanning 61 format types – all from the live system

• DROID v3, DROID v5, DROID v6 and DROID v6 ‘FAST’ tested

• Signatures v13, v37, v45, v49 and v50 tested

• All files tested with and without file extensions

Identifying and Quantifying Change

EP/1990/0432/29-FNew school patrol system being tested , Wellington

Photograph taken by John Nicholson ca 2 Feb 1990

Alexander Turnbull Library

• 1 million DROID ‘assertions’ captured• Python and MySQL used to sort,

clean, filter, draw graphics and otherwise interpret results

• Paper competed and will be available on the OPF website

www.openplanetsfoundation.org

Identifying and Quantifying Change

DCDL-0004533Eric Idle. 5 December, 2007.

Webb, Murray, 1947- : Digital caricatures published from 29 July 2005 onwards

Alexander Turnbull Library

Summary of Results

Of the 61 tested file types :

75% performed identically for all tested versions of DROID and signature versions

fmt/49(RTF 1.4)

Summary of Results

Of the 61 tested file types :

40% consistently offered a single PUID across the range of DROID tests

By extension: gif, avi, png, jpg, html, xml, bmp, wp, and some subsets of doc, ppt and exe

fmt/12(PNG 1.1)

Summary of Results

Of the 61 tested file types :

In 26% of the file types multiple PUIDs are equally asserted by DROID at various times.

By extension: docx,xlsx,pptx, some pdf, doc, xls, ppt, txt, log, aiff, and arc

fmt/7(TIF format)

Summary of Results

Of the 61 tested file types :

In 16% of the file types DROID version 6 in ‘FAST’ mode performs differently DROID version 6 in standard mode

By extension: epubs, mp4, flac, wav, zip and some subsets of pdf, xls, tif and exe fmt/6

(Waveform Audio)

Recommendation 1

There is a clear need for a community owned dataset that spans the PRONOM catalogue to support testing

(This should be community created) ExL-fmt/62 - fmt/189

(MS Open Office XML 2007)

Recommendation 2

It is strongly recommended that more research is undertaken looking at the persistence of PUID’s to give a more complete history of file type assertions by PRONOM/DROID

fmt/14(PDF 1.0)

Recommendation 3

Given the variances observed, especially with DROID v6 ‘FAST’ mode, it is recommended that all signatures are robustly tested prior to release, and efforts are made to maintain consistency with legacy signatures, and limit impact on users x-fmt/263

(ZIP format)

Recap

How Rosetta uses DROIDHow DROID has changed

Research NDHA completedResults

Recommendations

Thank you

jay.gattuso@dia.govt.nz

Rosetta demo – Wednesday 28th March 9am to 1pm @ NLNZ - 77 Thorndon Quay

Paper available through the Open Planets Website www.openplanetsfoundation.org

top related