people mashing: what we did in the aqua project paul wheatley (and) andrew jackson, bo middleton,...
TRANSCRIPT
People Mashing:What we did in the AQuA Project
Paul Wheatley
(and)
Andrew Jackson, Bo Middleton, Jodie Double, Rebecca McGuinness
2
Background
Increasing recognition of need for quite basic preservation tools What have we got? Is it of sufficient quality? Have I just broken it?
Characterisation and Quality Assurance (QA) in three main areas: QA of digitised material
Eg: missing pages, out of focus pages, duplicate pages, inconsistent metadata, incorrect cropping, “thumb in picture”…
QA of processing/storage/handling Eg: Unpacking containers, moving content, processing, storage…
Identification of preservation risks Eg: Non-embedded fonts in PDFs, Kakadu produced JPEG2000s
missing resolution information…
Particularly the case when working at scale: Move from thousands to millions of objects shows up processes that are not as solid as we thought
Causes: software errors, disks get full, networks drop out, human error…
Human QA does not scale well and is costly: automation is the solution
3
Digitised newspaper: Quality Assurance miss
4
Migration from TIFF to JPEG2000, corruption example
5
How do we move things forward…
Characterisation and QA challenges
Anecdotal evidence from colleagues suggested we were not alone
Existing digital preservation tools aren’t meeting the challenge
Open Source possibilities. Which tools, how effective are they, how do we make them work to solve our problems?
6
AQuA: Automating Quality Assurance Project
Project funding opportunity from JISC
Focus on putting existing tools into use and solving DP problems
Partnership with Universities of Leeds and York, The Open Planets Foundation (OPF) and the British Library (BL) : AQuA
Constraints: Modest funding 6 month project length No lead time
How could we make this work?
7
Staffing requirements
Skills and understanding required: A good understanding of the specific QA and
preservation challenges faced by UK institutions Access to samples of problematic digital
collections where these challenges were present, to support solution testing
Knowledge of likely open source toolsets that might provide useful solutions
Effort to progress, test, evaluate the new tools
Solution: Event focussed approach that would utilise the knowledge and the
expertise of a community rather than of a few individuals. The project effort would focus on facilitating events at which our
attendees would deliver the results
8
Existing event formats
Hackathons: technical focus, not heavily structured, prizes for challenges. OPF events, DEVSCI (UK)
Mashups: technical focus, combining open data sources in innovative ways to create new services
Wikiathons: none-technical, pooling effort of attendees to update and add detail to wiki data
Unconference events: agile, participant driven workshops and discussions, CURATE Camp (USA)
AQuA approach borrows elements from each and adds some structure and a mix of technical and none technical participants
9
AQuA events
Project held two events in the UK, Leeds and London
20-30 people, 3 days long
Strict participant roles: No observers! Participation is mandatory!
Techies/hackers, some with DP experience, some programmers from a library or archive background who hadn't worked on DP
Practitioners/collection owners, who were asked to bring a long at least one sample of a digital collection
Spoke to all attendees in advance to make sure they knew what to expect and what we expected of them at the event
Worked in an agile manner. Lots of lightning talks, knowledge exchange, group brainstorming, as well as hacking time
10
AQuA Event format:
Day 1: Capturing the preservation challenges Introductions, learn about collections Brainstorming and recording collection challenges and initial thoughts
about solutions Teamed up the participants in practitioner – techie pairs. These pairs
would then work together across the 3 days
Day 2: Hacking and mashing Techies – developing solutions Practitioners capturing and recording their requirements Workshop/discussion sessions on a variety of topics Lots of reporting back to facilitate knowledge exchange
Day 3: Wrap up, results, reporting back and evaluation Completing developments Presentations and demos Evaluating the preservation solutions against their requirements Evaluating the event
11
Capturing the outcomes
Short events, important to capture the outcomes
Checking in software developments to the GitHub code repository
Emphasis throughout event on writing up results in a wiki
Collection - Issues - Solution structure
Tool list of some 50 mainly open source tools used at the two events
http://wiki.opf-labs.org/display/AQuA
12
Example results
No time to describe all of them - 20 digital collections examined, 40 different preservation issues described, 25 different solutions implemented. http://bit.ly/ufyk4R
Collection: BL audio field recordings submitted by the British public, created on a variety of recording devices, in a cross section of esoteric file formats
Issue: Identify and validate sound files comprised of multiple file formats and containers; some with embedded metadata, some corrupted during upload, and some with incorrect file extensions
Solution: (produced by Maurice de Rooij, NANETH) AQUAudio is a wrapper script around the Open Source getID3() PHP-library. It extracts information from audio files, such as audio properties (bitrate, #channels, sample-frequency, etc.) and metadata (ID3v1, ID3v2, BWAVE metadata, etc), and writes the results to an XML file, optionally normalise
13
Example results (2)
Existing digital preservation tools (eg. DROID) supports only a handful of audio formats
getID3 is a comprehensive and well supported open source project.
Supports 60 different audio file formats Used extensively in other products, so its well tested
and reliable as evidenced by the active support forum
Achieving that level of support and quality from a home grown digital preservation tool would take years of development and funding that we simply don't have
Instead we leveraged an existing solution in 3 days that is now in production use at the British Library
14
Useful outputs
Focused way of solving preservation challenges. Some production solutions, some prototypes, some approaches ruled out.
Detailed record of genuine preservation challenges and requirements, that can inform future work
People Mashing / knowledge sharing: Better understanding between practitioners and techies Knowledge sharing on tools, approaches, techniques Community building sharing expertise, encouraging those with
problems to seek help in a non-judgemental environment Hands on digital preservation training for DP beginners
Participants wanted to continue to develop the fledgling community that was created
15
Thank you!
Any questions?
Further information:
AQuA Project: http://wiki.opf-labs.org/display/AQuA
Open Planets Foundation: http://openplanetsfoundation.org/
Me on twitter: @prwheatley