digitised collections: toward a digital strategy for for the nhm, london

33
Digitised collections: Toward a digital strategy for for the NHM, London Vince Smith Workshop 3, pro-iBiosphere, Berlin 23 May 2013

Upload: vincent-smith

Post on 10-May-2015

200 views

Category:

Technology


0 download

DESCRIPTION

Presented by Vince Smith at the pro-iBiosphere meeting in Berlin, 21-23 May 2013.

TRANSCRIPT

Page 1: Digitised collections: Toward a digital strategy for for the NHM, London

Digitised collections:Toward a digital strategy forfor the NHM, London

Vince Smith

Workshop 3, pro-iBiosphere, Berlin23 May 2013

Page 2: Digitised collections: Toward a digital strategy for for the NHM, London

Digital Ambition: NHM Science Strategy 2013-2017

A New Voyage of Discovery

Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement

Five Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills

Resources & funding

Measuring success

Page 3: Digitised collections: Toward a digital strategy for for the NHM, London

data.nhm.ac.uk/globe/

Page 4: Digitised collections: Toward a digital strategy for for the NHM, London

A New Voyage of Discovery

Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement

Five Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills

Resources & funding

Measuring success

Digital Ambition: NHM Science Strategy 2013-2017

Scientific impact 1,000 papers in leading journalsDigital access 20M specimens available digitallyEngagement 1M face-to-face engagementsCollections Globally important collectionsDiagnostic tools Diagnostic tools for key groupsDeep time Timeline of key transitionsScience & society Articulate of the role of scienceUK network Act as a national museumEarth sciences Earth Sciences CentreFunding £10M for Five Challenge Areas

Page 5: Digitised collections: Toward a digital strategy for for the NHM, London

Overview

1. Existing digital content, sources & formats• Research data• Collections data

2. Making collections data digital• Priorities• Protocols & pathfinder activities• Crowdsourcing transcription

3. Aggregation & delivery• The NHM data portal• Data visualisation, data sub-portals

4. Identifiers, links & interoperability• DataCite DOIs• Third party aggregators• Portal API’s, download & analytical functions

5. Timeline & constraints• Data policies• Next steps

Digitisation activities

Data portal

Page 6: Digitised collections: Toward a digital strategy for for the NHM, London

NHM Research Outputs

• 49 papers, 45 available online(4 print only or behind pay walls)

• 9 had supplementary data files• 39 papers with tables, charts & other data

o >1000 sequenceso 826 figureso 76 tableso 1 genome

• No collective view of these data (37 journals)• No consistent way of citing NHM data• No consistent mechanism to access data• Effectively invisible at the institutional level

One Month of NHM Science group papers

Data via Carolyn Lowry e-mail, 13th Feb. 2013

1. Existing digital content

Page 7: Digitised collections: Toward a digital strategy for for the NHM, London

NHM Collections Outputs: data

• Huge investment in NHM collection management system• ≠ Imaging• Most research projects need spatio-temporal records• Different requirements for different purposes

NHM COLLECTIONS April 2013

Collection area Estimate no of specimens

No. records in database

% collection in database

% records with location info

Botany 6,000,000 626,000 ~ 10% 96%Entomology 32,000,000 316,000 <1% 68%Mineralogy 500,000 422,000 ~ 95% 79%Palaeontology 9,000,000 342,000 ~ 3% 89%Zoology 28,000,000 1,131,000 ~ 60% via lots) 69%TOTAL 76,000,000 2,837,000 3% (23% )

1. Existing digital content

Page 8: Digitised collections: Toward a digital strategy for for the NHM, London

• Many, many imaging projects (highly fragmented)• Circa 40 TB for major collections (excluding library)• 120,000 images in KE EMu (many others not in KE!)• Circa 250,000 via NHM Photo unit (limited metadata)

Collection area No. image files Disk spaceBotany 140,133 35,302Entomology 529,106 3,172Mineralogy 14,000 6Palaeontology 122,548 993Zoology 12,975 1,598TOTAL 818,762 41,070

NHM Collections Outputs: images1. Existing digital content

Page 9: Digitised collections: Toward a digital strategy for for the NHM, London

Current data formats

• Darwin Core Archive (DwCA) & extensions (collections)• Circa 2020 fields mapped to 50 fields to generate archive• Images mainly JPG & TIFF• Metadata using EML & Genesis II standard• Research data files in a wide array of formats (blob files)

Nexus (character data and Newick formatted phylogenetic trees)

Non-NHM specimen lists (as Darwin Core Archive files)

PhyloXML (an XML standard for representing phylogenetic trees)

Output from the Imaging and Analysis Centre (Micro CT datafile formats)

NeXML (an XML standard for representing character data)

Collections of images from digitisation projects (as a collection of links or a zipped archive)

Sequence trace files (.scf sequence chromatogram format files) Environmental sequence files

Taxon checklists (as Darwin Core Archive files) Collection level descriptions

1. Existing digital content

Page 10: Digitised collections: Toward a digital strategy for for the NHM, London

• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.

• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)

2. Making collections data digitalDigitisation Priorities

Page 11: Digitised collections: Toward a digital strategy for for the NHM, London

• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.

• Tiered approach, different needs for different collections• Low hanging fruit (2D objects e.g. herb. sheets & slides)• Linked to strategic collaborations & financial opportunities

o e.g RBG Kew, RBG Edinburgh, Nat. Mum. Wales, Hunterian etc.

• Priorities dictate order – we plan to do it all (eventually)!

2. Making collections data digitalDigitisation Priorities

Page 12: Digitised collections: Toward a digital strategy for for the NHM, London

• Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer

• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)

2. Making collections data digitalDigitisation Protocols

Page 13: Digitised collections: Toward a digital strategy for for the NHM, London

• Exercise to develop digitisation protocols across collectiono Slides, spirit, herbarium sheets, pinned, multispecimen/drawer

• Protocols mapped to high level collections descriptions• Workflow software supporting rapid digitisation (to KE & DAMS)• Pathfinder activities for less well understood projects

o Entomological dry material (30 M specimens)- iCollections (specimen-by-specimen) approach- SatScan (drawer level multi-specimen) approach

2. Making collections data digitalDigitisation Protocols

Page 14: Digitised collections: Toward a digital strategy for for the NHM, London

• Specimen-by-specimen, traditional, dedicated 6 person team• Digitising British Isles Lepidoptera collection• ~500,000 specimens, 5,000 drawers• Re-curation & specimen imaging• Complete label information including georeferencing• For use in Climate Change initiative

2. Making collections data digitaliCollections Initiative

Page 15: Digitised collections: Toward a digital strategy for for the NHM, London

• 4-6 people over 3 years, work broken into small tasks by teams• Average imaging rate 163 specimen/day*person• Averaging >3min per specimen (prep., imaging & databasing) • >£1/specimen• BUT: 6,800 person years for the entire collection

2. Making collections data digitaliCollections Initiative

Page 16: Digitised collections: Toward a digital strategy for for the NHM, London

• Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated

2. Making collections data digitalSatScan Initiative

Page 17: Digitised collections: Toward a digital strategy for for the NHM, London

• Drawer level digitisation, segmented down to specimens• Very fast imaging, no specimen handling, just one view• No label information, but some data extracted from drawer• Specimens retrospectively cropped & annotated

2. Making collections data digitalSatScan Initiative

Page 18: Digitised collections: Toward a digital strategy for for the NHM, London

• Dedicated specimen-level rapid annotation software

2. Making collections data digitalSatScan Initiative

Page 19: Digitised collections: Toward a digital strategy for for the NHM, London

Crowdsourcing & Transcription

• We have a massive transcription problem• Experiments via Notes-from-Nature (a Zooniverse project)

• Transcribing the NHM ornithological accession registers

• Wikimedian in Residence (Wikisource transcription)• 4 Month project, including specimen label transcription

2. Making collections data digital

Page 20: Digitised collections: Toward a digital strategy for for the NHM, London

data.nhm.ac.uk• A focus for deposition and discovery of major NHM data sets• Promote innovation though re-use of museum data• Open Access, at a dedicated subdomain of the NHM website• Started Jan. 2013 (3 years), consultation throughout 2012

NHM Data Portal

Functional components of the data portal

3. Aggregation & Delivery

Page 21: Digitised collections: Toward a digital strategy for for the NHM, London

Search

Datasets matching

criteria

Individual dataset

Results

Browse & searchcriteria

Advanced display options

• Dataset registry, for dataset discovery, modeled on data.gov.uk• Uses CKAN, an open-source data portal software platform

3. Aggregation & DeliveryNHM Data Portal: Registry

Page 22: Digitised collections: Toward a digital strategy for for the NHM, London

Metadata about the dataset

Name

Geographic scope

Tags

“Social”

Authors

License

Download

Developer tools

TechnicalInfo.

(extracted from data

file)

• Dataset metadata discovery

3. Aggregation & DeliveryNHM Data Portal: Registry

Page 23: Digitised collections: Toward a digital strategy for for the NHM, London

• Simple datasets upload workflow for non-collections data

1. Name the dataset 2. Upload / link

the data file

3. Describe the data file

4. Theme & tag

5. Add additional resources

6. Temporal coverage

7. Geographic coverage

8. Save & finish

3. Aggregation & DeliveryNHM Data Portal: Dataset upload

Page 24: Digitised collections: Toward a digital strategy for for the NHM, London

Zoomable map

Applied filters

Toggle map, table & stats views

Search, download & display optionsNo. records

No. Georef. records

• Dedicated interface to visualise & explore major datasets• Focused on collections data, based on Canadensys.net, uses CartoDB

3. Aggregation & DeliveryNHM Data Portal: Data visualisation

Page 25: Digitised collections: Toward a digital strategy for for the NHM, London

Collections views

Statistical summary

Specimen record views

Data field mappings

Summary preview

Full record

Tables

Download

3. Aggregation & DeliveryNHM Data Portal: Data visualisation

Page 26: Digitised collections: Toward a digital strategy for for the NHM, London

• Using DataCite DOIs in the data portal• datasets (2014) & specimens (2015)

• Unique, persistent and resolvable identifiers• Easy to cite, alias existing specimen identifiers• Conform to minimum DataCite requirements

• Landing page, min. metadata standard, fee, min. 10 yr. contract, DOI (pre)fixes

NHM Data Portal & DataCite

Breaks us out of the biodiversity data silo

4. Identifiers, links & interoperability

Page 27: Digitised collections: Toward a digital strategy for for the NHM, London

• Content within the NHM data portal will be highly accessibleo Collections harvestable (e.g. by GBIF as a DwCA)o Download DwCAs on any search faceto Wide set of API’s available of datasets (part of CKAN)

• Sub-portals (selected content, themed by topic)o e.g Virtual Herbarium, NHM Science initiatives, geographic regions

• Analytical interface planned for 2015 (but not specified)

Data Aggregation, APIs & download4. Identifiers, links & interoperability

Page 28: Digitised collections: Toward a digital strategy for for the NHM, London

• Data portal will be “open-by-default”• Ambiguity in what this means & top down schizophrenia• Conflicting mandates on open access & revenue opportunities• Lots of guidance available, will use to form a common policy• A cross institutional policy would be useful (but challenging)

Data Policies & Next Steps5. Timeline & constraints

Page 29: Digitised collections: Toward a digital strategy for for the NHM, London

Jan 2013 Jan 2014 Jan 2015 Jan 2016

Requirements& dataset discovery

Private alpha Stable public beta

Full release & sub-portals

Internal feedback, data visualisation & DOIs

Subportals & analytical tools

Project start

NHM Data portal timeline

Next 6 months• More documentation (PID and Tech Spec)• Consultation and advocacy (internal and external)• Data mapping from KE EMu and software testing• Development

o website wireframe designo drafting data visualisation subcontracto Construction of private alpha release

5. Timeline & constraintsData Policies & Next Steps

Page 30: Digitised collections: Toward a digital strategy for for the NHM, London

Jan 2013 2014 2018

Path-finding & Programme

development

Private alpha Stable public beta

20 Million!!Project start

NHM digitisation timeline

Next 6 months• Initial conclusions from path-finding digitisation activities• Initial grant funding bids developed• Advocacy, outreach & development of a digitisation “programme”• Investigate possibilities for gallery development• Develop crowdsourcing strategy

2015 2016 2017

Major funding applications & a new gallery?

Digitisie… Digitisie… Digitisie…

5. Timeline & constraintsData Policies & Next Steps

Page 31: Digitised collections: Toward a digital strategy for for the NHM, London

QUESTIONS

Page 32: Digitised collections: Toward a digital strategy for for the NHM, London

Digitisation Priorities

• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.

Poacea

e

Brassic

acea

e

Solan

acea

e

Rubiacea

e

Anacard

iacea

e

Arecac

eae

Malvac

eae

Cucurbita

ceae

Grossular

iacea

e

Aquifolia

ceae

Juglandac

eae

Apiacea

e

Aspara

gace

ae

Pedali

acea

e

Laurac

eae

Convolvu

lacea

e

Oleace

ae

Bromeliac

eae

Lecy

thidacea

e0

100200300400500600700

Crop Wild Relatives (accepted taxa only)

2. Making collections data digital

Page 33: Digitised collections: Toward a digital strategy for for the NHM, London

• Priorities linked to science strategic prioritieso Disease, sustainability, crop wild relatives, pests etc.

• Tiered approach, different needs for different collections

Nick Poole, UK Collections Trust

2. Making collections data digitalDigitisation Priorities