No specimen left behind: Collections digitisation at the NHM, London*

Download No specimen left behind:  Collections digitisation at the NHM, London*

Post on 16-Apr-2017

2.934 views

Category:

Science

9 download

TRANSCRIPT

PowerPoint Presentation

Vince SmithCollections for the 21st Century, Florida5-6 May 2014No specimen left behind:

Collections digitisation at theNHM, London*

1

Some historythe rate of progress by the UK taxonomic institutions in digitising and making collections information available is disappointingly low there is a significant risk of damage to the international reputation of major institutions such as The Natural History MuseumHouse of Lords Science and Technology CommitteeReport on Taxonomy and Systematics, 2009

2

Digitisation rates at the NHM (circa 2009)

900 years to digitise the collection!

3

The prevailing attitude collections digitisationBiodiversity Informatics2010, 7: 120 1292010 GBIF Task Group:Global Strategy and Action Plan for the Digitisation of Natural History Collections

However desirable, digitization of all specimens across the globe is a noble but impracticable goal.

Digitizing all specimens is not an achievable aim at present

Our collections are

4

More technology, more automation, more speed

Whole drawer scanningHerbarium sheet scanning

Microscope slide scanning

5

European collections rising to the challenge

Large-scale data capture & digitisation in France, Netherlands & Finland

6

NHM London Science Strategy 2013-17A New Voyage of Discovery

Three Focal Areas1. Scientific discovery2. Scientific Infrastructure3. Scientific engagement

Five Challenges1. The Digital NHM2. Origins, evolution & futures3. Biodiversity discovery4. Natural resources & hazards5. Science, society & skills

Resources & funding

Measuring success

7

data.nhm.ac.uk/globe/Digitisation target20M specimens available by 2017

8

A long way to go, practically, technically & culturally

NHM collections comprise c.80m objects Physical register: c.5m Digital data: 2.8m Images: 350k

9

NHM Digital Collections Programme

A 2, 5 and 10 year plan...To collate, organise and make available one of the worlds most important natural history collections as digital resource, delivering:

an online specimen / lot-level database to manage all holdings

core meta-data and / or images for key parts of the collection

flexible informatics tools750,000 for first 2 years

10

OutlineWhyInternal objectives & benefitsResearch opportunity - the iCollections exampleWhatHow much data to digitiseLinking digitisation effort to project benefitsHowDigi-street pilots, quick wins (herbarium, drawer & slide scanning)Crowdsourcing pilots & optionsWhereNHM Data PortalExternal Portals (E.g. GBIF, Europeana)LinksCrowdfundingH2020 projects (COST, SYNTHESYS, LOD, VRE, Dig. Inf.)Other museums, herbaria & partners (e.g. CETAF & publishers)When

11

1. Why: ObjectivesPEOPLE & SKILLSDATA CAPTUREPOLICY & PROTOCOLINFRASTRUCTUREPARTNERSHIPSRESEARCHACCESSSTAKEHOLDERS & GOVERNANCE

12

1. Why: Research opportunity & the iCollections pilot

Using the NHM collections to track long-term seasonal response of butterflies to climate changeDigitisation of British and Irish Lepidoptera collectionSpecies poor, specimen rich~500,000 specimens, 5,000 drawersRe-curation, imaging, label data, georeferenced~25% complete (started Jan.13)About 50% specimens useableMany specimens in most years (late - 19th century to 1970)Provide longer time perspective than most observational records (BMS post-1976)

13

1. Why: Research opportunity & the iCollections pilotCooler springs

Warmer springs

Earlier collectionLater collectionRelationship between 10th percentile collection date of Anthocharis cardamines (Orange tip) and mean Mar. May temp.(N.B. temp. axis reversed)1900-2000, strong correlation between initial collection dates & temperatureCritical marker on phenological response prior to recent rapid climate changeLonger time perspective than most observational records (BMS post-1976)Museum data available for rare or hard to record speciesAn example of unique biological and ecological data from collectionsBrooks, Self, Toloni & Sparks, 2014, Int. J. Biometeorol. DOI 10.1007/s00484-013-0780-6

14

2. What: Linking data capture effort to research benefits12345AdministrativeBM Number of objectUnknown/ Not assignedIn label imageTranscribedTranscribed & linked to digitised registerTranscribed & linked to digitised registerObject-level identifier if differentNot assignedIn label image; not machine-readableTranscribedMachine readable barcodeMachine readable barcodeLocation within the NHMDefined at project levelCollectionCollection + Cabinet Collection + Cabinet + DrawerCollection + Cabinet + Drawer + Location within drawerImagesObject imageNoneMulti-object (e.g. drawer level); not separableMulti-object (e.g. drawer level); separableSingle object (low-res)Single object (high-res)Label imageNoneMulti-object (e.g. drawer level); human-readableSingle object; human-readableMulti-object (e.g. drawer level); transcribedSingle object; transcribedObject(s) metadataTaxonomic nameUnidentifiedIn label imageTranscribed but above species-levelTranscribed to species-levelName linked to current taxonomyType status (if single specimen)UnknownIn label imageYes/NoTranscribed & specifiedTranscribed & specifiedGeographical locationUnknown or in label imageContinent (TDWG level 1)Country or region (TDWG levels 2-4)Georeferenced named locality (e.g., town)Coordinates based on narrative or GPSDate of collectionUnknownIn label imageTranscribed yearTranscribed month and year or date rangeTranscribed dateCollectorUnknownIn label imageTranscribedTranscribed & linked to controlled listTranscribed; linked to controlled list & collector's notesStratigraphyUnknownIn label imageTranscribed verbatimChronostratigraphic interpretationChronostratigraphy, biostratigraphy and lithostratigraphyAdditional data (e.g. host)UnknownImaged but not transcribedPartially transcribedTranscribed in fullTranscribed in full and integrated

LowHighLevel of Data Capture

First Sweep

Benefit JustifiedCoarse biogeographyCoarse macroecologyComparative analysis of traitsMacroecological modelling of phenotypeCommunity phylogeneticsList-based macroecologyTemporal diversity curvesEvolution of disparity

OriginsResearch Areas We have mapped out much data we need to address these questionsFuturesCoarse analysis of spp. distribution changeCoarse Species Distribution ModelsPhenological changeHazardsEarliest records of invadersEffects of decadal climate oscillationsModelling biotic consequences of weatherEvolution of invasive speciesResourcesSemi-automated capture of trait dataModelling within-spp variation across rangeObject ImageStratig-raphyCollectorTax.nameGeographical locationDate of collection12345Radar Plots

15

3. How: Digi-street pilots (Herbarium Sheets)

PROCESS

16

3. How: Digi-street pilots (Herbarium Sheets)33k Specimens per day, 3 shifts (6am-10pm), Netherlands collection complete in 1.5 years1.29 Euros per specimen image (if outsourced), transcription at similar costVideo of Herbarium Sheet Digitisation(Not available on SlideShare Version of this presentation)

17

3. How: Digi-street pilots (Drawer scanning & segmentation)

SatScan whole drawer scanning30 Million specimens, 130k drawersFast, high res. multi-specimen drawer images (5 mins. each)No specimen handlingLimited drawer / unit tray metadata, plus identifiersSpecimen segmentation problemDigital and physical collection gets out of syncNeed to automate specimen segmentation

18

3. How: Digi-street pilots (Drawer scanning & segmentation)

Starting image

Auto-segment

Mark errors

CorrectWork with Pieter Holtzhausen and Stfan van der Walt (Stellenbosch University)Software: InselectMain language: Python

19

3. How: Digi-street pilots (Slide scanning)Originally for histological sectionsCan be adapted for NH specimensMany issues:Speed (Max. 500 per day)File size (2-5GB per slideNetwork ingestion (100MBps)Reading labels at both endsNHM testing 6 systemsNERC capital grant awardedFully operation early 2015

1. Slides cleaned & barcoded2. Loaded into hopper(50-100)3. High resolution scan4. Images stored & databased

20

3. How: Crowdsourcing pilot

1 user with 32,629 transcriptions!92 users with 100+ transcriptions363 users with 1 transcriptionRanked usersLog no. of records transcribedNHM Bird registersNo advertisingHard to transcribeChallenging starting project

21

3. How: Crowdsourcing optionsZooniverse ProjectsSmithsonian Digital VolunteersWikisource transcription (WiR)Herbarium@HomeNext steps: Survey and review of natural history transcription projects cf. paying transcribers

22

4. Where: NHM Data PortalA focus for deposition and discovery of NHM research & collections dataStable, citable identifiers on datasets & specimen / lot recordsTransparent data quality (un-reviewed, reviewed, reviewed & updated)Download (DwCA), web-services & Linked Open DataBuild using CKAN, with enhanced mapping functionality

SearchDatasets matching criteriaIndividual datasetResultsBrowse & searchcriteriaMapping, table & statistical views

23

4. Where: External PortalsFlickr

GBIFEuropeana

e.g. NHM ColeopteraNHM almost getting data to GBIF!Submitting to Europeana portal (via Open-Up)Niche collections on FlickrRobust API servicesGateway to image analysis projects (e.g. species recognition & trait extraction tools)

24

5. LinksCrowdfundingPersonalizes donationScales well Requires lots of data Most crowdsourcing platforms unsuitablePotential for a data visualization to support our needsH2020 ProjectsEU Research & Innovation funding Programme80 Billion from 2014-2020Strong record (EDIT, ViBRANT, SYNTHESYS1/2/3) 5 proposals in development for 2014/15Better alignment with Digital Collections ProgrammePartnersMajor museums & herbaria (Kew, Smithsonian, & Euro.6)Umbrella organisations & projects (GBIF, CETAF, iDigBio)Universities (e.g. on Image analysis)Data publishers (engagement on data & systems)

CrowdfundingExplicitly link donations collections digitisation (personalisation). Makes it possible to link with rewards better.Scales from 1 specimen to the whole collectionRequires lots of data on the collection & cost of digitisingMany platforms, but most unsuitable for our needsDevelop a compelling visualisation to support the programme

25

6. WhenHerbarium scanningPilot TBC (starting late-2014)Drawer scanningSegmentation Software (Aug. 2014)Pilots (Ongoing)Slide scannerTesting 6 systems (Complete)Procurement / purchase (July 2014)Pilot projects & system integration (From Sept. 2014)Crowdsourcing pilotsDraft review paper (Aug. 2014)Additional Notes from Nature Project (early 2015)NHM Data PortalInternal release (June 2014)Public release (Jan. 2015FundingH2020 projects (submitted, Sept. 14 & Jan. 15)Key dates over next 2 years

26

AcknowledgementsDigital Collections ProgrammePlanning: Ian Owens, Ben Atkinson, Dave Thomas, Andy Purvis, Emilie Smith & Vince Smith. iCollectionsProject team: Gordon Paterson, Geoff Martin, Martin Honey, Blanca Huertas, Darrell Siebert, Vladimir Blagoderov , Steve Cafferty, Adrian Hine, Chris Sleep, Mike Sadka, Elisa Cane, Lyndsey Douglas, Joanna Durant, Gerardo Mazzetta, Flavia Toloni, Peter Wing, Malcolm Penn & Liz Duffle.Research: Steve Brooks, Angela Self, Flavia Toloni & Tim Sparks.Drawer scanningNHM Satscan development: Vladimir Blagoderov, Laurence Livermore & Vince Smith.Software: Pieter Holtzhausen & Stfan van der Walt (Stellenbosch University).Slide scannerTesting: Vladimir Blagoderov & Alex Ball.CrowdsourcingPilots (NHM Team): Tim Conyers, Lawrence Brooks & Adrian Hine.Review paper: Laurence Livermore & Vince Smith.NHM Data PortalProject team: Vince Smith, Darrell Siebert, Dave Thomas & Adrian Hine.Development: Ben Scott & Alice Heaton.Apologies to anyone I have missed!

27

Questions28