the use of ocr in the digitisation of herbarium specimens

15
The use of OCR in the digitisation of herbarium specimens Robyn E Drinkwater, Robert Cubey & Elspeth Haston

Upload: hailey

Post on 25-Feb-2016

57 views

Category:

Documents


1 download

DESCRIPTION

The use of OCR in the digitisation of herbarium specimens. Robyn E Drinkwater, Robert Cubey & Elspeth Haston. What is happening in digitisation?. … and these minimal data records are going to need data added to them. What are the options when using optical character recognition (OCR)?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The use of OCR in the digitisation of herbarium specimens

The use of OCR in the digitisation of

herbarium specimens

Robyn E Drinkwater, Robert Cubey & Elspeth Haston

Page 2: The use of OCR in the digitisation of herbarium specimens

What is happening in digitisation?

Page 3: The use of OCR in the digitisation of herbarium specimens
Page 4: The use of OCR in the digitisation of herbarium specimens

• … and these minimal data records are going to need data added to them.

Page 5: The use of OCR in the digitisation of herbarium specimens

• Parse OCR text directly into the database fields

• Use OCR data to prepare the specimens for manual / semi automated data entry

What are the options when usingoptical character recognition (OCR)?

Page 6: The use of OCR in the digitisation of herbarium specimens

• We have had a digitisation project running to digitise all the specimens from SW Asia and the Middle East at RBGE.

• Minimal data had been captured originally*– Filing name– Geographical filing region– Barcode

• We have been routinely processing all our specimen images through ABBYY OCR software.

* E Haston, R Cubey, DJ Harris (2011). Data concepts and their relevance for data capture in large scale digitisation of biological collections. International Journal of Humanities and Arts Computing 6 (1-2), 111-119.

Page 7: The use of OCR in the digitisation of herbarium specimens

Exploring the data…

Page 8: The use of OCR in the digitisation of herbarium specimens

• We used the OCR output text to pull out over 7,000 specimen images and associated data records

• These were then prepared into batches:– some random– some sorted by collector and / or country

Step One

Page 9: The use of OCR in the digitisation of herbarium specimens

• A team of six digitisers at RBGE completed a series of trials

• They used two different protocols for data entry– complete records – partial records (including collector and geographical

information but not habitat and description)

• In total 7,200 specimens were processed

Step Two

Page 10: The use of OCR in the digitisation of herbarium specimens

• Compared to unsorted, random specimens, those which were sorted based on data from the OCR output were quicker to digitise

• Of the methods tested here, the most efficient used a protocol based on partial data entry, working with specimens which had been filtered by Collector and Country

Results…

Page 11: The use of OCR in the digitisation of herbarium specimens

The human factor…Thinking about the ease of entering the data for each test, rate

them on their relative ease of use

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Random 1 Collector Country Collector& Country

Collector& Country

(OCR)

Random 2

5- Hardest

4

3

2

1- Easiest

Page 12: The use of OCR in the digitisation of herbarium specimens

• Digitisation staff preferred working with sorted specimens

• They also preferred working with physical specimens rather than images

The human factor…

Page 13: The use of OCR in the digitisation of herbarium specimens

• This work is more easily applied than parsing data from the OCR output

• It can be used in conjunction with other tools later in the digitisation process since these other processes will almost certainly be more efficient with sorted batches of specimens

• Other tasks can also be built on top of this: eg condition assessment, QC, etc

Some more thoughts…

Page 14: The use of OCR in the digitisation of herbarium specimens

• It’s surprising what can be used to help filter specimens – the black art of search terms!

Page 15: The use of OCR in the digitisation of herbarium specimens

Acknowledgments

• The digitisation team at RBGE: Nicky Sharp, David Braidwood, Muhammad Ghazali, Lorna Glancy, Dorota Jaworska, Esther Nieto.

• The Andrew W Mellon Foundation

• Dr Antje Ahrends (RBGE) & Dr Chris Glaseby (BIOSS) for statistical advice