digitisation doctor optical character recognition

15
OCR Optical Character Recognition Simon Tanner Blog: simon- tanner.blogspot.co.uk Twitter: @SimonTanner

Upload: simon-tanner

Post on 08-May-2015

1.229 views

Category:

Technology


3 download

DESCRIPTION

Optical Character Recognition guidance and advice

TRANSCRIPT

Page 1: Digitisation Doctor Optical Character Recognition

OCR

Optical Character Recognition

Simon Tanner

Blog: simon-

tanner.blogspot.co.uk

Twitter: @SimonTanner

www.slideshare.net/KDCS/

Page 2: Digitisation Doctor Optical Character Recognition

King’s Digital Consultancy Services

www.digitalconsultancy.net

Page 3: Digitisation Doctor Optical Character Recognition

Some OCR resources

By Simon Tanner:

Deciding whether Optical Character Recognition is feasible (PDF document) created for the Oxford University Digital Librarywww.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf

Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archivewww.dlib.org/dlib/july09/munoz/07munoz.html

The IMPACT project: Improving Access to Textwww.impact-project.eu

Page 4: Digitisation Doctor Optical Character Recognition

OCR – How it works

Image optimisation

Document Image Analysis

Character recognition

Word identification/recognition

Correction

Formatting output

Page 5: Digitisation Doctor Optical Character Recognition

Assessing a resource for OCR

Scanning methods possible

Nature of original paper

Nature of printingUniformityLanguageText alignmentComplexity of alignmentLines, graphics and picturesHandwriting

Nature of document

Nature of output requirements

Page 6: Digitisation Doctor Optical Character Recognition

OCR Accuracy

Evaluating OCR accuracy is about more than just character to character accuracy rates

Character accuracy rates are misleading (more later…)

It is also about assessing the functionality enabled through the OCR’s output

Search accuracy

Volume of hits returned

Ability to structure searches and results

Accuracy of result ranking

Amount of correction required to achieve the required performance

Page 7: Digitisation Doctor Optical Character Recognition

Character accuracy rates may mislead

Consider this scenario:1,000 words with 5,000 characters (an average of 5 per word) excluding spaces

90% character accuracy means:

4,500 characters correct

Possibly a maximum 900 words correct (90%)

Possibly a minimum 500 words correct (50%)

Reality is somewhere in between

Depending on the number of “significant words” the search results could still be almost 100% or near zero

Page 8: Digitisation Doctor Optical Character Recognition

OCR Accuracy: Balancing factors

Character accuracy Vs Word accuracySignificant word accuracySignificant words with capital letter start accuracy

Bit-depth is the number one factor that can improve OCR accuracy once a base level of 300+dpi resolution is achieved.

Bitonal emphasises foxing and obscure characters in words: consequently, clergy, matrimonial and thethat would be captured accurately from the greyscale image.

Page 9: Digitisation Doctor Optical Character Recognition

BL Newspaper Results: arranged by date

50

60

70

80

90

100

1801

1810

1820

1830

1840

1850

1860

1870

1880

1890

1900

characters wordswords with capital letter start significant wordsPoly. (characters) Poly. (words)Poly. (significant words) Poly. (words with capital letter start)

Page 10: Digitisation Doctor Optical Character Recognition

OCR Quiz

Look at the examples on screen

Make a note of any features you think might affect OCR accuracy

Have a guess of what you think the accuracy in % terms might be

Page 11: Digitisation Doctor Optical Character Recognition
Page 12: Digitisation Doctor Optical Character Recognition

I am petfood, God toil! uttedy-toverthroW, at feaft; $gy abafe Men's affections tp; and seal for all Party-making Notions amdngft CfiriftiansybefGieirie will raife his,Church to that prof-perous, flourilhing State prophefied of, and prOmifed in the Scrip* tures. There mult be more Love, and Charity, and Unanimity amongft Chriftians,.

OCR Engine% characters

correct% words correct No. of corrections

FineReader 91.1 70.9 110

PrimeOCR 93.95 79.1 79

OCR Results

Total number of characters = 2109Total number of words = 379

Page 13: Digitisation Doctor Optical Character Recognition
Page 14: Digitisation Doctor Optical Character Recognition

OCR Engine% characters

correct% words correct No. of corrections

FineReader 73.7 57.5 31

PrimeOCR 75.9 62.37 28

OCR Results

A THEATRE erein be reprc-fented as wel the miferies & calamities tijat foiioto tht too*e^jr alfo the greate toyts andplefures tobtcf) tbe fatrfc faltooenio^An Argument both profitable anddele&able, to all that finccrclyloue the word of Codt'.*Deuifedby S. hhnv&n~ derlS^oodt.s 3^ Scene and allowed according to the order appointed., ^ Imprinted at London by Henry Bynncman*Anno Domini.CVM PHIT

Total number of characters = 411Total number of words = 73

Page 15: Digitisation Doctor Optical Character Recognition

OCR

Optical Character Recognition

Simon Tanner

Blog: simon-

tanner.blogspot.co.uk

Twitter: @SimonTanner