training & quality assessment of an optical character ...€¦ · isabell hubert1, antti...
TRANSCRIPT
First OCR model for Northern Haida, also known as Masset or Xaad Kil, a severely endangered First Nations language spoken in British Columbia, Canada, and Alaska, USA.
Goals• Train most accurate OCR model with least
amount of manual labor• Compare OCR model training approaches [1]• Update de-facto standardized ISRI analysis
tools to Unicode version [6, 7] • model accuracy cannot be predicted;
assessment tools crucial [8, 9]• Examine what influences model accuracy
Textual SourcePre-phonemic transcriptions of Northern Haida stories from early 20th century [2]• complex & extensive character set
• 78 characters + 12 punctuation marks• >100k words & 500k characters total
• too large for manual digitization• make legacy materials digitally accessible
• preserve Haida language and culture• original font could not be retrieved
Introduction
Isabell Hubert1, Antti Arppe1, Jordan Lachler1, Eddie A. Santos2
1Department of Linguistics, University of Alberta 2Department of Computing Science, University of Alberta
Training & Quality Assessment of an
Optical Character Recognition Model for Northern Haida
OCR engine: Tesseract [3]• Open Source (Apache 2.0), accurate,
integration with FST possible
Training approaches require:
• Train six models per approach• based on 1, 2, 3, 4, 5, and 10 pages• ~1,200 characters per page (13.3 x
character set)
Model Training
Results
Training accurate OCR model for extensive character set where original font cannot be retrieved:• based on 13x as many characters as are in character set • using the Source Image approach
• Choose training material that• is of good print quality• has all characters & high number of tokens per character
• Optional/if resources available: train ~2 additional models, then select best
• Northern Haida OCR model will be among first for North American indigenous languages• Resulting electronic corpus of Northern Haida will be largest to date• Once de facto standardized ISRI toolkit updated to Unicode version [6,7]
Recommendations & Conclusion
[1] Beusekom, J. V., Shafait, F., Breuel, T. M. (2008). Automated OCR Ground Truth Generation. 8th IAPR Workshop, 111–117. [2] Swanton, J. R. (1908). Haida Texts - Masset Dialect. In F. Boas, editor, The Jesup North Pacific Expedition, Memoir of the Am Museum of Nat Hist, volume X. Brill & Stechert, Leiden/NY. [3] Google (2012). tesseract v3.2.2. https://github.com/tesseract-ocr, acc 21 Nov 2015. [4] Project Gutenberg. (2014). www.gutenberg.org, acc 22 Nov 2015. [5] Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the RSS (B), 73(1): 3–36. [6] Rice, S. V., Nartker, T. A. (1996). The ISRI Analytic Tools for OCR Evaluation. [7] Rice, S. V. , Nartker, T. A. (2016). The ISRI Analytic Tools for OCR Evaluation - Unicode Version. Ported and extended by Eddie Antonio Santos. https://github.com/eddieantonio/isri-ocr-evaluation-tools, acc 5 Jan 2016. [8] Nagy, G., Nartker, T. A., Rice, S. V. (2000). Optical Character Recognition: An Illustrated Guide to the Frontier. SPIE Proceedings 58–69. [9] Rice, S. V. (1996). Measuring the Accuracy of Page-reading Systems. Haida example text: [2], p.208.
ReferencesFunded by SSHRC Partnership Development Grant (890-2013-0047) 21st Century Tools for Indigenous Languages, and KIAS Research Cluster Grant 21st Century Tools for Indigenous Languages. We thank Megan Bontogon, Dustin Bowers, Darren Flavelle, Catherine Ford, Evan Lloyd, Kaidi Lõo, Lauren Rudat, Miikka Silfverberg, Corey Telfer, and the members of ALTLab for helpful remarks.
Acknowledgements
Image Generation
(IG) [3]Source Image
(SI) [4]
ground truth (.txt) ground truth (.box)
original print font foreign OCR model
Fig. 2: Average CRA as a function of a character’s frequency (1-page Source Image model on pre-processed testing set).
Fig. 3: Average CRA as a function of a character’s logged frequency; same model & data as for Fig. 2.
Character-Level Analyses• Main effect of frequency on CRA across all character types*** (Fig. 2), but also within basic,***
composed, *** and punctuation*** types (generalized additive modelling [5], Fig. 3)• No effect of character type on CRAns
Fig. 1: Influence of training approach (SI = grey, IG = black) and page count on CRA.
• Most accurate models overall: • 1- and 2-page models trained in the SI
approach (CRA 96.30%/96.47%, WRA 89.03%/88.22%)
• Adding pages to model in SI approach caused decline in accuracy (Fig. 1)***
• Adding pages to model in IG approach improved WRA,*** but not CRAns
• Still worse than best SI models.• Image pre-processing improved CRA by up to
1.8% & WRA by up to 1.42%***• Choice of training page influenced CRA*** and
WRA***