datech2014 - session 3 - pocoto - an open source system for efficient interactive postcorrection of...

PoCoTo An Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts

Thorsten Vobl, Annette Gotscharek, Ulrich Reffle,

Christoph Ringlstetter, Klaus U. Schulz

CIS - Center for Information and Language Processing University of Munich Gini GmbH Munich

Motivation

- For historical texts still many OCR errors - Downstream Applications harmed Option to improve quality with interactive Postcorrection

Why: selected and important texts/corpora or parts can/must be lifted to a much higher level of accuracy/to perfection. Somehow “business driven” How: The user experience of the software has a major influence on time and efforts needed for improving accuracy.

Approach

Features to Raise Productivity within our competence and explorative : •  Plugin Language technology that unmasks orthographic variation in historical language and returns document specific distributions of OCR errors. •  Tool visualizes series of similar OCR errors •  Error series can be corrected in one shot •  Implement productive UX through interface and functionality

Evaluation

Tool developed in University Environment during EU project IMPACT and maintained since despite serious fluctuation Practical user tests in three major European libraries have shown Gains in time/corrections rates. User ratings from practitioners high. Maintaining Interest, open for new languages, new functionalities. Division of language resources and tool through a server-client model Published as an open source tool under GitHub.

§  Language technology used for improvement of interactive postcorrection §  Lexica, matching tool, profiler integrated as background technology §  Document centric knowledge from unsupervised analysis of OCRed

document used for detection of error classes and suggested corrections §  Batchmode for corrections of many errors in „one shot“

§  Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes

Starting Point: Postcorrection Tool as a Carrier of Technology

Flexible GUI

OCR

Correction candidates, Special workflows

Image

§  Unlimited configuration of the views:

–  OCR with image snippets –  Complete image page –  Correction candidates, special

workflows

Font-/window size configuration

§  OCRed text is presented to the user with word-image alignment.

§  Natural flow of text is maintained, comparison with original text images a lot easier than with focus hopping

View: OCR + Image Snippets

§  Alternative view with the complete page image.

–  Useful for difficult to read words –  Useful if word segmentation of the OCR

is too poor –  Useful if long distance text understanding

is needed

View: Original Image

§  Classical correction workflow through seuential manual input

Manual Correction

§  Speed-up through selection of proposed correction candidates

In line with what is usually offered: „Base Mode“

Drop Down Selection of Correction Candidates

Modern word word form in word form in form ground truth OCRed text Wmod Wgt Wocr

Patterns applied „pattern trace“

OCR errors applied „OCR trace“

„Interpretation“ of the OCR token Starting from OCR token Wocr Estimation of the Channel Model

Two-Channel Model for OCRed historical Text

Improved model for •  words •  patterns •  OCR errors

and their probabilities . .

for each OCR token Wocr

Improved list of interpretations with probabilities

Final Result

Modern word

Ground truth

OCR trace

Hist trace

Local guess Global guess

Profiling of historical OCRed corpora with EM

Document Eckartshausen

Result Probabilities historical patterns

LMF

Document Eckartshausen

Result Probabilities OCR errors

§  Valid historical words not marked as errors even if not in the lexicon („hypothetical lexicon“)

§  Historical variants proposed as correction candidates

Lexicons Triggered by Profiles

§  Improved Ranking of candidates through document specific language and error profile

§  Concordance Error View with high confidence corrections

Selection of Correction Candidates

§  High Probability Identical strings corrected as batch

§  Concordance views optional

Rapid Workflow - Batch Processing Identical Strings

§  Strings with identical error patterns corrected as batch

§  In the example: n -> u

Rapid Workflow - Batch Processing Identical Error Patterns

Controlled “Hard” Evaluations

0 10 20 30 40 50 60 70 80 90

0

100

200

300

400

500

600

700

800

BSB Dokument1

Corrections made

User1 Full

User2 Full

User3 Base

User4 Base

User5 Full

User6 Base

time in minutes

co

rre

ctio

ns m

ad

e

§  Measure Points every 10 minutes for 90 minutes

§  Each User with a base/full session (inter/intra User comparison)

§  More corrections avg. 1.5x – 3x for Full Mode

§  Earley Gains: First 10 Minutes

Closer Look into the Data

Soft Evaluations Questionaires with all three institutions. Most favorite aspect: Batch Corrections

Main problems: Stability Correction of Segmentation Errors

Future work

•  Extend to new Languages e.g. Latin

•  New Correction Scenarios e.g. specific Named Entity Correction

•  Turn Interest into a Community and Implement Industrial Tool Partnerships for isolated parts of the Software

Thanks for your attention!

… and special thanks to University of Alicante, Bavarian State Library, Royal Library of the Netherlands for their Time and Efforts during the Experiments