datech2014 - session 3 - pocoto - an open source system for efficient interactive postcorrection of...

23
PoCoTo An Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter , Klaus U. Schulz CIS - Center for Information and Language Processing University of Munich Gini GmbH Munich

Upload: impact-centre-of-competence

Post on 22-Nov-2014

200 views

Category:

Technology


0 download

DESCRIPTION

Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays

TRANSCRIPT

Page 1: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

PoCoTo An Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts

Thorsten Vobl, Annette Gotscharek, Ulrich Reffle,

Christoph Ringlstetter, Klaus U. Schulz

CIS - Center for Information and Language Processing University of Munich Gini GmbH Munich

Page 2: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Motivation

- For historical texts still many OCR errors - Downstream Applications harmed Option to improve quality with interactive Postcorrection

Why: selected and important texts/corpora or parts can/must be lifted to a much higher level of accuracy/to perfection. Somehow “business driven” How: The user experience of the software has a major influence on time and efforts needed for improving accuracy.

Page 3: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Approach

Features to Raise Productivity within our competence and explorative : •  Plugin Language technology that unmasks orthographic variation in historical language and returns document specific distributions of OCR errors. •  Tool visualizes series of similar OCR errors •  Error series can be corrected in one shot •  Implement productive UX through interface and functionality

Page 4: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Evaluation

Tool developed in University Environment during EU project IMPACT and maintained since despite serious fluctuation Practical user tests in three major European libraries have shown Gains in time/corrections rates. User ratings from practitioners high. Maintaining Interest, open for new languages, new functionalities. Division of language resources and tool through a server-client model Published as an open source tool under GitHub.

Page 5: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  Language technology used for improvement of interactive postcorrection §  Lexica, matching tool, profiler integrated as background technology §  Document centric knowledge from unsupervised analysis of OCRed

document used for detection of error classes and suggested corrections §  Batchmode for corrections of many errors in „one shot“

§  Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes

Starting Point: Postcorrection Tool as a Carrier of Technology

Page 6: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Flexible GUI

OCR

Correction candidates, Special workflows

Image

§  Unlimited configuration of the views:

–  OCR with image snippets –  Complete image page –  Correction candidates, special

workflows

Font-/window size configuration

Page 7: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  OCRed text is presented to the user with word-image alignment.

§  Natural flow of text is maintained, comparison with original text images a lot easier than with focus hopping

View: OCR + Image Snippets

Page 8: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  Alternative view with the complete page image.

–  Useful for difficult to read words –  Useful if word segmentation of the OCR

is too poor –  Useful if long distance text understanding

is needed

View: Original Image

Page 9: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  Classical correction workflow through seuential manual input

Manual Correction

Page 10: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  Speed-up through selection of proposed correction candidates

In line with what is usually offered: „Base Mode“

Drop Down Selection of Correction Candidates

Page 11: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Modern word word form in word form in form ground truth OCRed text Wmod Wgt Wocr

Patterns applied „pattern trace“

OCR errors applied „OCR trace“

„Interpretation“ of the OCR token Starting from OCR token Wocr Estimation of the Channel Model

Two-Channel Model for OCRed historical Text

Page 12: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Improved model for •  words •  patterns •  OCR errors

and their probabilities . .

for each OCR token Wocr

Improved list of interpretations with probabilities

Final Result

Modern word

Ground truth

OCR trace

Hist trace

Local guess Global guess

Profiling of historical OCRed corpora with EM

Page 13: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Document Eckartshausen

Result Probabilities historical patterns

Page 14: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

LMF

Document Eckartshausen

Result Probabilities OCR errors

Page 15: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  Valid historical words not marked as errors even if not in the lexicon („hypothetical lexicon“)

§  Historical variants proposed as correction candidates

Lexicons Triggered by Profiles

Page 16: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  Improved Ranking of candidates through document specific language and error profile

§  Concordance Error View with high confidence corrections

Selection of Correction Candidates

Page 17: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  High Probability Identical strings corrected as batch

§  Concordance views optional

Rapid Workflow - Batch Processing Identical Strings

Page 18: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

§  Strings with identical error patterns corrected as batch

§  In the example: n -> u

Rapid Workflow - Batch Processing Identical Error Patterns

Page 19: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Controlled “Hard” Evaluations

0 10 20 30 40 50 60 70 80 90

0

100

200

300

400

500

600

700

800

BSB Dokument1

Corrections made

User1 Full

User2 Full

User3 Base

User4 Base

User5 Full

User6 Base

time in minutes

co

rre

ctio

ns m

ad

e

§  Measure Points every 10 minutes for 90 minutes

§  Each User with a base/full session (inter/intra User comparison)

§  More corrections avg. 1.5x – 3x for Full Mode

§  Earley Gains: First 10 Minutes

Page 20: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Closer Look into the Data

Page 21: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Soft Evaluations Questionaires with all three institutions. Most favorite aspect: Batch Corrections

Main problems: Stability Correction of Segmentation Errors

Page 22: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Future work

•  Extend to new Languages e.g. Latin

•  New Correction Scenarios e.g. specific Named Entity Correction

•  Turn Interest into a Community and Implement Industrial Tool Partnerships for isolated parts of the Software

Page 23: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

Thanks for your attention!

… and special thanks to University of Alicante, Bavarian State Library, Royal Library of the Netherlands for their Time and Efforts during the Experiments