datech2014 - session 3 - pocoto - an open source system for efficient interactive postcorrection of...
DESCRIPTION
Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidaysTRANSCRIPT
PoCoTo An Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts
Thorsten Vobl, Annette Gotscharek, Ulrich Reffle,
Christoph Ringlstetter, Klaus U. Schulz
CIS - Center for Information and Language Processing University of Munich Gini GmbH Munich
Motivation
- For historical texts still many OCR errors - Downstream Applications harmed Option to improve quality with interactive Postcorrection
Why: selected and important texts/corpora or parts can/must be lifted to a much higher level of accuracy/to perfection. Somehow “business driven” How: The user experience of the software has a major influence on time and efforts needed for improving accuracy.
Approach
Features to Raise Productivity within our competence and explorative : • Plugin Language technology that unmasks orthographic variation in historical language and returns document specific distributions of OCR errors. • Tool visualizes series of similar OCR errors • Error series can be corrected in one shot • Implement productive UX through interface and functionality
Evaluation
Tool developed in University Environment during EU project IMPACT and maintained since despite serious fluctuation Practical user tests in three major European libraries have shown Gains in time/corrections rates. User ratings from practitioners high. Maintaining Interest, open for new languages, new functionalities. Division of language resources and tool through a server-client model Published as an open source tool under GitHub.
§ Language technology used for improvement of interactive postcorrection § Lexica, matching tool, profiler integrated as background technology § Document centric knowledge from unsupervised analysis of OCRed
document used for detection of error classes and suggested corrections § Batchmode for corrections of many errors in „one shot“
§ Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes
Starting Point: Postcorrection Tool as a Carrier of Technology
Flexible GUI
OCR
Correction candidates, Special workflows
Image
§ Unlimited configuration of the views:
– OCR with image snippets – Complete image page – Correction candidates, special
workflows
Font-/window size configuration
§ OCRed text is presented to the user with word-image alignment.
§ Natural flow of text is maintained, comparison with original text images a lot easier than with focus hopping
View: OCR + Image Snippets
§ Alternative view with the complete page image.
– Useful for difficult to read words – Useful if word segmentation of the OCR
is too poor – Useful if long distance text understanding
is needed
View: Original Image
§ Classical correction workflow through seuential manual input
Manual Correction
§ Speed-up through selection of proposed correction candidates
In line with what is usually offered: „Base Mode“
Drop Down Selection of Correction Candidates
Modern word word form in word form in form ground truth OCRed text Wmod Wgt Wocr
Patterns applied „pattern trace“
OCR errors applied „OCR trace“
„Interpretation“ of the OCR token Starting from OCR token Wocr Estimation of the Channel Model
Two-Channel Model for OCRed historical Text
Improved model for • words • patterns • OCR errors
and their probabilities . .
for each OCR token Wocr
Improved list of interpretations with probabilities
Final Result
Modern word
Ground truth
OCR trace
Hist trace
Local guess Global guess
Profiling of historical OCRed corpora with EM
Document Eckartshausen
Result Probabilities historical patterns
LMF
Document Eckartshausen
Result Probabilities OCR errors
§ Valid historical words not marked as errors even if not in the lexicon („hypothetical lexicon“)
§ Historical variants proposed as correction candidates
Lexicons Triggered by Profiles
§ Improved Ranking of candidates through document specific language and error profile
§ Concordance Error View with high confidence corrections
Selection of Correction Candidates
§ High Probability Identical strings corrected as batch
§ Concordance views optional
Rapid Workflow - Batch Processing Identical Strings
§ Strings with identical error patterns corrected as batch
§ In the example: n -> u
Rapid Workflow - Batch Processing Identical Error Patterns
Controlled “Hard” Evaluations
0 10 20 30 40 50 60 70 80 90
0
100
200
300
400
500
600
700
800
BSB Dokument1
Corrections made
User1 Full
User2 Full
User3 Base
User4 Base
User5 Full
User6 Base
time in minutes
co
rre
ctio
ns m
ad
e
§ Measure Points every 10 minutes for 90 minutes
§ Each User with a base/full session (inter/intra User comparison)
§ More corrections avg. 1.5x – 3x for Full Mode
§ Earley Gains: First 10 Minutes
Closer Look into the Data
Soft Evaluations Questionaires with all three institutions. Most favorite aspect: Batch Corrections
Main problems: Stability Correction of Segmentation Errors
Future work
• Extend to new Languages e.g. Latin
• New Correction Scenarios e.g. specific Named Entity Correction
• Turn Interest into a Community and Implement Industrial Tool Partnerships for isolated parts of the Software
Thanks for your attention!
… and special thanks to University of Alicante, Bavarian State Library, Royal Library of the Netherlands for their Time and Efforts during the Experiments