impact final conference - ulrich reffle
DESCRIPTION
Postcorrection in IMPACT with Ulrich Reffle from the University of MunichTRANSCRIPT
![Page 1: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/1.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Analysis and Post-Correction of OCR-processed historical documents
Ulrich Reffle
CISUniversity of Munich
![Page 2: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/2.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 2
Overview
Document specific analysis of OCR results of historical documents A system for interactive OCR post-correction
![Page 3: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/3.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 3
Document specific analysis of OCR results of historical documents
![Page 4: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/4.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 4
Why do we need special methods?
Problems specific to the processing of historical language in the context of mass digitization:– High OCR error rates– No standardized language
Special resources and methods are needed for OCR, post-processing and Information Retrieval
OCR-
resultOCR Post-
Correction IRDigital
image
Problem of historicallanguage variation
![Page 5: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/5.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 5
Why do we need special methods?
Diversity of input material makes document specific parameter settings important:– Distribution of spelling variants– Special vocabulary– OCR channel model
OCR-
resultOCR Post-
Correction IRDigital
image
Problem of historicallanguage variation
![Page 6: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/6.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 6
Document specific language and error profiles
Language and error profiles provide document specific characteristics of the language and OCR errors.
Language profile: shares of foreign languages (such as Latin, French), frequencies for language modeling, important patterns of spelling variation (in English: e.g. oou, vu )
Error profile: estimated error rate, important error patterns (like ec, il), frequent erroneous words
Language and error profiles are computed fully automatically, no manual interaction or groundtruth needed.
![Page 7: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/7.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 7
Global Profile of a document
Language
profile
Error
profile
Frequency
t→th 120
i→y 106
ä→a 38
… …
Frequency
e→c 51
n→u 45
t→i 34
… …
Lexicon %
Modern 82%
Historic 9%
Place names 6%
Latin 3%
Correct words 72%
Erroneous words 20%
Unknown words 8%
![Page 8: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/8.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 8
Local profile of all words of a document
„theil“„theil“„theil“„theil“„hatn“
Weighted set of interpretations/ correction suggestions for each word of the document.
Correction suggestion Modern spelling probability
hath has 0,95
hat Hat 0,01
hate hate 0,04
![Page 9: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/9.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 9
Summary
Document specific profiles …– are computed in a fully automated way from OCR output– provide characteristics of language and OCR error channel in order to adapt
OCR and downstream processes.
![Page 10: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/10.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 10
System for interactive post-correction of OCR results
![Page 11: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/11.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 11
Post-correction system
A graphical user interface for fast and convenient post-correction specifically for OCRed historical documents
Novel possibilities for detection, presentation and correction of systematic OCR errors.
![Page 12: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/12.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 12
Post-correction system
Special functionality
Image
OCR Editor
![Page 13: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/13.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 13
Proper treatment of spelling variants
Historical spelling variants are identified with the help of historical lexica and language profiles.
Local profiles include non-modern words as correction suggestions.
![Page 14: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/14.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 14
Conventional correction methods
Correcting words in the text view– Manual input– Selection of a correction suggestion
![Page 15: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/15.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 15
Batch-Correction of systematic OCR errors
Systematic OCR errors are identified by error profile Batches of errors can be corrected with just a few keystrokes.
![Page 16: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/16.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 16
Evaluation
User experiment with 14 participants. Novel technology makes correction up to 2.7 times faster.
![Page 17: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/17.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle [email protected] 17
Availability
Graphical interface is going to be distributed open source. Document pre-processing to obtain language and error profiles is protected
by US patent application.– Pre-processing is offered as a web-service, as of now free of charge.
![Page 18: IMPACT Final Conference - Ulrich Reffle](https://reader034.vdocuments.mx/reader034/viewer/2022042510/546d1d56af7959ea368b7544/html5/thumbnails/18.jpg)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Thank you!
http://[email protected]
24.10.2011 Ulrich Reffle [email protected]