europeana newspapers lft infoday genereux

11

Click here to load reader

Upload: europeana-newspapers

Post on 10-Jul-2015

113 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Europeana Newspapers LFT Infoday Genereux

1

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

Correcting OCR results with

computational linguistic

methods

Michel Généreux, EURAC

Bozen / Bolzano

in collaboration with Egon W. Stemle, Lionel Nicolas, Verena Lyding and Katalin Szabò

Page 2: Europeana Newspapers LFT Infoday Genereux

2

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

Introduction

The OPATCH project (Open Platform for Access to and Analysis of Textual Documents of Cultural Heritage) aims at creating an advanced on-line search infrastructure for research in an historical newspapers archive.

• Duration: 24 months (Jan 2014 – Dec 2015)• Fundings: Autonome Provinz Bozen-Südtirol, Landesgesetz Nr. 14, „Forschung und Innovation“

• Partners: • Landesbibliothek Dr. Friedrich Teßmann, Bozen• Institut für Corpuslinguistik und Texttechnologie (ICLTT), Wien

For implementing this, OPATCH builds on computational linguistic (CL) methods for structural parsing, word class tagging and named entity recognition.

Dating between 1910 and 1920, the newspapers are typed in the blackletter Fraktur font and paper quality is derogated due to age.

Hence, in OPATCH we are starting from majorly error-prone OCR-ed text, in quantities that cannot realistically be corrected manually.

Page 3: Europeana Newspapers LFT Infoday Genereux

3

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

A Glance at the Teßmann collection

626,287 pages for a total of 819,310,354 tokens;

616,751,127 tokens are in the reference dictionary, so a degree of cleanness of 0.75

Our post-OCR correction system is based on 10 OCR-ed pages with cleanness of 0.70

Given that the dictionary covers on average 91% of all words, 0.90 cleanness is almost perfect OCR.

Focusing on the best pages from the Teßmann collection ...

Page 4: Europeana Newspapers LFT Infoday Genereux

4

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

OPATCH 100k pages Newsp_Decade Number

of pagesNumber of

tokensNumber of

tokens in dictionary

Cumul. number of

pages

Cumul. number of

tokens

Cumul. number of

tokens in dict.

PM_1900 1032 1137449 0.901 1032 1137449 1024804 90.1%

AM_1890 8 9457 0.899 1040 1146906 1033302 90.1%

FA_1900 8 8652 0.887 1048 1155558 1040978 90.1%

PM_1910 737 750131 0.886 1785 1905689 1705286 89.5%

IS_1900 2970 2817010 0.879 4755 4722699 4180623 88.5%

WB_1900 72 45481 0.872 4827 4768180 4220278 88.5%

SVB_1890 9749 14306473 0.809 70546 92484068 75861319 82.0%

SVB_1900Volksblatt

17545 26314923 0.807 88091 118798991 97103539 81.7%

BZN_1890Bozner Nachrichten

16015 12153784 0.806 104106 130952775 106894635 81.6%

Page 5: Europeana Newspapers LFT Infoday Genereux

5

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

OPATCH unnanotated 5k pages We select the 5000 cleanest pages from the Teßmann collection above in the years 1910-1920. The average cleanness for this corpus is 91%, so no need for cleaning. This results in the following un-annotated example corpus:

Page 6: Europeana Newspapers LFT Infoday Genereux

6

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

OPATCH annotated 5k pages

Automated annotations for Part-of-Speech (POS), Lemmas and Named Entities (NE). We also have a list of roughly 31k locations and names for South Tyrol compiled by Teßmann.

Bozen hat ein Museum für Ötzi.

POS LEMMA NE

Bozen NE Bozen I-LOChat VAFIN habenein ART eineMuseum NN Museumfür APPR fürÖtzi NE Ötzi I-PER

Page 7: Europeana Newspapers LFT Infoday Genereux

7

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

CorporaTen OCR-ed pages with their manually corrected versions.

• 10,468 tokens and 3,621 types. Eight pages (8,324/2,487) are used as training data and two pages (2,144/1,134) for testing.

• More than one out of two tokens is misrecognized, among which almost half (48%) need a minimum of three edit operations for correction.

Reference corpus: 5M words and 5M bigrams• From the WEB and also http://www.gutenberg.org/ Romane

und Erzählungen (1910-20)• The dictionary covers 91% of all words in the ten OCR-ed pages

Page 8: Europeana Newspapers LFT Infoday Genereux

8

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

Approach for OCR-ed correction1. Probability models. Collate and tally all edit-operations (delete, insert and

replace) needed to transform all unrecognized tokens from the training OCR-ed texts to its corrected form in the Gold Standard: n|u 98

* we obtain two probability models: constrained and unconstrained

2. Candidate generation is achieved by finding the closest entry in the dictionary by applying the minimum number of edit-operations to an unrecognized OCR-ed token. The number of candidates is function of the maximum number of edit-operations allowed and the model used.

* wundestc → wundesten: ’c’ → ’e’ and inserting a ’n’ after the ’e’.

3. Selection of the most suitable candidate, given relative frequency and context:

... word word word word word WORD word word word word ...

Page 9: Europeana Newspapers LFT Infoday Genereux

9

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

Experiment 1Artificially created errors●To achieve this we extracted random trigrams from the GS (left context, target, right context) and applied, in reverse, the edit error model. ●Errors were introduced up to a limit of two per target and contexts. ●At the end of this process, we have two context words and five candidates, including the target.●En is the maximum edit-operations performed to generate candidates.

Page 10: Europeana Newspapers LFT Infoday Genereux

10

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

Experiment 2: Real errors

Page 11: Europeana Newspapers LFT Infoday Genereux

11

Institute for Specialised Communication and Multilingualism

Michel Généreux 27.10.2014

Discussion

The approach we presented to correct OCR errors considered four features of two types: edit-distance and n-grams frequencies. Results showed that a simple scoring system can correct with very high accuracy OCR-ed texts under idealized conditions: no more than two edit operations and a wide-coverage dictionary. Obviously, these conditions do not always hold in practice, thus an observed accu-racy drops to 10%. Wrong substitions by the OCR process have also been neglected.

Nevertheless, we can expect to improve our dictionary coverage so that very noisy OCR-ed texts (i.e. 48% error with distance of at least three to target) can be corrected with accuracies up to 20%.

OCR-ed texts with less challenging error patterns can be corrected with accuracies up to 61% (distance two) and 86% (distance one).

Reference: Michel Généreux, Egon W. Stemle, Verena Lyding and Lionel Nicolas. Correcting OCR errors for German in Fraktur font. Pisa, 9-10 dicembre 2014. La prima Conferenza di Linguistica Computazionale Italiana.