Language Tools for OCR with Katrien Depuydt

Download Language Tools for OCR with Katrien Depuydt

Post on 14-Jun-2015

172 views

Category:

Education

1 download

Embed Size (px)

TRANSCRIPT

  • 1. Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)

2. Overview

  • What is a computer lexicon
  • Lexica in IMPACT
  • Tools for lexicon building and applying lexica
  • Some results
  • Searching Demonstration

IMPACT 3. What is a computer lexicon?IMPACT 4. Computer lexicon vs electronic dictionary (1) IMPACT An electronic dictionary is:

  • Digitised full text (no pictures)
  • For human use
  • Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc.
  • Examples: OED online, WNT online

5. Dictionary XML (example) IMPACT 6. IMPACT 7. Computer Lexicon vs Electronic Dictionary (2) IMPACT

  • A computer lexicon is:
  • Always in a structured digital format (XML, relational database)
  • Main purpose: computer application
  • Explicitely coded information (e.g. lemma, part of speech, morphology, syntax)
  • Examples of use:
  • Linguistic enrichment of text material
  • Advanced searching (words with all spelling variant and inflections)
  • Automatic summarization, keyword extraction

8. IMPACT 9. Lexica in IMPACT IMPACT 10. The OCR lexicon IMPACT An OCR lexicon is

  • Acheckedlist of words in a language
  • Based on a corpus (collection) of dated texts (selection!)
  • Preferably with frequency information
  • Preferably from the same time period or of the same text type as the texts you wish to digitize

11. OCR lexicon: example IMPACT 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61 12. The IR lexicon

  • IR lexicon :most important information categories word forms (lists of words) +- frequency information
  • - quotes (dated sources) from corpora or electronicdictionaries - MODERN LEMMA (// entrance dictionary) linked tospelling variants and inflected forms of thesame word
  • The modern lemma is used for searching in texts
  • Standard use in corpus linguistics and modern historical lexicography

IMPACT 13. IMPACT 219490 < modern_lemma > aantuilen VRB850026 < written_form > tuyld 92141 < quote >Verhael ick (t.w. een als vrouw verkleede man) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, entuylddaer weer opan , Vermits een Vrou niet op een Vrou verlieven kan, 0204119124 14. Tools for lexicon building and application of lexica IMPACT 15. Types variation (spelling, inflection) IMPACT uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlickenuijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijkuuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijkI werelt weerelt wereld weerelds wereldt werelden weereld werrelts waereldsweerlytwereldts vveerelts waereld weerelden waerelden weerlt werlt wereldssweerels zwerlys swarels swereltswereltsswerrelsweirelts tsweereldswerretvverelt werlts werreltworreldwerldenwareld weirelt weireldwaerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldjeweurlt wald weled II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon ) 16. Neil Fitzgerald, 7th July 2011 17. Computer lexica

  • For OCR and OCR post correction
  • Improving searchability of historic text material by building a lexicon with variants by using a modern lemma as a search entry
  • Tools for lexicon building
  • Tools for application of lexicon in search engines
  • Lexicon cookbook
  • Guidelines and tools to use the lexica in OCR

IMPACT 18. Tools (more specific)

  • Lexicon building from corpus material and dictionaries
  • Use of lexica in search engines
  • Tool to extract spelling variation patterns from historical material
  • Tool to relate previously unrecognised spelling variations to their standard form
  • Tool to deduct previously unrecognised inflected forms to their basic form

IMPACT 19. Ordinary words vs Names (NEs)

  • Tools for the automatic recognition, classification and finding of variant names
    • Wish of the libraries
    • Separate regular vocabulary from names
    • Reduce unpleasant results: Abimelechapemelk! (b/p; i/e; e/0; k/ch ) (apemelk means monkeymilk..)
  • NE lexica

IMPACT 20. A number of results for Dutch and German IMPACT 21. Ground truth data: Dutch IMPACT Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M 22. Lexicon coverage (1: ground truth books) IMPACT Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material78% 95% 23. Lexicon coverage(2: GT newspapers 18 th -19 thC.) IMPACT Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95% 24. Lexicon coverage(3: GT Staten Generaal 19 eC.) IMPACT Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97% 25. Lexicon coverage(4: GT Staten Generaal 20 eC.) IMPACT Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98% 26. Lexicon coverage (5: Genesis, 1637 bible) IMPACT Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6% 27. Lexicon coverage (6: P.C. Hooft, histories) IMPACT Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96% 28. Evaluation of OCR IMPACT

  • Finereader SDK (version 9, 10)
  • External dictionary interface (implementation module)
  • Challenge
    • Translation of corpus frequencies to weights 0-100
    • Broken words, case-sensitivity,
    • Problem with long s(work around)
  • Lexicon Data
  • IMPACT OCR-lexicon for Dutch
  • Finereader internal lexicon

29. OCR results: word recognition rate IMPACT Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch(case hyphenation) With IMPACT lexicon for Dutch(case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 % 30. An example: IMPACT OCR at the beginning of the project: Results: A. Deeerde was degevaarlykfltiom de verlei ding aan 't Hof; de tweede deftillieenveiligde ; de derde dezwaarde , daar hy byna drie millioenen harde enonbefchaafde Menfchen beftierenmoest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest. 31. IMPACT Dictionary 16 thcentury No. ofword errors Reduction of error rate 18 thcenturyNo. ofword errors Reduction of error rate 19 thcenturyNo. ofword errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59% 32. Languages in IMPACT

  • Dutch, German, English , Spanish, French
  • Polish, Czech, Slovene and Bulgarian
  • Cross language perspective paper
  • Parallel OCR and IR experiments
  • GT datasets
  • Language tools: language independent
  • Except from 3 core languages: proof of concept lexica

IMPACT 33. English in IMPACT

  • Lexicon building using OED
    • OCR lexicon from quotations full text, possibly supplemented with corpus material
    • IR lexicon from headword variants in quotations (small demo)