Targeted Language Resources for the Digitisation of Historical Collections

Download Targeted Language Resources for the Digitisation of Historical Collections

Post on 09-May-2015

833 views

Category:

Education

2 download

TRANSCRIPT

  • 1.IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Targeted Language Resources for the Digitization of Historical Collections Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter,Klaus U. SchulzCIS, University of Munich JISC Workshop 2009, Bath, UK

2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Questions and Methods For historical documents of a specific period: what kind of linguistic resources? What kind of improvements can be expected? Consequences for engeneering and processes? ---------------------------------------------------------------------- (1) Corpus analysis (2) Quantitative Experiments on OCR and IR2 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Survey1. Special Challenges to Digitize Historical Materials 2. Composing and Analyzing a Historical Corpus 3. Types of Linguistic Resources 4. Evaluation of Benefits: OCR, IR 5. Consequences for Engineering3 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 0. CIS within IMPACT (1) Text Recognition: Adaption of Optical Character Recognition to historical documents (2) Resources Building: Enrichment of texts to Improve Information Retrieval (IR) on historical documents (3) Research beyond IMPACT: steps to a next generation interface to access collections of historical documents 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digitization Projects for Historical Materials SELECT Create a Historical Collection SCAN Create Images: Greyscale, Color, QA PROCESSImprove Images OCR/Type Create a Symbolical Representation, QA INDEXProcess a Term-Document Representation PRESENTProvide a User Interface for Access5 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.1. Special Challenges for the Digitization of Historical Materials1500 1600 1700 1800 1900 ImagingDamages on Originals Optical Character RecognitionRate of Recognition Errors Information RetrievalHistorical Variants Human ReadingUnknown Words date footertext 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Optical Character Recognition: Gothic, good qualityStdte den rmischen mumcizmg gleich zu stellen. Allem wenn sich je in einem Rechtstheile die altrechtlichen teutschen Gewohnheiten, und Gesetze erhalten haben, so ist es gewi in dieser Lehre, man mag entweder auf die Befugni, die Stadtgerechtigkeit zu ertheilen , oder auf die innere Regimentsverfftssung so- 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Optical Character Recognition: medium quality Frsten zu Gstternwerden/wer wollte vermainen / dawtIhroKhurftrstl Durchl gndiglsterHcttVatterinderpictcrrndFrombkcltallmFrstenvorzusetzen!scyn/vnd das halst>in^cclcQ^ vci pluz^uzn 5accr6o5 da tl iN KilchkN GottWwehr als ein Priester. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Examples of Noise introduced by imperfect OCR(1) processed word images may lead to False Friends Fischerei - Tischlerei: F->T, h -> hl (Engfishery - carpenterry) (2) processed word images may relate to no word at all ^.uglltt. schreibet/ (3) severe word segmentation errors vndExcmpelFrstl-vnd HeroischerTuzenFOCR on Gothic materials: good (WER < 10%); medium (10-30%); bad (< 30%) 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Why is it so bad(1) The image quality is challenging, furtherprocessing needed(2) The classifiers of the OCR disregard certain typefaces used in historic print(3) The language resources of the OCR are inappropriate: historical language 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IR: Search Problemeven for keyed collectionsKreter ?krauterKruter (Eng herbs)kreuter 0 Results for Kreutercreuther kruter (= Engherbs) as krauter, kreuter, kreter, kreuter, creuther 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special challenge for IR and OCR on historical Texts: Spelling variation Missing normalization of orthography leads to plenty of spelling variants in historical documents, e.g. in German texts (1500-1850): teil (= Engpart) as theil, teyl, theyl kruter (= Engherbs) as krauter, kreuter, kreter,kreuter, creuther fragte (= Engasked) as frug, fruk User is not aware of the variants and misses many documents: sometimes even false friends Solution: Mapping from variants to modern lemma12 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language Resources to tackle challenges encountered in OCR and IR Lexica for OCR Language Models for OCR Statistical Information about transformation patterns Historical Stopwords for IR Normalization Lexica with a mapping between modern and historical wordform for IR Syntactical Information for paradigmatic expansion and disambiguation at POS leveldate footertext 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Mantra: All Language Resources are Corpus BasedPossible sources: Keyed Materials on the Web Non Public Electronic Corpora Keying/corrected OCR of Image Corpora Noisy OCR Corpora 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Status of German historical Corpora1. Main development corpus Proofread texts from 1400 to 1900, Medium size: 2.7 Mill. tokens For lexicon construction For diachronic analysis/classification of vocabulary of distinct periods2. OCR corpus for lexicon testing OCRed Images + groundtruth aligned Texts from 16th, 18th, 19th century (5034, 2659, 18052) tokens3. IR test corpus for lexicon testing Special linguistically annotated groundtruth Texts from 16th, 17th, 18th, 19th century 31080 tokens 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language in a historical corpus for German Modern lexicon (CISLEX): coverage on Main Corpus on 10 periods 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language in a historical corpus for German Compounds (modern components); coverage on Main Corpus on 10 periods 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Two Variants of Lexica for IR and OCR Hypotetical Lexicon: Trying to map input strings to modern lexicon entries in a dynamic way via a special approximate matching procedure using historical transformation patterns. Witnessed Lexicon: Corpus checked lexicon entries: Historical spelling variant + modern Lemma for IR Historical word list for OCR18 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Hypothetical Lexicon: approximate matching procedure Many of the spelling variations can be traced back to a modern word by applying characteristic patterns / rewrite rules e.g. the historical string theyle can be traced toits modern equivalent teile by applying th tand ey ei. Required resources: Contemporary lexicon with inflected word forms Set of typical language-specific spelling variationpatterns19 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Modern lexicon InflectedLemmatizingforms information teileteil (= part) ... teilen (= to share) tailletaille (= waist)fragtefragen (= to ask) 20 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure~ 140Modern lexicon PatternsInflectedLemmatizing forms information th t ei ai teileteil (= part) ey ei ... teilen (= to share)l ll tailletaille (= waist) fragtefragen (= to ask) 21 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Approximate matching procedure Spelling~ 140Modern lexicon variationPatterns InflectedLemmatizing forms informationth t theile ei ai teileteil (= part)ey ei... teilen (= to share) l lltailletaille (= waist) fragtefragen (= to ask) 22 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Approximate matching procedure Spelling~ 140Modern lexicon variationPatterns InflectedLemmatizing forms informationth t theile ei ai teileteil (= part)ey ei... teilen (= to share) l lltailletaille (= waist) fragtefragen (= to ask) 23 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Approximate matching procedure Spelling~ 140Modern lexicon variationPatterns InflectedLemmatizing forms informationth t theile ei ai teileteil (= part)ey ei... teilen (= to share) l lltailletaille (= waist) fragtefragen (= to ask) 24 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Approximate matching procedure Spelling~ 140Modern lexicon variationPatterns InflectedLemmatizing forms informationth t theile ei ai teileteil (= part)ey ei... teilen (= to share) l lltailletaille (= waist) fragtefragen (= to ask) 25 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Approximate matching procedure Spelling~ 140Modern lexicon variationPatterns InflectedLemmatizing forms informationth t frug ei ai teileteil (= part)ey ei... teilen (= to share) l lltailletaille (= waist) fragtefragen (= to ask) 26 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Approximate matching procedure Spelling~ 140Modern lexicon variationPatterns InflectedLemmatizing forms information ?th t frug ei ai teileteil (= part)ey ei... teilen (= to share) l lltailletaille (= waist) fragtefragen (= to ask) 27 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Approximate matching procedure Advantages: No manual work needed Dynamic approach Limitations: Mismatches may link a historical spellingvariation to a wrong modern word. A part of the historical vocabulary cannot bereduced to a modern word by simple matching28 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Language in a historical corpus for German Hypothetical lexicon; coverage on Main Corpus on 10 periods 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Manually collected special lexica Spelling Modern lexicon variation InflectedLemmatizing forms information theile teileteil (= part)... teilen (= to share) frug tailletaille (= waist) fragtefragen (= to ask) 30 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Manually collected special lexica SpellingManual Modern lexicon variation mapping InflectedLemmatizing forms information theile teileteil (= part)... teilen (= to share) frug tailletaille (= waist) fragtefragen (= to ask) 31 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually collected special lexica Advantages: Associations between historical variant andmodern lemma are safe Associations that are not covered by the matchingapproach can be stored explicitly Limitations: Time consuming, labor-intensive, situationsoccur where specialists (historical linguists) areneeded. Hardly ever complete because of immensenumber of spelling variants32 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation of the approximative matcher is the lexicon redundant? Few empirical studies on crucial decisions for IR and OCR on historical texts: Is a matching approach enough? Do we need a lexicon, and if so, in which scenarios?33 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexica built both assign modern lemmas to historical full forms Validated lexicon constructed from Main Corpus currently ca. 15,000 entries (kernel lexicon for hist. German) poor coverage Witnessed lexicon for OCR: from Main corpus, 200,000 tokens without modern correspondence still limited coverage: corpus size Hypothetical lexicon for IR: matching procedure mapping historical full form to modern pendant plus lemmatizer for modern language (historical full form modern lemma), based on 140 patterns theoretically 100 Mio entries. High coverage, assignments can be erroneous, only able to capture regular correspondences (pattern based) 35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation of OCR Results Test Corpus for 16th , 18th , and 19th century Development version of a professional OCR engine with an external dictionary interface Experiments with different lexicon settingsNo additional lexicon, character model onlyGerman modern lexiconcorpus based witnessed lexiconhypothetical lexicon 36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.16th century 18th century 19th centuryReductionReductionReduction Dictionary No. of wordNo. of wordNo. of word of error rateof error rateof error rate errors errors errorsNo Lexicon1306 - 827-2074 - Optimal756 42% 395 52%61270%Lexicon Modern 1096 16% 501 39%88857% LexiconW.Historical938 28% 481 42%85659%Lexicon Modern + 1011 25% 480 42%84959% Virtual H.L. WER > 50%WER ~ 10% 36 37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation of the approximate matcher in an IR scenario proofread documents from 16th, 17th, 18th and 19th century and tagged each token manually. Collected a list of historical) stopwords Defined precision and recall for our scenario.37 38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main insights IR experiment 18th and 19th century: Pure matching approach leads to goodprecision values. Recall values are acceptable38 39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main insights IR experiment 16th and 17th century: Precision of the matching approach poor, a lexicon will help to avoid wrong matches. Recall values show that a large number of words can only be explained by a special lexicon39 40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Answer to our refined QuestionIs the lexicon redundant for OCR/IR on historicaltexts? Depends on the material, especially on the date of origin of the collection: Matching approach leads to acceptable results for 19th and 18th century collections. Serious limitations for 16th and 17th century collections. Special lexica will lead to important improvements Combination of matching approach and manually collected lexica may lead to optimal results. For postprocessing validated lexica are needed 40 41. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Engineering Consequences A Focus Collection of the BavarianStateLibrary VD16: Collection of Early High German Books Collaborative project with BSB on OCR/IR for this collection: Clemens Neudecker/Fedor Bochow Special lexicon building needed No 16th century electronic corpora available for lexicon development For real world test we defined a topic area as main interest: Theology41 42. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Engineering Consequences Iterative Process with Bavarian State Library toCreate Resources for VD16 Collection(1) A random selection of 200 pages from 100 sources (2) OCR and corpus experiments (3) Selection of usable sources (4) Specification of keying by BSB/CIS for 70 complete books usable for both presentation and linguistic resources building (5) Contract with service providers42 43. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Engineering ConsequencesA Focus Collection of the BavarianStateLibraryVD16: Collection of Early High German Books with30 million pages Integrate OCR Supplier: Special Type FaceModels; Character ModelsImprove OCR with a specialized historical Liguistic Database lexicon for VD16Improve IR access with a normalization lexicon43 44. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Wrap Up For challenging historical materials specialized lexica are needed Special lexica directly implemented into basic OCR lifts OCR quality significantly. For bigger projects seek direct collaboration with OCR partners For IR: use approximative matching or normalization lexica to process user queries Integrate research institutions and collection holders

Recommended

View more >