europeana newspapers german infoday - verarbeitung digitale zeitungen
Post on 11-May-2015
130 Views
Preview:
TRANSCRIPT
Digitale Zeitungen –Verarbeitung in Europeana Newspapers
Information Day SBB
Berlin, 27 Februar 2014
Clemens Neudecker, KB, Twitter: @cneudecker
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Übersicht
• Ziele & Herausforderungen
• Zeitungen im Projekt
• Workflow & Technologien
• Fragen & Antworten
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Ziele
• Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK)
• Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS)
• Erstellen von Software für NER in 3 Sprachen (KB)
• Entwicklung von Tools die den Workflow automatisieren
• Erstellen von Richtlinien und Empfehlungen (“best practices”)
3
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Herausforderungen
• Qualität vs. Durchsatz
• Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen)
• Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal)
• Unterschiedliche Dateiformate, Sprachen, Alphabete
• Historische Schreibvarianten
• Klar strukturierter und weitgehend automatisierter Workflow
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Die Zeitungen
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Workflow
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OCR @ UIBK
• OCR = Optical Character Recognition (Optische Zeichenerkennung)
• Technologien: ABBYY FineReader SDK• State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Ziel: Konvertierung von Farb-/Graustufenscans nach 1-bit mit für OCR optimierter Methode (GPP) + JP2k
• Hintergrund: Dateigrösseder Images reduzieren umDatenmenge handhabbarzu machen (hunderte TBs)
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Ziel: Unterstützung der Bibliotheken bei der Daten-anlieferung – Umbenennungvon Dateien und Ordnern
• Hintergrund: Daten in der fürautomatisierte Verarbeitungnotwendigen Struktur aufbereiten
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Ziel: Check und Validierungder Datenstruktur vorAnlieferung zur Verarbeitung
• Hintergrund: Garantie füralle Beteiligten dass die Datenfür die weitere Verarbeitungin geeigneter Form vorliegen
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OLR @ CCS
• OLR = Optical Layout Recognition (Optische Layouterkennung)
• Technologien: docWorks• Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen)
• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OLR ���� Artikelerkennung
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
NER @ KB
• NER = Named Entities Recognition
• Technologien: Stanford CRF-NER• 3 Sprachen: Deutsch, Niederländisch, Französisch
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Erkennung von 3 Klassen: Person, Ort, Organisation
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18
Ergebnisse für NL
Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900.
100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”)
*
* K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung
Personen Orte Organisationen
Precision 0.940 0.950 0.942
Recall 0.588 0.760 0.559
F-measure 0.689 0.838 0.671
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
NER vs. OCR
19
0,25
0,35
0,45
0,55
0,65
0,75
0,85
0,95
NER
OCR
Danke für die Aufmerksamkeit!
Noch Fragen?
clemens.neudecker@kb.nl
top related