an open corpus for named entity recognition in historic newspapers

An Open Corpus for Named Entity Recognition in Historic Newspapers

Clemens NeudeckerBerlin State Library

@cneudecker

LREC2016, 23-28 May 2016, Portorož, Slovenia

https://twitter.com/cneudecker


Background

• Europeana Newspapers EU-project:www.europeana-newspapers.eu

• OCRed 12m pages of historic newspapers from Europe (an estimated 25 billion words!)

• Newspaper content from 23 libraries, in 40 languages, covering 4 centuries (1618-1990)

• Public domain full-text available for download per language/content provider

http://www.europeana-newspapers.eu/

Formats & Standards

• Full-text produced in ALTO• Metadata (structural) in METS• Metadata (bibliographic) in EDM• Not a fan of XML?

Good ol‘ plain text (UTF-8) is also available…research.europeana.eu/itemtype/newspapers

• Currently working on:– API for text/search– API for images (IIIF)

https://github.com/altoxml

http://www.loc.gov/standards/mets/

http://pro.europeana.eu/page/edm-documentation

http://research.europeana.eu/itemtype/newspapers




http://iiif.io/

Approach

• 3 languages selected for NER:Dutch, German, French – in collab. with

• Content in these languages constitutes about 50% of the overall full-text in the collection

Methodology

• Select 100 representative pages per language– If a classifier already exists for given language –

run it on the selected 100 pages– Ingest tagged/untagged pages to annotation tool– Manually add/correct annotations

(>=2 librarians per language)– Export and convert tagged data to BIO format– Train classifier from BIO & gazetteers (if available)– Evaluate derived classifier using 4-fold cross-eval– Repeat until classification performance converges

NER software

• Tested Stanford NER, OpenNLP, NLTK, Gate• Adaptation of Stanford NER package (CRF)– Mature, well-documented, widely used– Open source (GPL)– Thread-safe & platform-independent (JVM)– Machine learning scales out more easily

to multiple languages– Prior experience working with CRF

http://nlp.stanford.edu/software/CRF-NER.shtml

https://opennlp.apache.org/

http://www.nltk.org/

http://gate.ac.uk/

NER encoding in ALTO

• In ALTO versions >2.1, this is possible:

<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"></String><String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"></String>…<Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/></Tags>

Annotation

• Evaluated BRAT, WebAnno, INL Attestation• Reasons for selection of INL Attestation:– Speed– Support

of ALTO format

– Supportfrom INLavailable

http://brat.nlplab.org/

https://webanno.github.io/webanno/

https://github.com/INL/AttestationTool

Annotation statsLanguage # tokens # PER # LOC # ORG

French 207,000 5,672 5,614 2,574

Dutch 182,483 4,492 4,448 1,160

German 96,735 7,914 6,143 2,784

Language # tokens # PER # LOC # ORG

French 100% 2,75% 2,71% 1,24%

Dutch 100% 2,46% 2,44% 0,64%

German 100% 8,18% 6,35% 2,88%

Language Word-Error-Rate (Bag of Words) Reading Order Success Rate

French 16,6% 19,9%

Dutch 17,6% 23,2%

German 15,9% / 21,9% 13,6%

Challenges

• Clear, comprehensive & common guidelines for manual annotation

• OCR quality – on average 80% word accuracy• Wide variation in historical spelling• Mix of languages on a single page• Lack/loss of metadata on page/word level• Some data corruption occured when ingesting

pre-tagged data into the annotation tool

Attempted workarounds

• Introduce OCR error patterns into training data actually yields less precision/recall

• Introduce a spelling variation module in the NER classifier rewrite rules (e.g. „frorn“ „from“) high integration effort requires reasonable amount of rules abandoned due to high complexity

Evaluation NL

Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)

Evaluation FR

Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)

Use cases

• Improving search, information retrieval– Within digital newspapers, a vast majority of

user queries are person and place names • Linking of named entities to authority files

to create linked data– The classification and disambiguation of named

entities allows the assignment of unique identifiers from authorative sources – thus enabling cross-language/cross-collection linking

Next steps

• Volunteers wanted! Help correct corpus and collaboratively create a free dataset – instructions on GitHub wiki:– github.com/EuropeanaNewspapers/ner-corpora/wiki

/Corpus-cleanup • Plans to improve performance:– Add distributional similarity as feature (Clark 2003)– Semantic generalisation (Faruqui & Padò 2010)– Specialised gazetteers (e.g. list of historic place names)– Data, data, data

https://github.com/EuropeanaNewspapers/ner-corpora/wiki/Corpus-cleanup








Open resources

• European Newspapers NER dataset (CC0):– github.com/EuropeanaNewspapers/ner-corpora

• Europeana Newspapers NER software (EUPL):– github.com/EuropeanaNewspapers/

europeananp-ner– github.com/EuropeanaNewspapers/

europeananp-dbpedia-disambiguation• Annotated ALTO files:– lab.kbresearch.nl/static/html/eunews.html

https://github.com/EuropeanaNewspapers/ner-corpora




https://github.com/EuropeanaNewspapers/europeananp-ner




https://github.com/EuropeanaNewspapers/europeananp-dbpedia-disambiguation




http://lab.kbresearch.nl/static/html/eunews.html





References

• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:Large scale refinement of digital historical newspapers with named entity recognitionProceedings of the IFLA Newspaper Section Satellite Meeting, 2014, Geneva, Switzerland.

• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:Unsupervised named entity recognition and disambiguation: An application to old French journalsAdvances in Data Mining. Applications and Theoretical Aspects, Springer LNCS, 2014.

Thank you for your attention!Questions?

Clemens NeudeckerBerlin State Library

@cneudecker



an open corpus for named entity recognition in historic newspapers

Government & Nonprofit