an open corpus for named entity recognition in historic newspapers

18
An Open Corpus for Named Entity Recognition in Historic Newspapers Clemens Neudecker Berlin State Library @ cneudecker LREC2016, 23-28 May 2016, Portorož, Slovenia

Upload: cneudecker

Post on 11-Jan-2017

131 views

Category:

Government & Nonprofit


1 download

TRANSCRIPT

Page 1: An Open Corpus for Named Entity Recognition in Historic Newspapers

An Open Corpus for Named Entity Recognition in Historic Newspapers

Clemens NeudeckerBerlin State Library

@cneudecker

LREC2016, 23-28 May 2016, Portorož, Slovenia

Page 2: An Open Corpus for Named Entity Recognition in Historic Newspapers

Background

• Europeana Newspapers EU-project:www.europeana-newspapers.eu

• OCRed 12m pages of historic newspapers from Europe (an estimated 25 billion words!)

• Newspaper content from 23 libraries, in 40 languages, covering 4 centuries (1618-1990)

• Public domain full-text available for download per language/content provider

Page 3: An Open Corpus for Named Entity Recognition in Historic Newspapers

Formats & Standards

• Full-text produced in ALTO• Metadata (structural) in METS• Metadata (bibliographic) in EDM• Not a fan of XML?

Good ol‘ plain text (UTF-8) is also available…research.europeana.eu/itemtype/newspapers

• Currently working on:– API for text/search– API for images (IIIF)

Page 4: An Open Corpus for Named Entity Recognition in Historic Newspapers

Approach

• 3 languages selected for NER:Dutch, German, French – in collab. with

• Content in these languages constitutes about 50% of the overall full-text in the collection

Page 5: An Open Corpus for Named Entity Recognition in Historic Newspapers

Methodology

• Select 100 representative pages per language– If a classifier already exists for given language –

run it on the selected 100 pages– Ingest tagged/untagged pages to annotation tool– Manually add/correct annotations

(>=2 librarians per language)– Export and convert tagged data to BIO format– Train classifier from BIO & gazetteers (if available)– Evaluate derived classifier using 4-fold cross-eval– Repeat until classification performance converges

Page 6: An Open Corpus for Named Entity Recognition in Historic Newspapers

NER software

• Tested Stanford NER, OpenNLP, NLTK, Gate• Adaptation of Stanford NER package (CRF)– Mature, well-documented, widely used– Open source (GPL)– Thread-safe & platform-independent (JVM)– Machine learning scales out more easily

to multiple languages– Prior experience working with CRF

Page 7: An Open Corpus for Named Entity Recognition in Historic Newspapers

NER encoding in ALTO

• In ALTO versions >2.1, this is possible:

<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"></String><String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"></String>…<Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/></Tags>

Page 8: An Open Corpus for Named Entity Recognition in Historic Newspapers

Annotation

• Evaluated BRAT, WebAnno, INL Attestation• Reasons for selection of INL Attestation:– Speed– Support

of ALTO format

– Supportfrom INLavailable

Page 9: An Open Corpus for Named Entity Recognition in Historic Newspapers

Annotation statsLanguage # tokens # PER # LOC # ORG

French 207,000 5,672 5,614 2,574

Dutch 182,483 4,492 4,448 1,160

German 96,735 7,914 6,143 2,784

Language # tokens # PER # LOC # ORG

French 100% 2,75% 2,71% 1,24%

Dutch 100% 2,46% 2,44% 0,64%

German 100% 8,18% 6,35% 2,88%

Language Word-Error-Rate (Bag of Words) Reading Order Success Rate

French 16,6% 19,9%

Dutch 17,6% 23,2%

German 15,9% / 21,9% 13,6%

Page 10: An Open Corpus for Named Entity Recognition in Historic Newspapers

Challenges

• Clear, comprehensive & common guidelines for manual annotation

• OCR quality – on average 80% word accuracy• Wide variation in historical spelling• Mix of languages on a single page• Lack/loss of metadata on page/word level• Some data corruption occured when ingesting

pre-tagged data into the annotation tool

Page 11: An Open Corpus for Named Entity Recognition in Historic Newspapers

Attempted workarounds

• Introduce OCR error patterns into training data actually yields less precision/recall

• Introduce a spelling variation module in the NER classifier rewrite rules (e.g. „frorn“ „from“) high integration effort requires reasonable amount of rules abandoned due to high complexity

Page 12: An Open Corpus for Named Entity Recognition in Historic Newspapers

Evaluation NL

Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)

Page 13: An Open Corpus for Named Entity Recognition in Historic Newspapers

Evaluation FR

Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)

Page 14: An Open Corpus for Named Entity Recognition in Historic Newspapers

Use cases

• Improving search, information retrieval– Within digital newspapers, a vast majority of

user queries are person and place names • Linking of named entities to authority files

to create linked data– The classification and disambiguation of named

entities allows the assignment of unique identifiers from authorative sources – thus enabling cross-language/cross-collection linking

Page 15: An Open Corpus for Named Entity Recognition in Historic Newspapers

Next steps

• Volunteers wanted! Help correct corpus and collaboratively create a free dataset – instructions on GitHub wiki:– github.com/EuropeanaNewspapers/ner-corpora/wiki

/Corpus-cleanup • Plans to improve performance:– Add distributional similarity as feature (Clark 2003)– Semantic generalisation (Faruqui & Padò 2010)– Specialised gazetteers (e.g. list of historic place names)– Data, data, data

Page 17: An Open Corpus for Named Entity Recognition in Historic Newspapers

References

• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:Large scale refinement of digital historical newspapers with named entity recognitionProceedings of the IFLA Newspaper Section Satellite Meeting, 2014, Geneva, Switzerland.

• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:Unsupervised named entity recognition and disambiguation: An application to old French journalsAdvances in Data Mining. Applications and Theoretical Aspects, Springer LNCS, 2014.

Page 18: An Open Corpus for Named Entity Recognition in Historic Newspapers

Thank you for your attention!Questions?

Clemens NeudeckerBerlin State Library

@cneudecker