polyglot-ner: massive multilingual named entity recognition

Polyglot-NER: Massive Multilingual

Named Entity Recognition

SDM May 2, 2015

Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena

Stony Brook University

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Named Entity Recognition (NER) Problem

■Input:

Plain text, T

■Output:

The spans of T that constitute proper names,

and the classification of the entity’s type.


NER Examples

Input: Vancouver is a coastal seaport city on the mainland

of British Columbia. The city's mayor is Gregor Robertson.

Output: Vancouver is a coastal seaport city on the mainland

of British Columbia. The city's mayor is Gregor Robertson.

Location

Location Person


Multilingual NER

❑NLTK ■ English

❑Stanford ■ English, Spanish,

Chinese, Arabic

❑OpenNLP ■ English, German, Dutch,

Spanish

❑Polyglot-NER ■ 40 Major Languages!

(English, Spanish, French, German,

Russian, Polish, Portuguese, Italian,

Dutch, Arabic, Hebrew, Hindi, Korean,

Japanese, Vietnamese, …)

While many pipelines exist, most languages are unsupported


Does Multilingual Matter?

Yes! Only 55% of the top 10 million websites are in English! [1]

There are 51 languages on Wikipedia with 100,000+

articles. [2]

[1] http://w3techs.com/technologies/history_overview/content_language/ms/y

[2] http://meta.wikimedia.org/wiki/List_of_Wikipedias

http://w3techs.com/technologies/history_overview/content_language/ms/y

http://w3techs.com/technologies/history_overview/content_language/ms/y

http://meta.wikimedia.org/wiki/List_of_Wikipedias

http://meta.wikimedia.org/wiki/List_of_Wikipedias


Multilingual is Hard

Feature Scarcity NLP tasks typically rely on

language-specific feature

engineering ❑ Orthographic features

❑ Part of Speech Tags

❑ Parallel Corpora

❑ WordNet

Annotation Scarcity Need NER examples -

labeled data is expensive.

Our solution: neural word

embeddings.

Our solution:

Wikipedia/Freebase for training

examples


Sub-problem: Word Representation

Input: Unstructured text

Output: Low dimensional word embeddings


Distributed Word Representations

Big Idea: Give similar words similar representations

pine

oak

rose

daisy

reading

writing

read

write

|V|

|V|: size of vocabulary

pine

oak

rose

daisy

reading

writing

read

write

d

d << |V|

Similar words share similar

representations.

Latent Dimensions

Explicit Dimensions


Polyglot Embeddings

● Wikipedia article text

● 137 Languages

● Available: ○ http://bit.ly/embeddings

[Al-Rfou, Perozzi, Skiena, 13] C Imagination

C is

C greater

C than

C detail

Score

Hidden

Layer

H

Projection

Layer

http://bitly.com/embeddings






Sub-Problem: Annotation Mining

Input: Wikipedia, Freebase

Output: Labeled NER training examples


Related Work


Annotations from Wikipedia

Inter-wiki links are a great

potential source of mentions.

Wik

ipedia

F

ree

ba

se

Freebase tells us which articles

are entity articles.


Example

Wiki Text:

Vancouver is a coastal seaport city on the mainland of

British Columbia. The city's mayor is Gregor Robertson.

“Vancouver”

“British Columbia”

“Gregor Robertson”

Strings

/m/080h2

/m/015jr

/m/0grlms

Freebase MID

City

Region

Person

Freebase

Category

Location

Location

Person

NER Label

http://en.wikipedia.org/wiki/Vancouver

http://en.wikipedia.org/wiki/British_Columbia

http://www.mayorofvancouver.ca/wp-content/uploads/2012/02/Mayor-Robertson-headshot-head-only.jpg


The Bad News Many false negatives in our dataset!

■ Wikipedia editors annotate only the first mention of

an entity but not later ones.

■ Most of the named entity mentions are not linked!

Example:

Vancouver is a coastal seaport city on the

mainland of British Columbia. Vancouver’s

mayor is Gregor Robertson.

http://en.wikipedia.org/wiki/Vancouver

http://en.wikipedia.org/wiki/British_Columbia

http://www.mayorofvancouver.ca/wp-content/uploads/2012/02/Mayor-Robertson-headshot-head-only.jpg


The Good News

Positive labels are very

high quality!

Need to emphasize this in

our training.

?

?

?

?

?

?

?

‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]


The trick: Oversampling

p

We can change the label

distribution by

oversampling from the

positive labels.

p is the percentage of positive

labels in the training dataset. Initially no

oversampling

p = 0.5, much

better


Cross-Domain Performance

Oversampling

Oversampling +

Exact Matching

Cross-Domain Testing on CoNLL


NER Demo @ http://bit.ly/polyglot-ner

Legend: Location Organization Person

http://bit.ly/polyglot-ner





But How to Evaluate? ■We have labeled data for a few languages

■Would like to evaluate everything


Distant Evaluation

John proviene de la ciudad de

Nueva York.

John is coming from New York City.

Machine

Translation

Calculate the error of omitting entities and the error of adding entities.

Person: 1

Location: 1

Organization: 0

Person: 0

Location: 1

Organization: 1

1

1


Experimental Design Distant Evaluation for Polyglot-NER:

1. Annotate English Wikipedia sentences using Stanford NER.

2. Randomly pick 1500 sentences that have at least one entity detected.

3. Translate these sentences using Google translate to 40 languages.

4. Run Polyglot-NER on the translated datasets.

5. Compare the number of entity chunks our annotators found to the

ones detected by Stanford per sentence.

6. Calculate the error of omitting (ℰ𝓜) and adding entities (ℰ𝒜)


Effect of Data Size ■ Size of training data

matters!

■ Tokenization is quite

important when the

word embeddings

coverage is limited.

# Words (Log Scale) E

rro

r M

issin

g

More

Data Will

Help

Anomalies

Good


Performance by Category

ℰ𝒜: Adding Error ℰ𝓜: Missing Error

Person Location


Limitations ■Named entities don’t always translate well:

❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …”

■Need a working translation system for the language


Take-aways

■NER in 40 languages!

■Word embeddings & oversampling offers equal

or better performance to feature engineering for

NER annotation mining.

■Translation based evaluation?


Thanks!

NER Demo: http://bit.ly/polyglot-ner

NER Code: http://polyglot-nlp.com

[email protected]

www.perozzi.net

Bryan Perozzi





http://polyglot-nlp.com




mailto:[email protected]

mailto:[email protected]

http://www.perozzi.net

http://www.perozzi.net

polyglot-ner: massive multilingual named entity recognition

Science

bryan perozzi polyglotner

spanish polyglotner

multilingual matter

word representation

arabic opennlp english

neural word embeddings

unstructured text output

writing read