polyglot-ner: massive multilingual named entity recognition

26
Polyglot-NER: Massive Multilingual Named Entity Recognition SDM May 2, 2015 Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena Stony Brook University

Upload: bryan-perozzi

Post on 09-Aug-2015

369 views

Category:

Science


8 download

TRANSCRIPT

Page 1: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Polyglot-NER: Massive Multilingual

Named Entity Recognition

SDM May 2, 2015

Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena

Stony Brook University

Page 2: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Named Entity Recognition (NER) Problem

■Input:

Plain text, T

■Output:

The spans of T that constitute proper names,

and the classification of the entity’s type.

Page 3: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

NER Examples

Input: Vancouver is a coastal seaport city on the mainland

of British Columbia. The city's mayor is Gregor Robertson.

Output: Vancouver is a coastal seaport city on the mainland

of British Columbia. The city's mayor is Gregor Robertson.

Location

Location Person

Page 4: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Multilingual NER

❑NLTK ■ English

❑Stanford ■ English, Spanish,

Chinese, Arabic

❑OpenNLP ■ English, German, Dutch,

Spanish

❑Polyglot-NER ■ 40 Major Languages!

(English, Spanish, French, German,

Russian, Polish, Portuguese, Italian,

Dutch, Arabic, Hebrew, Hindi, Korean,

Japanese, Vietnamese, …)

While many pipelines exist, most languages are unsupported

Page 5: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Does Multilingual Matter?

Yes! Only 55% of the top 10 million websites are in English! [1]

There are 51 languages on Wikipedia with 100,000+

articles. [2]

[1] http://w3techs.com/technologies/history_overview/content_language/ms/y

[2] http://meta.wikimedia.org/wiki/List_of_Wikipedias

Page 6: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Multilingual is Hard

Feature Scarcity NLP tasks typically rely on

language-specific feature

engineering ❑ Orthographic features

❑ Part of Speech Tags

❑ Parallel Corpora

❑ WordNet

Annotation Scarcity Need NER examples -

labeled data is expensive.

Our solution: neural word

embeddings.

Our solution:

Wikipedia/Freebase for training

examples

Page 7: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Sub-problem: Word Representation

Input: Unstructured text

Output: Low dimensional word embeddings

Page 8: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Distributed Word Representations

Big Idea: Give similar words similar representations

pine

oak

rose

daisy

reading

writing

read

write

|V|

|V|: size of vocabulary

pine

oak

rose

daisy

reading

writing

read

write

d

d << |V|

Similar words share similar

representations.

Latent Dimensions

Explicit Dimensions

Page 9: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Polyglot Embeddings

● Wikipedia article text

● 137 Languages

● Available: ○ http://bit.ly/embeddings

[Al-Rfou, Perozzi, Skiena, 13] C Imagination

C is

C greater

C than

C detail

Score

Hidden

Layer

H

Projection

Layer

Page 10: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Sub-Problem: Annotation Mining

Input: Wikipedia, Freebase

Output: Labeled NER training examples

Page 11: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Related Work

Page 12: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Annotations from Wikipedia

Inter-wiki links are a great

potential source of mentions.

Wik

ipedia

F

ree

ba

se

Freebase tells us which articles

are entity articles.

Page 13: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Example

Wiki Text:

Vancouver is a coastal seaport city on the mainland of

British Columbia. The city's mayor is Gregor Robertson.

“Vancouver”

“British Columbia”

“Gregor Robertson”

Strings

/m/080h2

/m/015jr

/m/0grlms

Freebase MID

City

Region

Person

Freebase

Category

Location

Location

Person

NER Label

Page 14: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

The Bad News Many false negatives in our dataset!

■ Wikipedia editors annotate only the first mention of

an entity but not later ones.

■ Most of the named entity mentions are not linked!

Example:

Vancouver is a coastal seaport city on the

mainland of British Columbia. Vancouver’s

mayor is Gregor Robertson.

Page 15: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

The Good News

Positive labels are very

high quality!

Need to emphasize this in

our training.

?

?

?

?

?

?

?

‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]

Page 16: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

The trick: Oversampling

p

We can change the label

distribution by

oversampling from the

positive labels.

p is the percentage of positive

labels in the training dataset. Initially no

oversampling

p = 0.5, much

better

Page 17: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Cross-Domain Performance

Oversampling

Oversampling +

Exact Matching

Cross-Domain Testing on CoNLL

Page 18: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

NER Demo @ http://bit.ly/polyglot-ner

Legend: Location Organization Person

Page 19: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

But How to Evaluate? ■We have labeled data for a few languages

■Would like to evaluate everything

Page 20: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Distant Evaluation

John proviene de la ciudad de

Nueva York.

John is coming from New York City.

Machine

Translation

Calculate the error of omitting entities and the error of adding entities.

Person: 1

Location: 1

Organization: 0

Person: 0

Location: 1

Organization: 1

1

1

Page 21: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Experimental Design Distant Evaluation for Polyglot-NER:

1. Annotate English Wikipedia sentences using Stanford NER.

2. Randomly pick 1500 sentences that have at least one entity detected.

3. Translate these sentences using Google translate to 40 languages.

4. Run Polyglot-NER on the translated datasets.

5. Compare the number of entity chunks our annotators found to the

ones detected by Stanford per sentence.

6. Calculate the error of omitting (ℰ𝓜) and adding entities (ℰ𝒜)

Page 22: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Effect of Data Size ■ Size of training data

matters!

■ Tokenization is quite

important when the

word embeddings

coverage is limited.

# Words (Log Scale) E

rro

r M

issin

g

More

Data Will

Help

Anomalies

Good

Page 23: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Performance by Category

ℰ𝒜: Adding Error ℰ𝓜: Missing Error

Person Location

Page 24: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Limitations ■Named entities don’t always translate well:

❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …”

■Need a working translation system for the language

Page 25: POLYGLOT-NER: Massive Multilingual Named Entity Recognition

Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition

Take-aways

■NER in 40 languages!

■Word embeddings & oversampling offers equal

or better performance to feature engineering for

NER annotation mining.

■Translation based evaluation?