polyglot-ner: massive multilingual named entity recognition
TRANSCRIPT
Polyglot-NER: Massive Multilingual
Named Entity Recognition
SDM May 2, 2015
Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena
Stony Brook University
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Named Entity Recognition (NER) Problem
■Input:
Plain text, T
■Output:
The spans of T that constitute proper names,
and the classification of the entity’s type.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Examples
Input: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Output: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Location
Location Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual NER
❑NLTK ■ English
❑Stanford ■ English, Spanish,
Chinese, Arabic
❑OpenNLP ■ English, German, Dutch,
Spanish
❑Polyglot-NER ■ 40 Major Languages!
(English, Spanish, French, German,
Russian, Polish, Portuguese, Italian,
Dutch, Arabic, Hebrew, Hindi, Korean,
Japanese, Vietnamese, …)
While many pipelines exist, most languages are unsupported
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Does Multilingual Matter?
Yes! Only 55% of the top 10 million websites are in English! [1]
There are 51 languages on Wikipedia with 100,000+
articles. [2]
[1] http://w3techs.com/technologies/history_overview/content_language/ms/y
[2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual is Hard
Feature Scarcity NLP tasks typically rely on
language-specific feature
engineering ❑ Orthographic features
❑ Part of Speech Tags
❑ Parallel Corpora
❑ WordNet
Annotation Scarcity Need NER examples -
labeled data is expensive.
Our solution: neural word
embeddings.
Our solution:
Wikipedia/Freebase for training
examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-problem: Word Representation
Input: Unstructured text
Output: Low dimensional word embeddings
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distributed Word Representations
Big Idea: Give similar words similar representations
pine
oak
rose
daisy
reading
writing
read
write
|V|
|V|: size of vocabulary
pine
oak
rose
daisy
reading
writing
read
write
d
d << |V|
Similar words share similar
representations.
Latent Dimensions
Explicit Dimensions
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Polyglot Embeddings
● Wikipedia article text
● 137 Languages
● Available: ○ http://bit.ly/embeddings
[Al-Rfou, Perozzi, Skiena, 13] C Imagination
C is
C greater
C than
C detail
Score
Hidden
Layer
H
Projection
Layer
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-Problem: Annotation Mining
Input: Wikipedia, Freebase
Output: Labeled NER training examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Related Work
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Annotations from Wikipedia
Inter-wiki links are a great
potential source of mentions.
Wik
ipedia
F
ree
ba
se
Freebase tells us which articles
are entity articles.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Example
Wiki Text:
Vancouver is a coastal seaport city on the mainland of
British Columbia. The city's mayor is Gregor Robertson.
“Vancouver”
“British Columbia”
“Gregor Robertson”
Strings
/m/080h2
/m/015jr
/m/0grlms
Freebase MID
City
Region
Person
Freebase
Category
Location
Location
Person
NER Label
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Bad News Many false negatives in our dataset!
■ Wikipedia editors annotate only the first mention of
an entity but not later ones.
■ Most of the named entity mentions are not linked!
Example:
Vancouver is a coastal seaport city on the
mainland of British Columbia. Vancouver’s
mayor is Gregor Robertson.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Good News
Positive labels are very
high quality!
Need to emphasize this in
our training.
?
?
?
?
?
?
?
‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The trick: Oversampling
p
We can change the label
distribution by
oversampling from the
positive labels.
p is the percentage of positive
labels in the training dataset. Initially no
oversampling
p = 0.5, much
better
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Cross-Domain Performance
Oversampling
Oversampling +
Exact Matching
Cross-Domain Testing on CoNLL
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Demo @ http://bit.ly/polyglot-ner
Legend: Location Organization Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
But How to Evaluate? ■We have labeled data for a few languages
■Would like to evaluate everything
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distant Evaluation
John proviene de la ciudad de
Nueva York.
John is coming from New York City.
Machine
Translation
Calculate the error of omitting entities and the error of adding entities.
Person: 1
Location: 1
Organization: 0
Person: 0
Location: 1
Organization: 1
1
1
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Experimental Design Distant Evaluation for Polyglot-NER:
1. Annotate English Wikipedia sentences using Stanford NER.
2. Randomly pick 1500 sentences that have at least one entity detected.
3. Translate these sentences using Google translate to 40 languages.
4. Run Polyglot-NER on the translated datasets.
5. Compare the number of entity chunks our annotators found to the
ones detected by Stanford per sentence.
6. Calculate the error of omitting (ℰ𝓜) and adding entities (ℰ𝒜)
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Effect of Data Size ■ Size of training data
matters!
■ Tokenization is quite
important when the
word embeddings
coverage is limited.
# Words (Log Scale) E
rro
r M
issin
g
More
Data Will
Help
Anomalies
Good
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Performance by Category
ℰ𝒜: Adding Error ℰ𝓜: Missing Error
Person Location
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Limitations ■Named entities don’t always translate well:
❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …”
■Need a working translation system for the language
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Take-aways
■NER in 40 languages!
■Word embeddings & oversampling offers equal
or better performance to feature engineering for
NER annotation mining.
■Translation based evaluation?
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Thanks!
NER Demo: http://bit.ly/polyglot-ner
NER Code: http://polyglot-nlp.com
www.perozzi.net
Bryan Perozzi