named entity recognition for twitter microposts (only) using distributed word representations
TRANSCRIPT
ELIS – Multimedia Lab
Fréderic Godin, Baptist Vandersmissen, Wesley De Neve & Rik Van de Walle
Multimedia Lab, Ghent University – iMinds
Find me at: @frederic_godin / www.fredericgodin.com
Named Entity Recognition for Twitter Microposts(only) using Distributed Word Representations
2
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Introduction
Goal: Recognizing 10 types of named entities (NEs) in noisy Twitter microposts
Problem: Tweets contain spelling mistakes, slang and lack uniform grammar rules
3
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Traditional solutionsTypical features: Ortographic features, gazetteers, corpus statistics or other parsing techniques (PoS and chunking)
Typical machine learning techniques: CRF, HMM
4
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
POS
Ortho-graphic
Gazetteers
Brown clustering
Word embedding
ML F1(%)
ousia X X X – GloVeentity linking using SVM
56.41
NLANGP – X X X word2vec & GloVe CRF++ 51.4
0
nrc – – X X word2vecsemi-Markov MIRA
44.74
multimedialab – – – – word2vec FFNN 43.7
5
USFD X X X X – CRF L-BFGS 42.46
iitp X X X – – CRF++ 39.84
Hallym X – – X correlation analysis CRFsuite 37.2
1
lattice X X – X – CRF wapiti 16.47
Baseline – X X – – CRFsuite 31.97
An overview of the used approaches
5
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
A simple, general but effective neural network architecture
Use word2vec to generate good feature representations for words (=unsupervised learning)
Feed those word representations to another neural network (NN) for any classification task (=supervised learning)
Example Feature representation
Machine learning Label(s)
Learn word2vec word representations
once in advance
Train a new NN for any task
6
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Word2vec: automatically learning good features
2D projection of a 400D space of the top 1000 words used on Twitter. The model was trained on 400 million tweets having 5 billion words
7
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
A simple, general but effective neural network architecture (1)
W(t-1)
W(t)
W(t+1)
Look
up
N-dim
N-dim
N-dim
Feed forward neural
networkTag(W(t))
Example Feature representation
Machine learning Label(s)
Concatenate (3N-dim)Window = 3
8
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
A simple, general but effective neural network architecture (2)
from
Beijing
to
Look
up
N-dim
N-dim
N-dim
Feed forward neural
networkLocation
Example Feature representation
Machine learning Label(s)
Concatenate (3N-dim)Window = 3
9
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Postprocessing (1)
FR ML
W(1)
W(2)
W(3)
Label(1)
Label(2)
Label(3)
Post- processing
Label(1)
Label(2)
Label(3)
Correct for inconsistencies
NE starting with an I-tag
Multi-word expressions having different categories
10
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Postprocessing (2)
FR ML
Manchester
United
is
B-Loc
I-sportsteam
O
Post- processing
B-sportsteam
I-sportsteam
O
Correct for inconsistencies
NE starting with an I-tag
Multi-word expressions having different categories
11
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Experimental setup
Feature Learning
Word2vec Skipgram with negative sampling
400 million raw English tweets (limited preprocessing)
Neural Network
One hidden layer, with 500 hidden units
Word embeddings of size 400, Voc of 3mil words
Mini-batch SGD and Dropout
Experiments with Tanh and ReLU
12
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Word2vec results
Slang
- Wrong capitalization- Sometimes not in Gazetteer
Spelling
13
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Normalizing slang words/spelling
14
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Dealing with capitalization + gazetteer functionality
15
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Results
POS
Ortho-graphic
Gazetteers
Brown clustering
Word embedding
ML F1(%)
ousia X X X – GloVeentity linking using SVM
56.41
NLANGP – X X X word2vec & GloVe CRF++ 51.4
0
nrc – – X X word2vecsemi-Markov MIRA
44.74
multimedialab – – – – word2vec FFNN 43.7
5
USFD X X X X – CRF L-BFGS 42.46
iitp X X X – – CRF++ 39.84
Hallym X – – X correlation analysis CRFsuite 37.2
1
lattice X X – X – CRF wapiti 16.47
BASELINE – X X – – CRFsuite 31.97
16
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Lessons learned
Feature LearningA W2V window of 1 worked best
More syntax-oriented embeddings
Neural NetworksMultiple layers did not improve the F1-score
Dropout and ReLU worked best
Postprocessing
Multi-word expressions often have different categories
17
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
Conclusion
End-to-end semi-supervised neural network architecture
No feature engineering needed
Reusable architecture
Beats traditional systems that only use hand-crafted features
18
ELIS – Multimedia Lab
NER in Twitter Microposts using distributed word representationsFréderic Godin et al.
31 July 2015
#Questions?
http://www.fredericgodin.com/software/
The word2vec Twitter model is available at:
@frederic_godin