improved relation extraction with feature-rich ...mgormley/papers/gormley+yu+dredze.emnlp.201… ·...
TRANSCRIPT
Improved Relation Extraction with Feature-Rich Compositional
Embedding Models
September 21, 2015EMNLP
1
Mo Yu* Matt Gormley*
Mark Dredze
*Co-first authors
FCM or: How I Learned to Stop Worrying (about Deep Learning)
and Love Features
September 21, 2015EMNLP
2
Mo Yu* Matt Gormley*
Mark Dredze
*Co-first authors
Handcrafted Features
3
NNP : VBN NNP VBD
PERLOC
Egypt - born Proyas directed
S
NP VP
ADJP VPNP
egypt - born proyas direct
p(y|x) ∝exp(Θyf( ))
born-in
Where do features come from?
4
Feat
ure
En
gin
ee
rin
g
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005
First word before M1
Second word before M1
Bag-of-words in M1
Head word of M1
Other word in between
First word after M2
Second word after M2
Bag-of-words in M2
Head word of M2
Bigrams in between
Words on dependency path
Country name list
Personal relative triggers
Personal title list
WordNet Tags
Heads of chunks in between
Path of phrase labels
Combination of entity types
Where do features come from?
5
Feat
ure
En
gin
ee
rin
g
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005 word
embeddingsMikolov et al.,
2013
CBOW model in Mikolov et al. (2013)
input(context words)
embedding
missing word
Look-up table Classifier
0.13 .26 … -.52
0.11 .23 … -.45
dog:
cat:similar words,similar embeddings
unsupervisedlearning
Where do features come from?
6
Feat
ure
En
gin
ee
rin
g
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005 word
embeddingsMikolov et al.,
2013
stringembeddings
Collobert & Weston, 2008
Socher, 2011
Convolutional Neural Networks (Collobert and Weston 2008)
The [movie] showed [wars]
pooling
CNN
Recursive Auto Encoder (Socher 2011)
The [movie] showed [wars]
RAE
Where do features come from?
7
Feat
ure
En
gin
ee
rin
g
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005 word
embeddingsMikolov et al.,
2013
treeembeddings
Socher et al.,2013
Hermann & Blunsom,2013
stringembeddings
Collobert & Weston, 2008
Socher, 2011
The [movie] showed [wars]
WNP,VP
WDT,NN WV,NN
S
NP VP
Where do features come from?
8
wordembeddings
treeembeddings
hand-craftedfeatures
stringembeddings
Feat
ure
En
gin
ee
rin
g
Feature Learning
Sun et al., 2011
Zhou et al.,2005
Mikolov et al.,2013
Collobert & Weston, 2008
Socher, 2011
Socher et al.,2013
Hermann & Blunsom,2013
Hermann et al.2014
word embedding features
Turian et al. 2010
Koo et al. 2008
Where do features come from?
9
wordembeddings
treeembeddings
word embedding featureshand-crafted
features
Our model (FCM)
stringembeddings
Feat
ure
En
gin
ee
rin
g
Feature Learning
Sun et al., 2011
Zhou et al.,2005
Mikolov et al.,2013
Collobert & Weston, 2008
Socher, 2011
Socher et al.,2013
Turian et al. 2010
Koo et al. 2008
Hermann et al.2014
Hermann & Blunsom,2013
Feature-rich Compositional Embedding Model (FCM)
Goals for our Model:
1. Incorporate semantic/syntactic structural information
2. Incorporate word meaning
3. Bridge the gap between feature engineering and feature learning – but remain as simple as possible
10
Feature-rich Compositional Embedding Model (FCM)
11The [movie]M1 I watched depicted [hope]M2
nilnoun-other
noun-person
noun-other
verb-percep.
verb-comm.
on-path(wi)
is-between(wi)
head-of-M1(wi)
head-of-M2(wi)
before-M1(wi)
before-M2(wi)
…
0
0
0
0
1
0
…
1
0
1
0
0
0
…
0
1
0
0
0
0
…
0
1
0
0
0
0
…
1
1
0
0
0
1
…
1
0
0
1
0
0
…
f1 f2 f3 f4 f5 f6
Per-word Features:
Feature-rich Compositional Embedding Model (FCM)
12The [movie]M1 I watched depicted [hope]M2
nilnoun-other
noun-person
noun-other
verb-percep.
verb-comm.
on-path(wi)
is-between(wi)
head-of-M1(wi)
head-of-M2(wi)
before-M1(wi)
before-M2(wi)
…
1
1
0
0
0
1
…
f5
Per-word Features:
Feature-rich Compositional Embedding Model (FCM)
13The [movie]M1 I watched depicted [hope]M2
nilnoun-other
noun-person
noun-other
verb-percep.
verb-comm.
on-path(wi) & wi= “depicted”
is-between(wi) & wi= “depicted”
head-of-M1(wi) & wi= “depicted”
head-of-M2(wi) & wi= “depicted”
before-M1(wi) & wi= “depicted”
before-M2(wi) & wi= “depicted”
…
1
1
0
0
0
1
…
f5
Per-word Features: (with conjunction)
Feature-rich Compositional Embedding Model (FCM)
14The [movie]M1 I watched depicted [hope]M2
nilnoun-other
noun-person
noun-other
verb-percep.
verb-comm.
1
1
0
0
0
1
…
f5
Per-word Features: (with soft conjunction)
on-path(wi)
is-between(wi)
head-of-M1(wi)
head-of-M2(wi)
before-M1(wi)
before-M2(wi)
…
-.3 .9 .1 -1
edepicted
Outer-product
Feature-rich Compositional Embedding Model (FCM)
15The [movie]M1 I watched depicted [hope]M2
nilnoun-other
noun-person
noun-other
verb-percep.
verb-comm.
1
1
0
0
0
1
…
f5
Per-word Features: (with soft conjunction)
on-path(wi)
is-between(wi)
head-of-M1(wi)
head-of-M2(wi)
before-M1(wi)
before-M2(wi)
…
-.3 .9 .1 -1
edepicted
-.3 .9 .1 -1
-.3 .9 .1 -1
-.3 .9 .1 -1
0 0 0 0
0 0 0 0
-.3 .9 .1 -1
… … … …
Feature-rich Compositional Embedding Model (FCM)
16
fi
ewi
p(y|x) ∝ exp Σi=1
n
Our full model sums over each word in the sentence
Then takes the dot-product with a parameter tensor
Ty
And finally, exponentiatesand renormalizes
Features for FCM
• Let M1 and M2 denote the left and right entity mentions
• Our per-word Binary Features: head of M1
head of M2
in-between M1 and M2
-2, -1, +1, or +2 of M1
-2, -1, +1, or +2 of M2
on dependency path between M1 and M2
• Optionally: Add the entity type of M1, M2, or both
17
FCM as a Neural Network
18
Σ
Τ
fn
p(y|x)
f1 e
h1 hn
ex
���
e
���
��� ���
���
���
w1 wn
Binary features
Embeddings
Parameter tensor
• Embeddings are (optionally) treated as model parameters• A log-bilinear model• We initialize, then fine-tune the embeddings
Baseline Model
19
Yi,j
NNP : VBN NNP VBD
PERLOC
Egypt- bornProyasdirected
S
NP VP
ADJP VPNP
egypt- bornproyasdirect
born-in
p(y|x) µ
exp(Θy�f ( ) )
• Multinomial logistic regression (standard approach)
• Bring in all the usual binary NLP features (Sun et al., 2011)– type of the left entity mention
– dependency path between mentions
– bag of words in right mention
– …
Hybrid Model: Baseline + FCM
20
Yi,j
Σ
Τ
fn
p(y|x)
f1 e
h1 hn
ex
���
e
���
��� ���
���
���
w1 wn
NNP : VBN NNP VBD
PERLOC
Egypt- bornProyasdirected
S
NP VP
ADJP VPNP
egypt- bornproyasdirect
born-in
p(y|x) µ
exp(Θy�f ( ) )
p(y|x) = pBaseline(y|x) pFCM(y|x)1Z(x)
Product of Experts:
Experimental Setup
ACE 2005
• Data: 6 domains– Newswire (nw)– Broadcast Conversation (bc)– Broadcast News (bn)– Telephone Speech (cts)– Usenet Newsgroups (un)– Weblogs (wl)
• Train: bn+nw (~3600 relations)Dev: ½ of bcTest: ½ of bc, cts, wl`
• Metric: Micro F1(given entity mention)
SemEval-2010 Task 8
• Data: Web text– Newswire (nw)– Broadcast Conversation (bc)– Broadcast News (bn)– Telephone Speech (cts)– Usenet Newsgroups (un)– Weblogs (wl)
• Train:Dev: Test:
• Metric: Macro F1(given entity boundaries)
21
Standard split from shared task
ACE 2005 Results
22
45%
50%
55%
60%
65%
Broadcast Conversation Conversational TelephoneSpeech
Weblogs
Mic
ro F
1
Test Set
Baseline
FCM
Baseline+FCM
SemEval-2010 ResultsSource Classifier F1
Socher et al. (2012) RNN 74.8
Socher et al. (2012) MVRNN 79.1
Hashimoto et al. (2015) RelEmb 81.8
Rink and Harabagiu (2010) SVM 82.2
Zeng et al. (2014) CNN 82.7
Santos et al. (2015) CR-CNN (log-loss) 82.7
Liu et al. (2015) DepNN 82.8
Hashimoto et al. (2015) RelEmb (task-spec-emb) 82.8
70 72 74 76 78 80
82
84
86
Best in SemEval-2010 Shared Task
SemEval-2010 ResultsSource Classifier F1
Socher et al. (2012) RNN 74.8
Socher et al. (2012) MVRNN 79.1
Hashimoto et al. (2015) RelEmb 81.8
Rink and Harabagiu (2010) SVM 82.2
Zeng et al. (2014) CNN 82.7
Santos et al. (2015) CR-CNN (log-loss) 82.7
Liu et al. (2015) DepNN 82.8
Hashimoto et al. (2015) RelEmb (task-spec-emb) 82.8
FCM (log-linear) 81.4
FCM (log-bilinear) 83.0
70 72 74 76 78 80
82
84
86
Best in SemEval-2010 Shared Task
SemEval-2010 ResultsSource Classifier F1
Socher et al. (2012) RNN 74.8
Socher et al. (2012) MVRNN 79.1
Hashimoto et al. (2015) RelEmb 81.8
Rink and Harabagiu (2010) SVM 82.2
Zeng et al. (2014) CNN 82.7
Santos et al. (2015) CR-CNN (log-loss) 82.7
Liu et al. (2015) DepNN 82.8
Hashimoto et al. (2015) RelEmb (task-spec-emb) 82.8
FCM (log-linear) 81.4
FCM (log-bilinear) 83.0
70 72 74 76 78 80
82
84
86
SemEval-2010 ResultsSource Classifier F1
Socher et al. (2012) RNN 74.8
Socher et al. (2012) MVRNN 79.1
Hashimoto et al. (2015) RelEmb 81.8
Rink and Harabagiu (2010) SVM 82.2
Xu et al. (2015) SDP-LSTM 82.4
Zeng et al. (2014) CNN 82.7
Santos et al. (2015) CR-CNN (log-loss) 82.7
Liu et al. (2015) DepNN 82.8
Hashimoto et al. (2015) RelEmb (task-spec-emb) 82.8
Xu et al. (2015) SDP-LSTM (full) 83.7
FCM (log-linear) 81.4
FCM (log-bilinear) 83.0
70 72 74 76 78 80
82
84
86
SemEval-2010 ResultsSource Classifier F1
Socher et al. (2012) RNN 74.8
Socher et al. (2012) MVRNN 79.1
Hashimoto et al. (2015) RelEmb 81.8
Rink and Harabagiu (2010) SVM 82.2
Xu et al. (2015) SDP-LSTM 82.4
Zeng et al. (2014) CNN 82.7
Santos et al. (2015) CR-CNN (log-loss) 82.7
Liu et al. (2015) DepNN 82.8
Hashimoto et al. (2015) RelEmb (task-spec-emb) 82.8
Xu et al. (2015) SDP-LSTM (full) 83.7
FCM (log-linear) 81.4
FCM (log-bilinear) 83.0
FCM (log-bilinear)(task-spec-emb)
83.7
70 72 74 76 78 80
82
84
86
SemEval-2010 ResultsSource Classifier F1
Socher et al. (2012) RNN 74.8
Socher et al. (2012) MVRNN 79.1
Hashimoto et al. (2015) RelEmb 81.8
Rink and Harabagiu (2010) SVM 82.2
Xu et al. (2015) SDP-LSTM 82.4
Zeng et al. (2014) CNN 82.7
Santos et al. (2015) CR-CNN (log-loss) 82.7
Liu et al. (2015) DepNN 82.8
Hashimoto et al. (2015) RelEmb (task-spec-emb) 82.8
Xu et al. (2015) SDP-LSTM (full) 83.7
Santos et al. (2015) CR-CNN (ranking-loss) 84.1
FCM (log-linear) 81.4
FCM (log-bilinear) 83.0
FCM (log-bilinear)(task-spec-emb)
83.7
70 72 74 76 78 80
82
84
86
Takeaways
FCM bridges the gap between feature engineering and feature learning
If you are allergic to deep learning:– Try the FCM for your task: it is simple, easy-to-
implement, and was shown to be effective for two relation benchmarks
If you are a deep learning expert:– Inject the FCM (i.e. outer product of features and
embeddings) into your fancy deep network
29
Questions?
Two open source implementations:
– Java: (Within the Pacaya framework)https://github.com/mgormley/pacaya
– C++: (From our NAACL 2015 paper on LRFCM)https://github.com/Gorov/ERE_RE