improved relation extraction with feature-rich ...mgormley/papers/gormley+yu+dredze.emnlp.201… ·...

Improved Relation Extraction with Feature-Rich Compositional

Embedding Models

September 21, 2015EMNLP

1

Mo Yu* Matt Gormley*

Mark Dredze

*Co-first authors

FCM or: How I Learned to Stop Worrying (about Deep Learning)

and Love Features

September 21, 2015EMNLP

2

Mo Yu* Matt Gormley*

Mark Dredze

*Co-first authors

Handcrafted Features

3

NNP : VBN NNP VBD

PERLOC

Egypt - born Proyas directed

S

NP VP

ADJP VPNP

egypt - born proyas direct

p(y|x) ∝exp(Θyf( ))

born-in

Where do features come from?

4

Feat

ure

En

gin

ee

rin

g

Feature Learning

hand-craftedfeatures

Sun et al., 2011

Zhou et al.,2005

First word before M1

Second word before M1

Bag-of-words in M1

Head word of M1

Other word in between

First word after M2

Second word after M2

Bag-of-words in M2

Head word of M2

Bigrams in between

Words on dependency path

Country name list

Personal relative triggers

Personal title list

WordNet Tags

Heads of chunks in between

Path of phrase labels

Combination of entity types


5

Feat

ure

En

gin

ee

rin

g

Feature Learning


Sun et al., 2011

Zhou et al.,2005 word

embeddingsMikolov et al.,

2013

CBOW model in Mikolov et al. (2013)

input(context words)

embedding

missing word

Look-up table Classifier

0.13 .26 … -.52

0.11 .23 … -.45

dog:

cat:similar words,similar embeddings

unsupervisedlearning


6

Feat

ure

En

gin

ee

rin

g

Feature Learning


Sun et al., 2011



2013

stringembeddings

Collobert & Weston, 2008

Socher, 2011

Convolutional Neural Networks (Collobert and Weston 2008)

The [movie] showed [wars]

pooling

CNN

Recursive Auto Encoder (Socher 2011)


RAE


7

Feat

ure

En

gin

ee

rin

g

Feature Learning


Sun et al., 2011



2013

treeembeddings

Socher et al.,2013

Hermann & Blunsom,2013

stringembeddings


Socher, 2011


WNP,VP

WDT,NN WV,NN

S

NP VP


8

wordembeddings

treeembeddings


stringembeddings

Feat

ure

En

gin

ee

rin

g

Feature Learning

Sun et al., 2011

Zhou et al.,2005

Mikolov et al.,2013


Socher, 2011

Socher et al.,2013


Hermann et al.2014

word embedding features

Turian et al. 2010

Koo et al. 2008


9

wordembeddings

treeembeddings

word embedding featureshand-crafted

features

Our model (FCM)

stringembeddings

Feat

ure

En

gin

ee

rin

g

Feature Learning

Sun et al., 2011

Zhou et al.,2005

Mikolov et al.,2013


Socher, 2011

Socher et al.,2013

Turian et al. 2010

Koo et al. 2008

Hermann et al.2014


Feature-rich Compositional Embedding Model (FCM)

Goals for our Model:

1. Incorporate semantic/syntactic structural information

2. Incorporate word meaning

3. Bridge the gap between feature engineering and feature learning – but remain as simple as possible

10


11The [movie]M1 I watched depicted [hope]M2

nilnoun-other

noun-person

noun-other

verb-percep.

verb-comm.

on-path(wi)

is-between(wi)

head-of-M1(wi)

head-of-M2(wi)

before-M1(wi)

before-M2(wi)

…

0

0

0

0

1

0

…

1

0

1

0

0

0

…

0

1

0

0

0

0

…

0

1

0

0

0

0

…

1

1

0

0

0

1

…

1

0

0

1

0

0

…

f1 f2 f3 f4 f5 f6

Per-word Features:



nilnoun-other

noun-person

noun-other

verb-percep.

verb-comm.

on-path(wi)

is-between(wi)

head-of-M1(wi)

head-of-M2(wi)

before-M1(wi)

before-M2(wi)

…

1

1

0

0

0

1

…

f5

Per-word Features:



nilnoun-other

noun-person

noun-other

verb-percep.

verb-comm.

on-path(wi) & wi= “depicted”

is-between(wi) & wi= “depicted”

head-of-M1(wi) & wi= “depicted”

head-of-M2(wi) & wi= “depicted”

before-M1(wi) & wi= “depicted”

before-M2(wi) & wi= “depicted”

…

1

1

0

0

0

1

…

f5

Per-word Features: (with conjunction)



nilnoun-other

noun-person

noun-other

verb-percep.

verb-comm.

1

1

0

0

0

1

…

f5

Per-word Features: (with soft conjunction)

on-path(wi)

is-between(wi)

head-of-M1(wi)

head-of-M2(wi)

before-M1(wi)

before-M2(wi)

…

-.3 .9 .1 -1

edepicted

Outer-product



nilnoun-other

noun-person

noun-other

verb-percep.

verb-comm.

1

1

0

0

0

1

…

f5

Per-word Features: (with soft conjunction)

on-path(wi)

is-between(wi)

head-of-M1(wi)

head-of-M2(wi)

before-M1(wi)

before-M2(wi)

…

-.3 .9 .1 -1

edepicted

-.3 .9 .1 -1

-.3 .9 .1 -1

-.3 .9 .1 -1

0 0 0 0

0 0 0 0

-.3 .9 .1 -1

… … … …


16

fi

ewi

p(y|x) ∝ exp Σi=1

n

Our full model sums over each word in the sentence

Then takes the dot-product with a parameter tensor

Ty

And finally, exponentiatesand renormalizes

Features for FCM

• Let M1 and M2 denote the left and right entity mentions

• Our per-word Binary Features: head of M1

head of M2

in-between M1 and M2

-2, -1, +1, or +2 of M1

-2, -1, +1, or +2 of M2

on dependency path between M1 and M2

• Optionally: Add the entity type of M1, M2, or both

17

FCM as a Neural Network

18

Σ

Τ

fn

p(y|x)

f1 e

h1 hn

ex

��

e

��

��

��

��

w1 wn

Binary features

Embeddings

Parameter tensor

• Embeddings are (optionally) treated as model parameters• A log-bilinear model• We initialize, then fine-tune the embeddings

Baseline Model

19

Yi,j

NNP : VBN NNP VBD

PERLOC

Egypt- bornProyasdirected

S

NP VP

ADJP VPNP

egypt- bornproyasdirect

born-in

p(y|x) µ

exp(Θy�f ( ) )

• Multinomial logistic regression (standard approach)

• Bring in all the usual binary NLP features (Sun et al., 2011)– type of the left entity mention

– dependency path between mentions

– bag of words in right mention

– …

Hybrid Model: Baseline + FCM

20

Yi,j

Σ

Τ

fn

p(y|x)

f1 e

h1 hn

ex

��

e

��

��

��

��

w1 wn

NNP : VBN NNP VBD

PERLOC

Egypt- bornProyasdirected

S

NP VP

ADJP VPNP

egypt- bornproyasdirect

born-in

p(y|x) µ

exp(Θy�f ( ) )

p(y|x) = pBaseline(y|x) pFCM(y|x)1Z(x)

Product of Experts:

Experimental Setup

ACE 2005

• Data: 6 domains– Newswire (nw)– Broadcast Conversation (bc)– Broadcast News (bn)– Telephone Speech (cts)– Usenet Newsgroups (un)– Weblogs (wl)

• Train: bn+nw (~3600 relations)Dev: ½ of bcTest: ½ of bc, cts, wl`

• Metric: Micro F1(given entity mention)

SemEval-2010 Task 8

• Data: Web text– Newswire (nw)– Broadcast Conversation (bc)– Broadcast News (bn)– Telephone Speech (cts)– Usenet Newsgroups (un)– Weblogs (wl)

• Train:Dev: Test:

• Metric: Macro F1(given entity boundaries)

21

Standard split from shared task

ACE 2005 Results

22

45%

50%

55%

60%

65%

Broadcast Conversation Conversational TelephoneSpeech

Weblogs

Mic

ro F

1

Test Set

Baseline

FCM

Baseline+FCM

SemEval-2010 ResultsSource Classifier F1

Socher et al. (2012) RNN 74.8

Socher et al. (2012) MVRNN 79.1

Hashimoto et al. (2015) RelEmb 81.8

Rink and Harabagiu (2010) SVM 82.2

Zeng et al. (2014) CNN 82.7

Santos et al. (2015) CR-CNN (log-loss) 82.7

Liu et al. (2015) DepNN 82.8

Hashimoto et al. (2015) RelEmb (task-spec-emb) 82.8

70 72 74 76 78 80

82

84

86

Best in SemEval-2010 Shared Task










FCM (log-linear) 81.4

FCM (log-bilinear) 83.0

70 72 74 76 78 80

82

84

86

Best in SemEval-2010 Shared Task












70 72 74 76 78 80

82

84

86






Xu et al. (2015) SDP-LSTM 82.4





Xu et al. (2015) SDP-LSTM (full) 83.7



70 72 74 76 78 80

82

84

86














FCM (log-bilinear)(task-spec-emb)

83.7

70 72 74 76 78 80

82

84

86












Santos et al. (2015) CR-CNN (ranking-loss) 84.1



FCM (log-bilinear)(task-spec-emb)

83.7

70 72 74 76 78 80

82

84

86

Takeaways

FCM bridges the gap between feature engineering and feature learning

If you are allergic to deep learning:– Try the FCM for your task: it is simple, easy-to-

implement, and was shown to be effective for two relation benchmarks

If you are a deep learning expert:– Inject the FCM (i.e. outer product of features and

embeddings) into your fancy deep network

29

Questions?

Two open source implementations:

– Java: (Within the Pacaya framework)https://github.com/mgormley/pacaya

– C++: (From our NAACL 2015 paper on LRFCM)https://github.com/Gorov/ERE_RE

https://github.com/mgormley/pacaya

https://github.com/Gorov/ERE_RE

improved relation extraction with feature-rich ...mgormley/papers/gormley+yu+dredze.emnlp.201… ·...

Documents