modeling missing data in distant supervision for information extraction (ritter+, tacl 2013)

17
Modeling Missing Data in Distant Supervision for Information Extraction Alan Ritter (CMU) Luke Zettlemoyer (University of Washington) Mausam (University of Washington) Oren Etzioni (Vulcan Inc.) TACL, 1, 367-378, 2013. Presented by Naoaki Okazaki (Tohoku University) 2014-09-05 Modeling Missing Data in Distant Supervision 1

Upload: naoaki-okazaki

Post on 28-Nov-2014

1.239 views

Category:

Science


4 download

DESCRIPTION

第回最先端NLP勉匷䌚の発衚資料 http://www.cl.ecei.tohoku.ac.jp/~y-matsu/snlp6/

TRANSCRIPT

Page 1: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Modeling Missing Data in Distant Supervision for Information Extraction

Alan Ritter (CMU)Luke Zettlemoyer (University of Washington)

Mausam (University of Washington)Oren Etzioni (Vulcan Inc.)TACL, 1, 367-378, 2013.

Presented by Naoaki Okazaki (Tohoku University)

2014-09-05 Modeling Missing Data in Distant Supervision 1

Page 2: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Relation instance extractionSteven Spielberg’s film Saving Private Ryan is loosely based on the brothers’ story.

Extractor Film Director

Saving Private Ryan Steven Spielberg

Film-director relation

• Fully-supervised learning (Zhou+ 05, 
)• Uses ACE corpora to build relation-instance classifiers• Suffers from the limited number of training data

• Unsupervised information extraction (Banko+ 07, 
)• Extracts relational patterns between entities, and clusters the

patterns into relations• Difficult to map clusters into relations of interest

• Bootstrap learning (Brin 98, 
)• Uses seed instances to extract a new set of relational patterns• Often suffers from low precision (semantic drift)

• Distant supervision (Mintz+ 09, 
)• Combines the advantages of the above approaches

2014-09-05 Modeling Missing Data in Distant Supervision 2

Page 3: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Distant supervision (Mintz+, 09)Person Birthplace

Edwin Hubble Marshfield


 
 Automatic annotation

Astronomer Edwin Hubble was born in Marshfield, Missouri.

Feature extraction

Mintz et al. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.* Each row presents a single feature. Concatenate features from different sentences containing the same entity pairs.

Problem: An entity pair cannot have multiple relationsE.g., Founded(Jobs, Apple) and CEO-of(Jobs, Apple) are true.

2014-09-05 Modeling Missing Data in Distant Supervision 3

Page 4: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

MultiR (Hoffmann+, 11)

Introduces latent variables (𝑧𝑧𝑖𝑖) to indicate the relation expressed by sentence 𝑥𝑥𝑖𝑖

0 1 1 0

Founder Founder CEO-of

𝑊𝑊born−in 𝑊𝑊founder 𝑊𝑊CEO−of 𝑊𝑊capital−of

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs is CEO of Apple.

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑝𝑝 𝒚𝒚, 𝒛𝒛 𝒙𝒙

=1𝑍𝑍𝑥𝑥ᅵ𝑟𝑟

Ίjoin(𝑊𝑊𝑟𝑟 , 𝒛𝒛)ᅵ𝑖𝑖

Ίextract(𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖)

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple) 𝑥𝑥𝑖𝑖: a sentence containing the entity pair𝑊𝑊𝑟𝑟 ∈ {0,1}: 1 if the knowledge base includes the pair with relation 𝑟𝑟, 0 otherwise𝑧𝑧𝑖𝑖 ∈ 𝑅𝑅: the relation expressed by sentence 𝑥𝑥𝑖𝑖

Ίextract 𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖 = exp ᅵ𝑗𝑗

𝜃𝜃𝑗𝑗𝜙𝜙𝑗𝑗(𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖)

Ίjoin 𝑊𝑊𝑟𝑟 , 𝒛𝒛 = 1(¬𝑊𝑊𝑟𝑟⋁∃𝑖𝑖: 𝑗𝑗 = 𝑧𝑧𝑖𝑖)(Deterministic OR)

The same as (Mintz+ 09)

Ίjoin ensures that a sentence 𝑥𝑥𝑖𝑖 expressing the relation 𝑟𝑟 exists if 𝑟𝑟 is true

Allows multiple relations for the same entity pair

2014-09-05 Modeling Missing Data in Distant Supervision 4

Page 5: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

MultiR: Training

Hoffmann et al. (2011) Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. ACL-2011, pages 541–550.

Loop for passes over the training data

Loop for entity pairs in the KB

Predict sentence-level and KB-level relations (ignoring

the facts in the KB)

Find an optimal assignment of sentence-level relations

consistent with the facts in KB

We need two kinds of inferences

Update feature weights similarly to the perceptron algorithm

2014-09-05 Modeling Missing Data in Distant Supervision 5

Page 6: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

MultiR: Inference 1: argmax𝒚𝒚,𝒛𝒛

𝑝𝑝(𝒚𝒚, 𝒛𝒛|𝒙𝒙)

? ? ? ?

? ? ?

𝑊𝑊born−in 𝑊𝑊founder 𝑊𝑊CEO−of 𝑊𝑊capital−of

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs is CEO of Apple.

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple)

0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2

born−infounderCEO−ofcapita−of

Predict a relation label for each sentence

independently

Aggregate sentence-level predictions into

global-level predictions

2014-09-05 Modeling Missing Data in Distant Supervision 6

Page 7: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

MultiR: Inference 1: argmax𝒚𝒚,𝒛𝒛

𝑝𝑝(𝒚𝒚, 𝒛𝒛|𝒙𝒙)

0 1 0 0

founder founder founder

𝑊𝑊born−in 𝑊𝑊founder 𝑊𝑊CEO−of 𝑊𝑊capital−of

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs is CEO of Apple.

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple)

0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2

born−infounderCEO−ofcapita−of

Predict a relation label for each sentence

independently

Aggregate sentence-level predictions into

global-level predictions

Very easy to find!Computational cost:

𝑜𝑜( 𝑅𝑅 𝒙𝒙 )

2014-09-05 Modeling Missing Data in Distant Supervision 7

Page 8: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

MultiR: Inference 2: argmax𝒛𝒛

𝑝𝑝(𝒛𝒛|𝒙𝒙,𝒚𝒚)

0 1 1 0

? ? ?

𝑊𝑊born−in 𝑊𝑊founder 𝑊𝑊CEO−of 𝑊𝑊capital−of

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs is CEO of Apple.

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple)

0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2

born−infounderCEO−ofcapita−of

0.5 87 16 11

8 96 7 0.1

0.1 0.2

Define an edge weight: w 𝑊𝑊𝑟𝑟 , 𝑧𝑧𝑖𝑖 = Ίextract(𝑟𝑟, 𝑥𝑥𝑖𝑖)

A node with 𝑊𝑊𝑟𝑟 = 1 must have at least an edge connecting to 𝑧𝑧𝑖𝑖

Each node 𝑧𝑧𝑖𝑖 must have an edge connecting to 𝑊𝑊𝑟𝑟

Find a set of edges that maximize the sum of weights

2014-09-05 Modeling Missing Data in Distant Supervision 8

Page 9: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

MultiR: Inference 2: argmax𝒛𝒛

𝑝𝑝(𝒛𝒛|𝒙𝒙,𝒚𝒚)

0 1 1 0

founder founder CEO-of

𝑊𝑊born−in 𝑊𝑊founder 𝑊𝑊CEO−of 𝑊𝑊capital−of

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs is CEO of Apple.

𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple)

0.5

16.0

9.0

0.1

8.0

11.0

6.0

0.1

7.0

8.0

7.0

0.2

born−infounderCEO−ofcapita−of

16 118 9

6 7

Define an edge weight: w 𝑊𝑊𝑟𝑟 , 𝑧𝑧𝑖𝑖 = Ίextract(𝑟𝑟, 𝑥𝑥𝑖𝑖)

A node with 𝑊𝑊𝑟𝑟 = 1 must have at least an edge connecting to 𝑧𝑧𝑖𝑖

Each node 𝑧𝑧𝑖𝑖 must have an edge connecting to 𝑊𝑊𝑟𝑟

Find a set of edges that maximize the sum of weights

Exact solution in polynomial time

In practice, approximate solution by greedy search (assigning 𝑧𝑧𝑖𝑖 for

each node 𝑊𝑊𝑟𝑟 = 1) is sufficient2014-09-05 Modeling Missing Data in Distant Supervision 9

Page 10: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Contribution of this work• MultiR makes two assumptions (hard constraints):

• If a fact is not found in the database, it cannot be mentioned in the text

• If a fact is in the database, it must be mentioned in at least one sentence.

• Relax MultiR to handle the situation where:• A fact is not mentioned in text (MIT)• A fact mentioned in text is missing in database (MID)

• Side effect of this relaxation• Incorporates the tendency that the knowledge base is

likely to include popular entities and relations2014-09-05 Modeling Missing Data in Distant Supervision 10

Page 11: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Distant Supervision with Data Not Missing at Random (DNMAR)

0 1 1 0

Founder Founder visit

𝑊𝑊born−in 𝑊𝑊founder 𝑊𝑊CEO−of 𝑊𝑊visit

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs visited Apple store


𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple)

0 1 0 1𝒕𝒕

Introduce a layer of latent variables (𝑡𝑡𝑟𝑟) to handle missing cases

𝜙𝜙miss 𝑊𝑊𝑟𝑟 , 𝑡𝑡𝑟𝑟

=

−𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀 (𝑊𝑊𝑟𝑟 = 1⋀𝑡𝑡𝑟𝑟 = 0)(missing in text)

−𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀 (𝑊𝑊𝑟𝑟 = 0⋀𝑡𝑡𝑟𝑟 = 1)(missing in DB)

0 (otherwise)

Relaxing two hard constraints in MultiR into soft ones with penalty

factors −𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀 and −𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀

Introduce a new factor:

Training algorithm is the same as the one used in MultiR

2014-09-05 Modeling Missing Data in Distant Supervision 11

Page 12: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Constrained inference: argmax𝒛𝒛

𝑝𝑝(𝒛𝒛|𝒙𝒙,𝒚𝒚)

0 1 1 0

? ? ?

𝑊𝑊born−in 𝑊𝑊founder 𝑊𝑊CEO−of 𝑊𝑊visit

Steve Jobs was founder of Apple.

Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.

Steve Jobs visited Apple store


𝑧𝑧1 𝑧𝑧2 𝑧𝑧3

𝑥𝑥1 𝑥𝑥2 𝑥𝑥3

𝒛𝒛

𝒙𝒙

𝒚𝒚

For entity pair, (Steve Jobs, Apple)

? ? ? ?𝒕𝒕

𝑧𝑧∗ = argmax𝒛𝒛

ᅵ𝑖𝑖=1

𝑛𝑛

𝜃𝜃 ï¿œ Ίextract 𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖 + ᅵ𝑟𝑟

𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀 ï¿œ 1(𝑊𝑊𝑟𝑟⋁∃𝑖𝑖: 𝑟𝑟 = 𝑧𝑧𝑖𝑖) −𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀ᅵ 1(¬𝑊𝑊𝑟𝑟⋁∃𝑖𝑖: 𝑟𝑟 = 𝑧𝑧𝑖𝑖)

Became more challenging

A* search can find an exact solution, but is not scalable

with many variables

Present a greedy hill climbing approach for the inference:

1. Initialize 𝑧𝑧𝑖𝑖 at random2. Obtain neighborhoods of

the current solution3. Move to the neighbor

yielding the highest score4. Repeat this process

2014-09-05 Modeling Missing Data in Distant Supervision 12

Page 13: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Incorporating popularity in KB• We tune the penalty factors 𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀 and 𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀 on a

development set• We can take into account how likely each fact is to

be observed in the text and the knowledge base• Facts about Barack Obama are likely to exist• Facts about Naoaki Okazaki are unlikely to exists

• Control the penalty factor for each entity pair• Popularity of entities: 𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀

(𝑒𝑒1,𝑒𝑒2) = −𝛟𝛟min(𝑐𝑐 𝑒𝑒1 , 𝑐𝑐(𝑒𝑒2))• A larger penalty if the model predicts that a fact about a

popular entity does not exist in KB• Well-aligned relations: assign 3 kinds of values of 𝛌𝛌𝑀𝑀𝑀𝑀𝑀𝑀𝑟𝑟

• A larger penalty if a popular relation such as contains, place_lived, and nationality does not exist in text

2014-09-05 Modeling Missing Data in Distant Supervision 13

Page 14: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Experiments• Binary relation extraction

• The standard setting (Riedel+, 10)• Knowledge base: Freebase relations• Text corpus: 1.8m New York Times articles

• Two kinds of evaluation• Sentence-level extractions using the dataset (Hoffmann+, 11)• Holdout evaluation on Freebase knowledge

• Unary relation extraction (NE categorization)• Twitter NE categorization dataset (Ritter+, 11)

• Knowledge base: Freebase (instances and their categories)• Text corpus: tweets

• Hold-out evaluation

2014-09-05 Modeling Missing Data in Distant Supervision 14

Page 15: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Results

17% increase in area under the curve.Incorporating popularity yielded 27% increase over the baseline.

This evaluation underestimate precision because many facts correctly extracted from text are missing in the database.DNMAR doubled the recall.

Ritter et al. (2013) Modeling Missing Data in Distant Supervision for Information Extraction, TACL(1), 367-378.

2014-09-05 Modeling Missing Data in Distant Supervision 15

Page 16: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

Conclusion• Investigated the problem of missing data in distant

supervision• Presented an extension of MultiR to handle missing

data• Could incorporate the popularity of facts to be

included in the knowledge base and text• Presented a scalable inference algorithm based on

greedy hill-climbing• Demonstrated the effectiveness of the modeling

2014-09-05 Modeling Missing Data in Distant Supervision 16

Page 17: Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)

References• Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke

Zettlemoyer, Daniel S. Weld. (2011) Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. ACL-2011, pages 541–550.

• Slides and codes

• Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.

2014-09-05 Modeling Missing Data in Distant Supervision 17