identifying constant and unique relations by using …ynaga/papers/takaku-emnlp2012...to plot...

32
Identifying Constant and Unique Relations by using Time-Series Text Yohei Takaku Nobuhiro Kaji Naoki Yoshinaga Masashi Toyoda Toyo Keizai Inc. Institute of Industrial Science, University of Tokyo

Upload: others

Post on 24-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Identifying Constant and Unique Relations by using Time-Series Text

Yohei Takaku Nobuhiro Kaji Naoki Yoshinaga Masashi ToyodaToyo Keizai Inc. Institute of Industrial Science, University of Tokyo

Page 2: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

2009 2010 2011 2012

Apple’s CEO isSteve Jobs

Relation Extraction from the Evolving Web

Apple’s CEO isTim Cook

• Web (text) as a growing goldmine for extracting

relations between real-world entities

[Pantel+ 06; Banko+ 07; Suchanek+ 07; Wu+ 08,10; Zhu+ 09; Mintz+ 09]

– Processing more text leads to more relations, but– Relations in text could be obsolete / will become outdated

obsolete

Page 3: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

• <arg1, flows through, arg2>• <arg1, ‘s CEO is, arg2>• <arg1, sells, arg2>

How to Consistently Compile Extracted Relations?

Page 4: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

• <arg1, flows through, arg2>• <arg1, ‘s CEO is, arg2>• <arg1, sells, arg2>

2009 2010 2011 2012

Moselle river flows through Germany

Moselle river flows through France

How to Consistently Compile Extracted Relations?

Accumulate all the relations, because the relation does not evolve over time

Page 5: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

• <arg1, flows through, arg2>• <arg1, ‘s CEO is, arg2>• <arg1, sells, arg2>

2009 2010 2011 2012

Apple’s CEO isSteve Jobs

Apple sellsMacBook Air

Apple’s CEO isTim Cook

Apple sellsiPhone 4S

How to Consistently Compile Extracted Relations?

overwritt

en

Overwrite with new one, because the relation can take one value of arg2

Page 6: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Constant and Unique Relations

<arg1, sells, arg2><arg1, flows through, arg2>

<arg1, ‘s father of, arg2> <arg1, belongs to, arg2>

<arg1, was born in, arg2> <arg1, ’s CEO is, arg2>

constant

non-constant

non-constant

• Given value of arg1,– Constant rel.: value of arg2 is independent of time– Unique rel.: value of arg2 is one at any point in time

, non-unique , non-unique

, uniqueconstant, unique

Page 7: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Overview

• Constancy and Uniqueness of Relations• Our Approach• Features for Constancy Classification• Features for Uniqueness Classification• Experiments• Conclusion

Page 8: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Two Binary Classification Tasks

<arg1, ’s father is, arg2>

<arg1, ’s CEO is, arg2>

<arg1, sells, arg2><arg1, borders on, arg2>

<arg1, was born in, arg2>

<arg1, belongs to, arg2>

Page 9: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Two Binary Classification TasksTask 1: constancy classification

constant non-constant

<arg1, ’s father is, arg2>

<arg1, ’s CEO is, arg2>

<arg1, sells, arg2><arg1, borders on, arg2>

<arg1, was born in, arg2>

<arg1, belongs to, arg2>

Page 10: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Two Binary Classification TasksTask 2: uniquness classification

unique

non-unique

<arg1, ’s father is, arg2>

<arg1, ’s CEO is, arg2>

<arg1, sells, arg2><arg1, borders on, arg2>

<arg1, was born in, arg2>

<arg1, belongs to, arg2>

Page 11: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Two Kinds of Features forTraining Supervised Classifiers

• Frequency obtained from time-series text– Detailed later

– Based on blog posts crawled from 2006 to 2011

• Linguistic cuese.g., <arg1, ’s president is, arg2>non-const.

e.g., <arg1, borders on, arg2>non-uniq.

Prefix George Bush is ex-president of USA

Coordination France borders on Italy as well as Spain

Page 12: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Overview

• Constancy and Uniqueness of Relations• Our Approach• Features for Constancy Classification• Features for Uniqueness Classification• Experiments• Conclusion

Page 13: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

<Keisuke Honda(=arg1), belongs to, arg2>non-const.

Using Time-series Textfor Constancy Classification

2010 – now, CSK Moskow2008 – 2010, VVV-Venlo

Page 14: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

<Keisuke Honda(=arg1), belongs to, arg2>non-const.

Using Time-series Textfor Constancy Classification

0

5

10

15

20

25

30

CSKA M

oscow

VVV Venlo VVVIta

ly

Dutch le

ague

Luzh

niki St

adium

Chairm

an Haib

eruden

Chelse

aPA

O0

5

10

15

20

25

30

CSKA M

oscow

VVV Venlo VVVIta

ly

Dutch le

ague

Luzh

niki St

adium

Chairm

an…

Chelse

aPA

O

2008 – 2009 2010 – 2011

arg2 fillers arg2 fillers

freq

uenc

y

freq

uenc

yVVV Venlo CSKA Moscow

Page 15: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Cosine Similarity as a Feature Value

0

5

10

15

20

25

30

CSK

A M

osco

w

VVV V

enl

oVVV

Ital

y

Dut

ch le

ague

Luzh

niki Sta

dium

Cha

irm

an H

aibe

rude

n

Che

lsea

PAO

0

5

10

15

20

25

30

CSK

A…

VVV V

enl

oVVV

Ital

y

Dut

ch…

Luzh

niki…

Cha

irm

an…

Che

lsea

PAO

<Keisuke Honda, belongs to, arg2>non-const.

(19, 0, 0, 0, 0, 0, 0, 0, 1)( 0, 27, 7, 0, 0, 0, 1, 0, 0)

cosine = 0.0

0

5

10

15

20

25

Tam

a rive

r

Sum

ida rive

r

Meg

uro

river

Asa

riv

er

Kann

da riv

er

Edo

rive

r

Saka

i river

Min

amiasa

river

Onda

river

<Tokyo, has river, arg2>const.

0

5

10

15

20

25

Tam

a rive

r

Sum

ida rive

r

Meg

uro

river

Asa

riv

er

Kann

da riv

er

Edo

rive

r

Saka

i river

Min

amiasa

river

Onda

river

( 4, 2, 1, 1, 1, 1, 0, 0, 0)( 4, 0, 1, 3, 1, 21, 3, 1, 1)

cosine = 0.39

Page 16: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Importance of Choosing Time Windows

0

5

10

15

20

25

30

CSKA M

osco

w

VVV Venl

oVVV

Italy

Dut

ch le

ague

Luzh

niki Sta

dium

Chairm

an H

aibe

rude

n

Chelsea

PAO

0

5

10

15

20

25

30

CSKA M

osco

w

VVV Venl

oVVV

Italy

Dut

ch le

ague

Luzh

niki Sta

dium

Chairm

an H

aibe

rude

n

Chelsea

PAO

<Keisuke Honda, belongs to, arg2>non-const.

2009 2010 2011 2012

VVV Venlo CSKA Moscow

Team

changed

X

Page 17: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Importance of Choosing Time Windows

05

1015202530

CSKA M

oscow

VVV Venlo VVVIta

ly

Dutch le

ague

Luzh

niki St

adium

Chairm

an Haib

eruden

Chelse

aPA

O

2009 2010 2011 2012

CSKA Moscow

Team changed

X

05

1015202530

CSKA M

oscow

VVV Venlo VVVIta

ly

Dutch le

ague

Luzh

niki St

adium

Chairm

an Haib

eruden

Chelse

aPA

O

CSKA Moscow

<Keisuke Honda, belongs to, arg2>non-const.

Page 18: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

-10

10

30

CSKA…

VVVDut

ch…

Chair…

PAO-10

10

30

CSKA…

VVVDut

ch…

Chair…

PAO

0

20

40

CSKA…

VVVDut

c…

Chai…

PAO

Using Multiple Time Windows

T months

2009 2010 2011 2012time0.0

0.4 0.3

............

Window size T = { 1, 3, 6, 12 (months) }

Integration method = { ave., min., max. }12 (= 4 x 3) features

0.4 + 0.3 + 0.03

= 0.233

Page 19: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Overview

• Constancy and Uniqueness of Relations• Our Approach• Features for Constancy Classification• Features for Uniqueness Classification• Experiments• Conclusion

Page 20: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

0

5

10

15

20

25

30

CSKA M

oscow

VVV V

enlo

VVV

Italy

Dutch league

Luzhnik

i Stadiu

m

Chairm

an…

Chelsea

PAO

Using Time-series Text

for Uniqueness Classification

<Keisuke Honda, belongs to, arg2>uniq.

<Google, acquires, arg2>non-uniq.

single entity12 entities

arg2 fillers arg2 fillers

……

0

5

10

15

20

25

30

YouTube

Gro

upon

Slide

Dem

and

SayNow

Tw

itte

r

DSP

FeedBurner

W

idevin

e

Fflick

Jaik

u

PushLife

number of entity types

Page 21: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

0

5

10

15

20

25

30

CSKA M

osco

w

VVV Venl

oVVV

Italy

Dut

ch le

ague

Luzh

niki Sta

dium

Chairm

an…

Chelsea

PAO

Using Time-series Text

for Uniqueness Classification (Cont.)

<Keisuke Honda, belongs to, arg2>uniq. <Google, acquires, arg2>non-uniq.

values of arg2values of arg2data can be noisy

0

5

10

15

20

25

30

YouT

ube

Gro

upon

Slid

e

Dem

and

SayN

ow

Twitt

erDSP

Feed

Burne

r

Wid

evine

Fflic

k

Jaik

u

Push

Life

19

1

97

1/19 ≒ 0.05 7/9 = 0.77

Frequency ratio between the 1st and 2nd most frequent entities

Page 22: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Setting Appropriate Time Windowsis also Important

<Keisuke Honda, belongs to, arg2>uniq.

2009 2010 2011 2012

Team changed

X

CSKA MoscowVVV Venlo

05

1015202530

CSKA M

osco

w

VVV Venlo VVVIta

ly

Dutch

leag

ue

Luzh

niki S

tadiu

m

Chairm

an H

aiberu

den

Chelse

aPA

O

Page 23: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Overview

• Constancy and Uniqueness of Relations• Our Approach• Features for Constancy Classification• Features for Uniqueness Classification• Experiments• Conclusion

Page 24: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Experiments

• Evaluate our method with relations extracted from time-series text– Classifier: Passive-aggressive algorithm w/ proposed features

– Time-series text: 6-year’s worth of Japanese blog posts (2.3-billion sentences)

• Conduct experiments to:– Evaluate the constancy classification

– Evaluate the uniqueness classification

– Investigate the impact of multiple window sizes

Page 25: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Data and Settings• Parse time-series text to extract dependency paths

connecting two named entities as relation instances

• Annotate 1000 relations (majority vote, 3 humans)

– Major reason for disagreement: type ambiguityEx. <arg1human, is seen in, arg2>non-const.

<arg1mountain, is seen in, arg2>const.

• Use the labeled relations for training & testing– Evaluation metric: Precision & Recall (5-fold cross validation)

Constancy UniquenessKappa [Fleiss 1971] 0.346 (fair) 0.428 (moderate)

Page 26: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

• Varying the threshold to classifier’s output (margin) to plot recall-precision curve – Baseline: cosine similarity between distributions over arg2

in the first and last month

Classification Result: Constancy

Recovered most of the constant relations while keeping precision

Suffered from data sparsenessBaseline

Proposed

Page 27: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Classification Result: Uniqueness

• Varying the threshold to classifier’s output (margin) to plot recall-precision curve – Baseline: re-implementation of [Lin+ 10]

(based on gross distributions over arg2)

Baseline

Proposed Time-series distributions over arg2 greatly improved the precision

Confused non-constant, unique rels. with non-unique rels.

Page 28: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Impact of using Multiple Time Windows, T

• Compare our method (multiple time windows) with methods using a single time window

– Combining features computed from multiple time windows greatly improved the precision of uniqueness classification

Constancy Uniqueness

Page 29: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

ex. <Japan, ‘s capital city is, {Tokyo, Kyoto}>uniq. (outdated fact)

<Sun, is bought out by, {Oracle, IBM}>const. (speculation)

Error Type Const. Uniq. Total

Paraphrases 16 52 68

Topical bias 0 56 56

Short/long-term evolution 20 27 47

Obsolete/Speculative facts 10 19 29ex. <Real Madrid, beats, Valencia>uniq. Apr. 7th, 2012

<Real Madrid, beats, Barcelona>uniq. Apr. 21st, 2012

ex. <Will Smith, ’s son is, {Jaden Smith/Trey Smith}>non-uniq.

co-starred in a film

ex. <Obama, is president of, {USA, the United States}>uniq.

• Investigate 200 misclassified relations

Error Analysis

Page 30: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Related Work

• TempEval temporal relation identification[Verhagen+ 2007, 2010]

– Associate event (relation instance) with timeex. <Charles Chaplin, was born in, London> OVERLAP 1889

– Do not address constancy of relation

• Functional relation identification [Ritter+ 10]

– Use distributions over arg2 to identify uniqueness [Lin+ 10]

– Time dimension should be considered to identify uniquenessex. <Naoki Yoshinaga, lives in, {Kyoto (-’96),

Tokyo (‘96-’05,’08-)}>uniq.

Page 31: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Overview

• Constancy and Uniqueness of Relations• Our Approach• Features for Constancy Classification• Features for Uniqueness Classification• Experiments• Conclusion

2019/11/7 31EMNLP-CoNLL2012

Page 32: Identifying Constant and Unique Relations by using …ynaga/papers/takaku-emnlp2012...to plot recall-precision curve –Baseline:re-implementation of [Lin+ 10] (based on gross distributions

Conclusion

• A novel notion of constancy of relations• A method of identifying constant and unique relations

– Use massive time-series text to induce features – Timer-series distributions over arg2 computed with multiple

time windows were quite effective

• Future work

– Apply our method to relations with typed arguments [Lin+ 10]

Ex. <arg1mountain, is seen in, arg2>const.

– Use constancy/uniqueness to compile extracted relations