hitting the right paraphrases in good time

50
1 Hitting The Right Paraphrases In Good Time Stanley Kok Dept. of Comp. Sci. & Eng. Univ. of Washington Seattle, USA Chris Brockett NLP Group Microsoft Research Redmond, USA

Upload: kevyn

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Hitting The Right Paraphrases In Good Time. Stanley Kok Dept. of Comp. Sci. & Eng. Univ. of Washington Seattle, USA. Chris Brockett NLP Group Microsoft Research Redmond, USA. Motivation Background Hitting Time Paraphraser Experiments Future Work. Overview. 2. Motivation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hitting The Right Paraphrases In Good Time

1

Hitting The Right Paraphrases In Good Time

Stanley KokDept. of Comp. Sci. & Eng.

Univ. of WashingtonSeattle, USA

Chris BrockettNLP Group

Microsoft ResearchRedmond, USA

Page 2: Hitting The Right Paraphrases In Good Time

Motivation Background Hitting Time Paraphraser Experiments Future Work

2

Overview

Page 3: Hitting The Right Paraphrases In Good Time

Motivation Background Hitting Time Paraphraser Experiments Future Work

3

Overview

Page 4: Hitting The Right Paraphrases In Good Time

4

What’s a paraphrase of…

ParaphraseSystem

“is on good terms with”

• “is friendly with”

• “is a friend of”• …

Query expansion Document summarization Natural language generation Question answering etc.

Applications

Page 5: Hitting The Right Paraphrases In Good Time

5

What’s a paraphrase of…

ParaphraseSystem

“is on good terms with”

• “is friendly with”

• “is a friend of”• …

Bilingual Parallel Corpora

Page 6: Hitting The Right Paraphrases In Good Time

English Phrase (E)

German Phrase (G)

P(G|E) P(E|G)

under control unter kontrolle 0.75 0.40

in check unter kontrolle 0.60 0.20

... … … …6

Bilingual Parallel Corpus

…the cost dynamic is under control……die kostenentwicklung unter kontrolle……keep the cost in check……die kosten unter kontrolle………

Phrase Table

Page 7: Hitting The Right Paraphrases In Good Time

BCB system [Bannard & Callison-Burch, ACL’05]

P(E2|E1) ¼C G P(E2|G) P(G|E1)

SBP system [Callison-Burch, EMNLP’08]

P(E2|E1) ¼C G P(E2|G,syn(E1)) p(G|E1, syn(E1))

7

State of the Art

Page 8: Hitting The Right Paraphrases In Good Time

8E1E2

G1 F2

P(F2|E1)

P(E2|F2)

P(G1|E1)P(E2|G1)

E3E4

(in check) (under control)

G2G3

(unter kontrolle)F1

Graphical View

Page 9: Hitting The Right Paraphrases In Good Time

9

Graphical ViewPath lengths > 2General graphAdd nodes to represent domain knowledge

Random WalksHitting Times

G1 F2G2G3 F1

E1E2E3E4

Page 10: Hitting The Right Paraphrases In Good Time

Motivation Background Hitting Time Paraphraser Experiments Future Work

10

Overview

Page 11: Hitting The Right Paraphrases In Good Time

AA

Random Walk Begin at node A Randomly pick neighbor n

E

F

D

B

C11

Page 12: Hitting The Right Paraphrases In Good Time

Random Walk Begin at node A Randomly pick neighbor n Move to node n

E

F

D A

2B

C12

Page 13: Hitting The Right Paraphrases In Good Time

Random Walk Begin at node A Randomly pick neighbor n Move to node n Repeat

E

F

D A

B

2C13

Page 14: Hitting The Right Paraphrases In Good Time

Expected number of steps starting from node i before node j is visited for first time Smaller hitting time → closer to start node i

Truncated Hitting Time [Sarkar & Moore, UAI’07]

Random walks are limited to T steps Computed efficiently & with high probability by

sampling random walks [Sarkar, Moore & Prakash ICML’08]

14

Hitting Time from node i to j

Page 15: Hitting The Right Paraphrases In Good Time

Finding Truncated Hitting Time By Sampling

E

F

D 1

B

C

A

A

T=5

15

Page 16: Hitting The Right Paraphrases In Good Time

Finding Truncated Hitting Time By Sampling

E

F

4 A

B

C

D

A D

T=5

16

Page 17: Hitting The Right Paraphrases In Good Time

Finding Truncated Hitting Time By Sampling

5

F

D A

B

C

E

A D E

T=5

17

Page 18: Hitting The Right Paraphrases In Good Time

Finding Truncated Hitting Time By Sampling

E

F

4 A

B

C

D

A D E D

T=5

18

Page 19: Hitting The Right Paraphrases In Good Time

Finding Truncated Hitting Time By Sampling

E

6

D A

B

CF

A D E D F

T=5

19

Page 20: Hitting The Right Paraphrases In Good Time

Finding Truncated Hitting Time By Sampling

5

F

D A

B

C

E

A D E D F E

T=5

20

Page 21: Hitting The Right Paraphrases In Good Time

Finding Truncated Hitting Time By Sampling

A D E D F E

T=5

E

F

D A

B

C

hAD=1hAE=2

hAF=4

hAA=0hAB=5

hAC=5

21

Page 22: Hitting The Right Paraphrases In Good Time

Motivation Background Hitting Time Paraphraser Experiments Future Work

22

Overview

Page 23: Hitting The Right Paraphrases In Good Time

23

Hitting Time Paraphraser (HTP)

ParaphraseSystem

“is on good terms with”

• “is friendly with”

• “is a friend of”• …

HTP

Phrase TablesEnglish-GermanEnglish-FrenchGerman-Frenchetc.

Phrase Paraphrases

Page 24: Hitting The Right Paraphrases In Good Time

24

Graph Construction

Page 25: Hitting The Right Paraphrases In Good Time

25

Graph Construction

Page 26: Hitting The Right Paraphrases In Good Time

BFS from query phrase up to depth d or up to max. number n of nodes d = 6, n = 50,000

26

… … … ……

……

…Graph Construction

Page 27: Hitting The Right Paraphrases In Good Time

27

Graph Construction

… … … ……

……

0.250.35

Page 28: Hitting The Right Paraphrases In Good Time

28

Graph Construction

… … … ……

……

0.6

Page 29: Hitting The Right Paraphrases In Good Time

29

Graph Construction

… … … ……

……

0.50.5

Page 30: Hitting The Right Paraphrases In Good Time

Run m truncated random walks to estimate truncated hitting time of each node T = 10, m = 1,000,000

Prune nodes with hitting times = T

Estimate Trunc. Hitting Times

Page 31: Hitting The Right Paraphrases In Good Time

31

Add Ngram Nodes

“achieve the goal”“achieve the aim”“reach the objective”

“the”……

“achieve the” “the aim”“reach” “objective”

Page 32: Hitting The Right Paraphrases In Good Time

32

Add “Syntax” Nodes

“whose goal is” “the aim is”“the objective is” “what goal”

start with article end with be start with interrogatives

Page 33: Hitting The Right Paraphrases In Good Time

33

Add Not-Substring-Of Nodes

“reach the” “reach the aim”“reach the objective” “objective”

not-substring-of

Page 34: Hitting The Right Paraphrases In Good Time

34

Feature Nodes

ngram nodes

“syntax” nodes

not-substring nodes

phrase nodes

p2

p1

p3

p4 = 0.4= 0.1

= 0.4

= 0.1

Page 35: Hitting The Right Paraphrases In Good Time

Run m truncated random walks again Rank paraphrases in increasing order of

hitting times

35

Re-estimate Truncated Hitting Times

Page 36: Hitting The Right Paraphrases In Good Time

Motivation Background Hitting Time Paraphraser Experiments Future Work

36

Overview

Page 37: Hitting The Right Paraphrases In Good Time

Europarl dataset [Koehn, MT-Summit’05]

Use 6 of 11 languages: English, Danish, German, Spanish, Finnish, Dutch

About a million sentences per language English−Foreign phrasal alignments by giza++

[Callison-Burch, EMNLP’08]

Foreign−Foreign phrasal alignments by MSR aligner

37

Data

Page 38: Hitting The Right Paraphrases In Good Time

SBP system [Callison-Burch, EMNLP’08]

HTP with no feature node HTP with bipartite graph

38

Comparison Systems

Page 39: Hitting The Right Paraphrases In Good Time

NIST dataset 4 English translations per Chinese sentence 33,216 English translations

Randomly selected 100 English phrases From 1-4grams in both NIST & Europarl datasets Exclude stop words, numbers, phrases containing

periods and commas

39

Evaluation Methodology

Page 40: Hitting The Right Paraphrases In Good Time

For each phrase, randomly select a sentence from NIST dataset containing it

Substituted top 1 to 10 paraphrases for phrase

40

Methodology

Page 41: Hitting The Right Paraphrases In Good Time

Manually evaluated resulting sentences 0: Clearly wrong; grammatically incorrect or does not preserve meaning 1: Minor grammatical errors (e.g., subject-verb disagreement; wrong tenses, etc.), or meaning largely preserved but not completely 2: Totally correct; grammatically correct and meaning is preserved

Correct: 1 and 2; Wrong: 0 Two evaluators; Kappa = 0.62 (substantial agree.)

41

Methodology

Page 42: Hitting The Right Paraphrases In Good Time

42

Phr. HTP SBPq1

q2

… … …q49

q50

q51

… …q100

HTP vs. SBP

p11 p21 p31 p41 p51 p61 p71 p81 p91 p101 p111 p121

p12 p22 p32 p42 p52

p149 p249p349p449p549p649p749p849

p11 p21 p31 p41 p51 p61 p71

p12 p22 p32

p149p249p349p449p549

p150 p250 p350p450p550 p650p750

p151 p251 p351p451p551 p651p751p851

p1100p2100p3100 p410

0p5100p6100p7100p8100

p951p1051 p1151p1251

0.71 0.53

Page 43: Hitting The Right Paraphrases In Good Time

43

Phr. HTP SBPq1

q2

… … …q49

q50

q51

… …q100

HTP vs. SBP

p11 p21 p31 p41 p51 p61 p71 p81 p91 p101 p111 p121

p12 p22 p32 p42 p52

p149 p249p349p449p549p649p749p849

p11 p21 p31 p41 p51 p61 p71

p12 p22 p32

p149p249p349p449p549

p150 p250 p350p450p550 p650p750

p151 p251 p351p451p551 p651p751p851 p951p1051 p1151p1251

0.56 0.39

373

paraphrases per

system

p1100p2100p3100 p410

0p5100p6100p7100p8100

Page 44: Hitting The Right Paraphrases In Good Time

44

Phr. HTP SBPq1

q2

… … …q49

q50

q51

… …q100

HTP vs. SBP

p11 p21 p31 p41 p51 p61 p71 p81 p91 p101 p111 p121

p12 p22 p32 p42 p52

p149 p249p349p449p549p649p749p849

p11 p21 p31 p41 p51 p61 p71

p12 p22 p32

p149p249p349p449p549

p150 p250 p350p450p550 p650p750

p151 p251 p351p451p551 p651p751p851 p951p1051 p1151p1251

483

paraphrases

0.54

p1100p2100p3100 p410

0p5100p6100p7100p8100

Page 45: Hitting The Right Paraphrases In Good Time

45

Phr. HTP SBPq1

q2

… … …q49

q50

q51

… …q100

HTP vs. SBP

p11 p21 p31 p41 p51 p61 p71 p81 p91 p101 p111 p121

p12 p22 p32 p42 p52

p149 p249p349p449p549p649p749p849

p11 p21 p31 p41 p51 p61 p71

p12 p22 p32

p149p249p349p449p549

p150 p250 p350p450p550 p650p750

p151 p251 p351p451p551 p651p751p851 p951p1051 p1151p1251

0.53

p1100p2100p3100 p410

0p5100p6100p7100p8100

0.50

0.71

0.61

Page 46: Hitting The Right Paraphrases In Good Time

46

Phr. HTP SBPq1

q2

… … …q49

q50

q51

… …q100

HTP vs. SBP

p11 p21 p31 p41 p51 p61 p71 p81 p91 p101 p111 p121

p12 p22 p32 p42 p52

p149 p249p349p449p549p649p749p849

p11 p21 p31 p41 p51 p61 p71

p12 p22 p32

p149p249p349p449p549

p150 p250 p350p450p550 p650p750

p151 p251 p351p451p551 p651p751p851 p951p1051 p1151p1251

0.54 0.39

p1100p2100p3100 p410

0p5100p6100p7100p8100975

paraphrases

0.32

373

paraphrases

492

paraphrases

0.43

420 correct

paraphrases

145 correct

paraphrases

Page 47: Hitting The Right Paraphrases In Good Time

47

Timings

System Timing (secs/phrase)

HTP 48

SBP 468

Page 48: Hitting The Right Paraphrases In Good Time

Motivation Background Hitting Time Paraphraser Experiments Future Work

48

Overview

Page 49: Hitting The Right Paraphrases In Good Time

Apply HTP to languages other than English Evaluate HTP impact on applications

e.g., improve performance of resource-sparse machine translation systems

Add more features etc.

49

Future Work

Page 50: Hitting The Right Paraphrases In Good Time

HTP: a paraphrase system based on random walks Good paraphrases have smaller hitting times General graph Path length > 2 Incorporate domain knowledge

HTP outperforms state-of-the-art

50

Conclusion