from paraphrase database to compositional paraphrase model and back john wieting university of...

From Paraphrase Database to Compositional Paraphrase Model and Back

John WietingUniversity of Illinois

Joint work with Mohit Bansal, Kevin Gimpel, Karen Livescu, and Dan Roth

The PPDB (Ganitkevitch et. al, 2013) is a vast collection of paraphrase pairs

Motivation

that allow the which enable thebe given the opportunity to have the possibility of

i can hardly hear you . you 're breaking up .and the establishment as well as the developmentlaying the foundations pave the way

making every effort to do its utmost… …

Motivation

• Improve coverage

• Have a parametric model

• Improve phrase pair scores

Contributions

• Powerful word embeddings that have human-level performance on SimLex999 and WordSim353

• Phrase embeddings• Model can re-rank phrases in PPDB 1.0 (Improve human

correlation from 25 to 52 ρ.)• Parameterization of PPDB that can be used downstream

• New datasets

Datasets

Wanted clean way to evaluate paraphrase composition

Two new datasets: One for bigram paraphrases and one for short-phrase paraphrases from PPDB

6

WordSim353

Topical Paraphrastic

SimLex-999Words

Bigrams MLSim(Mitchell and Lapata, 2010)

MLSim BigramPara

television programme tv set 5.8 1.0

training programme education course 5.7 5.0

bedroom window education officer 1.3 1.0

7

WordSim353


SimLex-999Words


MLPara(this talk)

MLSim MLPara

television programme tv set 5.8 1.0

training programme education course 5.7 5.0

bedroom window education officer 1.3 1.0

8

WordSim353


SimLex-999Words


MLPara(this talk)

Spearman’s rho Cohen’s kappa

adjective noun 0.87 0.79

noun noun 0.64 0.58

verb noun 0.73 0.73

9

WordSim353


SimLex-999Words


MLPara(this talk)

Phrases AnnoPPDB(this talk)

10

AnnoPPDB(this talk)

AnnoPPDB

can not be separated from is inseparable from 5.0

hoped to be able to looked forward to 3.4

come on , think about it people , please 2.2

how do you mean that what worst feelings 1.6

Phrases


11

AnnoPPDB(this talk)

AnnoPPDB





Phrases


Mean Deviation: 0.60

12

AnnoPPDB(this talk)

AnnoPPDB





Phrases


Dev and test sets were designed to have:

1) Variety of lengths2) Variety of quality3) Low word overlap

13

AnnoPPDB(this talk)

AnnoPPDB





Phrases


See Pavlick et al., 2015 for similar but larger dataset

Learning EmbeddingsWe now have datasets to test paraphrase similarity. Next we learn to embed words and phrases

All similarities are computed using cosine distance

Learning Embeddings

Related work on using PPDB to improve word embeddings: Yu and Dredze, 2014; Faruqui et al., 2015

We now have datasets to test paraphrase similarity. Next we learn to embed words and phrases

All similarities are computed using cosine distance

16

Training examples (word pairs from PPDB):

contamination pollution

converged convergence

captioned subtitled

outwit thwart

bad villain

broad general

permanent permanently

bed sack

carefree reckless

absolutely urgently

… …

17

Loss Function for Learning

sums over word pairs in PPDB

18



positive example

19


negative examples


positive example

20

Choosing Negative Examples?

21


only do argmax over current mini-batch

(for efficiency)

22


only do argmax over current mini-batch

(for efficiency)

we regularize by penalizing squared L2 distance to initial embeddings

23

113k word pairs from PPDB (XL)Training:

WordSim353Tuning:

SimLex-999Test:Notes: 1. trained with AdaGrad, tuned stepsize, mini-batch size, and regularization 2. initialized with 25-dim skip-gram vectors trained on Wikipedia 3. statistical significance computed using one-tailed method of Steiger (1980) 4. output of training: “paragram” embeddings

contamination pollutionconverged convergencecaptioned subtitled

… …

Results: SimLex-999

Series110

20

30

40

50

60

70

80

21

38

52

65.1

skip-gram (25-dim)

skip-gram (1000-dim)

Hill et al. (2014)

Average Human

Spearman’s ρ × 100

Results: SimLex-999

Series110

20

30

40

50

60

70

80

21

38

5256

65.1

skip-gram (25-dim)

skip-gram (1000-dim)

Hill et al. (2014)

paragram (25-dim)

Average Human

Hum

an


Para

gram

26


WordSim353Tuning:

SimLex-999Test:Notes: 1. replaced dot product in objective with cosine distance 2. trained with AdaGrad, tuned stepsize, mini-batch size, margin and regularization 3. initialized with 300-dim GloVe common crawl embeddings 4. output of training: “paragram-ws353” embeddings (“paragram-sl999” if tuned on SimLex-999)


… …Scaling up to 300 dimensions

27


WordSim353Tuning:

SimLex-999Test:Notes: 1. replaced dot product in objective with cosine distance 2. trained with AdaGrad, tuned stepsize, mini-batch, margin and regularization 3. initialized with 300-dim GloVe common crawl embeddings 4. output of training: “paragram-ws353” embeddings (“paragram-sl999” if tuned on SimLex-999)


… …

Results: SimLex-999

Series110

20

30

40

50

60

70

80

37.6

56.3 57.8

65.1

GloVe

Schwartz et al. 2015

Faruqui and Dyer 2015

Average Human


Results: SimLex-999

Series110

20

30

40

50

60

70

80

37.6

56.3 57.8

66.7 65.1

GloVe

Schwartz et al. 2015

Faruqui and Dyer 2015

paragram-ws353

Average HumanPa

ragr

am-w

s353

Hum

an


Results: SimLex-999

Series110

20

30

40

50

60

70

80

37.6

56.3 57.8

66.7 68.565.1

GloVeSchwartz et al. 2015Faruqui and Dyer 2015 paragram-ws353paragram-sl999Average Human

Para

gram

-ws3

53

Para

gram

-sl9

99

Hum

an


Results: WordSim-353

Series110

20

30

40

50

60

70

80

57.9

68.171.3

75.6

GloVe

Faruqui et al. 2015

Huang et al. 2012

Average Human

Tune on SimLex-999, test on WordSim-353



Series110

20

30

40

50

60

70

80

57.9

68.171.3 72

75.6

GloVe

Faruqui et al. 2015

Huang et al. 2012

paragram-sl999

Average Human


Hum

an

Para

gram

-sl9

99



Series110

20

30

40

50

60

70

80

57.9

68.171.3 72

76.9 75.6

GloVeFaruqui et al. 2015 Huang et al. 2012paragram-sl999paragram-ws353Average Human


Para

gram

-ws3

53

Para

gram

-sl9

99

Hum

an


34

Extrinsic Evaluation: Sentiment Analysis

word vectors dimensionality accuracy

skip-gram 25 77.0

skip-gram 50 79.6

paragram 25 80.9

Stanford Sentiment Treebank, binary classification

convolutional neural network (Kim, 2014) with 200 unigram filtersstatic: no fine-tuning of word vectors

25 dimension case

35



skip-gram 25 77.0

skip-gram 50 79.6

paragram 25 80.9



36



GloVe 300 81.4

paragram-ws353 300 83.9

paragram-sl999 300 84.0



300 dimension case

37



GloVe 300 81.4

paragram-ws353 300 83.9

paragram-sl999 300 84.0



38

We compare standard approaches:

vector addition

recursive neural network (RvNN) (Socher et al., 2011)

recurrent neural networks (RtNN)

Embedding Phrases?

requires binarized parse;we use Stanford parser

39

Loss Functions for Phrases

replace word vectors by phrase vectors

(computed by RvNN, RtNN, etc.)sum over phrase

pairs in PPDB

we regularize by penalizing squared L2 distance to initial (skip-gram) embeddings and L2 regularization on the composition

parameters

40

bigram pairs extracted from PPDBTraining:

MLSim (Mitchell & Lapata, 2010)Tuning:

MLParaTest:

adjective noun (134k) noun noun (36k) verb noun (63k)

easy job simple task town meeting town council achieve goal achieve aim

Notes: we extract bigram pairs of each type from PPDB using a part-of-speech tagger when tuning/testing on one subset, we only train on bigram pairs for that subset

41

Series110

20

30

40

50

60

70

80

36

4541

75

skip-gram (25), +

skip-gram (1000), +

Hashimoto et al. (2014)

Average Human


Results: MLPara

averages over three data splits: adj noun, noun noun, verb noun

42

Series110

20

30

40

50

60

70

80

36

4541

46

75

skip-gram (25), +

skip-gram (1000), +

Hashimoto et al. (2014)

paragram (25), +

Average Human


Results: MLPara

averages over three data splits: adj noun, noun noun, verb nounHu

man

Para

gram

, +

43

Series110

20

30

40

50

60

70

80

36

4541

46

52

75

skip-gram (25), +skip-gram (1000), +Hashimoto et al. (2014)paragram (25), +paragram (25), RNNAverage Human


Results: MLPara


Para

gram

, +

Hum

an

Para

gram

, RN

N

44

Series110

20

30

40

50

60

70

80

40

52

75

GloVe

paragram (25), RNN

Average Human


Results: MLPara


300 dimension case

45

Series110

20

30

40

50

60

70

80

40

52

75

GloVe

paragram (25), RNN

Average Human


Results: MLPara


46

Series110

20

30

40

50

60

70

80

40

51 52 52

75

GloVe

paragram-ws353,+

paragram-sl999,+

paragram (25), RNN

Average Human


Results: MLPara


Hum

an

Para

gram

-ws3

53,+

Para

gram

-sl9

99,+

Para

gram

(25)

, RN

N

47

60k phrase pairs from PPDBTraining:

260 annotated phrase pairsTuning:

1000 annotated phrase pairsTest:

that allow the which enable thebe given the opportunity to have the possibility of

i can hardly hear you . you 're breaking up .and the establishment as well as the developmentlaying the foundations pave the way

making every effort to do its utmost… …

48

Series10

10

20

30

40

50

20

25

33skip-gram (25)PPDBPPDB (tuned)

Results: AnnoPPDB


support vector regression to predict gold similarities5-fold cross validation on 260-example dev set

49

Series10

10

20

30

40

50

20

25

33 32skip-gram (25)

PPDB

PPDB (tuned)

paragram (25), +

Results: AnnoPPDB


Para

gram

, +

50

Series10

10

20

30

40

50

20

25

33 32

39 40

skip-gram (25)PPDBPPDB (tuned)paragram (25), +paragram (25), RtNNparagram (25), RvNN

Results: AnnoPPDB


Para

gram

, +

Para

gram

, RtN

N

Para

gram

, RvN

N

51

Series10

10

20

30

40

50

60

25

40PPDB

paragram (25), RtNN

Results: AnnoPPDB


300 dimension case

52

Series10

10

20

30

40

50

60

25

40PPDB

paragram (25), RtNN

Results: AnnoPPDB


53

Series10

10

20

30

40

50

60

25

4043

41 PPDB

paragram (25), RtNN

paragram-ws353

paragram-sl999

Results: AnnoPPDB


Para

gram

-sl9

99

Para

gram

-ws3

53

54

Series10

10

20

30

40

50

60

25

4043

41

4952

PPDBparagram (25), RtNNparagram-ws353paragram-sl999RtNN (300)LSTM (300)

Results: AnnoPPDB


RtN

N (3

00)

LSTM

(300

)

Para

gram

-sl9

99

Para

gram

-ws3

53

55

gold RvNN +

does not exceed is no more than 5.0 4.8 3.5

could have an impact on may influence 4.6 4.2 3.2

earliest opportunity early as possible 4.4 4.3 2.9

gold RcNN +

scheduled to be held in that will take place in 4.6 2.9 4.4

according to the paper , the newspaper reported that 4.6 2.8 4.1

’s surname family name of 4.4 2.8 4.1

RvNN is better:

addition is better:

Qualitative Analysis: For positive examples, addition model outperforms

RvNN when phrases 1) have similar length

2) have more “synonyms” in common

56

gold RvNN +

does not exceed is no more than 5.0 4.8 3.5

could have an impact on may influence 4.6 4.2 3.2

earliest opportunity early as possible 4.4 4.3 2.9

gold RvNN +

scheduled to be held in that will take place in 4.6 2.9 4.4

according to the paper , the newspaper reported that 4.6 2.8 4.1

’s surname family name of 4.4 2.8 4.1

RvNN is better:

Addition is better:

Conclusion

Our work shows how to use PPDB to:1) Create word embeddings that have human level performance on Simlex-999 and WordSim-353 2) Create compositonal paraphrase models that can improve correlation of PPDB 1.0 from 25 to 52 ρ.

We have also released two new datasets for evaluation of short-phrase paraphrasing models

Ongoing work: Phrase model improvements, off-the-shelf testing on downstream tasks

58

Thanks!

from paraphrase database to compositional paraphrase model and back john wieting university of...

Documents