from paraphrase database to compositional paraphrase model and back john wieting university of...
DESCRIPTION
Motivation Improve coverage Have a parametric model Improve phrase pair scoresTRANSCRIPT
From Paraphrase Database to Compositional Paraphrase Model and Back
John WietingUniversity of Illinois
Joint work with Mohit Bansal, Kevin Gimpel, Karen Livescu, and Dan Roth
The PPDB (Ganitkevitch et. al, 2013) is a vast collection of paraphrase pairs
Motivation
that allow the which enable thebe given the opportunity to have the possibility of
i can hardly hear you . you 're breaking up .and the establishment as well as the developmentlaying the foundations pave the way
making every effort to do its utmost… …
Motivation
• Improve coverage
• Have a parametric model
• Improve phrase pair scores
Contributions
• Powerful word embeddings that have human-level performance on SimLex999 and WordSim353
• Phrase embeddings• Model can re-rank phrases in PPDB 1.0 (Improve human
correlation from 25 to 52 ρ.)• Parameterization of PPDB that can be used downstream
• New datasets
Datasets
Wanted clean way to evaluate paraphrase composition
Two new datasets: One for bigram paraphrases and one for short-phrase paraphrases from PPDB
6
WordSim353
Topical Paraphrastic
SimLex-999Words
Bigrams MLSim(Mitchell and Lapata, 2010)
MLSim BigramPara
television programme tv set 5.8 1.0
training programme education course 5.7 5.0
bedroom window education officer 1.3 1.0
7
WordSim353
Topical Paraphrastic
SimLex-999Words
Bigrams MLSim(Mitchell and Lapata, 2010)
MLPara(this talk)
MLSim MLPara
television programme tv set 5.8 1.0
training programme education course 5.7 5.0
bedroom window education officer 1.3 1.0
8
WordSim353
Topical Paraphrastic
SimLex-999Words
Bigrams MLSim(Mitchell and Lapata, 2010)
MLPara(this talk)
Spearman’s rho Cohen’s kappa
adjective noun 0.87 0.79
noun noun 0.64 0.58
verb noun 0.73 0.73
9
WordSim353
Topical Paraphrastic
SimLex-999Words
Bigrams MLSim(Mitchell and Lapata, 2010)
MLPara(this talk)
Phrases AnnoPPDB(this talk)
10
AnnoPPDB(this talk)
AnnoPPDB
can not be separated from is inseparable from 5.0
hoped to be able to looked forward to 3.4
come on , think about it people , please 2.2
how do you mean that what worst feelings 1.6
Phrases
Topical Paraphrastic
11
AnnoPPDB(this talk)
AnnoPPDB
can not be separated from is inseparable from 5.0
hoped to be able to looked forward to 3.4
come on , think about it people , please 2.2
how do you mean that what worst feelings 1.6
Phrases
Topical Paraphrastic
Mean Deviation: 0.60
12
AnnoPPDB(this talk)
AnnoPPDB
can not be separated from is inseparable from 5.0
hoped to be able to looked forward to 3.4
come on , think about it people , please 2.2
how do you mean that what worst feelings 1.6
Phrases
Topical Paraphrastic
Dev and test sets were designed to have:
1) Variety of lengths2) Variety of quality3) Low word overlap
13
AnnoPPDB(this talk)
AnnoPPDB
can not be separated from is inseparable from 5.0
hoped to be able to looked forward to 3.4
come on , think about it people , please 2.2
how do you mean that what worst feelings 1.6
Phrases
Topical Paraphrastic
See Pavlick et al., 2015 for similar but larger dataset
Learning EmbeddingsWe now have datasets to test paraphrase similarity. Next we learn to embed words and phrases
All similarities are computed using cosine distance
Learning Embeddings
Related work on using PPDB to improve word embeddings: Yu and Dredze, 2014; Faruqui et al., 2015
We now have datasets to test paraphrase similarity. Next we learn to embed words and phrases
All similarities are computed using cosine distance
16
Training examples (word pairs from PPDB):
contamination pollution
converged convergence
captioned subtitled
outwit thwart
bad villain
broad general
permanent permanently
bed sack
carefree reckless
absolutely urgently
… …
17
Loss Function for Learning
sums over word pairs in PPDB
18
Loss Function for Learning
sums over word pairs in PPDB
positive example
19
Loss Function for Learning
negative examples
sums over word pairs in PPDB
positive example
20
Choosing Negative Examples?
21
Choosing Negative Examples?
only do argmax over current mini-batch
(for efficiency)
22
Choosing Negative Examples?
only do argmax over current mini-batch
(for efficiency)
we regularize by penalizing squared L2 distance to initial embeddings
23
113k word pairs from PPDB (XL)Training:
WordSim353Tuning:
SimLex-999Test:Notes: 1. trained with AdaGrad, tuned stepsize, mini-batch size, and regularization 2. initialized with 25-dim skip-gram vectors trained on Wikipedia 3. statistical significance computed using one-tailed method of Steiger (1980) 4. output of training: “paragram” embeddings
contamination pollutionconverged convergencecaptioned subtitled
… …
Results: SimLex-999
Series110
20
30
40
50
60
70
80
21
38
52
65.1
skip-gram (25-dim)
skip-gram (1000-dim)
Hill et al. (2014)
Average Human
Spearman’s ρ × 100
Results: SimLex-999
Series110
20
30
40
50
60
70
80
21
38
5256
65.1
skip-gram (25-dim)
skip-gram (1000-dim)
Hill et al. (2014)
paragram (25-dim)
Average Human
Hum
an
Spearman’s ρ × 100
Para
gram
26
170k word pairs from PPDB (XL)Training:
WordSim353Tuning:
SimLex-999Test:Notes: 1. replaced dot product in objective with cosine distance 2. trained with AdaGrad, tuned stepsize, mini-batch size, margin and regularization 3. initialized with 300-dim GloVe common crawl embeddings 4. output of training: “paragram-ws353” embeddings (“paragram-sl999” if tuned on SimLex-999)
contamination pollutionconverged convergencecaptioned subtitled
… …Scaling up to 300 dimensions
27
170k word pairs from PPDB (XL)Training:
WordSim353Tuning:
SimLex-999Test:Notes: 1. replaced dot product in objective with cosine distance 2. trained with AdaGrad, tuned stepsize, mini-batch, margin and regularization 3. initialized with 300-dim GloVe common crawl embeddings 4. output of training: “paragram-ws353” embeddings (“paragram-sl999” if tuned on SimLex-999)
contamination pollutionconverged convergencecaptioned subtitled
… …
Results: SimLex-999
Series110
20
30
40
50
60
70
80
37.6
56.3 57.8
65.1
GloVe
Schwartz et al. 2015
Faruqui and Dyer 2015
Average Human
Spearman’s ρ × 100
Results: SimLex-999
Series110
20
30
40
50
60
70
80
37.6
56.3 57.8
66.7 65.1
GloVe
Schwartz et al. 2015
Faruqui and Dyer 2015
paragram-ws353
Average HumanPa
ragr
am-w
s353
Hum
an
Spearman’s ρ × 100
Results: SimLex-999
Series110
20
30
40
50
60
70
80
37.6
56.3 57.8
66.7 68.565.1
GloVeSchwartz et al. 2015Faruqui and Dyer 2015 paragram-ws353paragram-sl999Average Human
Para
gram
-ws3
53
Para
gram
-sl9
99
Hum
an
Spearman’s ρ × 100
Results: WordSim-353
Series110
20
30
40
50
60
70
80
57.9
68.171.3
75.6
GloVe
Faruqui et al. 2015
Huang et al. 2012
Average Human
Tune on SimLex-999, test on WordSim-353
Spearman’s ρ × 100
Results: WordSim-353
Series110
20
30
40
50
60
70
80
57.9
68.171.3 72
75.6
GloVe
Faruqui et al. 2015
Huang et al. 2012
paragram-sl999
Average Human
Tune on SimLex-999, test on WordSim-353
Hum
an
Para
gram
-sl9
99
Spearman’s ρ × 100
Results: WordSim-353
Series110
20
30
40
50
60
70
80
57.9
68.171.3 72
76.9 75.6
GloVeFaruqui et al. 2015 Huang et al. 2012paragram-sl999paragram-ws353Average Human
Tune on SimLex-999, test on WordSim-353
Para
gram
-ws3
53
Para
gram
-sl9
99
Hum
an
Spearman’s ρ × 100
34
Extrinsic Evaluation: Sentiment Analysis
word vectors dimensionality accuracy
skip-gram 25 77.0
skip-gram 50 79.6
paragram 25 80.9
Stanford Sentiment Treebank, binary classification
convolutional neural network (Kim, 2014) with 200 unigram filtersstatic: no fine-tuning of word vectors
25 dimension case
35
Extrinsic Evaluation: Sentiment Analysis
word vectors dimensionality accuracy
skip-gram 25 77.0
skip-gram 50 79.6
paragram 25 80.9
Stanford Sentiment Treebank, binary classification
convolutional neural network (Kim, 2014) with 200 unigram filtersstatic: no fine-tuning of word vectors
36
Extrinsic Evaluation: Sentiment Analysis
word vectors dimensionality accuracy
GloVe 300 81.4
paragram-ws353 300 83.9
paragram-sl999 300 84.0
Stanford Sentiment Treebank, binary classification
convolutional neural network (Kim, 2014) with 200 unigram filtersstatic: no fine-tuning of word vectors
300 dimension case
37
Extrinsic Evaluation: Sentiment Analysis
word vectors dimensionality accuracy
GloVe 300 81.4
paragram-ws353 300 83.9
paragram-sl999 300 84.0
Stanford Sentiment Treebank, binary classification
convolutional neural network (Kim, 2014) with 200 unigram filtersstatic: no fine-tuning of word vectors
38
We compare standard approaches:
vector addition
recursive neural network (RvNN) (Socher et al., 2011)
recurrent neural networks (RtNN)
Embedding Phrases?
requires binarized parse;we use Stanford parser
39
Loss Functions for Phrases
replace word vectors by phrase vectors
(computed by RvNN, RtNN, etc.)sum over phrase
pairs in PPDB
we regularize by penalizing squared L2 distance to initial (skip-gram) embeddings and L2 regularization on the composition
parameters
40
bigram pairs extracted from PPDBTraining:
MLSim (Mitchell & Lapata, 2010)Tuning:
MLParaTest:
adjective noun (134k) noun noun (36k) verb noun (63k)
easy job simple task town meeting town council achieve goal achieve aim
Notes: we extract bigram pairs of each type from PPDB using a part-of-speech tagger when tuning/testing on one subset, we only train on bigram pairs for that subset
41
Series110
20
30
40
50
60
70
80
36
4541
75
skip-gram (25), +
skip-gram (1000), +
Hashimoto et al. (2014)
Average Human
Spearman’s ρ × 100
Results: MLPara
averages over three data splits: adj noun, noun noun, verb noun
42
Series110
20
30
40
50
60
70
80
36
4541
46
75
skip-gram (25), +
skip-gram (1000), +
Hashimoto et al. (2014)
paragram (25), +
Average Human
Spearman’s ρ × 100
Results: MLPara
averages over three data splits: adj noun, noun noun, verb nounHu
man
Para
gram
, +
43
Series110
20
30
40
50
60
70
80
36
4541
46
52
75
skip-gram (25), +skip-gram (1000), +Hashimoto et al. (2014)paragram (25), +paragram (25), RNNAverage Human
Spearman’s ρ × 100
Results: MLPara
averages over three data splits: adj noun, noun noun, verb noun
Para
gram
, +
Hum
an
Para
gram
, RN
N
44
Series110
20
30
40
50
60
70
80
40
52
75
GloVe
paragram (25), RNN
Average Human
Spearman’s ρ × 100
Results: MLPara
averages over three data splits: adj noun, noun noun, verb noun
300 dimension case
45
Series110
20
30
40
50
60
70
80
40
52
75
GloVe
paragram (25), RNN
Average Human
Spearman’s ρ × 100
Results: MLPara
averages over three data splits: adj noun, noun noun, verb noun
46
Series110
20
30
40
50
60
70
80
40
51 52 52
75
GloVe
paragram-ws353,+
paragram-sl999,+
paragram (25), RNN
Average Human
Spearman’s ρ × 100
Results: MLPara
averages over three data splits: adj noun, noun noun, verb noun
Hum
an
Para
gram
-ws3
53,+
Para
gram
-sl9
99,+
Para
gram
(25)
, RN
N
47
60k phrase pairs from PPDBTraining:
260 annotated phrase pairsTuning:
1000 annotated phrase pairsTest:
that allow the which enable thebe given the opportunity to have the possibility of
i can hardly hear you . you 're breaking up .and the establishment as well as the developmentlaying the foundations pave the way
making every effort to do its utmost… …
48
Series10
10
20
30
40
50
20
25
33skip-gram (25)PPDBPPDB (tuned)
Results: AnnoPPDB
Spearman’s ρ × 100
support vector regression to predict gold similarities5-fold cross validation on 260-example dev set
49
Series10
10
20
30
40
50
20
25
33 32skip-gram (25)
PPDB
PPDB (tuned)
paragram (25), +
Results: AnnoPPDB
Spearman’s ρ × 100
Para
gram
, +
50
Series10
10
20
30
40
50
20
25
33 32
39 40
skip-gram (25)PPDBPPDB (tuned)paragram (25), +paragram (25), RtNNparagram (25), RvNN
Results: AnnoPPDB
Spearman’s ρ × 100
Para
gram
, +
Para
gram
, RtN
N
Para
gram
, RvN
N
51
Series10
10
20
30
40
50
60
25
40PPDB
paragram (25), RtNN
Results: AnnoPPDB
Spearman’s ρ × 100
300 dimension case
52
Series10
10
20
30
40
50
60
25
40PPDB
paragram (25), RtNN
Results: AnnoPPDB
Spearman’s ρ × 100
53
Series10
10
20
30
40
50
60
25
4043
41 PPDB
paragram (25), RtNN
paragram-ws353
paragram-sl999
Results: AnnoPPDB
Spearman’s ρ × 100
Para
gram
-sl9
99
Para
gram
-ws3
53
54
Series10
10
20
30
40
50
60
25
4043
41
4952
PPDBparagram (25), RtNNparagram-ws353paragram-sl999RtNN (300)LSTM (300)
Results: AnnoPPDB
Spearman’s ρ × 100
RtN
N (3
00)
LSTM
(300
)
Para
gram
-sl9
99
Para
gram
-ws3
53
55
gold RvNN +
does not exceed is no more than 5.0 4.8 3.5
could have an impact on may influence 4.6 4.2 3.2
earliest opportunity early as possible 4.4 4.3 2.9
gold RcNN +
scheduled to be held in that will take place in 4.6 2.9 4.4
according to the paper , the newspaper reported that 4.6 2.8 4.1
’s surname family name of 4.4 2.8 4.1
RvNN is better:
addition is better:
Qualitative Analysis: For positive examples, addition model outperforms
RvNN when phrases 1) have similar length
2) have more “synonyms” in common
56
gold RvNN +
does not exceed is no more than 5.0 4.8 3.5
could have an impact on may influence 4.6 4.2 3.2
earliest opportunity early as possible 4.4 4.3 2.9
gold RvNN +
scheduled to be held in that will take place in 4.6 2.9 4.4
according to the paper , the newspaper reported that 4.6 2.8 4.1
’s surname family name of 4.4 2.8 4.1
RvNN is better:
Addition is better:
Conclusion
Our work shows how to use PPDB to:1) Create word embeddings that have human level performance on Simlex-999 and WordSim-353 2) Create compositonal paraphrase models that can improve correlation of PPDB 1.0 from 25 to 52 ρ.
We have also released two new datasets for evaluation of short-phrase paraphrasing models
Ongoing work: Phrase model improvements, off-the-shelf testing on downstream tasks
58
Thanks!