notes on (socher’s) rnn-based models of semantic...

4
Notes on (Socher’s) RNN-based models of semantic composition Chris Potts, Ling 236/Psych 236c: Representations of meaning, Spring 2013 May 23 Overview These notes attempt to make concrete the models and experiments of Socher et al. 2012 (the MV-RNN model). I also draw on Socher et al. 2011 (the basic RNN model). 1 Relationships between models Based on slides from Richard 1 and Socher et al. 2012:§2.2: (1) MV-RNN (Socher et al. 2012): p = f W Ba Ab f is sigmoid (R 7[0, 1]) or tanh (R 7[-1, 1]), applied element-wise. W , A, and B all do compositional work. (2) RNN (Socher et al. 2011): p = f W I n×n a I n×n b f is sigmoid or tanh. A and B are trivialized; W does all of the compositional work. (3) Mitchell & Lapata 2010: p = Ba + Ab = identity [ I n×n I n×n ] Ba Ab • No transformation to the parent vector. • No global composition matrix (the identities trivialize it). • Each daughter contributes a lexical matrix. (4) Baroni & Zamparelli 2010: A is an adjective matrix, b is a noun vector: p = Ab = identity [0 n×n I n×n ] Ba Ab • No transformation to the parent vector. • The global composition matrix cancels any contribution made by Ba and is identity on the contribution of Ab. • The contribution of the adjective is solely its matrix. 1 http://www.stanford.edu/class/cs224u/slides/2013-01-21-224U-RichardSocher.pdf

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Notes on (Socher’s) RNN-based models of semantic …web.stanford.edu/class/linguist236/materials/ling236-handout-05-23-socher.pdfAb • f is sigmoid (R 7 ... Ling 236/Psych 236c,

Notes on (Socher’s) RNN-based models of semantic compositionChris Potts, Ling 236/Psych 236c: Representations of meaning, Spring 2013

May 23

Overview

These notes attempt to make concrete the models and experiments of Socher et al. 2012 (theMV-RNN model). I also draw on Socher et al. 2011 (the basic RNN model).

1 Relationships between models

Based on slides from Richard1 and Socher et al. 2012:§2.2:

(1) MV-RNN (Socher et al. 2012):

p = f�

W�

BaAb

��

• f is sigmoid (R 7→ [0, 1]) or tanh (R 7→ [−1,1]), applied element-wise.• W , A, and B all do compositional work.

(2) RNN (Socher et al. 2011):

p = f�

W�

In×naIn×n b

��

• f is sigmoid or tanh.• A and B are trivialized; W does all of the compositional work.

(3) Mitchell & Lapata 2010:

p = Ba+ Ab = identity�

[In×nIn×n]�

BaAb

��

• No transformation to the parent vector.• No global composition matrix (the identities trivialize it).• Each daughter contributes a lexical matrix.

(4) Baroni & Zamparelli 2010: A is an adjective matrix, b is a noun vector:

p = Ab = identity�

[0n×nIn×n]�

BaAb

��

• No transformation to the parent vector.• The global composition matrix cancels any contribution made by Ba and is identity on

the contribution of Ab.• The contribution of the adjective is solely its matrix.

1http://www.stanford.edu/class/cs224u/slides/2013-01-21-224U-RichardSocher.pdf

Page 2: Notes on (Socher’s) RNN-based models of semantic …web.stanford.edu/class/linguist236/materials/ling236-handout-05-23-socher.pdfAb • f is sigmoid (R 7 ... Ling 236/Psych 236c,

Ling 236/Psych 236c, Stanford (Potts)

2 Propositional logic (§3.2)

Socher et al. (2012) present this in terms of the MV-RNN model, but I think the RNN suffices. (Topresent this model as an MV-RNN, assume that all of the matrices are identities.) I found W byjust playing around with equations until I hit a viable solution. Socher et al. used L-BFGS.

(5) T= 1

(6) F= 0

(7) ¬= 1

(8) ∧= 1

(9) W = [1,−1]

(10) g(x) =max(min(x , 1), 0)

(11) Negation:

g�

W�

11

��

= g(0) = 0

¬= 1 T= 1

g�

W�

10

��

= g(1) = 1

¬= 1 F= 0

(12) Conjunction:

g�

W�

10

��

= g(1) = 1

T= 1g�

W�

11

��

= g(0) = 0

∧= 1 T= 1

g�

W�

11

��

= g(0) = 0

T= 1g�

W�

10

��

= g(1) = 1

∧= 1 F= 0

g�

W�

01

��

= g(0) = 0

F= 0g�

W�

11

��

= g(0)

∧= 1 T= 1

g�

W�

01

��

= g(0)

F= 0g�

W�

10

��

= g(1) = 1

∧= 1 F= 0

Question Could this be done without a global matrix W? If the value of any two nodes ofdimension n is a vector of dimension n, then it seems impossible to simulate the two-place natureof binary connectives.

2

Page 3: Notes on (Socher’s) RNN-based models of semantic …web.stanford.edu/class/linguist236/materials/ling236-handout-05-23-socher.pdfAb • f is sigmoid (R 7 ... Ling 236/Psych 236c,

Ling 236/Psych 236c, Stanford (Potts)

3 Adverb–adjective combinations

1 00 1

JextremelyKT +

2 00 2

JgoodKT = [4,−8]

1 00 1

JextremelyKT +

2 00 2

JawfulKT = [1,−4]

JextremelyK=

8 10 2

JextremelyK(JgoodKT ) = [2,4]

JextremelyK(JawfulKT ) = [4,−8]

Figure 1: A toy example to illustrate the effects of matrix multiplication.

Socher et al.’s (2012) adverb–adjective experiments show off the power of the MV-RNN model.2

really good -- 25911 tokens

Rating

(Cou

nt/T

otal

) / s

um(C

ount

/Tot

al)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

really good

reallygood

not good -- 7739 tokens

Rating

(Cou

nt/T

otal

) / s

um(C

ount

/Tot

al)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

not good

notgood

not amazing -- 281 tokens

Rating

(Cou

nt/T

otal

) / s

um(C

ount

/Tot

al)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

not amazing

not

amazing

really bad -- 9943 tokens

Rating

(Cou

nt/T

otal

) / s

um(C

ount

/Tot

al)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

really bad

really

bad

not bad -- 10681 tokens

Rating

(Cou

nt/T

otal

) / s

um(C

ount

/Tot

al)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

not bad

not

bad

not terrible -- 861 tokens

Rating

(Cou

nt/T

otal

) / s

um(C

ount

/Tot

al)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.00

0.05

0.10

0.15

0.20

0.25

0.30

not terrible

not

terrible

Figure 2: Adverb–adjective distributions.

2Data and code: http://nasslli2012.christopherpotts.net/composition.html, http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces

3

Page 4: Notes on (Socher’s) RNN-based models of semantic …web.stanford.edu/class/linguist236/materials/ling236-handout-05-23-socher.pdfAb • f is sigmoid (R 7 ... Ling 236/Psych 236c,

Ling 236/Psych 236c, Stanford (Potts)

Softmax layer To predict the distribution over star-ratings, Socher et al. (2012) train a logisticclassifier on the parent vectors, which maps them to [0,1], a normalized version of the ratings1 . . . 10. This is the technique I described on p. 18 of the handout ‘Distributional approaches toword meanings’, and it is basically what Baroni et al. (2012) do with their VSMs.

Method Avg KLUniform 0.327Mean train 0.193p = 1

2(a + b) 0.103p = a � b 0.103p = [a; b] 0.101p = Ab 0.103RNN 0.093Linear MVR 0.092MV-RNN 0.091

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5fairly annoying

MV−RNNRNN

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5fairly awesome

MV−RNNRNN

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5fairly sad

MV−RNNRNN

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5not annoying

MV−RNNRNN

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5not awesome

MV−RNNRNN

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5not sad

Training Pair

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5unbelievably annoying

MV−RNNRNN

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5unbelievably awesome

MV−RNNRNN

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5unbelievably sad

MV−RNNRNN

Figure 3: Left: Average KL-divergence for predicting sentiment distributions of unseen adverb-adjective pairs of thetest set. See text for p descriptions. Lower is better. The main difference in the KL divergence comes from the fewnegation pairs in the test set. Right: Predicting sentiment distributions (over 1-10 stars on the x-axis) of adverb-adjective pairs. Each row has the same adverb and each column the same adjective. Many predictions are similarbetween the two models. The RNN and linear MVR are not able to modify the sentiment correctly: not awesome ismore positive than fairly awesome and not annoying has a similar shape as unbelievably annoying. Predictions of thelinear MVR model are almost identical to the standard RNN for these examples.

defined as KL(g||p) =�

i gi log(gi/pi), where g isthe gold distribution and p is the predicted one.

We compare to several baselines and ablations ofthe MV-RNN model. An (adverb,adjective) pair isdescribed by its vectors (a, b) and matrices (A, B).1 p = 0.5(a + b), vector average2. p = a � b, element-wise vector multiplication3. p = [a; b], vector concatenation4. p = Ab, similar to (Baroni and Lenci, 2010)5. p = g(W [a; b]), RNN, similar to Socher et al.6. p = Ab + Ba, Linear MVR, similar to (Mitchelland Lapata, 2010; Zanzotto et al., 2010)7. p = g(W [Ba;Ab]), MV-RNNThe final distribution is always predicted by asoftmax classifier whose inputs p vary for each ofthe models. This objective function (see Sec. 2.4)is different to all previously published work exceptthat of (Socher et al., 2011c).

We cross-validated all models over regulariza-tion parameters for word vectors, the softmax clas-sifier, the RNN parameter W and the word op-erators (10�4, 10�3) and word vector sizes (n =6, 8, 10, 12, 15, 20). All models performed best atvector sizes of below 12. Hence, it is the model’spower and not the number of parameters that deter-

mines the performance. The table in Fig. 3 showsthe average KL-divergence on the test set. It showsthat the idea of matrix-vector representations for allwords and having a nonlinearity are both impor-tant. The MV-RNN which combines these two ideasis best able to learn the various compositional ef-fects. The main difference in KL divergence comesfrom the few negation cases in the test set. Fig. 3shows examples of predicted distributions. Manyof the predictions are accurate and similar betweenthe top models. However, only the MV-RNN hasenough expressive power to allow negation to com-pletely shift the sentiment with respect to an adjec-tive. A negated adjective carrying negative senti-ment becomes slightly positive, whereas not awe-some is correctly attenuated. All three top modelscorrectly capture the U-shape of unbelievably sad.This pair peaks at both the negative and positivespectrum because it is ambiguous. When referringto the performance of actors, it is very negative, but,when talking about the plot, many people enjoy sadand thought-provoking movies. The p = Ab modeldoes not perform well because it cannot model thefact that for an adjective like “sad,” the operator of“unbelievably” behaves differently.

1206

Figure 3: From Socher et al. (2012), p. 1206.

References

Baroni, Marco, Raffaella Bernardi, Ngoc-Quynh Do & Chung-chieh Shan. 2012. Entailment above the wordlevel in distributional semantics. In Proceedings of the 13th conference of the European chapter of theassociation for computational linguistics, 23–32. Avignon, France: ACL.

Baroni, Marco & Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representingadjective–noun constructions in semantic space. In Proceedings of the 2010 conference on empirical methodsin natural language processing EMNLP ’10, 1183–1193. Stroudsburg, PA: Association for ComputationalLinguistics.

Mitchell, Jeff & Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science34(8). 1388–1429.

Socher, Richard, Brody Huval, Christopher D. Manning & Andrew Y. Ng. 2012. Semantic compositionalitythrough recursive matrix-vector spaces. In Proceedings of the 2012 conference on empirical methods innatural language processing, 1201–1211. Stroudsburg, PA.

Socher, Richard, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng & Christopher D. Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the 2011 con-ference on empirical methods in natural language processing, 151–161. Edinburgh, Scotland, UK.: ACL.

4