getting closer to reality: grounding and interaction in models ......kad´ ar, chrupa´ ła, and...

60
Getting Closer to Reality: Grounding and Interaction in Models of Human Language Acquisition Afra Alishahi

Upload: others

Post on 09-Dec-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Getting Closer to Reality: Grounding and Interaction in Models of Human Language Acquisition

Afra Alishahi

Page 2: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Getting Closer to Reality: Grounding and Interaction in Models of Human Language Acquisition

Afra Alishahi

???

Page 3: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

3

Can we simulate language learning In a naturalistic setting?

Page 4: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

4

Language Acquisition in the Wild

Patrz, łódka przybija do

brzegu.

Page 5: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Grounded Language Learning

• Acquire knowledge of language from utterances grounded in perceptual context

• What linguistic representations are learned?

• Historically, linguistic formalisms were given, and models were adapted to these formalisms.

• Example: how to induce a Context Free Grammar from a corpus?

• How do we know that humans use similar representations?

5

Page 6: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

An Alternative Approach

• Neural models of language allow for applying general-purpose architectures to performing different language tasks

• What type(s) of linguistic knowledge are necessary for performing a given task?

6

Page 7: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Focus of this Talk

7

How to simulate grounded language learning?

How to analyze emerging knowledge in neural models of language?

Page 8: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

A Naturalistic Scenario

8

Look, they are washing the elephants!

• Where to get the input data from?

Page 9: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Using Images of Natural Scenes

• Datasets such as Flickr30K and MS-Coco provide images of natural scenes and their textual description

• Image-caption pairs as proxy for linguistic & perceptual context

9

Page 10: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Imaginet: Learning by Processing

10

w1 w2 w3 ....

visual pathway

linguistic pathway wt+1

shared word representations

Computational Linguistics Volume 43, Number 4

Formally, each sentence is mapped to two sequences of hidden states, one byVISUAL and the other by TEXTUAL:

hV

t = GRUV(hV

t�1, xt) (5)

hT

t = GRUT(hT

t�1, xt) (6)

At each time step TEXTUAL predicts the next word in the sentence S from its currenthidden state h

T

t, and VISUAL predicts the image-vector1

i from its last hidden represen-tation h

V

t.

i = VhV

⌧ (7)

p(St+1|S1:t) = softmax(LhT

t ) (8)

The loss function is a multi-task objective that penalizes error on the visual and thetextual targets simultaneously. The objective combines cross-entropy loss L

T for theword predictions and cosine distance L

V for the image predictions,2 weighting themwith the parameter ↵ (set to 0.1).

LT(✓) = �1

⌧X

t=1

log p(St|S1:t) (9)

LV(✓) = 1 � i · i

kikkik(10)

L = ↵LT + (1 � ↵)LV (11)

For more details about the IMAGINET model and its performance, see Chrupała, Kadar,and Alishahi (2015). Note that we introduce a small change in the image represen-tation: We observe that using standardized image vectors, where each dimension istransformed by subtracting the mean and dividing by standard deviation, improvesperformance.

3.3 Unimodal Language Model

The model LM is a language model analogous to the TEXTUAL pathway of IMAGINETwith the difference that its word embeddings are not shared, and its loss function is thecross-entropy on word prediction. Using this model we remove the visual objective asa factor, as the model does not use the images corresponding to captions in any way.

3.4 Sum of Word Embeddings

The model SUM is a stripped-down version of the VISUAL pathway, which does notshare word embeddings, only uses the cosine loss function, and replaces the GRUnetwork with a summation over word embeddings. This removes the effect of word

1 Representing the full image, extracted from the pre-trained CNN of Simonyan and Zisserman (2015).2 Note that the original formulation in Chrupała, Kadar, and Alishahi (2015) uses mean squared error

instead; as the performance of VISUAL is measured on image-retrieval (which is based on cosinedistances) we use cosine distance as the visual loss here.

766

Computational Linguistics Volume 43, Number 4

Formally, each sentence is mapped to two sequences of hidden states, one byVISUAL and the other by TEXTUAL:

hV

t = GRUV(hV

t�1, xt) (5)

hT

t = GRUT(hT

t�1, xt) (6)

At each time step TEXTUAL predicts the next word in the sentence S from its currenthidden state h

T

t, and VISUAL predicts the image-vector1

i from its last hidden represen-tation h

V

t.

i = VhV

⌧ (7)

p(St+1|S1:t) = softmax(LhT

t ) (8)

The loss function is a multi-task objective that penalizes error on the visual and thetextual targets simultaneously. The objective combines cross-entropy loss L

T for theword predictions and cosine distance L

V for the image predictions,2 weighting themwith the parameter ↵ (set to 0.1).

LT(✓) = �1

⌧X

t=1

log p(St|S1:t) (9)

LV(✓) = 1 � i · i

kikkik(10)

L = ↵LT + (1 � ↵)LV (11)

For more details about the IMAGINET model and its performance, see Chrupała, Kadar,and Alishahi (2015). Note that we introduce a small change in the image represen-tation: We observe that using standardized image vectors, where each dimension istransformed by subtracting the mean and dividing by standard deviation, improvesperformance.

3.3 Unimodal Language Model

The model LM is a language model analogous to the TEXTUAL pathway of IMAGINETwith the difference that its word embeddings are not shared, and its loss function is thecross-entropy on word prediction. Using this model we remove the visual objective asa factor, as the model does not use the images corresponding to captions in any way.

3.4 Sum of Word Embeddings

The model SUM is a stripped-down version of the VISUAL pathway, which does notshare word embeddings, only uses the cosine loss function, and replaces the GRUnetwork with a summation over word embeddings. This removes the effect of word

1 Representing the full image, extracted from the pre-trained CNN of Simonyan and Zisserman (2015).2 Note that the original formulation in Chrupała, Kadar, and Alishahi (2015) uses mean squared error

instead; as the performance of VISUAL is measured on image-retrieval (which is based on cosinedistances) we use cosine distance as the visual loss here.

766

Computational Linguistics Volume 43, Number 4

Formally, each sentence is mapped to two sequences of hidden states, one byVISUAL and the other by TEXTUAL:

hV

t = GRUV(hV

t�1, xt) (5)

hT

t = GRUT(hT

t�1, xt) (6)

At each time step TEXTUAL predicts the next word in the sentence S from its currenthidden state h

T

t, and VISUAL predicts the image-vector1

i from its last hidden represen-tation h

V

t.

i = VhV

⌧ (7)

p(St+1|S1:t) = softmax(LhT

t ) (8)

The loss function is a multi-task objective that penalizes error on the visual and thetextual targets simultaneously. The objective combines cross-entropy loss L

T for theword predictions and cosine distance L

V for the image predictions,2 weighting themwith the parameter ↵ (set to 0.1).

LT(✓) = �1

⌧X

t=1

log p(St|S1:t) (9)

LV(✓) = 1 � i · i

kikkik(10)

L = ↵LT + (1 � ↵)LV (11)

For more details about the IMAGINET model and its performance, see Chrupała, Kadar,and Alishahi (2015). Note that we introduce a small change in the image represen-tation: We observe that using standardized image vectors, where each dimension istransformed by subtracting the mean and dividing by standard deviation, improvesperformance.

3.3 Unimodal Language Model

The model LM is a language model analogous to the TEXTUAL pathway of IMAGINETwith the difference that its word embeddings are not shared, and its loss function is thecross-entropy on word prediction. Using this model we remove the visual objective asa factor, as the model does not use the images corresponding to captions in any way.

3.4 Sum of Word Embeddings

The model SUM is a stripped-down version of the VISUAL pathway, which does notshare word embeddings, only uses the cosine loss function, and replaces the GRUnetwork with a summation over word embeddings. This removes the effect of word

1 Representing the full image, extracted from the pre-trained CNN of Simonyan and Zisserman (2015).2 Note that the original formulation in Chrupała, Kadar, and Alishahi (2015) uses mean squared error

instead; as the performance of VISUAL is measured on image-retrieval (which is based on cosinedistances) we use cosine distance as the visual loss here.

766

Page 11: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Image Representations

11

Image model

BOAT

BIRD

BOAR

Pre

-cla

ssif

cat io

n la

ye

r

VGG-16: Simonyan & Zisserman (2014)

Page 12: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

What Does the Model Learn?

• Evaluate the model through a set of comprehension and production tasks

• Production: retrieve the closest words or phrases to an image

• Comprehension: retrieve the closest images to a word or multi-word phrase

• Estimate word pair similarity, and correlate with human judgments

12

Page 13: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Word Meaning

Original label: parachute

Word: parachute Word: bicycle

Original label: bicycle-built-for-two

Page 14: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Utterance Meaning

14

Input sentence: “a brown teddy bear lying on top of a dry grass covered ground .”

• Does the model learn and use any knowledge about sentence structure?

Page 15: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Utterance Meaning

Scrambled input sentence: “a a of covered laying bear on brown grass top teddy ground . dry”

Input sentence: “a brown teddy bear lying on top of a dry grass covered ground .”

Page 16: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Another Example

16

Original: “a variety of kitchen utensils hanging from a UNK board .”

Scrambled: “kitchen of from hanging UNK variety a board utensils .”

Page 17: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

What Kind of Structure is Learned?

• Some structural patterns:

• Topics appear in sentence-initial position

• Same words are more important as heads than modifiers

• Periods terminate a sentence

• Sentences tend to not start with conjunctions

• Can we pin down the acquired structure more systematically?

17

Page 18: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Analysis Technique: Omission Score

• Perturbation study: how would the omission of a word from an input sentence affect the model?

18

Kadar, Chrupała, and Alishahi Linguistic Form and Function in RNN’s

order from consideration. We use this model as a baseline in the sections that focus onlanguage structure.

4. Experiments

In this section, we report a series of experiments in which we explore the kinds oflinguistic regularities the networks learn from word-level input. In Section 4.1 weintroduce omission score, a metric to measure the contribution of each token to theprediction of the networks, and in Section 4.2 we analyze how omission scores aredistributed over dependency relations and part-of-speech categories. In Section 4.3 weinvestigate the extent to which the importance of words for the different networksdepends on the words themselves, their sequential position, and their grammaticalfunction in the sentences. Finally, in Section 4.4 we systematically compare the typesof n-gram contexts that trigger individual dimensions in the hidden layers of thenetworks, and discuss their level of abstractness.

In all these experiments we report our findings based on the IMAGINET model, andwhenever appropriate compare it with our two other models LM and SUM. For all theexperiments, we trained the models on the training portion of the MSCOCO image-caption data set (Lin et al. 2014), and analyzed the representations of the sentences inthe validation set corresponding to 5000 randomly chosen images. The target imagerepresentations were extracted from the pre-softmax layer of the 16-layer CNN ofSimonyan and Zisserman (2015).

4.1 Computing Omission Scores

We propose a novel technique for interpreting the activation patterns of neural networkstrained on language tasks from a linguistic point of view, and focus on the high-levelunderstanding of what parts of the input sentence the networks pay most attention to.Furthermore, we investigate whether the networks learn to assign different amountsof importance to tokens, depending on their position and grammatical function in thesentences.

In all the models the full sentences are represented by the activation vector atthe end-of-sentence symbol (hend). We measure the salience of each word Si in aninput sentence S1:n based on how much the representation of the partial sentenceS\i = S1:i�1Si+1:n, with the omitted word Si, deviates from that of the original sentencerepresentation. For example, the distance between hend(the black dog is running) andhend(the dog is running) determines the importance of black in the first sentence. Weintroduce the measure omission(i, S) for estimating the salience of a word Si:

omission(i, S) = 1 � cosine(hend(S), hend(S\i)) (12)

Figure 2 demonstrates the omission scores for the LM, VISUAL, and TEXTUALmodels for an example caption. Figure 3 shows the images retrieved by VISUAL for thefull caption and for the one with the word baby omitted. The images are retrieved fromthe validation set of MSCOCO by: 1) computing the image representation of the givensentence with VISUAL; 2) extracting the CNN features for the images from the set; and3) finding the image that minimizes the cosine distance to the query. The omission scoresfor VISUAL show that the model paid attention mostly to baby and bed and slightlyto laptop, and retrieved an image depicting a baby sitting on a bed with a laptop.Removing the word baby leads to an image that depicts an adult male lying on a

767

Page 19: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Contribution of each Word

19

Computational Linguistics Volume 43, Number 4

0.0

0.1

0.2

0.3

0.4

ababy sits on a

bed

laughing wit

h alaptop

computer

open

score lm

textualvisual

Figure 2

Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open

for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.

bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.

4.2 Omission Score Distributions

The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get

Figure 3

Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open

(left) and the same sentence with the second word omitted (right).

768

Computational Linguistics Volume 43, Number 4

0.0

0.1

0.2

0.3

0.4

ababy sits on a

bed

laughing wit

h alaptop

computer

open

score lm

textualvisual

Figure 2

Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open

for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.

bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.

4.2 Omission Score Distributions

The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get

Figure 3

Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open

(left) and the same sentence with the second word omitted (right).

768

Computational Linguistics Volume 43, Number 4

0.0

0.1

0.2

0.3

0.4

ababy sits on a

bed

laughing wit

h alaptop

computer

open

score lm

textualvisual

Figure 2

Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open

for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.

bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.

4.2 Omission Score Distributions

The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get

Figure 3

Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open

(left) and the same sentence with the second word omitted (right).

768

Page 20: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Contribution of each Word

20

Computational Linguistics Volume 43, Number 4

0.0

0.1

0.2

0.3

0.4

ababy sits on a

bed

laughing wit

h alaptop

computer

open

score lm

textualvisual

Figure 2

Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open

for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.

bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.

4.2 Omission Score Distributions

The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get

Figure 3

Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open

(left) and the same sentence with the second word omitted (right).

768

Computational Linguistics Volume 43, Number 4

0.0

0.1

0.2

0.3

0.4

ababy sits on a

bed

laughing wit

h alaptop

computer

open

score lm

textualvisual

Figure 2

Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open

for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.

bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.

4.2 Omission Score Distributions

The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get

Figure 3

Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open

(left) and the same sentence with the second word omitted (right).

768

X

Page 21: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

An Example

21

Input sentence: “Stop sign with words written on it with a black marker .”

Page 22: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Impact of POS & Dependency Categories

22

Page 23: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

What We Have Learned

• Perceptual and linguistic context provide complementary sources of information for learning form-meaning associations.

• Analysis of the model’s hidden states gives us insight into useful structural knowledge for language comprehension and production, but more sophisticated techniques are needed.

23

Page 24: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

24

A Typical Language Learning Scenario

Patrz, łódka przybija do

brzegu.

Page 25: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

25

A Typical Language Learning Scenario

Grounded speech perception

Page 26: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Modeling Grounded Speech

26

Grounded speech perception

Image Model

Speech Model

Joint Semantic Space

Page 27: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Joint Semantic Space

27

Project speech and image to joint space

a bird walks on a beam

bears play in water

nal representation by removing them from the in-put and comparing the resulting representations tothe ones generated by the original input. In a sim-ilar approach, Li et al. (2016) study the contribu-tion of individual input tokens as well as hiddenunits and word embedding dimensions by erasingthem from the representation and analyzing howthis affects the model.

Miao et al. (2016) and Tang et al. (2016) use vi-sualization techniques for fine-grained analysis ofGRU and LSTM models for ASR. Visualizationof input and forget gate states allows Miao et al.(2016) to make informed adaptations to gated re-current architectures, resulting in more efficientlytrainable models. Tang et al. (2016) visualizequalitative differences between LSTM- and GRU-based architectures, regarding the encoding of in-formation, as well as how it is processed throughtime.

We specifically study linguistic properties of theinformation encoded in the trained model. Adiet al. (2016) introduce prediction tasks to ana-lyze information encoded in sentence embeddingsabout word order, sentence length, and the pres-ence of individual words. We use related tech-niques to explore encoding of aspects of form andmeaning within components of our stacked archi-tecture.

3 Models

We use a multi-layer, gated recurrent neural net-work (RHN) to model the temporal nature ofspeech signal. Recurrent neural networks are de-signed for modeling sequential data, and gatedvariants (GRUs, LSTMs) are widely used withspeech and text in both cognitive modeling andengineering contexts. RHNs are a simple gener-alization of GRU networks such that the transformbetween time points can consist of several steps.

Our multimodal model projects spoken utter-ances and images to a joint semantic space. Theidea of projecting different modalities to a sharedsemantic space via a pair of encoders has beenused in work on language and vision (among themVendrov et al. (2015)). The core idea is to en-courage inputs representing the same meaning indifferent modalities to end up nearby, while main-taining a distance from unrelated inputs.

The model consists of two parts: an utteranceencoder, and an image encoder. The utterance en-coder starts from MFCC speech features, while

the image encoder starts from features extractedwith a VGG-16 pre-trained on ImageNet. Our lossfunction attempts to make the cosine distance be-tween encodings of matching utterances and im-ages greater than the distance between encodingsof mismatching utterance/image pairs, by a mar-gin:

(1)

X

u,i

X

u0

max[0,↵+d(u, i)�d(u0, i)]

+

X

i0

max[0,↵+ d(u, i)� d(u, i0)]

!

where d(u, i) is the cosine distance between theencoded utterance u and encoded image i. Here(u, i) is the matching utterance-image pair, u0

ranges over utterances not describing i and i0

ranges over images not described by u.The image encoder enci is a simple linear pro-

jection, followed by normalization to unit L2norm:

enci(i) = unit(Ai+ b) (2)

where unit(x) =x

(xT x)0.5and with (A, b) as

learned parameters. The utterance encoder encu

consists of a 1-dimensional convolutional layer oflength s, size d and stride z, whose output feedsinto a Recurrent Highway Network with k lay-ers and L microsteps, whose output in turn goesthrough an attention-like lookback operator, andfinally L2 normalization:

encu(u) = unit(Attn(RHNk,L(Convs,d,z(u))))(3)

The main function of the convolutional layerConvs,d,z is to subsample the input along the tem-poral dimension. We use a 1-dimensional convo-lution with full border mode padding. The atten-tion operator simply computes a weighted sum ofthe RHN activation at all timesteps:

Attn(x) =X

t

↵txt (4)

where the weights ↵t are determined by learnedparameters U and W, and passed through thetimewise softmax function:

↵t =exp(U tanh(Wxt))Pt0 exp(U tanh(Wxt0))

(5)

The main component of the utterance encoder is arecurrent network, specifically a Recurrent High-way Network (Zilly et al., 2016). The idea behind

615

nal representation by removing them from the in-put and comparing the resulting representations tothe ones generated by the original input. In a sim-ilar approach, Li et al. (2016) study the contribu-tion of individual input tokens as well as hiddenunits and word embedding dimensions by erasingthem from the representation and analyzing howthis affects the model.

Miao et al. (2016) and Tang et al. (2016) use vi-sualization techniques for fine-grained analysis ofGRU and LSTM models for ASR. Visualizationof input and forget gate states allows Miao et al.(2016) to make informed adaptations to gated re-current architectures, resulting in more efficientlytrainable models. Tang et al. (2016) visualizequalitative differences between LSTM- and GRU-based architectures, regarding the encoding of in-formation, as well as how it is processed throughtime.

We specifically study linguistic properties of theinformation encoded in the trained model. Adiet al. (2016) introduce prediction tasks to ana-lyze information encoded in sentence embeddingsabout word order, sentence length, and the pres-ence of individual words. We use related tech-niques to explore encoding of aspects of form andmeaning within components of our stacked archi-tecture.

3 Models

We use a multi-layer, gated recurrent neural net-work (RHN) to model the temporal nature ofspeech signal. Recurrent neural networks are de-signed for modeling sequential data, and gatedvariants (GRUs, LSTMs) are widely used withspeech and text in both cognitive modeling andengineering contexts. RHNs are a simple gener-alization of GRU networks such that the transformbetween time points can consist of several steps.

Our multimodal model projects spoken utter-ances and images to a joint semantic space. Theidea of projecting different modalities to a sharedsemantic space via a pair of encoders has beenused in work on language and vision (among themVendrov et al. (2015)). The core idea is to en-courage inputs representing the same meaning indifferent modalities to end up nearby, while main-taining a distance from unrelated inputs.

The model consists of two parts: an utteranceencoder, and an image encoder. The utterance en-coder starts from MFCC speech features, while

the image encoder starts from features extractedwith a VGG-16 pre-trained on ImageNet. Our lossfunction attempts to make the cosine distance be-tween encodings of matching utterances and im-ages greater than the distance between encodingsof mismatching utterance/image pairs, by a mar-gin:

(1)

X

u,i

X

u0

max[0,↵+d(u, i)�d(u0, i)]

+

X

i0

max[0,↵+ d(u, i)� d(u, i0)]

!

where d(u, i) is the cosine distance between theencoded utterance u and encoded image i. Here(u, i) is the matching utterance-image pair, u0

ranges over utterances not describing i and i0

ranges over images not described by u.The image encoder enci is a simple linear pro-

jection, followed by normalization to unit L2norm:

enci(i) = unit(Ai+ b) (2)

where unit(x) =x

(xT x)0.5and with (A, b) as

learned parameters. The utterance encoder encu

consists of a 1-dimensional convolutional layer oflength s, size d and stride z, whose output feedsinto a Recurrent Highway Network with k lay-ers and L microsteps, whose output in turn goesthrough an attention-like lookback operator, andfinally L2 normalization:

encu(u) = unit(Attn(RHNk,L(Convs,d,z(u))))(3)

The main function of the convolutional layerConvs,d,z is to subsample the input along the tem-poral dimension. We use a 1-dimensional convo-lution with full border mode padding. The atten-tion operator simply computes a weighted sum ofthe RHN activation at all timesteps:

Attn(x) =X

t

↵txt (4)

where the weights ↵t are determined by learnedparameters U and W, and passed through thetimewise softmax function:

↵t =exp(U tanh(Wxt))Pt0 exp(U tanh(Wxt0))

(5)

The main component of the utterance encoder is arecurrent network, specifically a Recurrent High-way Network (Zilly et al., 2016). The idea behind

615

Page 28: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Image Model

28

Image model

BOAT

BIRD

BOAR

Pre

-cla

ssif

cat io

n la

ye

r

VGG-16: Simonyan & Zisserman (2014)

Page 29: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Speech Model

29

Project to the joint semantic space

• Attention: weighted sum of last RHN layer units

• RHN: Recurrent Highway Networks (Zilly et al., 2016)

• Convolution: subsampling MFCC vector

Attention

RHN

RHN

RHN

RHN

RHN

Convolution

MFCC

Grounded speech perception

Page 30: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

What does the Model Learn?

• What aspects of language does the model encode?

• Does the model encode linguistic form, and on which layers?

• Does it encode meaning, and on which layers?

30

Grounded speech perception

• Auxiliary tasks using layer activations as feature

Page 31: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

“Auxiliary Tasks” or “Probing Classifiers”

31

Grounded speech perception

Classifier/ Regressor

Trained to predict linguistic knowledge Y

Trained to perform Task X

Page 32: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Analysis via Auxiliary Tasks

• Predicting utterance length

• Predicting the presence of specific words

• Representation of similarity

• Homonym detection

• Synonym detection

32

Page 33: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Predicting Utterance Length

• Input: MFCC features from speech signal (layer 0), layer activations (1--4/5), or utterance embeddings (5--6)

• Prediction model: Linear Regression

33

Number of words

Input Activations for

utterance

Model Linear regression

Page 34: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Presence of Individual Words

• Input: MFCC features for words + layer activations

• Prediction model: Multilayer Perceptron

34

Word presence

Input Activations for

utterance

MFCC for word

Model MLP

Page 35: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Synonym Discrimination

• Distinguishing between synonym pairs in the same context:

• A girl looking at a photo

• A girl looking at a picture

• Synonyms were selected using WordNet synsets:

• The pair have the same POS tag and are interchangeable

• The pair clearly differ in form (not donut/doughnut)

• The more frequent token in a pair constitutes less than 95% of the occurrences.

35

Page 36: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Synonym Discrimination

36

8

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer.

●●

● ●●

●●

● ● ● ●●

●●

● ● ● ● ●

● ●

●● ●

● ●

● ● ●●

● ● ● ●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●● ● ●

● ● ●●

● ● ●● ●

● ●●

●●

● ●

●●

● ● ●

●●

●●

● ●●

● ● ● ● ●

● ●

● ●

● ●

● ●

● ●

●●

●● ●

●●

● ●

● ● ●●

● ●

0.0

0.1

0.2

0.3

mfcc conv rec1 rec2 rec3 rec4 rec5 embRepresentation

Error

Pair●

cut.slicemake.preparesomeone.personphoto.picturepicture.imagekid.childphotograph.pictureslice.piecebicycle.bikephotograph.photocouch.sofatv.televisionvegetable.veggie

sidewalk.pavementrock.stonestore.shoppurse.bagassortment.varietyspot.placepier.dockdirection.waycarpet.rugbun.rolllarge.bigsmall.little

Figure 6: Synonym discrimination error rates, perrepresentation and synonym pair.

tence embeddings give relatively high error ratessuggesting that the attention layer acts to focuson semantic information and filter out much ofphonological form.

6 Discussion

Understanding distributed representations learnedby neural networks is important but has the rep-utation of being hard to impossible. In this workwe focus on making progress on this problem fora particular domain: representations of phonology

in a multilayer recurrent neural network trained ongrounded speech signal. We believe it is impor-tant to carry out multiple analyses using diversemethodology: any single experiment may be mis-leading as it depends on analytical choices such asthe type of supervised model used for decoding,the algorithm used for clustering, or the similaritymetric for representational similarity analysis. Tothe extent that more than one experiment points tothe same conclusion our confidence in the reliabil-ity of the insights gained will be increased.

The main high-level result of our study con-firms earlier work: encoding of semantics be-comes stronger in higher layer, while encoding ofform becomes weaker. This general pattern is tobe expected as the objective of the utterance en-coder is to transform the input acoustic featuresin such a way that it can be matched to its coun-terpart in a completely separate modality. Manyof the details of how this happens, however, arefar from obvious: perhaps most surprisingly wefound that large amount of phonological informa-tion persist up to the top recurrent layer. Evidencefor this pattern emerges from the phoneme decod-ing task, the ABX task and the synonym discrim-ination task. The last one also shows that the at-tention layer filters out and significantly attenuatesencoding of phonology and makes the utteranceembeddings much more invariant to synonymy.

In future work we would like to apply ourmethodology to other models and data, especiallyhuman speech data. We would also like to makecomparisons to the results that emerge from simi-lar analyses applied to neuroimaging data.

8

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer.

●●

● ●●

●●

● ● ● ●●

●●

● ● ● ● ●

● ●

●● ●

● ●

● ● ●●

● ● ● ●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●● ● ●

● ● ●●

● ● ●● ●

● ●●

●●

● ●

●●

● ● ●

●●

●●

● ●●

● ● ● ● ●

● ●

● ●

● ●

● ●

● ●

●●

●● ●

●●

● ●

● ● ●●

● ●

0.0

0.1

0.2

0.3

mfcc conv rec1 rec2 rec3 rec4 rec5 embRepresentation

Error

Pair●

cut.slicemake.preparesomeone.personphoto.picturepicture.imagekid.childphotograph.pictureslice.piecebicycle.bikephotograph.photocouch.sofatv.televisionvegetable.veggie

sidewalk.pavementrock.stonestore.shoppurse.bagassortment.varietyspot.placepier.dockdirection.waycarpet.rugbun.rolllarge.bigsmall.little

Figure 6: Synonym discrimination error rates, perrepresentation and synonym pair.

tence embeddings give relatively high error ratessuggesting that the attention layer acts to focuson semantic information and filter out much ofphonological form.

6 Discussion

Understanding distributed representations learnedby neural networks is important but has the rep-utation of being hard to impossible. In this workwe focus on making progress on this problem fora particular domain: representations of phonology

in a multilayer recurrent neural network trained ongrounded speech signal. We believe it is impor-tant to carry out multiple analyses using diversemethodology: any single experiment may be mis-leading as it depends on analytical choices such asthe type of supervised model used for decoding,the algorithm used for clustering, or the similaritymetric for representational similarity analysis. Tothe extent that more than one experiment points tothe same conclusion our confidence in the reliabil-ity of the insights gained will be increased.

The main high-level result of our study con-firms earlier work: encoding of semantics be-comes stronger in higher layer, while encoding ofform becomes weaker. This general pattern is tobe expected as the objective of the utterance en-coder is to transform the input acoustic featuresin such a way that it can be matched to its coun-terpart in a completely separate modality. Manyof the details of how this happens, however, arefar from obvious: perhaps most surprisingly wefound that large amount of phonological informa-tion persist up to the top recurrent layer. Evidencefor this pattern emerges from the phoneme decod-ing task, the ABX task and the synonym discrim-ination task. The last one also shows that the at-tention layer filters out and significantly attenuatesencoding of phonology and makes the utteranceembeddings much more invariant to synonymy.

In future work we would like to apply ourmethodology to other models and data, especiallyhuman speech data. We would also like to makecomparisons to the results that emerge from simi-lar analyses applied to neuroimaging data.

Page 37: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Homonym Disambiguation

• Distinguishing between homonym pairs in different contexts, e.g. suite versus sweet or sale versus sail

• We selected homonym pairs in our dataset where

• both forms appear more than 20 times

• the two forms have different meanings (not theatre/theater)

• neither form is a function word

• the more frequent form constitutes less than 95% of the occurrences.

37

Page 38: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

38

Homonym disambiguation

Page 39: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

!39

Attention

RHN

RHN

RHN

RHN

RHN

Convolution

MFCC

Utt

eran

ce L

engt

h

Wor

d Pr

esen

ce

Mea

ning

Sim

ilari

ty

Hom

onym

Det

ecti

on

Syno

nym

Det

ecti

on

Edit

Sim

ilari

ty

✱ ✱

Page 40: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

!40

Attention

RHN

RHN

RHN

RHN

RHN

Convolution

MFCC

Utt

eran

ce L

engt

h

Wor

d Pr

esen

ce

Mea

ning

Sim

ilari

ty

Hom

onym

Det

ecti

on

Syno

nym

Det

ecti

on

Edit

Sim

ilari

ty

✱ ✱

Page 41: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Encoding of Phonology

• Questions: how is phonology encoded in

• MFCC features extracted from speech signal?

• activations of the layers of the model?

41

Page 42: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Encoding of Phonology

• Questions: how is phonology encoded in

• MFCC features extracted from speech signal?

• activations of the layers of the model?

• Data: Synthetically Spoken COCO dataset

• Experiments:

• Phoneme decoding and clustering

• Phoneme discrimination

42

Grounded speech perception

Page 43: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Phoneme Decoding

• Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes

• Speech signal was aligned with phonemic transcription using Gentle toolkit (based on Kaldi, Povey et al., 2011)

43

Page 44: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Phoneme Decoding

• Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes

44

5

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

MFCC features and activations) are stored in atr ⇥ Dr matrix, where tr and Dr are the num-ber of times steps and the dimensionality, respec-tively, for each representation r. Given the align-ment of each phoneme token to the underlying au-dio, we then infer the slice of the representationmatrix corresponding to it.

5 Experiments

In this section we report on four experimentswhich we designed to elucidate to what extent in-formation about phonology is represented in theactivations of the layers of the COCO Speechmodel. In Section 5.1 we quantify how easy it isto decode phoneme identity from activations. InSection 5.2 we determine phoneme discriminabil-ity in a controlled task with minimal pair stimuli.Section 5.3 shows how the phoneme inventory isorganized in the activation space of the model. Fi-nally, in Section 5.4 we tackle the general issueof the representation of phonological form versusmeaning with the controlled task of synonym dis-crimination.

5.1 Phoneme decoding

In this section we quantify to what extent phonemeidentity can be decoded from the input MFCC fea-tures as compared to the representations extractedfrom the COCO speech. As explained in Sec-tion 4.3, we use phonemic transcriptions alignedto the corresponding audio in order to segmentthe signal into chunks corresponding to individualphonemes.

We take a subset of 5000 utterances from thevalidation set of Synthetically Spoken COCO, andextract the force-aligned representations from theSpeech COCO model. We split this data into 2

3training and 1

3 heldout portions, and use super-vised classification in order to quantify the recov-erability of phoneme identities from the represen-tations. Each phoneme slice is averaged over time,so that it becomes a Dr-dimensional vector. Foreach representation we then train L2-penalized lo-gistic regression (with the fixed penalty weight1.0) on the training data and measure classifica-tion error rate on the heldout portion.

Figure 1 shows the results. As can be seen fromthis plot, phoneme recoverability is poor for therepresentations based on MFCC and the convolu-tional layer activations, but improves markedly forthe recurrent layers. Phonemes are easiest recov-

ered from the activations at recurrent layers 1 and2, and the accuracy decreases thereafter. This sug-gests that the bottom recurrent layers of the modelspecialize in recognizing this type of low-levelphonological information. It is notable howeverthat even the last recurrent layer encodes phonemeidentity to a substantial degree.

●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●●

●●

●●●

●●●

0.3

0.4

0.5

MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5Representation

Erro

r rat

e

Figure 1: Accuracy of phoneme decoding withinput MFCC features and COCO Speech modelactivations. The boxplot shows error rates boot-strapped with 1000 resamples.

5.2 Phoneme discrimination

Schatz et al. (2013) propose a framework for eval-uating speech features learned in an unsupervisedsetup that does not depend on phonetically labeleddata. They propose a set of tasks called Minimal-Pair ABX tasks that allow to make linguisticallyprecise comparisons between syllable pairs thatonly differ by one phoneme. They use variants ofthis task to study phoneme discrimination acrosstalkers and phonetic contexts as well as talker dis-crimination across phonemes.

Here we evaluate the COCO Speech model onthe Phoneme across Context (PaC) task of Schatzet al. (2013). This task consists of presenting a se-ries of equal-length tuples (A,B,X) to the model,where A and B differ by one phoneme (either avowel or a consonant), as do B and X , but A andX are not minimal pairs. For example, in the tu-ple (be /bi/, me /mi/, my /maI/), human subjectsare asked to identify which of the two syllables beor me are closest to my. The goal is to measure

Page 45: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Organization of Phonemes

• Agglomerative hierarchical clustering of phoneme activation vectors from the first hidden layer:

45

8

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer.

●●

● ●●

●●

● ● ● ●●

●●

● ● ● ● ●

● ●

●● ●

● ●

● ● ●●

● ● ● ●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●● ● ●

● ● ●●

● ● ●● ●

● ●●

●●

● ●

●●

● ● ●

●●

●●

● ●●

● ● ● ● ●

● ●

● ●

● ●

● ●

● ●

●●

●● ●

●●

● ●

● ● ●●

● ●

0.0

0.1

0.2

0.3

mfcc conv rec1 rec2 rec3 rec4 rec5 embRepresentation

Error

Pair●

cut.slicemake.preparesomeone.personphoto.picturepicture.imagekid.childphotograph.pictureslice.piecebicycle.bikephotograph.photocouch.sofatv.televisionvegetable.veggie

sidewalk.pavementrock.stonestore.shoppurse.bagassortment.varietyspot.placepier.dockdirection.waycarpet.rugbun.rolllarge.bigsmall.little

Figure 6: Synonym discrimination error rates, perrepresentation and synonym pair.

tence embeddings give relatively high error ratessuggesting that the attention layer acts to focuson semantic information and filter out much ofphonological form.

6 Discussion

Understanding distributed representations learnedby neural networks is important but has the rep-utation of being hard to impossible. In this workwe focus on making progress on this problem fora particular domain: representations of phonology

in a multilayer recurrent neural network trained ongrounded speech signal. We believe it is impor-tant to carry out multiple analyses using diversemethodology: any single experiment may be mis-leading as it depends on analytical choices such asthe type of supervised model used for decoding,the algorithm used for clustering, or the similaritymetric for representational similarity analysis. Tothe extent that more than one experiment points tothe same conclusion our confidence in the reliabil-ity of the insights gained will be increased.

The main high-level result of our study con-firms earlier work: encoding of semantics be-comes stronger in higher layer, while encoding ofform becomes weaker. This general pattern is tobe expected as the objective of the utterance en-coder is to transform the input acoustic featuresin such a way that it can be matched to its coun-terpart in a completely separate modality. Manyof the details of how this happens, however, arefar from obvious: perhaps most surprisingly wefound that large amount of phonological informa-tion persist up to the top recurrent layer. Evidencefor this pattern emerges from the phoneme decod-ing task, the ABX task and the synonym discrim-ination task. The last one also shows that the at-tention layer filters out and significantly attenuatesencoding of phonology and makes the utteranceembeddings much more invariant to synonymy.

In future work we would like to apply ourmethodology to other models and data, especiallyhuman speech data. We would also like to makecomparisons to the results that emerge from simi-lar analyses applied to neuroimaging data.

Page 46: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

• ABX task (Schatz et al., 2013): discriminate minimal pairs; is X closer to A or to B?

• A, B and X are CV syllables

• (A,B) and (B,X) are minimum pairs, but (A,X) are not (34,288 tuples in total)

Phoneme Discrimination

46

A: be /bi/ B: me /mi/

X: my /maI/

Page 47: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Phoneme Discrimination

47

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Table 3: Accuracy of choosing the correct targetin an ABX task using different representations.

MFCC 0.71Convolutional 0.73Recurrent 1 0.82Recurrent 2 0.82Recurrent 3 0.80Recurrent 4 0.76Recurrent 5 0.74

context invariance in phoneme discrimination byevaluating how often the model recognises X asthe syllable closer to B than to A.

We used a list of all attested consonant-vowel(CV) syllables of American English according tothe syllabification method described in Gorman(2013). We excluded the ones which could not beunambiguously represented using English spellingfor input to the TTS system (e.g. /baU/). We thencompiled a list of all possible (A,B,X) tuplesfrom this list where (A,B) and (B,X) are min-imal pairs, but (A,X) are not. This resulted in34,288 tuples in total. For each tuple, we measuresign(dist(A,X) � dist(B,X)), where dist(i, j)is the euclidean distance between the vector rep-resentations of syllables i and j. These represen-tations are either the audio feature vectors or thelayer activation vectors. A positive value for a tu-ple means that the model has correctly discrim-inated the phonemes that are shared or differentacross the syllables.

Table 3 shows the discrimination accuracy inthis task using various representations. The pat-tern is similar to what we observed in the phonemeidentification task: best accuracy is achieved usingrepresentation vectors from recurrent layers 1 and2, and it drops as we move further up in the model.The accuracy is lowest when MFCC features areused for this task.

However, the PaC task is most meaningful andchallenging where the target and the distractorphonemes belong to the same phoneme class. Fig-ure 2 shows the accuracies for this subset of cases,broken down by class. As can be seen, the modelcan discriminate between phonemes with high ac-curacy across all the layers, and the layer activa-tions are more informative for this task than theMFCC features. Again, phonemes seem to be rep-resented more accurately in the lower layers, andthe performance of the model in this task drops

0.5

0.6

0.7

0.8

0.9

mfcc conv rec1 rec2 rec3 rec4 rec5Representation

Accuracy

Class ●

affricateapproximant

fricativenasal

plosivevowel

Figure 2: Accuracies for the ABX CV task for thecases where the target and the distractor belong tothe same phoneme class.

as we move towards higher hidden layers. Thereare also clear differences in the pattern of discrim-inability for the phoneme classes. The vowels areespecially easy to tell apart, but accuracy on vow-els drops most acutely in the higher layers. Mean-while the accuracy on fricatives and approximantsstarts low, but improves rapidly and peaks aroundrecurrent layer 2.

5.3 Organization of phonemes

In this section we take a closer look at the un-derlying organization of phonemes in the model.Our experiment is inspired by Khalighinejad et al.(2017) who study how the speech signal is repre-sented in the brain at different stages of the au-ditory pathway by collecting and analyzing elec-troencephalography responses from participantslistening to continuous speech, and show thatbrain responses to different phoneme categoriesturn out to be organized by phonetic features.

We carry out an analogous experiment by an-alyzing the hidden layer activations of our modelin response to each phoneme in the input. First,we generated a distance matrix for every pair ofphonemes by calculating the Euclidean distancebetween the phoneme pair’s activation vectors foreach layer separately, as well as a distance matrixfor all phoneme pairs based on their MFCC fea-tures. Similar to what Khalighinejad et al. (2017)report, we observe that the phoneme activations on

Page 48: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

!48

Attention

RHN

RHN

RHN

RHN

RHN

Convolution

MFCC

Utt

eran

ce L

engt

h

Wor

d Pr

esen

ce

Mea

ning

Sim

ilari

ty

Hom

onym

Det

ecti

on

Syno

nym

Det

ecti

on

Edit

Sim

ilari

ty

✱ ✱

Phon

eme

Dec

odin

g

Phon

eme

Dis

crim

inat

ion

✱✱

Page 49: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

What We Have Learned

• Encodings of form and meaning emerge and evolve in hidden layers of stacked RNNs processing grounded speech

• Phoneme representations are most salient in lower layers, although large amount of phonological information persists up to the top recurrent layer

49

Page 50: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Back to Structure

• Diagnostic classifiers are not useful in probing structured representations

50

Linguistic structured

spaces

Page 51: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Representational Similarity Analysis (RSA)

• RSA (Kriegeskorte et al., 2008): Measuring correlations between representations A and B in a similarity space

• Compute representation (dis)similarity matrices in two spaces

• Measure correlations between upper triangles

51

Page 52: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

RSA: An Example

• Sentence similarity according to

• Sim A: human judgment

• Sim B: estimated by a model

• RSA score: correlation between Sim A and Sim B

52

Page 53: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Applying RSA to Language

53

x

y

z

x

y

z

A B

• What we need: a similarity metric within two spaces A & B

• Eg., A is a vector space, B is a space of trees or graphs

• What we do not need: a mapping between space A & B

Page 54: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Applying RSA to Language

54

A B

• What we need: a similarity metric within two spaces A & B

• Eg., A is a vector space, B is a space of trees or graphs

• What we do not need: a mapping between space A & B

Representational Similarity Analysis

Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2, 4.

Representational Similarity Analysis

Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2, 4.

Page 55: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Tree Kernels

• Measuring the similarity between two syntactic trees: count their overlapping subtrees

55

Page 56: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Applying RSA to NN Models of Language

• BERT: a state-of-the-art language model

56

BERT layers

BERT layers

BERT layers

Page 57: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Analysis Techniques: Summary

• Interpreting hidden activations in NN models of language

• Perturbation-based techniques

• Probing classifiers and auxiliary tasks

• Representation similarity analysis

57

BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

1st edition: EMNLP’2018, Brussels 2nd edition: ACL 2019, Florence

Page 58: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Grzegorz Chrupała Ákos Kádár

• Chrupała, Kádár & Alishahi (2015). Learning language through pictures. ACL’2015.

• Kádár, Chrupała & Alishahi (2017). Representation of linguistic form and function in recurrent neural networks. Computational Linguistics.

• Chrupała, Gelderloos & Alishahi (2017). Representation of language in a model of visually grounded speech signal. ACL’2017.

• Alishahi, Barking and Chrupała (2017). Encoding of phonology in a recurrent neural model of grounded speech. CoNLL’2017.

• Kádár, Elliott, Côté, Chrupała & Alishahi (2018). Lessons learned in multilingual grounded language learning. CoNLL.

• Chrupała and Alishahi (2019). Correlating neural and symbolic representations of language. ACL’2019.

• Alishahi, Chrupała and Linzen (2019). Analyzing and Interpreting Neural Networks for NLP. Natural Language Engineering.

Lieke Gelderloos

Representations of language in a model of visually

grounded speech signal

Grzegorz Chrupała Lieke Gelderloos Afra Alishahi

Marie Barking

Page 59: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Modeling Language Learning: What Next?

• Direct comparison with infant behaviour

• Learning phases, behavioural patterns

• Videos rather than static images

• Exploiting movement, change and non-verbal cues

• Interaction through dialogue

• Relying on feedback from communication success

59

Page 60: Getting Closer to Reality: Grounding and Interaction in Models ......Kad´ ar, Chrupa´ ła, and Alishahi Linguistic Form and Function in RNN’s order from consideration. We use this

Getting Closer to Reality

• Grounded language learning

• Mapping utterances to visual representations involves learning linguistic knowledge on various levels

• Examining the nature of emerging linguistic knowledge

• Developing techniques for interpreting the acquired knowledge in (neural) models of language

• Learning language through conversation

• Moving away from passive, data-driven models and towards frameworks which allow for interaction

60