getting closer to reality: grounding and interaction in models ......kad´ ar, chrupa´ ła, and...
TRANSCRIPT
Getting Closer to Reality: Grounding and Interaction in Models of Human Language Acquisition
Afra Alishahi
Getting Closer to Reality: Grounding and Interaction in Models of Human Language Acquisition
Afra Alishahi
???
3
Can we simulate language learning In a naturalistic setting?
4
Language Acquisition in the Wild
Patrz, łódka przybija do
brzegu.
Grounded Language Learning
• Acquire knowledge of language from utterances grounded in perceptual context
• What linguistic representations are learned?
• Historically, linguistic formalisms were given, and models were adapted to these formalisms.
• Example: how to induce a Context Free Grammar from a corpus?
• How do we know that humans use similar representations?
5
An Alternative Approach
• Neural models of language allow for applying general-purpose architectures to performing different language tasks
• What type(s) of linguistic knowledge are necessary for performing a given task?
6
Focus of this Talk
7
How to simulate grounded language learning?
How to analyze emerging knowledge in neural models of language?
A Naturalistic Scenario
8
Look, they are washing the elephants!
• Where to get the input data from?
Using Images of Natural Scenes
• Datasets such as Flickr30K and MS-Coco provide images of natural scenes and their textual description
• Image-caption pairs as proxy for linguistic & perceptual context
9
Imaginet: Learning by Processing
10
w1 w2 w3 ....
visual pathway
linguistic pathway wt+1
shared word representations
Computational Linguistics Volume 43, Number 4
Formally, each sentence is mapped to two sequences of hidden states, one byVISUAL and the other by TEXTUAL:
hV
t = GRUV(hV
t�1, xt) (5)
hT
t = GRUT(hT
t�1, xt) (6)
At each time step TEXTUAL predicts the next word in the sentence S from its currenthidden state h
T
t, and VISUAL predicts the image-vector1
i from its last hidden represen-tation h
V
t.
i = VhV
⌧ (7)
p(St+1|S1:t) = softmax(LhT
t ) (8)
The loss function is a multi-task objective that penalizes error on the visual and thetextual targets simultaneously. The objective combines cross-entropy loss L
T for theword predictions and cosine distance L
V for the image predictions,2 weighting themwith the parameter ↵ (set to 0.1).
LT(✓) = �1
⌧
⌧X
t=1
log p(St|S1:t) (9)
LV(✓) = 1 � i · i
kikkik(10)
L = ↵LT + (1 � ↵)LV (11)
For more details about the IMAGINET model and its performance, see Chrupała, Kadar,and Alishahi (2015). Note that we introduce a small change in the image represen-tation: We observe that using standardized image vectors, where each dimension istransformed by subtracting the mean and dividing by standard deviation, improvesperformance.
3.3 Unimodal Language Model
The model LM is a language model analogous to the TEXTUAL pathway of IMAGINETwith the difference that its word embeddings are not shared, and its loss function is thecross-entropy on word prediction. Using this model we remove the visual objective asa factor, as the model does not use the images corresponding to captions in any way.
3.4 Sum of Word Embeddings
The model SUM is a stripped-down version of the VISUAL pathway, which does notshare word embeddings, only uses the cosine loss function, and replaces the GRUnetwork with a summation over word embeddings. This removes the effect of word
1 Representing the full image, extracted from the pre-trained CNN of Simonyan and Zisserman (2015).2 Note that the original formulation in Chrupała, Kadar, and Alishahi (2015) uses mean squared error
instead; as the performance of VISUAL is measured on image-retrieval (which is based on cosinedistances) we use cosine distance as the visual loss here.
766
Computational Linguistics Volume 43, Number 4
Formally, each sentence is mapped to two sequences of hidden states, one byVISUAL and the other by TEXTUAL:
hV
t = GRUV(hV
t�1, xt) (5)
hT
t = GRUT(hT
t�1, xt) (6)
At each time step TEXTUAL predicts the next word in the sentence S from its currenthidden state h
T
t, and VISUAL predicts the image-vector1
i from its last hidden represen-tation h
V
t.
i = VhV
⌧ (7)
p(St+1|S1:t) = softmax(LhT
t ) (8)
The loss function is a multi-task objective that penalizes error on the visual and thetextual targets simultaneously. The objective combines cross-entropy loss L
T for theword predictions and cosine distance L
V for the image predictions,2 weighting themwith the parameter ↵ (set to 0.1).
LT(✓) = �1
⌧
⌧X
t=1
log p(St|S1:t) (9)
LV(✓) = 1 � i · i
kikkik(10)
L = ↵LT + (1 � ↵)LV (11)
For more details about the IMAGINET model and its performance, see Chrupała, Kadar,and Alishahi (2015). Note that we introduce a small change in the image represen-tation: We observe that using standardized image vectors, where each dimension istransformed by subtracting the mean and dividing by standard deviation, improvesperformance.
3.3 Unimodal Language Model
The model LM is a language model analogous to the TEXTUAL pathway of IMAGINETwith the difference that its word embeddings are not shared, and its loss function is thecross-entropy on word prediction. Using this model we remove the visual objective asa factor, as the model does not use the images corresponding to captions in any way.
3.4 Sum of Word Embeddings
The model SUM is a stripped-down version of the VISUAL pathway, which does notshare word embeddings, only uses the cosine loss function, and replaces the GRUnetwork with a summation over word embeddings. This removes the effect of word
1 Representing the full image, extracted from the pre-trained CNN of Simonyan and Zisserman (2015).2 Note that the original formulation in Chrupała, Kadar, and Alishahi (2015) uses mean squared error
instead; as the performance of VISUAL is measured on image-retrieval (which is based on cosinedistances) we use cosine distance as the visual loss here.
766
Computational Linguistics Volume 43, Number 4
Formally, each sentence is mapped to two sequences of hidden states, one byVISUAL and the other by TEXTUAL:
hV
t = GRUV(hV
t�1, xt) (5)
hT
t = GRUT(hT
t�1, xt) (6)
At each time step TEXTUAL predicts the next word in the sentence S from its currenthidden state h
T
t, and VISUAL predicts the image-vector1
i from its last hidden represen-tation h
V
t.
i = VhV
⌧ (7)
p(St+1|S1:t) = softmax(LhT
t ) (8)
The loss function is a multi-task objective that penalizes error on the visual and thetextual targets simultaneously. The objective combines cross-entropy loss L
T for theword predictions and cosine distance L
V for the image predictions,2 weighting themwith the parameter ↵ (set to 0.1).
LT(✓) = �1
⌧
⌧X
t=1
log p(St|S1:t) (9)
LV(✓) = 1 � i · i
kikkik(10)
L = ↵LT + (1 � ↵)LV (11)
For more details about the IMAGINET model and its performance, see Chrupała, Kadar,and Alishahi (2015). Note that we introduce a small change in the image represen-tation: We observe that using standardized image vectors, where each dimension istransformed by subtracting the mean and dividing by standard deviation, improvesperformance.
3.3 Unimodal Language Model
The model LM is a language model analogous to the TEXTUAL pathway of IMAGINETwith the difference that its word embeddings are not shared, and its loss function is thecross-entropy on word prediction. Using this model we remove the visual objective asa factor, as the model does not use the images corresponding to captions in any way.
3.4 Sum of Word Embeddings
The model SUM is a stripped-down version of the VISUAL pathway, which does notshare word embeddings, only uses the cosine loss function, and replaces the GRUnetwork with a summation over word embeddings. This removes the effect of word
1 Representing the full image, extracted from the pre-trained CNN of Simonyan and Zisserman (2015).2 Note that the original formulation in Chrupała, Kadar, and Alishahi (2015) uses mean squared error
instead; as the performance of VISUAL is measured on image-retrieval (which is based on cosinedistances) we use cosine distance as the visual loss here.
766
Image Representations
11
Image model
BOAT
BIRD
BOAR
Pre
-cla
ssif
cat io
n la
ye
r
VGG-16: Simonyan & Zisserman (2014)
What Does the Model Learn?
• Evaluate the model through a set of comprehension and production tasks
• Production: retrieve the closest words or phrases to an image
• Comprehension: retrieve the closest images to a word or multi-word phrase
• Estimate word pair similarity, and correlate with human judgments
12
Word Meaning
Original label: parachute
Word: parachute Word: bicycle
Original label: bicycle-built-for-two
Utterance Meaning
14
Input sentence: “a brown teddy bear lying on top of a dry grass covered ground .”
• Does the model learn and use any knowledge about sentence structure?
Utterance Meaning
Scrambled input sentence: “a a of covered laying bear on brown grass top teddy ground . dry”
Input sentence: “a brown teddy bear lying on top of a dry grass covered ground .”
Another Example
16
Original: “a variety of kitchen utensils hanging from a UNK board .”
Scrambled: “kitchen of from hanging UNK variety a board utensils .”
What Kind of Structure is Learned?
• Some structural patterns:
• Topics appear in sentence-initial position
• Same words are more important as heads than modifiers
• Periods terminate a sentence
• Sentences tend to not start with conjunctions
• Can we pin down the acquired structure more systematically?
17
Analysis Technique: Omission Score
• Perturbation study: how would the omission of a word from an input sentence affect the model?
18
Kadar, Chrupała, and Alishahi Linguistic Form and Function in RNN’s
order from consideration. We use this model as a baseline in the sections that focus onlanguage structure.
4. Experiments
In this section, we report a series of experiments in which we explore the kinds oflinguistic regularities the networks learn from word-level input. In Section 4.1 weintroduce omission score, a metric to measure the contribution of each token to theprediction of the networks, and in Section 4.2 we analyze how omission scores aredistributed over dependency relations and part-of-speech categories. In Section 4.3 weinvestigate the extent to which the importance of words for the different networksdepends on the words themselves, their sequential position, and their grammaticalfunction in the sentences. Finally, in Section 4.4 we systematically compare the typesof n-gram contexts that trigger individual dimensions in the hidden layers of thenetworks, and discuss their level of abstractness.
In all these experiments we report our findings based on the IMAGINET model, andwhenever appropriate compare it with our two other models LM and SUM. For all theexperiments, we trained the models on the training portion of the MSCOCO image-caption data set (Lin et al. 2014), and analyzed the representations of the sentences inthe validation set corresponding to 5000 randomly chosen images. The target imagerepresentations were extracted from the pre-softmax layer of the 16-layer CNN ofSimonyan and Zisserman (2015).
4.1 Computing Omission Scores
We propose a novel technique for interpreting the activation patterns of neural networkstrained on language tasks from a linguistic point of view, and focus on the high-levelunderstanding of what parts of the input sentence the networks pay most attention to.Furthermore, we investigate whether the networks learn to assign different amountsof importance to tokens, depending on their position and grammatical function in thesentences.
In all the models the full sentences are represented by the activation vector atthe end-of-sentence symbol (hend). We measure the salience of each word Si in aninput sentence S1:n based on how much the representation of the partial sentenceS\i = S1:i�1Si+1:n, with the omitted word Si, deviates from that of the original sentencerepresentation. For example, the distance between hend(the black dog is running) andhend(the dog is running) determines the importance of black in the first sentence. Weintroduce the measure omission(i, S) for estimating the salience of a word Si:
omission(i, S) = 1 � cosine(hend(S), hend(S\i)) (12)
Figure 2 demonstrates the omission scores for the LM, VISUAL, and TEXTUALmodels for an example caption. Figure 3 shows the images retrieved by VISUAL for thefull caption and for the one with the word baby omitted. The images are retrieved fromthe validation set of MSCOCO by: 1) computing the image representation of the givensentence with VISUAL; 2) extracting the CNN features for the images from the set; and3) finding the image that minimizes the cosine distance to the query. The omission scoresfor VISUAL show that the model paid attention mostly to baby and bed and slightlyto laptop, and retrieved an image depicting a baby sitting on a bed with a laptop.Removing the word baby leads to an image that depicts an adult male lying on a
767
Contribution of each Word
19
Computational Linguistics Volume 43, Number 4
0.0
0.1
0.2
0.3
0.4
ababy sits on a
bed
laughing wit
h alaptop
computer
open
score lm
textualvisual
Figure 2
Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open
for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.
bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.
4.2 Omission Score Distributions
The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get
Figure 3
Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open
(left) and the same sentence with the second word omitted (right).
768
Computational Linguistics Volume 43, Number 4
0.0
0.1
0.2
0.3
0.4
ababy sits on a
bed
laughing wit
h alaptop
computer
open
score lm
textualvisual
Figure 2
Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open
for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.
bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.
4.2 Omission Score Distributions
The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get
Figure 3
Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open
(left) and the same sentence with the second word omitted (right).
768
Computational Linguistics Volume 43, Number 4
0.0
0.1
0.2
0.3
0.4
ababy sits on a
bed
laughing wit
h alaptop
computer
open
score lm
textualvisual
Figure 2
Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open
for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.
bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.
4.2 Omission Score Distributions
The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get
Figure 3
Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open
(left) and the same sentence with the second word omitted (right).
768
Contribution of each Word
20
Computational Linguistics Volume 43, Number 4
0.0
0.1
0.2
0.3
0.4
ababy sits on a
bed
laughing wit
h alaptop
computer
open
score lm
textualvisual
Figure 2
Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open
for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.
bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.
4.2 Omission Score Distributions
The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get
Figure 3
Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open
(left) and the same sentence with the second word omitted (right).
768
Computational Linguistics Volume 43, Number 4
0.0
0.1
0.2
0.3
0.4
ababy sits on a
bed
laughing wit
h alaptop
computer
open
score lm
textualvisual
Figure 2
Omission scores for the example sentence a baby sits on a bed laughing with a laptop computer open
for LM and the two pathways, TEXTUAL and VISUAL, of IMAGINET.
bed. Figure 2 also shows that in contrast to VISUAL, TEXTUAL distributes its attentionmore evenly across time steps instead of focusing on the types of words related to thecorresponding visual scene. The peaks for LM are the same as for TEXTUAL, but thevariance of the omission scores is higher, suggesting that the unimodal language modelis more sensitive overall to input perturbations than TEXTUAL.
4.2 Omission Score Distributions
The omission scores can be used not only to estimate the importance of individualwords, but also of syntactic categories. We estimate the salience of each syntacticcategory by accumulating the omission scores for all words in that category. We tagevery word in a sentence with the part-of-speech (POS) category and the dependencyrelation label of its incoming arc. For example, for the sentence the black dog, we get
Figure 3
Images retrieved for the example sentence a baby sits on a bed laughing with a laptop computer open
(left) and the same sentence with the second word omitted (right).
768
X
An Example
21
Input sentence: “Stop sign with words written on it with a black marker .”
Impact of POS & Dependency Categories
22
What We Have Learned
• Perceptual and linguistic context provide complementary sources of information for learning form-meaning associations.
• Analysis of the model’s hidden states gives us insight into useful structural knowledge for language comprehension and production, but more sophisticated techniques are needed.
23
24
A Typical Language Learning Scenario
Patrz, łódka przybija do
brzegu.
25
A Typical Language Learning Scenario
Grounded speech perception
Modeling Grounded Speech
26
Grounded speech perception
Image Model
Speech Model
Joint Semantic Space
Joint Semantic Space
27
Project speech and image to joint space
a bird walks on a beam
bears play in water
nal representation by removing them from the in-put and comparing the resulting representations tothe ones generated by the original input. In a sim-ilar approach, Li et al. (2016) study the contribu-tion of individual input tokens as well as hiddenunits and word embedding dimensions by erasingthem from the representation and analyzing howthis affects the model.
Miao et al. (2016) and Tang et al. (2016) use vi-sualization techniques for fine-grained analysis ofGRU and LSTM models for ASR. Visualizationof input and forget gate states allows Miao et al.(2016) to make informed adaptations to gated re-current architectures, resulting in more efficientlytrainable models. Tang et al. (2016) visualizequalitative differences between LSTM- and GRU-based architectures, regarding the encoding of in-formation, as well as how it is processed throughtime.
We specifically study linguistic properties of theinformation encoded in the trained model. Adiet al. (2016) introduce prediction tasks to ana-lyze information encoded in sentence embeddingsabout word order, sentence length, and the pres-ence of individual words. We use related tech-niques to explore encoding of aspects of form andmeaning within components of our stacked archi-tecture.
3 Models
We use a multi-layer, gated recurrent neural net-work (RHN) to model the temporal nature ofspeech signal. Recurrent neural networks are de-signed for modeling sequential data, and gatedvariants (GRUs, LSTMs) are widely used withspeech and text in both cognitive modeling andengineering contexts. RHNs are a simple gener-alization of GRU networks such that the transformbetween time points can consist of several steps.
Our multimodal model projects spoken utter-ances and images to a joint semantic space. Theidea of projecting different modalities to a sharedsemantic space via a pair of encoders has beenused in work on language and vision (among themVendrov et al. (2015)). The core idea is to en-courage inputs representing the same meaning indifferent modalities to end up nearby, while main-taining a distance from unrelated inputs.
The model consists of two parts: an utteranceencoder, and an image encoder. The utterance en-coder starts from MFCC speech features, while
the image encoder starts from features extractedwith a VGG-16 pre-trained on ImageNet. Our lossfunction attempts to make the cosine distance be-tween encodings of matching utterances and im-ages greater than the distance between encodingsof mismatching utterance/image pairs, by a mar-gin:
(1)
X
u,i
X
u0
max[0,↵+d(u, i)�d(u0, i)]
+
X
i0
max[0,↵+ d(u, i)� d(u, i0)]
!
where d(u, i) is the cosine distance between theencoded utterance u and encoded image i. Here(u, i) is the matching utterance-image pair, u0
ranges over utterances not describing i and i0
ranges over images not described by u.The image encoder enci is a simple linear pro-
jection, followed by normalization to unit L2norm:
enci(i) = unit(Ai+ b) (2)
where unit(x) =x
(xT x)0.5and with (A, b) as
learned parameters. The utterance encoder encu
consists of a 1-dimensional convolutional layer oflength s, size d and stride z, whose output feedsinto a Recurrent Highway Network with k lay-ers and L microsteps, whose output in turn goesthrough an attention-like lookback operator, andfinally L2 normalization:
encu(u) = unit(Attn(RHNk,L(Convs,d,z(u))))(3)
The main function of the convolutional layerConvs,d,z is to subsample the input along the tem-poral dimension. We use a 1-dimensional convo-lution with full border mode padding. The atten-tion operator simply computes a weighted sum ofthe RHN activation at all timesteps:
Attn(x) =X
t
↵txt (4)
where the weights ↵t are determined by learnedparameters U and W, and passed through thetimewise softmax function:
↵t =exp(U tanh(Wxt))Pt0 exp(U tanh(Wxt0))
(5)
The main component of the utterance encoder is arecurrent network, specifically a Recurrent High-way Network (Zilly et al., 2016). The idea behind
615
nal representation by removing them from the in-put and comparing the resulting representations tothe ones generated by the original input. In a sim-ilar approach, Li et al. (2016) study the contribu-tion of individual input tokens as well as hiddenunits and word embedding dimensions by erasingthem from the representation and analyzing howthis affects the model.
Miao et al. (2016) and Tang et al. (2016) use vi-sualization techniques for fine-grained analysis ofGRU and LSTM models for ASR. Visualizationof input and forget gate states allows Miao et al.(2016) to make informed adaptations to gated re-current architectures, resulting in more efficientlytrainable models. Tang et al. (2016) visualizequalitative differences between LSTM- and GRU-based architectures, regarding the encoding of in-formation, as well as how it is processed throughtime.
We specifically study linguistic properties of theinformation encoded in the trained model. Adiet al. (2016) introduce prediction tasks to ana-lyze information encoded in sentence embeddingsabout word order, sentence length, and the pres-ence of individual words. We use related tech-niques to explore encoding of aspects of form andmeaning within components of our stacked archi-tecture.
3 Models
We use a multi-layer, gated recurrent neural net-work (RHN) to model the temporal nature ofspeech signal. Recurrent neural networks are de-signed for modeling sequential data, and gatedvariants (GRUs, LSTMs) are widely used withspeech and text in both cognitive modeling andengineering contexts. RHNs are a simple gener-alization of GRU networks such that the transformbetween time points can consist of several steps.
Our multimodal model projects spoken utter-ances and images to a joint semantic space. Theidea of projecting different modalities to a sharedsemantic space via a pair of encoders has beenused in work on language and vision (among themVendrov et al. (2015)). The core idea is to en-courage inputs representing the same meaning indifferent modalities to end up nearby, while main-taining a distance from unrelated inputs.
The model consists of two parts: an utteranceencoder, and an image encoder. The utterance en-coder starts from MFCC speech features, while
the image encoder starts from features extractedwith a VGG-16 pre-trained on ImageNet. Our lossfunction attempts to make the cosine distance be-tween encodings of matching utterances and im-ages greater than the distance between encodingsof mismatching utterance/image pairs, by a mar-gin:
(1)
X
u,i
X
u0
max[0,↵+d(u, i)�d(u0, i)]
+
X
i0
max[0,↵+ d(u, i)� d(u, i0)]
!
where d(u, i) is the cosine distance between theencoded utterance u and encoded image i. Here(u, i) is the matching utterance-image pair, u0
ranges over utterances not describing i and i0
ranges over images not described by u.The image encoder enci is a simple linear pro-
jection, followed by normalization to unit L2norm:
enci(i) = unit(Ai+ b) (2)
where unit(x) =x
(xT x)0.5and with (A, b) as
learned parameters. The utterance encoder encu
consists of a 1-dimensional convolutional layer oflength s, size d and stride z, whose output feedsinto a Recurrent Highway Network with k lay-ers and L microsteps, whose output in turn goesthrough an attention-like lookback operator, andfinally L2 normalization:
encu(u) = unit(Attn(RHNk,L(Convs,d,z(u))))(3)
The main function of the convolutional layerConvs,d,z is to subsample the input along the tem-poral dimension. We use a 1-dimensional convo-lution with full border mode padding. The atten-tion operator simply computes a weighted sum ofthe RHN activation at all timesteps:
Attn(x) =X
t
↵txt (4)
where the weights ↵t are determined by learnedparameters U and W, and passed through thetimewise softmax function:
↵t =exp(U tanh(Wxt))Pt0 exp(U tanh(Wxt0))
(5)
The main component of the utterance encoder is arecurrent network, specifically a Recurrent High-way Network (Zilly et al., 2016). The idea behind
615
Image Model
28
Image model
BOAT
BIRD
BOAR
Pre
-cla
ssif
cat io
n la
ye
r
VGG-16: Simonyan & Zisserman (2014)
Speech Model
29
Project to the joint semantic space
• Attention: weighted sum of last RHN layer units
• RHN: Recurrent Highway Networks (Zilly et al., 2016)
• Convolution: subsampling MFCC vector
Attention
RHN
RHN
RHN
RHN
RHN
Convolution
MFCC
Grounded speech perception
What does the Model Learn?
• What aspects of language does the model encode?
• Does the model encode linguistic form, and on which layers?
• Does it encode meaning, and on which layers?
30
Grounded speech perception
• Auxiliary tasks using layer activations as feature
“Auxiliary Tasks” or “Probing Classifiers”
31
Grounded speech perception
Classifier/ Regressor
Trained to predict linguistic knowledge Y
Trained to perform Task X
Analysis via Auxiliary Tasks
• Predicting utterance length
• Predicting the presence of specific words
• Representation of similarity
• Homonym detection
• Synonym detection
32
Predicting Utterance Length
• Input: MFCC features from speech signal (layer 0), layer activations (1--4/5), or utterance embeddings (5--6)
• Prediction model: Linear Regression
33
Number of words
Input Activations for
utterance
Model Linear regression
Presence of Individual Words
• Input: MFCC features for words + layer activations
• Prediction model: Multilayer Perceptron
34
Word presence
Input Activations for
utterance
MFCC for word
Model MLP
Synonym Discrimination
• Distinguishing between synonym pairs in the same context:
• A girl looking at a photo
• A girl looking at a picture
• Synonyms were selected using WordNet synsets:
• The pair have the same POS tag and are interchangeable
• The pair clearly differ in form (not donut/doughnut)
• The more frequent token in a pair constitutes less than 95% of the occurrences.
35
Synonym Discrimination
36
8
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer.
●
●
●●
●
●
●
●
●
●
● ●●
●●
●
● ● ● ●●
●
●●
● ● ● ● ●
●
● ●
●
●● ●
●
●
● ●
●
● ● ●●
●
●
●
● ● ● ●●
●
●●
●
● ●●
●
●
●●
●●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
● ●● ● ●
●
●
●
● ● ●●
●
●
●
●
● ● ●● ●
●
●
●
● ●●
●
●
●
●
●
●●
● ●
●
●●
●
● ● ●
●●
●●
●
● ●●
●
●
●
●
● ● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●●
●
●● ●
●●
●
● ●
● ● ●●
●
●
● ●
0.0
0.1
0.2
0.3
mfcc conv rec1 rec2 rec3 rec4 rec5 embRepresentation
Error
Pair●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
cut.slicemake.preparesomeone.personphoto.picturepicture.imagekid.childphotograph.pictureslice.piecebicycle.bikephotograph.photocouch.sofatv.televisionvegetable.veggie
sidewalk.pavementrock.stonestore.shoppurse.bagassortment.varietyspot.placepier.dockdirection.waycarpet.rugbun.rolllarge.bigsmall.little
Figure 6: Synonym discrimination error rates, perrepresentation and synonym pair.
tence embeddings give relatively high error ratessuggesting that the attention layer acts to focuson semantic information and filter out much ofphonological form.
6 Discussion
Understanding distributed representations learnedby neural networks is important but has the rep-utation of being hard to impossible. In this workwe focus on making progress on this problem fora particular domain: representations of phonology
in a multilayer recurrent neural network trained ongrounded speech signal. We believe it is impor-tant to carry out multiple analyses using diversemethodology: any single experiment may be mis-leading as it depends on analytical choices such asthe type of supervised model used for decoding,the algorithm used for clustering, or the similaritymetric for representational similarity analysis. Tothe extent that more than one experiment points tothe same conclusion our confidence in the reliabil-ity of the insights gained will be increased.
The main high-level result of our study con-firms earlier work: encoding of semantics be-comes stronger in higher layer, while encoding ofform becomes weaker. This general pattern is tobe expected as the objective of the utterance en-coder is to transform the input acoustic featuresin such a way that it can be matched to its coun-terpart in a completely separate modality. Manyof the details of how this happens, however, arefar from obvious: perhaps most surprisingly wefound that large amount of phonological informa-tion persist up to the top recurrent layer. Evidencefor this pattern emerges from the phoneme decod-ing task, the ABX task and the synonym discrim-ination task. The last one also shows that the at-tention layer filters out and significantly attenuatesencoding of phonology and makes the utteranceembeddings much more invariant to synonymy.
In future work we would like to apply ourmethodology to other models and data, especiallyhuman speech data. We would also like to makecomparisons to the results that emerge from simi-lar analyses applied to neuroimaging data.
8
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer.
●
●
●●
●
●
●
●
●
●
● ●●
●●
●
● ● ● ●●
●
●●
● ● ● ● ●
●
● ●
●
●● ●
●
●
● ●
●
● ● ●●
●
●
●
● ● ● ●●
●
●●
●
● ●●
●
●
●●
●●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
● ●● ● ●
●
●
●
● ● ●●
●
●
●
●
● ● ●● ●
●
●
●
● ●●
●
●
●
●
●
●●
● ●
●
●●
●
● ● ●
●●
●●
●
● ●●
●
●
●
●
● ● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●●
●
●● ●
●●
●
● ●
● ● ●●
●
●
● ●
0.0
0.1
0.2
0.3
mfcc conv rec1 rec2 rec3 rec4 rec5 embRepresentation
Error
Pair●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
cut.slicemake.preparesomeone.personphoto.picturepicture.imagekid.childphotograph.pictureslice.piecebicycle.bikephotograph.photocouch.sofatv.televisionvegetable.veggie
sidewalk.pavementrock.stonestore.shoppurse.bagassortment.varietyspot.placepier.dockdirection.waycarpet.rugbun.rolllarge.bigsmall.little
Figure 6: Synonym discrimination error rates, perrepresentation and synonym pair.
tence embeddings give relatively high error ratessuggesting that the attention layer acts to focuson semantic information and filter out much ofphonological form.
6 Discussion
Understanding distributed representations learnedby neural networks is important but has the rep-utation of being hard to impossible. In this workwe focus on making progress on this problem fora particular domain: representations of phonology
in a multilayer recurrent neural network trained ongrounded speech signal. We believe it is impor-tant to carry out multiple analyses using diversemethodology: any single experiment may be mis-leading as it depends on analytical choices such asthe type of supervised model used for decoding,the algorithm used for clustering, or the similaritymetric for representational similarity analysis. Tothe extent that more than one experiment points tothe same conclusion our confidence in the reliabil-ity of the insights gained will be increased.
The main high-level result of our study con-firms earlier work: encoding of semantics be-comes stronger in higher layer, while encoding ofform becomes weaker. This general pattern is tobe expected as the objective of the utterance en-coder is to transform the input acoustic featuresin such a way that it can be matched to its coun-terpart in a completely separate modality. Manyof the details of how this happens, however, arefar from obvious: perhaps most surprisingly wefound that large amount of phonological informa-tion persist up to the top recurrent layer. Evidencefor this pattern emerges from the phoneme decod-ing task, the ABX task and the synonym discrim-ination task. The last one also shows that the at-tention layer filters out and significantly attenuatesencoding of phonology and makes the utteranceembeddings much more invariant to synonymy.
In future work we would like to apply ourmethodology to other models and data, especiallyhuman speech data. We would also like to makecomparisons to the results that emerge from simi-lar analyses applied to neuroimaging data.
Homonym Disambiguation
• Distinguishing between homonym pairs in different contexts, e.g. suite versus sweet or sale versus sail
• We selected homonym pairs in our dataset where
• both forms appear more than 20 times
• the two forms have different meanings (not theatre/theater)
• neither form is a function word
• the more frequent form constitutes less than 95% of the occurrences.
37
38
Homonym disambiguation
!39
Attention
RHN
RHN
RHN
RHN
RHN
Convolution
MFCC
Utt
eran
ce L
engt
h
Wor
d Pr
esen
ce
Mea
ning
Sim
ilari
ty
Hom
onym
Det
ecti
on
Syno
nym
Det
ecti
on
✱
✱
✱
Edit
Sim
ilari
ty
✱
✱ ✱
✱
✱
!40
Attention
RHN
RHN
RHN
RHN
RHN
Convolution
MFCC
Utt
eran
ce L
engt
h
Wor
d Pr
esen
ce
Mea
ning
Sim
ilari
ty
Hom
onym
Det
ecti
on
Syno
nym
Det
ecti
on
✱
✱
Edit
Sim
ilari
ty
✱
✱ ✱
✱
✱
✱
Encoding of Phonology
• Questions: how is phonology encoded in
• MFCC features extracted from speech signal?
• activations of the layers of the model?
41
Encoding of Phonology
• Questions: how is phonology encoded in
• MFCC features extracted from speech signal?
• activations of the layers of the model?
• Data: Synthetically Spoken COCO dataset
• Experiments:
• Phoneme decoding and clustering
• Phoneme discrimination
42
Grounded speech perception
Phoneme Decoding
• Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes
• Speech signal was aligned with phonemic transcription using Gentle toolkit (based on Kaldi, Povey et al., 2011)
43
Phoneme Decoding
• Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes
44
5
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
MFCC features and activations) are stored in atr ⇥ Dr matrix, where tr and Dr are the num-ber of times steps and the dimensionality, respec-tively, for each representation r. Given the align-ment of each phoneme token to the underlying au-dio, we then infer the slice of the representationmatrix corresponding to it.
5 Experiments
In this section we report on four experimentswhich we designed to elucidate to what extent in-formation about phonology is represented in theactivations of the layers of the COCO Speechmodel. In Section 5.1 we quantify how easy it isto decode phoneme identity from activations. InSection 5.2 we determine phoneme discriminabil-ity in a controlled task with minimal pair stimuli.Section 5.3 shows how the phoneme inventory isorganized in the activation space of the model. Fi-nally, in Section 5.4 we tackle the general issueof the representation of phonological form versusmeaning with the controlled task of synonym dis-crimination.
5.1 Phoneme decoding
In this section we quantify to what extent phonemeidentity can be decoded from the input MFCC fea-tures as compared to the representations extractedfrom the COCO speech. As explained in Sec-tion 4.3, we use phonemic transcriptions alignedto the corresponding audio in order to segmentthe signal into chunks corresponding to individualphonemes.
We take a subset of 5000 utterances from thevalidation set of Synthetically Spoken COCO, andextract the force-aligned representations from theSpeech COCO model. We split this data into 2
3training and 1
3 heldout portions, and use super-vised classification in order to quantify the recov-erability of phoneme identities from the represen-tations. Each phoneme slice is averaged over time,so that it becomes a Dr-dimensional vector. Foreach representation we then train L2-penalized lo-gistic regression (with the fixed penalty weight1.0) on the training data and measure classifica-tion error rate on the heldout portion.
Figure 1 shows the results. As can be seen fromthis plot, phoneme recoverability is poor for therepresentations based on MFCC and the convolu-tional layer activations, but improves markedly forthe recurrent layers. Phonemes are easiest recov-
ered from the activations at recurrent layers 1 and2, and the accuracy decreases thereafter. This sug-gests that the bottom recurrent layers of the modelspecialize in recognizing this type of low-levelphonological information. It is notable howeverthat even the last recurrent layer encodes phonemeidentity to a substantial degree.
●
●
●
●
●
●●
●●
●●
●
●
●●●
●●●●
●
●●
●
●●
●●●
●●●
●●
●
●
●
●
●
●●●
●●●
0.3
0.4
0.5
MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5Representation
Erro
r rat
e
Figure 1: Accuracy of phoneme decoding withinput MFCC features and COCO Speech modelactivations. The boxplot shows error rates boot-strapped with 1000 resamples.
5.2 Phoneme discrimination
Schatz et al. (2013) propose a framework for eval-uating speech features learned in an unsupervisedsetup that does not depend on phonetically labeleddata. They propose a set of tasks called Minimal-Pair ABX tasks that allow to make linguisticallyprecise comparisons between syllable pairs thatonly differ by one phoneme. They use variants ofthis task to study phoneme discrimination acrosstalkers and phonetic contexts as well as talker dis-crimination across phonemes.
Here we evaluate the COCO Speech model onthe Phoneme across Context (PaC) task of Schatzet al. (2013). This task consists of presenting a se-ries of equal-length tuples (A,B,X) to the model,where A and B differ by one phoneme (either avowel or a consonant), as do B and X , but A andX are not minimal pairs. For example, in the tu-ple (be /bi/, me /mi/, my /maI/), human subjectsare asked to identify which of the two syllables beor me are closest to my. The goal is to measure
Organization of Phonemes
• Agglomerative hierarchical clustering of phoneme activation vectors from the first hidden layer:
45
8
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer.
●
●
●●
●
●
●
●
●
●
● ●●
●●
●
● ● ● ●●
●
●●
● ● ● ● ●
●
● ●
●
●● ●
●
●
● ●
●
● ● ●●
●
●
●
● ● ● ●●
●
●●
●
● ●●
●
●
●●
●●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
● ●● ● ●
●
●
●
● ● ●●
●
●
●
●
● ● ●● ●
●
●
●
● ●●
●
●
●
●
●
●●
● ●
●
●●
●
● ● ●
●●
●●
●
● ●●
●
●
●
●
● ● ● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●●
●
●● ●
●●
●
● ●
● ● ●●
●
●
● ●
0.0
0.1
0.2
0.3
mfcc conv rec1 rec2 rec3 rec4 rec5 embRepresentation
Error
Pair●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
cut.slicemake.preparesomeone.personphoto.picturepicture.imagekid.childphotograph.pictureslice.piecebicycle.bikephotograph.photocouch.sofatv.televisionvegetable.veggie
sidewalk.pavementrock.stonestore.shoppurse.bagassortment.varietyspot.placepier.dockdirection.waycarpet.rugbun.rolllarge.bigsmall.little
Figure 6: Synonym discrimination error rates, perrepresentation and synonym pair.
tence embeddings give relatively high error ratessuggesting that the attention layer acts to focuson semantic information and filter out much ofphonological form.
6 Discussion
Understanding distributed representations learnedby neural networks is important but has the rep-utation of being hard to impossible. In this workwe focus on making progress on this problem fora particular domain: representations of phonology
in a multilayer recurrent neural network trained ongrounded speech signal. We believe it is impor-tant to carry out multiple analyses using diversemethodology: any single experiment may be mis-leading as it depends on analytical choices such asthe type of supervised model used for decoding,the algorithm used for clustering, or the similaritymetric for representational similarity analysis. Tothe extent that more than one experiment points tothe same conclusion our confidence in the reliabil-ity of the insights gained will be increased.
The main high-level result of our study con-firms earlier work: encoding of semantics be-comes stronger in higher layer, while encoding ofform becomes weaker. This general pattern is tobe expected as the objective of the utterance en-coder is to transform the input acoustic featuresin such a way that it can be matched to its coun-terpart in a completely separate modality. Manyof the details of how this happens, however, arefar from obvious: perhaps most surprisingly wefound that large amount of phonological informa-tion persist up to the top recurrent layer. Evidencefor this pattern emerges from the phoneme decod-ing task, the ABX task and the synonym discrim-ination task. The last one also shows that the at-tention layer filters out and significantly attenuatesencoding of phonology and makes the utteranceembeddings much more invariant to synonymy.
In future work we would like to apply ourmethodology to other models and data, especiallyhuman speech data. We would also like to makecomparisons to the results that emerge from simi-lar analyses applied to neuroimaging data.
• ABX task (Schatz et al., 2013): discriminate minimal pairs; is X closer to A or to B?
• A, B and X are CV syllables
• (A,B) and (B,X) are minimum pairs, but (A,X) are not (34,288 tuples in total)
Phoneme Discrimination
46
A: be /bi/ B: me /mi/
X: my /maI/
Phoneme Discrimination
47
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Table 3: Accuracy of choosing the correct targetin an ABX task using different representations.
MFCC 0.71Convolutional 0.73Recurrent 1 0.82Recurrent 2 0.82Recurrent 3 0.80Recurrent 4 0.76Recurrent 5 0.74
context invariance in phoneme discrimination byevaluating how often the model recognises X asthe syllable closer to B than to A.
We used a list of all attested consonant-vowel(CV) syllables of American English according tothe syllabification method described in Gorman(2013). We excluded the ones which could not beunambiguously represented using English spellingfor input to the TTS system (e.g. /baU/). We thencompiled a list of all possible (A,B,X) tuplesfrom this list where (A,B) and (B,X) are min-imal pairs, but (A,X) are not. This resulted in34,288 tuples in total. For each tuple, we measuresign(dist(A,X) � dist(B,X)), where dist(i, j)is the euclidean distance between the vector rep-resentations of syllables i and j. These represen-tations are either the audio feature vectors or thelayer activation vectors. A positive value for a tu-ple means that the model has correctly discrim-inated the phonemes that are shared or differentacross the syllables.
Table 3 shows the discrimination accuracy inthis task using various representations. The pat-tern is similar to what we observed in the phonemeidentification task: best accuracy is achieved usingrepresentation vectors from recurrent layers 1 and2, and it drops as we move further up in the model.The accuracy is lowest when MFCC features areused for this task.
However, the PaC task is most meaningful andchallenging where the target and the distractorphonemes belong to the same phoneme class. Fig-ure 2 shows the accuracies for this subset of cases,broken down by class. As can be seen, the modelcan discriminate between phonemes with high ac-curacy across all the layers, and the layer activa-tions are more informative for this task than theMFCC features. Again, phonemes seem to be rep-resented more accurately in the lower layers, andthe performance of the model in this task drops
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5
0.6
0.7
0.8
0.9
mfcc conv rec1 rec2 rec3 rec4 rec5Representation
Accuracy
Class ●
●
●
●
●
●
affricateapproximant
fricativenasal
plosivevowel
Figure 2: Accuracies for the ABX CV task for thecases where the target and the distractor belong tothe same phoneme class.
as we move towards higher hidden layers. Thereare also clear differences in the pattern of discrim-inability for the phoneme classes. The vowels areespecially easy to tell apart, but accuracy on vow-els drops most acutely in the higher layers. Mean-while the accuracy on fricatives and approximantsstarts low, but improves rapidly and peaks aroundrecurrent layer 2.
5.3 Organization of phonemes
In this section we take a closer look at the un-derlying organization of phonemes in the model.Our experiment is inspired by Khalighinejad et al.(2017) who study how the speech signal is repre-sented in the brain at different stages of the au-ditory pathway by collecting and analyzing elec-troencephalography responses from participantslistening to continuous speech, and show thatbrain responses to different phoneme categoriesturn out to be organized by phonetic features.
We carry out an analogous experiment by an-alyzing the hidden layer activations of our modelin response to each phoneme in the input. First,we generated a distance matrix for every pair ofphonemes by calculating the Euclidean distancebetween the phoneme pair’s activation vectors foreach layer separately, as well as a distance matrixfor all phoneme pairs based on their MFCC fea-tures. Similar to what Khalighinejad et al. (2017)report, we observe that the phoneme activations on
!48
Attention
RHN
RHN
RHN
RHN
RHN
Convolution
MFCC
Utt
eran
ce L
engt
h
Wor
d Pr
esen
ce
Mea
ning
Sim
ilari
ty
Hom
onym
Det
ecti
on
Syno
nym
Det
ecti
on
✱
✱
Edit
Sim
ilari
ty
✱
✱ ✱
✱
✱
Phon
eme
Dec
odin
g
Phon
eme
Dis
crim
inat
ion
✱
✱
✱
✱✱
What We Have Learned
• Encodings of form and meaning emerge and evolve in hidden layers of stacked RNNs processing grounded speech
• Phoneme representations are most salient in lower layers, although large amount of phonological information persists up to the top recurrent layer
49
Back to Structure
• Diagnostic classifiers are not useful in probing structured representations
50
Linguistic structured
spaces
Representational Similarity Analysis (RSA)
• RSA (Kriegeskorte et al., 2008): Measuring correlations between representations A and B in a similarity space
• Compute representation (dis)similarity matrices in two spaces
• Measure correlations between upper triangles
51
RSA: An Example
• Sentence similarity according to
• Sim A: human judgment
• Sim B: estimated by a model
• RSA score: correlation between Sim A and Sim B
52
Applying RSA to Language
53
x
y
z
x
y
z
A B
• What we need: a similarity metric within two spaces A & B
• Eg., A is a vector space, B is a space of trees or graphs
• What we do not need: a mapping between space A & B
Applying RSA to Language
54
A B
• What we need: a similarity metric within two spaces A & B
• Eg., A is a vector space, B is a space of trees or graphs
• What we do not need: a mapping between space A & B
Representational Similarity Analysis
Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2, 4.
Representational Similarity Analysis
Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2, 4.
Tree Kernels
• Measuring the similarity between two syntactic trees: count their overlapping subtrees
55
Applying RSA to NN Models of Language
• BERT: a state-of-the-art language model
56
BERT layers
BERT layers
BERT layers
Analysis Techniques: Summary
• Interpreting hidden activations in NN models of language
• Perturbation-based techniques
• Probing classifiers and auxiliary tasks
• Representation similarity analysis
57
BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
1st edition: EMNLP’2018, Brussels 2nd edition: ACL 2019, Florence
Grzegorz Chrupała Ákos Kádár
• Chrupała, Kádár & Alishahi (2015). Learning language through pictures. ACL’2015.
• Kádár, Chrupała & Alishahi (2017). Representation of linguistic form and function in recurrent neural networks. Computational Linguistics.
• Chrupała, Gelderloos & Alishahi (2017). Representation of language in a model of visually grounded speech signal. ACL’2017.
• Alishahi, Barking and Chrupała (2017). Encoding of phonology in a recurrent neural model of grounded speech. CoNLL’2017.
• Kádár, Elliott, Côté, Chrupała & Alishahi (2018). Lessons learned in multilingual grounded language learning. CoNLL.
• Chrupała and Alishahi (2019). Correlating neural and symbolic representations of language. ACL’2019.
• Alishahi, Chrupała and Linzen (2019). Analyzing and Interpreting Neural Networks for NLP. Natural Language Engineering.
Lieke Gelderloos
Representations of language in a model of visually
grounded speech signal
Grzegorz Chrupała Lieke Gelderloos Afra Alishahi
Marie Barking
Modeling Language Learning: What Next?
• Direct comparison with infant behaviour
• Learning phases, behavioural patterns
• Videos rather than static images
• Exploiting movement, change and non-verbal cues
• Interaction through dialogue
• Relying on feedback from communication success
59
Getting Closer to Reality
• Grounded language learning
• Mapping utterances to visual representations involves learning linguistic knowledge on various levels
• Examining the nature of emerging linguistic knowledge
• Developing techniques for interpreting the acquired knowledge in (neural) models of language
• Learning language through conversation
• Moving away from passive, data-driven models and towards frameworks which allow for interaction
60