![Page 1: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/1.jpg)
1
Dr. Odette Scharenborg
Associate Professor & Delft Technology FellowDelft University of TechnologyThe [email protected]
The representation of speech in the human and artificial brain
![Page 2: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/2.jpg)
2
Dr. Odette Scharenborg
Associate Professor & Delft Technology FellowDelft University of TechnologyThe [email protected]
The representation of speech in the human and deep neural networks
![Page 3: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/3.jpg)
3
Acknowledgements
• Polina Drozdova• Roeland van Hout
• Sebastian Tiesmeyer• Nikki van der Gouw• Martha Larson
• Najim Dehak• Junrui Ni• Mark Hasegawa-Johnson
![Page 4: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/4.jpg)
4
Ultimate goal
Automatic speech recognition (ASR) for all theworld’s languages & all types of speech
![Page 5: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/5.jpg)
5
Problem
Transcription 1
Transcription 2
Transcription 3
![Page 6: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/6.jpg)
6
Possible solution
Create a flexible ASR that is trained on language/speech type X andmapped to language/speech type Y
![Page 7: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/7.jpg)
7
What we need
1. Invariant units of speech which transfer easily and accurately to new languages & types of speech
2. ASR system that can flexibly adapt to new languages & types of speech
3. ASR system that can decide when to create a new sound category
4. ASR system that can create a new sound category
![Page 8: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/8.jpg)
8
What we need
1. Invariant units of speech which transfer easily and accurately to new languages & types of speech
2. ASR system that can flexibly adapt to new languages & types of speech
3. ASR system that can decide when to create a new sound category
4. ASR system that can create a new sound category
![Page 9: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/9.jpg)
9
Human listeners
1. Adapt to all types of pronunciations/speakers
2. Create new sound categories when learning a new language
![Page 10: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/10.jpg)
10
Comparing humans and machines
• Different hardware: vs.
• Same task: recognition of words from speech
Comparing humans and machines [Scharenborg, 2007]:– provide insights into human speech processing
(computational modelling)– improve ASR technology (MFCCs, PLPs, template-
based ASR)
![Page 11: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/11.jpg)
11
Ultimate question
• What is the optimal representation of speech fora DNN-based ASR to be able to map onelanguage/type of speech to another?
![Page 12: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/12.jpg)
12
This talk
• How do humans adapt? Perceptual learning in humans
• Perceptual learning in DNNs
• Perceptual learning in DNNs over time
• Representation of speech in DNNs
![Page 13: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/13.jpg)
13
Perceptual learning
“[…] relatively long-lasting changes to an organism’s perceptual system that improve its ability to respond to its environment and are caused by this environment” [Goldstone, 1998, p. 586]
[Drozdova, van Hout, Scharenborg, 2018; Scharenborg & Janse, 2013]
![Page 14: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/14.jpg)
14
Perceptual learning
“[…] relatively long-lasting changes to an organism’s perceptual system that improve its ability to respond to its environment and are caused by this environment” [Goldstone, 1998, p. 586]
[Drozdova, van Hout, Scharenborg, 2018; Scharenborg & Janse, 2013]
![Page 15: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/15.jpg)
15
[l/ɹ]
appe[l/ɹ] (= apple) wekke[l/ɹ] (= alarm clock)
mee[l/ɹ]
(= flour) (= lake)
Lexically-guided perceptual learningex
posu
rete
st
Ambiguous input
Adjustment of the phoneme category boundary
[Drozdova, van Hout & Scharenborg, 2015, 2016; Norris, McQueen & Cutler, 2003]
Disambiguating context
![Page 16: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/16.jpg)
16
General procedure Exposure: lexical
decision task
Amb-l group:- 40 ambiguous /l/ words- 40 natural /r/ words
Amb-r group- 40 ambiguous /r/ words- 40 natural /l/ words
Test: phoneticcategorisation task
Ambiguous [l/r] test items
+ filler words and nonwords 5 steps, each 18 times
20% 40% 50% 60% 80% /r/
![Page 17: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/17.jpg)
17
Phonetic categorisation results
R: listeners who learned to map [l/ɹ] onto [ɹ]
L: listeners who learned to map [l/ɹ] onto [l]
[Scharenborg, Mitterer & McQueen, Interspeech, 2011]
![Page 18: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/18.jpg)
18
Lexically-guided perceptual learning
• Causes a temporary change in phonetic category boundaries
• Needs lexical or phonotactic knowledge
• Generalises to words that have not been presented earlier
• Speaker dependent
• Fast: only 10-15 ambiguous items needed
• Long-lasting effect
[Clarke-Davidson et al., 2008; Eisner & McQueen, 2005, 2006; Kraljic & Samuel, 2005, 2007; McQueen et al., 2006; Scharenborg & Janse, 2013; Drozdova, van Hout, & Scharenborg, 2016]
![Page 19: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/19.jpg)
19
Perceptual learning in DNNs
inspired by
… but can they flexibly adapt to different speech types like human listeners can?
[Scharenborg, Tiesmeyer, Hasegawa-Johnson, Dehak, 2018]
![Page 20: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/20.jpg)
20
If so, visualize the mechanism by which the DNNs perform this adaptation intermediate representations that might have correlates in human perceptual adaptation
DNN
![Page 21: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/21.jpg)
21
Lexical retuning
1. Train baseline DNN on Dutch read speech ~64h:
2. Retrain the baseline model using the acoustic stimuli of the human experiment [Scharenborg & Janse, 2013]:
Baseline
• 120 natural words• 40 [ɹ]-final words• 40 [l]-final words
AmbR
• 120 natural words• 40 [l]-final words• 40 [ɹ]-final words:
[ɹ] replaced by [l/ɹ]
AmbL
• 120 natural words• 40 [ɹ]-final words• 40 [l]-final words:
[l] replaced by [l/ɹ]
![Page 22: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/22.jpg)
22
Findings Expectations
More [ɹ] responses when exposed to wekke[l/ɹ] Learn to interpret [l/ɹ] as [ɹ]
More [l] responses when exposed to appe[l/ɹ] Learn to interpret [l/ɹ] as [l]
Differences particularly between AmbL and AmbR models
![Page 23: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/23.jpg)
23
Results
• Lexical retuning results
• Inter-segment distances
• Visualisations of the clusters in the hidden layers
![Page 24: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/24.jpg)
24
Lexical retuning results
Test set = train setSound
presentedSound(s) classified (%)
Baseline, retrained model[l] l(97.6), m(3.4)[ɹ] ɹ(95.0), sil(5.0)[l/ɹ] l(46.9), sil(23.5), ə(19.8), ɹ(8.6), ɛi (1.2)AmbL model[l] o(78.0), ɔ(10.4), sil(2.4), e(2.4), ɛi (2.4), øː(2.4)[ɹ] ɹ(87.5), e(7.5), ʌu(2.5), ɛi(2.5)[l/ɹ] l(81.5), ə(12.3), ə(2.5), ɛi(2.5), t(1.2)AmbR model[l] l(97.6), sil(3.4)[ɹ] ɹ(72.5), ə(15.0), sil(10.0), t(2.5)[l/ɹ] ɹ(88.9), sil(6.2), ə(4.9)
![Page 25: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/25.jpg)
25
Inter-segment distances for each layer
For higher layers, distance [l]-[l/ɹ] decreases for AmbL model [ɹ]-[l/ɹ] decreases for AmbR model
![Page 26: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/26.jpg)
26
Visualisations of the learned clusters in the hidden layers
![Page 27: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/27.jpg)
27
PCA visualisations of the 4th hidden layers
[l/ɹ] [l][ɹ]
Baseline model
![Page 28: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/28.jpg)
28
PCA visualisations of the 4th hidden layers
AmbR model[l/ɹ] [l][ɹ]
![Page 29: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/29.jpg)
29
PCA visualisations of the 4th hidden layers
AmbL model[l/ɹ] [l][ɹ]
![Page 30: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/30.jpg)
30
So …
• DNNs trained with ambiguous sounds show perceptual learning
• Not only at the output layer but also at intermediate layers
![Page 31: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/31.jpg)
31
Perceptual learning in DNNs over time
• Human listeners only need 10-15 ambiguousitems
• How about DNNs?
[Ni, Hasegawa-Johnson, Scharenborg, 2019]
![Page 32: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/32.jpg)
32
Same experimental set-up
• Retraining in 10 bins of 4 ambiguous items
• Test on unseen data from next bin
1 2 3 4 5 6 7 8 9 10+ + + + + + + +
![Page 33: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/33.jpg)
33
Proportion of [ɹ] responses over time
Baseline model
![Page 34: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/34.jpg)
34
Proportion of [ɹ] responses over time
AmbR
![Page 35: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/35.jpg)
35
Proportion of [ɹ] responses over time
AmbL model
![Page 36: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/36.jpg)
36
So …
DNNs• Need only a few ambiguous items
• Show a similar step-like function as humans
![Page 37: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/37.jpg)
37
Representation of speech in DNNs
• Free reign to the DNN
What speech representations are learned whenfaced with the large variabiliy in speech?
[Scharenborg, van der Gouw, Larson, Marchiori, 2019]
![Page 38: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/38.jpg)
38
Research question
Can a generic DNN-based ASR trained to distinguish high-level speech sounds learn the underlying structures used by human listeners to understand speech?
![Page 39: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/39.jpg)
39
An example from vision
Extract and learn features associated with planes andchairs?
≠?
![Page 40: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/40.jpg)
40
Methodology
• Naïve, generic feed-forward deep neural network– 3 hidden layers with 1024 nodes each
• Corpus Spoken Dutch, 64h read speech
• Consonsant/vowel classification task
• Visualize the activations of the speech representations at the hidden layers– Using different linguistic labels, to check for clustering
![Page 41: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/41.jpg)
41
Linguistic labels
• Vowels/consonants differ in the way they are produced: absence/presence of a constriction in the vocal tract
• Phoneme: the smallest unit that changes the meaning of a word e.g. ball and wall
• Articulatory feature: acoustic correlate of the articulatory properties of speech sounds– Manner of articulation: type of constriction (e.g., full closure,
narrowing)– Place of articulation: location of the constriction (consonants only)– Tongue position: location of the bunch of the tongue (vowels only)
![Page 42: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/42.jpg)
42
C/V classification results
• Frame level accuracy: 85.5%– Averaged over 5 training runs– Consonants: 85.19% / Vowels: 86.69% correct
![Page 43: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/43.jpg)
43
Visualisations: V/C classification task
Original input of the network Layer 1 of the network
Layer 2 of the network Layer 3 of the network
• vowel• consonant
![Page 44: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/44.jpg)
44
Manner of articulation, 3rd layerConsonants Vowels
![Page 45: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/45.jpg)
45
Place of articulation, 3rd layerConsonants Vowels
![Page 46: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/46.jpg)
46
Summary
DNNs• Progressive abstraction in subsequent hidden
layers
• Capture the structure in speech by clustering the speech signal, without explicit training
• Adapt to nonstandard speech on the basis of only a few labelled examples, showing a step-like function
![Page 47: The representation of speech in the human and artificial brainspecom.nw.ru/history/sites/2019/keynotes/SPECOM... · The representation of speech in the human and artificial brain](https://reader035.vdocuments.mx/reader035/viewer/2022070107/6022d6a6cf867c0f2c491c68/html5/thumbnails/47.jpg)
47
Our needs
• Invariant units of speech which transfer easily and accurately to other languages
DNN phone categories are flexible DNNs automatically create linguistic categories
• ASR system that can flexibly adapt to new types of speech
DNNs are able to do that
Next: ASR system that can decide when to create a new sound category, and do so