![Page 2: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/2.jpg)
Motivation
I Natural way of communication (No trainingneeded)
I Leaves hands and eyes free (Good forfunctionally disabled)
I Effective (Higher data rate than typing)
I Can be transmitted/received inexpensively(phones)
2 / 117
![Page 3: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/3.jpg)
A dream of Artificial Intelligence
2001: A space odyssey (1968)
3 / 117
![Page 4: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/4.jpg)
ASR in a Broader Context
DialogueManager
AutomaticSpeech
Recognition
SpokenLanguage
UnderstandingText to Speech
4 / 117
![Page 5: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/5.jpg)
The ASR Scope
Convert speech into text
AutomaticSpeech
Recognition“My name is . . . ”
[confidence score]
Not considered here:
I non-verbal signals
I prosody
I multi-modal interaction
5 / 117
![Page 6: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/6.jpg)
The ASR Scope
Convert speech into text
AutomaticSpeech
Recognition“My name is . . . ”
[confidence score]
Not considered here:
I non-verbal signals
I prosody
I multi-modal interaction
5 / 117
![Page 7: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/7.jpg)
A very long endeavour1952, Bell laboratories, isolated digit recognition,single speaker, hardware based [2]
[2] K. H. Davis, R. Biddulph, and S. Balashek. “Automatic Recognition of Spoken Digits”. In: JASA 24.6 (1952),pp. 637–642
6 / 117
![Page 8: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/8.jpg)
An underestimated challenge
for 60 years many boldannouncements
7 / 117
![Page 9: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/9.jpg)
http://www.itl.nist.gov/iad/mig/publications/ASRhistory/
8 / 117
![Page 10: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/10.jpg)
Main variables in ASR
Speaking mode isolated words vs continuous speech
Speaking style read speech vs spontaneous speech
Speakers speaker dependent vs speakerindependent
Vocabulary small (<20 words) vs large (>50 000words)
Robustness against background noise
9 / 117
![Page 11: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/11.jpg)
Challenges — VariabilityBetween speakers
I AgeI GenderI AnatomyI Dialect
Within speaker
I StressI EmotionI Health conditionI Read vs SpontaneousI Adaptation to
environment (Lombardeffect)
I Adaptation to listener
Environment
I NoiseI Room acousticsI Microphone distanceI Microphone, telephoneI Bandwidth
Listener
I AgeI Mother tongueI Hearing lossI Known / unknownI Human / Machine
10 / 117
![Page 12: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/12.jpg)
Applications today
Call centers:
I traffic information
I time-tables
I booking. . .
Accessibility
I Dictation
I hand-free control (TV, video, telephone)
Smart phones
I Siri, Android. . .
11 / 117
![Page 13: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/13.jpg)
Outline
Speech Signal Representations
Template Matching
Probabilistic ApproachKnowledge Modelling
Performance Measures
Robustness and Adaptation
Speaker Recognition
More details in DT2118: “Speech and SpeakerRecognition”
12 / 117
![Page 14: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/14.jpg)
Components of ASR System
Speech SignalSpectralAnalysis
FeatureExtraction
Searchand Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - KnowledgeDecoder
13 / 117
![Page 15: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/15.jpg)
Outline
Speech Signal Representations
Template Matching
Probabilistic ApproachKnowledge Modelling
Performance Measures
Robustness and Adaptation
Speaker Recognition
14 / 117
![Page 16: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/16.jpg)
Speech Signal Representations
Speech SignalSpectralAnalysis
FeatureExtraction
Representation
Goals:
I disregard irrelevant information
I optimise relevant information for modelling
15 / 117
![Page 17: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/17.jpg)
Speech Signal Representations
Speech SignalSpectralAnalysis
FeatureExtraction
Representation
Means:
I try to model essential aspects of speechproduction
I imitate auditory processes
I consider properties of statistical modelling
15 / 117
![Page 18: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/18.jpg)
Examples of Speech Sounds
http://www.speech.kth.se/wavesurfer/
16 / 117
![Page 19: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/19.jpg)
Feature Extraction and Speech Production
Esophagus
Larynx
Soft palate(velum)
Hard palate
Lung
Trachea
Jaw
Oral cavity
Teeth
TongueLip
Nostril
Nasal cavity
Diaphragm
cavityPharyngeal
����������������������������������������
Trachea
Muscle Force and Relaxation
Lungs
FoldsVocal
Glottis
PharyngealCavity Cavity
Cavity
Oral
Nasal
Velum
NoseOutput
MouthOutput
17 / 117
![Page 20: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/20.jpg)
Source/Filter Model, General Case
Vowels
Esophagus
Larynx
Soft palate(velum)
Hard palate
Lung
Trachea
Jaw
Oral cavity
Teeth
TongueLip
Nostril
Nasal cavity
Diaphragm
cavityPharyngeal
�����������������������������������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������������������������������������
������
������������������������������������������������������
���������������������������������������������
������������������������
������������������������
����������������������������
����������������������������
����������������������������������������
����������������������������������������
������������������������
������������������������
����������������
������������������������
������������������������
��������������������
��������������������
����������������
����������������
������������������������
������������������������
�������������������������
�������������������������
� Source (periodic)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)
18 / 117
![Page 21: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/21.jpg)
Source/Filter Model, General Case
Fricatives (e.g. sh) or Plosive (e.g. k)
Esophagus
Larynx
Soft palate(velum)
Hard palate
Lung
Trachea
Jaw
Oral cavity
Teeth
TongueLip
Nostril
Nasal cavity
Diaphragm
cavityPharyngeal
�����������������������������������������������������������������������������������������������������������������������
�����������������������������������������������������������������������������������������������������������������������
��������
������������������
������������������
������������������������
������������������������
������������������������������
������������������������������
����������������������������
����������������������������
����������������������������������������
����������������������������������������
������������������������
������������������������
����������������
������������������������
������������������������
��������������������
��������������������
����������������
����������������
������������������������
������������������������
������������������������������
������������������������������
� Source (noise orimpulsive)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)
18 / 117
![Page 22: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/22.jpg)
Source/Filter Model, General Case
Fricatives (e.g. s) or Plosive (e.g. t)
Esophagus
Larynx
Soft palate(velum)
Hard palate
Lung
Trachea
Jaw
Oral cavity
Teeth
TongueLip
Nostril
Nasal cavity
Diaphragm
cavityPharyngeal
�����������������������������������������������������������������������������������������������������������������������
�����������������������������������������������������������������������������������������������������������������������
��������
��������������������������������������������������������
��������������������������������������������������������
������������������������
������������������������
����������������������������
����������������������������
����������������������������������������
����������������������������������������
������������������������
������������������������
����������������
������������������������
������������������������
��������������������
��������������������
����������������
����������������
������������������������
������������������������
�������������������������
�������������������������
� Source (noise orimpulsive)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)
18 / 117
![Page 23: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/23.jpg)
Source/Filter Model, General Case
Nasalised Vowels
Esophagus
Larynx
Soft palate(velum)
Hard palate
Lung
Trachea
Jaw
Oral cavity
Teeth
TongueLip
Nostril
Nasal cavity
Diaphragm
cavityPharyngeal
�����������������������������������������������������������������������������������������������������������������������
��������������������������������������������������������������������������������������������������������������������������
������
������������������������������������������������������
�����������������������������������������������������������������������������������������������
��������������������������������������������������
���������������������������������������
���������������������������������������
����������������
������������������������
������������������������
��������������������
��������������������
����������������
����������������
��������������������
��������������������
������������������������
������������������������
����������������������������
����������������������������
����������������������������������������
����������������������������������������
������������������������
������������������������
��������������������
��������������������
����������������������������
����������������������������
��������������������������������
��������������������������������
��������������������
��������������������
������������
������������
�������������������������
�������������������������
� Source (periodic)� Front Cavity� Back Cavity� Back Cavity (2ndapprox.)
18 / 117
![Page 24: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/24.jpg)
Examples
0 1 2 3 4 5 6 7 8
freqency (kHz)
dB
vocalic sound
������������������������
������������������������
����������������������������
����������������������������
����������������������������������������
����������������������������������������
������������������������
������������������������
����������������
������������������������
������������������������
��������������������
��������������������
����������������
����������������
������������������������
������������������������
�������������������������
�������������������������
0 1 2 3 4 5 6 7 8
freqency (kHz)
dB
fricative sound
������������������������������
������������������������������
����������������������������
����������������������������
����������������������������������������
����������������������������������������
������������������������
������������������������
����������������
������������������������
������������������������
��������������������
��������������������
����������������
����������������
������������������������
������������������������
������������������������������
������������������������������
19 / 117
![Page 25: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/25.jpg)
Relevant vs Irrelevant InformationFor the purpose of transcribing words:Relevant: vocal tract shape → spectral envelopeIrrelevant: vocal fold vibration frequency (f0) →spectral details
0 1 2 3 4 5 6 7 8
freqency (kHz)
dB
vocalic sound
Exceptions:
I tonal languages (Chinese)
I pitch and prosody convey meaning
20 / 117
![Page 26: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/26.jpg)
Relevant vs Irrelevant InformationFor the purpose of transcribing words:Relevant: vocal tract shape → spectral envelopeIrrelevant: vocal fold vibration frequency (f0) →spectral details
0 1 2 3 4 5 6 7 8
freqency (kHz)
dB
vocalic sound
Exceptions:
I tonal languages (Chinese)
I pitch and prosody convey meaning
20 / 117
![Page 27: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/27.jpg)
Linear Prediction AnalysisAttempt to model the vocal tract filter
x [n] =
p∑k=1
akx [n − k]
better match at spectral peaks than valleys21 / 117
![Page 28: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/28.jpg)
Mel Frequency Cepstrum Coefficients
I imitate aspects of auditory processing
I de facto standard in ASR
I does not assume all-pole model of the spectrum
I uncorrelated: easier to model statistically
22 / 117
![Page 29: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/29.jpg)
MFCCs Calculation
FFT Mel FilterbankCosine
Transform
C1 C2 C3 C4−20
0
20
40cepstrum of /a:/
0 5 10 1540
60
80
100
filterbank channels
spectrum of /a:/
23 / 117
![Page 30: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/30.jpg)
Cosine Transform
Cj =√
2N
∑Ni=1 Ai cos( jπ(i−0.5)
N )
0 5 10 15−0.5
0
0.5
W1
0 5 10 15−0.5
0
0.5
W2
0 5 10 15−0.5
0
0.5
W3
0 5 10 15−0.5
0
0.5
W4
Ai
0 5 10 1540
60
80
100
filterbank channels
spectrum of /a:/
0 5 10 1520
40
60
80
filterbank channels
spectrum of /s:/
Cj
C1 C2 C3 C4−20
0
20
40cepstrum of /a:/
C1 C2 C3 C4−60
−40
−20
0
20cepstrum of /s:/
24 / 117
![Page 31: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/31.jpg)
MFCCs: typical values
I 12 Coefficients C1–C12
I Energy (could be C0)
I Delta coefficients (derivatives in time)
I Delta-delta (second order derivatives)
I total: 39 coefficients per frame (analysiswindow)
25 / 117
![Page 32: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/32.jpg)
A time varying signal
I speech is time varying
I short segment are quasi-stationary
I use short time analysis
26 / 117
![Page 33: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/33.jpg)
Short-Time Fourier Analysis
7500 8000 8500 9000 9500 10000−0.1
−0.05
0
0.05
0.1am
plit
ude
samples
0 1000 2000 3000 4000 5000 6000 7000 8000−20
−15
−10
−5
0
5
pow
er
spectr
um
(dB
)
frequency (Hz)
27 / 117
![Page 34: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/34.jpg)
Short-Time Fourier Analysis
7500 8000 8500 9000 9500 10000−0.1
−0.05
0
0.05
0.1am
plit
ude
samples
0 1000 2000 3000 4000 5000 6000 7000 8000−16
−14
−12
−10
−8
−6
−4
−2
0
2
pow
er
spectr
um
(dB
)
frequency (Hz)
27 / 117
![Page 35: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/35.jpg)
Short-Time Fourier Analysis
7500 8000 8500 9000 9500 10000−0.1
−0.05
0
0.05
0.1am
plit
ude
samples
0 1000 2000 3000 4000 5000 6000 7000 8000−16
−14
−12
−10
−8
−6
−4
−2
pow
er
spectr
um
(dB
)
frequency (Hz)
27 / 117
![Page 36: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/36.jpg)
Short-Time Fourier Analysis
27 / 117
![Page 37: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/37.jpg)
Windowing, typical values
I signal sampling frequency: 8–20kHz
I analysis window: 10–50ms
I frame interval: 10–25ms (100–40Hz)
28 / 117
![Page 38: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/38.jpg)
Frame-Based Processing
File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39
1
2
3
4
5
6
7
kHz
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
time
ttclnixaymaxttclniydxraokkclixh#
mytoaccording
29 / 117
![Page 39: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/39.jpg)
Comparing frames
I city block distance: d(x , y) =∑
i |xi − yi |I Euclidean distance: d(x , y) =
√∑i(xi − yi)2
I Mahalanobis distance:d(x , y) =
∑i(xi − µy)2/σy
I probability function:f (X = x |µ,Σ) = N(x ;µ,Σ)
I artificial neural networks: d = f (∑
i wixi − θ)
30 / 117
![Page 40: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/40.jpg)
Comparing Utterances
In order to recognise speech we have to be able tocompare different utterances
Va jobbaru me Vad jobbar du med
31 / 117
![Page 41: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/41.jpg)
Fixed vs Variable Length Representation
32 / 117
![Page 42: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/42.jpg)
Combining frame-wise scores intoutterance scores
Template Matching
I oldest technique
I simple comparison of template patterns
I compensate for varying speech rate (DynamicProgramming)
Hidden Markov Models (HMMs)
I most used technique
I models of segmental structure of speech
I recognition by Viterbi search (DynamicProgramming)
33 / 117
![Page 43: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/43.jpg)
Outline
Speech Signal Representations
Template Matching
Probabilistic ApproachKnowledge Modelling
Performance Measures
Robustness and Adaptation
Speaker Recognition
34 / 117
![Page 44: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/44.jpg)
Template Matching
35 / 117
![Page 45: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/45.jpg)
Dynamic ProgrammingI compare any possible alignmentI problem: exponential with H and K!
1 H
1K
unknown utterance
refe
renc
eut
tera
nce
36 / 117
![Page 46: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/46.jpg)
Dynamic ProgrammingDynamic Time Warping (DTW) algorithm
1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],
AccD[h − 1, k − 1],AccD[h, k − 1])
1 H
1K
unknown utterance
refe
renc
eut
tera
nce
36 / 117
![Page 47: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/47.jpg)
Dynamic ProgrammingDynamic Time Warping (DTW) algorithm
1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],
AccD[h − 1, k − 1],AccD[h, k − 1])
1 H
1K
unknown utterance
refe
renc
eut
tera
nce
36 / 117
![Page 48: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/48.jpg)
Dynamic ProgrammingDynamic Time Warping (DTW) algorithm
1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],
AccD[h − 1, k − 1],AccD[h, k − 1])
1 H
1K
unknown utterance
refe
renc
eut
tera
nce
36 / 117
![Page 49: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/49.jpg)
Dynamic ProgrammingDynamic Time Warping (DTW) algorithm
1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],
AccD[h − 1, k − 1],AccD[h, k − 1])
1 H
1K
unknown utterance
refe
renc
eut
tera
nce
36 / 117
![Page 50: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/50.jpg)
Dynamic ProgrammingDynamic Time Warping (DTW) algorithm
1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],
AccD[h − 1, k − 1],AccD[h, k − 1])
1 H
1K
unknown utterance
refe
renc
eut
tera
nce
36 / 117
![Page 51: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/51.jpg)
Dynamic ProgrammingDynamic Time Warping (DTW) algorithm
1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],
AccD[h − 1, k − 1],AccD[h, k − 1])
1 H
1K
unknown utterance
refe
renc
eut
tera
nce
36 / 117
![Page 52: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/52.jpg)
Dynamic ProgrammingDynamic Time Warping (DTW) algorithm
1: for h = 1 to H do2: for k = 1 to K do3: AccD[h, k] = LocD[h, k] + min(AccD[h − 1, k],
AccD[h − 1, k − 1],AccD[h, k − 1])
1 H
1K
unknown utterance
refe
renc
eut
tera
nce AccD[H,K]
36 / 117
![Page 53: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/53.jpg)
DP Example: Spelling
I observations are letters
I local distance: 0 (same letter), 1 (differentletter)
I Unknown utterance: ALLDRIG
I Reference1: ALDRIG
I Reference2: ALLTID
I Problem: find closest match
Distance char-by-char:
I ALLDRIG–ALDRIG = 5
I ALLDRIG–ALLTID = 4
37 / 117
![Page 54: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/54.jpg)
DP Example: Solution
LocD[h,k]=
G 1 1 1 1 1 1 0
I 1 1 1 1 1 0 1
R 1 1 1 1 0 1 1
D 1 1 1 0 1 1 1
L 1 0 0 1 1 1 1
A 0 1 1 1 1 1 1
A L L D R I G
AccD[h,k]=
G 5 4 4 3 2 1 0
I 4 3 3 2 1 0 1
R 3 2 2 1 0 1 2
D 2 1 1 0 1 2 3
L 1 0 0 1 2 3 4
A 0 1 2 3 4 5 6
A L L D R I G
Distance ALLDRIG–ALDRIG: AccD[H,K] = 0
Distance ALLDRIG–ALLTID? (5min)
38 / 117
![Page 55: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/55.jpg)
DP Example: Solution
LocD[h,k]=
G 1 1 1 1 1 1 0
I 1 1 1 1 1 0 1
R 1 1 1 1 0 1 1
D 1 1 1 0 1 1 1
L 1 0 0 1 1 1 1
A 0 1 1 1 1 1 1
A L L D R I G
AccD[h,k]=
G 5 4 4 3 2 1 0
I 4 3 3 2 1 0 1
R 3 2 2 1 0 1 2
D 2 1 1 0 1 2 3
L 1 0 0 1 2 3 4
A 0 1 2 3 4 5 6
A L L D R I G
Distance ALLDRIG–ALDRIG: AccD[H,K] = 0Distance ALLDRIG–ALLTID? (5min)
38 / 117
![Page 56: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/56.jpg)
DP Example: Solution
LocD[h,k]=
D 1 1 1 0 1 1 1
I 1 1 1 1 1 0 1
T 1 1 1 1 1 1 1
L 1 0 0 1 1 1 1
L 1 0 0 1 1 1 1
A 0 1 1 1 1 1 1
A L L D R I G
AccD[h,k]=
D 5 3 3 2 3 3 3
I 4 2 2 2 2 2 3
T 3 1 1 1 2 3 4
L 2 0 0 1 2 3 4
L 1 0 0 1 2 3 4
A 0 1 2 3 4 5 6
A L L D R I G
Distance ALLDRIG–ALDRIG: AccD[H,K] = 0Distance ALLDRIG–ALLTID: AccD[H,K] = 3
39 / 117
![Page 57: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/57.jpg)
Best path: Backtracking
Sometimes we want to know the path
1. at each point [h,k] remember the minimumdistance predecessor (back pointer)
2. at the end point [H,K] follow the back pointersuntil the start
40 / 117
![Page 58: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/58.jpg)
Properties of Template Matching
Pros:
+ No need for phonetic transcriptions
+ within-word co-articulation for free
+ high time resolution
Cons:
– cross-word co-articulation not modelled
– requires recordings of every word
– not easy to model variation
– does not scale up with vocabulary size
41 / 117
![Page 59: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/59.jpg)
Outline
Speech Signal Representations
Template Matching
Probabilistic ApproachKnowledge Modelling
Performance Measures
Robustness and Adaptation
Speaker Recognition
42 / 117
![Page 60: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/60.jpg)
Components of ASR System
Speech SignalSpectralAnalysis
FeatureExtraction
Searchand Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - KnowledgeDecoder
43 / 117
![Page 61: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/61.jpg)
A probabilistic perspective
1. Compute probability of a word sequence giventhe acoustic observation: P(words|sounds)
2. find the optimal word sequence by maximisingthe probability:
words = arg maxP(words|sounds)
44 / 117
![Page 62: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/62.jpg)
A probabilistic perspective: Bayes’ rule
P(words|sounds) =P(sounds|words)P(words)
P(sounds)
I P(sounds|words) can be estimated fromtraining data and transcriptions
I P(words): a priori probability of the words(Language Model)
I P(sounds): a priori probability of the sounds(constant, can be ignored)
45 / 117
![Page 63: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/63.jpg)
Components of ASR System
Speech SignalSpectralAnalysis
FeatureExtraction
Searchand Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - KnowledgeDecoder
Acoustic Models
46 / 117
![Page 64: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/64.jpg)
Probabilistic ModellingProblem: How do we model P(sounds|words)?
File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39
1
2
3
4
5
6
7
kHz
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
time
ttclnixaymaxttclniydxraokkclixh#
mytoaccording
Every feature vector (observation at time t) is acontinuous stochastic variable (e.g. MFCC)
47 / 117
![Page 65: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/65.jpg)
Probabilistic ModellingProblem: How do we model P(sounds|words)?
File: sx352.WAV Page: 1 of 1 Printed: Mon Dec 05 09:01:39
1
2
3
4
5
6
7
kHz
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
time
ttclnixaymaxttclniydxraokkclixh#
mytoaccording
Every feature vector (observation at time t) is acontinuous stochastic variable (e.g. MFCC)
47 / 117
![Page 66: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/66.jpg)
StationarityProblem: speech is not stationaryI we need to model short segments independentlyI the fundamental unit can not be the word, but
must be shorterI usually we model three segments for each
phoneme
48 / 117
![Page 67: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/67.jpg)
Local probabilities (frame-wise)
If segment sufficiently short
P(sounds|segment)
can be modelled with standard probabilitydistributionsUsually Gaussian or Gaussian Mixture
49 / 117
![Page 68: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/68.jpg)
Global Probabilities (utterance)
Problem: How do we combine the differentP(sounds|segment) to form P(sounds|words)?
Answer: Hidden Markov Model (HMM)
w1 w2 w3
w
ao1 ao2 ao3
ao
sh1 sh2 sh3
sh
wash
50 / 117
![Page 69: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/69.jpg)
Hidden Markov Models (HMMs)
s1 s2
s3
Elements:
set of states: S = {s1, s2, s3}transition probabilities: T (sa, sb) = P(sb, t|sa, t − 1)prior probabilities: π(sa) = P(sa, t0)state to observation prob-abilities:
B(o, sa) = P(o|sa)
51 / 117
![Page 70: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/70.jpg)
Hidden Markov Models (HMMs)
s1 s2 s3
Elements:
set of states: S = {s1, s2, s3}transition probabilities: T (sa, sb) = P(sb, t|sa, t − 1)prior probabilities: π(sa) = P(sa, t0)state to observation prob-abilities:
B(o, sa) = P(o|sa)
51 / 117
![Page 71: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/71.jpg)
Hidden Markov Models (HMMs)
1 Hunknown utterance
52 / 117
![Page 72: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/72.jpg)
Hidden Markov Models (HMMs)
1 Hunknown utterance
52 / 117
![Page 73: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/73.jpg)
HMM-questions
1. what is the probability that the model hasgenerated the sequence of observations?(isolated word recognition)
2. what is the most likely state sequence given theobservation sequence? (continuous speechrecognition)
3. how can the model parameters be estimatedfrom examples? (training)
53 / 117
![Page 74: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/74.jpg)
HMM-questions
1. what is the probability that the model hasgenerated the sequence of observations?(isolated word recognition) forward algorithm
2. what is the most likely state sequence given theobservation sequence? (continuous speechrecognition) Viterbi algorithm [5]
3. how can the model parameters be estimatedfrom examples? (training) Baum-Welch[1]
[5] A. J. Viterbi. “Error Bounds for Convolutional Codes and an Asymtotically optimum decoding algorithm”. In:IEEE Trans. Inform. Theory IT-13 (Apr. 1967), pp. 260–269
[1] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. “A maximization technique occurring in the statistical analysisof probabilistic functions of Markov chains”. In: Ann. Math. Statist. 41.1 (1970), pp. 164–171
53 / 117
![Page 75: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/75.jpg)
Isolated Words Recognition
hmm1
hmm2
hmm3
. . .
hmmK
Compare Likelihoods (forward-backward)
54 / 117
![Page 76: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/76.jpg)
Continuous Speech Recognition
hmm1
hmm2
hmm3
. . .
hmmK
Viterbi algorithm
55 / 117
![Page 77: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/77.jpg)
Modelling Coarticulation
Example peat /pi:t/ vs wheel /wi:l/
0 0 . 1 0 . 2 0 . 3
-2 0 0 0
-1 0 0 0
0
1 0 0 0
0 0 . 1 0 . 2 0 . 3
-2 0 0 0
-1 0 0 0
0
1 0 0 0
2 0 0 0
T im e (s e c o n d s )
Freq
uenc
y(H
z)
0 0 . 1 0 . 2 0 . 30
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
T im e (s e c o n d s )
Freq
uenc
y(H
z)
0 0 . 1 0 . 2 0 . 30
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
56 / 117
![Page 78: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/78.jpg)
Modelling Coarticulation
Context dependent models (CD-HMMs)
I Duplicate each phoneme model depending onleft and right context:
I from “a” monophone model
I to “d−a+f”, “d−a+g”, “l−a+s”. . . triphonemodels
I If there are N = 50 phonemes in the language,there are N3 = 125000 potential triphones
I many of them are not exploited by the language
57 / 117
![Page 79: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/79.jpg)
Amount of parameters
Example:
I a large vocabulary recogniser may have 60000triphone models
I each model has 3 states
I each state may have 32 mixture componentswith 1 + 39× 2 parameters each (weight,means, variances): 39× 32× 2 + 32 = 2528
Totally it is 60000× 3× 2528 = 455 millionparameters!
58 / 117
![Page 80: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/80.jpg)
Similar Coarticulation
/ri:/ vs /wi:/
0 0.1 0.2 0.3 0.4-600
-400
-200
0
200
400
Time (seconds)
Frequency(Hz)
0 0.1 0.2 0.3 0.40
1000
2000
3000
4000
0 0.1 0.2 0.3 0.4-600
-400
-200
0
200
400
Time (seconds)
Frequency(Hz)
0 0.1 0.2 0.3 0.40
1000
2000
3000
4000
59 / 117
![Page 81: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/81.jpg)
Tying to reduce complexity
Example: similar triphones d−a+m and t−a+m
I same right context, similar left context
I 3rd state is expected to be very similar
I 2nd state may also be similar
States (and their parameters) can be sharedbetween models
+ reduce complexity
+ more data to estimate each parameter
– fine detail may be lost
done with CART tree methodology
60 / 117
![Page 82: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/82.jpg)
Tying to reduce complexity
Example: similar triphones d−a+m and t−a+m
I same right context, similar left context
I 3rd state is expected to be very similar
I 2nd state may also be similar
States (and their parameters) can be sharedbetween models
+ reduce complexity
+ more data to estimate each parameter
– fine detail may be lost
done with CART tree methodology
60 / 117
![Page 83: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/83.jpg)
Components of ASR System
Speech SignalSpectralAnalysis
FeatureExtraction
Searchand Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - KnowledgeDecoder
Lexical Models
61 / 117
![Page 84: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/84.jpg)
Components of ASR System
Speech SignalSpectralAnalysis
FeatureExtraction
Searchand Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - KnowledgeDecoder
Lexical Models
61 / 117
![Page 85: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/85.jpg)
Lexical Models
I in general specify sequence of phoneme foreach word
I example:
“dictionary” IPA X-SAMPAUK: /d I k S @ n (@) ô i/ /d I k S @ n (@) r i/
USA: /d I k S @ n E ô i/ /d I k S @ n E r i/
I expensive resources
I include multiple pronunciations
I phonological rules (assimilation, deletion)
62 / 117
![Page 86: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/86.jpg)
Pronunciation Network
Example: tomato
/t/
/@/
/oU/
/m/
/A/
/eI/
/R/
/t/
/oU/
0.5
0.3
0.2
0.7
0.3
0.2
0.8
0.8
0.2
63 / 117
![Page 87: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/87.jpg)
Assimilation
did you /d I dZ j @/set you /s E tS 3/
last year /l æ s tS i: ô/because you’ve /b i: k @ Z u: v/
64 / 117
![Page 88: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/88.jpg)
Deletion
find him /f a I n I m/around this /@ ô aU n I s/
let me in /l E m i: n/
65 / 117
![Page 89: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/89.jpg)
Out of Vocabulary Words
I Proper names often not in lexicon
I derive pronunciation automatically
I English has very complexgrapheme-to-phoneme rules
I attempts to derive pronunciation from speechrecordings
66 / 117
![Page 90: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/90.jpg)
Components of ASR System
Speech SignalSpectralAnalysis
FeatureExtraction
Searchand Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - KnowledgeDecoder
Language Models
67 / 117
![Page 91: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/91.jpg)
Components of ASR System
Speech SignalSpectralAnalysis
FeatureExtraction
Searchand Match
Recognised Words
Acoustic Models
Lexical Models
Language Models
Representation
Constraints - KnowledgeDecoder
Language Models
67 / 117
![Page 92: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/92.jpg)
Why do we need language models?
Bayes’ rule:
P(words|sounds) =P(sounds|words)P(words)
P(sounds)
whereP(words): a priori probability of the words(Language Model)
We could use non informative priors(P(words) = 1/N), but. . .
68 / 117
![Page 93: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/93.jpg)
Branching Factor
I if we have N words in the dictionary
I at every word boundary we have to consider Nequally likely alternatives
I N can be in the order of millions
word
word1
word2
. . .
wordN
69 / 117
![Page 94: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/94.jpg)
Ambiguity
“ice cream” vs “I scream”
/aI s k ô i: m/
70 / 117
![Page 95: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/95.jpg)
Language Models
P(words|sounds) =P(sounds|words)P(words)
P(sounds)
Finite state networks (hand-made, see lab)
I formal language, e.g. traffic control
Statistical Models (N-grams)
I unigrams: P(wi)
I bigrams: P(wi |wi−1)
I trigrams: P(wi |wi−1,wi−2)
I . . .
71 / 117
![Page 96: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/96.jpg)
Chomsky’s formal grammar
Noam Chomsky: linguist, philosopher, . . .
G = (V ,T ,P , S)
where
V : set of non-terminal constituentsT : set of terminals (lexical items)P : set of production rulesS : start symbol
72 / 117
![Page 97: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/97.jpg)
ExampleS = sentenceV = {NP (noun phrase),
NP1, VP (verbphrase), NAME, ADJ,V (verb), N (noun)}
T = {Mary , person , loves, that , . . . }
P = {S → NP VPNP → NAMENP → ADJ NP1NP1 → NVP → VERB NPNAME → MaryV → lovesN → personADJ → that }
S
NP
NAME
Mary
VP
V
loves
NP
ADJ
that
NP1
N
person
73 / 117
![Page 98: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/98.jpg)
ExampleS = sentenceV = {NP (noun phrase),
NP1, VP (verbphrase), NAME, ADJ,V (verb), N (noun)}
T = {Mary , person , loves, that , . . . }
P = {S → NP VPNP → NAMENP → ADJ NP1NP1 → NVP → VERB NPNAME → MaryV → lovesN → personADJ → that }
S
NP
NAME
Mary
VP
V
loves
NP
ADJ
that
NP1
N
person
73 / 117
![Page 99: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/99.jpg)
Formal Language Models
I only used for simple tasks
I hard to code by hand
I people do not speak following formal grammars
74 / 117
![Page 100: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/100.jpg)
Statistical Grammar Models (N-grams)
Simply count co-occurrence of words in large textdata sets
I unigrams: P(wi)
I bigrams: P(wi |wi−1)
I trigrams: P(wi |wi−1,wi−2)
I . . .
75 / 117
![Page 101: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/101.jpg)
Language Models: complexity
Increasing N in N-grams leads to:
1. more complex decoders
2. difficulties in training the LM parameters
76 / 117
![Page 102: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/102.jpg)
Knowledge Models in ASR
Acoustic Models trained on hours of annotatedspeech recordings (especially developedspeech databases)
Lexical Model usually produced by hand by experts(or generated by rules)
Language Models trained on millions of words oftext (often from news papers)
77 / 117
![Page 103: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/103.jpg)
Outline
Speech Signal Representations
Template Matching
Probabilistic ApproachKnowledge Modelling
Performance Measures
Robustness and Adaptation
Speaker Recognition
78 / 117
![Page 104: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/104.jpg)
Word Accuracy
A = 100N − S − D − I
NWhere
I N : total number of reference words
I S : substitutions
I D: deletions
I I : insertions
79 / 117
![Page 105: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/105.jpg)
Word Accuracy: example
Ref/Rec I wanted badly to meet you
I corrreally delwanted corrto ins corrsee subyou corr
6 words, 1 substitution, 1 insertion, 1 deletion
A = 1006− 1− 1− 1
6= 50%
requires dynamic programming80 / 117
![Page 106: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/106.jpg)
Measure Difficulty
Language Perplexity
B = 2H , H = −∑∀W
P(W ) log2(P(W ))
I P(W ) is the probability of the word sequence(language model)
I H is called entropy
I B can be seen as measure of average numberof words that can follow any given word
I Example: equiprobable digit sequences B = 10
81 / 117
![Page 107: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/107.jpg)
Effect of adding features
82 / 117
![Page 108: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/108.jpg)
Effect of adding training dataSwichboard data
83 / 117
![Page 109: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/109.jpg)
Effect of adding Gaussians
84 / 117
![Page 110: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/110.jpg)
Effect of adding data for language models
Million sentences
85 / 117
![Page 111: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/111.jpg)
Some dictation systems
I vocabulary over 100 000 words
I many languages
I systems: Nuance NatuallySpeaking, Microsoft,(IBM ViaVoice), (Dragon Dictate)
86 / 117
![Page 112: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/112.jpg)
New applications
I Indexing of TV and radio programs (offline),Google
I real-time subtitling of TV programs (re-speakerthat summarises)
I voice search (Google)
I language learning
I smart phones
87 / 117
![Page 113: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/113.jpg)
Outline
Speech Signal Representations
Template Matching
Probabilistic ApproachKnowledge Modelling
Performance Measures
Robustness and Adaptation
Speaker Recognition
88 / 117
![Page 114: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/114.jpg)
Main variables in ASR
Speaking mode isolated words vs continuous speech
Speaking style read speech vs spontaneous speech
Speakers speaker dependent vs speakerindependent
Vocabulary small (<20 words) vs large (>50 000words)
Robustness against background noise
89 / 117
![Page 115: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/115.jpg)
http://www.itl.nist.gov/iad/mig/publications/ASRhistory/
90 / 117
![Page 116: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/116.jpg)
Why is it so hard?
Language
91 / 117
![Page 117: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/117.jpg)
Why is it so hard?
Language
database
91 / 117
![Page 118: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/118.jpg)
Why is it so hard?
Language
database
application
91 / 117
![Page 119: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/119.jpg)
Challenges — VariabilityBetween speakers
I AgeI GenderI AnatomyI Dialect
Within speaker
I StressI EmotionI Health conditionI Read vs SpontaneousI Adaptation to
environment (Lombardeffect)
I Adaptation to listener
Environment
I NoiseI Room acousticsI Microphone distanceI Microphone, telephoneI Bandwidth
Listener
I AgeI Mother tongueI Hearing lossI Known / unknownI Human / Machine
92 / 117
![Page 120: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/120.jpg)
Sheep and Goats [3]
[3] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds. “SHEEP, GOATS, LAMBS and WOLVESA Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation”. In:INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. 1998
93 / 117
![Page 121: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/121.jpg)
Sheep and Goats [3]
[3] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds. “SHEEP, GOATS, LAMBS and WOLVESA Statistical Analysis of Speaker Performance in the NIST 1998 Speaker Recognition Evaluation”. In:INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING. 1998
93 / 117
![Page 122: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/122.jpg)
Exmpl: spontaneous vs hyper-articulated
Va jobbaru me Vad jobbar du med
“What is your occupation”(“What work you with”)
94 / 117
![Page 123: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/123.jpg)
Examples of reduced pronunciation
Spoken Written In EnglishTesempel Till exempel for exampleahamba och han bara and he justbafatt bara for att just becausejavende jag vet inte I don’t know
95 / 117
![Page 124: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/124.jpg)
Microphone distanceHeadset
2 m distance
96 / 117
![Page 125: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/125.jpg)
How do we cope with variability?Ideally: models that generalise
Language
database
application
97 / 117
![Page 126: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/126.jpg)
How do we cope with variability?Ideally: models that generalise
Language
database
application
97 / 117
![Page 127: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/127.jpg)
How do we cope with variability?Large companies use insane quantities of data
Language
databaseapplication
98 / 117
![Page 128: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/128.jpg)
How do we cope with variability?Adaptation
Language
database
application
99 / 117
![Page 129: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/129.jpg)
How do we cope with variability?Adaptation
Language
database
application
99 / 117
![Page 130: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/130.jpg)
Adaptation: Example
Enrolment in Dictation Systems
I let the user read a small text before using thesystem
Beta version of smartphone applications
I the company has all the rights on datagenerated
100 / 117
![Page 131: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/131.jpg)
Limitations
I lack of context
I require huge amounts of training data
101 / 117
![Page 132: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/132.jpg)
Adapted from Mikael Parkvall’s Lingvistiska Samlarbilder, Nr.96:“Problem med automatisk taligenkanning”
SPRÄNGD ANKA
PASSERADE TOMATER
VEGGIE WHOPPER PIROG
PYTTIPANNA
OXJÄRPE
NASI GORENG
SEMLA
SILLSALLAD
POTATISGRATÄNG
FALUKORVKNAPRIGA KLIKEX
PEPPARROTKÖTTPEPPARBIFFKOKOSBOLLARZUCCHINI
SPRÄNGD ANKA
ROTMOS
KROPPKAKA
LINSSOPPALINGONSYLT
RIS A LA MALTA
LÖVBIFFKORV STROGANOFF
SOCKERKAKA
KÖTTBULLARLINSPASTEJ
BUILLABAISE
TACOS
GULASCHGRÖNKÅLS-BAKLAVA
OLIVBRÖD
PYTTIPANNAKASSLER
TORTILLASMOUSSAKAMARÄNGSWISS
SALSAOXRULLADER
RABARBERPAJMANGOCHUTNEY OXJÄRPE
102 / 117
![Page 133: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/133.jpg)
Lack of Generalisation[4]
0
5
10
15
20
25
30
1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
Hours of training data
Wor
d Er
ror R
ate
(%)
Less supervised More supervised
Broadcast news transcription
linear extrapolation
[4] R. Moore. “A Comparison of the Data Requirements of Automatic Speech Recognition Systems and HumanListeners”. In: Proc. of Eurospeech. Geneva, Switzerland, 2003, pp. 2582–2584
103 / 117
![Page 134: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/134.jpg)
Lack of Generalisation[4]
0
5
10
15
20
25
30
1 10 100 1,000 10,000 100,000 1,000,000 10,000,000
Hours of training data
Wor
d Er
ror R
ate
(%)
Less supervised More supervised
Broadcast news transcription
linear extrapolation
In order to reach 10-years-old’s performance,ASR needs 4 to 70 human lifetimes
exposure to speech!!
[4] R. Moore. “A Comparison of the Data Requirements of Automatic Speech Recognition Systems and HumanListeners”. In: Proc. of Eurospeech. Geneva, Switzerland, 2003, pp. 2582–2584
103 / 117
![Page 135: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/135.jpg)
New directions
I Production inspired modelling
I Study children’s speech acquisitionI Modelling and decision techniques
I EigenvoicesI Deep learning neural networks
104 / 117
![Page 136: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/136.jpg)
Outline
Speech Signal Representations
Template Matching
Probabilistic ApproachKnowledge Modelling
Performance Measures
Robustness and Adaptation
Speaker Recognition
105 / 117
![Page 137: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/137.jpg)
Speaker Recognition
Created by Hakan Melin
106 / 117
![Page 138: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/138.jpg)
Person Identification
Methods rely on:
I something you posses:key, magnetic card, . . .
I something you know:PIN-code, password, . . .
I something you are:physical attributes, behaviour (biometrics)
107 / 117
![Page 139: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/139.jpg)
Recognition, Verification, IdentificationRecognition: general termSpeaker verification:
I an identity is claimed and is verified by voice
I binary decision (accept/reject)
I performance independent of number of users
Speaker identification:
I choose one of N speakers
I close set: voice belongs to one of the Nspeakers
I open set: any person can access the system
I problem difficulty increases with N
108 / 117
![Page 140: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/140.jpg)
Text Dependence
Either fix the content or recognise it. Examples:
I Fixed password (text dependent)
I User-specific password
I System prompts the text (preventsimpostors from recording and playingback the password)
I any word is allowed (text independent)
text
inde
pen
dent
109 / 117
![Page 141: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/141.jpg)
Representations
Speech Recognition:
I represent speech content
I disregard speaker identity
Speaker Recognition:
I represent speaker identity
I disregard speech content
Surprisingly:
I MFCCs used for both
I suggests that feature extraction could beimproved
110 / 117
![Page 142: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/142.jpg)
Representations
Speech Recognition:
I represent speech content
I disregard speaker identity
Speaker Recognition:
I represent speaker identity
I disregard speech content
Surprisingly:
I MFCCs used for both
I suggests that feature extraction could beimproved
110 / 117
![Page 143: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/143.jpg)
Speaker Verification
Registration (training, enrolment)
Spectralanalysis
Training utterancesfrom a new client
Trainmodel q1 q2 q7
a2 2,
Trained speaker model
Verification
Spectralanalysis MatchingAccess utterance
Claimed identity
Accept / Reject
Problem: The matching scorebetween the client model and theutterance is sensitive todistortion, utterance duration, etc.
111 / 117
![Page 144: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/144.jpg)
Modelling Techniques
HMMs
I Text dependent systems
I state sequence represents allowed utterance
GMMs (Gaussian Mixture Models)
I Text independent systems
I large number of Gaussian components
I sequential information not used
SVM (Support Vector Machines)
Combined models
112 / 117
![Page 145: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/145.jpg)
Evaluation
Claimed Decision:Identity Accept Reject
True OK False Reject (FR)
False False Accept (FA) OK
113 / 117
![Page 146: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/146.jpg)
Score Distribution and Error Balance
( )speaker"true"sf
( )speaker"false"sf
P("false accept")
P("false reject")
$s
Decision threshold
114 / 117
![Page 147: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/147.jpg)
Performance Measures
I False Rejection Rate (FR)
I False Acceptance Rate (FA)
I Half Total Error Rate (HTER = (FR+FA)/2)
I Equal Error Rate (EER)
I Detection Error Trade-off (DET) Curve
115 / 117
![Page 148: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/148.jpg)
PER vs Commercial System
2 5 10 20
2
5
10
20
False Accept Rate (in %)
FalseRejectRate(in%)
PER eval02, es=E2a_G8, ts=S2b_G8
commercial_system_with_adaptationCTT:combo,originalCTT:combo,retrained
Retrained:Background modeltrained on PER speech
Original:Background model trained ontelephone speech
116 / 117
![Page 149: DT2112 Speech Recognition by Computers - KTH · 2015. 1. 29. · letter) I Unknown utterance: ALLDRIG I Reference1: ALDRIG I Reference2: ALLTID I Problem: nd closest match Distance](https://reader034.vdocuments.mx/reader034/viewer/2022052018/6031945e78d89b42296c0a08/html5/thumbnails/149.jpg)
More information and mathematicalformulations in DT2118
117 / 117