deep speech 2: end-to-end speech recognition in english...
TRANSCRIPT
Speech Recognition
19/12/2017 Deep Speech 1
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Amnon Drory & Matan Karo
Overview
19/12/2017 Deep Speech 2
Automatic Speech Recognition
19/12/2017 Deep Speech 3
19/12/2017 Deep Speech 4
19/12/2017 Deep Speech 5
19/12/2017 Deep Speech 6
100,000,000H.M.U = Hundred Million Users
19/12/2017 Deep Speech 7
The Task: Good Speech Recognition
19/12/2017 Deep Speech 8
Traditional Speech Recognition (ASR)
19/12/2017 Deep Speech 9
Traditional ASR + Deep Learning
19/12/2017 Deep Speech 10
Baidu’s Approach: End-To-End Neural Net
19/12/2017 Deep Speech 11
100,000,000
Speed up
19/12/2017 Deep Speech 12
Training Data: Annotated Audio
19/12/2017 Deep Speech 13
Thousands of hours of annotated speech for training: in English and Mandarin.
Use text to learn a lot about the language.
This can help us in understanding speech
19/12/2017 Deep Speech 14
which words are common which word is reasonable in the current context
Training Data: Raw Text
Lecture Plan
• Overview
• Input
• Output + CTC
• Model Architecture
• Results
19/12/2017 Deep Speech 15
Input
19/12/2017 Deep Speech 16
ASR
• A complete speech application: • Speech transcription
• Word spotting/ trigger word
• Speaker identification /verification
19/12/2017 Deep Speech 17
Audio Input
• Input: Raw Audio , 1D signal
• Pre-Process: Spectrogram
19/12/2017 Deep Speech 18
Preprocessing: SortaGrad• Dealing with different lengths of utterances
• Try to keep similar – length utterances together
19/12/2017 Deep Speech 19
(LibriSpeechcleandata.)
Preprocessing: Data Augmentation
• Additive noise• increases robustness to noisy speech
• increases the data set : 10k hours of raw audio -> 100k hours
19/12/2017 Deep Speech 20
Output
19/12/2017 Deep Speech 21
The Goal
• Create a neural network (RNN) from which we can extract transcription, 𝑦 .• Train from labeled pairs 𝑥, 𝑦∗
19/12/2017 Deep Speech 22
Connectionist Temporal Classification (CTC)
• The network is also called “Acoustic Model”.
• Acoustic model main issue - length(x) != length(y)
• Solution - divide the transcription task to steps:
• RNN output neurons c encode distribution over symbolsEncode: 𝑥 → 𝑐
• Define a mapping from distribution to text𝛽 𝑓(𝑐) → 𝑦
• Find function 𝑓 for achieving 𝑦In training : summation for all mappingsIn testing : ML using beamsearch
19/12/2017 Deep Speech 23
Connectionist Temporal Classification (CTC)• RNN creates probability vectors (distribution) using Softmax
For grapheme-based model: 𝑐 ∈ 𝐴, 𝐵, 𝐶, 𝐷,… , 𝑏𝑙𝑎𝑛𝑘, 𝑠𝑝𝑎𝑐𝑒
• Independence assumption: 𝑃 𝑐 𝑥 = 𝑖=1𝑁 𝑃(𝑐𝑖|𝑥)
19/12/2017 Deep Speech 24
Training With CTC
• Mapping:•Given a character sequence 𝑐, remove duplicates and blanks
• Therefore 𝑃 𝑦 𝑥 is the summation over all possible 𝑐 with the same mapping:
19/12/2017 Deep Speech 25
Training With CTC
• Update network parameters 𝜃 to maximize likelihood of correct label 𝑦∗:
𝜃∗ = argmax𝜃 𝑖 log 𝑃 𝑦
∗ 𝑖 𝑥 𝑖 →
𝜃∗ = argmax𝜃
𝑖
log
𝑐∶𝛽 𝑐 =𝑦∗(𝑖)
𝑃 𝑐 𝑥 𝑖
• There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages).
19/12/2017 Deep Speech 26
Decoding
• Network outputs 𝑃 𝑐 𝑥 , we want 𝑃 𝑦 𝑥
• Simple naive solution: Max Decoding
𝛽( argmax𝑐𝑃 𝑐 𝑥 )
_ _ c c _ a a a a _ _ _ _ b b b b _ _ _ _ _
19/12/2017 Deep Speech 27
Max Decoding
19/12/2017 Deep Speech 28
• Doesn’t work in practice, good for diagnostics
Language model: n-gram
• A probabilistic Markov Model : 𝑃(𝑥𝑖|𝑥𝑖− 𝑛−1 , … , 𝑥𝑖−1)
19/12/2017 Deep Speech 29
Language model: n-gram
19/12/2017 Deep Speech 30
• Examples from Google n-gram corpus
Decoding with LM
• Even with better decoding schemes CTC model tends to make spelling and linguistic errors.
• Solution: Combine a Language Model!
argmax𝑦log{𝑃 𝑦 𝑥 𝑃 𝑦 𝛼 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡(𝑦) 𝛽}
• 𝛼 – weights between LM and CTC network
• 𝛽- encourages more words in transcription
• Use Beam Search to find the transcript 𝑦
19/12/2017 Deep Speech 31
Decoding with LM
19/12/2017 Deep Speech 32
Decoding with LM : Beam Search
19/12/2017 Deep Speech 33
• The Naive approach
Decoding with LM : Beam Search
19/12/2017 Deep Speech 34
Architecture
19/12/2017 Deep Speech 36
Model Architecture
19/12/2017 Deep Speech 37
• 11 layers
•The chosen architecture: • 3 x 2D conv ,
• 7 x RNN,
• 1 x FC
•Batch Normalization along the DNN.
RNN as state machine
19/12/2017 Deep Speech 38
RNN as grid
19/12/2017 Deep Speech 39
Forward Pass
19/12/2017 Deep Speech 40
Back Propagation
19/12/2017 Deep Speech 41
Weight Update
Bi-Directional RNN
19/12/2017 Deep Speech 42
RNN with limited future context
19/12/2017 Deep Speech 43
RNN vs. LSTM vs. GRU
19/12/2017 Deep Speech 44
Model Architecture
19/12/2017 Deep Speech 45
• 11 layers
•The chosen architecture: • 3 x 2D conv ,
• 7 x RNN,
• 1 x FC
•Batch Normalization along the DNN.
Convolutional Layers: Images
19/12/2017 Deep Speech 46
Convolutional Layers: Audio
19/12/2017 Deep Speech 47
Time
Freq
uen
cy
Results
19/12/2017 Deep Speech 48
Test Sets
19/12/2017 Deep Speech 49
Results: Sometimes better than Humans
19/12/2017 Deep Speech 50
19/12/2017 Deep Speech 51
Results: Sometimes better than Humans
19/12/2017 Deep Speech 52
Results: Sometimes better than Humans
Questions?
19/12/2017 Deep Speech 53
The End
19/12/2017 Deep Speech 54