deep speech 2: end-to-end speech recognition in english...

Speech Recognition

19/12/2017 Deep Speech 1

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Amnon Drory & Matan Karo

Overview


Automatic Speech Recognition


100,000,000H.M.U = Hundred Million Users


The Task: Good Speech Recognition


Traditional Speech Recognition (ASR)


Traditional ASR + Deep Learning


Baidu’s Approach: End-To-End Neural Net


100,000,000

Speed up


Training Data: Annotated Audio


Thousands of hours of annotated speech for training: in English and Mandarin.

Use text to learn a lot about the language.

This can help us in understanding speech


which words are common which word is reasonable in the current context

Training Data: Raw Text

Lecture Plan

• Overview

• Input

• Output + CTC

• Model Architecture

• Results


Input


ASR

• A complete speech application: • Speech transcription

• Word spotting/ trigger word

• Speaker identification /verification


Audio Input

• Input: Raw Audio , 1D signal

• Pre-Process: Spectrogram


Preprocessing: SortaGrad• Dealing with different lengths of utterances

• Try to keep similar – length utterances together


(LibriSpeechcleandata.)

Preprocessing: Data Augmentation

• Additive noise• increases robustness to noisy speech

• increases the data set : 10k hours of raw audio -> 100k hours


Output


The Goal

• Create a neural network (RNN) from which we can extract transcription, 𝑦 .• Train from labeled pairs 𝑥, 𝑦∗


Connectionist Temporal Classification (CTC)

• The network is also called “Acoustic Model”.

• Acoustic model main issue - length(x) != length(y)

• Solution - divide the transcription task to steps:

• RNN output neurons c encode distribution over symbolsEncode: 𝑥 → 𝑐

• Define a mapping from distribution to text𝛽 𝑓(𝑐) → 𝑦

• Find function 𝑓 for achieving 𝑦In training : summation for all mappingsIn testing : ML using beamsearch


Connectionist Temporal Classification (CTC)• RNN creates probability vectors (distribution) using Softmax

For grapheme-based model: 𝑐 ∈ 𝐴, 𝐵, 𝐶, 𝐷,… , 𝑏𝑙𝑎𝑛𝑘, 𝑠𝑝𝑎𝑐𝑒

• Independence assumption: 𝑃 𝑐 𝑥 = 𝑖=1𝑁 𝑃(𝑐𝑖|𝑥)


Training With CTC

• Mapping:•Given a character sequence 𝑐, remove duplicates and blanks

• Therefore 𝑃 𝑦 𝑥 is the summation over all possible 𝑐 with the same mapping:


Training With CTC

• Update network parameters 𝜃 to maximize likelihood of correct label 𝑦∗:

𝜃∗ = argmax𝜃 𝑖 log 𝑃 𝑦

∗ 𝑖 𝑥 𝑖 →

𝜃∗ = argmax𝜃

𝑖

log

𝑐∶𝛽 𝑐 =𝑦∗(𝑖)

𝑃 𝑐 𝑥 𝑖

• There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages).


Decoding

• Network outputs 𝑃 𝑐 𝑥 , we want 𝑃 𝑦 𝑥

• Simple naive solution: Max Decoding

𝛽( argmax𝑐𝑃 𝑐 𝑥 )

_ _ c c _ a a a a _ _ _ _ b b b b _ _ _ _ _


Max Decoding


• Doesn’t work in practice, good for diagnostics

Language model: n-gram

• A probabilistic Markov Model : 𝑃(𝑥𝑖|𝑥𝑖− 𝑛−1 , … , 𝑥𝑖−1)


Language model: n-gram


• Examples from Google n-gram corpus

Decoding with LM

• Even with better decoding schemes CTC model tends to make spelling and linguistic errors.

• Solution: Combine a Language Model!

argmax𝑦log{𝑃 𝑦 𝑥 𝑃 𝑦 𝛼 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡(𝑦) 𝛽}

• 𝛼 – weights between LM and CTC network

• 𝛽- encourages more words in transcription

• Use Beam Search to find the transcript 𝑦


Decoding with LM


Decoding with LM : Beam Search


• The Naive approach

Decoding with LM : Beam Search


Architecture


Model Architecture


• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

RNN as state machine


RNN as grid


Forward Pass


Back Propagation


Weight Update

Bi-Directional RNN


RNN with limited future context


RNN vs. LSTM vs. GRU


Model Architecture


• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

Convolutional Layers: Images


Convolutional Layers: Audio


Time

Freq

uen

cy

Results


Test Sets


Results: Sometimes better than Humans


Questions?


The End


deep speech 2: end-to-end speech recognition in english...

Documents