deep speech 2: end-to-end speech recognition in english...

Speech Recognition

19/12/2017 Deep Speech 1

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Amnon Drory & Matan Karo

Overview

Automatic Speech Recognition

100,000,000H.M.U = Hundred Million Users

The Task: Good Speech Recognition

Traditional Speech Recognition (ASR)

Traditional ASR + Deep Learning

Baidu’s Approach: End-To-End Neural Net

100,000,000

Speed up

Training Data: Annotated Audio

Thousands of hours of annotated speech for training: in English and Mandarin.

Use text to learn a lot about the language.

This can help us in understanding speech

which words are common which word is reasonable in the current context

Training Data: Raw Text

Lecture Plan

• Overview

• Input

• Output + CTC

• Model Architecture

• Results

• A complete speech application: • Speech transcription

• Word spotting/ trigger word

• Speaker identification /verification

Audio Input

• Input: Raw Audio , 1D signal

• Pre-Process: Spectrogram

Preprocessing: SortaGrad• Dealing with different lengths of utterances

• Try to keep similar – length utterances together

(LibriSpeechcleandata.)

Preprocessing: Data Augmentation

• Additive noise• increases robustness to noisy speech

• increases the data set : 10k hours of raw audio -> 100k hours

Output

The Goal

• Create a neural network (RNN) from which we can extract transcription, 𝑦 .• Train from labeled pairs 𝑥, 𝑦∗

Connectionist Temporal Classification (CTC)

• The network is also called “Acoustic Model”.

• Acoustic model main issue - length(x) != length(y)

• Solution - divide the transcription task to steps:

• RNN output neurons c encode distribution over symbolsEncode: 𝑥 → 𝑐

• Define a mapping from distribution to text𝛽 𝑓(𝑐) → 𝑦

• Find function 𝑓 for achieving 𝑦In training : summation for all mappingsIn testing : ML using beamsearch

Connectionist Temporal Classification (CTC)• RNN creates probability vectors (distribution) using Softmax

For grapheme-based model: 𝑐 ∈ 𝐴, 𝐵, 𝐶, 𝐷,… , 𝑏𝑙𝑎𝑛𝑘, 𝑠𝑝𝑎𝑐𝑒

• Independence assumption: 𝑃 𝑐 𝑥 = 𝑖=1𝑁 𝑃(𝑐𝑖|𝑥)

Training With CTC

• Mapping:•Given a character sequence 𝑐, remove duplicates and blanks

• Therefore 𝑃 𝑦 𝑥 is the summation over all possible 𝑐 with the same mapping:

Training With CTC

• Update network parameters 𝜃 to maximize likelihood of correct label 𝑦∗:

𝜃∗ = argmax𝜃 𝑖 log 𝑃 𝑦

∗ 𝑖 𝑥 𝑖 →

𝜃∗ = argmax𝜃

𝑐∶𝛽 𝑐 =𝑦∗(𝑖)

𝑃 𝑐 𝑥 𝑖

• There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages).

Decoding

• Network outputs 𝑃 𝑐 𝑥 , we want 𝑃 𝑦 𝑥

• Simple naive solution: Max Decoding

𝛽( argmax𝑐𝑃 𝑐 𝑥 )

_ _ c c _ a a a a _ _ _ _ b b b b _ _ _ _ _

Max Decoding

• Doesn’t work in practice, good for diagnostics

Language model: n-gram

• A probabilistic Markov Model : 𝑃(𝑥𝑖|𝑥𝑖− 𝑛−1 , … , 𝑥𝑖−1)

Language model: n-gram

• Examples from Google n-gram corpus

Decoding with LM

• Even with better decoding schemes CTC model tends to make spelling and linguistic errors.

• Solution: Combine a Language Model!

argmax𝑦log{𝑃 𝑦 𝑥 𝑃 𝑦 𝛼 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡(𝑦) 𝛽}

• 𝛼 – weights between LM and CTC network

• 𝛽- encourages more words in transcription

• Use Beam Search to find the transcript 𝑦

Decoding with LM

Decoding with LM : Beam Search

• The Naive approach

Decoding with LM : Beam Search

Architecture

Model Architecture

• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

RNN as state machine

RNN as grid

Forward Pass

Back Propagation

Weight Update

Bi-Directional RNN

RNN with limited future context

RNN vs. LSTM vs. GRU

Model Architecture

• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

Convolutional Layers: Images

Convolutional Layers: Audio

Results

Test Sets

Results: Sometimes better than Humans

Questions?

The End

deep speech 2: end-to-end speech recognition in english...

Documents

hate speech versus free speech

high frequency oscillations of first eigenmodes in...

direct speech smartprep · direct and indirect speech *****...

digital speech processing— lecture 16 speech... · 1...

1 speech user interfaces 2 outline motivation for speech uis...

speed/accuracy trade-offs for modern convolutional...

digital speech processingdigital speech processing—— …...

hybridization and the origin of fano resonances in...

reported speech / indirect speech

dolan speech · title: dolan speech subject: dolan speech...

1 6-text to speech (tts) speech synthesis speech synthesis...

convolutional neural networks on graphs with fast...

object detection and dense captioning -...

hate speech or free speech?

personalising speech to-speech translation

unit 1 friendship grammar direct speech & indirect speech...

speech enhancement - speech processing€¦ · input speech...

gans for exploiting unlabeled...

pathological fracture risk assessment in patients with...

speech perception the speech chain - ucsb speech...–...