deep speech 2: end-to-end speech recognition in english...

Post on 20-Jun-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Speech Recognition

19/12/2017 Deep Speech 1

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Amnon Drory & Matan Karo

Overview

19/12/2017 Deep Speech 2

Automatic Speech Recognition

19/12/2017 Deep Speech 3

19/12/2017 Deep Speech 4

19/12/2017 Deep Speech 5

19/12/2017 Deep Speech 6

100,000,000H.M.U = Hundred Million Users

19/12/2017 Deep Speech 7

The Task: Good Speech Recognition

19/12/2017 Deep Speech 8

Traditional Speech Recognition (ASR)

19/12/2017 Deep Speech 9

Traditional ASR + Deep Learning

19/12/2017 Deep Speech 10

Baidu’s Approach: End-To-End Neural Net

19/12/2017 Deep Speech 11

100,000,000

Speed up

19/12/2017 Deep Speech 12

Training Data: Annotated Audio

19/12/2017 Deep Speech 13

Thousands of hours of annotated speech for training: in English and Mandarin.

Use text to learn a lot about the language.

This can help us in understanding speech

19/12/2017 Deep Speech 14

which words are common which word is reasonable in the current context

Training Data: Raw Text

Lecture Plan

• Overview

• Input

• Output + CTC

• Model Architecture

• Results

19/12/2017 Deep Speech 15

Input

19/12/2017 Deep Speech 16

ASR

• A complete speech application: • Speech transcription

• Word spotting/ trigger word

• Speaker identification /verification

19/12/2017 Deep Speech 17

Audio Input

• Input: Raw Audio , 1D signal

• Pre-Process: Spectrogram

19/12/2017 Deep Speech 18

Preprocessing: SortaGrad• Dealing with different lengths of utterances

• Try to keep similar – length utterances together

19/12/2017 Deep Speech 19

(LibriSpeechcleandata.)

Preprocessing: Data Augmentation

• Additive noise• increases robustness to noisy speech

• increases the data set : 10k hours of raw audio -> 100k hours

19/12/2017 Deep Speech 20

Output

19/12/2017 Deep Speech 21

The Goal

• Create a neural network (RNN) from which we can extract transcription, 𝑦 .• Train from labeled pairs 𝑥, 𝑦∗

19/12/2017 Deep Speech 22

Connectionist Temporal Classification (CTC)

• The network is also called “Acoustic Model”.

• Acoustic model main issue - length(x) != length(y)

• Solution - divide the transcription task to steps:

• RNN output neurons c encode distribution over symbolsEncode: 𝑥 → 𝑐

• Define a mapping from distribution to text𝛽 𝑓(𝑐) → 𝑦

• Find function 𝑓 for achieving 𝑦In training : summation for all mappingsIn testing : ML using beamsearch

19/12/2017 Deep Speech 23

Connectionist Temporal Classification (CTC)• RNN creates probability vectors (distribution) using Softmax

For grapheme-based model: 𝑐 ∈ 𝐴, 𝐵, 𝐶, 𝐷,… , 𝑏𝑙𝑎𝑛𝑘, 𝑠𝑝𝑎𝑐𝑒

• Independence assumption: 𝑃 𝑐 𝑥 = 𝑖=1𝑁 𝑃(𝑐𝑖|𝑥)

19/12/2017 Deep Speech 24

Training With CTC

• Mapping:•Given a character sequence 𝑐, remove duplicates and blanks

• Therefore 𝑃 𝑦 𝑥 is the summation over all possible 𝑐 with the same mapping:

19/12/2017 Deep Speech 25

Training With CTC

• Update network parameters 𝜃 to maximize likelihood of correct label 𝑦∗:

𝜃∗ = argmax𝜃 𝑖 log 𝑃 𝑦

∗ 𝑖 𝑥 𝑖 →

𝜃∗ = argmax𝜃

𝑖

log

𝑐∶𝛽 𝑐 =𝑦∗(𝑖)

𝑃 𝑐 𝑥 𝑖

• There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages).

19/12/2017 Deep Speech 26

Decoding

• Network outputs 𝑃 𝑐 𝑥 , we want 𝑃 𝑦 𝑥

• Simple naive solution: Max Decoding

𝛽( argmax𝑐𝑃 𝑐 𝑥 )

_ _ c c _ a a a a _ _ _ _ b b b b _ _ _ _ _

19/12/2017 Deep Speech 27

Max Decoding

19/12/2017 Deep Speech 28

• Doesn’t work in practice, good for diagnostics

Language model: n-gram

• A probabilistic Markov Model : 𝑃(𝑥𝑖|𝑥𝑖− 𝑛−1 , … , 𝑥𝑖−1)

19/12/2017 Deep Speech 29

Language model: n-gram

19/12/2017 Deep Speech 30

• Examples from Google n-gram corpus

Decoding with LM

• Even with better decoding schemes CTC model tends to make spelling and linguistic errors.

• Solution: Combine a Language Model!

argmax𝑦log{𝑃 𝑦 𝑥 𝑃 𝑦 𝛼 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡(𝑦) 𝛽}

• 𝛼 – weights between LM and CTC network

• 𝛽- encourages more words in transcription

• Use Beam Search to find the transcript 𝑦

19/12/2017 Deep Speech 31

Decoding with LM

19/12/2017 Deep Speech 32

Decoding with LM : Beam Search

19/12/2017 Deep Speech 33

• The Naive approach

Decoding with LM : Beam Search

19/12/2017 Deep Speech 34

Architecture

19/12/2017 Deep Speech 36

Model Architecture

19/12/2017 Deep Speech 37

• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

RNN as state machine

19/12/2017 Deep Speech 38

RNN as grid

19/12/2017 Deep Speech 39

Forward Pass

19/12/2017 Deep Speech 40

Back Propagation

19/12/2017 Deep Speech 41

Weight Update

Bi-Directional RNN

19/12/2017 Deep Speech 42

RNN with limited future context

19/12/2017 Deep Speech 43

RNN vs. LSTM vs. GRU

19/12/2017 Deep Speech 44

Model Architecture

19/12/2017 Deep Speech 45

• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

Convolutional Layers: Images

19/12/2017 Deep Speech 46

Convolutional Layers: Audio

19/12/2017 Deep Speech 47

Time

Freq

uen

cy

Results

19/12/2017 Deep Speech 48

Test Sets

19/12/2017 Deep Speech 49

Results: Sometimes better than Humans

19/12/2017 Deep Speech 50

19/12/2017 Deep Speech 51

Results: Sometimes better than Humans

19/12/2017 Deep Speech 52

Results: Sometimes better than Humans

Questions?

19/12/2017 Deep Speech 53

The End

19/12/2017 Deep Speech 54

top related