deep speech 2: end-to-end speech recognition in english...

53
Speech Recognition 19/12/2017 Deep Speech 1 Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Amnon Drory & Matan Karo

Upload: others

Post on 20-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Speech Recognition

19/12/2017 Deep Speech 1

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Amnon Drory & Matan Karo

Page 2: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Overview

19/12/2017 Deep Speech 2

Page 3: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Automatic Speech Recognition

19/12/2017 Deep Speech 3

Page 4: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

19/12/2017 Deep Speech 4

Page 5: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

19/12/2017 Deep Speech 5

Page 6: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

19/12/2017 Deep Speech 6

Page 7: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

100,000,000H.M.U = Hundred Million Users

19/12/2017 Deep Speech 7

Page 8: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

The Task: Good Speech Recognition

19/12/2017 Deep Speech 8

Page 9: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Traditional Speech Recognition (ASR)

19/12/2017 Deep Speech 9

Page 10: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Traditional ASR + Deep Learning

19/12/2017 Deep Speech 10

Page 11: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Baidu’s Approach: End-To-End Neural Net

19/12/2017 Deep Speech 11

100,000,000

Page 12: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Speed up

19/12/2017 Deep Speech 12

Page 13: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Training Data: Annotated Audio

19/12/2017 Deep Speech 13

Thousands of hours of annotated speech for training: in English and Mandarin.

Page 14: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Use text to learn a lot about the language.

This can help us in understanding speech

19/12/2017 Deep Speech 14

which words are common which word is reasonable in the current context

Training Data: Raw Text

Page 15: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Lecture Plan

• Overview

• Input

• Output + CTC

• Model Architecture

• Results

19/12/2017 Deep Speech 15

Page 16: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Input

19/12/2017 Deep Speech 16

Page 17: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

ASR

• A complete speech application: • Speech transcription

• Word spotting/ trigger word

• Speaker identification /verification

19/12/2017 Deep Speech 17

Page 18: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Audio Input

• Input: Raw Audio , 1D signal

• Pre-Process: Spectrogram

19/12/2017 Deep Speech 18

Page 19: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Preprocessing: SortaGrad• Dealing with different lengths of utterances

• Try to keep similar – length utterances together

19/12/2017 Deep Speech 19

(LibriSpeechcleandata.)

Page 20: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Preprocessing: Data Augmentation

• Additive noise• increases robustness to noisy speech

• increases the data set : 10k hours of raw audio -> 100k hours

19/12/2017 Deep Speech 20

Page 21: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Output

19/12/2017 Deep Speech 21

Page 22: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

The Goal

• Create a neural network (RNN) from which we can extract transcription, 𝑦 .• Train from labeled pairs 𝑥, 𝑦∗

19/12/2017 Deep Speech 22

Page 23: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Connectionist Temporal Classification (CTC)

• The network is also called “Acoustic Model”.

• Acoustic model main issue - length(x) != length(y)

• Solution - divide the transcription task to steps:

• RNN output neurons c encode distribution over symbolsEncode: 𝑥 → 𝑐

• Define a mapping from distribution to text𝛽 𝑓(𝑐) → 𝑦

• Find function 𝑓 for achieving 𝑦In training : summation for all mappingsIn testing : ML using beamsearch

19/12/2017 Deep Speech 23

Page 24: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Connectionist Temporal Classification (CTC)• RNN creates probability vectors (distribution) using Softmax

For grapheme-based model: 𝑐 ∈ 𝐴, 𝐵, 𝐶, 𝐷,… , 𝑏𝑙𝑎𝑛𝑘, 𝑠𝑝𝑎𝑐𝑒

• Independence assumption: 𝑃 𝑐 𝑥 = 𝑖=1𝑁 𝑃(𝑐𝑖|𝑥)

19/12/2017 Deep Speech 24

Page 25: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Training With CTC

• Mapping:•Given a character sequence 𝑐, remove duplicates and blanks

• Therefore 𝑃 𝑦 𝑥 is the summation over all possible 𝑐 with the same mapping:

19/12/2017 Deep Speech 25

Page 26: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Training With CTC

• Update network parameters 𝜃 to maximize likelihood of correct label 𝑦∗:

𝜃∗ = argmax𝜃 𝑖 log 𝑃 𝑦

∗ 𝑖 𝑥 𝑖 →

𝜃∗ = argmax𝜃

𝑖

log

𝑐∶𝛽 𝑐 =𝑦∗(𝑖)

𝑃 𝑐 𝑥 𝑖

• There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages).

19/12/2017 Deep Speech 26

Page 27: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Decoding

• Network outputs 𝑃 𝑐 𝑥 , we want 𝑃 𝑦 𝑥

• Simple naive solution: Max Decoding

𝛽( argmax𝑐𝑃 𝑐 𝑥 )

_ _ c c _ a a a a _ _ _ _ b b b b _ _ _ _ _

19/12/2017 Deep Speech 27

Page 28: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Max Decoding

19/12/2017 Deep Speech 28

• Doesn’t work in practice, good for diagnostics

Page 29: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Language model: n-gram

• A probabilistic Markov Model : 𝑃(𝑥𝑖|𝑥𝑖− 𝑛−1 , … , 𝑥𝑖−1)

19/12/2017 Deep Speech 29

Page 30: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Language model: n-gram

19/12/2017 Deep Speech 30

• Examples from Google n-gram corpus

Page 31: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Decoding with LM

• Even with better decoding schemes CTC model tends to make spelling and linguistic errors.

• Solution: Combine a Language Model!

argmax𝑦log{𝑃 𝑦 𝑥 𝑃 𝑦 𝛼 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡(𝑦) 𝛽}

• 𝛼 – weights between LM and CTC network

• 𝛽- encourages more words in transcription

• Use Beam Search to find the transcript 𝑦

19/12/2017 Deep Speech 31

Page 32: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Decoding with LM

19/12/2017 Deep Speech 32

Page 33: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Decoding with LM : Beam Search

19/12/2017 Deep Speech 33

• The Naive approach

Page 34: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Decoding with LM : Beam Search

19/12/2017 Deep Speech 34

Page 35: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Architecture

19/12/2017 Deep Speech 36

Page 36: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Model Architecture

19/12/2017 Deep Speech 37

• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

Page 37: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

RNN as state machine

19/12/2017 Deep Speech 38

Page 38: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

RNN as grid

19/12/2017 Deep Speech 39

Forward Pass

Page 39: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

19/12/2017 Deep Speech 40

Back Propagation

Page 40: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

19/12/2017 Deep Speech 41

Weight Update

Page 41: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Bi-Directional RNN

19/12/2017 Deep Speech 42

Page 42: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

RNN with limited future context

19/12/2017 Deep Speech 43

Page 43: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

RNN vs. LSTM vs. GRU

19/12/2017 Deep Speech 44

Page 44: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Model Architecture

19/12/2017 Deep Speech 45

• 11 layers

•The chosen architecture: • 3 x 2D conv ,

• 7 x RNN,

• 1 x FC

•Batch Normalization along the DNN.

Page 45: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Convolutional Layers: Images

19/12/2017 Deep Speech 46

Page 46: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Convolutional Layers: Audio

19/12/2017 Deep Speech 47

Time

Freq

uen

cy

Page 47: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Results

19/12/2017 Deep Speech 48

Page 48: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Test Sets

19/12/2017 Deep Speech 49

Page 49: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Results: Sometimes better than Humans

19/12/2017 Deep Speech 50

Page 50: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

19/12/2017 Deep Speech 51

Results: Sometimes better than Humans

Page 51: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

19/12/2017 Deep Speech 52

Results: Sometimes better than Humans

Page 52: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Questions?

19/12/2017 Deep Speech 53

Page 53: Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

The End

19/12/2017 Deep Speech 54