intel nervana artificial intelligence meetup 11/30/16

31
Proprietary and confidential. Do not distribute. End-to-end speech recognition with neon Anthony Ndirango & Tyler Lee MAKING MACHINES SMARTER.™ now part of

Upload: nervana-systems

Post on 16-Apr-2017

273 views

Category:

Technology


2 download

TRANSCRIPT

A fast, scalable deep learning platform

End-to-end speech recognition with neonAnthony Ndirango & Tyler Lee

MAKING MACHINES SMARTER.

now part of

Proprietary and confidential. Do not distribute.

Nervana Systems Proprietary Outline2Deep learning synopsisLarge vocabulary continuous speech recognitionEnd-to-end speech recognition systemsIntegrating weighted finite-state transducers for decoding

Nervana Systems Proprietary

3Back-propagationEnd-to-endResnetImageNetWord2VecRegularizationConvolutionUnrollingRNNGeneralizationhyperparametersVideo recognitiondropoutPoolingLSTMAlexNetSpeech recognitiondownload neon!https://github.com/NervanaSystems/neongit clone [email protected]:NervanaSystems/neon.git

Nervanas deep learning tutorials:https://www.nervanasys.com/deep-learning-tutorials/We are hiring!https://www.nervanasys.com/careers/

Nervana Systems Proprietary

What is deep learning?4A method for extracting features at multiple levels of abstractionFeatures are discovered from dataPerformance improves with more dataNetwork can express complex transformationsHigh degree of representational power

Nervana Systems Proprietary

5What can deep learning do today?

Nervana Systems Proprietary

6

Healthcare: Tumor detection

Automotive: Speech interfaces

Finance: Time-series search engine

Positive:Negative:

Agricultural Robotics

Oil & Gas

Positive:Negative:

Proteomics: Sequence analysis

Query:

Results:

Nervana in action

Nervana Systems Proprietary

Large Vocabulary Continuous Speech Recognition7

Nervana Systems Proprietary

State-of-the-art ASR pipeline8kwkbranfks

quick brown fox

Nervana Systems Proprietary

9

Nervana Systems Proprietary

Emphasize that we are just replicating DS2s acoustic model

Processing data using aeon10download aeon!https://github.com/NervanaSystems/aeonTrain directly from raw audio, extracting spectral features on-the-flyHandles arbitrarily large datasetsLoads data from disk to device with minimal latencyAlso supports image and video data

Nervana Systems Proprietary

11

Acoustic models in neon

complete source available athttps://github.com/NervanaSystems/deepspeech

Nervana Systems Proprietary CTC12The basic problem to be solved involves mapping a sequence of audio features to a sequence of characters, with no obvious relationship between the lengths of the sequences.CTC works around this problem by first defining a collapse function.Definition by example: Collapse(_NNN_ _EE_ _R_ _VVV_AAA_N_AAAA_) = NERVANA

Nervana Systems Proprietary

13

For each utterance, model outputs a matrix of frame-wise character probabilitiesGiven the ground truth transcript, the CTC algorithm:finds all paths which collapse onto ground truthuses the probability matrix to weight each path

Nervana Systems Proprietary Example14input audio with 5 framesground truth: CABfind all strings of length 5, including blank characters that collapse onto CAB

CTC Cost

Nervana Systems Proprietary Inference15

do an argmax for each columnconcatenate the resulting characters to obtain a stringcollapse the string to get the output

Nervana Systems Proprietary Examples of argmax-decoded outputs 16decoded outputsground truthyounited presidentiol is a lefe in surance companyunited presidential is a life insurance company

that was sertainly true last weekthat was certainly true last week

we're now ready to say we're intechnical default a spokesman saidwe're not ready to say we're in technical default a spokesman said

Nervana Systems Proprietary Decoding: from characters to language17So we have probabilities of each character at each frame. Now what?If CER (character error rate) is nearly perfect

Were pretty much set. Just use the best character at each frame.If CER is too high

We should enforce some rules from the language. E.g:All words must be validFavor likely word sequences

Nervana Systems Proprietary WFSTs efficiently enforce language constraints18Weighted finite state transducers:Automata whose state transitions map a sequence of input symbols to a sequence of output symbolsDirected graph structureStates enforce language structureTransitions choose amongst valid symbols

For an in-depth review, see Mohri, Pereira & Riley, 2008

Nervana Systems Proprietary Why do we use them?19A lot of decoding concepts map nicely to FSTs. CTC, lexicon (vocabulary) and grammar (language model) can all be easily represented.Efficient algorithms exist to combine FSTs, giving a single decoding graphDecoding graphCTCgraphLexicongraphGrammargraph

Nervana Systems Proprietary

CTC is easily implemented as an FST20

Removes repeated characters and blanks (_)

C_

A A A _

B B _Input: C _ A A A _ B B _C

C A

C A BOutput: C A B

Nervana Systems Proprietary

A vocabulary (lexicon) is also easy to implement as an FST21Maps a sequence of characters or phonemes to words

Nervana Systems Proprietary WFSTs have a few drawbacks22Less end-to-end: A large number of parameters learned completely separate from the acoustic model

Memory issues with large vocabularies or complex language modelsGraph# States# ArcsCTC3191Lexicon30,62940,516Trigram3,538,57910,213,039Composed Trigram26,817,69654,104,686

Nervana Systems Proprietary WFSTs greatly improve word error rate23ReferenceCER(no LM)WER(no LM)WER(trigram LM)WER(trigram LM w/ enhancements)Hannun, et al. (2014)10.735.814.1N/AGraves-Jaitly (2014) 9.230.1N/A8.7Hwang-Sung (2016) 10.638.48.888.1Miao et al. (2015) [Eesen]N/AN/A9.17.3Bahdanau et al. (2016)6.418.610.89.3Nervana-Speech8.6432.58.4N/A

younited presidentiol is a lefe in surance companyunited presidential is a life insurance company

that was sertainly true last weekthat was certainly true last week

we're now ready to say we're intechnical default a spokesman saidwe're not ready to say we're in technical default a spokesman said

Nervana Systems Proprietary 24Nervanas deep learning tutorials:https://www.nervanasys.com/deep-learning-tutorials/

Acoustic model source available at:https://github.com/NervanaSystems/deepspeech

Github page:https://github.com/NervanaSystems/neon

For more information, contact:[email protected] info

Nervana Systems Proprietary neon on

25

Nervana Systems Proprietary model zoo26github.com/NervanaSystems/ModelZoo model files, parameters

GoogLeNetAlexnetVGGDeep Residual NetbAbI Q&Aimdb Sentiment AnalysisVideo Activity DetectionDeep Reinforcement LearningLSTM Image CaptioningFast-RCNN Object LocalizationAllCNN

Nervana Systems Proprietary

dont have to make a model from scratch- many examples of pre-trained models

mention Yinyin-Fast-RCNN, Sathish C3D, babI

27THANK YOU!

QUESTIONS?

Nervana Systems Proprietary 28

Nervana Systems Proprietary Convolution012345678

0123

19253743

0134

0123

19

Each element in the output is the result of a dot product between two vectors

29inputfilteroutput

Nervana Systems Proprietary

To understand Convolution networks, we should understand convolution operation first and then see how such operation is implemented in a network structure

Convolution layer30012345678

0123

19253743

012345678

19

0

2

3

1

0

2

3

1

0

2

3

1

0

2

3

1

25

37

43

Nervana Systems Proprietary Bi-directional RNN (BiRNN)31

Nervana Systems Proprietary