intel nervana artificial intelligence meetup 11/30/16

A fast, scalable deep learning platform

End-to-end speech recognition with neonAnthony Ndirango & Tyler Lee

MAKING MACHINES SMARTER.

now part of

Proprietary and confidential. Do not distribute.

Nervana Systems Proprietary Outline2Deep learning synopsisLarge vocabulary continuous speech recognitionEnd-to-end speech recognition systemsIntegrating weighted finite-state transducers for decoding

Nervana Systems Proprietary

3Back-propagationEnd-to-endResnetImageNetWord2VecRegularizationConvolutionUnrollingRNNGeneralizationhyperparametersVideo recognitiondropoutPoolingLSTMAlexNetSpeech recognitiondownload neon!https://github.com/NervanaSystems/neongit clone [email protected]:NervanaSystems/neon.git

Nervanas deep learning tutorials:https://www.nervanasys.com/deep-learning-tutorials/We are hiring!https://www.nervanasys.com/careers/


What is deep learning?4A method for extracting features at multiple levels of abstractionFeatures are discovered from dataPerformance improves with more dataNetwork can express complex transformationsHigh degree of representational power


5What can deep learning do today?


6

Healthcare: Tumor detection

Automotive: Speech interfaces

Finance: Time-series search engine

Positive:Negative:

Agricultural Robotics

Oil & Gas

Positive:Negative:

Proteomics: Sequence analysis

Query:

Results:

Nervana in action


Large Vocabulary Continuous Speech Recognition7


State-of-the-art ASR pipeline8kwkbranfks

quick brown fox


9


Emphasize that we are just replicating DS2s acoustic model

Processing data using aeon10download aeon!https://github.com/NervanaSystems/aeonTrain directly from raw audio, extracting spectral features on-the-flyHandles arbitrarily large datasetsLoads data from disk to device with minimal latencyAlso supports image and video data


11

Acoustic models in neon

complete source available athttps://github.com/NervanaSystems/deepspeech

Nervana Systems Proprietary CTC12The basic problem to be solved involves mapping a sequence of audio features to a sequence of characters, with no obvious relationship between the lengths of the sequences.CTC works around this problem by first defining a collapse function.Definition by example: Collapse(_NNN_ _EE_ _R_ _VVV_AAA_N_AAAA_) = NERVANA


13

For each utterance, model outputs a matrix of frame-wise character probabilitiesGiven the ground truth transcript, the CTC algorithm:finds all paths which collapse onto ground truthuses the probability matrix to weight each path

Nervana Systems Proprietary Example14input audio with 5 framesground truth: CABfind all strings of length 5, including blank characters that collapse onto CAB

CTC Cost

Nervana Systems Proprietary Inference15

do an argmax for each columnconcatenate the resulting characters to obtain a stringcollapse the string to get the output

Nervana Systems Proprietary Examples of argmax-decoded outputs 16decoded outputsground truthyounited presidentiol is a lefe in surance companyunited presidential is a life insurance company

that was sertainly true last weekthat was certainly true last week

we're now ready to say we're intechnical default a spokesman saidwe're not ready to say we're in technical default a spokesman said

Nervana Systems Proprietary Decoding: from characters to language17So we have probabilities of each character at each frame. Now what?If CER (character error rate) is nearly perfect

Were pretty much set. Just use the best character at each frame.If CER is too high

We should enforce some rules from the language. E.g:All words must be validFavor likely word sequences

Nervana Systems Proprietary WFSTs efficiently enforce language constraints18Weighted finite state transducers:Automata whose state transitions map a sequence of input symbols to a sequence of output symbolsDirected graph structureStates enforce language structureTransitions choose amongst valid symbols

For an in-depth review, see Mohri, Pereira & Riley, 2008

Nervana Systems Proprietary Why do we use them?19A lot of decoding concepts map nicely to FSTs. CTC, lexicon (vocabulary) and grammar (language model) can all be easily represented.Efficient algorithms exist to combine FSTs, giving a single decoding graphDecoding graphCTCgraphLexicongraphGrammargraph


CTC is easily implemented as an FST20

Removes repeated characters and blanks (_)

C_

A A A _

B B _Input: C _ A A A _ B B _C

C A

C A BOutput: C A B


A vocabulary (lexicon) is also easy to implement as an FST21Maps a sequence of characters or phonemes to words

Nervana Systems Proprietary WFSTs have a few drawbacks22Less end-to-end: A large number of parameters learned completely separate from the acoustic model

Memory issues with large vocabularies or complex language modelsGraph# States# ArcsCTC3191Lexicon30,62940,516Trigram3,538,57910,213,039Composed Trigram26,817,69654,104,686

Nervana Systems Proprietary WFSTs greatly improve word error rate23ReferenceCER(no LM)WER(no LM)WER(trigram LM)WER(trigram LM w/ enhancements)Hannun, et al. (2014)10.735.814.1N/AGraves-Jaitly (2014) 9.230.1N/A8.7Hwang-Sung (2016) 10.638.48.888.1Miao et al. (2015) [Eesen]N/AN/A9.17.3Bahdanau et al. (2016)6.418.610.89.3Nervana-Speech8.6432.58.4N/A

younited presidentiol is a lefe in surance companyunited presidential is a life insurance company

that was sertainly true last weekthat was certainly true last week

we're now ready to say we're intechnical default a spokesman saidwe're not ready to say we're in technical default a spokesman said

Nervana Systems Proprietary 24Nervanas deep learning tutorials:https://www.nervanasys.com/deep-learning-tutorials/

Acoustic model source available at:https://github.com/NervanaSystems/deepspeech

Github page:https://github.com/NervanaSystems/neon

For more information, contact:[email protected] info

Nervana Systems Proprietary neon on

25

Nervana Systems Proprietary model zoo26github.com/NervanaSystems/ModelZoo model files, parameters

GoogLeNetAlexnetVGGDeep Residual NetbAbI Q&Aimdb Sentiment AnalysisVideo Activity DetectionDeep Reinforcement LearningLSTM Image CaptioningFast-RCNN Object LocalizationAllCNN


dont have to make a model from scratch- many examples of pre-trained models

mention Yinyin-Fast-RCNN, Sathish C3D, babI

27THANK YOU!

QUESTIONS?

Nervana Systems Proprietary 28

Nervana Systems Proprietary Convolution012345678

0123

19253743

0134

0123

19

Each element in the output is the result of a dot product between two vectors

29inputfilteroutput


To understand Convolution networks, we should understand convolution operation first and then see how such operation is implemented in a network structure

Convolution layer30012345678

0123

19253743

012345678

19

0

2

3

1

0

2

3

1

0

2

3

1

0

2

3

1

25

37

43

Nervana Systems Proprietary Bi-directional RNN (BiRNN)31


intel nervana artificial intelligence meetup 11/30/16

Technology