erich elsen, research scientist, baidu research at mlconf nyc - 4/15/16
TRANSCRIPT
Erich Elsen
Natural User Interfaces
• Goal: Make interacting with computers as natural as interacting with humans
• AI problems:– Speech recognition
– Emotional recognition
– Semantic understanding
– Dialog systems
– Speech synthesis
Erich Elsen
Deep Speech Applications
• Voice controlled apps• Peel Partnership
• English and Mandarin APIs in the US• Integration into Baidu’s products in China
Erich Elsen
Deep Speech: End-to-end learning
• Deep neural network predicts probability of characters directly from audio
. . .
. . .
T H _ E … D O G
Erich Elsen
Deep Speech: CTC
E .01 .05 .1 .1 .8 .05
H .01 .1 .1 .6 .05 .05
T .01 .8 .75 .2 .05 .1
BLANK .97 .05 .05 .1 .1 .8
• Simplified sequence of network outputs (probabilities)
• Generally many more timesteps than letters• Need to look at all the ways we can write “the”• Adjacent characters collapse• TTTHEE, TTTTHE, TTHHEE, THEEEE, ….• Solve with dynamic programming
Time
Erich Elsen
warp-ctc
• Recently open sourced our CTC implementation
• Efficient, parallel CPU and GPU backend
• 100-400X faster than other implementations
• Apache license, C interfacehttps://github.com/baidu-research/warp-ctc
Erich Elsen
Accuracy scales with Data
Data & Model Size
Performance
Deep Learning algorithms
Many previous methods
• 40% error reduction for each 10x increase in dataset size
Erich Elsen
Training sets
• Train on ~1½ years of data (and growing)
• English and Mandarin
• End-to-end deep learning is key to assembling large datasets
• Datasets drive accuracy
Erich Elsen
Large Datasets = Large Models
Dataset Size
Big Model
Small Model
Accuracy
• Models require over 20 Exa-flops to train (exa = 10^18)
• Trained on 4+ Terabytes of audio
Erich Elsen
Experiment Scaling
• Batch Norm impact with deeper networks
• Sequence wise normalization:
Erich Elsen
Parallelism across GPUs
Model Parallel
Data Parallel
MPI_Allreduce()
Training Data Training Data
For these models, Data Parallelism works best
Erich Elsen
Performance for RNN training
• 55% of GPU FMA peak using a single GPU
• ~48% of peak using 8 GPUs in one node
• Weak scaling very efficient, albeit algorithmically challenged
1
2
4
8
16
32
64
128
256
512
1 2 4 8 16 32 64 128
TF
LO
P/s
Number of GPUs
Typical training run
one node multi node
Erich Elsen
All-reduce
• We implemented our own all-reduce out of send and receive
• Several algorithm choices based on size
• Careful attention to affinity and topology
Erich Elsen
Scalability
• Batch size is hard to increase – algorithm, memory limits
• Performance at small batch sizes (32, 64) leads to scalability limits
Erich Elsen
Precision
• FP16 also mostly works– Use FP32 for softmax and weight updates
• More sensitive to labeling error
1
10
100
1000
10000
100000
1000000
10000000
100000000
-31
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Count
Magnitude
Weight Distribution
Erich Elsen
Conclusion
• We have to do experiments at scale
• Pushing compute scaling for end-to-end deep learning
• Efficient training for large datasets– 50 Teraflops/second sustained on one model
– 20 Exaflops to train each model
• Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides
Erich Elsen