nvidia® cudnn gpu-accelerated machine learningspeech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2015/nn...
TRANSCRIPT
NVIDIA® cuDNN
GPU-Accelerated Machine Learning
How GPU Acceleration Works
Application Code
+
GPU CPU 5% of Code
~ 80% of run-time
Compute-Intensive Functions
Rest of Sequential CPU Code
3 Ways to Program GPUs
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
HPC Today cuDNN is a library of primitives for deep learning
Deep Learning with cuDNN cuDNN is a library of primitives for deep learning
GPUs
cuDNN
Frameworks
Applications
Tesla TX-1 Titan
LARGE SCALE VISUAL RECOGNITION CHALLENGE (ILSVRC)
person
car
helmet
motorcycle
bird
frog
person
dog
chair
person
hammer
flower pot
power drill
1.2M training images • 1000 object categories
Image Classification Error Rates
2012
CHALLENGE SUMMARY
4
60
110
0
20
40
60
80
100
120
2010 2011 2012 2013 2014
Entries using GPUs
28% 26%
16%
12%
7%
0%
5%
10%
15%
20%
25%
30%
2010 2011 2012 2013 2014
DEEP LEARNING VISUALIZED
Image Classification, Object Detection, Localization Face Recognition
Speech & Natural Language Processing
Medical Imaging & Interpretation
Seismic Imaging & Interpretation Recommendation
Example Use Cases
Deep learning revolutionizing medical research
Detecting Mitosis in
Breast Cancer Cells — IDSIA
Predicting the Toxicity
of New Drugs — Johannes Kepler University
Understanding Gene Mutation
to Prevent Disease — University of Toronto
cuDNN Version 2
cuDNN Design Goal
Basic Deep Learning Subroutines
Allow user to write a DNN application without any CUDA code
Flexible Layout
Handle any data layout
Basic Deep Learning Subroutines
Great performance with more memory use
Good performance with minimal memory usage
DNN ROUTINES
Convolutions – 80-90% of the execution time
Pooling – Spatial smoothing
Activation – Pointwise non-linear function
CONVOLUTIONS – The MAIN Workload
2D conv as a GEMV
I1 I2 I3 I4 I5 I6
I7 I8 I9 I10 I11 I12
I13 I14 I15 I16 I17 I18
I19 I20 I21 I22 I23 I24
I25 I26 I27 I28 I29 I30
I31 I32 I33 I34 I35 I36
F1 F2 F3
F4 F5 F6
F7 F8 F9
I1 I2 I3 I7 I8 I9 I13 I14 I15
I2 I3 I4 I8 I9 I10 I14 I15 I16
I3 I4 I5 I9 I10 I11 I15 I16 I17
F1
F2
F3
F4
F5
F6
F7
F8
F9
Image
Filter
Multi-convolve
cuDNN V2 Flexibility
cuDNN V2 new features
cuDNN Version 2
Accelerates key routines to
improve performance of neural
net training
Up to 1.8x faster on AlexNet than
a baseline GPU implementation
New support for 3D convolutions
Integrated into all major Deep
Learning frameworks: Caffe,
Theano, Torch
1.0x 1.0x
1.6x
1.2x
Caffe (GoogLeNet) Torch (OverFeat)
Baseline (GPU)
With cuDNN
2.5M
18M
23M
43M
0
10
20
30
40
50
16 Core CPU GTX Titan Titan BlackcuDNN v1
Titan XcuDNN v2
Millions
of
Images
Images Trained Per Day (Caffe AlexNet)
E5-2698 v3 @ 2.3GHz / 3.6GHz Turbo
cuDNN Version 2
Accelerates key routines to
improve performance of neural
net training
Up to 1.8x faster on AlexNet than
a baseline GPU implementation
New support for 3D convolutions
Integrated into all major Deep
Learning frameworks: Caffe,
Theano, Torch
1.0x 1.0x
1.6x
1.2x
Caffe (GoogLeNet) Torch (OverFeat)
Baseline (GPU)
With cuDNN
2.5M
18M
23M
43M
0
10
20
30
40
50
16 Core CPU GTX Titan Titan BlackcuDNN v1
Titan XcuDNN v2
Millions
of
Images
Images Trained Per Day (Caffe AlexNet)
E5-2698 v3 @ 2.3GHz / 3.6GHz Turbo
NVIDIA® cuDNN Roadmap
Q3’14 Q4’14
Layers (foward & backprop)
- Convolutional
- Pooling
- Softmax
- ReLu/Sigmoid/Tanh
Performance Features
Release 1 September 2014
High performance
convolution
Layers
- Local receptive field
- Contrast normalization
- Fully-connected
- Recurrent
Support for multiple GPUs
per node
Faster convolution routines
Release 3 Release 2
Q2’15 Q1’15
Tuning for future chips
GPU-Accelerated Deep Learning Frameworks
CAFFE TORCH THEANO Mernava neo CUDA-
CONVNET2 KALDI
Description Deep Learning
Framework
Scientific Computing
Framework
Math Expression
Compiler
Deep Learning
Framework
Deep Learning
Application
Speech Recognition
Toolkit
cuDNN R2 R2 R2 -- -- --
Multi-GPU In Progress In Progress In Progress (nnet2)
Multi-CPU (nnet2)
License BSD-2 BSD BSD Apache 2.0 Apache 2.0 Apache 2.0
Interface(s) Text-based definition
files, C++. Python,
MATLAB
Python, Lua,
MATLAB Python Python C++ C++, Shell scripts
Embedded (TK1)
http://developer.nvidia.com/deeplearning
Using cuDNN
cuDNN Easy to Enable
DIGITS
Visualization tool for DNN training
Use default network, import one, or
design your own
Import your training data from disk or
web
Monitor multiple trainings in parallel
Deep Learning GPU Training System
DIGITS
Test Image
Monitor Progress Configure DNN Process Data Visualize Layers
DIGITS
Deep Learning GPU Training System
Who it is for
Deep learning researchers
Automotive
Medical Researchers
Defense
Intelligent Video Analytics
Web Companies
Startups
Thank you!
Developer Zone: https://developer.nvidia.com/deeplearning
GPU Technology Conference: http://www.gputechconf.com/
cuDNN Download: https://developer.nvidia.com/cuDNN
DIGITS Download: https://developer.nvidia.com/digits
DIGITS Source: https://www.github.com/nvidia/digits