gpu coder: automatic cuda and tensorrt code...
Post on 29-Aug-2018
218 Views
Preview:
TRANSCRIPT
1© 2018 The MathWorks, Inc.
GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB
Girish VenkataramaniArvind JayaramanJaya Shankar
2
GPUs and CUDA programming
Ease of programming(expressivity, algorithmic, …)
Perfo
rman
ceeasier
faster
MATLAB
Python
CUDA
OpenCL
C/C++
GPUs are “hardware on steroids”, … but, programming them is hard
GPU Coder
3
Consider an example: saxpy
Scalarized MATLAB
Vectorized MATLAB
Automatic compilation from an expressive language to a high-performance language
4
Deep Learning applications
TensorRT is great for inference, … but, applications require more than inference
TensorRT
Traffic Sign Recognition
Deep learning
5
GPU Coder is new technology released in September 2017
Neural NetworksDeep Learning, machine learning
Image Processing and Computer Vision
Image filtering, feature detection/extraction
Signal Processing and Communications
FFT, filtering, cross correlation,
5x faster than TensorFlow2x faster than mxnet
60x faster than CPUs for stereo disparity
20x faster than CPUs for FFTs
Accelerated implementation of parallel algorithms on GPUs
7
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
8
GPU Coder automatically extracts parallelism from MATLAB
1. Scalarized MATLAB (“for-all” loops)
2. Vectorized MATLAB(math operators and library functions)
3. Composite functions in MATLAB(maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT)
Infer CUDA kernels from MATLAB loops
Vendor library replacement
9
static __global__ mykernel(A, X, Y, C, n){
int k = getThreadIndex(N);
int t = A[k] * X[k];C[k] = t + Y[k];
}
From a loop to a CUDA kernel
for k = 1:nt = A(k) .* X(k);C(k) = t + Y(k);
end
{ …mykernel<<< f(n) >>>(A, X, Y, C, n);
…}
Create kernel from loop bodyCompute kernel size
Y
f(n)
Classify kernel variables (input, output, local)
Ins: A, X, Y, nOuts: CLocal: t, k
Is this loop parallel?
Dependence analysis to understand the iteration space
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
10
From a loop-nesting to CUDA kernels
for i = 1:pfor j = 1:m
for k = 1:n…(inner loop)…
endend
end
for i = 1:p…(outer prologue code)…for j = 1:m
for k = 1:n…(inner loop)…
end…(outer epilogue code)…
endend
Perfect LoopsImperfect Loops
Fundamental requirement: Loops need to be contiguous for parallelization
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
11
From a loop-nesting to CUDA kernels
for i = 1:pfor j = 1:m
for k = 1:n… (outer prologue code) ……(inner loop)…if k == n
… (outer epilogue code) …end
endend
end
for i = 1:p…(outer prologue code)…for j = 1:m
for k = 1:n…(inner loop)…
end…(outer epilogue code)…
endend
Perfect LoopsImperfect Loops
Fundamental requirement: Loops need to be contiguous for parallelization
Loop Perfectization(if possible)
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
12
for a = 1:Mfor b = 1:N
…(outer prologue code)…for c = 1:K
for d = 1:P…(inner loop)…
endend
endend
From a loop-nesting to CUDA kernels
Find parallel loops
Dependence analysis
Partition loop nesting
Heuristic may favor larger iteration space
for i = 1:Pfor a = 1:M
for b = 1:N…(inner loop)…
endendfor x = 1:K
for y = 1:L…(inner loop)…
endend
end
(K x P)
(P)
(MxN + KxL)
Example 1 Example 2
Fundamental requirement: Loops need to be contiguous for parallelization
(M x N)
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
13
for a = 1:Mfor b = 1:N
…(outer prologue code)…for c = 1:K
for d = 1:P…(inner loop)…
endend
endend
From a loop-nesting to CUDA kernels
Create kernel for each partitionFind parallel loops
Dependence analysis
Partition loop nesting
Heuristic may favor larger iteration space
Use process from single loop conversion
for i = 1:Pfor a = 1:M
for b = 1:N…(inner loop)…
endendfor x = 1:K
for y = 1:L…(inner loop)…
endend
end
(K x P)
(P)
(MxN + KxL)
Example 1 Example 2
Fundamental requirement: Loops need to be contiguous for parallelization
(M x N)
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
14
From vectorized MATLAB to CUDA kernels
output(:, 1) = (input(:, 1) – x_im) .* factor;
Loop Fusion
Create larger parallel loops (and hence CUDA kernels)
Scalarization
Reduce to for-loops
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);
endfor a = 1:M
output(i, 1) = diff(i) * factor(i);end
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);
end
Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
15
From vectorized MATLAB to CUDA kernels
output(:, 1) = (input(:, 1) – x_im) .* factor;
Loop Fusion
Create larger parallel loops (and hence CUDA kernels)
Scalar Replacement
Reduce temp matrices to temp scalars
Scalarization
Reduce to for-loops
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);
endfor a = 1:M
output(i, 1) = diff(i) * factor(i);end
for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);
end
for i = 1:Mtmp = input(i, 1) – x_im(i);output(i, 1) = tmp * factor(i);
end
Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
16
From composite functions to optimized CUDA
• imfilter• imresize• imerode• imdilate• bwlabel• imwarp• …
• Deep learning inference (cuDNN, TensorRT)
• …
• Matrix multiply (cuBLAS)• Linear algebra (cuSolver)• FFT functions (cuFFT)• Convolution• …
Core math Image processingComputer vision
Neural Networks
Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions
Over 300+ MATLAB functions are optimized for CUDA code generation
17
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
18
Optimizing CPU-GPU data movement is a challengeA = ……for i = 1:N
… A(i)end…
imfilter…
A = ……
kernel1<<<…>>>( )
…
imfilter_kernel(…)
…
A = ……cudaMemcpyHtoD(gA, a);kernel1<<<…>>>(gA)cudaMemcpyDtoH(…)…cudaMemcpyHtoD(…) imfilter_kernel(…)cudaMemcpyDtoH(…)…
Where is the ideal placement of memcpy?
Naïve placement
K1
K2
K3
Optimized placement
K1
K2
K3
19
GPU Coder optimizes memcpy placementA(:) = ….C(:) = ….
for i = 1:N….gB = kernel1(gA);gA = kernel2(gB);if (some_condition)
gC = kernel3(gA, gB);end….
end
…. = C;
cudaMemcpy*definitely* needed
cudaMemcpy*not* needed
cudaMemcpy*may be* needed
Observations• Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a
status flag per variable• Use-Def to determine where to insert memcpy
A(:) = …A_isDirtyOnCpu = true;…for i = 1:N
if (A_isDirtyOnCpu)cudaMemcpy(gA, A);A_isDirtyOnCpu = false;
endgB = kernel1(gA);gA = kernel2(gB);if (somecondition)
gC = kernel3(gA, gB);C_isDirtyOnGpu = true;
end…
end…if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);C_isDirtyOnGpu = false;
end… = C;
Assume gA, gB and gC are mapped to GPU memory
Generated (pseudo) code
20
GPU memory hierarchy is deep
Memory Visibility Heuristics/notes GPU Coder support
Global memory CPU + GPU Share data b/w CPU and GPU Yes
Local memory/registers per GPU thread Thread local data Yes
Shared memory per GPU block Shared data between threads Yes
Texture memory CPU + GPU Shared read-only data with 2D alignment Future
Constant memory GPU Read-only constants Yes
21
Dotprod
Input image Conv. kernel Output image
rows
cols
kw
kh
GPU Coder automatically maps data to shared memory
GPU Coder automatically creates shared memory for many MATLAB image processing functions:
imfilter, imerode, imdilate, conv2, …
22
GPU Coder runs a host of compiler transforms to generate CUDA
Control-flow graphIntermediate representation
(CFG – IR)
….
….
CUDA kernel optimizations
Front – end
Traditional compiler optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory mapping
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loopoptimizations
24
Easily re-target to Jetson and Drive platforms
Cross-compile for NVIDIA boards– Jetson boards– DrivePX2
Two small changes1. Change build-type to ‘lib’
2. Select cross-compile toolchain
25
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
26
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Model importer
DNNdesign + training
Design in MATLAB Manage large image sets Automate data labeling Easy access to models
Training in MATLAB Acceleration with GPU’s Scale to clusters
27
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
DNNdesign + training
Applicationdesign
28
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
Applicationdesign
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
29
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
Applicationdesign
30
Traffic sign detection and recognition
Object detection
DNN
Strongest Bounding
Box
Classifier DNN
31
C++/CUDA + TensorRT
C++/CUDA + cuDNN
C++/CUDA + TensorRT (int8)
GPU Coder allows for choice in deployment (cuDNN, TensorRT)
Application logic
32
Performance summary (VGG-16) on TitanXP
0 50 100 150 200 250 300 350 400
MATLAB (cuDNN fp32)
GPU Coder (cuDNN fp32)
GPU Coder (TensorRT fp32)
GPU Coder (TensorRT int8)
33
MATLAB and GPU Coder support state-of-art classification networks
GoogLeNetResNet50
Alexnet vs Squeezenet
Network # Layers Size Frame-rate (GPU Coder)
Alexnet 25 227 MB 500 FpsResNet50 177 96 MB 160 FpsGoogLeNet 144 27 MB 190 FpsSqueezenet 68 5 MB 615 Fps
34
Alexnet Inference on NVIDIA Titan Xp
MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN
Fram
es p
er s
econ
d
Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
GPU Pascal Titan Xp
cuDNN v7
Testing platform
mxNet (1.1.0)
MATLAB GPU Coder +TensorRT 3.0.1 (int8)
TensorFlow (1.6.0)
35
VGG-16 Inference on NVIDIA Titan Xp
MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN
Fram
es p
er s
econ
d
Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
GPU Pascal Titan Xp
cuDNN v7
Testing platform
mxNet (1.1.0)
MATLAB GPU Coder +TensorRT 3.0.1 (int8)
TensorFlow (1.6.0)
36
Talk outline
Introduction
GPU Coder internals– Automatic parallelization
– Memory optimization
– Deep learning compilation
Application demo: Lidar processing in MATLAB using deep learning
37
LiDAR Processing for Autonomous Vehicles
Design Deep learning-based LiDAR algorithm in MATLAB• Automate ground-truth labeling • Pre-process LiDAR data for training • GPU accelerated training
High Performance Inference
38
Why Use LiDAR and Deep Learning for Autonomous Vehicles ?
Why use LiDAR ?– Provides accurate 3-D structure of scene– Required sensor for autonomous driving
Why use deep learning ?– Best accuracy– High-performance inference with GPU Coder
CarsGround
LiDAR Sematic Segmentation
Classifying Point Cloud Clusters
40
Lidar processing application design is easy in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
Applicationdesign
41
Lidar processing application design is easy in MATLAB
Trained DNN
DNNdesign + training
Data prep, labeling
Training
Application logic
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
Applicationdesign
42
Data preparation and labeling of Lidar is a challenge
Trained DNN
DNNdesign + training
Accessing lidar data
TrainingLidar pre-processing
Labeling lidar data
43
Access and Visualize LiDAR Data
Access Stored LiDAR Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply)
Visualize LiDAR Data Streaming LiDAR player Static point cloud display Point cloud differences
44
LiDAR Preprocessing
ROI + Remove Ground• Fit plane using RANSAC
Cluster• Segment clusters using
Euclidean distance
48
Organize Data for Training
Raw Point Cloud Data
Ground Truth Labels Transformed to Label Mask
Project to 2D
51
Deep learning workflow in MATLAB
Train in MATLAB
Model importer
Trained DNN
Application logic
Model importer
C++/CUDA + TensorRT
C++/CUDA + cuDNN
Standalone Deployment
DNNdesign + training
Applicationdesign
52
Check Out Deep Learning in MATLAB and GPU Coder
GPU Coderhttps://www.mathworks.com/products/gpu-coder.html
Deep learning in MATLABhttps://www.mathworks.com/solutions/deep-learning.html
Deep learning On-Ramp: Free self-paced, online traininghttps://matlabacademy.mathworks.com/R2017b/portal.html?course=deeplearning
top related