gpu coder: automatic cuda and tensorrt code...

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Girish VenkataramaniArvind JayaramanJaya Shankar

GPUs and CUDA programming

Ease of programming(expressivity, algorithmic, …)

ceeasier

faster

MATLAB

Python

OpenCL

GPUs are “hardware on steroids”, … but, programming them is hard

GPU Coder

Consider an example: saxpy

Scalarized MATLAB

Vectorized MATLAB

Automatic compilation from an expressive language to a high-performance language

Deep Learning applications

TensorRT is great for inference, … but, applications require more than inference

TensorRT

Traffic Sign Recognition

Deep learning

GPU Coder is new technology released in September 2017

Neural NetworksDeep Learning, machine learning

Image Processing and Computer Vision

Image filtering, feature detection/extraction

Signal Processing and Communications

FFT, filtering, cross correlation,

5x faster than TensorFlow2x faster than mxnet

60x faster than CPUs for stereo disparity

20x faster than CPUs for FFTs

Accelerated implementation of parallel algorithms on GPUs

Example: Lidar semantic segmentation

Talk outline

Introduction

GPU Coder internals– Automatic parallelization

– Memory optimization

– Deep learning compilation

Application demo: Lidar processing in MATLAB using deep learning

GPU Coder automatically extracts parallelism from MATLAB

1. Scalarized MATLAB (“for-all” loops)

2. Vectorized MATLAB(math operators and library functions)

3. Composite functions in MATLAB(maps to cuBlas, cuFFT, cuSolver, cuDNN, TensorRT)

Infer CUDA kernels from MATLAB loops

Vendor library replacement

static __global__ mykernel(A, X, Y, C, n){

int k = getThreadIndex(N);

int t = A[k] * X[k];C[k] = t + Y[k];

From a loop to a CUDA kernel

for k = 1:nt = A(k) .* X(k);C(k) = t + Y(k);

{ …mykernel<<< f(n) >>>(A, X, Y, C, n);

Create kernel from loop bodyCompute kernel size

Classify kernel variables (input, output, local)

Ins: A, X, Y, nOuts: CLocal: t, k

Is this loop parallel?

Dependence analysis to understand the iteration space

Extracting parallelism in MATLAB1. Scalarized MATLAB (for loops)2. Vectorized MATLAB3. Composite functions

From a loop-nesting to CUDA kernels

for i = 1:pfor j = 1:m

for k = 1:n…(inner loop)…

endend

for i = 1:p…(outer prologue code)…for j = 1:m

end…(outer epilogue code)…

endend

Perfect LoopsImperfect Loops

Fundamental requirement: Loops need to be contiguous for parallelization

for i = 1:pfor j = 1:m

for k = 1:n… (outer prologue code) ……(inner loop)…if k == n

… (outer epilogue code) …end

endend

for i = 1:p…(outer prologue code)…for j = 1:m

end…(outer epilogue code)…

endend

Perfect LoopsImperfect Loops

Loop Perfectization(if possible)

for a = 1:Mfor b = 1:N

…(outer prologue code)…for c = 1:K

for d = 1:P…(inner loop)…

endend

Find parallel loops

Dependence analysis

Partition loop nesting

Heuristic may favor larger iteration space

for i = 1:Pfor a = 1:M

for b = 1:N…(inner loop)…

endendfor x = 1:K

for y = 1:L…(inner loop)…

endend

(K x P)

(MxN + KxL)

Example 1 Example 2

(M x N)

for a = 1:Mfor b = 1:N

…(outer prologue code)…for c = 1:K

for d = 1:P…(inner loop)…

endend

Create kernel for each partitionFind parallel loops

Dependence analysis

Partition loop nesting

Heuristic may favor larger iteration space

Use process from single loop conversion

for i = 1:Pfor a = 1:M

for b = 1:N…(inner loop)…

endendfor x = 1:K

for y = 1:L…(inner loop)…

endend

(K x P)

(MxN + KxL)

Example 1 Example 2

(M x N)

From vectorized MATLAB to CUDA kernels

output(:, 1) = (input(:, 1) – x_im) .* factor;

Loop Fusion

Create larger parallel loops (and hence CUDA kernels)

Scalarization

Reduce to for-loops

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);

endfor a = 1:M

output(i, 1) = diff(i) * factor(i);end

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);

Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1

From vectorized MATLAB to CUDA kernels

output(:, 1) = (input(:, 1) – x_im) .* factor;

Loop Fusion

Create larger parallel loops (and hence CUDA kernels)

Scalar Replacement

Reduce temp matrices to temp scalars

Scalarization

Reduce to for-loops

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);

endfor a = 1:M

output(i, 1) = diff(i) * factor(i);end

for i = 1:Mdiff(i) = input(i, 1) – x_im(i);output(i, 1) = diff(i) * factor(i);

for i = 1:Mtmp = input(i, 1) – x_im(i);output(i, 1) = tmp * factor(i);

Assume the following sizes: ‘output’ : M x 3‘input’ : M x 3‘x_im’ : M x 1‘factor’ : M x 1

From composite functions to optimized CUDA

• imfilter• imresize• imerode• imdilate• bwlabel• imwarp• …

• Deep learning inference (cuDNN, TensorRT)

• …

• Matrix multiply (cuBLAS)• Linear algebra (cuSolver)• FFT functions (cuFFT)• Convolution• …

Core math Image processingComputer vision

Neural Networks

Over 300+ MATLAB functions are optimized for CUDA code generation

Talk outline

Introduction

Optimizing CPU-GPU data movement is a challengeA = ……for i = 1:N

… A(i)end…

imfilter…

A = ……

kernel1<<<…>>>( )

imfilter_kernel(…)

A = ……cudaMemcpyHtoD(gA, a);kernel1<<<…>>>(gA)cudaMemcpyDtoH(…)…cudaMemcpyHtoD(…) imfilter_kernel(…)cudaMemcpyDtoH(…)…

Where is the ideal placement of memcpy?

Naïve placement

Optimized placement

GPU Coder optimizes memcpy placementA(:) = ….C(:) = ….

for i = 1:N….gB = kernel1(gA);gA = kernel2(gB);if (some_condition)

gC = kernel3(gA, gB);end….

…. = C;

cudaMemcpy*definitely* needed

cudaMemcpy*not* needed

cudaMemcpy*may be* needed

Observations• Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a

status flag per variable• Use-Def to determine where to insert memcpy

A(:) = …A_isDirtyOnCpu = true;…for i = 1:N

if (A_isDirtyOnCpu)cudaMemcpy(gA, A);A_isDirtyOnCpu = false;

endgB = kernel1(gA);gA = kernel2(gB);if (somecondition)

gC = kernel3(gA, gB);C_isDirtyOnGpu = true;

end…

end…if (C_isDirtyOnGpu)

cudaMemcpy(C, gC);C_isDirtyOnGpu = false;

end… = C;

Assume gA, gB and gC are mapped to GPU memory

Generated (pseudo) code

GPU memory hierarchy is deep

Memory Visibility Heuristics/notes GPU Coder support

Global memory CPU + GPU Share data b/w CPU and GPU Yes

Local memory/registers per GPU thread Thread local data Yes

Shared memory per GPU block Shared data between threads Yes

Texture memory CPU + GPU Shared read-only data with 2D alignment Future

Constant memory GPU Read-only constants Yes

Dotprod

Input image Conv. kernel Output image

GPU Coder automatically maps data to shared memory

GPU Coder automatically creates shared memory for many MATLAB image processing functions:

imfilter, imerode, imdilate, conv2, …

GPU Coder runs a host of compiler transforms to generate CUDA

Control-flow graphIntermediate representation

(CFG – IR)

CUDA kernel optimizations

Front – end

Traditional compiler optimizations

MATLAB Library function mapping

Parallel loop creation

CUDA kernel creation

cudaMemcpy minimization

Shared memory mapping

CUDA code emission

Scalarization

Loop perfectization

Loop interchange

Loop fusion

Scalar replacement

Loopoptimizations

Demo: Stereo disparity

Left camera Right camera

Disparity map

Stereo disparity

Easily re-target to Jetson and Drive platforms

Cross-compile for NVIDIA boards– Jetson boards– DrivePX2

Two small changes1. Change build-type to ‘lib’

2. Select cross-compile toolchain

Talk outline

Introduction

Deep learning workflow in MATLAB

Train in MATLAB

Model importer

Trained DNN

Model importer

DNNdesign + training

Design in MATLAB Manage large image sets Automate data labeling Easy access to models

Training in MATLAB Acceleration with GPU’s Scale to clusters

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

Applicationdesign

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

Applicationdesign

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Standalone Deployment

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Applicationdesign

Traffic sign detection and recognition

Object detection

Strongest Bounding

Classifier DNN

C++/CUDA + TensorRT

C++/CUDA + cuDNN

C++/CUDA + TensorRT (int8)

GPU Coder allows for choice in deployment (cuDNN, TensorRT)

Application logic

Performance summary (VGG-16) on TitanXP

0 50 100 150 200 250 300 350 400

MATLAB (cuDNN fp32)

GPU Coder (cuDNN fp32)

GPU Coder (TensorRT fp32)

GPU Coder (TensorRT int8)

MATLAB and GPU Coder support state-of-art classification networks

GoogLeNetResNet50

Alexnet vs Squeezenet

Network # Layers Size Frame-rate (GPU Coder)

Alexnet 25 227 MB 500 FpsResNet50 177 96 MB 160 FpsGoogLeNet 144 27 MB 190 FpsSqueezenet 68 5 MB 615 Fps

Alexnet Inference on NVIDIA Titan Xp

MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN

Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

mxNet (1.1.0)

MATLAB GPU Coder +TensorRT 3.0.1 (int8)

TensorFlow (1.6.0)

VGG-16 Inference on NVIDIA Titan Xp

MATLAB GPU Coder +TensorRT 3.0.1MATLAB GPU Coder +cuDNN

Batch SizeCPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

mxNet (1.1.0)

MATLAB GPU Coder +TensorRT 3.0.1 (int8)

TensorFlow (1.6.0)

Talk outline

Introduction

LiDAR Processing for Autonomous Vehicles

Design Deep learning-based LiDAR algorithm in MATLAB• Automate ground-truth labeling • Pre-process LiDAR data for training • GPU accelerated training

High Performance Inference

Why Use LiDAR and Deep Learning for Autonomous Vehicles ?

Why use LiDAR ?– Provides accurate 3-D structure of scene– Required sensor for autonomous driving

Why use deep learning ?– Best accuracy– High-performance inference with GPU Coder

CarsGround

LiDAR Sematic Segmentation

Classifying Point Cloud Clusters

What Does LiDAR Data Look Like ?

Lidar processing application design is easy in MATLAB

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Applicationdesign

Lidar processing application design is easy in MATLAB

Trained DNN

Data prep, labeling

Training

Application logic

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Applicationdesign

Data preparation and labeling of Lidar is a challenge

Trained DNN

Accessing lidar data

TrainingLidar pre-processing

Labeling lidar data

Access and Visualize LiDAR Data

Access Stored LiDAR Data Velodyne file I/O (pcap) Individual point clouds (.pcd,ply)

Visualize LiDAR Data Streaming LiDAR player Static point cloud display Point cloud differences

LiDAR Preprocessing

ROI + Remove Ground• Fit plane using RANSAC

Cluster• Segment clusters using

Euclidean distance

Ground Truth Labeling of LiDAR Data

Organize Data for Training

Raw Point Cloud Data

Ground Truth Labels Transformed to Label Mask

Project to 2D

Create Network Architecture

Easy API to create network structure

Deployment using GPU Coder

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Train in MATLAB

Model importer

Trained DNN

Application logic

Model importer

C++/CUDA + TensorRT

C++/CUDA + cuDNN

Applicationdesign

Check Out Deep Learning in MATLAB and GPU Coder

GPU Coderhttps://www.mathworks.com/products/gpu-coder.html

Deep learning in MATLABhttps://www.mathworks.com/solutions/deep-learning.html

Deep learning On-Ramp: Free self-paced, online traininghttps://matlabacademy.mathworks.com/R2017b/portal.html?course=deeplearning

gpu coder: automatic cuda and tensorrt code...

Documents

s8822 optimizing nmt with tensorrt - nvidia · inference...

fast neural network inference with tensorrt on autonomous...

what’s new in matlab and simulink...deploying deep...

march 20, 2019 tensorrt inference - nvidia · tensorrt...

© 2015 the mathworks, inc. · models using automatically...

gpu inference performance study whitepaper - · pdf...

2015q4 coder

Высокопроизводительный...

deep learning in · mxnet (1.1.0) gpu coder + tensorrt...

the mpeg-4 general audio coder€¦ · scalable ga coder :...

pg-08724-001 v001 | may 2018 tensorrt · tensorrt...

du-08731-001 v4.0.1 | june 2018 tensorrt 4.0.1 ... ·...

gpu, gp-gpu, gpu computing

productive coder

deploying deep neural networks to embedded gpus and cpus -...

coder tips

coder keyspres

tensorrt neural network deployment with digits and ·...

inference optimization using tensorrt with use...

clean coder