![Page 1: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/1.jpg)
1
Anshuman Bhat – CUDA Product ManagerSaikat Dasadhikari – CUDA Engineering
29th March 2018
S8868 - CUDA on Xavier
![Page 2: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/2.jpg)
22
CUDA ECOSYSTEM 2018
CUDA
DOWNLOADS
IN 2017
3,500,000
CUDA
REGISTERED
DEVELOPERS
800,000
GTC
ATTENDEES
8,000+
![Page 3: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/3.jpg)
A BIG “THANK YOU”For being part of DRIVE Ecosystem
Vehicle & System Engineer
Data Engineer
Deep Learning Engineer
AV Engineer
![Page 4: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/4.jpg)
4
CUDA Toolkit on Tegra
CUDA 9.1 & XAVIER SUPPORT ECO-SYSTEM ENABLEMENT
DEVELOPER TOOLSDRIVE Developer Workflow & Open
Platform
I/O Coherence
Registering Host Allocations
dGPU over NVLink
Level1 – CUDA, cuDNN and TRT
Level2 – DRIVE-OS
Level3 – DRIVE-AV
Application Notes - Tegra
Enabling Tool partners
DRIVE DEVZONE
Ease of getting started
Debug functional issues
Resource usage analysis
Performance optimization
![Page 5: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/5.jpg)
5
DRIVE Developer Workflow&
DRIVE Open Platform
![Page 6: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/6.jpg)
6
DRIVE DEVELOPER WORKFLOWIterative Workflow
DeveloperLab PC
with dGPUDRIVE™ Xavier
Vehicle Integration
Iterative Testing
Fast iteration loop with PC, same CUDA code used across PC, DRIVE Dev Platform, and vehicle
![Page 7: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/7.jpg)
7
DRIVE PLATFORMOpen Platform – Entry at Various Levels
DRIVE Development Platform
Sensors
Sensors & Maps
NVMEDIA DRIVE OS, CUDA
CUDA accelerated libraries(notably TensorRT)
DriveWorks Algorithm Modules
Autonomous Driving Applications
DriveWorks
Tools DNNs
Included Not included
DRIVE OS (Linux or QNX)NVMEDIA
SAL CUDA libraries (CuDNN…)
DriveWorks Algorithm Modules
TensorRT NvMedia (VPI)
Open GL ES
![Page 8: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/8.jpg)
8
Xavier
![Page 9: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/9.jpg)
9
DRIVE XAVIER SAMPLES IN Q1World’s First Autonomous Machine Processor
• Highest performance SoC on the planet
• 9B transistors
• Highest performance graphics
• NVIDIA’s new Volta GPU architecture
• Most advanced AI – 30 trillion
operations per second
• Dedicated H/W – PVA & DLA
• Designed for ASIL-D AV solution
![Page 10: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/10.jpg)
10
Xavier – AI Car SuperComputer
• I/O Coherence
• Registering Host Allocations - Enables
use of cudaHostRegister on Xavier
• Pegasus supports NVLink – Robotaxi L5
solution
• ISO26262
New Features
System Memory
CacheCoherency
GPU CPU
![Page 11: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/11.jpg)
11
CUDA DEVELOPER TOOLS&
GPU ACCELERATED LIBRARIES
![Page 12: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/12.jpg)
12
Why NVIDIA
Developer
Tools?
• Ease of getting started
• Debug functional issues
• Resource usage analysis
• Performance optimization
![Page 13: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/13.jpg)
13
AUTONOMOUS DRIVING SOFTWARE TOOLS
Tegra™ Hardware (ARM, GPU & SoC Peripherals)
AppApp
CUDA, DriveWorks, cuDNN, TensorRT,…
AppAppApp
JTAG
kGDB Debug Agent
Native OSDebug Tools (GDB)
& NV Debug Tools NV System
Trace andProfilingTools
<Tng
Top to Bottom Debug and Profiling Capabilities
App
GPU Graphics GPU Compute CPU Multi-core
DRIVE OS
![Page 14: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/14.jpg)
14
AUTOMOTIVE DEVELOPER TOOLS
DriveInstall
Ease of installation
Ease of Setup
Ease of Getting Started
Tegra Graphics Debugger
Performance monitoring
Debugging
Profiling
Graphics
System Profiler
CPU sampling profiler
System & Application Trace
CUDA and OpenGL Trace
Platform Profiling
NVIDIA® Nsight™ Eclipse Edition
Build management
Kernel debugging and profiling
Compute / CUDA
CLI Tools
Profiler, debugger, memory checker, etc.
NVIDIA® Visual Profiler
Kernel profiler
Complete Portfolio from Design-To-Production
![Page 15: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/15.jpg)
15
DEVELOPER TOOL SUPPORTDebug and Profiling Capabilities
CUDA Dev Tools Lab PC-dGPU-Linux DRIVE-ARM-Linux DRIVE-ARM-QNX
Nvidia Nsight Eclipse Edition Yes Yes Yes
CUDA-GDB Yes Yes Yes
CUDA-MemCheck Yes Yes Yes
Visual Profiler Yes Yes Yes
CUPTI Yes Yes Yes
TGD (Tegra Graphics Debugger) NA Yes Yes
TSP (Tegra System Profiler) NA Yes Yes
CUDA-NVPROF Yes Yes Yes
![Page 16: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/16.jpg)
16
DEEP LEARNING
GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your DRIVE Application
LINEAR ALGEBRA PARALLEL ALGORITHMS
SIGNAL, IMAGE & VIDEO
TensorRT
cuBLAS
cuSPARSE cuRAND
NVIDIA NPPcuFFT
CUDA
Math library
cuSOLVER
cuDNN
![Page 17: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/17.jpg)
17
ECO-SYSTEM ENABLEMENT
![Page 18: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/18.jpg)
18
CUDA TEGRA APPLICATION NOTE
An overview of NVIDIA Tegra memory architecture and considerations for porting code from a discrete GPU (dGPU) to the Tegra® integrated GPU (iGPU).
For developers who are already familiar with programming in CUDA, and C/C++, and want to develop applications for the Tegra® SoC.
Complements CUDA C Programming Guide
Download - http://docs.nvidia.com/cuda
Helps Port dGPU Code to iGPU
![Page 19: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/19.jpg)
DRIVE DEVZONE
SW installer & dev tools
Latest SW
Documentation
Links to Support
Updates on ecosystem
Enabling DRIVE Developers - developer.nvidia.com/DRIVE
![Page 20: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/20.jpg)
20
PARTNERING WITH TOOL COMPANIESCoverity Static Analysis - Find critical defects, security weaknesses and
safety issues in your code – as it is written
Synopsys and NVIDIA are
working together to
provide a static analysis
solution for
CUDA-based code
ISO26262
recommendations for
Static Analysis and Code
Coverage
![Page 21: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/21.jpg)
21
PARTNERING WITH TOOL COMPANIESVector Software - VectorCAST/QA Support for CUDA
TÜV SÜD CertifiedSoftware Tool for
Safety Related
Development
NVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++ and is
being extended to CUDA C++ applications
Source-code tool that captures code coverage information during test execution
www.vectorcast.com
![Page 22: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/22.jpg)
22
CUDA@XAVIER-DEEP-DIVE
![Page 23: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/23.jpg)
23
I/O COHERENCE
![Page 24: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/24.jpg)
24
I/O COHERENCEMaintaining coherent view of the data across CPU & I/O Devices
I/O Coherence enables automatic cache coherency between CPU and GPU, by hardware.
I/O Coherence enables a GPU request to snoop the CPU cache hierarchy.
I/O Coherence enables higher throughput, lower usage of Main Memory Bandwidth.
I/O Coherence enables support for System-wide atomic read-modify-write (RMW) operations from the GPU.
Available with NVLink, Integrated GPU and over PCI-E.
CUDA 9.x
System Memory
CacheCoherency
GPUCPU
![Page 25: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/25.jpg)
25
APIs
![Page 26: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/26.jpg)
26
cudaHostAlloc/cudaMallocHost
• Inability to use CPU & GPU caches for zero-copy allocation.
• Latency & Bandwidth overhead on CPU & GPU accesses.
Pre-Xavier
READ/WRITEPAGE LOCKED
MEMORY
---------------
SYSTEM
MEMORY
READ/WRITE
CPU GPU
![Page 27: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/27.jpg)
27
cudaHostAlloc/cudaMallocHost
• CPU Cache enabled for the zero-copy allocation.
• GPU and CPU access to System Memory goes via CPU Cache.
• Improved Latency & Bandwidth.
Xavier
READ/WRITE PAGE LOCKED
MEMORY
---------------
SYSTEM
MEMORY
READ/WRITE
L1/L2/L3 Cache
I/O Coherence
CPUGPU
![Page 28: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/28.jpg)
28
cudaHostRegister
• Allows mapping system memory to CUDA.
• Eliminates Memcpy from System Memory to CUDA Memory.
• This feature is already existing on x86 + Discrete GPU Platform.
Available now on Tegra Platforms
READ/WRITE PAGE LOCKED
MEMORY
---------------
SYSTEM
MEMORY
READ/WRITE
L1/L2/L3 Cache
I/O Coherence
CPU GPU
![Page 29: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/29.jpg)
29
cudaHostRegister USE CASE
///////////////////////////////////////////////////////
FillHostBuffer(data_h);
cudaMemcpy(data_d, data_h, BUFF_SIZE, cudaMemcpyHostToDevice);
MyKernel<<<numBlocks, threadsPerBlock>>>(data_d);
cudaMemcpy(data_h, data_d, BUFF_SIZE, cudaMemcpyDeviceToHost);
UseHostBuffer(data_h);
/////////////////////////////////////////////////////////
CPU<->GPU data sharing with Memcpy (Legacy method)
Fill Host Buffer (Input)
Memcpy – Host to Device
Kernel Launch & Execution
Memcpy – Device to Host
Use the Output Data
![Page 30: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/30.jpg)
30
cudaHostRegister USE CASE
///////////////////////////////////////////////////
FillHostBuffer(data_h);
cudaHostRegister(data_h, BUFF_SIZE, cudaHostRegisterMapped);
cudaHostGetDevicePointer(&data_d, data_h, 0);
MyKernel<<<numBlocks, threadsPerBlock>>>(data_d);
cudaDeviceSynchronize();
UseHostBuffer(data_h);
////////////////////////////////////////////////////
CPU<->GPU data sharing with CUDA Mapped Host Memory
Fill Host Buffer (Input)
Map Host Buffer to CUDA
Kernel Launch & Execution
Use the Output Data
![Page 31: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/31.jpg)
31
New Instructions
![Page 32: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/32.jpg)
32
WARP LEVEL MATRIX MULTIPLY AND ACCUMULATE (WMMA)
• Highly efficient, small matrix multiply operations for Deep-learning.
• WMMA instructions make use of the Integrated Tensor Cores of the Xavier SoC.
• The matrix sizes supported are (MxNxK) : 16x16x16, 32x8x16, 8x32x16
Overview
![Page 33: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/33.jpg)
33
NVLink
![Page 34: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/34.jpg)
34
NVLink on Xavier
NVIDIA DRIVE PEGASUS
ROBOTAXI AI COMPUTER
Overview
• Xavier supports NVLink 2.0.
• Data transfer rate up to 25 Gbit/s
per data lane per direction.
• Xavier supports connection to
external Discrete GPUs over
NVLink.
• I/O Coherence and System-wide
Atomic Operations on Xavier
works over NVLink.
![Page 35: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/35.jpg)
35
Important Links
CUDA C Programming Guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
CUDA C Best Practices Guide:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
NVIDIA DRIVE ZONE
http://developer.nvidia.com/drive
http://docs.nvidia.com/drive/
DRIVE and CUDA
![Page 37: S8868 - CUDA on Xavieron-demand.gputechconf.com/gtc/2018/presentation/s8868-cuda-on-xavier-what-is-new.pdfNVIDIA uses VectorCAST/QA to measure code coverage during testing of C, C++](https://reader033.vdocuments.mx/reader033/viewer/2022041914/5e68af6fd372f7064435ec13/html5/thumbnails/37.jpg)
37
THANK YOU