nvidia profiling tools overview - indico
TRANSCRIPT
![Page 1: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/1.jpg)
NVIDIA PROFILING TOOLS OVERVIEW
François Courteille, Principal Solution Architect, [email protected]
![Page 2: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/2.jpg)
Introduction to NVIDIA profilers
Nvprof and the Visual Profiler
Nsight Systems
Profiling from CLI
NVTX (NVIDIA Tools Extension)
Getting Started Resources
AGENDA
2
![Page 3: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/3.jpg)
)
NVIDIA PROFILING TOOLS
Nsight Systems Nsight Visual Studio Edition NVIDIA Profiler (nvprof)
Nsight Compute Nsight Eclipse Edition NVIDIA Visual Profiler (nvvp
7
CUPTI (CUDA Profiling Tools Interface)
![Page 4: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/4.jpg)
TOOLS COMPARISON
57
NVIDIA© NVIDIA© NVIDIA© Intel© Linux perf
Nsight™ Nsight™ Visual Profiler VTune™ OProfile
Systems Compute Amplifier
Target OS Linux, Windows Linux, Windows Linux, Mac, Windows Linux, Windows Linux
GPUs Pascal+ Pascal+ Kepler+ None None
CPUs x86_64 x86_64 x86, x86_64, Power x86, x86_64 x86, x86_64, Power
Trace NVTX, OS runtime, NVTX, MPI, CUDA, MPI, ITT Kernel
CUDA, CuDNN, CuBLAS, CUDA OpenACC, NVTX
OpenACC, OpenGL, DX12
CPU PC High Speed No Yes High Speed High Speed
Sampling
NVLINK, GPU Future Yes No No
Power, Thermal
Src Code View No Yes Yes Yes No
Compare No Yes No Yes No
Sessions
![Page 5: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/5.jpg)
3
NVPROF AND THE VISUAL PROFILER
![Page 6: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/6.jpg)
13
NVIDIA PROFILER
• GUI : nvvp
• Command line:nvprof <application> <application options>
• Command line with GUI output:nvprof –o prof.nvvp <application> <application options>
• Command line, MPI applications, GUI output:mpirun –np 4 nvprof –o prof_%h_%p.nvvp <application> <application options>(%h and %p will be replaced by host and pid)
• Command line, capturing all low level metrics for later GUI analysis (slow!)nvprof –-analysis-metrics –o prof.nvvp <application> <application options>
• Command line, all low level metrics for later GUI analysis only for the 2nd occurrence of each kernel:nvprof –-kernels :::2 –-analysis-metrics –o prof.nvvp <application> <application options>
• Command line with GUI output, limiting the profiling to the first 60seconds (60s after GPU context creation)nvprof –t 60 –o prof.nvvp <application> <application options>
Some examples of how to use it
See the profiler documentation for more information
![Page 7: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/7.jpg)
14
NVPROF – Basic Usage$ nvprof ./cg==88109== NVPROF is profiling process 88109, command: ./cg<program output>==88109== Profiling application: ./cg==88109== Profiling result:Time(%) Time Calls Avg Min Max Name95.62% 4.74211s 101 46.952ms 43.029ms 435.09ms matvec(matrix const &, vector const &, vector const &)_12_gpu
3.28% 162.50ms 302 538.09us 236.55us 31.155ms waxpby(double, vector const &, double, vector const &, vector const &)_26_gpu
0.73% 36.165ms 200 180.83us 138.18us 223.26us dot(vector const &, vector const&)_10_gpu
0.37% 18.310ms 200 91.549us 89.664us 93.441us dot(vector const &, vector const&)_10_gpu_red
0.00% 100.13us 200 500ns 480ns 1.0240us [CUDA memcpy HtoD]0.00% 81.408us 200 407ns 384ns 416ns [CUDA memcpy DtoH]
==88109== Unified Memory profiling result:Device "Tesla P100-SXM2-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name13783 197.40KB 64.000KB 960.00KB 2.594727GB 99.26990ms Host To Device7681 - - - - 298.1818ms GPU Page fault groups
Total CPU Page faults: 7973
![Page 8: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/8.jpg)
17
CUDA VISUAL PROFILER
Kernel profile – memory hierarchy view
Unified Memory
NVLink
PC sampling
OpenACC/OpenMP Profiling
NVTX
Overview of key features
![Page 9: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/9.jpg)
18
NVIDIA’S VISUAL PROFILER (NVVP)
Timeline
Guided
System
Analysis
![Page 10: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/10.jpg)
19
DATA MOVEMENT IN VISUAL PROFILER
![Page 11: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/11.jpg)
20
UVM IN VISUAL PROFILER
![Page 12: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/12.jpg)
21
KERNEL PROFILEMemory hierarchy view
![Page 13: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/13.jpg)
25
VISUAL PROFILER
Selected interval Source location
CPU Page Fault Source Correlation
![Page 14: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/14.jpg)
26
VISUAL PROFILER
Source line causing CPU page fault
CPU Page Fault Source Correlation
![Page 15: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/15.jpg)
27
VISUAL PROFILER - NEW UNIFIED MEMORY EVENTSPage throttling, Memory thrashing, Remote map
Memory thrashing
Page throttling
Remote map
![Page 16: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/16.jpg)
29
VISUAL PROFILERNVLINK visualization
Color codes for
NVLink
Topology
Static
properties
Runtime
values
Option to collect
NVLink information
Unguided Analysis
Selected
NVLink
Version
![Page 17: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/17.jpg)
30
VISUAL PROFILER
MemCpy API
NVLink Events on
Timeline
Color Coding of
NVLink Events
NVLink events on timeline
![Page 18: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/18.jpg)
31
VISUAL PROFILER
1 2 Application transparently runs on compute node and profiling data is displayed in the Visual Profiler
Select custom script, then create a remote session as usual
Multi-hop remote profiling - Application Profiling
![Page 19: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/19.jpg)
32
CPU SAMPLING
• CPU profile is gathered by periodically sampling the state of each thread in the running application.
• The CPU details view summarizes the samples collected into a call-tree, listing the number of samples (or amount of time) that was recorded in each function.
![Page 20: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/20.jpg)
33
VISUAL PROFILERCPU Sampling
Percentage of time spent collectively by all threads
Range of time
spent across
all threads
Selected thread
is highlighted in
Orange
Bar chart of the
amount of time
spent by thread
![Page 21: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/21.jpg)
34
PC SAMPLING
PC sampling feature is available for device with CC >= 5.2
Provides CPU PC sampling parity + additional information for warp states/stalls reasons for GPU kernels
Effective in optimizing large kernels, pinpoints performance bottlenecks at specific lines in source code or assembly instructions
Samples warp states periodically in round robin order over all active warps
No overheads in kernel runtime, CPU overheads to parse the records
![Page 22: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/22.jpg)
35
VISUAL PROFILER - PC SAMPLINGOption to select sampling period
![Page 23: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/23.jpg)
36
VISUAL PROFILER
Pie chart for sample distribution for a CUDA function
Source-Assembly view
PC SAMPLING UI
![Page 24: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/24.jpg)
37NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
NVPROF – MPI ProfilingNVPROF & Visual Profiler do not natively understand MPI
It is possible to load data from multiple MPI ranks (same or different GPUS) into Visual Profiler though
Trick: Label your data to know which MPI rank it came from:
• --process-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" --context-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}”
Trick: Give every rank its own filename
• -o timeline.%q{OMPI_COMM_WORLD_RANK}.nvprof
See https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/ for one more trick for showing MPI calls on your timeline.
![Page 25: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/25.jpg)
38
NVPROF – MPI Profiling$ mpirun -n 4 --gpu nvprof --process-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" --context-name "MPI Rank %q{OMPI_COMM_WORLD_RANK}" -o timeline.%q{OMPI_COMM_WORLD_RANK}.nvprof ./laplace2d.solution==89073== NVPROF is profiling process 89073, command: ./laplace2d.solution==89070== NVPROF is profiling process 89070, command: ./laplace2d.solution==89075== NVPROF is profiling process 89075, command: ./laplace2d.solution==89066== NVPROF is profiling process 89066, command: ./laplace2d.solution<program output>==89075== Generated result file: timeline.3.nvprof==89073== Generated result file: timeline.2.nvprof==89070== Generated result file: timeline.1.nvprof==89066== Generated result file: timeline.0.nvprof
![Page 26: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/26.jpg)
39
MULTI-PROCESS PROFILING
When running nvprof with multiple processes, it’s useful to label each process:
$ nvprof –o timeline_rank%q{OMPI_COMM_WORLD_RANK} \
--context-name “MPI Rank %q{OMPI_COMM_WORLD_RANK} \
--process-name “MPI Rank %q{OMPI_COMM_WORLD_RANK} \
--annotate-mpi openmpi …
39
![Page 27: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/27.jpg)
PROFILING MPI+CUDA APPLICATIONSUsing nvprof+NVVP
New since CUDA 9
Embed MPI rank in output filename, process name, and context name (OpenMPI)
Alternatives:
Only save the textual output (--log-file)
Collect data from all processes that run on a node (--profile-all-processes)
67
MVAPICH2: MV2_COMM_WORLD_RANK
--annotate-mpi mpich
nvprof --output-profile profile.%q{OMPI_COMM_WORLD_RANK} \
--process-name "rank %q{OMPI_COMM_WORLD_RANK}“ \
--context-name "rank %q{OMPI_COMM_WORLD_RANK}“ \
--annotate-mpi openmpi
![Page 28: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/28.jpg)
42
MPI PROFILING
1 4
Importing into the Visual Profiler
2 3
![Page 29: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/29.jpg)
43
MPI PROFILINGVisual Profiler
MPI Rank-based naming
NVTX Markers & Ranges
See: https://devblogs.nvidia.com/parallelforall/gpu-pro-
tip-track-mpi-calls-nvidia-visual-profiler
![Page 30: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/30.jpg)
44
PROFILER API
Real applications frequently produce too much data to manage.
Profiling can be programmatically toggled:
#include <cuda_profiler_api.h>
cudaProfilerStart();
…
cudaProfilerStop();
This can be paired with nvprof:
$ nvprof --profile-from-start off …
44
![Page 31: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/31.jpg)
45
SELECTIVE PROFILING
When the profiler API still isn’t enough, selectively profile kernels, particularly with
performance counters.
$ nvprof --kernels :::1 --analysis-metrics …
45
context:stream:kernel:invocation
Record metrics for only the first
invocation of each kernel.
![Page 32: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/32.jpg)
46
NVTX ANNOTATIONS
The NVIDIA Tools Extensions (NVTX) allow you to annotate the profile:
#include <nvToolsExt.h> // Link with -lnvToolsExt
nvtxRangePushA(“timestep”);
timestep();
nvtxRangePop();
See https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvtx for more features, including V3 usage.
46
![Page 33: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/33.jpg)
47
NVTX IN VISUAL PROFILER
47
Named Range
![Page 34: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/34.jpg)
48
EXPORTING DATA
It’s often useful to post-process nvprof data using your favorite tool (Python, Excel, …):
$ nvprof --csv --log-file output.csv \
–i profile.nvprof
It’s often necessary to massage this file before loading into your favorite tool.
48
![Page 35: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/35.jpg)
49
OPENACC PROFILINGOpenAcc->Driver API->Compute
correlation
OpenAcc->Source Code correlation
OpenAcctimeline
OpenAcc Properties
![Page 36: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/36.jpg)
50
OPENMP PROFILING
Information about OpenMP regions using the OpenMP tools interface (OMPT) starting CUDA 10.0
Supported on x86_64 and Power Linux with PGI runtime 18.1+
Supported added in the CUPTI, nvprof and Visual Profiler
![Page 37: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/37.jpg)
51
OPENMP PROFILING IN NVPROF
nvprof option openmp-profiling to enable/disable the OpenMP profiling, default on
$nvprof openmp-profiling on ./omp-app
Type Time(%) Time Calls Avg Min Max Name
OpenMP (incl): 99.97% 277.10ms 20 13.855ms 13.131ms 18.151ms omp_parallel
0.03% 72.728us 19 3.8270us 2.9840us 9.5610us omp_idle
0.00% 7.9170us 7 1.1310us 1.0360us 1.5330us omp_wait_barrier
Option --print-openmp-summary to print a summary of all recorded OpenMP activities
![Page 38: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/38.jpg)
52
OPENMP PROFILING IN VISUAL PROFILER
![Page 39: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/39.jpg)
53
OPENMP PROFILING IN VISUAL PROFILERTable View
![Page 40: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/40.jpg)
54
PROFILING NVLINK USAGEUsing nvprof+NVVP
9/22/2020
Run nvprof multiple times to collect metrics
nvprof --output-profile profile.<metric>.%q{OMPI_COMM_WORLD_RANK}\
--aggregate-mode off --event-collection-mode continuous \
--metrics <metric> –f
Use `--query-metrics` and `--query-events` for full list of metrics (-m) or events (-e)
Combine with an MPI annotated timeline file for full picture
![Page 41: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/41.jpg)
PROFILING NVLINK USAGEUsing nvprof+NVVP
70
![Page 42: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/42.jpg)
PROFILING NVLINK USAGEUsing nvprof+NVVP
71
![Page 43: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/43.jpg)
PROFILING NVLINK USAGEUsing nvprof+NVVP
72
![Page 44: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/44.jpg)
58
SUMMIT NVLINK TOPOLOGY
![Page 45: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/45.jpg)
59
CPU PAGE FAULT SOURCE CORRELATION
Unguided Analysis
Option to collect Unified Memory
information
Summary of all CPU page faults
![Page 46: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/46.jpg)
3
INTRODUCTION TONSIGHT SYSTEMS
![Page 47: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/47.jpg)
NSIGHT PRODUCT FAMILY
Standalone Performance Tools
Nsight Systems system-wide application algorithm tuning
Nsight Compute Debug/optimize specific CUDA kernel
Nsight Graphics Debug/optimize specific graphics
IDE plugins
Nsight Eclipse Edicion/Visual Studio editor, debugger, some perf analysis
8
![Page 48: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/48.jpg)
NSIGHT SYSTEMSOverview
Profile System-wide application
Multi-process tree, GPU workload trace, etc
Investigate your workload across multiple CPUs and GPUs
CPU algorithms, utilization, and thread states
GPU streams kernels, memory transfers, etc
NVTX, CUDA & Library API, etc
Ready for Big Data
docker, user privilege (linux), cli, etc
9
![Page 49: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/49.jpg)
10
Thread/coremigration
Processes andthreads Thread state
CUDA and OpenGLAPI trace
cuDNN and
cuBLAS trace
Kernel and
memory transferactivities
Multi-GPU
![Page 50: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/50.jpg)
NVTX Tracing
11
![Page 51: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/51.jpg)
TRANSITIONING TO PROFILE A KERNELDive into kernel analysis
12
![Page 52: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/52.jpg)
13
Kernel Profile Comparisons with Baseline
NVIDIA NSIGHT COMPUTENext Generation Kernel Profiler
Metric DataInteractive CUDA API debugging and kernel profiling
Fast Data Collection
Graphical and multiple kernel comparison reports
Improved Workflow and Fully Customizable(Baselining, Programmable UI/Rules)
Source Correlation
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux(x86,ARM), Windows, OSX (host only)
GPUs: Pascal, Volta, Turing
![Page 53: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/53.jpg)
PROFILING GPU APPLICATIONComputingFocusing GPU
How to measure
NVIDIA (Visual) Profiler / Nsight Compute
NVIDIA Supports them with cuDNN, cuBLAS, and so on14
Application
Tracing
• Arithmetic• Control flow…
• Cache misses• Bandwidth limit• Access pattern…
• Too few threads• Register limit
• Large shared
memory…
Instructions
Bottleneck
Memory
Bottleneck
Low Achieved
Occupancy
Low SM
Efficiency
CPU/GPU
Tracing
GPU Profiling
Low GPU
Utilization
![Page 54: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/54.jpg)
PROFILING GPU APPLICATIONFocusing System Operation
How to measure
Nsight System / Application Tracing
15
I/OCPU
Computation
Job Startup /
Checkpoints
Kernel Launch
Latency
Memcopy
Latency
CPU-Only
Activities
Application
Tracing
Instructions
Bottleneck
Memory
Bottleneck
Low Achieved
Occupancy
Low SM
Efficiency
CPU/GPU
Tracing
GPU Profiling
Low GPU
Utilization
![Page 55: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/55.jpg)
16
HOW TO USE
![Page 56: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/56.jpg)
NSIGHT SYSTEMS PROFILEProfile with CLIAPIs to be traced
Name of output file
Application command
Show outputon console
Automatic conversion of .qdstrm temp results file to .qdrep format if converter utility is available.
cuda osrt nvtx
– GPU kernel– OS runtime– NVIDIA Tools Extension
cudnn – CUDA Deep NN library cublas – CUDA BLAS library
17
https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.6-x86/06-cli-profiling.htm
$ nsys profile –t cuda,osrt,nvtx,cudnn,cublas \
–o baseline.qdstrm –w true python main.py
![Page 57: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/57.jpg)
NSIGHT SYSTEMS PROFILENo NVTX
Difficult to understand → no useful
18
![Page 58: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/58.jpg)
19
NVTX (NVIDIA TOOLS EXTENSION)
![Page 59: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/59.jpg)
NVTX ANNOTATIONS
20
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
![Page 60: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/60.jpg)
NVTX ANNOTATIONSNVTX in PyTorch
Batch %d
copy to device
Forward pass
21
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
![Page 61: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/61.jpg)
NVTX ANNOTATIONSNVTX using cupy pakage
Batch %d
copy to device
Forward pass
22
https://docs-cupy.chainer.org/en/stable/
![Page 62: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/62.jpg)
NSIGHT SYSTEM PROFILENVTX range marker tip
NVTX for data loading (data augmetnation)
23
![Page 63: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/63.jpg)
BASELINE PROFILE
MNIST Training: 89 sec, <5% utilization
CPU waits on a semaphore and starves the GPU!
GPU Starvation GPU Starvation 24
![Page 64: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/64.jpg)
BASELINE PROFILE (WITH NVTX)
GPU is idle during data loading5.1ms
Data is loaded using a single thread. This starves the GPU! 25
fwd/bwdfwd/bwd
![Page 65: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/65.jpg)
OPTIMIZE SOURCE CODE
Data loader was configured to use 1 worker thread
kwargs = {'num_workers': 1, 'pin_memory' True if use_cuda else {}
Let’s switch to using 8 worker threads:
kwargs = {'num_workers’: 8, 'pin_memory' True if use_cuda else {}
26
![Page 66: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/66.jpg)
AFTER OPTIMIZATION
Time for data loading reduced for each bath
60us Reduced from 5.1ms to 60us for batch 50 27
![Page 67: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/67.jpg)
![Page 68: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/68.jpg)
43
CASE STUDY: OPENACC SAMPLE
![Page 69: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/69.jpg)
OPENACC SAMPLE
• Sample from https://devblogs.nvidia.com/getting-started-openacc
• Solves 2-D Laplace equation with iterative Jacobi solver
• Each iteration
1.
2.
3.
A stencil calculation
Update the matrix
Check if error tolerance is met. If not, go to step 1.
44
![Page 70: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/70.jpg)
SAMPLE (CPU VERSION)
Convergence loopwhile ( error > tolerror = 0.0;
for( int j = 1; j for( int i = 1;
&& iter < iter_max ) {
< n-1; j++) {i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] +A[j-1][i] + A[j+1][i]); Stencil calculation
error = fmax( error, fabs(Anew[j][i] - A[j][i]));}
}
for( int j = 1; jfor( int i = 1;
< n-1; j++) {i < m-1; i++ ) {
Update matrixA[j][i] = Anew[j][i];}
}iter++;
}45
![Page 71: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/71.jpg)
OPENACC
while ( error > tol && iter < iter_max error = 0.0;#pragma acc kernels{for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {Anew[j][i] = …
SAMPLE
Convergence loop) {
Stencil calculationerror = fmax( error, fabs(Anew[j][i] - A[j][i]));
}}for( int j = 1; j for( int i = 1;
< n-1; j++) {i < m-1; i++ ) {
Update matrixA[j][i] = Anew[j][i];}
}}iter++;
46}
![Page 72: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/72.jpg)
PERFORMANCE
Execution time for 1000 iterations on a system with:
Intel® Core™ i7-6850K CPU
NVIDIA TITAN X (Pascal) GPU
That is unexpected!
47
![Page 73: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/73.jpg)
OPTIMIZATION WORKFLOW
ProfileApplication
Nsight Systems
OptimizeInspect & Analyze
48
![Page 74: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/74.jpg)
BASELINE PROFILE
Excessive data copies slowing down GPU 49
![Page 75: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/75.jpg)
OPENACC
while ( error > tol && iter < iter_max error = 0.0;#pragma acc kernels{for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {Anew[j][i] = …
SAMPLE
Convergence loop) {
Stencil calculationerror = fmax( error, fabs(Anew[j][i] - A[j][i]));
}}for( int j = 1; j for( int i = 1;
< n-1; j++) {i < m-1; i++ ) {
Update matrixA[j][i] = Anew[j][i];}
}}iter++;
50}
![Page 76: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/76.jpg)
OPENACCacc data copy(A) create(Anew)error > tol && iter < iter_max= 0.0;
SAMPLE#pragmawhile (error
Convergence loop) {
#pragma acc kernels{
for( int j = 1; j < n-1; j++) {for( int i = 1; i < m-1; i++ ) {Anew[j][i] = … Stencil calculation
error = fmax( error, fabs(Anew[j][i] - A[j][i]));}
}for( int j = 1; jfor( int i = 1;
< n-1; j++) {i < m-1; i++ ) {
Update matrixA[j][i] = Anew[j][i];}
}}iter++;
51
}
![Page 77: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/77.jpg)
AFTER OPTIMIZATION
Excessive data copies eliminated 52
![Page 78: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/78.jpg)
AFTER OPTIMIZATION
kernel coverageCUDA on GPU is ~97%
53
![Page 79: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/79.jpg)
AFTER OPTIMIZATION
Execution time for 1000 iterations on a system with:
Intel® Core™ i7-6850K CPU
NVIDIA TITAN X (Pascal) GPU44x speedup!
best practiceshttps://www.openacc.org/resources for more 54
![Page 80: NVIDIA PROFILING TOOLS OVERVIEW - Indico](https://reader030.vdocuments.mx/reader030/viewer/2022032505/6234dd9c24293974203aaec7/html5/thumbnails/80.jpg)
COMMON OPTIMIZATION OPPORTUNITIES
CPU
• Thread synchronization
• Algorithm bottlenecks starve theGPUs (case study 1)
Multi GPU
Single GPU
• Memory operations – blocking,serial, unnecessary (case study
Too much synchronization -
2)
•device, context, stream, defaultstream, implicit
CPU GPU Overlap – avoid excessive communication
• Communication between GPUs
Lack of Stream Overlap in memory management, kernel execution
•
55