cuda and gpu training workshop april 21, 2014
DESCRIPTION
CUDA and GPU Training Workshop April 21, 2014. University of Georgia CUDA Teaching Center. UGA CUDA Teaching Center. UGA, through the efforts of Professor Thiab Taha, has been selected by NVIDIA as a CUDA Teaching Center, est. 2011 Presenters: Jennifer Rouan and Sayali Kale - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/1.jpg)
CUDA and GPU Training Workshop
April 21, 2014University of Georgia CUDA Teaching Center
![Page 2: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/2.jpg)
UGA CUDA Teaching Center
UGA, through the efforts of Professor Thiab Taha, has been selected by NVIDIA as a CUDA Teaching Center, est. 2011
Presenters: Jennifer Rouan and Sayali Kale
Visit us at http://cuda.uga.edu
2
![Page 3: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/3.jpg)
Workshop Outline
Intruduction to GPUs and CUDA
CUDA Programming Concepts
Current GPU Research at UGA
GPU Computing Resources at UGA
“My First CUDA Program” – hands-on programming project
3
![Page 4: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/4.jpg)
A Little Bit of GPU Background
GPU is a computer chip that performs rapid mathematical calculations in parallel, primarily for the purpose of rendering images.
NVIDIA introduced the first GPU, The GeForce256, in 1999 and remains one of the major players in the market.
Using CUDA, the GPUs can be used for general purpose processing this approach is known as GPGPU.
4
![Page 5: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/5.jpg)
Question
What are different ways hardware designer make computers run faster?
1. Higher clock speeds
2. More work/clock cycle
3. More processors
5
![Page 6: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/6.jpg)
What is CUDA?
Compute Unified Device Architecture
It is a parallel computing platform and programming model created by NVDIA and implemented by the GPUs(Graphics Processing Units) that they produce.
CUDA compiler uses variation of C with future support of C++
CUDA was released on February 15, 2007 for PC and Beta version for Mac OS X on August 19, 2008.
6
![Page 7: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/7.jpg)
Why CUDA?
CUDA provides ability to use high-level languages . GPUs allow creation of very large number of
concurrently executed threads at very low system resource cost.
CUDA also exposes fast shared memory (16KB) that
can be shared between threads.
7
![Page 8: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/8.jpg)
CPU v/s GPU
Central Processing Units (CPUs) consists of a few cores optimized for sequential serial processing .
Graphics Processing Units (GPUs) consists of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously.
Video
8
memory
CPUGPU
![Page 9: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/9.jpg)
CUDA Computing System
9
GPU-accelerated computing offers great performance by running portions of the application to the GPU, while the remainder of the code still runs on the CPU.
![Page 10: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/10.jpg)
CUDA Computing System
A CUDA computing system consists of a host (CPU) and one or more devices (GPUs)
The portions of the program that can be evaluated in parallel are executed on the device. The host handles the serial portions and the transfer of execution and data to and from the device
10
![Page 11: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/11.jpg)
CUDA Program Source Code
A CUDA program is a unified source code encompassing both host and device code. Convention: program_name.cu
NVIDIA’s compiler (nvcc) separates the host and device code at compilation
The host code is compiled by the host’s standard C compilers. The device code is further compiled by nvcc for execution on the GPU
11
![Page 12: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/12.jpg)
CUDA Program Execution steps:
1. CPU allocated storage on GPU (CUDA Malloc)
2. CPU copies I/P data from CPU to GPU (CUDA Memcpy)
3. CPU launches kernel (Invoking program )on GPU to process data (Kernel Launch)
4. CPU copies results back to CPU from GPU (CUDA Memcpy)
12
![Page 13: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/13.jpg)
Processing Flow
Processing Flow of CUDA: Copy data from main
memory to GPU memory.
CPU instructs the process to GPU.
GPU execute parallel in each core.
Copy the result from GPU memory to main memory.
![Page 14: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/14.jpg)
CUDA Program Execution
Execution of a CUDA program begins on the host CPU
When a kernel function (or simply “kernel”) is launched, execution is transferred to the device and a massive “grid” of lightweight threads is spawned
When all threads of a kernel have finished executing, the grid terminates and control of the program returns to the host until another kernel is launched
14
![Page 15: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/15.jpg)
Thread Batching: Grids and Blocks
A kernel is executed as a grid of thread blocks
All threads share data memory space
A thread block is a batch of threads that can cooperate with each other.
Threads and blocks have IDs So each thread can
decide what data to work on
Grid Dim: 1D or 2D Block Dim: 1D, 2D, or 3D
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
![Page 16: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/16.jpg)
CUDA Program Structure example
int main(void) {
float *a_h, *a_d; // pointers to host and device arrays
const int N = 10; // number of elements in array
size_t size = N * sizeof(float); // size of array in memory
// allocate memory on host and device for the array
// initialize array on host (a_h)
// copy array a_h to allocated device memory location (a_d)
// kernel invocation code – to have the device perform
// the parallel operations
// copy a_d from the device memory back to a_h
// free allocated memory on device and host
}
16
![Page 17: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/17.jpg)
Data Movement example
int main(void)
{
float *a_h, *a_d; const int N = 10;
size_t size = N * sizeof(float); // size of array in memory
a_h = (float *)malloc(size); // allocate array on host
cudaMalloc((void **) &a_d, size); // allocate array on device
for (i=0; i<N; i++) a_h[i] = (float)i; // initialize array
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// kernel invocation code
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d); free(a_h); // free allocated memory
}
17
![Page 18: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/18.jpg)
Kernel Invocation example
int main(void){ float *a_h, *a_d; const int N = 10; size_t size = N * sizeof(float); // size of array in memory
a_h = (float *)malloc(size); // allocate array on host cudaMalloc((void **) &a_d, size); // allocate array on device for (i=0; i<N; i++) a_h[i] = (float)i; // initialize array cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
int block_size = 4; // set up execution parameters int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaFree(a_d); free(a_h); // free allocated memory}
18
![Page 19: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/19.jpg)
Kernel Function Call#include <stdio.h>
#include <cuda.h>
// kernel function that executes on the CUDA device
__global__ void square_array(float *a, int N)
{ int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
} // main routine that executes on the host
Compare with serial C version:
void square_array(float *a, int N) {
int i;
for (i = 0; i < N; i++) a[i] = a[i] * a[i];
}
19
![Page 20: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/20.jpg)
CUDA Thread Organization
Since all threads of a grid execute the same code, they rely on a two-level hierarchy of coordinates to distinguish themselves: blockIdx and threadIdx
The example code fragment:ID = blockIdx.x * blockDim.x + threadIdx.x;will yield a unique ID for every thread across all blocks of a grid
20
![Page 21: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/21.jpg)
Execution Parameters and Kernel Launch
A kernel is invoked by the host program with execution parameters surrounded by ‘<<<’ and ‘>>>’ as in: function_name <<< grid_dim, block_dim >>> (arg1, arg2);
At kernel launch, a “grid” is spawned on the device. A grid consists of a one- or two-dimensional array of “blocks”. In turn, a block consists of a one-, two-, or three-dimensional array of “threads”.
Grid and block dimensions are passed to the kernel function at invocation as execution parameters.
21
![Page 22: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/22.jpg)
Execution Parameters and Kernel Launch
gridDim and blockDim are CUDA built-in variables of type dim3, essentially a C struct with three unsigned integer fields, x, y, and z
Since a grid is generally two-dimensional, gridDim.z is ignored but should be set to 1 for clarity
dim3 grid_d = (n_blocks, 1, 1); // this is still dim3 block_d = (block_size, 1, 1); // host code function_name <<< grid_d, block_d >>> (arg1, arg2);
For one-dimensional grids and blocks, scalar values can be used instead of dim3 type
22
![Page 23: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/23.jpg)
Execution Parameters and Kernel Launch
23
dim3 grid_dim = (2, 2, 1)dim3 block_dim = (4, 2, 2)
![Page 24: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/24.jpg)
Kernel Functions
A kernel function specifies the code to be executed by all threads in parallel – an instance of single-program, multiple-data (SPMD) parallel programming.
A kernel function declaration is a C function extended with one of three keywords: “__device__”, “__global__”, or “__host__”.
24
Executed on the:
Only callable from the:
__device__ float DeviceFunc()
device device
__global__ void KernelFunc()
device host
__host__ float HostFunc() host host
![Page 25: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/25.jpg)
CUDA Device Memory Types
Global Memory and Constant Memory can be accessed by the host and device. Constant Memory serves read-only data to the device at high bandwidth. Global Memory is read-write and has a longer latency
Registers, Local Memory, and Shared Memory are accessible only to the device. Registers and Local Memory are available only to their own thread. Shared Memory is accessible to all threads within the same block.
25
![Page 26: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/26.jpg)
CUDA Device Memory Model
Each thread can: R/W per-thread registers R/W per-thread local
memory R/W per-block shared
memory R/W per-grid global memory Read only per-grid constant
memory
(Device) Grid
ConstantMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
![Page 27: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/27.jpg)
CUDA- Advantages
Huge increase in processing power over conventional CPU processing. Early reports suggest speed increases of 10x to 200x over CPU processing speed.
Researchers can use several GPU's to preform the same amount of operations as many servers in less time, thus saving money, time, and space.
C language is widely used, so it is easy for developers to learn how to program for CUDA.
All graphics cards in the G80 series and beyond support CUDA.
Harnesses the power of the GPU by using parallel processing; running thousands of simultaneous reads.
27
![Page 28: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/28.jpg)
Disadvantages
Limited user base- Only NVIDIA G80 and onward video cards can use CUDA, thus isolating all ATI users.
Speeds may be bottlenecked at the bus between CPU and GPU.
Mainly developed for researchers- not many uses for average users.
System is still in development.
28
![Page 29: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/29.jpg)
Current GPU Research at UGA
Institute of Bioinformatics
“Solving Nonlinear Systems of First Order Ordinary Differential Equations Using a Galerkin Finite Element Method”, Al-Omari, A., Schuttler, H.-B., Arnold, J., and Taha, T. R. IEEE Access, 1, 408-417 (2013).
“Solving Large Nonlinear Systems of First-Order Ordinary Differential Equations With Hierarchical Structure Using Multi-GPGPUs and an Adaptive Runge Kutta ODE Solver”, Al-Omari, A., Arnold, J., Taha, T. R., Schuttler, H.-B. IEEE Access, 1, 770-777 (2013).
29
![Page 30: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/30.jpg)
Current GPU Research at UGA
Institute of Bioinformatics
GTC Poster:http://cuda.uga.edu/docs/GPGPU_runge-kutta.pdf
30
![Page 31: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/31.jpg)
Current GPU Research at UGA
Department of Physics and Astronomy
“A generic, hierarchical framework for massively parallel Wang-Landau sampling”, T. Vogel, Y. W. Li, T. Wust, and D. P. Landau, Phys. Rev. Lett. 110, 210603 (2013).
“Massively parallel Wang-Landau Sampling on Multiple GPUs”, J. Yin and D. P. Landau, Comput. Phys. Commun. 183, 1568-1573 (2012).
31
![Page 32: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/32.jpg)
Current GPU Research at UGA
Department of Physics and Astronomy
32
![Page 33: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/33.jpg)
Current GPU Research at UGA
Department of Computer Science
“Analysis of Surface Folding Patterns of DICCCOLS Using the GPU-Optimized Geodesic Field Estimate”, Mukhopadhyay, A., Lim, C.-W., Bhandarkar, S. M., Chen, H., New, A., Liu, T., Rasheed, K. M., Taha, T. R. Nagoyaa: Proceedings of MICCAI Workshop on Mesh Processing in Medical Image Analysis (2013).
33
![Page 34: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/34.jpg)
Current GPU Research at UGA
Department of Computer Science
34
![Page 35: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/35.jpg)
Current GPU Research at UGA
Department of Computer Science
“Using Massively Parallel Evolutionary Computation on GPUs for Biological Circuit Reconstruction”, Cholwoo Lim, master's thesis under the direction of Dr. Khaled Rasheed (2013).
35
![Page 36: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/36.jpg)
Current GPU Research at UGA
36
![Page 37: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/37.jpg)
Current GPU Research at UGA
Department of Computer Science
“GPU Acceleration of High-Dimensional k-Nearest Neighbor Search for Face Recognition using EigenFaces”, Jennifer Rouan, master's thesis under the direction of Dr. Thiab Taha (2014)
“Using CUDA for GPUs over MPI to solve Nonlinear Evolution Equations”, Jennifer Rouan and Thiab Taha, presented at: The Eighth IMACS International Conference on Nonlinear Evolution Equations and Wave Phenomena: Computation and Theory (2013)
37
![Page 38: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/38.jpg)
Current GPU Research at UGA
Department of Computer Science
Research Day 2014 Poster:http://cuda.uga.edu/docs/J_Rouan_Research_Day_poster.pdf
Waves 2013 Poster:http://cuda.uga.edu/docs/waves_Poster_Rouan.pdf
38
![Page 39: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/39.jpg)
More CUDA Training Resources
University of Georgia CUDA Teaching Center: http://cuda.uga.edu
Nvidia training and education site: http://developer.nvidia.com/cuda-education-training
Stanford University course on iTunes U: http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322
University of Illinois: http://courses.engr.illinois.edu/ece498/al/Syllabus.html
University of California, Davis: https://smartsite.ucdavis.edu/xsl-portal/site/1707812c-4009-4d91-a80e-271bde5c8fac/page/de40f2cc-40d9-4b0f-a2d3-e8518bd0266a
University of Wisconsin: http://sbel.wisc.edu/Courses/ME964/2011/me964Spring2011.pdf
University of North Carolina at Charlotte: http://coitweb.uncc.edu/~abw/ITCS6010S11/index.html
39
![Page 40: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/40.jpg)
GPUs Available at UGA
CUDA Teaching Center lab in 207A
Twelve NVIDIA GeForce GTX 480 GPUs
Six Linux hosts on cs.uga.edu domain:cuda01, cuda02, cuda03, cuda04, cuda05 & cuda06
SSH from nike.cs.uga.edu with your CS login and password
More GPUs available on the Z-cluster visit http://gacrc.uga.edu for an account and more
information
40
![Page 41: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/41.jpg)
References
Kirk, D., & Hwu, W. (2010). Programming Massively Parallel Processors: A Hands-on Approach, 1 – 75
Tarjan, D. (2010). Introduction to CUDA, Stanford University on iTunes U
Atallah, M. J. (Ed.), (1998). Algorithms and theory of computation handbook. Boca Raton, FL: CRC Press
von Neumann, J. (1945). First draft of a report on the EDVAC. Contract No. W-670-ORD-4926, U.S. Army Ordnance Department and University of Pennsylvania
Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue, 3(7), 54 – 62
Stratton, J. A., Stone, S. S., & Hwu, W. W. (2008). MCUDA: And efficient implementation of CUDA kernels for multi-core CPUs. Canada: Edmonton
Vandenbout, Dave (2008). My First Cuda Program, http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/
41
![Page 42: CUDA and GPU Training Workshop April 21, 2014](https://reader035.vdocuments.mx/reader035/viewer/2022081516/568147c3550346895db50668/html5/thumbnails/42.jpg)
References
“Using Massively Parallel Evolutionary Computation on GPUs for Biological Circuit Reconstruction”, Cholwoo Lim, master's thesis under the direction of Dr. Khaled Rasheed (2013) and Prof. Taha is one of the Advisory Committee members
“Solving large Nonlinear Systems of ODE with Hierarchical Structure Using Multi-GPGPUs and an Adaptive Runge Kutte”, Ahmad Al-Omari, Thiab Taha, B. Schuttler, Jonathan Arnold, presented at: GPU Technology Conference, March 2014
“Using CUDA for GPUs over MPI to solve Nonlinear Evolution Equations”, Jennifer Rouan and Thiab Taha, presented at: The Eighth IMACS International Conference on Nonlinear Evolution Equations and Wave Phenomena: Computation and Theory, March 2013
“GPU Acceleration of High-Dimensional k-Nearest Neighbor Search for Face Recognition using EigenFaces”, Jennifer Rouan, and Thiab Taha, presented at: UGA Department of Computer Science Research Day, April 2014
42