![Page 1: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/1.jpg)
GPU for HPC
Hardware Evolution and milestones
GPU x CPU Architecture
DevelopmentDevelopment-Kernel- Execution and memory hierarchy
-threads and blocks-Global memory
- Warps and Coalesced i/o
Finite Diference Method-one dimension wave equation
-global and shared memory examples-Maxwell Equations, two dimension Yee method
-global, shared and texture examples
Outline
Hardware Evolution and milestonesHardware Evolution and milestones
![Page 2: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/2.jpg)
What is CUDA
Is a extension of C language that enable the use of GPU for general computation !
![Page 3: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/3.jpg)
Why CUDA
● Easy to use● Chip hardware● Free to use● Broaded adopted
![Page 4: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/4.jpg)
Installing
http://developer.nvidia.com/cuda-toolkit-40
We need:-Module: Developer Drivers for Linux (270.41.19)-Toolkit: Cuda Toolkit for your distribution
Compiler: NVCC (same syntax as GCC)
![Page 5: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/5.jpg)
Adoption time line
![Page 6: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/6.jpg)
Adoption
![Page 7: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/7.jpg)
GFLOPs EvolutionGPUxCPU
![Page 8: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/8.jpg)
Bandwidth EvolutionGPUxCPU
![Page 9: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/9.jpg)
Schematic Architecture CPU x GPU
CPUs: great area dedicated to control GPUs: great area dedicated to ALU(Aritmetic Logical Unit)
![Page 10: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/10.jpg)
How to develop
CPU GPU
Host Device
Copy data from Host and Device:cudaMemcpy(A_d, A_h, memsize, cudaMemcpyHostToDevice); cudaMemcpy(A_h, A_d, memsize, cudaMemcpyDeviceToHost); cudaMemcpy(B_d, A_d, memsize, cudaMemcpyDeviceToDevice);
Alloc data on Host and Device:cudaMalloc((void**)&A_d, memsize);A_h = (float *)malloc(memsize);
![Page 11: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/11.jpg)
blockIdx 0blockIdx 0 blockIdx 1blockIdx 1
Threads and Blocks build-in variables
threadIdx 1threadIdx 1
Execute one kernel instance and has a build-in indexthat points to data that should be updated !
threadIdx 0threadIdx 0 threadIdx 0threadIdx 0 threadIdx 4threadIdx 4......
Thread index = blockIdx.x*blockDim.x + threadIdx.x= blockIdx.x*blockDim.x + threadIdx.x
Threads forming a Block can be syncronized !
![Page 12: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/12.jpg)
// Kernel definition// Kernel definition__global__ void VecAdd(float* A, float* B, float* C)__global__ void VecAdd(float* A, float* B, float* C){{int i = blockIdx.x * blockDim.x + threadIdx.x;int i = blockIdx.x * blockDim.x + threadIdx.x;C[i] = A[i] + B[i];C[i] = A[i] + B[i];}}int main()int main(){{......// Kernel invocation with N threads// Kernel invocation with N threadsVecAdd<<<nBlocks, BlockSize>>>(A, B, C);VecAdd<<<nBlocks, BlockSize>>>(A, B, C);
Kernel
![Page 13: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/13.jpg)
Calling a Kernel
Main function – Host code:
kernel <<< nBlocks, BlockSize >>> (arguments); kernel <<< nBlocks, BlockSize >>> (arguments);
nBlocks → number of blocksnBlocks → number of blocksBlockSize → number of threads for each BlockBlockSize → number of threads for each Block Example:
int BlockSize=4;int BlockSize=4;
int nBlocks = N / BlockSize + (N % BlockSize == 0?0:1);int nBlocks = N / BlockSize + (N % BlockSize == 0?0:1);
nBlocks defined at run time!
![Page 14: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/14.jpg)
Programming model
![Page 15: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/15.jpg)
Memory Hierarchy
![Page 16: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/16.jpg)
Memory Hierarchy
![Page 17: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/17.jpg)
Multiprocessor
Hardware view
Each block is divided in 32 threads called Warp and queued to be executed in SIMD model on a Multiprocessor.
To warranty execute all threads in parallel in a warp the memory reads and writes must be coalesced!
![Page 18: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/18.jpg)
Hardware view
![Page 19: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/19.jpg)
![Page 20: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/20.jpg)
![Page 21: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/21.jpg)
Wave equation solution via FDTD:
ii-1 i+1
One-dimension FDTD Method
![Page 22: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/22.jpg)
![Page 23: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/23.jpg)
Idle Threads
![Page 24: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/24.jpg)
Idle Threads
Syncronization !
![Page 25: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/25.jpg)
How to Sync this two Threads ?
You have to re-run the kernel !
![Page 26: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/26.jpg)
Basic Example: Sum two arraysT
ime
/ n
um
be
r o
f lo
op
s
When adding two array of10MB GPU expende half of the time for call the kernel !
![Page 27: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/27.jpg)
main()
![Page 28: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/28.jpg)
Resultados
![Page 29: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/29.jpg)
Faraday's Law:
Ampere's Law:
Maxwell's Equations and the Yee algorithm
![Page 30: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/30.jpg)
Two-dimension FDTD Method
TEz Mode
![Page 31: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/31.jpg)
Two-dimension FDTD Method
TMz Mode
![Page 32: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/32.jpg)
Example of a computational domain in twodimensions (16x16). Each box contains information
about the fields and electromagnetic properties. Division of the computational domain into four regions. The arrows indicate the process of communication between
regions.
CUDA execution model representation. In blue, we have threads divided between the blocks, in
green, and grouped into a grid, in red.
Syncronization issue !!!
![Page 33: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/33.jpg)
dimGrid.xdimGrid.x
dim
Grid
.ydi
mG
rid.y
dimBlock.xdimBlock.x
dim
Blo
ck.y
dim
Blo
ck.y
dim3 dimBlock (BLOCKSIZEX, BLOCKSIZEY); dim3 dimBlock (BLOCKSIZEX, BLOCKSIZEY); //dim. of threads block //dim. of threads block
dim3 dimGrid ( #of Blocks X direction, #of Blocks Y direction ); dim3 dimGrid ( #of Blocks X direction, #of Blocks Y direction );
Set Threads and Blocks in 2D
![Page 34: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/34.jpg)
Mapping 2D computational domain in 1D arrays
dimGrid.xdimGrid.x
dim
Grid
.ydi
mG
rid.y
dimBlock.xdimBlock.x
dim
Blo
ck.y
dim
Blo
ck.y
int i = blockIdx.x * blockDim.x + threadIdx.x;int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;int j = blockIdx.y * blockDim.y + threadIdx.y;
Field[i+j*Lx] !Field[i+j*Lx] !
![Page 35: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/35.jpg)
Setting Threads and Blocks
Launching Kernel
![Page 36: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/36.jpg)
Example !
![Page 37: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/37.jpg)
Electric field time evolution
![Page 38: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/38.jpg)
Video
![Page 39: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/39.jpg)
Benchmark: Our Cuda code x Our CPU code x reference Cuda code3
Grid Points CPU FDTD 2D Time(s)
Reference GPU FDTD 2D Time(s)
Our GPU FDTD 2D Time(s)
128x128 0,405 0,39 0,162
256x256 2,253 0,76 0,245
512x512 10,408 2,2 0,519
1024x1024 43,531 8,05 1,735
2048x2048 183,899 31,6 6,065
4096x4096 568,199 20,402
Table 1: Real elapsed time for the execution 2000 FDTD time iterations of the propagation of a Gaussian Pulse on a conductor box
![Page 40: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/40.jpg)
Speedups
Grid Points SpeedUP (our CPU/ our GPU)
SpeedUP (GPU ref. / our GPU)
128x128 2,5 2,4
256x256 9,2 3,1
512x512 20,1 4,2
1024x1024 25,1 4,6
2048x2048 30,3 5,2
4096x4096 27,9
Table 2: Speedups for the execution 2000 FDTD time iterations of the propagation of a Gaussian Pulse on a conductor box
![Page 41: Outline - Sites do IFGWsites.ifi.unicamp.br/maicon/files/2017/01/Cuda-course.pdf · 2017. 1. 9. · Benchmark: Our Cuda code x Our CPU code x reference Cuda code3 Grid Points CPU](https://reader036.vdocuments.mx/reader036/viewer/2022081620/610471a4a4cc2b14047b4d0d/html5/thumbnails/41.jpg)
References
Dr. Dobbs: Supercomputing for masses
http://developer.nvidia.com/cuda-education-training
groups.google.com.br/group/gpubrasil