gpu computing - dirac - nersc: national energy research scientific
TRANSCRIPT
![Page 1: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/1.jpg)
GPU Computing with Dirac
Hemant Shukla
![Page 2: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/2.jpg)
2
Architectural Differences
2
ALU
Cache
DRAM
Control
Logic
DRAM
CPU GPU
512 cores 10s to 100s of threads per core Latency is hidden by fast context switching
Less than 20 cores 1-‐2 threads per core Latency is hidden by large cache
![Page 3: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/3.jpg)
3
Programming Models
3
CUDA (Compute Unified Device Architecture)
OpenCL
Microsoft's DirectCompute
Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, MATLAB, IDL, and Mathematica
Compilers from PGI, RCC, HMPP, Copperhead, OpenACC
![Page 4: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/4.jpg)
4
44 Nodes
DIRAC
Hardware
1 Tesla C 2050 (Fermi) GPU 2 Nehalem Quad Core 24 GB of DDR3
4 Tesla C 1060 GPU 2 Nehalem Quad Core 24 GB DDR3
4 Tesla C 2050 GPU 2 Nehalem Quad Core 24 GB DDR3
1 Tesla C 1060 (Fermi) GPU 2 Nehalem Quad Core 24 GB of DDR3
4 Nodes
1 Node 1 Node 4X QDR InfiniBand (32 Gb/s)
Interconnect
![Page 5: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/5.jpg)
5
DIRAC ACCOUNT
http://www.nersc.gov/users/computational-systems/dirac/getting-started/request-for-dirac-account/
Request an account of Dirac
For questions contact NERSC Consultants at [email protected]
![Page 6: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/6.jpg)
6
> ssh -l <username>@carver.nersc.gov !!> qsub -I -V -q dirac_int -l nodes=1:ppn=8:fermi!qsub: waiting for job 1822582.cvrsvc09-ib to start qsub: job 1822582.cvrsvc09-ib ready!
[dirac28] >! Dirac node prompt
DIRAC Access CARVER IBM iDataPlex DIRAC
Dirac nodes reside on Carver
3
Start interactive node on Dirac
Log into Carver 1
2
SINGLE NODE ACCESS
![Page 7: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/7.jpg)
7
First CUDA Program
7
#include <stdio.h>int main() { int noOfDevices;
/* get no. of device */ cudaGetDeviceCount (&noOfDevices);
cudaDeviceProp prop;for (int i = 0; i < noOfDevices; i++) { /*get device properties */
cudaGetDeviceProperties (&prop, i );
printf ("Device Name:\t %s\n", prop.name); printf ("Total global memory:\t %ld\n”, prop.totalGlobalMem); printf (”No. of SMs:\t %d\n”, prop.multiProcessorCount);
printf ("Shared memory / SM:\t %ld\n", prop.sharedMemPerBlock); printf("Registers / SM:\t %d\n", prop.regsPerBlock);
}return 1;
}whatDevice.cu
![Page 8: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/8.jpg)
8
First CUDA Program
8
Output
> module load cuda!!OR
> modlue load cuda/<version>!
> nvcc whatDevice.cu –o whatDevice!
Compilation (on Carver)
Device Name: ! !Tesla C2050!Total global memory: !2817982464!No. of SMs: ! !14!Shared memory / SM: !49152!Registers / SM: !32768 !
> qsub -I -V -q dirac_int -l nodes=1:ppn=8:fermi!
[dirac27] > ./whatDevice!
Log into Dirac and run the code
![Page 9: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/9.jpg)
9
CUDA + OpenMP + MPI
> mpicc -c main.c -o main.o!> nvcc -c kernel.cu -o kernel.o!
> mpicc main.o kernel.o -L/usr/local/cuda/lib64 -lcudart!
> gcc –fopenmp someCode.cu –o someCode [include omp.h]!
> nvcc –Xcompiler -fopenmp -lgomp someCode.cu –o someCode!
![Page 10: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/10.jpg)
10
Matlab
>> N = 5000;!>> a = rand (N,N);!>> b = gpuArray(a);!
>> tic; r1 = fft2(a); toc;!Elapsed time is 1.135059 seconds.!
>> tic; gather (r1); toc;!Elapsed time is 0.015606 seconds.!
>> tic; r2 = fft2(b); toc;!Elapsed time is 0.423533 seconds.!
>> tic; gather (r2); toc;!Elapsed time is 0.253305 seconds.!
2D FFT Example
![Page 11: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/11.jpg)
11
Running Batch Jobs
#PBS -q dirac_reg #PBS -l nodes=16:ppn=8:fermi #PBS -l walltime=00:10:00 #PBS -A gpgpu #PBS -N my_job #PBS -e my_job.$PBS_JOBID.err #PBS -o my_job.$PBS_JOBID.out #PBS -V
cd $PBS_O_WORKDIR mpirun -np 128 ./my_executable!
submit.job
> qsub submit.job!
![Page 12: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/12.jpg)
12
Programming Models on Dirac
CUDA
OpenCL
Matlab
HMPP
MINT
RCC
PGI (CUDA-FORTRAN)
OpenACC
![Page 13: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/13.jpg)
13
Adaptive Mesh Refinement
Data stored in Octree data structure
Refinement with 2l spatial resolution per level l
Figu
re -
Hsi
-Yu
Sch
ive
et a
l., 2
010
2D Patch
83 cells per patch
Identical spatial geometry (same kernel)
Uniform and individual time-steps
![Page 14: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/14.jpg)
14
Hemant Shukla, Hsi-Yu Schive et al. SC 2011
Large scale Cosmological Simulations with GAMER
Results
![Page 15: GPU Computing - Dirac - NERSC: National Energy Research Scientific](https://reader031.vdocuments.mx/reader031/viewer/2022020703/61fb47fb2e268c58cd5c4ff0/html5/thumbnails/15.jpg)
15
For More Information
http://www.nersc.gov/users/computational-systems/dirac/