gpgpu – a current trend ina current trend in high
TRANSCRIPT
GPGPU – A Current Trend inGPGPU A Current Trend inHigh Performance Computing
Chokchai Box Leangsuksun, PhDg ,
SWEPCO Endowed Professor*, Computer ScienceDirector, High Performance Computing Initiative
Louisiana Tech [email protected]
1
*SWEPCO endowed professorship is made possible by LA Board of Regents
OutlineOutline
• Intro to HPC - Box• GPU Tutorial – Box• CUDA programming concepts - BoxCU p og a g co cepts o• Case study: Advanced performance improvement
7 April 2010
2
Mainstream CPUs
• CPU speed – plateaus 3-4CPU speed plateaus 3 4 Ghz
• More cores in a single chip– Dual/Quad core is now– Manycore (GPGPU)
T diti l A li ti
3-4 Ghz cap
• Traditional Applications won’t get a free rides
• Conversion to parallel• Conversion to parallel computing (HPC, MT)
7 April 2010
3
This diagram is from “no free lunch article in DDJ
New trends in computingNew trends in computing
• Old & current – SMP, Cluster,• Multicore computers
– Intel Core 2 Duo– AMD 2x 64
• Many-core accelerators– GPGPU, FPGA, Cell • More Many brains in one computer• Not to increase CPU frequency• Harness many computers – a cluster computing
4/7/2010
4
What is HPC?What is HPC?
• High Performance Computing – Parallel , Supercomputing– Achieve the fastest possible computing outcome– Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & y p g p , g,
programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific,
engineering, medical, business, entertainment and etc.
• Time to insights, Time to discovery, Times to markets
7 April 2010
5
Parallel Programming Concepts
Conventional serial executionwhere the problem is represented as a series of instructions that are
Parallel execution of a problem involves partitioning of the problem into multiple executable parts that are
t ll l i d ll ti l
Problem
executed by the CPU mutually exclusive and collectively exhaustive represented as a partially ordered set exhibiting concurrency.
i t tii t ti
CPUTask Task Task Task
ProblemProblem
instructionsinstructions instructionsinstructions
Parallel computing takes advantage of concurrency to :
• Solve larger problems with less time• Solve larger problems with less time• Save on Wall Clock Time• Overcoming memory constraints• Utilizing non-local resources
7 April 2010
6
6
CPU CPU CPU CPU
g
Source from Thomas Sterling’s intro to HPC
HPC Applications and Major Industries• Finite Element Modeling
– Auto/Aero• Fluid Dynamics• Fluid Dynamics
– Auto/Aero, Consumer Packaged Goods Mfgs, Process Mfg, Disaster Preparedness (tsunami)
I i• Imaging– Seismic & Medical
• Finance & Business – Banks, Brokerage Houses (Regression Analysis, Risk,
Options Pricing, What if, …)– Wal-mart’s HPC in their operationsWal mart s HPC in their operations
• Molecular Modeling – Biotech and Pharmaceuticals
Comple Problems Large Datasets Long R nsComple Problems Large Datasets Long R ns7 April 2010
7
Complex Problems, Large Datasets, Long RunsComplex Problems, Large Datasets, Long RunsThis slide is from Intel presentation “Technologies for Delivering Peak Performance on HPC and Grid Applications”
The GPGPU Tutorial.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
What & Why is GPGPU ?What & Why is GPGPU ?• General Purpose computation using GPU
in applications other than 3D graphicsin applications other than 3D graphics– GPU accelerates critical path of application
• One of the hottest computing trends– Heterogeneous computing
• Data parallel algorithms leverage GPU attributesLarge data arrays streaming throughput– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism– Low-latency floating point (FP) computation
• Applications – see //GPGPU.org– Game effects (FX) physics, image processing
• Oil exploration, Realtime MRI-CT-scan,
© David Kirk/NVIDIA
p , ,
– Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting
Why is GPGPU?Why is GPGPU? • Large number of cores –
– 100-1000 cores in a single cardg
• Low cost – less than $100-$1500• Green computingp g
– Low power consumption – 135 watts/card
– 135 w vs 30000 w (300 watts * 100)
• 1 card can perform > 100 desktops– $750 vs 50000 ($500 * 100)
4/7/2010
10
CPU vs. GPUCPU vs. GPU
• CPU– Fast caches– Branching adaptability– High performance
• GPU– Multiple ALUs– Fast onboard memoryFast onboard memory– High throughput on parallel tasks
• Executes program on each fragment/vertex
• CPUs are great for task parallelism• GPUs are great for data parallelism
Supercomputing 2008 Education Program
11
CPU vs. GPU - HardwareCPU vs. GPU Hardware
• More transistors devoted to data processing
Supercomputing 2008 Education Program
12
Two major players
Parallel Computing on a GPU p g• NVIDIA GPU Computing Architecture
– Via a HW device interface – In laptops, desktops, workstations, servers
• Tesla T10 1070 from 1-4 TFLOPS• AMD/ATI 4870 x2 1600 cores• AMD/ATI 4870 x2 1600 cores• NVIDIA Tegra is an all-in-one (system-on-a-chip)
processor architecture derived from the ARM family
ATI 4850
family • GPU parallelism is better than Moore’s law, more
doubling every year• GPGPU is a GPU that allows user to process both
graphics and non-graphics applications.
© David Kirk/NVIDIA
GeForce 8800
Requirements of a GPU system q y• GPGPU is a GPU that allows user to process
both graphics and non-graphics applicationsboth graphics and non graphics applications.
• GPGPU-capable video cardp• Power supply• Cooling Tesla D870• PCI-Express 16x
© David Kirk/NVIDIA
GeForce 8800
Examples of GPU devicesExamples of GPU devices
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
NVIDIA GeForce 8800 (G80)NVIDIA GeForce 8800 (G80)
• the eighth generation of NVIDIA’s GeForce graphic g g g pcards.
• High performance CUDA-enabled GPGPU• 128 cores• Memory 256-768 MB or 1.5 GB in Teslay• High-speed memory bandwidth• Supports Scalable Link Interface (SLI)Supports Scalable Link Interface (SLI)
NVIDIA TeslaTMNVIDIA Tesla
• Feature– GPU Computing for HPC– No display ports– Dedicate to computation– For massively Multi-threaded computing
S i f– Supercomputing performance
NVIDIA T l C dNVIDIA Tesla Card >>• C-Series(Card) = 1 GPU with 1.5 GB• D-Series(Deskside unit) = 2 GPUs• S-Series(1U server) = 4 GPUs
• Note: 1 G80 GPU = 128 cores = ~500 GFLOPs• 1 T10 = 240 cores = 1 TFLOPs
<< NVIDIA G80<< NVIDIA G80
© David Kirk/NVIDIA
This slide is from NVDIA CUDA tutorial
ATI Stream (1)ATI Stream (1)
4/7/2010
21
ATI 4870ATI 4870
4/7/2010
22
ATI 4870 X2ATI 4870 X2
4/7/2010
23
Architecture of ATI Radeon 4000 series
This slide is from ATI presentation
This slide is from ATI presentation
Intel LarrabeeIntel Larrabee•a hybrid between a multi-core CPU and a GPU, •coherent cache hierarchy and x86 architecture•coherent cache hierarchy and x86 architecture compatibility are CPU-like•its wide SIMD vector units and texture sampling•its wide SIMD vector units and texture sampling hardware are GPU-like.
I t d ti tO CL
Introduction to Open CL
Toward new approach in Computing
Moayad Almohaishi
Introduction to openCLIntroduction to openCL
•OpenCL stands for Open Computing Language•OpenCL stands for Open Computing Language. •It is from consortium efforts such as Apple, NVDIA, AMD etcAMD etc. •The Khronos group who was responsible for OpenGL. •Take 6 months to come up with the specifications•Take 6 months to come up with the specifications.
OpenCLOpenCL
•1 Royalty-free1. Royalty free.•2. Support both task and data parallel programing modes.modes.•3. Works for vendor-agnostic GPGPUs•4 including multi cores CPUs4. including multi cores CPUs•5. Works on Cell processors.•6 Support handhelds and mobile devices•6. Support handhelds and mobile devices.•7. Based on C language under C99.
OpenCL Platform ModelOpenCL Platform Model
CPUs+GPU platformsCPUs GPU platforms
4/7/2010
33
Performance of GPGPUPerformance of GPGPU
Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS
© David Kirk/NVIDIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
CUDACUDA
• “Compute Unified Device Architecture”• General purpose programming model
– User kicks off batches of threads on the GPUGPU = dedicated super threaded massively data parallel co processor– GPU = dedicated super-threaded, massively data parallel co-processor
• Targeted software stack– Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU– Standalone Driver - Optimized for computation
I t f d i d f t hi f API– Interface designed for compute - graphics free API– Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
– Explicit GPU memory management
An Example of Physical Reality B hi d CUDABehind CUDA
CPU(host)
GPU /GPU w/ local DRAM
(device)( )
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Parallel Computing on a GPU p g• NVIDIA GPU Computing Architecture
– Via a separate HW interface p– In laptops, desktops, workstations, servers
• Programmable in C with CUDA toolsGeForce 8800
• Programmable in C with CUDA tools• Multithreaded SIMD model uses application
data parallelism and thread parallelismTesla D870Tesla D870
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Tesla S870
GeForce 880016 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB
DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
Input Assembler
Host
Thread Execution Manager
Te t re T t T t T t T t T t T t T tT t
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Load/store Load/store Load/store Load/store Load/store
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL1, University of Illinois, Urbana-Champaign
Load/store
Global Memory
Load/store Load/store Load/store Load/store Load/store
I d i CUDA iIntroduction to CUDA programming
Th t i l t d f D id Ki k/NVIDIA d W i W HThese materials are excerpted from David Kirk/NVIDIA and Wen-mei W. Hwu And Christian Trefftz / Greg Wolffe’s SC08 GPU tutorials
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Data-parallel ProgrammingData parallel Programming
• Think of the CPU as a massively-threaded co-yprocessor
• Write “kernel” functions that execute on the device --processing multiple data elements in parallel
• Keep it busy! massive threading• Keep your data close! local memory
Supercomputing 2008 Education Program
42
Pixel / Thread ProcessingPixel / Thread Processing
Supercomputing 2008 Education Program
43
Steps for CUDA ProgrammingSteps for CUDA Programming
1. Device Initialization 2. Device memory allocation3. Copies data to device memory3. Cop es data to dev ce e o y4. Executes kernel (Calling __global__ function) 5 Copies data from device memory (retrieve results)5. Copies data from device memory (retrieve results)
Initially:Initially:
Host’s Memory GPU Card’s Memory
array
y y
Supercomputing 2008 Education Program
45
Allocate Memory in the GPU cardAllocate Memory in the GPU card
Host’s Memory GPU Card’s Memory
array_darray
y y
Supercomputing 2008 Education Program
46
Copy content from the host’s memory to the GPU dGPU card memory
Host’s Memory GPU Card’s Memory
array_darray
y y
Supercomputing 2008 Education Program
47
Execute code on the GPUExecute code on the GPU
GPU MPs Kernel code
Host’s Memory GPU Card’s Memory
array_darray
y y
Supercomputing 2008 Education Program
48
Copy results back to the host memoryCopy results back to the host memory
Host’s Memory GPU Card’s Memory
array_darray
y y
Supercomputing 2008 Education Program
49
Steps for CUDA ProgrammingSteps for CUDA Programming
1. Device Initialization 2. Device memory allocation3. Copies data to device memory3. Cop es data to dev ce e o y4. Executes kernel (Calling __global__ function) 5 Copies data from device memory (retrieve results)5. Copies data from device memory (retrieve results)
Hello WorldHello World
// Kernel definition __global__ void vecAdd(float* A, float* B, float* C)
{
}
int main() { // Kernel invocationint main() { // Kernel invocation
vecAdd<<<1, N>>>(A, B, C);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Hello WorldHello World
// Kernel definition __global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x; ;C[i] = A[i] + B[i];
}
int main() { // Kernel invocation
vecAdd<<<1, N>>>(A, B, C);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Extended CExtended C• Declspecs
– global, device, shared, local constant
__device__ float filter[N];
global void convolve (float *image) {local, constant
• Keywords
__global__ void convolve (float *image) {
__shared__ float region[M];...
– threadIdx, blockIdx• Intrinsics
– __syncthreads
region[threadIdx] = image[i];
__syncthreads() ...
• Runtime API– Memory, symbol,
image[j] = result;}
// Allocate GPU memoryi i
y, y ,execution management
• Function launch
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
, ( y g );
Initialize Device callsInitialize Device calls
• cudaSetDevice(device) is for selecting the device associated to the host thread.
• cudaGetDeviceCount(&devicecount) is for getting number of devices.
• cudaGetDeviceProperties(&deviceProp,device) is for retrieving device’s properties
• Note: cudaSetDevice() must be called before any __global__ function, otherwise device 0 is automatically selected.
CUDA Language conceptCUDA Language concept
• CUDA Programming Modelg g• CUDA Memory Model
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Some TerminologySome Terminology• device = GPU = set of multiprocessors
l i f & h d• Multiprocessor = set of processors & shared memory• Kernel = GPU program• Grid = array of thread blocks that execute a kernel• Thread block = group of SIMD threads that execute
k l d i i h da kernel and can communicate via shared memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
56
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Thread Batching: Grids and BlocksThread Batching: Grids and Blocks• A kernel is executed as a grid
of thread blocksHost Device
Grid 1
– All threads share data memory space
• A thread block is a batch of
Kernel 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0 1)
Block(1 1)
Block(2 1)threads that can cooperate with
each other by:– Synchronizing their execution Kernel
(0, 1) (1, 1) (2, 1)
Grid 2
y g• For hazard-free shared
memory accesses– Efficiently sharing data through
l l t h d
Kernel 2
Block (1, 1)
a low latency shared memory• Two threads from two different
blocks cannot cooperate Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Courtesy: NDVIA
What are those blockIds and threadIds?
blockIdx.x is a built-in variable in CUDA th t t th bl kId i thCUDA that returns the blockId in the x axis of the block that is executing this block of code
threadIdx.x is another built-in variable returns the threadId in the x axis of the thread that is beingx axis of the thread that is being executed by this stream processor
• Example code in the kernel:x=blockIdx.x*BLOCK_SIZE+threadIdx.x;block_d[x] = blockIdx.x;thread_d[x] = threadIdx.x;
Supercomputing 2008 Education Program
65
In the GPU:In the GPU:
Processing Elements
Thread 1
Thread 2
Thread 3
Thread 0
Thread 1
Thread 2
Thread 3
Thread 0
Array Elements
Block 0 Block 1Supercomputing 2008
Education Program66
Block 0 Block 1
CUDA Device Memory ModelO iOverview
• Each thread can: (Device) Grid
– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory
Block (0, 0)
Shared Memory
Block (1, 0)
Shared MemoryW p y– R/W per-grid global memory– Read only per-grid constant
memory Thread (0, 0)
Registers
Thread (1, 0)
Registers
Thread (0, 0)
Registers
Thread (1, 0)
Registers
memory– Read only per-grid texture memory
LocalMemory
LocalMemory
LocalMemory
LocalMemory
• The host can R/W
ConstantMemory
GlobalMemory
Host• The host can R/W
global, constant, and texture memories
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
y
TextureMemory
texture memories
Global, Constant, and Texture Memories(L L t A )(Long Latency Accesses)
• Global memory (Device) Grid
– Main means of communicating R/W Data between host and
Block (0, 0)
Shared Memory
Block (1, 0)
Shared MemoryData between host and device
– Contents visible to all threads Thread (0, 0)
Registers
Thread (1, 0)
Registers
Thread (0, 0)
Registers
Thread (1, 0)
Registers
threads• Texture and Constant
MemoriesLocal
MemoryLocal
MemoryLocal
MemoryLocal
Memory
– Constants initialized by host Contents visible to all
ConstantMemory
GlobalMemory
Host
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
– Contents visible to all threads
y
TextureMemory
Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Device Memory AllocationCUDA Device Memory Allocation• cudaMalloc()
Allocates object in the(Device) Grid
– Allocates object in the device Global MemoryGlobal MemoryRequires two parameters
Block (0, 0)
Shared Memory
Register Register
Block (1, 0)
Shared Memory
Register Register– Requires two parameters• Address of a pointer to the
allocated objectThread (0,
0)
Registers
Thread (1, 0)
Registers
Thread (0, 0)
Registers
Thread (1, 0)
Registers
j• Size of of allocated object
• cudaFree() GlobalMemory
LocalMemor
y
LocalMemor
y
LocalMemor
y
LocalMemor
y
Host()– Frees object from device
Global MemoryConstantMemory
TextureMemory
Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
y• Pointer to freed object
Memory
CUDA Host-Device Data TransferCUDA Host Device Data Transfer• cudaMemcpy()
memory data transfer(Device) Grid
– memory data transfer– Requires four parameters
• Pointer to source
Block (0, 0)
Shared Memory
Register Register
Block (1, 0)
Shared Memory
Register Register
• Pointer to destination• Number of bytes copied• Type of transfer
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Thread (0, 0)
Registers
Thread (1, 0)
Registers
• Type of transfer – Host to Host– Host to Device
iGlobalMemory
LocalMemor
y
LocalMemor
y
LocalMemor
y
LocalMemor
y
Host
– Device to Host– Device to Device
• Asynchronous in CUDA
ConstantMemory
TextureMemory
Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Asynchronous in CUDA Memory
CUDA Function DeclarationsCUDA Function DeclarationsExecuted on
the:Only callable
from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
• global defines a kernel function__global__ defines a kernel function– Must return void
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Language Extensions:V i bl T Q lifiVariable Type Qualifiers
Memory Scope Lifetimedevice local int LocalVar; local thread thread__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block__device__ int GlobalVar; global grid application
• __device__ is optional when used with __device__ __constant__ int ConstantVar; constant grid application
__ ____local__, __shared__, or __constant__
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
73
Access TimesAccess Times
• Register – dedicated HW - single cycleg g y• Shared Memory – dedicated HW - single cycle• Local Memory – DRAM, no cache - *slow*oca e o y , o cac e s ow• Global Memory – DRAM, no cache - *slow*• Constant Memory – DRAM cached 1 10s 100sConstant Memory DRAM, cached, 1…10s…100s
of cycles, depending on cache locality• Texture Memory – DRAM, cached, 1…10s…100s ofTexture Memory DRAM, cached, 1…10s…100s of
cycles, depending on cache locality• Instruction Memory (invisible) – DRAM, cached
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
74
y ( ) ,
CUDA function calls restrictions
• device functions cannot have their __ __address taken
• For functions executed on the device:• For functions executed on the device:– No recursion
No static variable declarations inside the function– No static variable declarations inside the function– No variable number of arguments
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Calling a Kernel Function – Thread CreationC g e e u c o e d C e o• A kernel function must be called with an execution
fi ticonfiguration:__global__ void KernelFunc(...);di 3 Di G id(100 50) // 5000 th d bl kdim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per
block size_t SharedMemBytes = 64; // 64 bytes of shared
memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes , , y
>>>(...);
• Any call to a kernel function is asynchronous from CUDA 1 0 li i h d d f bl ki
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
CUDA 1.0 on, explicit synch needed for blocking
Resources on lineResources on line
• http://www.ddj.com/hpc-high-performance-computing/207200659
• http://www.nvidia.com/object/cuda_home.html#• http://www.nvidia.com/object/cuda_learn.html
Supercomputing 2008 Education Program
77
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
Case StudiesCase Studies
• GPU-enabled EM algorithm – applications in g ppBioinfomatics with Genome Institute, National BIOTEC center, Thailand
• GPU application for Monte Carlos Integration -with Amir Fabin UTA & Dick Greenwood
• GPU checkpoint/restart
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign
ThanksThanks
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign