gpgpu – a current trend ina current trend in high

80
GPGPU A Current Trend in GPGPU A Current Trend in High Performance Computing Chokchai Box Leangsuksun, PhD SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech University [email protected] 1 *SWEPCO endowed professorship is made possible by LA Board of Regents

Upload: others

Post on 13-Nov-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPGPU – A Current Trend inA Current Trend in High

GPGPU – A Current Trend inGPGPU A Current Trend inHigh Performance Computing

Chokchai Box Leangsuksun, PhDg ,

SWEPCO Endowed Professor*, Computer ScienceDirector, High Performance Computing Initiative

Louisiana Tech [email protected]

1

*SWEPCO endowed professorship is made possible by LA Board of Regents

Page 2: GPGPU – A Current Trend inA Current Trend in High

OutlineOutline

• Intro to HPC - Box• GPU Tutorial – Box• CUDA programming concepts - BoxCU p og a g co cepts o• Case study: Advanced performance improvement

7 April 2010

2

Page 3: GPGPU – A Current Trend inA Current Trend in High

Mainstream CPUs

• CPU speed – plateaus 3-4CPU speed plateaus 3 4 Ghz

• More cores in a single chip– Dual/Quad core is now– Manycore (GPGPU)

T diti l A li ti

3-4 Ghz cap

• Traditional Applications won’t get a free rides

• Conversion to parallel• Conversion to parallel computing (HPC, MT)

7 April 2010

3

This diagram is from “no free lunch article in DDJ

Page 4: GPGPU – A Current Trend inA Current Trend in High

New trends in computingNew trends in computing

• Old & current – SMP, Cluster,• Multicore computers

– Intel Core 2 Duo– AMD 2x 64

• Many-core accelerators– GPGPU, FPGA, Cell • More Many brains in one computer• Not to increase CPU frequency• Harness many computers – a cluster computing

4/7/2010

4

Page 5: GPGPU – A Current Trend inA Current Trend in High

What is HPC?What is HPC?

• High Performance Computing – Parallel , Supercomputing– Achieve the fastest possible computing outcome– Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & y p g p , g,

programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific,

engineering, medical, business, entertainment and etc.

• Time to insights, Time to discovery, Times to markets

7 April 2010

5

Page 6: GPGPU – A Current Trend inA Current Trend in High

Parallel Programming Concepts

Conventional serial executionwhere the problem is represented as a series of instructions that are

Parallel execution of a problem involves partitioning of the problem into multiple executable parts that are

t ll l i d ll ti l

Problem

executed by the CPU mutually exclusive and collectively exhaustive represented as a partially ordered set exhibiting concurrency.

i t tii t ti

CPUTask Task Task Task

ProblemProblem

instructionsinstructions instructionsinstructions

Parallel computing takes advantage of concurrency to :

• Solve larger problems with less time• Solve larger problems with less time• Save on Wall Clock Time• Overcoming memory constraints• Utilizing non-local resources

7 April 2010

6

6

CPU CPU CPU CPU

g

Source from Thomas Sterling’s intro to HPC

Page 7: GPGPU – A Current Trend inA Current Trend in High

HPC Applications and Major Industries• Finite Element Modeling

– Auto/Aero• Fluid Dynamics• Fluid Dynamics

– Auto/Aero, Consumer Packaged Goods Mfgs, Process Mfg, Disaster Preparedness (tsunami)

I i• Imaging– Seismic & Medical

• Finance & Business – Banks, Brokerage Houses (Regression Analysis, Risk,

Options Pricing, What if, …)– Wal-mart’s HPC in their operationsWal mart s HPC in their operations

• Molecular Modeling – Biotech and Pharmaceuticals

Comple Problems Large Datasets Long R nsComple Problems Large Datasets Long R ns7 April 2010

7

Complex Problems, Large Datasets, Long RunsComplex Problems, Large Datasets, Long RunsThis slide is from Intel presentation “Technologies for Delivering Peak Performance on HPC and Grid Applications”

Page 8: GPGPU – A Current Trend inA Current Trend in High

The GPGPU Tutorial.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 9: GPGPU – A Current Trend inA Current Trend in High

What & Why is GPGPU ?What & Why is GPGPU ?• General Purpose computation using GPU

in applications other than 3D graphicsin applications other than 3D graphics– GPU accelerates critical path of application

• One of the hottest computing trends– Heterogeneous computing

• Data parallel algorithms leverage GPU attributesLarge data arrays streaming throughput– Large data arrays, streaming throughput

– Fine-grain SIMD parallelism– Low-latency floating point (FP) computation

• Applications – see //GPGPU.org– Game effects (FX) physics, image processing

• Oil exploration, Realtime MRI-CT-scan,

© David Kirk/NVIDIA

p , ,

– Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

Page 10: GPGPU – A Current Trend inA Current Trend in High

Why is GPGPU?Why is GPGPU? • Large number of cores –

– 100-1000 cores in a single cardg

• Low cost – less than $100-$1500• Green computingp g

– Low power consumption – 135 watts/card

– 135 w vs 30000 w (300 watts * 100)

• 1 card can perform > 100 desktops– $750 vs 50000 ($500 * 100)

4/7/2010

10

Page 11: GPGPU – A Current Trend inA Current Trend in High

CPU vs. GPUCPU vs. GPU

• CPU– Fast caches– Branching adaptability– High performance

• GPU– Multiple ALUs– Fast onboard memoryFast onboard memory– High throughput on parallel tasks

• Executes program on each fragment/vertex

• CPUs are great for task parallelism• GPUs are great for data parallelism

Supercomputing 2008 Education Program

11

Page 12: GPGPU – A Current Trend inA Current Trend in High

CPU vs. GPU - HardwareCPU vs. GPU Hardware

• More transistors devoted to data processing

Supercomputing 2008 Education Program

12

Page 13: GPGPU – A Current Trend inA Current Trend in High

Two major players

Page 14: GPGPU – A Current Trend inA Current Trend in High

Parallel Computing on a GPU p g• NVIDIA GPU Computing Architecture

– Via a HW device interface – In laptops, desktops, workstations, servers

• Tesla T10 1070 from 1-4 TFLOPS• AMD/ATI 4870 x2 1600 cores• AMD/ATI 4870 x2 1600 cores• NVIDIA Tegra is an all-in-one (system-on-a-chip)

processor architecture derived from the ARM family

ATI 4850

family • GPU parallelism is better than Moore’s law, more

doubling every year• GPGPU is a GPU that allows user to process both

graphics and non-graphics applications.

© David Kirk/NVIDIA

GeForce 8800

Page 15: GPGPU – A Current Trend inA Current Trend in High

Requirements of a GPU system q y• GPGPU is a GPU that allows user to process

both graphics and non-graphics applicationsboth graphics and non graphics applications.

• GPGPU-capable video cardp• Power supply• Cooling Tesla D870• PCI-Express 16x

© David Kirk/NVIDIA

GeForce 8800

Page 16: GPGPU – A Current Trend inA Current Trend in High

Examples of GPU devicesExamples of GPU devices

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 17: GPGPU – A Current Trend inA Current Trend in High

NVIDIA GeForce 8800 (G80)NVIDIA GeForce 8800 (G80)

• the eighth generation of NVIDIA’s GeForce graphic g g g pcards.

• High performance CUDA-enabled GPGPU• 128 cores• Memory 256-768 MB or 1.5 GB in Teslay• High-speed memory bandwidth• Supports Scalable Link Interface (SLI)Supports Scalable Link Interface (SLI)

Page 18: GPGPU – A Current Trend inA Current Trend in High

NVIDIA TeslaTMNVIDIA Tesla

• Feature– GPU Computing for HPC– No display ports– Dedicate to computation– For massively Multi-threaded computing

S i f– Supercomputing performance

Page 19: GPGPU – A Current Trend inA Current Trend in High

NVIDIA T l C dNVIDIA Tesla Card >>• C-Series(Card) = 1 GPU with 1.5 GB• D-Series(Deskside unit) = 2 GPUs• S-Series(1U server) = 4 GPUs

• Note: 1 G80 GPU = 128 cores = ~500 GFLOPs• 1 T10 = 240 cores = 1 TFLOPs

<< NVIDIA G80<< NVIDIA G80

Page 20: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA

This slide is from NVDIA CUDA tutorial

Page 21: GPGPU – A Current Trend inA Current Trend in High

ATI Stream (1)ATI Stream (1)

4/7/2010

21

Page 22: GPGPU – A Current Trend inA Current Trend in High

ATI 4870ATI 4870

4/7/2010

22

Page 23: GPGPU – A Current Trend inA Current Trend in High

ATI 4870 X2ATI 4870 X2

4/7/2010

23

Page 24: GPGPU – A Current Trend inA Current Trend in High

Architecture of ATI Radeon 4000 series

Page 25: GPGPU – A Current Trend inA Current Trend in High

This slide is from ATI presentation

Page 26: GPGPU – A Current Trend inA Current Trend in High

This slide is from ATI presentation

Page 27: GPGPU – A Current Trend inA Current Trend in High

Intel LarrabeeIntel Larrabee•a hybrid between a multi-core CPU and a GPU, •coherent cache hierarchy and x86 architecture•coherent cache hierarchy and x86 architecture compatibility are CPU-like•its wide SIMD vector units and texture sampling•its wide SIMD vector units and texture sampling hardware are GPU-like.

Page 28: GPGPU – A Current Trend inA Current Trend in High

I t d ti tO CL

Introduction to Open CL

Toward new approach in Computing

Moayad Almohaishi

Page 29: GPGPU – A Current Trend inA Current Trend in High

Introduction to openCLIntroduction to openCL

•OpenCL stands for Open Computing Language•OpenCL stands for Open Computing Language. •It is from consortium efforts such as Apple, NVDIA, AMD etcAMD etc. •The Khronos group who was responsible for OpenGL. •Take 6 months to come up with the specifications•Take 6 months to come up with the specifications.

Page 30: GPGPU – A Current Trend inA Current Trend in High

OpenCLOpenCL

•1 Royalty-free1. Royalty free.•2. Support both task and data parallel programing modes.modes.•3. Works for vendor-agnostic GPGPUs•4 including multi cores CPUs4. including multi cores CPUs•5. Works on Cell processors.•6 Support handhelds and mobile devices•6. Support handhelds and mobile devices.•7. Based on C language under C99.

Page 31: GPGPU – A Current Trend inA Current Trend in High
Page 32: GPGPU – A Current Trend inA Current Trend in High

OpenCL Platform ModelOpenCL Platform Model

Page 33: GPGPU – A Current Trend inA Current Trend in High

CPUs+GPU platformsCPUs GPU platforms

4/7/2010

33

Page 34: GPGPU – A Current Trend inA Current Trend in High

Performance of GPGPUPerformance of GPGPU

Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS

Page 35: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA

Page 36: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 37: GPGPU – A Current Trend inA Current Trend in High

CUDACUDA

• “Compute Unified Device Architecture”• General purpose programming model

– User kicks off batches of threads on the GPUGPU = dedicated super threaded massively data parallel co processor– GPU = dedicated super-threaded, massively data parallel co-processor

• Targeted software stack– Compute oriented drivers, language, and tools

• Driver for loading computation programs into GPU– Standalone Driver - Optimized for computation

I t f d i d f t hi f API– Interface designed for compute - graphics free API– Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

– Explicit GPU memory management

Page 38: GPGPU – A Current Trend inA Current Trend in High

An Example of Physical Reality B hi d CUDABehind CUDA

CPU(host)

GPU /GPU w/ local DRAM

(device)( )

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 39: GPGPU – A Current Trend inA Current Trend in High

Parallel Computing on a GPU p g• NVIDIA GPU Computing Architecture

– Via a separate HW interface p– In laptops, desktops, workstations, servers

• Programmable in C with CUDA toolsGeForce 8800

• Programmable in C with CUDA tools• Multithreaded SIMD model uses application

data parallelism and thread parallelismTesla D870Tesla D870

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Tesla S870

Page 40: GPGPU – A Current Trend inA Current Trend in High

GeForce 880016 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB

DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Input Assembler

Host

Thread Execution Manager

Te t re T t T t T t T t T t T t T tT t

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL1, University of Illinois, Urbana-Champaign

Load/store

Global Memory

Load/store Load/store Load/store Load/store Load/store

Page 41: GPGPU – A Current Trend inA Current Trend in High

I d i CUDA iIntroduction to CUDA programming

Th t i l t d f D id Ki k/NVIDIA d W i W HThese materials are excerpted from David Kirk/NVIDIA and Wen-mei W. Hwu And Christian Trefftz / Greg Wolffe’s SC08 GPU tutorials

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 42: GPGPU – A Current Trend inA Current Trend in High

Data-parallel ProgrammingData parallel Programming

• Think of the CPU as a massively-threaded co-yprocessor

• Write “kernel” functions that execute on the device --processing multiple data elements in parallel

• Keep it busy! massive threading• Keep your data close! local memory

Supercomputing 2008 Education Program

42

Page 43: GPGPU – A Current Trend inA Current Trend in High

Pixel / Thread ProcessingPixel / Thread Processing

Supercomputing 2008 Education Program

43

Page 44: GPGPU – A Current Trend inA Current Trend in High

Steps for CUDA ProgrammingSteps for CUDA Programming

1. Device Initialization 2. Device memory allocation3. Copies data to device memory3. Cop es data to dev ce e o y4. Executes kernel (Calling __global__ function) 5 Copies data from device memory (retrieve results)5. Copies data from device memory (retrieve results)

Page 45: GPGPU – A Current Trend inA Current Trend in High

Initially:Initially:

Host’s Memory GPU Card’s Memory

array

y y

Supercomputing 2008 Education Program

45

Page 46: GPGPU – A Current Trend inA Current Trend in High

Allocate Memory in the GPU cardAllocate Memory in the GPU card

Host’s Memory GPU Card’s Memory

array_darray

y y

Supercomputing 2008 Education Program

46

Page 47: GPGPU – A Current Trend inA Current Trend in High

Copy content from the host’s memory to the GPU dGPU card memory

Host’s Memory GPU Card’s Memory

array_darray

y y

Supercomputing 2008 Education Program

47

Page 48: GPGPU – A Current Trend inA Current Trend in High

Execute code on the GPUExecute code on the GPU

GPU MPs Kernel code

Host’s Memory GPU Card’s Memory

array_darray

y y

Supercomputing 2008 Education Program

48

Page 49: GPGPU – A Current Trend inA Current Trend in High

Copy results back to the host memoryCopy results back to the host memory

Host’s Memory GPU Card’s Memory

array_darray

y y

Supercomputing 2008 Education Program

49

Page 50: GPGPU – A Current Trend inA Current Trend in High

Steps for CUDA ProgrammingSteps for CUDA Programming

1. Device Initialization 2. Device memory allocation3. Copies data to device memory3. Cop es data to dev ce e o y4. Executes kernel (Calling __global__ function) 5 Copies data from device memory (retrieve results)5. Copies data from device memory (retrieve results)

Page 51: GPGPU – A Current Trend inA Current Trend in High

Hello WorldHello World

// Kernel definition __global__ void vecAdd(float* A, float* B, float* C)

{

}

int main() { // Kernel invocationint main() { // Kernel invocation

vecAdd<<<1, N>>>(A, B, C);

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 52: GPGPU – A Current Trend inA Current Trend in High

Hello WorldHello World

// Kernel definition __global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x; ;C[i] = A[i] + B[i];

}

int main() { // Kernel invocation

vecAdd<<<1, N>>>(A, B, C);

}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 53: GPGPU – A Current Trend inA Current Trend in High

Extended CExtended C• Declspecs

– global, device, shared, local constant

__device__ float filter[N];

global void convolve (float *image) {local, constant

• Keywords

__global__ void convolve (float *image) {

__shared__ float region[M];...

– threadIdx, blockIdx• Intrinsics

– __syncthreads

region[threadIdx] = image[i];

__syncthreads() ...

• Runtime API– Memory, symbol,

image[j] = result;}

// Allocate GPU memoryi i

y, y ,execution management

• Function launch

void *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

, ( y g );

Page 54: GPGPU – A Current Trend inA Current Trend in High

Initialize Device callsInitialize Device calls

• cudaSetDevice(device) is for selecting the device associated to the host thread.

• cudaGetDeviceCount(&devicecount) is for getting number of devices.

• cudaGetDeviceProperties(&deviceProp,device) is for retrieving device’s properties

• Note: cudaSetDevice() must be called before any __global__ function, otherwise device 0 is automatically selected.

Page 55: GPGPU – A Current Trend inA Current Trend in High

CUDA Language conceptCUDA Language concept

• CUDA Programming Modelg g• CUDA Memory Model

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 56: GPGPU – A Current Trend inA Current Trend in High

Some TerminologySome Terminology• device = GPU = set of multiprocessors

l i f & h d• Multiprocessor = set of processors & shared memory• Kernel = GPU program• Grid = array of thread blocks that execute a kernel• Thread block = group of SIMD threads that execute

k l d i i h da kernel and can communicate via shared memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

56

Page 57: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 58: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 59: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 60: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 61: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 62: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 63: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 64: GPGPU – A Current Trend inA Current Trend in High

Thread Batching: Grids and BlocksThread Batching: Grids and Blocks• A kernel is executed as a grid

of thread blocksHost Device

Grid 1

– All threads share data memory space

• A thread block is a batch of

Kernel 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0 1)

Block(1 1)

Block(2 1)threads that can cooperate with

each other by:– Synchronizing their execution Kernel

(0, 1) (1, 1) (2, 1)

Grid 2

y g• For hazard-free shared

memory accesses– Efficiently sharing data through

l l t h d

Kernel 2

Block (1, 1)

a low latency shared memory• Two threads from two different

blocks cannot cooperate Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Courtesy: NDVIA

Page 65: GPGPU – A Current Trend inA Current Trend in High

What are those blockIds and threadIds?

blockIdx.x is a built-in variable in CUDA th t t th bl kId i thCUDA that returns the blockId in the x axis of the block that is executing this block of code

threadIdx.x is another built-in variable returns the threadId in the x axis of the thread that is beingx axis of the thread that is being executed by this stream processor

• Example code in the kernel:x=blockIdx.x*BLOCK_SIZE+threadIdx.x;block_d[x] = blockIdx.x;thread_d[x] = threadIdx.x;

Supercomputing 2008 Education Program

65

Page 66: GPGPU – A Current Trend inA Current Trend in High

In the GPU:In the GPU:

Processing Elements

Thread 1

Thread 2

Thread 3

Thread 0

Thread 1

Thread 2

Thread 3

Thread 0

Array Elements

Block 0 Block 1Supercomputing 2008

Education Program66

Block 0 Block 1

Page 67: GPGPU – A Current Trend inA Current Trend in High

CUDA Device Memory ModelO iOverview

• Each thread can: (Device) Grid

– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory

Block (0, 0)

Shared Memory

Block (1, 0)

Shared MemoryW p y– R/W per-grid global memory– Read only per-grid constant

memory Thread (0, 0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

memory– Read only per-grid texture memory

LocalMemory

LocalMemory

LocalMemory

LocalMemory

• The host can R/W

ConstantMemory

GlobalMemory

Host• The host can R/W

global, constant, and texture memories

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

y

TextureMemory

texture memories

Page 68: GPGPU – A Current Trend inA Current Trend in High

Global, Constant, and Texture Memories(L L t A )(Long Latency Accesses)

• Global memory (Device) Grid

– Main means of communicating R/W Data between host and

Block (0, 0)

Shared Memory

Block (1, 0)

Shared MemoryData between host and device

– Contents visible to all threads Thread (0, 0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

threads• Texture and Constant

MemoriesLocal

MemoryLocal

MemoryLocal

MemoryLocal

Memory

– Constants initialized by host Contents visible to all

ConstantMemory

GlobalMemory

Host

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

– Contents visible to all threads

y

TextureMemory

Courtesy: NDVIA

Page 69: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 70: GPGPU – A Current Trend inA Current Trend in High

CUDA Device Memory AllocationCUDA Device Memory Allocation• cudaMalloc()

Allocates object in the(Device) Grid

– Allocates object in the device Global MemoryGlobal MemoryRequires two parameters

Block (0, 0)

Shared Memory

Register Register

Block (1, 0)

Shared Memory

Register Register– Requires two parameters• Address of a pointer to the

allocated objectThread (0,

0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

j• Size of of allocated object

• cudaFree() GlobalMemory

LocalMemor

y

LocalMemor

y

LocalMemor

y

LocalMemor

y

Host()– Frees object from device

Global MemoryConstantMemory

TextureMemory

Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

y• Pointer to freed object

Memory

Page 71: GPGPU – A Current Trend inA Current Trend in High

CUDA Host-Device Data TransferCUDA Host Device Data Transfer• cudaMemcpy()

memory data transfer(Device) Grid

– memory data transfer– Requires four parameters

• Pointer to source

Block (0, 0)

Shared Memory

Register Register

Block (1, 0)

Shared Memory

Register Register

• Pointer to destination• Number of bytes copied• Type of transfer

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

• Type of transfer – Host to Host– Host to Device

iGlobalMemory

LocalMemor

y

LocalMemor

y

LocalMemor

y

LocalMemor

y

Host

– Device to Host– Device to Device

• Asynchronous in CUDA

ConstantMemory

TextureMemory

Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Asynchronous in CUDA Memory

Page 72: GPGPU – A Current Trend inA Current Trend in High

CUDA Function DeclarationsCUDA Function DeclarationsExecuted on

the:Only callable

from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

• global defines a kernel function__global__ defines a kernel function– Must return void

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 73: GPGPU – A Current Trend inA Current Trend in High

Language Extensions:V i bl T Q lifiVariable Type Qualifiers

Memory Scope Lifetimedevice local int LocalVar; local thread thread__device__ __local__ int LocalVar; local thread thread

__device__ __shared__ int SharedVar; shared block block__device__ int GlobalVar; global grid application

• __device__ is optional when used with __device__ __constant__ int ConstantVar; constant grid application

__ ____local__, __shared__, or __constant__

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

73

Page 74: GPGPU – A Current Trend inA Current Trend in High

Access TimesAccess Times

• Register – dedicated HW - single cycleg g y• Shared Memory – dedicated HW - single cycle• Local Memory – DRAM, no cache - *slow*oca e o y , o cac e s ow• Global Memory – DRAM, no cache - *slow*• Constant Memory – DRAM cached 1 10s 100sConstant Memory DRAM, cached, 1…10s…100s

of cycles, depending on cache locality• Texture Memory – DRAM, cached, 1…10s…100s ofTexture Memory DRAM, cached, 1…10s…100s of

cycles, depending on cache locality• Instruction Memory (invisible) – DRAM, cached

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

74

y ( ) ,

Page 75: GPGPU – A Current Trend inA Current Trend in High

CUDA function calls restrictions

• device functions cannot have their __ __address taken

• For functions executed on the device:• For functions executed on the device:– No recursion

No static variable declarations inside the function– No static variable declarations inside the function– No variable number of arguments

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 76: GPGPU – A Current Trend inA Current Trend in High

Calling a Kernel Function – Thread CreationC g e e u c o e d C e o• A kernel function must be called with an execution

fi ticonfiguration:__global__ void KernelFunc(...);di 3 Di G id(100 50) // 5000 th d bl kdim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per

block size_t SharedMemBytes = 64; // 64 bytes of shared

memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes , , y

>>>(...);

• Any call to a kernel function is asynchronous from CUDA 1 0 li i h d d f bl ki

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

CUDA 1.0 on, explicit synch needed for blocking

Page 77: GPGPU – A Current Trend inA Current Trend in High

Resources on lineResources on line

• http://www.ddj.com/hpc-high-performance-computing/207200659

• http://www.nvidia.com/object/cuda_home.html#• http://www.nvidia.com/object/cuda_learn.html

Supercomputing 2008 Education Program

77

Page 78: GPGPU – A Current Trend inA Current Trend in High

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 79: GPGPU – A Current Trend inA Current Trend in High

Case StudiesCase Studies

• GPU-enabled EM algorithm – applications in g ppBioinfomatics with Genome Institute, National BIOTEC center, Thailand

• GPU application for Monte Carlos Integration -with Amir Fabin UTA & Dick Greenwood

• GPU checkpoint/restart

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

Page 80: GPGPU – A Current Trend inA Current Trend in High

ThanksThanks

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign