gpgpu – a current trend ina current trend in high

GPGPU – A Current Trend inGPGPU A Current Trend inHigh Performance Computing

Chokchai Box Leangsuksun, PhDg ,

SWEPCO Endowed Professor*, Computer ScienceDirector, High Performance Computing Initiative

Louisiana Tech [email protected]

1

*SWEPCO endowed professorship is made possible by LA Board of Regents

OutlineOutline

• Intro to HPC - Box• GPU Tutorial – Box• CUDA programming concepts - BoxCU p og a g co cepts o• Case study: Advanced performance improvement

7 April 2010

2

Mainstream CPUs

• CPU speed – plateaus 3-4CPU speed plateaus 3 4 Ghz

• More cores in a single chip– Dual/Quad core is now– Manycore (GPGPU)

T diti l A li ti

3-4 Ghz cap

• Traditional Applications won’t get a free rides

• Conversion to parallel• Conversion to parallel computing (HPC, MT)

7 April 2010

3

This diagram is from “no free lunch article in DDJ

New trends in computingNew trends in computing

• Old & current – SMP, Cluster,• Multicore computers

– Intel Core 2 Duo– AMD 2x 64

• Many-core accelerators– GPGPU, FPGA, Cell • More Many brains in one computer• Not to increase CPU frequency• Harness many computers – a cluster computing

4/7/2010

4

What is HPC?What is HPC?

• High Performance Computing – Parallel , Supercomputing– Achieve the fastest possible computing outcome– Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & y p g p , g,

programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific,

engineering, medical, business, entertainment and etc.

• Time to insights, Time to discovery, Times to markets

7 April 2010

5

Parallel Programming Concepts

Conventional serial executionwhere the problem is represented as a series of instructions that are

Parallel execution of a problem involves partitioning of the problem into multiple executable parts that are

t ll l i d ll ti l

Problem

executed by the CPU mutually exclusive and collectively exhaustive represented as a partially ordered set exhibiting concurrency.

i t tii t ti

CPUTask Task Task Task

ProblemProblem

instructionsinstructions instructionsinstructions

Parallel computing takes advantage of concurrency to :

• Solve larger problems with less time• Solve larger problems with less time• Save on Wall Clock Time• Overcoming memory constraints• Utilizing non-local resources

7 April 2010

6

6

CPU CPU CPU CPU

g

Source from Thomas Sterling’s intro to HPC

HPC Applications and Major Industries• Finite Element Modeling

– Auto/Aero• Fluid Dynamics• Fluid Dynamics

– Auto/Aero, Consumer Packaged Goods Mfgs, Process Mfg, Disaster Preparedness (tsunami)

I i• Imaging– Seismic & Medical

• Finance & Business – Banks, Brokerage Houses (Regression Analysis, Risk,

Options Pricing, What if, …)– Wal-mart’s HPC in their operationsWal mart s HPC in their operations

• Molecular Modeling – Biotech and Pharmaceuticals

Comple Problems Large Datasets Long R nsComple Problems Large Datasets Long R ns7 April 2010

7

Complex Problems, Large Datasets, Long RunsComplex Problems, Large Datasets, Long RunsThis slide is from Intel presentation “Technologies for Delivering Peak Performance on HPC and Grid Applications”

The GPGPU Tutorial.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL, University of Illinois, Urbana-Champaign

What & Why is GPGPU ?What & Why is GPGPU ?• General Purpose computation using GPU

in applications other than 3D graphicsin applications other than 3D graphics– GPU accelerates critical path of application

• One of the hottest computing trends– Heterogeneous computing

• Data parallel algorithms leverage GPU attributesLarge data arrays streaming throughput– Large data arrays, streaming throughput

– Fine-grain SIMD parallelism– Low-latency floating point (FP) computation

• Applications – see //GPGPU.org– Game effects (FX) physics, image processing

• Oil exploration, Realtime MRI-CT-scan,

© David Kirk/NVIDIA

p , ,

– Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

Why is GPGPU?Why is GPGPU? • Large number of cores –

– 100-1000 cores in a single cardg

• Low cost – less than $100-$1500• Green computingp g

– Low power consumption – 135 watts/card

– 135 w vs 30000 w (300 watts * 100)

• 1 card can perform > 100 desktops– $750 vs 50000 ($500 * 100)

4/7/2010

10

CPU vs. GPUCPU vs. GPU

• CPU– Fast caches– Branching adaptability– High performance

• GPU– Multiple ALUs– Fast onboard memoryFast onboard memory– High throughput on parallel tasks

• Executes program on each fragment/vertex

• CPUs are great for task parallelism• GPUs are great for data parallelism

Supercomputing 2008 Education Program

11

CPU vs. GPU - HardwareCPU vs. GPU Hardware

• More transistors devoted to data processing


12

Two major players

Parallel Computing on a GPU p g• NVIDIA GPU Computing Architecture

– Via a HW device interface – In laptops, desktops, workstations, servers

• Tesla T10 1070 from 1-4 TFLOPS• AMD/ATI 4870 x2 1600 cores• AMD/ATI 4870 x2 1600 cores• NVIDIA Tegra is an all-in-one (system-on-a-chip)

processor architecture derived from the ARM family

ATI 4850

family • GPU parallelism is better than Moore’s law, more

doubling every year• GPGPU is a GPU that allows user to process both

graphics and non-graphics applications.


GeForce 8800

Requirements of a GPU system q y• GPGPU is a GPU that allows user to process

both graphics and non-graphics applicationsboth graphics and non graphics applications.

• GPGPU-capable video cardp• Power supply• Cooling Tesla D870• PCI-Express 16x


GeForce 8800

Examples of GPU devicesExamples of GPU devices


NVIDIA GeForce 8800 (G80)NVIDIA GeForce 8800 (G80)

• the eighth generation of NVIDIA’s GeForce graphic g g g pcards.

• High performance CUDA-enabled GPGPU• 128 cores• Memory 256-768 MB or 1.5 GB in Teslay• High-speed memory bandwidth• Supports Scalable Link Interface (SLI)Supports Scalable Link Interface (SLI)

NVIDIA TeslaTMNVIDIA Tesla

• Feature– GPU Computing for HPC– No display ports– Dedicate to computation– For massively Multi-threaded computing

S i f– Supercomputing performance

NVIDIA T l C dNVIDIA Tesla Card >>• C-Series(Card) = 1 GPU with 1.5 GB• D-Series(Deskside unit) = 2 GPUs• S-Series(1U server) = 4 GPUs

• Note: 1 G80 GPU = 128 cores = ~500 GFLOPs• 1 T10 = 240 cores = 1 TFLOPs

<< NVIDIA G80<< NVIDIA G80


This slide is from NVDIA CUDA tutorial

ATI Stream (1)ATI Stream (1)

4/7/2010

21

ATI 4870ATI 4870

4/7/2010

22

ATI 4870 X2ATI 4870 X2

4/7/2010

23

Architecture of ATI Radeon 4000 series

This slide is from ATI presentation

Intel LarrabeeIntel Larrabee•a hybrid between a multi-core CPU and a GPU, •coherent cache hierarchy and x86 architecture•coherent cache hierarchy and x86 architecture compatibility are CPU-like•its wide SIMD vector units and texture sampling•its wide SIMD vector units and texture sampling hardware are GPU-like.

I t d ti tO CL

Introduction to Open CL

Toward new approach in Computing

Moayad Almohaishi

Introduction to openCLIntroduction to openCL

•OpenCL stands for Open Computing Language•OpenCL stands for Open Computing Language. •It is from consortium efforts such as Apple, NVDIA, AMD etcAMD etc. •The Khronos group who was responsible for OpenGL. •Take 6 months to come up with the specifications•Take 6 months to come up with the specifications.

OpenCLOpenCL

•1 Royalty-free1. Royalty free.•2. Support both task and data parallel programing modes.modes.•3. Works for vendor-agnostic GPGPUs•4 including multi cores CPUs4. including multi cores CPUs•5. Works on Cell processors.•6 Support handhelds and mobile devices•6. Support handhelds and mobile devices.•7. Based on C language under C99.

OpenCL Platform ModelOpenCL Platform Model

CPUs+GPU platformsCPUs GPU platforms

4/7/2010

33

Performance of GPGPUPerformance of GPGPU

Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS

CUDACUDA

• “Compute Unified Device Architecture”• General purpose programming model

– User kicks off batches of threads on the GPUGPU = dedicated super threaded massively data parallel co processor– GPU = dedicated super-threaded, massively data parallel co-processor

• Targeted software stack– Compute oriented drivers, language, and tools

• Driver for loading computation programs into GPU– Standalone Driver - Optimized for computation

I t f d i d f t hi f API– Interface designed for compute - graphics free API– Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds


– Explicit GPU memory management

An Example of Physical Reality B hi d CUDABehind CUDA

CPU(host)

GPU /GPU w/ local DRAM

(device)( )


Parallel Computing on a GPU p g• NVIDIA GPU Computing Architecture

– Via a separate HW interface p– In laptops, desktops, workstations, servers

• Programmable in C with CUDA toolsGeForce 8800

• Programmable in C with CUDA tools• Multithreaded SIMD model uses application

data parallelism and thread parallelismTesla D870Tesla D870


Tesla S870

GeForce 880016 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB

DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Input Assembler

Host

Thread Execution Manager

Te t re T t T t T t T t T t T t T tT t

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007ECE 498AL1, University of Illinois, Urbana-Champaign

Load/store

Global Memory

Load/store Load/store Load/store Load/store Load/store

I d i CUDA iIntroduction to CUDA programming

Th t i l t d f D id Ki k/NVIDIA d W i W HThese materials are excerpted from David Kirk/NVIDIA and Wen-mei W. Hwu And Christian Trefftz / Greg Wolffe’s SC08 GPU tutorials


Data-parallel ProgrammingData parallel Programming

• Think of the CPU as a massively-threaded co-yprocessor

• Write “kernel” functions that execute on the device --processing multiple data elements in parallel

• Keep it busy! massive threading• Keep your data close! local memory


42

Pixel / Thread ProcessingPixel / Thread Processing


43

Steps for CUDA ProgrammingSteps for CUDA Programming

1. Device Initialization 2. Device memory allocation3. Copies data to device memory3. Cop es data to dev ce e o y4. Executes kernel (Calling __global__ function) 5 Copies data from device memory (retrieve results)5. Copies data from device memory (retrieve results)

Initially:Initially:

Host’s Memory GPU Card’s Memory

array

y y


45

Allocate Memory in the GPU cardAllocate Memory in the GPU card


array_darray

y y


46

Copy content from the host’s memory to the GPU dGPU card memory


array_darray

y y


47

Execute code on the GPUExecute code on the GPU

GPU MPs Kernel code


array_darray

y y


48

Copy results back to the host memoryCopy results back to the host memory


array_darray

y y


49

Steps for CUDA ProgrammingSteps for CUDA Programming

1. Device Initialization 2. Device memory allocation3. Copies data to device memory3. Cop es data to dev ce e o y4. Executes kernel (Calling __global__ function) 5 Copies data from device memory (retrieve results)5. Copies data from device memory (retrieve results)

Hello WorldHello World

// Kernel definition __global__ void vecAdd(float* A, float* B, float* C)

{

}

int main() { // Kernel invocationint main() { // Kernel invocation

vecAdd<<<1, N>>>(A, B, C);

}


Hello WorldHello World

// Kernel definition __global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x; ;C[i] = A[i] + B[i];

}

int main() { // Kernel invocation

vecAdd<<<1, N>>>(A, B, C);

}


Extended CExtended C• Declspecs

– global, device, shared, local constant

__device__ float filter[N];

global void convolve (float *image) {local, constant

• Keywords

__global__ void convolve (float *image) {

__shared__ float region[M];...

– threadIdx, blockIdx• Intrinsics

– __syncthreads

region[threadIdx] = image[i];

__syncthreads() ...

• Runtime API– Memory, symbol,

image[j] = result;}

// Allocate GPU memoryi i

y, y ,execution management

• Function launch

void *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);


, ( y g );

Initialize Device callsInitialize Device calls

• cudaSetDevice(device) is for selecting the device associated to the host thread.

• cudaGetDeviceCount(&devicecount) is for getting number of devices.

• cudaGetDeviceProperties(&deviceProp,device) is for retrieving device’s properties

• Note: cudaSetDevice() must be called before any __global__ function, otherwise device 0 is automatically selected.

CUDA Language conceptCUDA Language concept

• CUDA Programming Modelg g• CUDA Memory Model


Some TerminologySome Terminology• device = GPU = set of multiprocessors

l i f & h d• Multiprocessor = set of processors & shared memory• Kernel = GPU program• Grid = array of thread blocks that execute a kernel• Thread block = group of SIMD threads that execute

k l d i i h da kernel and can communicate via shared memory


56

Thread Batching: Grids and BlocksThread Batching: Grids and Blocks• A kernel is executed as a grid

of thread blocksHost Device

Grid 1

– All threads share data memory space

• A thread block is a batch of

Kernel 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0 1)

Block(1 1)

Block(2 1)threads that can cooperate with

each other by:– Synchronizing their execution Kernel

(0, 1) (1, 1) (2, 1)

Grid 2

y g• For hazard-free shared

memory accesses– Efficiently sharing data through

l l t h d

Kernel 2

Block (1, 1)

a low latency shared memory• Two threads from two different

blocks cannot cooperate Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)


Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Courtesy: NDVIA

What are those blockIds and threadIds?

blockIdx.x is a built-in variable in CUDA th t t th bl kId i thCUDA that returns the blockId in the x axis of the block that is executing this block of code

threadIdx.x is another built-in variable returns the threadId in the x axis of the thread that is beingx axis of the thread that is being executed by this stream processor

• Example code in the kernel:x=blockIdx.x*BLOCK_SIZE+threadIdx.x;block_d[x] = blockIdx.x;thread_d[x] = threadIdx.x;


65

In the GPU:In the GPU:

Processing Elements

Thread 1

Thread 2

Thread 3

Thread 0

Thread 1

Thread 2

Thread 3

Thread 0

Array Elements

Block 0 Block 1Supercomputing 2008

Education Program66

Block 0 Block 1

CUDA Device Memory ModelO iOverview

• Each thread can: (Device) Grid

– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory

Block (0, 0)

Shared Memory

Block (1, 0)

Shared MemoryW p y– R/W per-grid global memory– Read only per-grid constant

memory Thread (0, 0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

memory– Read only per-grid texture memory

LocalMemory

LocalMemory

LocalMemory

LocalMemory

• The host can R/W

ConstantMemory

GlobalMemory

Host• The host can R/W

global, constant, and texture memories


y

TextureMemory

texture memories

Global, Constant, and Texture Memories(L L t A )(Long Latency Accesses)

• Global memory (Device) Grid

– Main means of communicating R/W Data between host and

Block (0, 0)

Shared Memory

Block (1, 0)

Shared MemoryData between host and device

– Contents visible to all threads Thread (0, 0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

threads• Texture and Constant

MemoriesLocal

MemoryLocal

MemoryLocal

MemoryLocal

Memory

– Constants initialized by host Contents visible to all

ConstantMemory

GlobalMemory

Host


– Contents visible to all threads

y

TextureMemory

Courtesy: NDVIA

CUDA Device Memory AllocationCUDA Device Memory Allocation• cudaMalloc()

Allocates object in the(Device) Grid

– Allocates object in the device Global MemoryGlobal MemoryRequires two parameters

Block (0, 0)

Shared Memory

Register Register

Block (1, 0)

Shared Memory

Register Register– Requires two parameters• Address of a pointer to the

allocated objectThread (0,

0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

j• Size of of allocated object

• cudaFree() GlobalMemory

LocalMemor

y

LocalMemor

y

LocalMemor

y

LocalMemor

y

Host()– Frees object from device

Global MemoryConstantMemory

TextureMemory

Memory


y• Pointer to freed object

Memory

CUDA Host-Device Data TransferCUDA Host Device Data Transfer• cudaMemcpy()

memory data transfer(Device) Grid

– memory data transfer– Requires four parameters

• Pointer to source

Block (0, 0)

Shared Memory

Register Register

Block (1, 0)

Shared Memory

Register Register

• Pointer to destination• Number of bytes copied• Type of transfer

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Thread (0, 0)

Registers

Thread (1, 0)

Registers

• Type of transfer – Host to Host– Host to Device

iGlobalMemory

LocalMemor

y

LocalMemor

y

LocalMemor

y

LocalMemor

y

Host

– Device to Host– Device to Device

• Asynchronous in CUDA

ConstantMemory

TextureMemory

Memory


Asynchronous in CUDA Memory

CUDA Function DeclarationsCUDA Function DeclarationsExecuted on

the:Only callable

from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

• global defines a kernel function__global__ defines a kernel function– Must return void


Language Extensions:V i bl T Q lifiVariable Type Qualifiers

Memory Scope Lifetimedevice local int LocalVar; local thread thread__device__ __local__ int LocalVar; local thread thread

__device__ __shared__ int SharedVar; shared block block__device__ int GlobalVar; global grid application

• __device__ is optional when used with __device__ __constant__ int ConstantVar; constant grid application

__ ____local__, __shared__, or __constant__


73

Access TimesAccess Times

• Register – dedicated HW - single cycleg g y• Shared Memory – dedicated HW - single cycle• Local Memory – DRAM, no cache - *slow*oca e o y , o cac e s ow• Global Memory – DRAM, no cache - *slow*• Constant Memory – DRAM cached 1 10s 100sConstant Memory DRAM, cached, 1…10s…100s

of cycles, depending on cache locality• Texture Memory – DRAM, cached, 1…10s…100s ofTexture Memory DRAM, cached, 1…10s…100s of

cycles, depending on cache locality• Instruction Memory (invisible) – DRAM, cached


74

y ( ) ,

CUDA function calls restrictions

• device functions cannot have their __ __address taken

• For functions executed on the device:• For functions executed on the device:– No recursion

No static variable declarations inside the function– No static variable declarations inside the function– No variable number of arguments


Calling a Kernel Function – Thread CreationC g e e u c o e d C e o• A kernel function must be called with an execution

fi ticonfiguration:__global__ void KernelFunc(...);di 3 Di G id(100 50) // 5000 th d bl kdim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per

block size_t SharedMemBytes = 64; // 64 bytes of shared

memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes , , y

>>>(...);

• Any call to a kernel function is asynchronous from CUDA 1 0 li i h d d f bl ki


CUDA 1.0 on, explicit synch needed for blocking

Resources on lineResources on line

• http://www.ddj.com/hpc-high-performance-computing/207200659

• http://www.nvidia.com/object/cuda_home.html#• http://www.nvidia.com/object/cuda_learn.html


77

Case StudiesCase Studies

• GPU-enabled EM algorithm – applications in g ppBioinfomatics with Genome Institute, National BIOTEC center, Thailand

• GPU application for Monte Carlos Integration -with Amir Fabin UTA & Dick Greenwood

• GPU checkpoint/restart


ThanksThanks


gpgpu – a current trend ina current trend in high

Documents