david luebke nvidia research gpu architecture & implications

David Luebke

NVIDIA Research

GPU Architecture & Implications

© NVIDIA Corporation 2007

GPU Architecture

CUDA provides a parallel programming model

The Tesla GPU architecture implements this

This talk will describe the characteristics, goals, and implications of that architecture

G80 GPU Implementation: Tesla C870

681 million transistors470 mm2 in 90 nm CMOS

128 thread processors518 GFLOPS peak1.35 GHz processor clock

1.5 GB DRAM76 GB/s peak800 MHz GDDR3 clock384 pin DRAM interface

ATX form factor cardPCI Express x16170 W max with DRAM


G80 (launched Nov 2006)

128 Thread Processors execute kernel threads

Up to 12,288 parallel threads active

Per-block shared memory (PBSM) accelerates processing

Block Diagram Redux

Thread Execution Manager

Input Assembler

Host

PBSM

Global Memory

Load/store

PBSM

Thread Processors

PBSM

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSMPBSM


Streaming Multiprocessor (SM)

Processing elements8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)

½ MB total register file space!

usual ops: float, int, branch, …

Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total

16KB on-chip memorylow latency storageshared amongst threads of a blocksupports thread communication

SP

SharedMemory

MT IU

SM t0 t1 … tB

Goal: Scalability

Scalable executionProgram must be insensitive to the number of cores

Write one program for any number of SM cores

Program runs on any size GPU without recompiling

Hierarchical execution modelDecompose problem into sequential steps (kernels)

Decompose kernel into computing parallel blocks

Decompose block into computing parallel threads

Hardware distributes independent blocks to SMs as available

Blocks Run on Multiprocessors

Kernel launched by host

. . .

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

. . .

Device processor array

Device Memory

Goal: easy to program

Strategies:Familiar programming language mechanics

C/C++ with small extension

Simple parallel abstractionsSimple barrier synchronization

Shared memory semantics

Hardware-managed hierarchy of threads


Hardware Multithreading

Hardware allocates resources to blocksblocks need: thread slots, registers, shared memory

blocks don’t run until resources are available

Hardware schedules threadsthreads have their own registers

any thread not waiting for something can run

context switching is (basically) free – every cycle

Hardware relies on threads to hide latencyi.e., parallelism is necessary for performance

SP

SharedMemory

MT IU

SM

Goal: Performance per millimeter

For GPUs, perfomance == throughput

Strategy: hide latency with computation not cacheHeavy multithreading – already discussed by Kevin

Implication: need many threads to hide latencyOccupancy – typically need 128 threads/SM minimumMultiple thread blocks/SM good to minimize effect of barriers

Strategy: Single Instruction Multiple Thread (SIMT)Balances performance with ease of programming


SIMT Thread Execution

Groups of 32 threads formed into warpsalways executing same instructionshared instruction fetch/dispatchsome become inactive when code path divergeshardware automatically handles divergence

Warps are the primitive unit of schedulingpick 1 of 24 warps for each instruction slot

SIMT execution is an implementation choicesharing control logic leaves more space for ALUslargely invisible to programmermust understand for performance, not correctness

SP

SharedMemory

MT IU

SM

© NVIDIA Corporation 2007gh07 Hot3D: Tesla GPU Computing12

SIMT Multithreaded Execution

Weaving: the original parallel thread technology is about 10,000 years oldWarp: a set of 32 parallel threadsthat execute a SIMD instruction

SM hardware implements zero-overhead warp and thread schedulingEach SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads

Threads can execute independentlySIMD warp automatically diverges and converges when threads branchBest efficiency and performance when threads of a warp execute togetherSIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency

warp 8 instruction 11

SM multithreadedinstruction scheduler




...

time



SPSharedMemory

MT

IU

Device Memory

Texture Cache Constant Cache

I Cac

heMemory Architecture

Direct load/store access to device memorytreated as the usual linear sequence of bytes (i.e., not pixels)

Texture & constant caches are read-only access paths

On-chip shared memory shared amongst threads of a blockimportant for communication amongst threads

provides low-latency temporary storage (~100x less than DRAM)

HostMemory

PCIe

Myths of GPU Computing

GPUs layer normal programs on top of graphics

GPUs architectures are:Very wide (1000s) SIMD machines…

…on which branching is impossible or prohibitive…

…with 4-wide vector registers.

GPUs are power-inefficient

GPUs don’t do real floating point


GPUs layer normal programs on top of graphicsNO: CUDA compiles directly to the hardware








GPUs architectures are:Very wide (1000s) SIMD machines… NO: warps are 32-wide








…on which branching is impossible or prohibitive… NOPE








…with 4-wide vector registers. NO: scalar thread processors








GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies







GPUs are power-inefficient:


GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy


log2(x) and 2^x estimates accuracy

23 bit No 12 bit No

Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / accelerators

More precise / usable in some ways

Less precise in other ways

GPU FP getting better every generationDouble precision support shortly

Goal: best of class by 2009

Questions?

David [email protected]

Applications &

Sweet Spots


GPU Computing Sweet Spots

Applications:

High arithmetic intensity:

Dense linear algebra, PDEs, n-body, finite difference, …

High bandwidth: Sequencing (virus scanning, genomics), sorting, database…

Visual computing:Graphics, image processing, tomography, machine vision…


GPU Computing Example Markets

Computational

Modeling

Computational

Chemistry

Computational

Medicine

Computational

Science

Computational

Biology

Computational

Finance

Computational

Geoscience

Image

Processing


Applications - Condensed

3D image analysisAdaptive radiation therapyAcousticsAstronomyAudioAutomobile visionBioinfomaticsBiological simulationBroadcastCellular automataComputational Fluid DynamicsComputer VisionCryptographyCT reconstructionData MiningDigital cinema/projectionsElectromagnetic simulationEquity training

FilmFinancial - lots of areasLanguagesGISHolographics cinemaImaging (lots)Mathematics researchMilitary (lots)Mine planningMolecular dynamicsMRI reconstructionMultispectral imagingnbodyNetwork processingNeural networkOceanographic researchOptical inspectionParticle physics

Protein foldingQuantum chemistryRay tracingRadarReservoir simulationRobotic vision/AIRobotic surgerySatellite data analysisSeismic imagingSurgery simulationSurveillanceUltrasoundVideo conferencingTelescopeVideoVisualizationWirelessX-ray


GPU Computing Sweet Spots

From cluster to workstationThe “personal supercomputing” phase change

From lab to clinic

From machine room to engineer, grad student desks

From batch processing to interactive

From interactive to real-time

GPU-enabled clusters A 100x or better speedup changes the science

Solve at different scales

Direct brute-force methods may outperform cleverness

New bottlenecks may emerge

Approaches once inconceivable may become practical


New Applications

Real-time options implied volatility engine

Swaption volatility cube calculator

Manifold 8 GIS

Ultrasound imaging

HOOMD Molecular Dynamics

Also…Image rotation/classificationGraphics processing toolboxMicroarray data analysisData parallel primitivesAstrophysics simulations

SDK: Mandelbrot, computer vision

Seismic migration


The Future of GPUs

GPU Computing drives new applicationsReducing “Time to Discovery”

100x Speedup changes science and research methods

New applications drive the future of GPUs and GPU Computing

Drives new GPU capabilities

Drives hunger for more performance

Some exciting new domains: Vision, acoustic, and embedded applications

Large-scale simulation & physics

Accuracy &

Performance


GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy


Reciprocal sqrt estimate accuracy


log2(x) and 2^x estimates accuracy

23 bit No 12 bit No


Do GPUs Do Real IEEE FP?

G8x GPU FP is IEEE 754Comparable to other processors / accelerators

More precise / usable in some ways

Less precise in other ways

GPU FP getting better every generationDouble precision support shortly

Goal: best of class by 2009


CUDA Performance Advantages

Performance:BLAS1: 60+ GB/sec

BLAS3: 127 GFLOPS

FFT: 52 benchFFT* GFLOPS

FDTD: 1.2 Gcells/sec

SSEARCH: 5.2 Gcells/sec

Black Scholes: 4.7 GOptions/sec

VMD: 290 GFLOPS

How:Leveraging shared memory

GPU memory bandwidth

GPU GFLOPS performance

Custom hardware intrinsics

__sinf(), __cosf(), __expf(), __logf(), …

All benchmarks are compiled code!

GPGPU vs.

GPU Computing


Problem: GPGPU

OLD: GPGPU – trick the GPU into general-purpose computing by casting problem as graphics

Turn data into images (“texture maps”)

Turn algorithms into image synthesis (“rendering passes”)

Promising results, but:Tough learning curve, particularly for non-graphics experts

Potentially high overhead of graphics API

Highly constrained memory layout & access model

Need for many passes drives up bandwidth consumption


Solution: CUDA

NEW: GPU Computing with CUDACUDA = Compute Unified Driver Architecture

Co-designed hardware & software for direct GPU computing

Hardware: fully general data-parallel architecture

Software: program the GPU in C

General thread launch

Global load-store

Parallel data cache

Scalar architecture

Integers, bit operations

Double precision (soon)

Scalable data-parallel execution/memory model

C with minimal yet powerful extensions


Graphics Programming Model

Graphics Application

Vertex Program

Rasterization

Fragment Program

Display


Streaming GPGPU Programming

OpenGL Program to Add A and B

Vertex Program

Rasterization

Fragment Program

CPU Reads Texture Memory

for Results

Start by creating a quad

“Programs” created with raster operation

Write answer to texture memory as a “color”

Read textures as input to OpenGL shader program

All this just to do A + B


What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Pixel Program

Display

Input Registers

Pixel Program

Output Registers

Constants

Texture

Temp Registers


What’s Wrong With GPGPU?

Application

Vertex Program

Rasterization

Fragment Program

Display

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

APIs are specific to graphics

Limited texture size and dimension

Limited shader outputs

No scatter

Limited instruction set

No thread communication

Limited local storage


Building a Better Pixel

Input Registers

Fragment Program

Output Registers

Constants

Texture

Registers


Building a Better Pixel Thread

Thread Program

Output Registers

Constants

Texture

Registers

Thread Number

Features

Millions of instructions

Full Integer and Bit instructions

No limits on branching, looping

1D, 2D, or 3D thread ID allocation


Global Memory

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

Features

Fully general load/store to

GPU memory

Untyped, not fixed texture types

Pointer support


Parallel Data Cache

Thread Program

Global Memory

Thread Number

Constants

Texture

Registers

Features

Dedicated on-chip memory

Shared between threads for inter-thread communication

Explicitly managed

As fast as registers

Parallel Data Cache


Example Algorithm - Fluids

So the pressure for each particle is…

Pressure1 = P1 + P2 + P3 + P4




Pressure depends on neighbors

Goal: Calculate PRESSURE in a fluid

Pressure = Sum of neighboring pressures

Pn’ = P1 + P2 + P3 + P4


Example Fluid Algorithm

CPU GPGPU

GPU Computingwith CUDA

Multiple passes through video

memory

Parallel execution through cache

Single thread out of cache

Program/Control

Data/Computation

Control

ALU

Cache DRAM

P1

P2

P3

P4

Pn’=P1+P2+P3+P4

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

ParallelData Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1

P2

P3

P4

P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4


Parallel Data Cache

Bring the data closer to the ALU

Addresses a fundamental problem of stream computing:

• The data are far from the FLOPS, video RAM latency is high

• Threads can only communicate their results through this high latency RAM

GPGPU

Multiple passes through video

memory

ALU

VideoMemory

Control

ALU

Control

ALU

Control

ALU

P1,P2P3,P4

P1,P2

P3,P4

P1,P2P3,P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4


Parallel Data Cache

Parallel execution through cache

ParallelData Cache

ThreadExecutionManager

ALU

Control

ALU

Control

ALU

Control

ALU

DRAM

P1

P2

P3

P4

P5

SharedData

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Pn’=P1+P2+P3+P4

Bring the data closer to the ALU

Stage computation for the parallel data cache

Minimize trips to external memory

Share values to minimize overfetch and computation

Increases arithmetic intensity by keeping data close to the processors

User managed generic memory, threads read/write arbitrarily


Streaming vs. GPU Computing

GPGPU

CUDA

ALU

ALU

StreamingGather in, Restricted write

Memory is far from ALU

No inter-element communication

GPU Computing with CUDAMore general data parallel model

Full Scatter / Gather

PDC brings the data closer to the ALU

App decides how to decompose the problem across threads

Share and communicate between threads to solve problems efficiently

GPU Design


CPU/GPU Parallelism

Moore’s Law gives you more and more transistorsWhat do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible

Tactics: – Cache (area limiting)– Instruction/Data prefetch– Speculative executionlimited by “perimeter” – communication bandwidth…then add task parallelism…multi-core

GPU strategy: make the workload (as many threads as possible) run as fast as possible

Tactics:

– Parallelism (1000s of threads)– Pipelining limited by “area” – compute capability


Background: Unified Design

Shader D

Shader A

Shader B

Shader C

Shader Core

ibuffer ibuffer ibuffer ibuffer

obuffer obuffer obufferobuffer

Discrete Design Unified Design


Hardware Implementation:Collection of SIMT Multiprocessors

Each multiprocessor is a set of SIMT thread processors

Single Instruction Multiple Thread

Each thread processor has:program counter, register file, etc.scalar data pathread/write memory access

Unit of SIMT execution: warpexecute same instruction/clockHardware handles thread scheduling and divergence transparently

Warps enable a friendly data-parallel programming model!

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

InstructionUnit

Processor 1 …Processor 2 Processor M


Hardware Implementation:Memory Architecture

The device has local device memory

Can be read and written by the host and by the multiprocessors

Each multiprocessor has:A set of 32-bit registers per processor

on-chip shared memory

A read-only constant cache

A read-only texture cache

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache


Hardware Implementation:Memory Model

Each thread can:Read/write per-block on-chip shared memory

Read per-grid cached constant memory

Read/write non-cached device memory:

Per-grid global memory

Per-thread local memory

Read cached texture memory

Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

CUDAProgramming


CUDA SDK

NVIDIA C Compiler

NVIDIA Assemblyfor Computing CPU Host Code

Integrated CPUand GPU C Source Code

Libraries:FFT, BLAS,…Example Source Code

CUDADriver

DebuggerProfiler

Standard C Compiler

GPU CPU


CUDA: Features available to kernels

Standard mathematical functionssinf, powf, atanf, ceil, etc.

Built-in vector typesfloat4, int4, uint4, etc. for dimensions 1..4

Texture accesses in kernelstexture<float,2> my_texture; // declare texture reference

float4 texel = texfetch(my_texture, u, v);


G8x CUDA = C with Extensions

Philosophy: provide minimal set of extensions necessary to expose power

Function qualifiers:__global__ void MyKernel() { }__device__ float MyDeviceFunc() { }

Variable qualifiers:__constant__ float MyConstantArray[32];__shared__ float MySharedArray[32];

Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel

Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3 blockDim; // Block dimensiondim3 blockIdx; // Block indexdim3 threadIdx; // Thread indexvoid __syncthreads(); // Thread synchronization


CUDA: Runtime support

Explicit memory allocation returns pointers to GPU memory

cudaMalloc(), cudaFree()

Explicit memory copy for host ↔ device, device ↔ device

cudaMemcpy(), cudaMemcpy2D(), ...

Texture management

cudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperability

cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …


Example: Adding matrices w/ 2D grids

CPU C program CUDA C program

void addMatrix(float *a, float *b,

float *c, int N)

{

int i, j, index;

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

index = i + j * N;

c[index]=a[index] + b[index];

}

}

}

void main()

{

.....

addMatrix(a, b, c, N);

}

__global__ void addMatrix(float *a,float *b,

float *c, int N)

{

int i=blockIdx.x*blockDim.x+threadIdx.x;

int j=blockIdx.y*blockDim.y+threadIdx.y;

int index = i + j * N;

if ( i < N && j < N)

c[index]= a[index] + b[index];

}

void main()

{

..... // allocate & transfer data to GPU

dim3 dimBlk (blocksize, blocksize);

dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);

addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);

}


Example: Vector Addition Kernel

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}


Example: Invoking the Kernel

__global__ void vecAdd(float* A, float* B, float* C);

void main()

{

// Execute on N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}


Example: Host code for memory

// allocate host (CPU) memoryfloat* h_A = (float*) malloc(N * sizeof(float));float* h_B = (float*) malloc(N * sizeof(float));… initalize h_A and h_B …

// allocate device (GPU) memoryfloat* d_A, d_B, d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);


A quick review

device = GPU = set of multiprocessors

Multiprocessor = set of processors & shared memory

Kernel = GPU program

Grid = array of thread blocks that execute a kernel

Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

Memory Location Cached Access Who

Local Off-chip No Read/write One thread

Shared On-chip N/A - resident Read/write All threads in a block

Global Off-chip No Read/write All threads + host

Constant Off-chip Yes Read All threads + host

Texture Off-chip Yes Read All threads + host

Data-ParallelProgramming

Scan Literature

Pre-Hibernation

First proposed in APL by Iverson (1962)

Used as a data parallel primitive in the Connection Machine (1990)Feature of C* and CM-Lisp

Guy Blelloch used scan as a primitive for various parallel algorithms; his balanced-tree scan is used in the example here

Blelloch, 1990, “Prefix Sums and Their Applications”

Post-Democratization

O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)Applied to Summed Area Tables by Hensley et al. (EG05)

O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06)

O(n) work & space GPU implementation by Harris et al. (2007)NVIDIA CUDA SDK and GPU Gems 3

Applied to radix sort, stream compaction, and summed area tables

Parallel Reduction Complexity

Log(N) parallel steps, each step S does N/2S independent ops

Step Complexity is O(log N)

For N=2D, performs S[1..D]2D-S = N-1 operations

Work Complexity is O(N) – It is work-efficient

i.e. does not perform more operations than a sequential algorithm

With P threads physically in parallel (P processors), time complexity is O(N/P + log N)

Compare to O(N) for sequential reduction

Unrolling Last Steps

Only one warp is active during the last few steps

Unroll them and remove unneeded __syncthreads()

for (unsigned int s = bd/2; s > 32; s >>= 1) {

if (t < s) { data[t] += data[t + s]; } __syncthreads(); } if (t < 32) data[t] += data[t + 32]; if (t < 16) data[t] += data[t + 16]; if (t < 8) data[t] += data[t + 8]; if (t < 4) data[t] += data[t + 4]; if (t < 2) data[t] += data[t + 2]; if (t < 1) data[t] += data[t + 1];

Unrolling the Loop Completely

When block size is known at compile time, we can completely unroll the loop

It often is, since the maximum thread block size of 512 constrains us

Use templates…

#define STEP(d) \ if (t < (d)) data[t] += data[t+(d));#define SYNC __syncthreads();

template <unsigned int bsize>__global__ void d_reduce(int *g_idata, int *g_odata){ ... if (bsize == 512) STEP(512) SYNC if (bsize >= 256) STEP(256) SYNC if (bsize >= 128) STEP(128) SYNC if (bsize >= 64) STEP(64) SYNC if (bsize >= 32) { STEP(32) STEP(16)

STEP(8) STEP(4) STEP(2)

STEP(1) }}

GPU ComputingMotivation


Computing Challenge

graphic

Task Computing Data Computing


Extreme Growth in Raw Data

Source: John Bates, NOAA Nat. Climate Center

NOAA Weather Data

Source: Alexa, YouTube 2006

YouTube Bandwidth Growth

Mill

ions

Source: Hedburg, CPI, Walmart

Walmart Transaction Tracking

Mill

ions

Source: Jim Farnsworth, BP May 2005

BP Oil and Gas Active Data

Ter

abyt

es

Pet

abyt

es

NOAA NASA Weather Data in Petabytes

0

10

20

30

40

50

60

70

80

90

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017


Computational Horsepower

GPU is a massively parallel computation engineHigh memory bandwidth (5-10x CPU)

High floating-point performance (5-10x CPU)


Benchmarking: CPU vs. GPU Computing

G80 vs. Core2 Duo 2.66 GHzMeasured against commercial CPU benchmarks when possible

“Free” Massively Parallel Processors

It’s not science fiction, it’s just funded by them

Asst Master Chief Harvard

SuccessStories


Success Stories: Data to Design

Acceleware EM Field simulation technology for the GPU

3D Finite-Difference and Finite-Element (FDTD)

Modeling of:

Cell phone irradiation

MRI Design / Modeling

Printed Circuit Boards

Radar Cross Section (Military)

Pacemaker with Transmit Antenna10X

1X

4 GPUs2 GPUs1 GPUCPU3.2 GHz

0

100

200

300

400

500

600

700

Performance (Mcells/s)

20X

5X


EvolvedMachines

130X Speed up

Simulate brain circuitry

Sensory computing: vision, olfactory

EvolvedMachines


10X with MATLAB CPU+GPU

Pseudo-spectral simulation of 2D Isotropic turbulence

Matlab: Language of Science

http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m

http://developer.nvidia.com/object/matlab_cuda.html


MATLAB Example:Advection of an elliptic vortex

Matlab 168 seconds

Matlab with CUDA(single precision FFTs)20 seconds

256x256 mesh, 512 RK4 steps, Linux, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m


MATLAB Example:Pseudo-spectral simulation of 2D Isotropic turbulence

MATLAB 992 seconds

MATLAB with CUDA(single precision FFTs)93 seconds

512x512 mesh, 400 RK4 steps, Windows XP, MATLAB filehttp://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m


NAMD/VMD Molecular Dynamics

http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/

240X speedup

Computational biology


Molecular Dynamics Example

Case study: molecular dynamics research at U. Illinois Urbana-Champaign

(Scientist-sponsored) course project for CS 498AL: Programming Massively Parallel Multiprocessors (Kirk/Hwu)

Next slides stolen from a nice description of problem, algorithms, and iterative optimization process available at:




Molecular Modeling: Ion Placement

Biomolecular simulations attempt to replicate in vivo conditions in silico.Model structures are initially constructed in vacuumSolvent (water) and ions are added as necessary for the required biological conditionsComputational requirements scale with the size of the simulated structure


Evolution of Ion Placement Code

First implementation was sequentialVirus structure with 10^6 atoms would require 10 CPU daysTuned for Intel C/C++ vectorization+SSE, ~20x speedupParallelized /w pthreads: high data parallelism = linear speedupParallelized GPU accelerated implementation: 3 GeForce 8800GTX cards outrun ~300 Itanium2 CPUs!Virus structure now runs in 25 seconds on 3 GPUs!Further speedups should still be possible…


Multi-GPU CUDA Coulombic Potential Map Performance

Host: Intel Core 2 Quad, 8GB RAM, ~$3,000

3 GPUs: NVIDIA GeForce 8800GTX, ~$550 each

32-bit RHEL4 Linux (want 64-bit CUDA!!)

235 GFLOPS per GPU for current version of coulombic potential map kernel

705 GFLOPS total for multithreaded multi-GPU version Three GeForce 8800GTX GPUs

in a single machine, cost ~$4,650

ProfessorPartnership


NVIDIA Professor Partnership

Support faculty research & teaching effortsSmall equipment gifts (1-2 GPUs) Significant discounts on GPU purchases

Especially Quadro, Tesla equipmentUseful for cost matching

Research contracts Small cash grants (typically ~$25K gifts)Medium-scale equipment donations (10-30 GPUs)

Informal proposals, reviewed quarterlyFocus areas: GPU computing, especially with an educational mission or component

http://www.nvidia.com/page/professor_partnership.html

Easy

Competitive

david luebke nvidia research gpu architecture & implications

Documents

shared memory pbsm

shared memory blocks

architecture slide

scalar thread processors

dram slide

kernel threads

available slide

thread slots