massively parallel architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · massively...

48
The CELL processor Introduction to GPGPU Conclusion Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI [email protected] Bat. 490 - Bureau 104 20 janvier 2009 J. Falcou

Upload: others

Post on 19-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Massively Parallel ArchitecturesA Take on Cell Processor and GPU programming

Joel Falcou - [email protected]

Bat. 490 - Bureau 104

20 janvier 2009

J. Falcou

Page 2: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Motivation

Harder,Better,Faster,Stronger (famous tune)Scientific Computation is largely demanding of computation power

Faster computation = more results now

Biology and Health Care

Oiling and Finance

Video Games Industry

The Silent RevolutionComputing Power : 400 GFLOPS vs 32 GFLOPS

Memory bandwidth : 100-200 GB/s vs 10 GB/s

GPU are in everyday PCs

Cell went from server blade to the game industry (PS3)

J. Falcou

Page 3: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Motivation

Harder,Better,Faster,Stronger (famous tune)Scientific Computation is largely demanding of computation power

Faster computation = more results now

Biology and Health Care

Oiling and Finance

Video Games Industry

The Silent RevolutionComputing Power : 400 GFLOPS vs 32 GFLOPS

Memory bandwidth : 100-200 GB/s vs 10 GB/s

GPU are in everyday PCs

Cell went from server blade to the game industry (PS3)

J. Falcou

Page 4: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Motivation

When Video games ruled the WorldGame design has become ever more sophisticated.

Fast GPUs lead to complex shader for real-time effects.

In turn, the demand for speed has led to ever-increasing innovation incard design.

The gaming industry has overtaken the defense, finance, oil and healthcareindustries as the main driving factor for high performance processors.

The NV40 architecture has 225 million transistors, compared toabout 175 million for the Pentium 4 EE 3.2Ghz chip.

J. Falcou

Page 5: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Motivation

When Video games ruled the WorldGame design has become ever more sophisticated.

Fast GPUs lead to complex shader for real-time effects.

In turn, the demand for speed has led to ever-increasing innovation incard design.

The gaming industry has overtaken the defense, finance, oil and healthcareindustries as the main driving factor for high performance processors.

The NV40 architecture has 225 million transistors, compared toabout 175 million for the Pentium 4 EE 3.2Ghz chip.

J. Falcou

Page 6: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Motivation

J. Falcou

Page 7: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Objectives

Theory !Hardware architecture of GPU and Cell processorPros and Cons of those architectures

... and PracticeIntroduction to GPGPUTools and LanguagesSample code

J. Falcou

Page 8: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

Objectives

Theory !Hardware architecture of GPU and Cell processorPros and Cons of those architectures

... and PracticeIntroduction to GPGPUTools and LanguagesSample code

J. Falcou

Page 9: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Motivation

Less is MoreGP CPU increases in complexityPeak performances slow downBuilding more with less complex PU

The CELL ProcessorHeterogenous multi-coreDSP-like coprocessorHigh-memory bandwidth ( 200GB/s)

J. Falcou

Page 10: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Motivation

Less is MoreGP CPU increases in complexityPeak performances slow downBuilding more with less complex PU

The CELL ProcessorHeterogenous multi-coreDSP-like coprocessorHigh-memory bandwidth ( 200GB/s)

J. Falcou

Page 11: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Where to find it ? ? ?

J. Falcou

Page 12: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

The CELL Processor

Structure1 PowerPC Processing Unit

8 Synergetic Processing Unit

1 XDRAM Interface

1 4-way DMA bus

Parallelism sourceTLP over the PPE

TLP over the SPE

ILP inside each SPE

J. Falcou

Page 13: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

The CELL Processor

J. Falcou

Page 14: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Available Tools

... that workGCC/G++ for the CellGFORTRAN for the CellUse a dual source compilation process

... that don’t workOpenMP : bad scaling, huge executableTask-based MPI : huge latency, low bandwidth

J. Falcou

Page 15: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Available Tools

... that workGCC/G++ for the CellGFORTRAN for the CellUse a dual source compilation process

... that don’t workOpenMP : bad scaling, huge executableTask-based MPI : huge latency, low bandwidth

J. Falcou

Page 16: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Separate development

Specificities of the PPEAll the features of a PPC CoreSupports up to two threadsFull-fledged Altivec SIMD extension

Specificities of the SPEsSpecialized Altivec SIMD extensionNo scalar ALUCacheless and predictorless

J. Falcou

Page 17: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Separate development

Specificities of the PPEAll the features of a PPC CoreSupports up to two threadsFull-fledged Altivec SIMD extension

Specificities of the SPEsSpecialized Altivec SIMD extensionNo scalar ALUCacheless and predictorless

J. Falcou

Page 18: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Memory and Communications

Communicating between PPE and SPEs

SPE LS are virtually mapped into PPE memoryPPE and SPE code share the same process spaceSPE code must be ’downloaded’ when application starts

Handling SPE Local Store

SPE LS is only 256KB for code+dataSPE memories aren’t sharedNeed for explicit data transfer primitives

J. Falcou

Page 19: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Memory and Communications

Communicating between PPE and SPEs

SPE LS are virtually mapped into PPE memoryPPE and SPE code share the same process spaceSPE code must be ’downloaded’ when application starts

Handling SPE Local Store

SPE LS is only 256KB for code+dataSPE memories aren’t sharedNeed for explicit data transfer primitives

J. Falcou

Page 20: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Memory and Communications

MailboxAllow transfer of small data (32bits) between SPE and PPE

Two mailbox per SPE (in and out)

Two mode : waiting or polling

Useful for simple synchronization (thread pool pattern)

Primitives : spe_in_mbox_write and spe_in_mbox_read

Signal

Allow transfer of small data (32bits) between SPEs

Two signal slots per SPE (generic purpose)

Useful for message-passing emulation with DMA transfers

Primitives : mfc_sndsig and spe_read_signal

J. Falcou

Page 21: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

Memory and Communications

MailboxAllow transfer of small data (32bits) between SPE and PPE

Two mailbox per SPE (in and out)

Two mode : waiting or polling

Useful for simple synchronization (thread pool pattern)

Primitives : spe_in_mbox_write and spe_in_mbox_read

Signal

Allow transfer of small data (32bits) between SPEs

Two signal slots per SPE (generic purpose)

Useful for message-passing emulation with DMA transfers

Primitives : mfc_sndsig and spe_read_signal

J. Falcou

Page 22: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

DMA Transfers

Principles

Offload the SPU from being blocked during memory transfer

Used to download SPE code into SPE LS

Up to 4 transfers cna be done in parallel over the SPE-Bus

Up to one upload and one download in parallel over the PPE bus

Primitives : mfc_get,mfc_put and mfc_read_tag_status_all

Traps and Pitfalls

Data to send/receive must be aligned on a 128bits boundary

Data size should be 1,2,4,8 or any multiple of 16 bytes

Limited number of DMA channels

Double buffering must be considered

J. Falcou

Page 23: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

ArchitectureCoding for the CELL

DMA Transfers

Principles

Offload the SPU from being blocked during memory transfer

Used to download SPE code into SPE LS

Up to 4 transfers cna be done in parallel over the SPE-Bus

Up to one upload and one download in parallel over the PPE bus

Primitives : mfc_get,mfc_put and mfc_read_tag_status_all

Traps and Pitfalls

Data to send/receive must be aligned on a 128bits boundary

Data size should be 1,2,4,8 or any multiple of 16 bytes

Limited number of DMA channels

Double buffering must be considered

J. Falcou

Page 24: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Motivation

GPU beyond 3D graphics

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Back in the day of openGL GPGPULimited texture size/dimensionLimited outputsLack of integers and bitwise operatorsLimited communications

J. Falcou

Page 25: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Motivation

GPU beyond 3D graphics

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Back in the day of openGL GPGPULimited texture size/dimensionLimited outputsLack of integers and bitwise operatorsLimited communications

J. Falcou

Page 26: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

The NVIDIA Products

GeForce seriesSeparate HW interface

Work as an external MPM

Tesla machines8-series GPUs : 200 GFLOPS

stand-alone or 1U rackableunit

J. Falcou

Page 27: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

The NVIDIA Products

GeForce seriesSeparate HW interface

Work as an external MPM

Tesla machines8-series GPUs : 200 GFLOPS

stand-alone or 1U rackableunit

J. Falcou

Page 28: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Inside a GPU

Hierarchical Memory

Global Memory

Shared Memory

Local Memory

ProcessorsHigh density SMP

Support 4-way SIMD

J. Falcou

Page 29: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Inside a GPU

Hierarchical Memory

Global Memory

Shared Memory

Local Memory

ProcessorsHigh density SMP

Support 4-way SIMD

J. Falcou

Page 30: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Global View

KernelsA GPGPU application is made of

CPU computation

GPU Kernels

Grids and BlocksKernel = grid of thread blocks

All threads share datamemory space

A thread block is a batch ofthreads that can cooperate

J. Falcou

Page 31: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Global View

KernelsA GPGPU application is made of

CPU computation

GPU Kernels

Grids and BlocksKernel = grid of thread blocks

All threads share datamemory space

A thread block is a batch ofthreads that can cooperate

J. Falcou

Page 32: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Block and Thread IDs

Threads and blocks have IDsEach thread decide the datato process

Block ID : 1D or 2D

Thread ID : 1D, 2D, or 3D

Memory Access

Depend son domain

Image : 2D

Physics : 3D

J. Falcou

Page 33: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Block and Thread IDs

Threads and blocks have IDsEach thread decide the datato process

Block ID : 1D or 2D

Thread ID : 1D, 2D, or 3D

Memory Access

Depend son domain

Image : 2D

Physics : 3D

J. Falcou

Page 34: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Memory Access Patterns

Each thread canR/W per-thread registers

R/W per-thread local memory

R/W per-block sharedmemory

R/W per-grid global memory

Read only per-grid constant

The host canR/W constant memory

R/W texture memory

R/W global memory

J. Falcou

Page 35: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Memory Access Patterns

Each thread canR/W per-thread registers

R/W per-thread local memory

R/W per-block sharedmemory

R/W per-grid global memory

Read only per-grid constant

The host canR/W constant memory

R/W texture memory

R/W global memory

J. Falcou

Page 36: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Global, Constant, and Texture Memories

Global Memory

Main means ofcommunicating between hostand device

Contents visible to all threads

Texture and ConstantConstants initialized by host

Contents visible to all threads

J. Falcou

Page 37: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Global, Constant, and Texture Memories

Global Memory

Main means ofcommunicating between hostand device

Contents visible to all threads

Texture and ConstantConstants initialized by host

Contents visible to all threads

J. Falcou

Page 38: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

CUDA Processing Flow

J. Falcou

Page 39: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Copy Processing Data

Create data on HostcudaMallocHost() : allocate memory on the host

cudaMalloc() : allocate memory in the device Global Memory

Copy to Device

cudaMemcpy() : copy memory between host and device

Asynchronous since Cuda 1.1

Works 4-way : (host,device) X (host,device)

Examplefloat *host, *device;

cudaMallocHost(&host, sizeof(float)*64*64);cudaMalloc(&device, sizeof(float)*64*64);

cudaMemcpy(host, device, sizeof(float)*64*64, cudaMemcpyHostToDevice);

J. Falcou

Page 40: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Copy Processing Data

Create data on HostcudaMallocHost() : allocate memory on the host

cudaMalloc() : allocate memory in the device Global Memory

Copy to Device

cudaMemcpy() : copy memory between host and device

Asynchronous since Cuda 1.1

Works 4-way : (host,device) X (host,device)

Examplefloat *host, *device;

cudaMallocHost(&host, sizeof(float)*64*64);cudaMalloc(&device, sizeof(float)*64*64);

cudaMemcpy(host, device, sizeof(float)*64*64, cudaMemcpyHostToDevice);

J. Falcou

Page 41: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Copy Processing Data

Create data on HostcudaMallocHost() : allocate memory on the host

cudaMalloc() : allocate memory in the device Global Memory

Copy to Device

cudaMemcpy() : copy memory between host and device

Asynchronous since Cuda 1.1

Works 4-way : (host,device) X (host,device)

Examplefloat *host, *device;

cudaMallocHost(&host, sizeof(float)*64*64);cudaMalloc(&device, sizeof(float)*64*64);

cudaMemcpy(host, device, sizeof(float)*64*64, cudaMemcpyHostToDevice);

J. Falcou

Page 42: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Instruct the Processing

Define the device mapping

CUDA provides built-in types for dimension

Define a block grid

Define a thread grid

Run the kernelCUDA provides a synatx extnsion for calling a given function over a given grid

Exampledim3 dimBlock(16,16);dim3 dimGrid(64 / dimBlock.x, 64 / dimBlock.y);

device_kernel<<<dimGrid, dimBlock>>>(host,64);

J. Falcou

Page 43: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Instruct the Processing

Define the device mapping

CUDA provides built-in types for dimension

Define a block grid

Define a thread grid

Run the kernelCUDA provides a synatx extnsion for calling a given function over a given grid

Exampledim3 dimBlock(16,16);dim3 dimGrid(64 / dimBlock.x, 64 / dimBlock.y);

device_kernel<<<dimGrid, dimBlock>>>(host,64);

J. Falcou

Page 44: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Instruct the Processing

Define the device mapping

CUDA provides built-in types for dimension

Define a block grid

Define a thread grid

Run the kernelCUDA provides a synatx extnsion for calling a given function over a given grid

Exampledim3 dimBlock(16,16);dim3 dimGrid(64 / dimBlock.x, 64 / dimBlock.y);

device_kernel<<<dimGrid, dimBlock>>>(host,64);

J. Falcou

Page 45: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Build a Parallel kernel

kernel.cu__global__ void device_kernel(float* data, size_t size){

// Block indexint bx = blockIdx.x;int by = blockIdx.y;

// Thread indexint tx = threadIdx.x;int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the blockint begin = size * BLOCK_SIZE * by;// Index of the last sub-matrix of A processed by the blockint end = begin + size - 1;// Step size used to iterate through the sub-matrices of Aint step = BLOCK_SIZE;

for(int a = begin; a <= end; a += step)data[a + size * ty + tx] = 255 - data[a + size * ty + tx];

}

J. Falcou

Page 46: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

The NVIDIA ArchitectureProgramming with CUDA

Sample Code

see mmul.*

J. Falcou

Page 47: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

As a Conclusion ...

Some research topics ...High-level tools are needed. WIP includes :

Algorithmic Skeletons for the CellBulk Synchronous Parallelism for GPUArchitecture-independant Algebra library

Some untapped domainOperationnal ResearchCryptography/CompressionArtificial Intelligence

J. Falcou

Page 48: Massively Parallel Architecturesfalcou/teaching/par/accelerator.pdf · 2009-03-11 · Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI

The CELL processorIntroduction to GPGPU

Conclusion

As a Conclusion ...

Some research topics ...High-level tools are needed. WIP includes :

Algorithmic Skeletons for the CellBulk Synchronous Parallelism for GPUArchitecture-independant Algebra library

Some untapped domainOperationnal ResearchCryptography/CompressionArtificial Intelligence

J. Falcou