university of michigan electrical engineering and computer science amir hormati, mehrzad samadi,...

University of MichiganElectrical Engineering and Computer

Science

Amir Hormati, Mehrzad Samadi, Mark Woh,

Trevor Mudge, and Scott Mahlke

Sponge: Portable Stream Programming on Graphics Engines


Science

2

Why GPUs?

• Every mobile and desktop system will have one

• Affordable and high performance

• Over-provisioned

• Programmable

Sony PlayStation Phone

2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

250

500

750

1000

1250

1500

NVIDIA GPU

INTEL CPUTh

eore

tical

GFL

OPS

/s

GeForce GTX 480

GeForce GTX 280

GeForce 8800 GTX

GeForce 7800 GTX

GeForce 6800 Ultra


Science

3

GPU Architecture

Shared

Regs

0 1

2 3

4 5

6 7

Interconnection Network

CPU

SM 0 SM 1 SM 29

Kernel 1

Kernel 2

Tim

e 0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 5

6 7

Registers

Global Memory (Device Memory)

Shared Memory


Science

4

GPU Programming ModelPer-block Register

Grid 1

Grid 0

Per-appDevice Global

Memory

Grid Sequence

__shared__ int GlobalVar

Per-block Shared Memory

Block

__shared__ int SharedVar

int LocalVarArray[10]

int RegisterVar

Thread

Per-threadRegister

Thread

Per-threadLocal

Memory

Per-block Shared Memory

Thread

int LocalVarArray[10]

• Threads Blocks Grid

• All the threads run one kernel

• Registers private to each thread

• Registers spill to local memory

• Shared memory shared between threads of a block

• Global memory shared between all blocks


Science

5

Grid 1

GPU Execution Model

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7


Science

6

GPU Execution Model

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

Block 2

Warp 0 Warp 1

ThreadId

0 31 32 63


Science

7

GPU Programming Challenges

0

50

100

150

200

250

300

350

400

64

4832

16

8

Number of Registers Per Thread

Tim

e (m

s)

High Performance Desktop Mobile

Optimized forGeForce GTX 285

Optimized forGeForce 8400 GS

• Data restructuring for complex memory hierarchy efficiently– Global memory, Shared memory, Registers

• Partitioning work between CPU and GPU

• Lack of portability between different generations of GPU– Registers, active warps, size of global

memory, size of shared memory

• Will vary even more– Newer high performance cards e.g. NVIDA’s

Fermi– Mobile GPUs with less resources


Science

8

Nonlinear Optimization Space

[Ryoo , CGO ’08]

SAD Optimization Space

908 Configurations

We need higher level of abstraction!


Science

9

Goals

• Write-once parallel software

• Free the programmer from low-level details

(C + Pthreads) Shared Memory Processors

(C +Intrinsics) SIMD Engines

(Verilog/VHDL) FPGAs

(CUDA/OpenCL) GPUs

Parallel Specification


Science

10

Streaming

• Higher-level of abstraction

• Decoupling computation and memory accesses

• Coarse grain exposed parallelism, exposed communication

• Programmers can focus on the algorithms instead of low-level details

• Streaming actors use buffers to communicate

• A lot of recent works on extending portability of streaming applications

Actor 1

Actor 2 Actor 5

Splitter

Actor 4Actor 3

Joiner

Actor 6


Science

11

Sponge

– Generating optimized CUDA for a wide variety of GPU targets

– Perform an array of optimizations on stream graphs

– Optimizing and porting to different generations

– Utilize memory hierarchy (registers, shared memory, coallescing)

– Efficiently utilize streaming cores

Reorganization and Classification

Memory Layout

Graph

Restructuring

Register Optimization

Shared/Global Memory

Helper Threads

Bank Conflict Resolution

Loop Unrolling

Software Prefetching


Science

12

GPU Performance Model- Memory bound Kernels

M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7

≈ Memory Time

- Computation bound Kernels

M 0 M 1 M 4 M 5M 2 M 3 M 6 M 7

C 0 C1 C 2 C 3 C 4 C 5 C 6 C 7

≈ Computation Time

M CMemory Instructions Computation Instructions


Science

13

Actor Classification

• High Traffic Actors(HiT)– Large number of memory accesses per actor– Less number of threads with shared memory– Using shared memory underutilizes the processors

• Low Traffic Actors(LoT)– Less number of memory accesses per actor– More number of threads– Using shared memory increases the performance


Science

14

Thread 1 Thread 2 Thread 3Thread 0

1514131211109876543210

1514131211109876543210

Global Memory Accesses

A[4,4]

Global Memory

Global Memory2 6 10 14

2 6 10 141 5 9 13

1 5 9 13

0 4 8 12

0 4 8 12

3 7 11 15

3 7 11 15

• Large access latency

• Not access the words in sequence

• No coalescing

A[4,4] A[4,4] A[4,4]

A[i, j] Actor A has i pops and j pushes


Science

15

Thread 3Thread 2Thread 1Thread 0

Shared Memory

15151414131312121111101099887766554433221100

15151414131312121111101099887766554433221100

A[4,4] A[4,4] A[4,4] A[4,4]

Shared Memory

Shared Memory

1514131211109876543210

1514131211109876543210

Global To

Shared

Global To

Shared

Global To

Shared

Global To

Shared

Global Memory

Global Memory 3210

3210

7654

7654

111098

111098

15141312

15141312

3210

3210

7654

7654

111098

111098

15141312

15141312

Shared to

Global

Shared to

Global

Shared to

Global

Shared to

Global


Science

16

Using Shared Memory

• Shared memory is 100x faster than global memory

• Coalesce all global memory accesses

• Number of threads is limited by size of the shared memory.

For number of iterationsFor number of pops

For number of pushs

Shared Global

Shared Global

syncthreads

syncthreads

End Kernel

Begin Kernel <<<Blocks, Threads>>>:

Work


For number of iterations

End Kernel

Work


Science

17


syncthreads

syncthreads

If helper threads

If helper threads

If worker threads

End Kernel

Begin Kernel <<< Blocks, Threads + Helper >>>:

Work

Shared Global

Shared Global

Helper Threads

• Shared memory limits the number of threads.

• Underutilized processors can fetch data.

• All the helper threads are in one warp. (no control flow divergence)



End Kernel

Work


Science

18


syncthreads

syncthreads

For number of pops

For number of pops

For number of pops

For number of pushs


End Kernel

If not the last iteration

Work

Regs Global

Regs Global

Shared Regs

Shared Global

Data Prefetch

• Better register utilization

• Data for iteration i+1 is moved to registers

• Data for iteration i is moved from register to shared memory

• Allows the GPU to overlap instructions

For number of iterationsFor number of pops

For number of pushs

Shared Global

Shared Global

syncthreads

syncthreads

End Kernel


Work


Science

19

Loop unrolling• Similar to traditional unrolling

• Allows the GPU to overlap instructions

• Better register utilization

• Less loop control overhead

• Can also be applied to memory transfer loops

For number of iterations/2


End Kernel

syncthreads

syncthreads

Work

Work

For number of pops

Shared Global

For number of pops

Shared Global

syncthreads

syncthreads

For number of pushs

Shared Global

For number of pushs

Shared Global


Science

20

Methodology

• Set of benchmarks from the StreamIt Suite• 3GHz Intel Core 2 Duo CPU with 6GB RAM• Nvidia Geforce GTX 285

StreamProcessors

Processor clock

Memory Configuration Memory Bandwidth

240 1476 MHz 2GB DDR3 159.0 GB/s


Science

21

Result (Baseline CPU)DCT

FFT

Matrix

Multipl

yMatr

ix Mult

iplyB

lock

Biton

ic

Batch

er

Radix

Merge S

ortCom

paris

ion C

ounti

ng

Vecto

r Add

Histog

ram

Avera

ge

05

101520253035404550

With Transfer Without Transfer

Spee

dup(

X)

10

24


Science

22

Result (Baseline GPU)DCT

FFT

Matrix M

ultiply

Matrix M

ultiply

Block

Biton

ic

Batch

er

Radix

Merge S

ortCom

paris

ion C

ounti

ng

Vecto

r Add

Histogra

m

Avera

ge

0

1

2

3

4

5

6

7

Shared/Global Prefetch/Unrolling Helper Threads Graph Restructuring

Spee

dup(

X)

64%

3%16%16%


Science

23

Conclusion

• Future systems will be heterogeneous

• GPUs are important part of such systems

• Programming complexity is a significant challenge

• Sponge automatically creates optimized CUDA code for a wide variety of GPU targets

• Provide portability by performing an array of optimizations on stream graphs


Science

24

Questions


Science

25

Spatial Intermediate Representation

• StreamIt• Main Constructs:

– Filter Encapsulate computation.

– Pipeline Expressing pipeline parallelism.– Splitjoin Expressing task-level parallelism.– Other constructs not relevant here

• Exposes different types of parallelism– Composable, hierarchical

• Stateful and stateless filters

pipeline

filter

splitjoin


Science

26

Nonlinear Optimization Space

[Ryoo , CGO ’08]

SAD Optimization Space

908 Configurations


Science

27

Thread 1 Thread 2Thread 0

Bank Conflict

776655443322110015151414131312121111101099887766554433221100

A[8,8] A[8,8] A[8,8]

Shared Memory

Shared Memory776655443322110015151414131312121111101099887766554433221100

Conflict

00 88 00

00 88 00

11 99 11

11 99 11

22 1010 22

22 1010 22

27

data = buffer[BaseAddress + s * ThreadId]


Science

28

Thread 2Thread 1Thread 0

Removing Bank Conflict

776655443322110015151414131312121111101099887766554433221100

A[8,8] A[8,8] A[8,8]

Shared Memory

Shared Memory776655443322110015151414131312121111101099887766554433221100

00 99 22

00 99 22

11 1010 33

11 1010 33

22 1111 44

22 1111 44

28

data = buffer[BaseAddress + s * ThreadId]

if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd

university of michigan electrical engineering and computer science amir hormati, mehrzad samadi,...

Documents

local memory shared

shared registers

size of shared memory

gpu architecture shared

memory hierarchy registers

block global memory

size of global memory

blocks slide