university of michigan electrical engineering and computer science amir hormati, mehrzad samadi,...
Post on 22-Dec-2015
214 views
TRANSCRIPT
University of MichiganElectrical Engineering and Computer
Science
Amir Hormati, Mehrzad Samadi, Mark Woh,
Trevor Mudge, and Scott Mahlke
Sponge: Portable Stream Programming on Graphics Engines
University of MichiganElectrical Engineering and Computer
Science
2
Why GPUs?
• Every mobile and desktop system will have one
• Affordable and high performance
• Over-provisioned
• Programmable
Sony PlayStation Phone
2002 2003 2004 2005 2006 2007 2008 2009 2010 20110
250
500
750
1000
1250
1500
NVIDIA GPU
INTEL CPUTh
eore
tical
GFL
OPS
/s
GeForce GTX 480
GeForce GTX 280
GeForce 8800 GTX
GeForce 7800 GTX
GeForce 6800 Ultra
University of MichiganElectrical Engineering and Computer
Science
3
GPU Architecture
Shared
Regs
0 1
2 3
4 5
6 7
Interconnection Network
CPU
SM 0 SM 1 SM 29
Kernel 1
Kernel 2
Tim
e 0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
Shared
Regs
0 1
2 3
4 5
6 7
Registers
Global Memory (Device Memory)
Shared Memory
University of MichiganElectrical Engineering and Computer
Science
4
GPU Programming ModelPer-block Register
Grid 1
Grid 0
Per-appDevice Global
Memory
Grid Sequence
__shared__ int GlobalVar
Per-block Shared Memory
Block
__shared__ int SharedVar
int LocalVarArray[10]
int RegisterVar
Thread
Per-threadRegister
Thread
Per-threadLocal
Memory
Per-block Shared Memory
Thread
int LocalVarArray[10]
• Threads Blocks Grid
• All the threads run one kernel
• Registers private to each thread
• Registers spill to local memory
• Shared memory shared between threads of a block
• Global memory shared between all blocks
University of MichiganElectrical Engineering and Computer
Science
5
Grid 1
GPU Execution Model
SM 1
Shared
Regs
0 1
2 3
4 5
6 7
SM 0
Shared
Regs
0 1
2 3
4 5
6 7
SM 2
Shared
Regs
0 1
2 3
4 5
6 7
SM 3
Shared
Regs
0 1
2 3
4 5
6 7
SM 30
Shared
Regs
0 1
2 3
4 5
6 7
University of MichiganElectrical Engineering and Computer
Science
6
GPU Execution Model
Block 0
Block 1
Block 3
Shared
Registers
0 1
2
4 5
3
6 7
SM0
Block 2
Warp 0 Warp 1
ThreadId
0 31 32 63
University of MichiganElectrical Engineering and Computer
Science
7
GPU Programming Challenges
0
50
100
150
200
250
300
350
400
64
4832
16
8
Number of Registers Per Thread
Tim
e (m
s)
High Performance Desktop Mobile
Optimized forGeForce GTX 285
Optimized forGeForce 8400 GS
• Data restructuring for complex memory hierarchy efficiently– Global memory, Shared memory, Registers
• Partitioning work between CPU and GPU
• Lack of portability between different generations of GPU– Registers, active warps, size of global
memory, size of shared memory
• Will vary even more– Newer high performance cards e.g. NVIDA’s
Fermi– Mobile GPUs with less resources
University of MichiganElectrical Engineering and Computer
Science
8
Nonlinear Optimization Space
[Ryoo , CGO ’08]
SAD Optimization Space
908 Configurations
We need higher level of abstraction!
University of MichiganElectrical Engineering and Computer
Science
9
Goals
• Write-once parallel software
• Free the programmer from low-level details
(C + Pthreads) Shared Memory Processors
(C +Intrinsics) SIMD Engines
(Verilog/VHDL) FPGAs
(CUDA/OpenCL) GPUs
Parallel Specification
University of MichiganElectrical Engineering and Computer
Science
10
Streaming
• Higher-level of abstraction
• Decoupling computation and memory accesses
• Coarse grain exposed parallelism, exposed communication
• Programmers can focus on the algorithms instead of low-level details
• Streaming actors use buffers to communicate
• A lot of recent works on extending portability of streaming applications
Actor 1
Actor 2 Actor 5
Splitter
Actor 4Actor 3
Joiner
Actor 6
University of MichiganElectrical Engineering and Computer
Science
11
Sponge
– Generating optimized CUDA for a wide variety of GPU targets
– Perform an array of optimizations on stream graphs
– Optimizing and porting to different generations
– Utilize memory hierarchy (registers, shared memory, coallescing)
– Efficiently utilize streaming cores
Reorganization and Classification
Memory Layout
Graph
Restructuring
Register Optimization
Shared/Global Memory
Helper Threads
Bank Conflict Resolution
Loop Unrolling
Software Prefetching
University of MichiganElectrical Engineering and Computer
Science
12
GPU Performance Model- Memory bound Kernels
M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7
≈ Memory Time
- Computation bound Kernels
M 0 M 1 M 4 M 5M 2 M 3 M 6 M 7
C 0 C1 C 2 C 3 C 4 C 5 C 6 C 7
≈ Computation Time
M CMemory Instructions Computation Instructions
University of MichiganElectrical Engineering and Computer
Science
13
Actor Classification
• High Traffic Actors(HiT)– Large number of memory accesses per actor– Less number of threads with shared memory– Using shared memory underutilizes the processors
• Low Traffic Actors(LoT)– Less number of memory accesses per actor– More number of threads– Using shared memory increases the performance
University of MichiganElectrical Engineering and Computer
Science
14
Thread 1 Thread 2 Thread 3Thread 0
1514131211109876543210
1514131211109876543210
Global Memory Accesses
A[4,4]
Global Memory
Global Memory2 6 10 14
2 6 10 141 5 9 13
1 5 9 13
0 4 8 12
0 4 8 12
3 7 11 15
3 7 11 15
• Large access latency
• Not access the words in sequence
• No coalescing
A[4,4] A[4,4] A[4,4]
A[i, j] Actor A has i pops and j pushes
University of MichiganElectrical Engineering and Computer
Science
15
Thread 3Thread 2Thread 1Thread 0
Shared Memory
15151414131312121111101099887766554433221100
15151414131312121111101099887766554433221100
A[4,4] A[4,4] A[4,4] A[4,4]
Shared Memory
Shared Memory
1514131211109876543210
1514131211109876543210
Global To
Shared
Global To
Shared
Global To
Shared
Global To
Shared
Global Memory
Global Memory 3210
3210
7654
7654
111098
111098
15141312
15141312
3210
3210
7654
7654
111098
111098
15141312
15141312
Shared to
Global
Shared to
Global
Shared to
Global
Shared to
Global
University of MichiganElectrical Engineering and Computer
Science
16
Using Shared Memory
• Shared memory is 100x faster than global memory
• Coalesce all global memory accesses
• Number of threads is limited by size of the shared memory.
For number of iterationsFor number of pops
For number of pushs
Shared Global
Shared Global
syncthreads
syncthreads
End Kernel
Begin Kernel <<<Blocks, Threads>>>:
Work
Begin Kernel <<<Blocks, Threads>>>:
For number of iterations
End Kernel
Work
University of MichiganElectrical Engineering and Computer
Science
17
For number of iterations
syncthreads
syncthreads
If helper threads
If helper threads
If worker threads
End Kernel
Begin Kernel <<< Blocks, Threads + Helper >>>:
Work
Shared Global
Shared Global
Helper Threads
• Shared memory limits the number of threads.
• Underutilized processors can fetch data.
• All the helper threads are in one warp. (no control flow divergence)
Begin Kernel <<<Blocks, Threads>>>:
For number of iterations
End Kernel
Work
University of MichiganElectrical Engineering and Computer
Science
18
For number of iterations
syncthreads
syncthreads
For number of pops
For number of pops
For number of pops
For number of pushs
Begin Kernel <<<Blocks, Threads>>>:
End Kernel
If not the last iteration
Work
Regs Global
Regs Global
Shared Regs
Shared Global
Data Prefetch
• Better register utilization
• Data for iteration i+1 is moved to registers
• Data for iteration i is moved from register to shared memory
• Allows the GPU to overlap instructions
For number of iterationsFor number of pops
For number of pushs
Shared Global
Shared Global
syncthreads
syncthreads
End Kernel
Begin Kernel <<<Blocks, Threads>>>:
Work
University of MichiganElectrical Engineering and Computer
Science
19
Loop unrolling• Similar to traditional unrolling
• Allows the GPU to overlap instructions
• Better register utilization
• Less loop control overhead
• Can also be applied to memory transfer loops
For number of iterations/2
Begin Kernel <<<Blocks, Threads>>>:
End Kernel
syncthreads
syncthreads
Work
Work
For number of pops
Shared Global
For number of pops
Shared Global
syncthreads
syncthreads
For number of pushs
Shared Global
For number of pushs
Shared Global
University of MichiganElectrical Engineering and Computer
Science
20
Methodology
• Set of benchmarks from the StreamIt Suite• 3GHz Intel Core 2 Duo CPU with 6GB RAM• Nvidia Geforce GTX 285
StreamProcessors
Processor clock
Memory Configuration Memory Bandwidth
240 1476 MHz 2GB DDR3 159.0 GB/s
University of MichiganElectrical Engineering and Computer
Science
21
Result (Baseline CPU)DCT
FFT
Matrix
Multipl
yMatr
ix Mult
iplyB
lock
Biton
ic
Batch
er
Radix
Merge S
ortCom
paris
ion C
ounti
ng
Vecto
r Add
Histog
ram
Avera
ge
05
101520253035404550
With Transfer Without Transfer
Spee
dup(
X)
10
24
University of MichiganElectrical Engineering and Computer
Science
22
Result (Baseline GPU)DCT
FFT
Matrix M
ultiply
Matrix M
ultiply
Block
Biton
ic
Batch
er
Radix
Merge S
ortCom
paris
ion C
ounti
ng
Vecto
r Add
Histogra
m
Avera
ge
0
1
2
3
4
5
6
7
Shared/Global Prefetch/Unrolling Helper Threads Graph Restructuring
Spee
dup(
X)
64%
3%16%16%
University of MichiganElectrical Engineering and Computer
Science
23
Conclusion
• Future systems will be heterogeneous
• GPUs are important part of such systems
• Programming complexity is a significant challenge
• Sponge automatically creates optimized CUDA code for a wide variety of GPU targets
• Provide portability by performing an array of optimizations on stream graphs
University of MichiganElectrical Engineering and Computer
Science
25
Spatial Intermediate Representation
• StreamIt• Main Constructs:
– Filter Encapsulate computation.
– Pipeline Expressing pipeline parallelism.– Splitjoin Expressing task-level parallelism.– Other constructs not relevant here
• Exposes different types of parallelism– Composable, hierarchical
• Stateful and stateless filters
pipeline
filter
splitjoin
University of MichiganElectrical Engineering and Computer
Science
26
Nonlinear Optimization Space
[Ryoo , CGO ’08]
SAD Optimization Space
908 Configurations
University of MichiganElectrical Engineering and Computer
Science
27
Thread 1 Thread 2Thread 0
Bank Conflict
776655443322110015151414131312121111101099887766554433221100
A[8,8] A[8,8] A[8,8]
Shared Memory
Shared Memory776655443322110015151414131312121111101099887766554433221100
Conflict
00 88 00
00 88 00
11 99 11
11 99 11
22 1010 22
22 1010 22
27
data = buffer[BaseAddress + s * ThreadId]
University of MichiganElectrical Engineering and Computer
Science
28
Thread 2Thread 1Thread 0
Removing Bank Conflict
776655443322110015151414131312121111101099887766554433221100
A[8,8] A[8,8] A[8,8]
Shared Memory
Shared Memory776655443322110015151414131312121111101099887766554433221100
00 99 22
00 99 22
11 1010 33
11 1010 33
22 1111 44
22 1111 44
28
data = buffer[BaseAddress + s * ThreadId]
if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd