designing physics algorithms for gpu architecture
DESCRIPTION
Designing physics Algorithms for gpu architecture. Takahiro HARADA AMD. Narrow phase on GPU. Narrow phase is parallel How to solve each pair? Design it for a specific architecture. GPU Architecture. Radeon HD 5870 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec Many cores - PowerPoint PPT PresentationTRANSCRIPT
Designing physics Algorithms for gpu architecture
Takahiro HARADAAMD
2| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Narrow phase on GPU
Narrow phase is parallel
– How to solve each pair?
Design it for a specific architecture
3| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Radeon HD 5870
– 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec
– Many cores 20SIMDs x 64 wide SIMD
CPU SSE 4 wide SIMD
Program of a work item is packed in VLIW, then executed
GPU Architecture
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
CoreCore CoreCore CoreCore CoreCore
Radeon HD 5870 Phenom II X4
CoreCoreSIMDSIMD
20(SIMDs)x16(Thread processors) x 5(Stream cores) = 1600
4| Designing Physics Algorithms for GPU Architecture | March 1, 2011
156.3GB/s156.3GB/s
Memory
Register
Global memory
– “Main memory”
– Large
– High latency
Local Data Store ( LDS)
– Low latency
– High bandwidth
– Like a user managed cache
– Key to get high performance SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
GPUSIMDSIMD
Global Memory> 1GB
Local Data Share32KB
5| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Narrow phase on CPU
Methods on CPUs(GJK)
– Any convex shapes
– Possible to implement on the GPU
– Complicated for GPU
– Divergence => Low use of ALUs
GPU prefer simpler algorithm with less logic
Why GPU is not good at complicated logic?
– Wide SIMD architecture
Void Kernel(){ executeX();
switch() { case A: { executeA(); break; } case B: { executeB(); break; } case C: { executeC(); break; } } finish();}
0 1 2 3 4 5 6 7
25%
25%
50%
6| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Narrow phase on GPU
Particles
– Search for neighboring particle
– Collide to all
– Accurate shape representation needs Increase resolution
Acceleration structure in each shape
– Increase complexity
Explode number of contacts
Etc..
Can we make it better but keep it simple?
Void Kernel()
{
prepare();
collide(p0);
collide(p1);
collide(p2);
collide(p3);
collide(p4);
collide(p5);
collide(p6);
collide(p7);
}
0 1 2 3 4 5 6 7
7| Designing Physics Algorithms for GPU Architecture | March 1, 2011
a Good approach for GPUs, from architecture
Have to know what GPUs likes
– Less branch
Less divergence
– Use LDS on SIMD
– Latency hiding
Why latency?
8| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Work group(WG), work item(WI)
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD
Radeon HD 5870
SIMDSIMD
Work Group0Work Group0
Work Group1Work Group1
Work Group2Work Group2
SIMD lane(64lanes)Work item(64items)
Particle[0-63]
Particle[64-127]
Particle[128-191]
9| Designing Physics Algorithms for GPU Architecture | March 1, 2011
How GPU hides Latency?
Memory access latency
– Not rely on cache
SIMD hides latency by switching WGs
The more WGs/SIMD is the better
– 1WG/SIMD cannot hide latency
– Overlap work to memory request
What determines # of WGs/SIMD?
– Local resource usage
WorkGroup0
WorkGroup1
WorkGroup2
WorkGroup3
Void Kernel(){ readGlobalMem();
compute();}
10| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Why reduce Resource usage?
Regs are limited resource
# of WGs/SIMD
– SIMD regs/(kernel regs use)
– LDS/(kernel LDS use)
Less # of regs
– More WGs
– Hide latency
Register overflow -> global memory
KernelA Regs:8
KernelB Regs:4
KernelC Regs:2
SIMD Engine (8 regs)
11
11
11
22
22 33 44
11| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Preview of Current Approach
1 WG processes 1 pair Reduce resource usage
Less branch
–Compute is branch free
–No dependency
Use of LDS
–No global mem access on compute()
–Random access to LDS
Latency hiding
–Pair data for a WG not per WI
WIs work together
Unified method for all the shapes
0 1 2 3 4 5 6 7Void Kernel(){ fetchToLDS(); BARRIER;
compute();
BARRIER; workTogether();
BARRIER; Writeback();}
Solver
13| Designing Physics Algorithms for GPU Architecture | March 1, 2011
14| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Fusion
16| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Choosing a processor
CPU can do everything
– Not good for highly parallel computations as GPU
GPU is very powerful processor
– Only for parallel computation
Real problem has both
GPU is far from CPU
17| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Fusion
GPU and CPU are close
Faster communication between GPU and CPU
Use both GPU and CPU
– Parallel workload -> GPU
– Serial workload -> CPU
18| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Collision between large and small particles
Granularity of computation
– Large particle collide more
– Inefficient use of the GPU
0 1 2 3 4 5 6 7
19| Designing Physics Algorithms for GPU Architecture | March 1, 2011
20| Designing Physics Algorithms for GPU Architecture | March 1, 2011
Q & A