slide 1/8 performance debugging for highly parallel accelerator architectures saurabh bagchi ece...
TRANSCRIPT
Slide 1/8
Performance Debugging for Highly Parallel Accelerator Architectures
Saurabh BagchiECE & CS, Purdue University
Joint work with: Tsungtai Yeh, Amit Sabne, Rudolf Eigenmann (Purdue)
Presentation available at: engineering.purdue.edu/dcsl
Slide 2/8
Emerging Trend• Heterogeneous computing is gaining ground as a way to
accelerate performance of parallel applications• Buzzword is “accelerators”
– Graphics Processing Units (GPUs)
– Field Programmable Gate Arrays (FPGAs)
• Attraction is high degree of parallelism close to the main processor– Example: Dell Poweredge servers have 2 Kepler GPUs with a
total of 2 1536 CUDA cores
Slide 3/8
But … not so fast• Programming models for these architectures hide lots of
architecture details– As they should
• But, these architectures are ripe for committing horrendous performance errors– Even more so than in traditional CPU architectures
• Why?– FPGA: Constrained on chip memory; careless program can wipe
out any performance improvement by going to main processor
– GPU: Multiple levels of memory hierarchy with widely different access latencies; Identical control flow mandated for all threads within a block
Slide 4/8
GPU Schematic
CUDA hierarchy of threads, thread blocks, and grids with per thread private, per block shared,
and per application global memory spaces
Memory hierarchy
Slide 5/8
Specs leading to Performance Problem• Shared memory and L1 cache are limited
– 16 KB-48 KB or 48 KB-16 KB
– Very fast access: 1+ TB/s
• Global memory is accessible by all threads on the GPU– Larger amount of memory: 8 GB
– Slower access: 320 GB/s
• If communication with the host memory is required (over PCI Express bus), then much slower– PLDI 12 paper shows a 5X speedup if avoiding cyclic
communication
Slide 6/8
Common Patterns of Performance Bugs• Memory bugs
– Un-coalescing memory access
– Bank conflict of shared memory
– Channel skew in global memory
– The schedule of transmission of host to device memory
• Multi-thread bugs– Block/Thread configuration
– Branch divergence
• Synchronization bugs
Slide 7/8
Performance debugger work flow
Benchmarking (small scales or
small data)
Profiling
Detect performance anomaly
Localize the problem
Automatic program transformation
Re-benchmarking
Acceptable?NOBreak
Yes
Program Static Analysis
Slide 8/8
Example of a Performance Bug• Matrix transpose on GPU
– The memory bandwidth of GTX 280 is 140 GB/sec
• For 2048 2048 matrix– Naïve transpose: 2.2 GB/s
– Coalesced transpose: 17.1 GB/s
Slide 9/8
Can We Do This Automatically?
• Training Phase (A Series of Small-scale Testing Runs)– Instrumentation to record observational features– Modeling to train a model that can predict observational features
from control features
• Deployment Phase (Large-scale Production Runs)– Instrumentation to record the same features– Detection to flag production runs with negative correlation– Localization
• Use the trained model to reconstruct observational feature• Rank features by reconstruction error
• Some lessons from our prior work [HPDC `11] [HotDep `12]
Slide 10/8
Can We Do This Automatically?• Maybe
• Some lessons from our prior work [HPDC `11] [HotDep `12]
ControlFeature X
Observational Feature Y
g(*)
f(*)
corr(f( ), g( )) < 0
corr(f( ), g( )) < 0
yy
xx
BUG!
Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated
Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated
Behavioral FeatureBehavioral Feature
Scale of ExecutionScale of Execution
Slide 11/8
ControlFeature X
Observational Feature Y
g(*)
f(*)
g’-1(f (x))g’-1(f (x))
ABHRANTA: a Predictive Model for Program Behavior at Large Scale
• ABHRANTA replaced non-invertible transform g used by Vrisha with a linear transform g’
• The new model provides an automatic way to reconstruct “bug-free” behavior at large scale, lifting the burden of manual analysis of program scaling behavior
g’(*)g’(*)
xx
f(x)f(x)
Slide 12/8
Results from HPC Benchmark• AMG2006 is a parallel algebraic multigrid solver for linear
systems, written in 104K lines of C code. – The application is configured to solve the default 3D Laplace type problem
• Train on 8-128 node runs, test at larger scales (up to 4096 nodes)
• Fault injection study – Integer overflows, buffer overflows
• Control features: X, Y, Z dimensions of 3D grid
• Observational features: All conditionals indexed by calling context
Slide 13/8
Can We Do This for GPU Programs?
• We think we can (Wild ?) Speculation!
• Features that make this approach more feasible1. More regular kernels than general purpose programs
2. Good places to insert monitors to observe behavioral features
3. Often spare computational capacity close by
4. Types of performance bugs are limited
5. Types of program transformations limited
Slide 14/8
Presentation available at:Dependable Computing Systems Lab
(DCSL) web siteengineering.purdue.edu/dcsl