slide 1/8 performance debugging for highly parallel accelerator architectures saurabh bagchi ece...

/8

Performance Debugging for Highly Parallel Accelerator Architectures

Saurabh BagchiECE & CS, Purdue University

Joint work with: Tsungtai Yeh, Amit Sabne, Rudolf Eigenmann (Purdue)

Presentation available at: engineering.purdue.edu/dcsl

/8

Emerging Trend• Heterogeneous computing is gaining ground as a way to

accelerate performance of parallel applications• Buzzword is “accelerators”

– Graphics Processing Units (GPUs)

– Field Programmable Gate Arrays (FPGAs)

• Attraction is high degree of parallelism close to the main processor– Example: Dell Poweredge servers have 2 Kepler GPUs with a

total of 2 1536 CUDA cores

/8

But … not so fast• Programming models for these architectures hide lots of

architecture details– As they should

• But, these architectures are ripe for committing horrendous performance errors– Even more so than in traditional CPU architectures

• Why?– FPGA: Constrained on chip memory; careless program can wipe

out any performance improvement by going to main processor

– GPU: Multiple levels of memory hierarchy with widely different access latencies; Identical control flow mandated for all threads within a block

/8

GPU Schematic

CUDA hierarchy of threads, thread blocks, and grids with per thread private, per block shared,

and per application global memory spaces

Memory hierarchy

/8

Specs leading to Performance Problem• Shared memory and L1 cache are limited

– 16 KB-48 KB or 48 KB-16 KB

– Very fast access: 1+ TB/s

• Global memory is accessible by all threads on the GPU– Larger amount of memory: 8 GB

– Slower access: 320 GB/s

• If communication with the host memory is required (over PCI Express bus), then much slower– PLDI 12 paper shows a 5X speedup if avoiding cyclic

communication

/8

Common Patterns of Performance Bugs• Memory bugs

– Un-coalescing memory access

– Bank conflict of shared memory

– Channel skew in global memory

– The schedule of transmission of host to device memory

• Multi-thread bugs– Block/Thread configuration

– Branch divergence

• Synchronization bugs

/8

Performance debugger work flow

Benchmarking (small scales or

small data)

Profiling

Detect performance anomaly

Localize the problem

Automatic program transformation

Re-benchmarking

Acceptable?NOBreak

Yes

Program Static Analysis

/8

Example of a Performance Bug• Matrix transpose on GPU

– The memory bandwidth of GTX 280 is 140 GB/sec

• For 2048 2048 matrix– Naïve transpose: 2.2 GB/s

– Coalesced transpose: 17.1 GB/s

/8

Can We Do This Automatically?

• Training Phase (A Series of Small-scale Testing Runs)– Instrumentation to record observational features– Modeling to train a model that can predict observational features

from control features

• Deployment Phase (Large-scale Production Runs)– Instrumentation to record the same features– Detection to flag production runs with negative correlation– Localization

• Use the trained model to reconstruct observational feature• Rank features by reconstruction error

• Some lessons from our prior work [HPDC `11] [HotDep `12]

/8

Can We Do This Automatically?• Maybe

• Some lessons from our prior work [HPDC `11] [HotDep `12]

ControlFeature X

Observational Feature Y

g(*)

f(*)

corr(f( ), g( )) < 0

corr(f( ), g( )) < 0

yy

xx

BUG!

Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated

Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated

Behavioral FeatureBehavioral Feature

Scale of ExecutionScale of Execution

/8

ControlFeature X

Observational Feature Y

g(*)

f(*)

g’-1(f (x))g’-1(f (x))

ABHRANTA: a Predictive Model for Program Behavior at Large Scale

• ABHRANTA replaced non-invertible transform g used by Vrisha with a linear transform g’

• The new model provides an automatic way to reconstruct “bug-free” behavior at large scale, lifting the burden of manual analysis of program scaling behavior

g’(*)g’(*)

xx

f(x)f(x)

/8

Results from HPC Benchmark• AMG2006 is a parallel algebraic multigrid solver for linear

systems, written in 104K lines of C code. – The application is configured to solve the default 3D Laplace type problem

• Train on 8-128 node runs, test at larger scales (up to 4096 nodes)

• Fault injection study – Integer overflows, buffer overflows

• Control features: X, Y, Z dimensions of 3D grid

• Observational features: All conditionals indexed by calling context

/8

Can We Do This for GPU Programs?

• We think we can (Wild ?) Speculation!

• Features that make this approach more feasible1. More regular kernels than general purpose programs

2. Good places to insert monitors to observe behavioral features

3. Often spare computational capacity close by

4. Types of performance bugs are limited

5. Types of program transformations limited

/8

Presentation available at:Dependable Computing Systems Lab

(DCSL) web siteengineering.purdue.edu/dcsl

slide 1/8 performance debugging for highly parallel accelerator architectures saurabh bagchi ece...

Documents