development tools for parallel computing - bcs. · pdf filedevelopment tools for parallel...

29
Development Tools for Parallel Computing David Lecomber CTO, Allinea Software [email protected]

Upload: truongnhan

Post on 06-Feb-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Development Tools for Parallel Computing

David Lecomber

CTO, Allinea Software

[email protected]

Page 2: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Agenda

● Introduction● What is HPC● Bugs and Debugging● Debugging parallel applications● Challenges for the future

Page 3: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

About Allinea

● Development tools company for HPC● Flagship product Allinea DDT

– Now the leading debugger in parallel computing● The most scalable debugger

– Record holder for debugging software on largest machines– Production use at extreme scale – and desktop

● Wide customer base– Blue-chip engineering, government and academic research– Strong collaborative relationships with customers and partners

Page 4: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

What is HPC

● High Performance Computing

● Common aliases

– Supercomputing

– Scientific computing

– Parallel computing

● Simulation of some natural process/thing

● Intense number crunching: CPUs work flat out

● Historically distinct from data crunching

● Very large number of usually interrelated calculations: too big/slow for single machine

● Rarely real time today (but soon?)

● Examples

● Engineering - aerospace, automotive

● Sciences – nuclear physics, molecular modelling, astrophysics

● Oil and gas – reservoir modelling

● Medical – modelling of human heart, neurology

● Climate modelling and weather forecasting

Page 5: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Parallel programming in HPC

● A world of pragmatists● Scientists, academics, grad students, engineers

● Fortran, C++

– Many legacy codebases

– Distributed development

– Difficult to test – scale, platforms, ...

● One dominant standard library: MPI

– Job launch and data transfer between machines

– Point to point communication (send, receive) and collective operations

– Single program, multiple data (SPMD) - multiple processes with separate memory

● Other models: OpenMP, PGAS languages

● Decades of parallel computing● Problems naturally parallel – although sometimes complex to partition

Page 6: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Parallel Programming Models

● Shared memory - OpenMP

● Pragmas to existing code● ….● #omp parallel for ● for (i = 0; i < n; i++) {● ….

● Can be straightforward

– Data race conditions a potential problem

– Shared memory required

● Try it with “gcc -fopenmp”

● Distributed memory - MPI

● Distributed memory communication library

– MPI_Send – send bytes to process N

– MPI_Recv – receive bytes from …

– MPI_Bcast – broadcast from all

– ….

● Around 200 functions – many codes only use ~10

● Free implementations – eg. Open MPI, MPICH – do not require a cluster/supercomputer

Page 7: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Example Code

Page 8: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

The impact of multicore

● Cannot wait for faster processors to arrive ● Performance/capability leaps only via more parallelism

● Reluctant adopters of multicore – but why?● Existing codes are parallel

● Scalability (performance) often tails of as process counts rise (“weak scaling”)

– “Two strong oxen or 1,024 chickens?”

– 8 Petaflops but near 10 Megawatts – efficiency is important

● One survey of 188 supercomputing centres (IDC):

– 52% of HPC applications run above 1 node

– 12% of HPC applications scale above 1,000 cores

– 1% of applications scale above 10,000 cores

● Software development is required to efficiently use more parallel resource

Page 9: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

How extreme is it?

20012002

20032004

20052006

20072008

20092010

2011

0100000200000300000400000500000600000

Growth in HPC core counts

Average Cores

Largest

Smallest

Year

Co

re c

ou

nt

200120022003200420052006200720082009201020110

5000

10000

15000

20000

HPC core counts

Average Cores Smallest

Co

re c

ou

nt

● Machine sizes are exploding● Skewed by largest machines …

but a common trend

● Largest system (Nov 2011)

– Japan – 10 Petaflops

– UK's largest: 90,000 cores and 2/3rd of a Petaflop

● Easier to build a machine than it is to program it

Page 10: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

HPC's current challenge

● GPUs – a rival to traditional processors

● AMD and NVIDIA

● OpenCL, CUDA

● Great bang-for-bucks ratios

● A big challenge for HPC developers

● Data transfer

● Several memory levels

● Grid/block layout and thread scheduling

● Synchronization

● Tiny granularity – often one thread per single calculation (SIMD)

● New languages, compilers, potential standards

Page 11: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Example GPU algorithm

● Matrix-matrix multiplication

● For C

● Nested loops, ~4 lines of code in C

● For CUDA

● Transfer whole matrix to device memory

● Read lines of A for block to shared memory

● Read columns of B for block to shared memory

● Synchronise

● Calculate output (loop) – one output cell of C per GPU thread

● End kernel

● Write array back to host memory

● Recognizably C – but...

● More complex

● More concurrent

● More buggy

Page 12: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

A parallel hybrid world

● Hardware is determining the software

● Exploit concurrency within a multicore node: Shared memory via OpenMP, pthreads, …

● To exploit GPUs: CUDA, OpenCL, …

● For multiple nodes: MPI

● Result: Mixtures of paradigms

– Message passing, shared memory and GPU

● Very large GPU systems now in service:

● Oak Ridge National Laboratory, Tennessee – Titan (Cray XK6)

– 20,000 nodes - 299,008 CPU cores and 960 NVIDIA Tesla GPUs (and growing..)

● NUDT China – Tianhe-1A

– 86,038 CPU cores and 7168 NVIDIA Tesla GPUs

– $88M to build, $20M to run

● Many software rewrites are in progress because of GPUs

● Cost/performance vs codebase complexity – and longevity – development investment

● …. but what do we do when software fails?

Page 13: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

So how do we fix software?

● With …● Thousands of threads

● Millions of variables

● Terabytes of data

● How do you figure out what's going on with your code?● Old tricks long dead: multiple terminals, print statements, …

● Different from (eg.) web farms● Everything is inter-related – not independent

● We need to see all threads and processes – together

● Different from most other fields?● From embedded multicore? Only in scale (sometimes in terminology)

● Does it look like your problem?

● Does it look like your next problem?

Page 14: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Bugs in Practice

Page 15: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Some types of bug

● Some Terminology● Bohr bug

– Steady, dependable bug

● Heisenbug– Vanishes when you try to debug (observe)

● Mandelbug– Complexity and obscurity of the cause is so great that it appears

chaotic

● Schroedinbug– First occurs after someone reads the source file and deduces

that it never worked, after which the program ceases to work

Page 16: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

How do we debug?

● A scientific process? ● Identify and reproduce the bug

● Hypothesis, trial and observation, understand how the code is behaving ...

– Printf

– Command line debuggers

– Graphical debuggers

● Other options● Static analysis

● Race detection with automated tools

● Valgrind

● Manual source code review

– “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

— Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.

Page 17: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

The oldest debugger in the world

● All developers know printf● As part of more general debugging messages

– Run binary in a “debug” mode to log behaviour

– Default for many web applications – eg. HTTP access logs

– A form of post-mortem debugging

● In response to specific problem

– Insert instructions to code to test hypothesis● At line 53, x becomes 7 and the if statement is executed

– Recompile, run, examine output

– Hypothesis incorrect? Loop back and try again

– Hypothesis correct? Remove debug output from binary, and fix the bug

● At scale?

– Interference with timing

– Interleaving of output can be misleading

– Flushing of output can be misleading

– Too much output

Page 18: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Real debuggers...

● Inspect the insides of an application whilst it is alive● Inspect process state

– Process registers, and memory

– Variables and stacktraces (nesting of function calls)

● Control/observe execution

– Step line by line, function by function through an execution

– Stop at a line or function (breakpoint)

– Stop if a memory location changes

● Ideal to watch how a program is executed

– Less intrusive on the code than printf

– See exact line of crash – unlike printf

– Test more hypotheses at a time

● Most well-known examples cater for single process debugging

– GDB, Visual Studio, ...

Page 19: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Debugging Parallel Applications

● The same needs: observation, control, ...● More complex environment

– No command prompt

– Printf unreliable

– No core files

● More complex problems– More processes

– More data

● More Heisenbugs– Threading and communication

introduce non-determinism

Page 20: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

● Graphical source level debugger for ● Parallel, multi-threaded, scalar or

hybrid code

● C, C++, F90, Co-Array Fortran, UPC

● Strong feature set● Memory debugging

● Data analysis

● Managing concurrency● Emphasizing differences

● Collective control

“Make as simple as possible, no more”

Allinea DDT in a nutshell

Page 21: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Fixing everyday crashes

● Typical crash scenario:● Threads/processes can be anywhere

● Too many to manually examine individually

● A good overview is important● Allinea DDT merges stacks from

processes and threads into a tree

● Leap to source for crashes

● Information scalably without overload

● Common fault patterns evident instantly● Divergence, deadlock

Page 22: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Process Control

● Interacting with application progress is easy with DDT● Step, breakpoint, play, or set data

watchpoints based on groups

● Change interleaving order by stepping/playing selectively

● Group creation is easy● Integrated throughout Allinea DDT -

eg. stack and data views

Page 23: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

● Clear need to see data

● Too many variables to trawl manually

● Allinea DDT compares data automatically

● Smart highlighting

● Subtle hints for differences and changes

● New: Now with sparklines!

● More detailed analysis

● Full cross process comparison

● Historical values via tracepoints

Simplifying data divergence

Page 24: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Large Array Support

● Browse arrays ● 1, 2, 3, … dimensions● Table view

● Filtering● Look for an outlier

● Export● Save to a spreadsheet

● View arrays from multiple processes● Search terabytes for rogue

data – in parallel

Page 25: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

A simple parallel debugger

● A basic parallel debugger

● Aggregate scalar debuggers

– They work: good starting point

– Control asynchronously

● Implement support for many platforms and MPI implementations

● Develop user interface

– Simplify control and state display

● Initial architecture

● Scalar debuggers connect to user interface

– Direction connections - linear performance

● Any per-process item is an eventual bottleneck

– Operating system limitations● File handles on the GUI

● Threads, processes

– I/O limitations● Linear access counts on the best networked file systems

are still linear

– Memory and computation limitations

● Machines still getting bigger...

User Interface

Controller

Debugger

Process

Controller

Debugger

Process

Page 26: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

Bug fixing at scale

● Can we reproduce at a smaller scale?● Attempt to make problem happen on fewer nodes

– Often requires reduced data set – the large one may not fit● Smaller data set may not trigger the problem

– Does the bug even exist on smaller problems?● Didn't you already try the code at small scale?

– Is it a system issue – eg. an MPI problem?

● Is probability stacking up against you?– Unlikely to spot on smaller runs – without many many runs– But near guaranteed to see it on a many-thousand core run

● Debugging at extreme scale is a necessity

Page 27: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

How to make a Petascale debugger

● A control tree is the solution● Ability to send bulk commands and

merge responses

– 100,000 processes in a depth 3 tree

● Compact data type to represent sets of processes

– eg. For message envelopes

– An ordered tree of intervals?

– Or a bitmap?

● Develop aggregations

– Merge operations are key

– Not everything can merge losslessly

– Maintain the essence of the information● eg. min, max, distribution

Page 28: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

0 50,000 100,000 150,000 200,0000

0.020.040.060.08

0.10.12

DDT 3.0 Performance Figures

All StepAll Breakpoint

MPI Processes

Tim

e (

Se

con

ds)

For Petascale and beyond

● Partnership with largest users

● DoE Oak Ridge National Laboratories

● LLNL, ANL, CEA and others

● High performance debugging - even at 220,000 cores

● Step all and display stacks: 0.1 seconds

● Logarithmic

● Usability is a “Big Thing”

● Scalable interface and features

● One million cores?

● … waiting for the machine!

Page 29: Development Tools for Parallel Computing - bcs. · PDF fileDevelopment Tools for Parallel Computing David Lecomber CTO, Allinea Software david@allinea.com. Agenda ... Parallel programming

The Future

● Concurrency will increase● 2012 or early 2013 – DDT will debug a million core

system● International and national groups are preparing for

Exascale: – 100x more powerful than today's most powerful system– Expected to be multi-level parallel (hybrid)

● Continued adoption of multicore and hybrid programming in consumer arena:– Laptops, tablets, mobile phones