(1) formal verification for message passing and gpu computing (2) xum: an experimental multicore ...

(1) Formal Verification for Message Passing and GPU Computing

(2) XUM: An Experimental Multicore supporting MCAPI

Ganesh Gopalakrishnan

School of Computing, University of Utah

and

Center for Parallel Computing (CPU)

http://www.cs.utah.edu/fv


General Theme

• Take FM where it hasn’t gone before – A handful working in these domains which are

crucially important• Explore the space of concurrency based on

message passing APIs– A little bit of a mid-life crisis for an FV person

learning which areas in SW design need help…

Ideas I hope to present• L1: How to test message passing programs used in HPC

– Recognize the ubiquity of certain APIs in critical areas (e.g. MPI, in HPC)

– With proper semantic characterization, we can formally understand / teach, and formally test

• No need to outright dismiss these APIs as “too hairy”• (to the contrary) be able to realize fundamental issues that will be faced in

any attempt along the same lines– Long-lived APIs create a tool ecosystem around them that becomes

harder to justify replacing• What it takes to print a line in a “real setting”

– Need to build stack-walker to peel back profiling layers, and locate the actual source line

– Expensive – and pointless – to roll your own stack walker

Ideas I hope to present

• L2: How to test at scale– The only practical way to detect communication

non-determinism (in this domain)– Can form the backbone of future large-scale

replay-based debugging tools


• Realize that the multicore landscape is rapidly changing– Accelerators (e.g. GPUs) are growing in use– Multicore CPUs and GPUs will be integrated more

tightly– Energy is a first-rate currency– Lessons learned from the embedded systems

world are very relevant


• L3: Creating dedicated verification tools for GPU kernels– How symbolic verification methods can be

effectively used to analyze GPU kernel functions– Status of tool and future directions


• L4: Designing an experimental message-passing multicore– Implements an emerging message passing standard called

MCAPI in silicon– How the design of special instructions can help with fast

messaging– How features in the Network on Chips (NoC) can help

support the semantics of MCAPI• Community involvement in the creation of such tangible artifacts

can be healthy– Read “The future of Microprocessors” in a recent CACM, by Shekhar

Borkar and Andrew Chien

Organization

• Today– MPI and dyn. FV

• Tomorrow– GPU computing and FV– XUM

Context/Motivation: Demand for cycles!• Terascale

• Petascale

• Exascale

• Zettascale

More compute power enables new discoveries, solves new problems

– Molecular dynamics simulations• Better drug design facilitated

– Sanbonmatsu et al, FSE 2010 keynote• 290 days of simulation to simulate 2 million atom interactions over 2 nano seconds

– Better “oil caps” can be designed if we have the right compute infrastructure

• Gropp, SC 2010 panel

High End Machines for HPC /

Cloud

Desktop Servers

andComputeServers

Embedded Systems

andDevices

OpenMPCUDA / OpenCL PthreadsMPI

Multicore Association

APIs

Commonality among different scalesAlso “HPC” will increasingly go embedded

11

Difficult Road Ahead wrt Debugging

• Concurrent software debugging is hard• Gets harder as the degree of parallelism in applications

increases– Node level: Message Passing Interface (MPI)– Core level: Threads, OpenMPI, CUDA

• Hybrid programming will be the future– MPI + Threads – MPI + OpenMP– MPI + CUDA

• Yet tools are lagging behind!– Many tools cannot operate at scale and give measurable coverage

HPC AppsHPC Correctness Tools

High-end Debugging Methods areoften Expensive, Inconclusive

• Expensive machines, resources– $3M electricity a year (megawatt)– $1B to install hardware– Months of planning to get runtime on cluster

• Debugging tools/methods are primitive– Extreme-Scale goal unrealistic w/o better approaches

• Inadequate attention from “CS”– Little/no Formal Software Engineering methods– Almost zero critical mass

Importance of Message Passing in HPC (MPI)

• Born ~1994• The world’s fastest CPU ran at 68 MHz • The Internet had 600 sites then!• Java was still not around

• Still dominant in 2011– Large investments in applications, tooling support

• Credible FV research in HPC must include MPI

– Use of message passing is growing• Erlang, actor languages, MCAPI, .NET async … (not yet for HPC)• Streams in CUDA, Queues in OpenCL,…

15

Trend: Hybrid Concurrency

Sandybridge (courtesy anandtech.com) Geoforce GTX 480 (Nvidia)AMD Fusion APU

Concurrent Data StructuresInfiniband style interconnect

Monolith Large-scaleMPI-based

UserApplications

Problem-Solving Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment basedUser Applications

High PerformanceMPI Libraries

MPI Verification approach depends on type of determinism

• Execution Deterministic– Basically one computation per input data

• Value Deterministic– Multiple computations, but yield same “final answer”

• Nondeterministic– Basically reactive programs built around message

passing, possibly also using threads

Examples to follow

An example of parallelizing matrix multiplication using message passing

X


X

MPI_Bcast

MPI_Send

MPI_Recv


19

X

MPI_Send

MPI_Bcast

MPI_Send

MPI_Recv


20

X =

MPI_Send

MPI_Recv

MPI_Bcast

MPI_Send

MPI_Recv

21MPI_Send

MPI_Recv (from: P0, P1, P2, P3…) ;

Send Next Row toFirst Slave whichBy now must be free

Unoptimized Initial Version : Execution Deterministic

22MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row toFirst Worker that returns the answer!

Later Optimized Version : Value DeterministicOpportunistically Send Work to Processor Who Finishes first

23MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row toFirst Worker that returns the answer!

Still More Optimized Value-Deterministic versions:Communications are made Non-blocking, and Software Pipelined

(still expected to remain value-deterministic)

Typical MPI Programs

• Value-Nondeterministic MPI programs do exist– Adaptive Dynamic Load Balancing Libraries

• But most are value deterministic or execution deterministic– Of course, one does not really know w/o analysis!

• Detect replay non-determinism over schedule space– Races can creep into MPI programs

• Forgetting to Wait for MPI non-blocking calls to finish– Floating point can make things non-deterministic

Gist of bug-hunting story

• MPI programs “die by the bite of a thousand mosquitoes”– No major vulnerabilities one can focus on

• E.g. in Thread Programming, focusing on races• With MPI, we need comprehensive “Bug Monitors”

• Building MPI bug monitors requires collaboration– Lucky to have collaborations with DOE labs– The lack of FV critical mass hurts

A real-world bug

P0 P1 P2--- --- ---Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

• Expected “circular” msg passing• Found that P0’s Recv entirely vanished !!• REASON : ??

– In C, -1 % N is not N-1 but rather -1 itself– In MPI, “-1” Is MPI_PROC_NULL– Recv posted on MPI_PROC_NULL is ignored !

A real-world bug

P0 P1 P2--- --- ---Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

• Expected “circular” msg passing• Found that P0’s Recv entirely vanished !!• REASON : ??

– In C, -1 % N is not N-1, but -1 – In MPI, “-1” Is MPI_PROC_NULL– Recv posted on MPI_PROC_NULL is ignored !

MPI Bugs – more anecdotal evidence

• Bug encountered at large scale w.r.t. famous MPI library (Vo)– Bug was absent at a smaller scale– It was a concurrency bug

• Attempt to implement collective communication (Thakur) – Bug exists for ranges of size parameters

• Wrong assumption: that MPI barrier was irrelevant (Siegel)– It was not – a communication race was created

• Other common bugs (we see it a lot; potentially concurrency dep.)– Forgetting to wait for non-blocking receive to finish– Forgetting to free up communicators and type objects

• Some codes may be considered buggy if non-determinism arises!– Use of MPI_Recv(*) often does not result in non-deterministic execution– Need something more than “superficial inspection”

Real bug stories in the MPI-land

• Typing a[i][i] = init instead of a[i][j] = init• Communication races

– Unintended send matches “wildcard receive”• Bugs that show up when ported

– Runtime buffering changes; deadlocks erupt– Sometimes, bugs show up when buffering added!

• Misunderstood “Collective” semantics– Broadcast does not have “barrier” semantics

• MPI + threads– Royal troubles await the newbies

Our Research Agenda in HPC• Solve FV of Pure MPI Applications “well”

– Progress in non-determinism coverage for fixed test harness– MUST integrate with good error monitors

• (Preliminary) Work on hybrid MPI + Something– Something = Pthreads and CUDA so far– Evaluated heuristics for deterministic replay of Pthreads + MPI

• Work on CUDA/OpenCL Analysis– Good progress on Symbolic Static Analyzer for CUDA Kernels– (Prelim.) progress on Symbolic Test Generator for CUDA Pgms

• (Future) Symbolic Test Generation to “crash” hybrid pgms– Finding lurking crashes may be a communicable value proposition

• (Future) Intelligent schedule-space exploration– Focus on non-monolithic MPI programs

Motivation for Coverage of Communication Nondeterminism

32

Eliminating wasted search in message passing verif.

P0---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1---

MPI_Recv(from P0…);


MPI_Recv(*, x);

if (x==22) then error1 else MPI_Recv(*, x);

P2---

MPI_Send(to P1…);


33

@InProceedings{PADTAD2006:JitterBug,author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de Supinski and Andreas S{\ae}bj{\"o}rnsen},title = {Improving distributed memory applications testing by message perturbation},booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD) Workshop, at the International Symposium on Software Testing and Analysis},address = {Portland, ME, USA},month = {July},year = {2006}}

A frequently followed approach: “boil the whole schedule space” – often very wasteful

34

P0---

MPI_Send(to P1…);


P1---



MPI_Recv(*, x);

if (x==22) then error1 else MPI_Recv(*, x);

P2---

MPI_Send(to P1…);


But consider these two cases…

No need to play with schedules of deterministic actions

Eliminating wasted work in message passing verif.

Need to detectResource Dependent Bugs

36

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);

Example of Resource Dependent Bug

We know that this program with lesser Send buffering may deadlock

37

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);


… and this program with more Send buffering may avoid a deadlock

38

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);


… But this program deadlocks if Send(to:1) has more buffering !

39

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);



40

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);



41

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);



42

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);



Mismatched – hence a deadlock

43

Widely Publicized Misunderstandings

“”Your program is deadlock free if you have successfully tested it under zero buffering”

44

MPI at fault?• Perhaps partly

– Over 17 years of MPI, things have changed– Inevitable use of shared memory cores, GPUs, …– Yet, many of the issues seem fundamental to

• Need for wide adoption across problems, languages, machines• Need to give programmer better handle on resource usage

• How to evolve out of MPI?– Whom do we trust to reset the world?– Will they get it any better?– What about the train-wreck meanwhile?

• Must one completely evolve out of MPI?

45

Our Impact So Far

Sandybridge (courtesy anandtech.com) Geoforce GTX 480 (Nvidia)AMD Fusion APU

PUGand GKLEE

ISPand DAMPI

Concurrent Data StructuresInfiniband style interconnect

Monolith Large-scaleMPI-based

UserApplications

Problem-Solving Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment basedUser Applications

High PerformanceMPI Libraries

Usefulformalizationsto help testthese

46

Outline for L1• Dynamic formal verification of MPI

– It is basically testing which discovers all alternate schedules

• Coverage of communication non-determinism– Also gives us a “predictive theory” of MPI behavior

– Centralized approach : ISP– GEM: Tool Integration within Eclipse Parallel Tools Platform– Demo of GEM

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

A Simple MPI Example

47

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;


48

• Non-blocking Send – send lasts from Isend to Wait• Send buffer can be reclaimed only after Wait clears• Forgetting to issue Wait MPI “request object” leak

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;


49

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;


50

• Non-blocking Receive – lasts from Irecv to Wait• Recv buffer can be examined only after Wait clears• Forgetting to issue Wait MPI “request object” leak

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;


51

• Blocking receive in the middle• Equivalent to Irecv followed by its Wait• The data fetched by Recv(2) is available before that of Irecv

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;


52

• Since P0’s Isend and Irecv can be “in flight”, the barrier can be crossed• This allows P2’s Isend to race with P0’s Isend, and match Irecv(*)

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

Testing Challenges

53

•Traditional testing methods may reveal only P0->P1 or P2->P1• P2->P1 may happen after the code is ported• Our tools ISP and DAMPI automatically discover and run both tests, regardless of the execution platform

Flow of ISP

54

Executable

Proc1

Proc2

……Procn

SchedulerRun

MPI Runtime

Scheduler intercepts MPI calls Reorders and/or rewrites the actual calls going into the MPI Runtime Discovers maximal non-determinism ; plays through all choices

MPI Program

Interposition Layer

55

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext Barrier

MPI Runtime

ISP Scheduler Actions (animation)

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNextBarrier

Irecv(*)

Barrier

56

MPI Runtime


P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Barrier

Barrier

Barrier

57

MPI Runtime


P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Wait (req)

Recv(2)

Isend(1)

SendNext

Wait (req)

Irecv(2)Isend

Wait

No Match-Set

58

Deadlock!


Formalization of MPI Behavior to build Formal Dynamic Verifier (verification scheduler)

60

Formal Modeling of MPI Calls

• MPI calls are modeled in terms of four salient events– Call issued

• All calls are issued in program order– Call returns

• The code after the call can now be executed– Call matches

• An event that marks the call committing– Call completes

• All resources associated with the call are freed

61

The Matches-Before Relation of MPIIsend(to: Proc k, …);…Isend(to: Proc k, …)

Irecv(from: Proc k, …);…Irecv(from: Proc k, …)

Isend( &h );…Wait( h );

Wait(..);…AnyMPIOp;

Barrier(..);…AnyMPIOp;

Irecv(from: Proc *, …);…Irecv(from: Proc k, …)

Irecv(from: Proc j, …);…Irecv(from: Proc *, …)

ConditionalMatchesbefore

1. 2.

3. 4.

5. 6. 7.

62

The ISP Scheduler• Pick a process Pi and its instrn Op at PC n• If Op does not have an unmatched ancestor according to MB,

then collect Op into Scheduler’s Reorder Buffer– Stay with Pi, Increment n

• Else Switch to Pi+1 until all Pi are “fenced”– “fenced” means all current Ops have unmatched ancestors

• Form Match Sets according to priority order– If Match Set is {s1, s2, .. sK} + R(*), cover all cases using stateless replay

• Issue an eligible set of Match Set Ops into the MPI runtime• Repeat until all processes are finalized or error encountered

Theorem: This Scheduling Method achieves ND-Coverage in MPI programs!

63

How MB helps predict outcomes

Will this single-process example called “Auto-send” deadlock ?

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

64



65


The MB


66


Collect R(from:0, h1)

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

The MPIRuntime

67


Collect B


R(from:0, h1)

B

The MPIRuntime

68


P0 is Fenced; Form Match Set { B } and Send it to the MPI Runtime !


R(from:0, h1)

B

The MPIRuntime

69


Collect S(to:0, h2)


R(from:0, h1)

B

S(to:0, h2)

The MPIRuntime

70


P0 is fenced. So form the {R,S} match-set. Fire!


R(from:0, h1)

B

S(to:0, h2)

The MPIRuntime

71


Collect W(h1)


R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPIRuntime

72


P0 is Fenced. So form {W} match set. Fire!


R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPIRuntime

73


Collect W(h2)


R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPIRuntime

74


Fire W(h2)! Program finishes w/o deadlocks!


R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPIRuntime

75

Outline for L2• Dynamic formal verification of MPI

– It is basically testing which discovers all alternate schedules• Coverage of communication non-determinism

– Also gives us a “predictive theory” of MPI behavior

– Centralized approach : ISP– GEM: Tool Integration within Eclipse Parallel Tools Platform

• DEMO OF GEM

– Distributed approach : DAMPI

Dynamic Verification at Scale

77

Why do we need a Distributed Analyzer for MPI Programs (DAMPI)?

• The ROB and the MB graph can get large– Limit of ISP: “32 Parmetis (15 K LOC) processes”

• We need to dynamically verify real apps– Large data sets, 1000s of processes– High-end bugs often masked when downscaling– Downscaling is often IMPOSSIBLE in practice– ISP is too sequential! Employ parallelism of a large Cluster!

• What do we give up?– We can’t do “what if” reasoning

• What if a PARTICULAR Isend has infinite buffering?– We may have to give up precision for scalability

DAMPI Framework

Executable

Proc1

Proc2

……Procn

Alternate Matches

MPI runtime

MPI Program

DAMPI - PnMPI modules

Schedule Generator

EpochDecisions

Rerun

DAMPI – Distributed Analyzer for MPI

Distributed Causality Tracking in DAMPI

• Alternate matches are – co-enabled and concurrent actions – that can match according to MPI’s matching rules

• DAMPI performs an initial run of the MPI program• Discovers alternative matches• We have developed an MPI-specific Sparse Logical

Clock Tracking– Vector Clocks (no omissions, less scalable)– Lamport Clocks (no omissions in practice, more scalable)

• We have gone up to 1000 processes (10K seems possib)

DAMPI uses Lamport clock to build Happens-Before relationships

• Use Lamport clock to track Happens-Before– Sparse Lamport Clocks – only “count” non-det events– MPI MB relation is “baked” into clock-update rules– Increases it after completion of MPI_Recv (ANY_SOURCE)– Nested blocking / non-blocking operations handled– Compare incoming clock to detect potential matches

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2 pb(0

)S(P1)

How we use Happens-Before relationship to detect alternative matches

• Question: could P2’s send have matched P1’s recv?• Answer: Yes!• Earliest Message with lower clock value than the current

process clock is an eligible match

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2pb

(0)S(P1)

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2

pb(0

)S(P1)

DAMPI Algorithm Review: (1) Discover Alternative matches during initial run

R1(2)

1

S(P1)

0

0

0

P0

P1

P2S(P1)

DAMPI Algorithm Review: (2) Force alternative matches during replay

X

DAMPI maintains very good scalability vs ISP

4 8 16 320

100

200

300

400

500

600

700

800

900

ParMETIS-3.1 (no wildcard)

ISPDAMPI

Number of tasks

Tim

e in

seco

nds

DAMPI is also faster at processing interleavings

250 500 750 10000

1000

2000

3000

4000

5000

6000

7000

8000

Matrix Multiplication with Wildcard Receives

ISPDAMPI

Number of Interleavings

Tim

e in

seco

nds

Results on large applications: SpecMPI2007 and NAS-PB

Slowdown is for one interleaving No replay was necessary

ParMETI

S-3.1

107.leslie3

d

113.GemsFD

TD

126.lammps

130.soco

rro137.lu BT CG DT EP FT IS LU MG

0

0.5

1

1.5

2

2.5

Base 1024 processes

DAMPI 1024 processes

The Devil in the Detail of DAMPI

• Textbook Lamport/Vector Clocks Inapplicable– “Clocking” all events won’t scale

• So “clock” only ND-events– When do non-blocking calls finish ?

• Can’t “peek” inside MPI runtime• Don’t want to impede with execution• So have to infer when they finish

– Later blocking operations can force Matches Before• So need to adjust clocks when blocking operations happen

• Our contributions: Sparse VC and Sparse LC algos• Handing real corner cases in MPI

– Synchronous sends “learning” about non-blocking Recv start!

Questions from Monday’s Lecture• What bugs are caught by ISP / DAMPI

– Deadlocks– MPI Resource Leaks

• Forgotten dealloc of MPI communicators, MPI type objects• Forgotten Wait after Isend or Irecv (request objects leaked)• C assert statements (safety checks)

• What have we done wrt the correctness of our MPI verifier ISP’s algorithms?– Testing + paper/pencil proof using standard ample-set based proof methods– Brief look at the formal transition system of MPI, and ISP’s transition system

• Why have HPC folks not built a happens-before model for APIs such as MPI?– HPC folks are grappling with many challenges and actually doing a great job– There is a lack of “computational thinking” in HPC that must be addressed

• Non-CS background often• See study by Jan Westerholm in EuroMPI 2010

– CS folks must step forward to help• Why this does not naturally happen: “Numerical Methods” not popular in core CS• There isn’t a clearly discernible “HPC industry”

– Wider use of GPUs, Physics based gaming, … can help push CS toward HPC– Mobile devices will use CPUs + GPUs and do “CS problems” and “HPC problems” in a unified setting (e.g.

face recognition, …)• Our next two topics (PUG and XUM) touch on our attempts in this area

89

Outline for L3• Formal analysis methods for accelerators/GPUs

– Same thing– In future, there may be a large-scale mish-mash of CPUs and

GPUs– CPUs also will have mini GPUs, SIMD units, …

• Again, revisit the Borkar / Chien article

• Regardless..– It looks a lot unlike traditional Pthreads/Java/C# threading– We would like to explore how to debug these kinds of codes

efficiently, and help designers explore their design space while root-causing bugs quickly

90

Why develop FV methods for CUDA?

• GPUs are key enablers of HPC– Many of the top 10 machines are GPU based– I found the presentation by Paul Lindberg eye-opening

http://www.youtube.com/watch?v=vj6A8AKVIuI• Interesting debugging challenges

– Races– Barrier mismatches– Bank conflicts– Asynchronous MPI / CUDA interactions

http://www.youtube.com/watch?v=vj6A8AKVIuI

91

What are GPUs (aka Accelerators)?

Sandybridge (courtesy anandtech.com)

OpenCL Compute (courtesy Khronos group)

GeoforceGTX 480(Nvidia)

• Three of world’s Top-5 Supercomputers built using GPUs• World’s greenest supercomputers also GPU based• CPU/GPU integration is a clear trend• Hand-held devices (iPhone) will also use them

AMD Fusion APU

9292

CPU program

void inc_cpu(float* a, float b, int N) {

for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;}

void main() { ..... increment_cpu(a, b, N);}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if (tid < N) A[tid] = A[tid] + b;}

voidmain() { ….. dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);}

Example: Increment Array Elements

Contrast between CPUs and GPUs

9393

CPU program

void inc_cpu(float* a, float b, int N) {

for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;}

void main() { ..... increment_cpu(a, b, N);}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if (tid < N) A[tid] = A[tid] + b;}

voidmain() { ….. dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);}

Example: Increment Array Elements

Contrast between CPUs and GPUsFine-grained threadsscheduled to run like this:Tid = 0, 1, 2, 3, …

94

Why heed Many-core / GPU Compute?

EAGAN, Minn., May 3, 1991 (John Markoff, NY Times)The Convex Computer Corporation plans to introduce its first supercomputer on Tuesday. But Cray Research Inc.,the king of supercomputing, says it is more worried by "killer micros"-- compact, extremely fast work stations that sell for less than$100,000.

Take-away : Clayton Christensen’s “Disruptive Innovation”GPU is a disruptive technology!

95

GPUs offer an eminent setting in which to study heterogeneous CPU organizationsand memory hierarchies

Other Reasons to study GPU Compute

96

What bugs caught? What value?• GPU hardware still stabilizing

– Characterize GPU Hardware Formally• Currently, program behavior may change with platform

– Micro benchmarking sheds further light• www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

• The software is the real issue!

97

GPU Software Bugs / Difficulties• SIMD model, unfamiliar

– Synchronization, races, deadlocks, … all different!• Machine constants, program assumptions about

problem parameters affect correctness / performance– GPU “kernel functions” may fail or perform poorly

• Formal reasoning can help identify performance pitfalls• Are still young, so emulators and debuggers may not

match hardware behavior• Multiple memory sub-systems are involved, further

complicating semantics / performance

98

Approaches/Tools for GPU SW Verif.

• Instrumented Dynamic Verification on GPU Emulators– Boyer et al, STMCS 2008

• Grace : Combined Static / Dynamic Analysis– Zheng et al, PPoPP 2011

• PUG : An SMT based Static Analyzer– Lee and Gopalakrishnan, FSE 2010

• GKLEE : An SMT based test generator– Lee, Rajan, Ghosh, and Gopalakrishnan (under submission)

99

Planned Overall Workflow of PUGRealized

100

Workflow and Results from PUG

101101

Examples of a Data RaceIncrement N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b ...

__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid < N) A[tid] = A[tid – 1] + b;}

102102


tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b ...


(and (and (/= t1.x t2.x) …… (or (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= (bv-sub idx1@t1 0b0000000001) idx1@t2)) (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for Read-Write Race

103103


tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b ...


(and (and (/= t1.x t2.x) …… (or (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= (bv-sub idx1@t1 0b0000000001) idx1@t2)) (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for Read-Write Race

104104

Real Example with Race__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads();

if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1]; ...

/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11] This causes a real race on writes to d[10] because of the += done by the threads*/

105105

Real Example with Race__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads();

if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1]; ...

/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11] This causes a real race on writes to d[10] because of the += done by the threads*/

106106

PUG Divides and Conquers Each Barrier Interval

t1 t2 BI1 is conflict free iff:

t1,t2 : a1t1, a2

t1, a1t2 and a2

t2 do not conflict with each other

p p pp

a1

a2

a3

a4

a5

a1

a2

a3

a4

a5

BI 3 is conflict free iff the following accesses do not conflict for all t1,t2:

a3t1, p : a4

t1, p : a5t1,

a3t2, p : a4

t2, p : a5t2

a6 a6

BI1

BI3

107107

PUG Results (FSE 2010)

Kernels (in CUDA SDK )

loc +O +C +R B.C. Time (pass)

Bitonic Sort 65 HIGH 2.2s

MatrixMult 102 * * HIGH <1s

Histogram64 136 LOW 2.9s

Sobel 130 * HIGH 5.6s

Reduction 315 HIGH 3.4s

Scan 255 * * * LOW 3.5s

Scan Large 237 * * LOW 5.7s

Nbody 206 * HIGH 7.4s

Bisect Large 1400 * * HIGH 44sRadix Sort 1150 * * * LOW 39s

Eigenvalues 2300 * * * HIGH 68s

+ O: require no bit-vector overflowing

+C: require constraints on the input values

+R: require manual refinement

B.C.: measure how serious the bank conflicts are

Time: SMT solving time

108

“GPU class” bugs

Defects Barrier Error or Race Refinement

benign fatal over #kernel over #loop

13 (23%) 3 2 17.5% 10.5%

Tested over 57 assignment submissions

Defects: Indicates how many kernels are not well parameterized, i.e. work only in certain configurations

Refinement: Measures how many loops need automatic refinement (by PUG).

109

PUG’s Strengths and Weaknesses• Strengths:

– SMT-based Incisive SA avoids interleaving explosion– Still obtains coverage guarantees– Good for GPU Library FV

• Weaknesses:– Engineering effort : C++, Templates, Breaks, .. – SMT “explosion” for value correctness– Does not help test the code on the actual hardware– False alarms requires manual intervention

• Handling loops, Kernel calling contexts

110

Thorough internal documentation of PUG is available

• http://www.cs.utah.edu/fv/mediawiki/index.php/PUG• One recent extension (ask me for our EC2 2011 paper)

– Symbolic analysis to detect occurrences of non-coalesced memory accesses

http://www.cs.utah.edu/fv/mediawiki/index.php/PUG

111

Two directions in progress

• Parameterized Verification– See Li’s dissertation for initial results

• Formal Testing – Brand new code-base called GKLEE

• Joint work with Fujitsu Research

112

GKLEE: Formal Testing of GPU Kernels(joint work with Li, Rajan, Ghosh of Fujitsu)

113

Evaluation of GKLEE vs. PUG

Kernels n = 4 n = 16 n = 64 n = 256 n = 1024PUG GKLEE PUG GKLEE GKLEE GKLEE GKLEE

Simple Reduction 2.8 < 0.1 (< 0.1) T.O. < 0.1 (< 0.1) < 0.1 (< 0.1) 0.2 (0.3) 2.3 (2.9)

Matrix Transpose 1.9 < 0.1 (< 0.1) T.O. < 0.1 (0.3) < 0.1 (3.2) < 0.1 (63) 0.9 (T.O.)

Bitonic Sort 3.7 0.9 (1) T.O. T.O. T.O. T.O. T.O.

Scan Large ▬ < 0.1 (< 0.1) ▬ < 0.1 (< 0.1) 0.1 (0.2) 1.6 (3) 22 (51)

Execution time of PUG and GKLEE on kernels from the CUDA SDKfor functional correctness. n is the number of threads. T.O. means > 5 min.

114

Evaluation of GKLEE’s Test Coverage

Kernels sc cov. bc cov. min #test Avg. Covt max. Covtexec. time

Bitonic Sort 100% / 100%

51% / 44% 5 78% /

76%100% / 94% 1s

Merge Sort 100% / 100%

70% / 50% 6 88% /

80%100% / 95% 0.5s

Word Search 100% / 92%

34% / 25% 2 100% /

81%100% / 85% 0.1s

Radix Sort 100% / 91%

54% / 35% 6 91% /

68%100% / 75% 5s

Suffix Tree Match

100% / 90%

38% / 35% 8 90% /

70%100% / 82% 12s

Test generation results. Traditional metrics sc cov. and bc cov. give source code and bytecode coverage respectively. Refined metrics avg.covt and max.covt measure the average and maximum coverage of all threads.

All coverages were far better than achieved through random testing.

115

Outline for L4• A brief introduction to MCAPI, MRAPI, MTAPI• XUM : an eXtensible Utah Multicore system• What are we able to learn through building a

hardware realization of a custom multicore– How can we push this direction forward?– Any collaborations?

• Clearly there are others who do this full-time• This has been a side-project (by two very good students, albeit)

in our group• So we would like to build on others’ creations also…

116

Multicore Association APIs• http://www.multicore-association.org• Reason for our interest in the MCA APIs

– Our project through the Semiconductor Research Association – Collaborator in dynamic verification of MCAPI Applications

• Prof. Eric Mercer (BYU, Utah)– The BYU team also has developed a formal spec for MCAPI and MRAPI, and

built golden executable models from these specs• XUM

– A Utah project involving two students• MS project of Ben Meakin• BS project of Grant Ayers

– An attempt to support MCAPI functions in HW+SW– (Later) hoping to support MRAPI

http://www.multicore-association.org/

117

MCAPI(Picture courtesy of Multicore Association)

• A facility to interconnect heterogeneous embedded multicore systems/chips• These systems could be very minimalistic

• No OS, different OS, could be DSP, CPU, …• Standardization (and revision) finished around 2009

• No widely used and portable communication APIs in this space• Currently two commercial implementations

• Mentor’s Open MCAPI• Polycore’s messenger

• XUM is the only hardware-assisted implementation of the communication API

118

MCAPI Calls, Use Cases, Expectations

• Lead figures in MCAPI standardization (so far as our interactions go)– Jim Holt (Freescale), Sven Brehmer (Polycore), Markus Levy (Multicore Association)

• “End-points” are connected– Each end-point could be a thread, a process, ..– Blocking and non-blocking communication support

• MCAPI_Send, MCAPI_Send_I, …• Waits, Tests• No barriers (in the API); one could implement• Create end-points (a collective call)

• Use cases– Present use-cases are in C/Pthreads, with each thread performing MCAPI calls to communicate– Very reminiscent of monolithic-style MPI programs (with all its drawbacks)

• General Expectation– That MCAPI will be used as a standard transport layer with respect to which one may implement

higher abstractions– One project: Chapman (Univ Houston) work on realizing OpenMP– Other suggested: Task-graph (or other) higher level abstraction to specify computations, with “smart

runtime” employing MCAPI

119

MCAPI Tool Support• Currently no ‘formal’ debugging tools

– Not enough case studies yet (projects underway in Prof. Alper Sen’s group)• MCC: An MCAPI Correctness Checker

– Subodh Sharma (PhD student)– Borrows from the dynamic verification tool design of ISP (MPI checker) and Inspect (our

Pthreads checker)– Dynamic verification against existing MCAPI libraries– MCC incurs new headaches

• Hybrid Pthreads/MCAPI behaviors• Deterministic replay of schedules often difficult

– Our present conclusion:• Don’t go there! • We know that dynamic verification of hybrid concurrent programs is a royal pain!

– Waiting for higher abstractions / better practices to emerge in the area• BYU projects on model checking using an MCAPI Golden Executable Model

– Main difference is that they rely on an MCAPI operational semantics whereas we capitalize on behavior (of MCAPI library)

120

MRAPI • Portable Resource API

– Portable Mutexes, Mallocs– Portable varieties of software managed shared memory, DMA– Pthreads and Unix facilities won’t do

• Not well-matched with requirements of heterogeneous multicores with disparate set of features / resources

• MTAPI standardization: yet to begin• One possible usage of MCAPI + MRAPI:

– MCAPI Send call allocates buffer using MRAPI calls– MCAPI Send happens

• say using XUM’s network, or MRAPI’s software DMA– MRAPI calls free-up buffer

XUM is currently being prototyped on the popular XUP board

Two XUPV5-LX110T boards obtained courtesy of Xilinx Inc. !

XUM architecture• 32-bit MIPS ISA compliant cores• Request network is in-order dimensional order routed

• worm-hole flow control• Each router unit arbitrates round-robin • Datapath in the reqest n/w is 16-bits wide• Ack n/w has broadcast and pt-to-pt transfers• We can plug in I/O devices as if they are tiles• All these exist as VHDL+Verilog mapped onto FPGAs

XUM architecture• Current memory architecture :

• all tiles have memory ports leading to a FCFS arbiter that is backed by a pipelined DDR2 controller to an SDRAM

• All tiles can have their own clocks. • About 4 physical clock sources. PLL primitives available (Xilinx tools).• Additional clocks can be synthesized• Currently 500 MHz for a flip-flop. 100 MHz realizable. Look at OpenSparc.

Recent XUM Achievements• Built Bootloader (FPGA) and Protocol for downloading code

images over RS-232• XUM Memory Controller

– Built a fully functional DDR2-SDRAM memory controller that provides usable amounts of memory on-board

• Support for pipelined transfers– Built a simple FCFS arbitrated memory controller (shared by all cores)

• All cores share same address space – no protection, but handy!

• Ported XUM MIPS cores to 32-bits, and debugged the CPUs– Debugged the CPU some (more needed)

• Found errors in delay slot and forwarding logic…– Would be a good test vehicle for pipelined CPU FV methods

Other features / statusMore details• 5-7 stage in-order pipeline.• No branch predictor (will add). No speculation.• Web documentation status: VHDL/Verilog code available

Software story:• GCC would be usable

– Must add some new instructions such as load-immediate-upper– In-line assembly for XUM instructions

• FPU is TBD – new student

RTOS story:• FreeRTOS (e.g.) – compile using GCC -> download.

Further software story…

MCAPI and MRAPI realization• BYU collaborator, Prof. Mercer, has MCAPI and MRAPI

golden executable model (as formal state transition rules)• Will compile these into detailed C implementationsProgramming approach• Not recommending straight coding using MCAPI / MRAPI,

as the code becomes a “rats nest” soon• Will investigate compiling tasking primitives into a

runtime that is supported by MCAPI and MRAPIOther ideas?

XUM Details

• See Ben Meakin’s MS thesis– Available from http://www.cs.utah.edu/fv

• The thesis provides:– Code for send/receive– Correctness properties of interest wrt XUM

• Good test vehicle for HW FV projects– Memory footprint data

• Very parsimonious support for MCAPI possible– Latency/throughput measurements on XUM

• Also, comparison with a Pthread-based baseline


XUM Communications Datapath

XUM ISA Extensions

• Send Header• Send word• Send tail• Send ack• Broadcast• Receive ack

MCAPI Message Send• Disable interrupts• Asm(“sndhd.s …”)• While (I < bufsize)

– Asm(“sndw …”)• Asm(“sendtl ..”)• Asm(“recack”..”)• Enable interrupts• Support for connectionless and connected MCAPI protocols

– The latter achieved by not issuing a tail-flit till connection needs to be closed

Further ideas / thoughts• The embedded multicore space is likely very influential

– Enables development of hardware assist for new APIs and runtime mechanisms

– Even the HPC space may be influenced by design ideas percolating from below

– Dynamic formal verification tools may employ “hooks” into the hardware• Avoids the “dirty tricks” we had to use in ISP to get control over the MPI runtime very

indirectly– Faking “Wait” operations, pre-issuing Waits to poke the MPI progress engine etc.

• In the end, we can sell what we can debug• Time to market may be minimized through better FV / dynamic

verification support provided by HW• Great teaching tool

– If the FPGA design tool-chain is a bit kinder/gentler– Projects such as Lava, Kiwi, .. (MSR) provide rays of hope…

Details of ISP and DAMPI

Work out crooked-barrier example on the board, assisted by a formal transition system

• for MPI, • Then a transition system for the ISP centralized

scheduler (as interposition layer)• Then a transition system for DAMPI’s distributed

scheduler (sparse Lamport-clock based)• The formal transition systems clearly show how the

native semantics of MPI has been “tamed” by specific scheduler implementations!

Concluding Remarks

Summary of the explorations of a group (esp. its advisor) in “mid-life crisis”, wanting to be relevant and wanting to be formal (also wanting to be liked )

In the end it was worth it

Must now skate to where the puck will be!

(1) formal verification for message passing and gpu computing (2) xum: an experimental multicore ...

Documents

message passing

computing power

experimental message

fvtomorrowgpu computing

emerging message

formal verification

dedicated verification

parallel computing cpu