(1) formal verification for message passing and gpu computing (2) xum: an experimental multicore ...

133
(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI Ganesh Gopalakrishnan School of Computing, University of Utah and Center for Parallel Computing (CPU) http://www.cs.utah.edu/fv

Upload: peyton

Post on 25-Feb-2016

22 views

Category:

Documents


0 download

DESCRIPTION

(1) Formal Verification for Message Passing and GPU Computing (2) XUM: An Experimental Multicore supporting MCAPI. Ganesh Gopalakrishnan School of Computing, University of Utah and Center for Parallel Computing (CPU) http://www.cs.utah.edu/fv. General Theme. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

(1) Formal Verification for Message Passing and GPU Computing

(2) XUM: An Experimental Multicore supporting MCAPI

Ganesh Gopalakrishnan

School of Computing, University of Utah

and

Center for Parallel Computing (CPU)

http://www.cs.utah.edu/fv

Page 2: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

General Theme

• Take FM where it hasn’t gone before – A handful working in these domains which are

crucially important• Explore the space of concurrency based on

message passing APIs– A little bit of a mid-life crisis for an FV person

learning which areas in SW design need help…

Page 3: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Ideas I hope to present• L1: How to test message passing programs used in HPC

– Recognize the ubiquity of certain APIs in critical areas (e.g. MPI, in HPC)

– With proper semantic characterization, we can formally understand / teach, and formally test

• No need to outright dismiss these APIs as “too hairy”• (to the contrary) be able to realize fundamental issues that will be faced in

any attempt along the same lines– Long-lived APIs create a tool ecosystem around them that becomes

harder to justify replacing• What it takes to print a line in a “real setting”

– Need to build stack-walker to peel back profiling layers, and locate the actual source line

– Expensive – and pointless – to roll your own stack walker

Page 4: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Ideas I hope to present

• L2: How to test at scale– The only practical way to detect communication

non-determinism (in this domain)– Can form the backbone of future large-scale

replay-based debugging tools

Page 5: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Ideas I hope to present

• Realize that the multicore landscape is rapidly changing– Accelerators (e.g. GPUs) are growing in use– Multicore CPUs and GPUs will be integrated more

tightly– Energy is a first-rate currency– Lessons learned from the embedded systems

world are very relevant

Page 6: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Ideas I hope to present

• L3: Creating dedicated verification tools for GPU kernels– How symbolic verification methods can be

effectively used to analyze GPU kernel functions– Status of tool and future directions

Page 7: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Ideas I hope to present

• L4: Designing an experimental message-passing multicore– Implements an emerging message passing standard called

MCAPI in silicon– How the design of special instructions can help with fast

messaging– How features in the Network on Chips (NoC) can help

support the semantics of MCAPI• Community involvement in the creation of such tangible artifacts

can be healthy– Read “The future of Microprocessors” in a recent CACM, by Shekhar

Borkar and Andrew Chien

Page 8: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Organization

• Today– MPI and dyn. FV

• Tomorrow– GPU computing and FV– XUM

Page 9: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Context/Motivation: Demand for cycles!• Terascale

• Petascale

• Exascale

• Zettascale

Page 10: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

More compute power enables new discoveries, solves new problems

– Molecular dynamics simulations• Better drug design facilitated

– Sanbonmatsu et al, FSE 2010 keynote• 290 days of simulation to simulate 2 million atom interactions over 2 nano seconds

– Better “oil caps” can be designed if we have the right compute infrastructure

• Gropp, SC 2010 panel

Page 11: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

High End Machines for HPC /

Cloud

Desktop Servers

andComputeServers

Embedded Systems

andDevices

OpenMPCUDA / OpenCL PthreadsMPI

Multicore Association

APIs

Commonality among different scalesAlso “HPC” will increasingly go embedded

11

Page 12: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Difficult Road Ahead wrt Debugging

• Concurrent software debugging is hard• Gets harder as the degree of parallelism in applications

increases– Node level: Message Passing Interface (MPI)– Core level: Threads, OpenMPI, CUDA

• Hybrid programming will be the future– MPI + Threads – MPI + OpenMP– MPI + CUDA

• Yet tools are lagging behind!– Many tools cannot operate at scale and give measurable coverage

HPC AppsHPC Correctness Tools

Page 13: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

High-end Debugging Methods areoften Expensive, Inconclusive

• Expensive machines, resources– $3M electricity a year (megawatt)– $1B to install hardware– Months of planning to get runtime on cluster

• Debugging tools/methods are primitive– Extreme-Scale goal unrealistic w/o better approaches

• Inadequate attention from “CS”– Little/no Formal Software Engineering methods– Almost zero critical mass

Page 14: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Importance of Message Passing in HPC (MPI)

• Born ~1994• The world’s fastest CPU ran at 68 MHz • The Internet had 600 sites then!• Java was still not around

• Still dominant in 2011– Large investments in applications, tooling support

• Credible FV research in HPC must include MPI

– Use of message passing is growing• Erlang, actor languages, MCAPI, .NET async … (not yet for HPC)• Streams in CUDA, Queues in OpenCL,…

Page 15: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

15

Trend: Hybrid Concurrency

Sandybridge (courtesy anandtech.com) Geoforce GTX 480 (Nvidia)AMD Fusion APU

Concurrent Data StructuresInfiniband style interconnect

Monolith Large-scaleMPI-based

UserApplications

Problem-Solving Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment basedUser Applications

High PerformanceMPI Libraries

Page 16: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

MPI Verification approach depends on type of determinism

• Execution Deterministic– Basically one computation per input data

• Value Deterministic– Multiple computations, but yield same “final answer”

• Nondeterministic– Basically reactive programs built around message

passing, possibly also using threads

Examples to follow

Page 17: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

An example of parallelizing matrix multiplication using message passing

X

Page 18: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

An example of parallelizing matrix multiplication using message passing

X

MPI_Bcast

MPI_Send

MPI_Recv

Page 19: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

An example of parallelizing matrix multiplication using message passing

19

X

MPI_Send

MPI_Bcast

MPI_Send

MPI_Recv

Page 20: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

An example of parallelizing matrix multiplication using message passing

20

X =

MPI_Send

MPI_Recv

MPI_Bcast

MPI_Send

MPI_Recv

Page 21: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

21MPI_Send

MPI_Recv (from: P0, P1, P2, P3…) ;

Send Next Row toFirst Slave whichBy now must be free

Unoptimized Initial Version : Execution Deterministic

Page 22: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

22MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row toFirst Worker that returns the answer!

Later Optimized Version : Value DeterministicOpportunistically Send Work to Processor Who Finishes first

Page 23: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

23MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row toFirst Worker that returns the answer!

Still More Optimized Value-Deterministic versions:Communications are made Non-blocking, and Software Pipelined

(still expected to remain value-deterministic)

Page 24: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Typical MPI Programs

• Value-Nondeterministic MPI programs do exist– Adaptive Dynamic Load Balancing Libraries

• But most are value deterministic or execution deterministic– Of course, one does not really know w/o analysis!

• Detect replay non-determinism over schedule space– Races can creep into MPI programs

• Forgetting to Wait for MPI non-blocking calls to finish– Floating point can make things non-deterministic

Page 25: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Gist of bug-hunting story

• MPI programs “die by the bite of a thousand mosquitoes”– No major vulnerabilities one can focus on

• E.g. in Thread Programming, focusing on races• With MPI, we need comprehensive “Bug Monitors”

• Building MPI bug monitors requires collaboration– Lucky to have collaborations with DOE labs– The lack of FV critical mass hurts

Page 26: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

A real-world bug

P0 P1 P2--- --- ---Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

• Expected “circular” msg passing• Found that P0’s Recv entirely vanished !!• REASON : ??

– In C, -1 % N is not N-1 but rather -1 itself– In MPI, “-1” Is MPI_PROC_NULL– Recv posted on MPI_PROC_NULL is ignored !

Page 27: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

A real-world bug

P0 P1 P2--- --- ---Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

• Expected “circular” msg passing• Found that P0’s Recv entirely vanished !!• REASON : ??

– In C, -1 % N is not N-1, but -1 – In MPI, “-1” Is MPI_PROC_NULL– Recv posted on MPI_PROC_NULL is ignored !

Page 28: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

MPI Bugs – more anecdotal evidence

• Bug encountered at large scale w.r.t. famous MPI library (Vo)– Bug was absent at a smaller scale– It was a concurrency bug

• Attempt to implement collective communication (Thakur) – Bug exists for ranges of size parameters

• Wrong assumption: that MPI barrier was irrelevant (Siegel)– It was not – a communication race was created

• Other common bugs (we see it a lot; potentially concurrency dep.)– Forgetting to wait for non-blocking receive to finish– Forgetting to free up communicators and type objects

• Some codes may be considered buggy if non-determinism arises!– Use of MPI_Recv(*) often does not result in non-deterministic execution– Need something more than “superficial inspection”

Page 29: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Real bug stories in the MPI-land

• Typing a[i][i] = init instead of a[i][j] = init• Communication races

– Unintended send matches “wildcard receive”• Bugs that show up when ported

– Runtime buffering changes; deadlocks erupt– Sometimes, bugs show up when buffering added!

• Misunderstood “Collective” semantics– Broadcast does not have “barrier” semantics

• MPI + threads– Royal troubles await the newbies

Page 30: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Our Research Agenda in HPC• Solve FV of Pure MPI Applications “well”

– Progress in non-determinism coverage for fixed test harness– MUST integrate with good error monitors

• (Preliminary) Work on hybrid MPI + Something– Something = Pthreads and CUDA so far– Evaluated heuristics for deterministic replay of Pthreads + MPI

• Work on CUDA/OpenCL Analysis– Good progress on Symbolic Static Analyzer for CUDA Kernels– (Prelim.) progress on Symbolic Test Generator for CUDA Pgms

• (Future) Symbolic Test Generation to “crash” hybrid pgms– Finding lurking crashes may be a communicable value proposition

• (Future) Intelligent schedule-space exploration– Focus on non-monolithic MPI programs

Page 31: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Motivation for Coverage of Communication Nondeterminism

Page 32: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

32

Eliminating wasted search in message passing verif.

P0---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1---

MPI_Recv(from P0…);

MPI_Recv(from P2…);

MPI_Recv(*, x);

if (x==22) then error1 else MPI_Recv(*, x);

P2---

MPI_Send(to P1…);

MPI_Send(to P1, data=33);

Page 33: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

33

@InProceedings{PADTAD2006:JitterBug,author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de Supinski and Andreas S{\ae}bj{\"o}rnsen},title = {Improving distributed memory applications testing by message perturbation},booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD) Workshop, at the International Symposium on Software Testing and Analysis},address = {Portland, ME, USA},month = {July},year = {2006}}

A frequently followed approach: “boil the whole schedule space” – often very wasteful

Page 34: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

34

P0---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1---

MPI_Recv(from P0…);

MPI_Recv(from P2…);

MPI_Recv(*, x);

if (x==22) then error1 else MPI_Recv(*, x);

P2---

MPI_Send(to P1…);

MPI_Send(to P1, data=33);

But consider these two cases…

No need to play with schedules of deterministic actions

Eliminating wasted work in message passing verif.

Page 35: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Need to detectResource Dependent Bugs

Page 36: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

36

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);

Example of Resource Dependent Bug

We know that this program with lesser Send buffering may deadlock

Page 37: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

37

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);

Example of Resource Dependent Bug

… and this program with more Send buffering may avoid a deadlock

Page 38: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

38

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

Example of Resource Dependent Bug

… But this program deadlocks if Send(to:1) has more buffering !

Page 39: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

39

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

Example of Resource Dependent Bug

… But this program deadlocks if Send(to:1) has more buffering !

Page 40: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

40

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

Example of Resource Dependent Bug

… But this program deadlocks if Send(to:1) has more buffering !

Page 41: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

41

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

Example of Resource Dependent Bug

… But this program deadlocks if Send(to:1) has more buffering !

Page 42: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

42

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

Example of Resource Dependent Bug

… But this program deadlocks if Send(to:1) has more buffering !

Mismatched – hence a deadlock

Page 43: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

43

Widely Publicized Misunderstandings

“”Your program is deadlock free if you have successfully tested it under zero buffering”

Page 44: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

44

MPI at fault?• Perhaps partly

– Over 17 years of MPI, things have changed– Inevitable use of shared memory cores, GPUs, …– Yet, many of the issues seem fundamental to

• Need for wide adoption across problems, languages, machines• Need to give programmer better handle on resource usage

• How to evolve out of MPI?– Whom do we trust to reset the world?– Will they get it any better?– What about the train-wreck meanwhile?

• Must one completely evolve out of MPI?

Page 45: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

45

Our Impact So Far

Sandybridge (courtesy anandtech.com) Geoforce GTX 480 (Nvidia)AMD Fusion APU

PUGand GKLEE

ISPand DAMPI

Concurrent Data StructuresInfiniband style interconnect

Monolith Large-scaleMPI-based

UserApplications

Problem-Solving Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment basedUser Applications

High PerformanceMPI Libraries

Usefulformalizationsto help testthese

Page 46: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

46

Outline for L1• Dynamic formal verification of MPI

– It is basically testing which discovers all alternate schedules

• Coverage of communication non-determinism– Also gives us a “predictive theory” of MPI behavior

– Centralized approach : ISP– GEM: Tool Integration within Eclipse Parallel Tools Platform– Demo of GEM

Page 47: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

A Simple MPI Example

47

Page 48: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

A Simple MPI Example

48

• Non-blocking Send – send lasts from Isend to Wait• Send buffer can be reclaimed only after Wait clears• Forgetting to issue Wait MPI “request object” leak

Page 49: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

A Simple MPI Example

49

Page 50: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

A Simple MPI Example

50

• Non-blocking Receive – lasts from Irecv to Wait• Recv buffer can be examined only after Wait clears• Forgetting to issue Wait MPI “request object” leak

Page 51: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

A Simple MPI Example

51

• Blocking receive in the middle• Equivalent to Irecv followed by its Wait• The data fetched by Recv(2) is available before that of Irecv

Page 52: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

A Simple MPI Example

52

• Since P0’s Isend and Irecv can be “in flight”, the barrier can be crossed• This allows P2’s Isend to race with P0’s Isend, and match Irecv(*)

Page 53: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

Testing Challenges

53

•Traditional testing methods may reveal only P0->P1 or P2->P1• P2->P1 may happen after the code is ported• Our tools ISP and DAMPI automatically discover and run both tests, regardless of the execution platform

Page 54: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Flow of ISP

54

Executable

Proc1

Proc2

……Procn

SchedulerRun

MPI Runtime

Scheduler intercepts MPI calls Reorders and/or rewrites the actual calls going into the MPI Runtime Discovers maximal non-determinism ; plays through all choices

MPI Program

Interposition Layer

Page 55: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

55

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext Barrier

MPI Runtime

ISP Scheduler Actions (animation)

Page 56: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNextBarrier

Irecv(*)

Barrier

56

MPI Runtime

ISP Scheduler Actions (animation)

Page 57: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Barrier

Barrier

Barrier

57

MPI Runtime

ISP Scheduler Actions (animation)

Page 58: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Wait (req)

Recv(2)

Isend(1)

SendNext

Wait (req)

Irecv(2)Isend

Wait

No Match-Set

58

Deadlock!

ISP Scheduler Actions (animation)

Page 59: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Formalization of MPI Behavior to build Formal Dynamic Verifier (verification scheduler)

Page 60: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

60

Formal Modeling of MPI Calls

• MPI calls are modeled in terms of four salient events– Call issued

• All calls are issued in program order– Call returns

• The code after the call can now be executed– Call matches

• An event that marks the call committing– Call completes

• All resources associated with the call are freed

Page 61: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

61

The Matches-Before Relation of MPIIsend(to: Proc k, …);…Isend(to: Proc k, …)

Irecv(from: Proc k, …);…Irecv(from: Proc k, …)

Isend( &h );…Wait( h );

Wait(..);…AnyMPIOp;

Barrier(..);…AnyMPIOp;

Irecv(from: Proc *, …);…Irecv(from: Proc k, …)

Irecv(from: Proc j, …);…Irecv(from: Proc *, …)

ConditionalMatchesbefore

1. 2.

3. 4.

5. 6. 7.

Page 62: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

62

The ISP Scheduler• Pick a process Pi and its instrn Op at PC n• If Op does not have an unmatched ancestor according to MB,

then collect Op into Scheduler’s Reorder Buffer– Stay with Pi, Increment n

• Else Switch to Pi+1 until all Pi are “fenced”– “fenced” means all current Ops have unmatched ancestors

• Form Match Sets according to priority order– If Match Set is {s1, s2, .. sK} + R(*), cover all cases using stateless replay

• Issue an eligible set of Match Set Ops into the MPI runtime• Repeat until all processes are finalized or error encountered

Theorem: This Scheduling Method achieves ND-Coverage in MPI programs!

Page 63: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

63

How MB helps predict outcomes

Will this single-process example called “Auto-send” deadlock ?

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Page 64: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

64

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

How MB helps predict outcomes

Page 65: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

65

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

The MB

How MB helps predict outcomes

Page 66: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

66

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect R(from:0, h1)

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

The MPIRuntime

Page 67: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

67

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect B

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

The MPIRuntime

Page 68: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

68

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

P0 is Fenced; Form Match Set { B } and Send it to the MPI Runtime !

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

The MPIRuntime

Page 69: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

69

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect S(to:0, h2)

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

S(to:0, h2)

The MPIRuntime

Page 70: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

70

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

P0 is fenced. So form the {R,S} match-set. Fire!

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

S(to:0, h2)

The MPIRuntime

Page 71: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

71

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect W(h1)

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPIRuntime

Page 72: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

72

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

P0 is Fenced. So form {W} match set. Fire!

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPIRuntime

Page 73: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

73

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect W(h2)

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPIRuntime

Page 74: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

74

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Fire W(h2)! Program finishes w/o deadlocks!

How MB helps predict outcomes Scheduler’sReorderBuffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPIRuntime

Page 75: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

75

Outline for L2• Dynamic formal verification of MPI

– It is basically testing which discovers all alternate schedules• Coverage of communication non-determinism

– Also gives us a “predictive theory” of MPI behavior

– Centralized approach : ISP– GEM: Tool Integration within Eclipse Parallel Tools Platform

• DEMO OF GEM

– Distributed approach : DAMPI

Page 76: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Dynamic Verification at Scale

Page 77: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

77

Why do we need a Distributed Analyzer for MPI Programs (DAMPI)?

• The ROB and the MB graph can get large– Limit of ISP: “32 Parmetis (15 K LOC) processes”

• We need to dynamically verify real apps– Large data sets, 1000s of processes– High-end bugs often masked when downscaling– Downscaling is often IMPOSSIBLE in practice– ISP is too sequential! Employ parallelism of a large Cluster!

• What do we give up?– We can’t do “what if” reasoning

• What if a PARTICULAR Isend has infinite buffering?– We may have to give up precision for scalability

Page 78: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

DAMPI Framework

Executable

Proc1

Proc2

……Procn

Alternate Matches

MPI runtime

MPI Program

DAMPI - PnMPI modules

Schedule Generator

EpochDecisions

Rerun

DAMPI – Distributed Analyzer for MPI

Page 79: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Distributed Causality Tracking in DAMPI

• Alternate matches are – co-enabled and concurrent actions – that can match according to MPI’s matching rules

• DAMPI performs an initial run of the MPI program• Discovers alternative matches• We have developed an MPI-specific Sparse Logical

Clock Tracking– Vector Clocks (no omissions, less scalable)– Lamport Clocks (no omissions in practice, more scalable)

• We have gone up to 1000 processes (10K seems possib)

Page 80: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

DAMPI uses Lamport clock to build Happens-Before relationships

• Use Lamport clock to track Happens-Before– Sparse Lamport Clocks – only “count” non-det events– MPI MB relation is “baked” into clock-update rules– Increases it after completion of MPI_Recv (ANY_SOURCE)– Nested blocking / non-blocking operations handled– Compare incoming clock to detect potential matches

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2 pb(0

)S(P1)

Page 81: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

How we use Happens-Before relationship to detect alternative matches

• Question: could P2’s send have matched P1’s recv?• Answer: Yes!• Earliest Message with lower clock value than the current

process clock is an eligible match

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2pb

(0)S(P1)

Page 82: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2

pb(0

)S(P1)

DAMPI Algorithm Review: (1) Discover Alternative matches during initial run

Page 83: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

R1(2)

1

S(P1)

0

0

0

P0

P1

P2S(P1)

DAMPI Algorithm Review: (2) Force alternative matches during replay

X

Page 84: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

DAMPI maintains very good scalability vs ISP

4 8 16 320

100

200

300

400

500

600

700

800

900

ParMETIS-3.1 (no wildcard)

ISPDAMPI

Number of tasks

Tim

e in

seco

nds

Page 85: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

DAMPI is also faster at processing interleavings

250 500 750 10000

1000

2000

3000

4000

5000

6000

7000

8000

Matrix Multiplication with Wildcard Receives

ISPDAMPI

Number of Interleavings

Tim

e in

seco

nds

Page 86: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Results on large applications: SpecMPI2007 and NAS-PB

Slowdown is for one interleaving No replay was necessary

ParMETI

S-3.1

107.leslie3

d

113.GemsFD

TD

126.lammps

130.soco

rro137.lu BT CG DT EP FT IS LU MG

0

0.5

1

1.5

2

2.5

Base 1024 processes

DAMPI 1024 processes

Page 87: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

The Devil in the Detail of DAMPI

• Textbook Lamport/Vector Clocks Inapplicable– “Clocking” all events won’t scale

• So “clock” only ND-events– When do non-blocking calls finish ?

• Can’t “peek” inside MPI runtime• Don’t want to impede with execution• So have to infer when they finish

– Later blocking operations can force Matches Before• So need to adjust clocks when blocking operations happen

• Our contributions: Sparse VC and Sparse LC algos• Handing real corner cases in MPI

– Synchronous sends “learning” about non-blocking Recv start!

Page 88: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Questions from Monday’s Lecture• What bugs are caught by ISP / DAMPI

– Deadlocks– MPI Resource Leaks

• Forgotten dealloc of MPI communicators, MPI type objects• Forgotten Wait after Isend or Irecv (request objects leaked)• C assert statements (safety checks)

• What have we done wrt the correctness of our MPI verifier ISP’s algorithms?– Testing + paper/pencil proof using standard ample-set based proof methods– Brief look at the formal transition system of MPI, and ISP’s transition system

• Why have HPC folks not built a happens-before model for APIs such as MPI?– HPC folks are grappling with many challenges and actually doing a great job– There is a lack of “computational thinking” in HPC that must be addressed

• Non-CS background often• See study by Jan Westerholm in EuroMPI 2010

– CS folks must step forward to help• Why this does not naturally happen: “Numerical Methods” not popular in core CS• There isn’t a clearly discernible “HPC industry”

– Wider use of GPUs, Physics based gaming, … can help push CS toward HPC– Mobile devices will use CPUs + GPUs and do “CS problems” and “HPC problems” in a unified setting (e.g.

face recognition, …)• Our next two topics (PUG and XUM) touch on our attempts in this area

Page 89: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

89

Outline for L3• Formal analysis methods for accelerators/GPUs

– Same thing– In future, there may be a large-scale mish-mash of CPUs and

GPUs– CPUs also will have mini GPUs, SIMD units, …

• Again, revisit the Borkar / Chien article

• Regardless..– It looks a lot unlike traditional Pthreads/Java/C# threading– We would like to explore how to debug these kinds of codes

efficiently, and help designers explore their design space while root-causing bugs quickly

Page 90: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

90

Why develop FV methods for CUDA?

• GPUs are key enablers of HPC– Many of the top 10 machines are GPU based– I found the presentation by Paul Lindberg eye-opening

http://www.youtube.com/watch?v=vj6A8AKVIuI• Interesting debugging challenges

– Races– Barrier mismatches– Bank conflicts– Asynchronous MPI / CUDA interactions

Page 91: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

91

What are GPUs (aka Accelerators)?

Sandybridge (courtesy anandtech.com)

OpenCL Compute (courtesy Khronos group)

GeoforceGTX 480(Nvidia)

• Three of world’s Top-5 Supercomputers built using GPUs• World’s greenest supercomputers also GPU based• CPU/GPU integration is a clear trend• Hand-held devices (iPhone) will also use them

AMD Fusion APU

Page 92: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

9292

CPU program

void inc_cpu(float* a, float b, int N) {

for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;}

void main() { ..... increment_cpu(a, b, N);}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if (tid < N) A[tid] = A[tid] + b;}

voidmain() { ….. dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);}

Example: Increment Array Elements

Contrast between CPUs and GPUs

Page 93: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

9393

CPU program

void inc_cpu(float* a, float b, int N) {

for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;}

void main() { ..... increment_cpu(a, b, N);}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if (tid < N) A[tid] = A[tid] + b;}

voidmain() { ….. dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);}

Example: Increment Array Elements

Contrast between CPUs and GPUsFine-grained threadsscheduled to run like this:Tid = 0, 1, 2, 3, …

Page 94: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

94

Why heed Many-core / GPU Compute?

EAGAN, Minn., May 3, 1991 (John Markoff, NY Times)The Convex Computer Corporation plans to introduce its first supercomputer on Tuesday. But Cray Research Inc.,the king of supercomputing, says it is more worried by "killer micros"-- compact, extremely fast work stations that sell for less than$100,000.

Take-away : Clayton Christensen’s “Disruptive Innovation”GPU is a disruptive technology!

Page 95: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

95

GPUs offer an eminent setting in which to study heterogeneous CPU organizationsand memory hierarchies

Other Reasons to study GPU Compute

Page 96: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

96

What bugs caught? What value?• GPU hardware still stabilizing

– Characterize GPU Hardware Formally• Currently, program behavior may change with platform

– Micro benchmarking sheds further light• www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

• The software is the real issue!

Page 97: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

97

GPU Software Bugs / Difficulties• SIMD model, unfamiliar

– Synchronization, races, deadlocks, … all different!• Machine constants, program assumptions about

problem parameters affect correctness / performance– GPU “kernel functions” may fail or perform poorly

• Formal reasoning can help identify performance pitfalls• Are still young, so emulators and debuggers may not

match hardware behavior• Multiple memory sub-systems are involved, further

complicating semantics / performance

Page 98: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

98

Approaches/Tools for GPU SW Verif.

• Instrumented Dynamic Verification on GPU Emulators– Boyer et al, STMCS 2008

• Grace : Combined Static / Dynamic Analysis– Zheng et al, PPoPP 2011

• PUG : An SMT based Static Analyzer– Lee and Gopalakrishnan, FSE 2010

• GKLEE : An SMT based test generator– Lee, Rajan, Ghosh, and Gopalakrishnan (under submission)

Page 99: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

99

Planned Overall Workflow of PUGRealized

Page 100: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

100

Workflow and Results from PUG

Page 101: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

101101

Examples of a Data RaceIncrement N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b ...

__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid < N) A[tid] = A[tid – 1] + b;}

Page 102: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

102102

Examples of a Data RaceIncrement N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b ...

__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid < N) A[tid] = A[tid – 1] + b;}

(and (and (/= t1.x t2.x) …… (or (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= (bv-sub idx1@t1 0b0000000001) idx1@t2)) (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for Read-Write Race

Page 103: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

103103

Examples of a Data RaceIncrement N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b ...

__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid < N) A[tid] = A[tid – 1] + b;}

(and (and (/= t1.x t2.x) …… (or (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= (bv-sub idx1@t1 0b0000000001) idx1@t2)) (and (bv-lt idx1@t1 N0) (bv-lt idx1@t2 N0) (= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for Read-Write Race

Page 104: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

104104

Real Example with Race__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads();

if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1]; ...

/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11] This causes a real race on writes to d[10] because of the += done by the threads*/

Page 105: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

105105

Real Example with Race__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); } __syncthreads();

if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1]; ...

/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11] This causes a real race on writes to d[10] because of the += done by the threads*/

Page 106: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

106106

PUG Divides and Conquers Each Barrier Interval

t1 t2 BI1 is conflict free iff:

t1,t2 : a1t1, a2

t1, a1t2 and a2

t2 do not conflict with each other

p p pp

a1

a2

a3

a4

a5

a1

a2

a3

a4

a5

BI 3 is conflict free iff the following accesses do not conflict for all t1,t2:

a3t1, p : a4

t1, p : a5t1,

a3t2, p : a4

t2, p : a5t2

a6 a6

BI1

BI3

Page 107: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

107107

PUG Results (FSE 2010)

Kernels (in CUDA SDK )

loc +O +C +R B.C. Time (pass)

Bitonic Sort 65 HIGH 2.2s

MatrixMult 102 * * HIGH <1s

Histogram64 136 LOW 2.9s

Sobel 130 * HIGH 5.6s

Reduction 315 HIGH 3.4s

Scan 255 * * * LOW 3.5s

Scan Large 237 * * LOW 5.7s

Nbody 206 * HIGH 7.4s

Bisect Large 1400 * * HIGH 44sRadix Sort 1150 * * * LOW 39s

Eigenvalues 2300 * * * HIGH 68s

+ O: require no bit-vector overflowing

+C: require constraints on the input values

+R: require manual refinement

B.C.: measure how serious the bank conflicts are

Time: SMT solving time

Page 108: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

108

“GPU class” bugs

Defects Barrier Error or Race Refinement

benign fatal over #kernel over #loop

13 (23%) 3 2 17.5% 10.5%

Tested over 57 assignment submissions

Defects: Indicates how many kernels are not well parameterized, i.e. work only in certain configurations

Refinement: Measures how many loops need automatic refinement (by PUG).

Page 109: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

109

PUG’s Strengths and Weaknesses• Strengths:

– SMT-based Incisive SA avoids interleaving explosion– Still obtains coverage guarantees– Good for GPU Library FV

• Weaknesses:– Engineering effort : C++, Templates, Breaks, .. – SMT “explosion” for value correctness– Does not help test the code on the actual hardware– False alarms requires manual intervention

• Handling loops, Kernel calling contexts

Page 110: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

110

Thorough internal documentation of PUG is available

• http://www.cs.utah.edu/fv/mediawiki/index.php/PUG• One recent extension (ask me for our EC2 2011 paper)

– Symbolic analysis to detect occurrences of non-coalesced memory accesses

Page 111: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

111

Two directions in progress

• Parameterized Verification– See Li’s dissertation for initial results

• Formal Testing – Brand new code-base called GKLEE

• Joint work with Fujitsu Research

Page 112: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

112

GKLEE: Formal Testing of GPU Kernels(joint work with Li, Rajan, Ghosh of Fujitsu)

Page 113: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

113

Evaluation of GKLEE vs. PUG

Kernels n = 4 n = 16 n = 64 n = 256 n = 1024PUG GKLEE PUG GKLEE GKLEE GKLEE GKLEE

Simple Reduction 2.8 < 0.1 (< 0.1) T.O. < 0.1 (< 0.1) < 0.1 (< 0.1) 0.2 (0.3) 2.3 (2.9)

Matrix Transpose 1.9 < 0.1 (< 0.1) T.O. < 0.1 (0.3) < 0.1 (3.2) < 0.1 (63) 0.9 (T.O.)

Bitonic Sort 3.7 0.9 (1) T.O. T.O. T.O. T.O. T.O.

Scan Large ▬ < 0.1 (< 0.1) ▬ < 0.1 (< 0.1) 0.1 (0.2) 1.6 (3) 22 (51)

Execution time of PUG and GKLEE on kernels from the CUDA SDKfor functional correctness. n is the number of threads. T.O. means > 5 min.

Page 114: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

114

Evaluation of GKLEE’s Test Coverage

Kernels sc cov. bc cov. min #test Avg. Covt max. Covtexec. time

Bitonic Sort 100% / 100%

51% / 44% 5 78% /

76%100% / 94% 1s

Merge Sort 100% / 100%

70% / 50% 6 88% /

80%100% / 95% 0.5s

Word Search 100% / 92%

34% / 25% 2 100% /

81%100% / 85% 0.1s

Radix Sort 100% / 91%

54% / 35% 6 91% /

68%100% / 75% 5s

Suffix Tree Match

100% / 90%

38% / 35% 8 90% /

70%100% / 82% 12s

Test generation results. Traditional metrics sc cov. and bc cov. give source code and bytecode coverage respectively. Refined metrics avg.covt and max.covt measure the average and maximum coverage of all threads.

All coverages were far better than achieved through random testing.

Page 115: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

115

Outline for L4• A brief introduction to MCAPI, MRAPI, MTAPI• XUM : an eXtensible Utah Multicore system• What are we able to learn through building a

hardware realization of a custom multicore– How can we push this direction forward?– Any collaborations?

• Clearly there are others who do this full-time• This has been a side-project (by two very good students, albeit)

in our group• So we would like to build on others’ creations also…

Page 116: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

116

Multicore Association APIs• http://www.multicore-association.org• Reason for our interest in the MCA APIs

– Our project through the Semiconductor Research Association – Collaborator in dynamic verification of MCAPI Applications

• Prof. Eric Mercer (BYU, Utah)– The BYU team also has developed a formal spec for MCAPI and MRAPI, and

built golden executable models from these specs• XUM

– A Utah project involving two students• MS project of Ben Meakin• BS project of Grant Ayers

– An attempt to support MCAPI functions in HW+SW– (Later) hoping to support MRAPI

Page 117: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

117

MCAPI(Picture courtesy of Multicore Association)

• A facility to interconnect heterogeneous embedded multicore systems/chips• These systems could be very minimalistic

• No OS, different OS, could be DSP, CPU, …• Standardization (and revision) finished around 2009

• No widely used and portable communication APIs in this space• Currently two commercial implementations

• Mentor’s Open MCAPI• Polycore’s messenger

• XUM is the only hardware-assisted implementation of the communication API

Page 118: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

118

MCAPI Calls, Use Cases, Expectations

• Lead figures in MCAPI standardization (so far as our interactions go)– Jim Holt (Freescale), Sven Brehmer (Polycore), Markus Levy (Multicore Association)

• “End-points” are connected– Each end-point could be a thread, a process, ..– Blocking and non-blocking communication support

• MCAPI_Send, MCAPI_Send_I, …• Waits, Tests• No barriers (in the API); one could implement• Create end-points (a collective call)

• Use cases– Present use-cases are in C/Pthreads, with each thread performing MCAPI calls to communicate– Very reminiscent of monolithic-style MPI programs (with all its drawbacks)

• General Expectation– That MCAPI will be used as a standard transport layer with respect to which one may implement

higher abstractions– One project: Chapman (Univ Houston) work on realizing OpenMP– Other suggested: Task-graph (or other) higher level abstraction to specify computations, with “smart

runtime” employing MCAPI

Page 119: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

119

MCAPI Tool Support• Currently no ‘formal’ debugging tools

– Not enough case studies yet (projects underway in Prof. Alper Sen’s group)• MCC: An MCAPI Correctness Checker

– Subodh Sharma (PhD student)– Borrows from the dynamic verification tool design of ISP (MPI checker) and Inspect (our

Pthreads checker)– Dynamic verification against existing MCAPI libraries– MCC incurs new headaches

• Hybrid Pthreads/MCAPI behaviors• Deterministic replay of schedules often difficult

– Our present conclusion:• Don’t go there! • We know that dynamic verification of hybrid concurrent programs is a royal pain!

– Waiting for higher abstractions / better practices to emerge in the area• BYU projects on model checking using an MCAPI Golden Executable Model

– Main difference is that they rely on an MCAPI operational semantics whereas we capitalize on behavior (of MCAPI library)

Page 120: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

120

MRAPI • Portable Resource API

– Portable Mutexes, Mallocs– Portable varieties of software managed shared memory, DMA– Pthreads and Unix facilities won’t do

• Not well-matched with requirements of heterogeneous multicores with disparate set of features / resources

• MTAPI standardization: yet to begin• One possible usage of MCAPI + MRAPI:

– MCAPI Send call allocates buffer using MRAPI calls– MCAPI Send happens

• say using XUM’s network, or MRAPI’s software DMA– MRAPI calls free-up buffer

Page 121: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

XUM is currently being prototyped on the popular XUP board

Two XUPV5-LX110T boards obtained courtesy of Xilinx Inc. !

Page 122: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

XUM architecture• 32-bit MIPS ISA compliant cores• Request network is in-order dimensional order routed

• worm-hole flow control• Each router unit arbitrates round-robin • Datapath in the reqest n/w is 16-bits wide• Ack n/w has broadcast and pt-to-pt transfers• We can plug in I/O devices as if they are tiles• All these exist as VHDL+Verilog mapped onto FPGAs

Page 123: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

XUM architecture• Current memory architecture :

• all tiles have memory ports leading to a FCFS arbiter that is backed by a pipelined DDR2 controller to an SDRAM

• All tiles can have their own clocks. • About 4 physical clock sources. PLL primitives available (Xilinx tools).• Additional clocks can be synthesized• Currently 500 MHz for a flip-flop. 100 MHz realizable. Look at OpenSparc.

Page 124: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Recent XUM Achievements• Built Bootloader (FPGA) and Protocol for downloading code

images over RS-232• XUM Memory Controller

– Built a fully functional DDR2-SDRAM memory controller that provides usable amounts of memory on-board

• Support for pipelined transfers– Built a simple FCFS arbitrated memory controller (shared by all cores)

• All cores share same address space – no protection, but handy!

• Ported XUM MIPS cores to 32-bits, and debugged the CPUs– Debugged the CPU some (more needed)

• Found errors in delay slot and forwarding logic…– Would be a good test vehicle for pipelined CPU FV methods

Page 125: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Other features / statusMore details• 5-7 stage in-order pipeline.• No branch predictor (will add). No speculation.• Web documentation status: VHDL/Verilog code available

Software story:• GCC would be usable

– Must add some new instructions such as load-immediate-upper– In-line assembly for XUM instructions

• FPU is TBD – new student

RTOS story:• FreeRTOS (e.g.) – compile using GCC -> download.

Page 126: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Further software story…

MCAPI and MRAPI realization• BYU collaborator, Prof. Mercer, has MCAPI and MRAPI

golden executable model (as formal state transition rules)• Will compile these into detailed C implementationsProgramming approach• Not recommending straight coding using MCAPI / MRAPI,

as the code becomes a “rats nest” soon• Will investigate compiling tasking primitives into a

runtime that is supported by MCAPI and MRAPIOther ideas?

Page 127: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

XUM Details

• See Ben Meakin’s MS thesis– Available from http://www.cs.utah.edu/fv

• The thesis provides:– Code for send/receive– Correctness properties of interest wrt XUM

• Good test vehicle for HW FV projects– Memory footprint data

• Very parsimonious support for MCAPI possible– Latency/throughput measurements on XUM

• Also, comparison with a Pthread-based baseline

Page 128: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

XUM Communications Datapath

Page 129: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

XUM ISA Extensions

• Send Header• Send word• Send tail• Send ack• Broadcast• Receive ack

Page 130: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

MCAPI Message Send• Disable interrupts• Asm(“sndhd.s …”)• While (I < bufsize)

– Asm(“sndw …”)• Asm(“sendtl ..”)• Asm(“recack”..”)• Enable interrupts• Support for connectionless and connected MCAPI protocols

– The latter achieved by not issuing a tail-flit till connection needs to be closed

Page 131: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Further ideas / thoughts• The embedded multicore space is likely very influential

– Enables development of hardware assist for new APIs and runtime mechanisms

– Even the HPC space may be influenced by design ideas percolating from below

– Dynamic formal verification tools may employ “hooks” into the hardware• Avoids the “dirty tricks” we had to use in ISP to get control over the MPI runtime very

indirectly– Faking “Wait” operations, pre-issuing Waits to poke the MPI progress engine etc.

• In the end, we can sell what we can debug• Time to market may be minimized through better FV / dynamic

verification support provided by HW• Great teaching tool

– If the FPGA design tool-chain is a bit kinder/gentler– Projects such as Lava, Kiwi, .. (MSR) provide rays of hope…

Page 132: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Details of ISP and DAMPI

Work out crooked-barrier example on the board, assisted by a formal transition system

• for MPI, • Then a transition system for the ISP centralized

scheduler (as interposition layer)• Then a transition system for DAMPI’s distributed

scheduler (sparse Lamport-clock based)• The formal transition systems clearly show how the

native semantics of MPI has been “tamed” by specific scheduler implementations!

Page 133: (1) Formal Verification for Message Passing  and GPU Computing (2) XUM: An Experimental  Multicore  supporting MCAPI

Concluding Remarks

Summary of the explorations of a group (esp. its advisor) in “mid-life crisis”, wanting to be relevant and wanting to be formal (also wanting to be liked )

In the end it was worth it

Must now skate to where the puck will be!