1 mpi verification ganesh gopalakrishnan and robert m. kirby students yu yang, sarvani vakkalanka,...

1

MPI Verification

Ganesh Gopalakrishnan and Robert M. Kirby Students

Yu Yang, Sarvani Vakkalanka, Guodong Li, Subodh Sharma, Anh Vo, Michael DeLisi, Geof Sawaya

(http://www.cs.utah.edu/formal_verification)

School of ComputingUniversity of Utah

Supported by: Microsoft HPC Institutes

NSF CNS 0509379

http://www.cs.utah.edu/formal_verification

2

“MPI Verification”

or

How to exhaustively verify MPI programs

without the pain of model buildingand considering only “relevant interleavings”

3

Computing is at an inflection point

(photo courtesy of Intel)

4

Our work pertains to these:

MPI programs

MPI libraries

Shared Memory Threads based on Locks

5

Name of the Game: Progress Through Precision

1. Precision in Understanding

2. Precision in Modeling

3. Precision in Analysis

4. Doing Modeling and Analysis with Low Cost

6

1. Need for Precision in Understanding:

The “crooked barrier” quiz

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

Will P1’s Send Match P2’s Receive ?

7

Need for Precision in Understanding:


P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

It will ! Here is the animation

8



P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

9



P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

10



P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

11



P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

12



P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

13

Would you rather explain each conceivable situation in a large API with an elaborate “bee dance” and informal English…. or would you rather specify it mathematically and let the user calculate the outcomes?

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

14

TLA+ Spec of MPI_Wait (Slide 1/2)

15

TLA+ Spec of MPI_Wait (Slide 2/2)

16

Executable Formal Specification can help validate our understanding of MPI …

04/18/23

TLA+ MPI Library Model

TLA+ Prog. Model

MPIC Program Model

Visual Studio 2005

Phoenix Compiler

TLC Model Checker MPIC Model Checker

Verification Environment

MPIC IR

FMICS 07 PADTAD 07

17

The Histrionics of FV for HPC (1)

18

The Histrionics of FV for HPC (2)

19

Error-trace Visualization in VisualStudio

20

2. Precision in Modeling:The “Byte-range Locking Protocol” ChallengeAsked to see if new protocol using MPI 1-sided was OK…

lock_acquire (start, end) {

Stage 11 val[0] = 1; /* flag */ val[1] = start; val[2] = end;2 while(1) {3 lock_win4 place val in win5 get values of other processes from win6 unlock_win7 for all i, if (Pi conflicts with my range)8 conflict = 1;

Stage 29 if(conflict) {10 val[0] = 011 lock_win12 place val in win13 unlock_win14 MPI_Recv(ANY_SOURCE)15 }16 else{17 /* lock is acquired */18 break;19 }20 }//end while

flag start end 0 -1 -1 0 -1 -1 0 -1 -1

21

Precision in Modeling:The “Byte-range Locking Protocol” Challenge

Studied code Wrote Promela Verification Model (a week) Applied the SPIN Model Checker Found Two Deadlocks Previously Unknown Wrote Paper (EuroPVM / MPI 2006) with Thakur and Gropp – won

one of the three best-paper awards With new insight, Designed Correct AND Faster Protocol !

Still, we felt lucky … what if we had missed the error while hand-modeling

Also hand-modeling was NO FUN – how about running the real MPI code “cleverly” ?

22

Measurement under Low Contention

23

Measurement under High Contention

24

4. Modeling and Analysis with Reduced Cost…

0: 1: 2: 3: 4: 5:

Card Deck 0 Card Deck 1

0: 1: 2: 3: 4: 5:

• Only the interleavings of the red cards matter • So don’t try all riffle-shuffles (12!) / (6!) (6!) = 924• Instead just try TWO shuffles of the decks !!

25

What works for cards works for MPI(and for PThreads also) !!

0: MPI_Init1: MPI_Win_lock2: MPI_Accumulate3: MPI_Win_unlock4: MPI_Barrier5: MPI_Finalize

P0 (owner of window) P1 (non-owner of window)

0: MPI_Init1: MPI_Win_lock2: MPI_Accumulate3: MPI_Win_unlock4: MPI_Barrier5: MPI_Finalize

•These are the dependent operations• 504 interleavings without POR in this example• 2 interleavings with POR !!

26

4. Modeling and Analysis with Reduced Cost The “Byte-range Locking Protocol” Challenge

Studied code DID NOT STUDY CODE Wrote Promela Verification Model (a week) NO MODELING Applied the SPIN Model Checker NEW ISP VERIFIER Found Two Deadlocks Previously Unknown FOUND SAME! Wrote Paper (EuroPVM / MPI 2007) with Thakur and Gropp – won

one of the three best-paper awards DID NOT WIN

Still, we felt lucky … what if we had missed the error while hand-modeling NO NEED TO FEEL LUCKY (NO LOST INTERLEAVING – but also did not foolishly do ALL interleavings)

Also hand-modeling was NO FUN – how about running the real MPI code “cleverly” ? DIRECT RUNNING WAS FUN

27

3. Precision in AnalysisThe “crooked barrier” quiz again …

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

Our Cluster NEVER gave us the P0 to P2 match !!!

Elusive Interleavings !!

Bites you the hardest when you port to new platform !!

28

3. Precision in AnalysisThe “crooked barrier” quiz again …

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

SOLVED!! Using the new POE Algorithm

Partial Order Reduction in the presence ofOut of Order Operations and Elusive Interleavings

29

Precision in Analysis

POE Works Great (all 41 Umpire Test-Suites Run) No need to “pad” delay statements to jiggle schedule

and force “the other” interleaving– This is a very brittle trick anyway!

Prelim Version Under Submission – Detailed Version for EuroPVM…

Jitterbug uses this approach – We don’t need it

Siegel (MPI_SPIN): Modeling effort Marmot : Different Coverage Guarantees..

30

1-4: Finally! Precision and Low Cost in Modeling and Analysis, taking advantage of MPI semantics (in our heads…)

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

This is how POE does it

31

Discover All Potential Senders by Collecting (but not issuing) operations at runtime…

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( ANY )

MPI_Barrier

32

Rewrite “ANY” to ALL POTENTIAL SENDERS

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( P0 )

MPI_Barrier

33

Rewrite “ANY” to ALL POTENTIAL SENDERS

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( P1 )

MPI_Barrier

34

Recurse over all such configurations !

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( P1 )

MPI_Barrier

35

If we now have P0-P2 doing this, and P3-5 doing the same computation between themselves, no need to interleave these groups…

P0---

MPI_Isend ( P2 )

MPI_Barrier

P1---

MPI_Barrier

MPI_Isend( P2 )

P2---

MPI_Irecv ( * )

MPI_Barrier

P3---

MPI_Isend ( P5 )

MPI_Barrier

P4---

MPI_Barrier

MPI_Isend( P5 )

P5---

MPI_Irecv ( * )

MPI_Barrier

36

Why is all this worth doing ?

37

MPI is the de-facto standard for programming cluster machines

Our focus: Help Eliminate Concurrency Bugs from HPC ProgramsApply similar techniques for other APIs also (e.g. PThreads, OpenMP)

(BlueGene/L - Image courtesy of IBM / LLNL) (Image courtesy of Steve Parker, CSAFE, Utah)

38

The success of MPI (Courtesy of Al Geist, EuroPVM / MPI 2007)

39

The Need for Formal Semantics for MPI

– Send

– Receive

– Send / Receive

– Send / Receive / Replace

– Broadcast

– Barrier

– Reduce

– Rendezvous mode

– Blocking mode

– Non-blocking mode

– Reliance on system buffering

– User-attached buffering

– Restarts/Cancels of MPI Operations

– Non Wildcard receives

– Wildcard receives

– Tag matching

– Communication spaces

An MPI program is an interesting (and legal)combination of elementsfrom these spaces

40

The core count rises but the number of pins on a socket is fixed. This accelerates the decrease in the bytes/flops ratio per socket.

The bandwidth to memory (per core) decreases

The bandwidth to interconnect (per core) decreases

The bandwidth to disk (per core) decreases

MPI Library Implementations Would Also ChangeMulti-core – how it affects MPI (Courtesy, Al Geist)

Need Formal Semantics for MPI, because we can’t imitate any existing implementation…

41

Look for commonly committed mistakes automatically–Deadlocks

–Communication Races

–Resource Leaks

We are only after “low hanging” bugs…

42

Deadlock pattern…

04/18/23

P0 P1--- ---

s(P1); s(P0);

r(P1); r(P0);

P0 P1

--- ---

Bcast; Barrier;

Barrier; Bcast;

43

Communication Race Pattern…

04/18/23

P0 P1 P2--- --- ---r(*); s(P0); s(P0);

r(P1);

P0 P1 P2--- --- ---r(*); s(P0); s(P0);

r(P1);

OK

NOK

44

Resource Leak Pattern…

04/18/23

P0---some_allocation_op(&handle);

FORGOTTEN DEALLOC !!

45

Bugs are hidden within huge state-spaces…

04/18/23

46

Partial Order Reduction Illustrated… With 3 processes, the size

of an interleaved state space is ps=27

Partial-order reduction explores representative sequences from each equivalence class

Delays the execution of independent transitions

In this example, it is possible to “get away” with 7 states (one interleaving)

04/18/23

47

A Deadlock Example… (off by one deadlock)

// Add-up integrals calculated by each process

if (my_rank == 0) {

total = integral;

for (source = 0; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT,source,

tag, MPI_COMM_WORLD, &status);

total = total + integral;

}

} else {

MPI_Send(&integral, 1, MPI_FLOAT, dest,

tag, MPI_COMM_WORLD);

}

04/18/23

p1:to 0 p2:to 0 p3:to 0

p0:fr 0 p0:fr 1 p0:fr 2

48

Organization of ISP

MPI ProgramMPI Program

Simplified MPI

Program

Simplified MPI

Program

Simplifications

Actual MPI Library and Runtime

Actual MPI Library and Runtime

executableexecutable

Proc 1

Proc n

schedulerrequest/permit

request/permitcompile

PMPI calls

49

Summary (have posters for each) Formal Semantics for a large subset of MPI 2.0

– Executable semantics for about 150 MPI 2.0 functions– User interactions through VisualStudio API

Direct execution of user MPI programs to find issues– Downscale code, remove data that does not affect control, etc– New Partial Order Reduction Algorithm

» Explores only Relevant Interleavings– User can insert barriers to contain complexity

» New Vector-Clock algorithm determines if barriers are safe– Errors detected

» Deadlocks» Communication races» Resource leaks

Direct execution of PThread programs to find issues– Adaptation of Dynamic Partial Order Reduction reduces interleavings– Parallel implementation – scales linearly

50

Also built POR explorer for C / Pthreads programs, called “Inspect”

Multithreaded C/C++ program

Multithreaded C/C++ program

instrumented program

instrumented program

instrumentation

Thread library wrapper

Thread library wrapper

compile

executableexecutable

thread 1

thread n

schedulerrequest/permit

request/permit

51

Dynamic POR is almost a “must” !

( Dynamic POR as in Flanagan and Godefroid, POPL 2005)

52

Why Dynamic POR ?

a[ j ]++ a[ k ]--

• Ample Set depends on whether j == k

• Can be very difficult to determine statically

• Can determine dynamically

53

Why Dynamic POR ?

The notion of action dependence (crucial to POR methods) is a function of the execution

54

Computation of “ample” sets in Static POR versus in DPOR

Ample determinedusing “local” criteria

Current State

Next move of Red process

Nearest DependentTransitionLooking Back

Add Red Process to“Backtrack Set”

This builds the Ampleset incrementally based on observed dependencies

Blue is in “Done” set

{ BT }, { Done }

55

We target C/C++ PThread Programs Instrument the given program (largely automated) Run the concurrent program “till the end” Record interleaving variants while advancing When # recorded backtrack points reaches a soft

limit, spill work to other nodes In one larger example, a 11-hour run was finished in

11 minutes using 64 nodes

Heuristic to avoid recomputations was essential for speed-up. First known distributed DPOR

Putting it all together …

56

A Simple DPOR Example

{}, {}t0:

lock(t)

unlock(t)

t1:

lock(t)

unlock(t)

t2:

lock(t)

unlock(t)

57

For this example, all the paths explored during DPOR

For others, it will be a proper subset

58

Idea for parallelization: Explore computations from the backtrack set in other processes.

“Embarrassingly Parallel” – it seems so, anyway !

59

worker a worker b

Request unloading

idle node id

work description

report result

load balancer

We then devised a work-distribution scheme…

60

Speedup on aget

61

Speedup on bbuf

62

Historical Note

Model Checking – Proposed in 1981

– 2007 ACM Turing Award for Clarke, Emerson, and Sifakis

Bug discovery facilitated by– The creation of simplified models

– Exhaustively checking the models

» Exploring only relevant interleavings

63

Looking ahead…

Plans for one year out…

64

Finish tool implementation for MPI and others…

Static Analysis to reduce some cost Inserting Barriers (to contain cost) using new vector-

clocking algorithm for MPI Demonstrate on meaningful apps (e.g. Parmetis) Plug into MS VisualStudio Development of PThread (“Inspect”) tool with same

capabilities Evolving these tools to Transaction Memory, Microsoft

TPL, OpenMP, …

65

Thanks Microsoft !and Dennis Crain, Shahrokh Mortazavi

In these times of unpredictable NSF funding, the HPC Institute Program made it possible for us to produce a

great cadre of Formal Verification Engineers

Robert Palmer (PhD – to join Microsoft soon), Sonjong Hwang (MS), Steve Barrus (BS), Salman

Pervez (MS)

Yu Yang (PhD), Sarvani Vakkalanka (PhD), Guodong Li (PhD), Subodh Sharma (PhD), Anh Vo (PhD), Michael DeLisi (BS/MS), Geof Sawaya (BS)

(http://www.cs.utah.edu/formal_verification)

Microsoft HPC Institutes

NSF CNS 0509379

http://www.cs.utah.edu/formal_verification

66

Extra Slides

67

Looking Further Ahead: Need to clear “idea log-jam in multi-core computing…”

“There isn’t such a thing as Republican clean air or Democratic clean air. We all breathe the same air.”

There isn’t such a thing as an architectural-only solution, or a compilers-only solution to future problems in multi-core computing…

68

Now you see it; Now you don’t !

On the menace of non reproducible bugs.

Deterministic replay must ideally be an option User programmable schedulers greatly

emphasized by expert developers Runtime model-checking methods with state-

space reduction holds promise in meshing with current practice…

1 mpi verification ganesh gopalakrishnan and robert m. kirby students yu yang, sarvani vakkalanka,...

Documents

barrier mpi

understanding of mpi

isend p2 mpi

mpi programs

mpi libraries

barrier slide

tla spec of mpi

isend p2 p2