efficient dynamic verification algorithms for mpi applications

1

EFFICIENT DYNAMIC VERIFICATION ALGORITHMS FOR MPI APPLICATIONS

Dissertation DefenseSarvani Vakkalanka

Committee: Prof. Ganesh Gopalakrishnan (advisor), Prof. Mike Kirby (co-

advisor), Prof. Suresh Venkatasubramanian, Prof. Matthew Might, Prof. Stephen Siegel (Univ. of Delaware)

2

Necessity for Verification

• Software testing is ad-hoc.• Software Errors expensive - $59.5

Billion/yr (2001 NTSI Study). • Software written today is complex

and uses many existing libraries.• Our focus – contribute to– Parallel scientific software written

using MPI

3

Motivation

• Concurrent software debugging is hard!• Very little formal support for Message Passing

concurrency.• Active testing (schedule enforcement) is

important.• Reducing redundant (equivalent) verification runs

is crucial.• Verification for portability – another important

requirement.

4

Approaches to Verification

Testing methods suffer from bug omissions.

Static analysis based methods generate many false alarms. Model based verification is tedious.

Dynamic verification – no false alarms

5

Contributions

• New dynamic verification algorithms for MPI.• New Happens-Before models for Message

Passing concurrency.• Verification to handle resource dependency.• MPI dynamic verification tool ISP that handles

non-trivial codes for safety properties.

6

Agenda

• Intro to Dynamic Verification• Intro to MPI– Four MPI Operations (S, R, W, B).– MPI Ordering Guarantees.– Applying DPOR to MPI

• Dynamic verification algorithms avoiding redundant searches and handling resource dependencies

• Formal MPI Transition System• Experimental Results• Conclusions

7

EFFICIENT DYNAMIC VERIFICATION

Code written using mature libraries (MPI, OpenMP, PThreads, …)

API calls made from real programming languages

(C, Fortran, C++)

Runtime semantics determined by realistic compilers and runtimes

Dynamic Verificationabstracts verification details.

(static analysis and model based verification canplay importantsupportive roles)

8

Growing Importance of Dynamic Verification

Exponential number of TOTAL Interleavings – most are EQUIVALENT – generate only RELEVANT ones !!

9

P0 P1 P2 P3 P4

TOTAL > 10 Billion Interleavings !!

a++

b-- g=2 g=3

Dynamic Partial Order Reduction

10

P0 P1 P2 P3 P4

Dependent actionsOnly these 2 are

RELEVANT!!!

TOTAL > 10 Billion Interleavings !!

All other actionsare pairwise independent

g=2 g=3

DPOR

• A state σ consists of the following sets:– enabled(σ)– backtrack(σ) : sufficient subset of enabled(σ)– enabled(σ) = backtrack(σ) , then the full state space is

explored.• Co-enabledness of transitions• Dependence among transitions

11

σ

Co-enabledness & Dependence

12

t1

t1

t2

t2

{t2}{t1}

{t1, t2}

DPOR Concepts

• DPOR requires the identification of dependence and co-enabledness among transitions

• Identifying dependence is simple– Two lock accesses on the same mutex.– Two writes to the same global variable.– Similar concepts for MPI.

• Identifying co-enabledness is difficult (like will happen in parallel).

13

14

P1 P2

lock(l)x = 1x = 2unlock(l)

lock(l)y = 1x = 2unlock(l)

Illustration of DPOR Concepts

15

P1 P2

lock(l)x = 1x = 2unlock(l)

lock(l)y = 1x = 2unlock(l)

Illustration of DPOR Concepts

16

Thread Verification vs MPI Verification

• Thread verification – well studied! .– Well known dynamic verification tools on thread

verification [CHESS, INSPECT].– Thread verification follows traditional dynamic partial

order reduction. DPOR does not extend directly for MPI

• MPI Verification – not so! – requires a formal definition.– out-of-order completion semantics.– Must define dependence

INTRODUCTION TO MPI

17

18

IBM Blue Gene(Picture Courtesy IBM)

LANL’s Petascale machine“Roadrunner”(AMD Opteron CPUs and IBM PowerX Cell)

• The choice for ALL large-scale parallel simulations (earthquake, weather..)• Runs “everywhere”.• Very mature codes exist in MPI – tens of person years.• Performs critical simulations in science and engineering.

The Ubiquity of MPI

19

Overview of Message Passing Interface (MPI) API

• One of the major Standardization Successes.

• Lingua franca of Parallel Computing

• Runs on parallel machines of a WIDE range of sizes

• Standard is published at www.mpi-forum.org

• MPI 2.0 includes over 300 functions

http://www.mpi-forum.org/

20

MPI Execution Environment

• MPI execution environment consists of two main components:– MPI processes.– The MPI runtime daemon.

• All processes statically created.• Process rank between 0 and n-1.• The MPI processes issue instructions into MPI runtime.• The MPI runtime implements and executes the MPI

library.

21

MPI Execution Contd…

• Every process starts execution with MPI_Init(int argc, char **argv);

• MPI_Finalize – at the end

22

• Abbreviated as S

MPI_Isend (void *buff, …, int dest, int tag, MPI_Comm comm, MPI_Request handle);

23

• Abbreviated as R

MPI_Irecv (void *buff, …, int src, int tag, MPI_Comm comm, MPI_Request *handle);

24

• Abbreviated as W

MPI_Wait (MPI_Request *handle, MPI_Status *status);

25

• Abbreviated as B.• All processes must invoke B before any can get

past.

MPI_Barrier (MPI_Comm comm);

26

MPI Ordering Guarantees

27

MPI Ordering Guarantees

28

MPI Ordering Guarantee

29

Applying DPOR to MPI

Programs like this – almost impossible to test on real platforms.

30

Why DPOR does not work!

31

Modifying Runtime Doesn’t Help!

• Assume that the MPI runtime is modified to support verification

• The sends are matched with receives in the order they are issued to the MPI runtime

• Is this sufficient?

32

Crooked Barrier ExampleP0 P1 P2

Isend(1, req)

BarrierBarrier

Irecv(*, req) Barrier

Wait(req) Irecv(2, req1)

Wait(req1)

Wait(req)

Isend(1, req)

Wait(req)

Verification Support does not work!

33

Our Main Algorithms

• Partial Order avoiding Elusive Interleavings (POE).

• POEOPT : Reduced interleavings even further.

• POEMSE: Handle resource dependencies.

34

Illustration of POE P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext Barrier

35

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNextBarrier

Irecv(*)

Barrier

Illustration of POE

36

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Barrier

Barrier

Barrier

Illustration of ISP’s Verification Algorithm

37

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Wait (req)

Recv(2)

Isend(1)

SendNext

Wait (req)

Irecv(2)Isend

Wait

No Matching

Send

Deadlock!

Illustration of POE

IntraCB

38

Notations

• MPI_Isend : Si,j (k), where – i is the process issuing the send,– j is the dynamic execution count of S in process i and – k is the destination process rank where the message is to be

sent

• MPI_Irecv: Ri,j(k)– k is the source

• MPI_Barrier: Bi,j

• MPI_Wait : Wi,j’(hi,j)– hi,j is the request handle of Si,j (k) or Ri,j(k)

POE Issue: Redundancy

39

POE explores both the match-sets resulting in 2 interlevings while just 1 interleaving is sufficient.

SOLUTION : Explore only match-sets for single wildcard receive.

DOES NOT WORK! BREAKS PERSISTENCE.

POE and Persistent Sets

Add only this match-set to bactrack

40

Maintaining Persistent backtrack sets is important.

Otherwise, verification algorithm is broken

POE Issue: Buffering DeadlocksWhen no sends are buffered

Deadlock!

41

POE Issue: Redundancy

Simple Optimization: If there is no more sends targeting a wildcard receive, thenadd only of of the match-sets to backtrack set.

42

Redundancy : POEOPT

P0 P1 P2 P3

W0,2(h0,1)

S0,1(1) R1,1(*)

W1,2(h1,1)

S1,3(3)

W0,4(h1,3)

R1,5(*)

W1,6(h1,5)

W2,2(h2,1)

R2,1(1) S3,1(*)

W3,2(h3,1)

R3,3(1)

W3,4(h3,3)

S3,5(1)

W3,6(h3,5) 43

Detecting Matching

• Exploring all non-deterministic matchings in a state is not a solution

• The IntraHB relation is not sufficient to detect matchings across processes

• We introduce the notion of Inter-HB

44

InterHB Relation

45

Redundancy : POEOPT

P0 P1 P2 P3

W0,2(h0,1)

S0,1(1) R1,1(*)

W1,2(h1,1)

S1,3(3)

W0,4(h1,3)

R1,5(*)

W1,6(h1,5)

W2,2(h2,1)

R2,1(*) S3,1(2)

W3,2(h3,1)

R3,3(1)

W3,4(h3,3)

S3,5(1)

W3,6(h3,5) 46

Redundancy : POEOPT

W0,2(h0,1)

S0,1(1) R1,1(*)

W1,2(h1,1)

R1,3(3)

W1,4(h1,3)

W2,2(h2,1)

R2,1(*) S3,1(2)

W3,2(h3,1)

S3,3(1)

W3,4(h3,3)

P0 P1 P2 P3 P4 P5

R4,1(*)

W4,2(h4,1)

S5,1(1)

W5,2(h5,1)

NO PATH47

Slack/Buffering Deadlocks

Deadlocks only when S0,1 or S1,1 or both are buffered

48

Buffer All Sends ???ZERO SLACK

49

Buffer All Sends ???ZERO SLACK

50

Buffer All Sends ???INF SLACK

51

Buffer All Sends ???INF SLACK

52

Buffer All Sends ???ONLY S0,0

Deadlock!

53

Slack/Buffering : POEMSE

P0 P1 P2

W0,2(h0,1)

S0,1(1) S1,1(2)

W1,2(h1,1)

R1,3(0)

W1,4(h1,3)

R2,3(0)

W2,4(h2,3)

W2,2(h2,1)

R2,1(*)

S0,3(2)

W0,4(h0,3)

54


P0 P1 P2

W0,2(h0,1)

S0,1(1) S1,1(2)

W1,2(h1,1)

R1,3(0)

W1,4(h1,3)

R2,3(0)

W2,4(h2,3)

W2,2(h2,1)

R2,1(*)

S0,3(2)

W0,4(h0,3)

No-op

55

POEMSE

• Finds all paths between a wildcard receive and a matching send.

• If there is a path without a culprit wait in it, then does nothing.

• If every path contains at least one culprit wait, then the algorithm finds all ways to break paths by trying to select exactly one wait in every path.– We call this as finding minimal wait sets.– NP-Complete problem (Proved by reduction from 1-in-3 SAT).– Finding all minimal wait sets in #P-Complete.

56

Minimal Wait Sets

• Find the power-set of all the culprit waits.• Sort the Power-set by size in non-decreasing order.• For each subset if it breaks all paths– Delete all it supersets from the power-set.

• If it does not break all the paths– Delete it from the power-set

57


P0 P1 P2

W0,2(h0,1)

S0,1(1) S1,1(2)

W1,2(h1,1)

R1,3(0)

W1,4(h1,3)

R2,3(0)

W2,4(h2,3)

W2,2(h2,1)

R2,1(*)

S0,3(2)

W0,4(h0,3)

58

59

MPI TRANSITION SYSTEM

MPI State and Transitions

• An MPI function is in one or more of the following states:– Issued (I)– Matched (M)– Complete (C)– Returned (R)

• A global state is <I,M,C,R, pc> – initial state is

• Two kinds of transitions:– Process transitions– MPI runtime transitions

60

Process Transitions

61

MPI Runtime Book-keeping Sets

No Ancestor of x is in Ready set

All ancestors are matched All ancestors are Complete62

63

IntraHB - Crooked Barrier ExampleP0 P1 P2

Isend(1, req)

BarrierBarrier

Irecv(*, req) Barrier

Wait(req) Irecv(2, req1)

Wait(req1)

Wait(req)

Isend(1, req)

Wait(req)

MPI Runtime Transitions

Zero Buffering

64

MPI Runtime Transitions Contd..

65

MPI Runtime Transition

Conditional Happens-Before

Dynamic source re-write

66

Simple MPI Example

67

70

Dependence and Independence Properties

MPI Dependence

71

RESULTS – Real Benchmarks• Game of Life

• EuroPVM / MPI 2007 versions of Gropp and Lusk done in seconds• MADRE – Siegels’ Mem Aware Data Redistrib. Engine

• Found previously documented deadlock• ParMETIS – Hypergraph Partitioner

• Initial run of days reduced now to seconds on a laptop• MPI-BLAST – Genome sequencer using BLAST

• Runs to completion on small instances• A few MPI Spec Benchmarks

• Some benchmarks exhibit interleaving explosion; others OK• ADLB

• Initial experiments have been successful

72

Results

• Resource Leak caught in Parmetis• ISP used in the development cycle of A* algorithm:– Found 3 deadlocks during various implementation phases.– All deadlocks were unintentional (not seeded)

73

Results Contd…Umpire Program POE Marmot

any_src-can-deadlock7.c Deadlock detected 2 interleavings

Deadlock Caught in5/10 runs

any_src-can-deadlock10.c Deadlock Detected1 interleaving

Deadlock caught7/10 runs

basic-deadlock10.c Deadlock detected in1 interleaving

Deadlock caught in10/10 runs

74

75

POEOPT

76

POEMSE

77

How well did we do?• Verisoft Project

– Used for telephone switch software verification in Bell Labs– Available

• The Java Pathfinder Project– Developed at NASA for Java Control Software– On SourceForge

• The CHESS Project– Microsoft Research; available for academic institutions– In use within Microsoft product groups and used by academics

• Inspect : UV group’s unique Pthread verifier.– Available for download

• ISP : dynamic verification tool for MPI– Implements the dynamic verification algorithms from this dissertations.– Available for download with the PTP (Parallel Tools Platform).

CONCLUSIONS

• First efficient and practical dynamic reduction based algorithms for real MPI programs.

• Verification for portability with respect to buffering.• First Happens-Before model for MPI.• ISP scheduler directly based on our theory.• ISP+GEM released and demoed widely.

78

Questions & Answers

79

THANK YOU

POE ALGORITHM

80

POE Proof of Correctness

81

POE Illustration

82

83

Dynamic Verification

• No modeling effort for the programmer• Program is the model – the actual program is

verified• Push-button interface : easy to use.• On the downside – verification is a function of

input. – Most programs are fairly data independent.

Dynamic verification methods are ideal for programmers!

efficient dynamic verification algorithms for mpi applications

Documents

dynamic verification

verification testing

thread verification

dynamic verificationintro

mature libraries mpi

mpifour mpi operations

unlockl lockly

dpor conceptsdpor