motivation: reliable software
DESCRIPTION
On Building Reliable Concurrent Systems Vijay K. Garg Professor, Department of ECE and CS Director, PDSL The University of Texas at Austin Austin, TX 78712. Motivation: Reliable Software. Multithreaded Distributed programs are prone to errors. - PowerPoint PPT PresentationTRANSCRIPT
Parallel and Distributed Systems Laboratory
Paradise: A Toolkit for Building Reliable Concurrent Systems
On Building Reliable Concurrent Systems
Vijay K. Garg
Professor, Department of ECE and CSDirector, PDSL
The University of Texas at AustinAustin, TX 78712
2
Motivation: Reliable Software
Multithreaded Distributed programs are prone to errors.
– Concurrency, nondeterminism, process and channel failures
Techniques to ensure program correctness
Before program development: Model Checking
During: Testing and Debugging
After: Software Fault-Tolerance
3
Paradise Approach
Key Abstraction: Global Properties
Model Checking and Verification: Check global properties against model of the program (Promela)
Testing and Debugging: Global breakpoints, Trace analysis
Software Fault-Tolerance: Monitoring for global properties, Controlled Reexecution
4
Talk Outline
Motivation
Monitoring Distributed Systems
– Clock : Tracking Dependency
– Camera : Global Snapshot (Checkpoint)
– Sensor : Detecting Global Properties
– Slicer : Computation Slicing
– Supervisor: Controlling Execution
Other Projects at PDSL
5
Paradise Environment
Program Monitor
Slicer
Predicate
Observe
Control
6
Trace Model: Total Order vs Partial Order
Total order: interleaving of events in a trace
Partial order: Lamport’s happened-before model
f2e1
CS2 CS1
f1 e2
P1
P2
Partial Order Trace
CS2
CS1
e1 e2
f1 f2
e2e1
CS1CS2
f1 f2
Successful Trace
Specification:CS1 Λ CS2
¬CS2 ¬CS1
¬ CS1
¬CS2
¬CS1 ¬ CS2
Faulty Trace
7
Tracking Dependency
computation: a set of events ordered by “happened before” relation
Problem: Timestamp events to answer
– e happened before f ?
– e concurrent with f ?
8
Clocks in a Distributed System
Result: s happened before t i the vector at s is less than the vector at t.
Vector Clocks [Fidge 89, Mattern 89]
P1
(1,0,0) (2,1,0) (3,1,0)
P2
(0,1,0) (0,2,0)
P3
(0,0,1) (0,0,2) (2,1,3)
9
Dynamic Chain Clocks
Problem with vector clocks: scalability, dynamic process structure
Idea: Computing the “chains” in an online fashion [Aggarwal and Garg PODC 05] for relevant events
a
f
eb
d
c
h
g
a b c d
e f g h
A computation with 4 processes The relevant subcomputation
P1
P2
P3
P4
10
Experimental Results
Simulation of a computation with 1% relevant events
Measured
– number of components vs number of threads
– total time overhead vs number of threads
11
Talk Outline
Motivation and Overview
Instrumentation
– Clock : Tracking Dependency
Property Checking
– Camera : Global Snapshot (Checkpoint)
– Sensor : Detecting Global Properties
– Slicer : Computation Slicing
12
Global Snapshot
Problem: Compute a global snapshot/checkpoint of the system (state of the processes and the channels)
Motivation
– Checkpointing for fault tolerance
– Distributed debugging
– Detecting stable predicates
13
Key Difficulties: Taking care of messages
Two sites A and B with $400 each. Site A sends a message with $ 100 to site B.
Problem 1: Inconsistent State checkpoint (A) before message sent - $400 message received before checkpoint (B) - $500
Problem 2: Messages in transit message sent before checkpoint (A) - $300 checkpoint (B) before message received - $400
P1
P2
P3
G1G2
14
Current Algorithms
Key idea: white/red processes and messages
A process must be red to act on a red message
Record white messages received by red processes
[Chandy and Lamport 85] A "marker" message on every channel
[Mattern 89, SBF+ 04] Number of white messages sent on every channel
O(N2) messages for completely connected topology
15
Our Algorithms
joint work with Rahul Garg, and Yogish Sabharwal, IBM IRL. (ACM International Conference on Supercomputing 2006)
– Lower Bound: Ω(N log w) messages
– Implementation: Blue Gene/L with MPI.
– w: average number of messages in transit
Algorithm Number of messages Message size
Grid Based O(N3/2) O(N1/2 logw)
Tree Based O(N logN logw) O(logw + logN)
Centralized O(N logw) O(logw + logN)
16
Global Property Detection
Predicate: A global condition expressed using variables on processes
– e.g., more than one process is in critical section, there is no token in the system
Problem: find a global state that satisfies the given predicate
P1
P2
G1 G2
Critical section
17
The Main Difficulty in Partial Order
Algorithm for general predicate [Cooper and Marzullo 91]
Too many global states : A computation may contain as many as O(kn) global states
• k: maximum number of events on a process• n: number of processes
e1 e2
f1 f2
T┴
P1
P2 {e1, ┴} {f1, ┴}
{e1, f1, ┴}
{e2, e1, f1, ┴}
{e2, e1, f2, f1, ┴
{e1, f2, f1, ┴}
{e2, e1, ┴}
{┴}
18
Efficient Predicate Detection for Special Cases
stable predicate: [Chandy and Lamport 85] • once the predicate becomes true, it stays true e.g.,
deadlock
unstable predicate:• observer independent predicate [Charron-Bost et al 95]
occurs in one interleaving occurs in all interleavings e.g., any disjunction of local predicate
• linear predicate [Chase and Garg 95] e.g., conjunctive predicates such as there is no leader in
the system• relational predicate: x1 + x2 +…+ xn ≥ k [Chase and Garg
95] e.g., violation of k-mutual exclusion
19
Linear PredicatesThe set of consistent cuts that satisfy a linear
predicate is closed under intersection [Chase and Garg 95].
Examples:
conjunctive predicates: critical1 and critical2 channel predicates: “all channels are empty” , “there are
exactly k messages in the channel from process P to Q”
some relational predicates: x1 + x2 · k, when xi is mon. non-decreasing
W X W X
W Å X
20
Conjunctive Predicates
A predicate that can be expressed as l1 Λ l2 Λ … ln , where li is local to Pi.
Detect errors that may be hidden in some run due to race conditions.
Examples:
– mutual exclusion problem: (P1 in CS) and (P2 in CS)
– missing primary: (P1 is secondary) and (P2 is secondary) and (P3 is secondary)
Importance: Sufficient for detection of any boolean expression of local predicates
21
Conjunctive Predicates: Centralized Algorithm
(l1 Λ l2 Λ … ln ) is true iff there exist si in Pi such that li is true in state si, and si and sj are incomparable for distinct i,j.
22
Algorithms for Conjunctive Predicates
Centralized Algorithm [Garg and Waldecker 92] Each non-checker process maintains its local vector and sends to the checker process the chain clock whenever
– local predicate is true
– at most once in each message interval.
Time complexity: Checker requires at most O(n2m) comparisons.
– token based algorithm [Garg and Chase 95]
– completely distributed algorithm [Garg and Chase 95]
– keeping queues shorter [Chiou and Korfhage 95]
– avoiding control messages [Hurfin, Mizuno, Raynal, Singhal 96]
23
Other Special Classes of Predicates
Relational Predicates
– Let xi: number of token at Pi
– Σxi < k: loss of tokens
– Algorithms: max-flow techniques [Groselj 93, Chase and Garg 95, Wu and Chen 98]
– Dilworth's partition [Tomlinson and Garg 96]
24
Relational Predicates
Let xi 0 be a variable at Pi . Predicates of the form [Chase and Garg 95]
xi k
Algorithm: Consistent cut with minimum value = min cut in the flow graph
25
Predicate Detection in General
Explore the state-space (need to examine all global states) without constructing the graph
breadth first manner [Cooper and Marzullo 91]
depth first manner [Alagar and Venkatesan 94]
lexical order [Garg 03]
{e1, ┴} {f1, ┴}
{e1, f1, ┴}
{e2, e1, f1, ┴}
{e2, e1, f2, f1, ┴
{e1, f2, f1, ┴}
{e2, e1, ┴}
{┴}
26
Talk Outline
Motivation and Overview
Instrumentation
– Clock : Tracking Dependency
Property Checking
– Camera : Global Snapshot (Checkpoint)
– Sensor : Detecting Global Properties
– Slicer : Computation Slicing
– Supervisor: Controlling Execution
27
The Main Idea of Computation Slicing
Partial order trace
slice
state explosion
keep all red global statesslicing
28
How does Computation Slicing Help?
Partial order trace
slice
retain all global states satisfying b1
slicing for b1
check b1 Λ b2
check b2
satisfy b1
29
Example
Detect predicate (x*y + z < 5) Λ (x ≥1) Λ (z ≤3)
P1
P2
P3
x
y
z
a
1
b
2
c
-1
d
0
e
0
f
2
g
1
h
3
u
4
v
1
w
2
x
4
{a,e,f,u,v} {b}
{w} {g}
Computation
Slice with respect to (x ≥1) Λ (z ≤3)
30
Slice
slice: a sub-trace such that:
it contains all consistent cuts of the trace satisfying the given predicate
it contains the least number of consistent cuts
[Garg and Mittal 01, Mittal and Garg 01]
predicate
trace
sliceSlicer
31
Results
Efficient polynomial-time algorithms for computing the slice for:
– linear predicates: [Garg and Mittal 01]• time-complexity: O(n2m)
– general predicate:• Theorem: Given a computation, if a predicate b can be
detected efficiently then the slice for b can also be computed efficiently. [Mittal,Sen and Garg 03]
– combining slices: Boolean operators
– temporal logic operators: EF, AG, EG
– approximate slice: For arbitrary boolean expression
• n: number of processes• m: number of events
32
POTA Architecture [Sen and Garg 04]
Instrumentor
Specification
SlicerPredicate Detector
Trace
Slice
Predicate (Specification)
TranslatorExecuteProgram
ExecuteSPIN
Program
InstrumentedProgram
Promela
Trace Slice
yes/witnessno/counterexample
no/counterexample
yes
Analyzer
33
Experiments: Dining Philosophers Trace Verification
POTA: Partial Order Trace Analyzer (based on slicing) [Sen and Garg 03]
SPIN: A widely used model checking tool [Holzmann 97]
– SPIN: 250 seconds for n = 6, runs out of memory for n > 6.
– POTA: can handle n= 200. Used 400 seconds.
Predicate: Two neighboring dining philosophers do not eat concurrently
34
Supervisor: Motivation for Control
maintain global invariants or proper order of events
– Examples: Distributed Debugging
• ensure that busy1 V busy2 is always true
• ensure that m1 is delivered before m2
Fault tolerance• On fault, rollback and execute under control
35
Rollback Recovery for Software Faults
Re-execution Problem:
To re-execute in order to avoid a recurrence of a previously detected failure
– Progressive Retry [Wang et al 97]
– Controlled Re-execution [Tarafdar and Garg 98]
P1
P2
P3
faulty state
restored state
36
Controlled Re-execution
Add the synchronization necessary to maintain safety property
– e.g., mutual exclusion
Critical section
P1
P2
a
G1
c
d b
37
Results
Efficient algorithms for computing the synchronization for:
– Locks [Tarafdar, Garg DISC98]
• O(nm) algorithm for various types of locks
– disjunctive predicate [Mittal, Garg 00]
e.g., (n-1)-mutual exclusion • time-complexity:} O(m2)• minimizes the number of synchronization arrows
– region predicate [Mittal, Garg PODC 00]
e.g., virtual clocks of processes are “approximately” synchronized
• time-complexity:O(nm2)• maximizes the concurrency in the controlled computation
n: number of processes, m: number of events
38
Conclusions
Efficient algorithms possible for monitoring global properties
Observation and Control: a powerful abstraction
Current execution engines are designed for performance rather than fault-tolerance
39
Other Research Projects
Distributed simulation: GVT algorithms, fault-tolerance
Recovery Schemes: Optimistic Message Logging, fault-tolerance without replication
Model Checking: Partial Order Methods
Formal Methods: Petri Nets, Lattice Theory, Max Plus Algebra
40
Questions
?